Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserving external images (issue #399) #447

Merged
merged 19 commits into from
Jan 19, 2020

Conversation

smaant
Copy link
Collaborator

@smaant smaant commented Oct 14, 2019

Added preserving external images into a local storage.

Configuration

Added preserve-external-images parameter (PRESERVE_EXTERNAL_IMAGES envvar) which is false by default.

Implementation details

Images Preserver acts as a CommentConverter and utilizes default image storage as it configured by Remark admin. Images Preserver extracts all external images (sourced outside of REMARK_URL), downloads and stores them into local storage, after that image URL is replaced with a local version. If at any moment of downloading/storing image any problems occur (for example image url returns 404, or image has unsupported format, or too big, etc) then original image url will be kept.

Considered and rejected

  1. Configuring it as an Image Proxy mode (e.g. IMG_PROXY=[none,proxy,store]). For some reason, even if implementation is somewhat close to Image Proxy, it feels that from a user (Remark admin) perspective Image Proxy and Image Preserving are two different things. Plus after storing an image Image Preserver no longer acts as a proxy.
  2. Configuring it as an Image Store option. Doesn't feel like an Image Store kind of property, more like an engine behavior in general.

Future improvements

  1. Given that previously all stored images were directly uploaded by users, it kind of reduced chances of storing multiple duplicates. With Images Preserving enabled, people constantly referring to the same memes/gifs/etc might fill local store with duplicates pretty quickly. Some kind of deduplication might help.

@smaant
Copy link
Collaborator Author

smaant commented Oct 14, 2019

Looks like deprecation warning from golangci-lint breaks json for GolangCI

@smaant
Copy link
Collaborator Author

smaant commented Oct 14, 2019

Option is renamed to timeout golangci/golangci-lint#793

@umputun
Copy link
Owner

umputun commented Oct 15, 2019

Thx for the PR.

My primary concern is this - " rejected: Configuring it as an Image Proxy mode" but to me, it sounds like the right way to do it. The current proxy already does most of the work of mapping picture urls, converting comment and serving back different (proxied) content on the fly. Preserving images on the FS (or other stores) is a natural extension of such a proxy. i.e. some form of cached proxy. Btw, if we treat stored images as persistent cache, we can limit the size of the cache (number of stored images or total volume on the disk) with LFU or LRU.

Besides, it feels like rest-level functionality and not the engine-level. I think with less code, it can be done less intrusively with an additional proxy.Image property and a few, relatively minor changes in Convert and Handler function. After Convert got the list of images we can store them with the original url encoded somehow into the file name, and Handle could serve from those files instead of reading from the remote URL. By doing this, we won't have images segmented by the user name, and it will allow native deduplication.

One potential disadvantage with this simplified logic - we won't be able to utilize submit+commit technic used in image.Service and won't be able to drop unlinked images from the storage. However, your implementation doesn't use it either (I think you use Service just as a Store), and this issue can be addressed as a part of LFU logic some day.

Another potentially promising solution is to combine image.Store part with image proxy solution described above. I.e. inject image.Store directly into proxy.Image and discard uder_id completely by passing in something like "system_images" as a user name.

@smaant
Copy link
Collaborator Author

smaant commented Oct 18, 2019

It might be a long answer, apologize in advance :)

I think there're two key things to clarify:

  1. What kind of guaranties this "local copy" should provide? I assumed they should be treated just like uploaded pictures therefore stored permanently. I may probably argue that cache approach won't solve stated problem ("images hosted on some external place and, sometimes, those pictures stored temporarily") because it's very much possible that Remark's cache for a particular image will be purged around the same time as image disappears from external place.
  2. When you say that this feature sounds very much like an extension for a proxy functionality, do you mean from implementation perspective only or from configuration pov also? In my opinion they should be configured separately because they solve two completely different problems. From a user (remark admin) perspective:
    • Proxy — "I want to make sure user isn't getting warnings about insecure connection when external image is coming from HTTP site" (and proxy is just one of the possible solutions for this problem);
    • Preserving images — "I want to make sure comments are preserved in original state even when they refer to external resources".

Besides these two things a few more comments:

After Convert got the list of images we can store them with the original url encoded somehow into the file name

Imho pretty dangerous approach, urls are mutable, tomorrow the same url may refer to a completely different picture but remark's storage/cache will continue serving old one.

However, your implementation doesn't use it (submit+commit technic) either

In fact it does use it. Because I put images into a permanent storage and replace urls with /api/v1/pictures/... all of them will be committed during comment posting. And cleanup is just an infinite loop which removed everything from staging older than TTL (no extra actions from me are required).

discard uder_id completely by passing in something like "system_images" as a user name

Yeah, I thought about it too. One thing that stopped me, eventually some remark installation will get an actual user called "system_images" :) Who knows what other features remark will have then, for example you may add returning all images for a user, imagine surprise of this "system_images" user :)

Let me know what you think, after all it's your project and I'm totally fine changing implementation in a way you see it.

@umputun
Copy link
Owner

umputun commented Oct 18, 2019

It might be a long answer, apologize in advance :)

No worries, mine will be even longer ;)

TLDR: up to you with a couple of extras

it's very much possible that Remark's cache for a particular image will be purged around the same time as image disappears from external place.

Sure. I mentioned LRU/LFU type of cleanup as a hypothetical "possible someday" kind of extension. Anything unlimited may run out of space someday, but let's not care about it now, like I ignored this problem for images directly uploaded to remark42 with a hope to always have big enough disk to keep them forever.

When you say that this feature sounds very much like an extension for a proxy functionality, do you mean from implementation perspective only or from configuration pov also?

Both. I don't see the ability to serve the response from the stored file as something disqualifying the thing to be a proxy. One of the most popular proxies is squid, and this is exactly what it does. Currently, the proxy we have in remark42 limited to provide http->https "forwarding" but this forwarding is for the same external assets, and I don't see anything conceptually different between serving request by calling http.Get to load a picture or some store.Load func reading the picture from FS. I still treat this thing as some sort of cached proxy. We agreed to cache forever, but the retention period doesn't make much difference technically or conceptually.

Imho pretty dangerous approach, urls are mutable, tomorrow the same url may refer to a completely different picture but remark's storage/cache will continue serving old one.

Well, yes and no. I mean, this is a typical tradeoff between size and accuracy. We may address it in different ways without killing deduplication completely, for example, checking the size of the remote image after some TTL and including size into the encoded filename. But I don't think we care about this too much, at least for now, because currently, we including images exactly the same way by URL. However, I don't have any fundamental problem with saving images under user_id if you think it may help somehow.

In fact it does use it. Because I put images into a permanent storage and replace urls with
/api/v1/pictures/... all of them will be committed during comment posting

Ok, agree.

eventually some remark installation will get an actual user called "system_images"

No, this is not possible with user_id because ids formed as <provider>_sha, i.e. github_123456779. Even if someday we add a provider system we won't be able to make sha what looks like "images" ;)

Let me know what you think, after all it's your project and I'm totally fine changing implementation in a way you see it.

I don't have a major technical problem with your current implementation. It is fine and clear code, decently covered. I didn't like the change of converter interface initially but can live with it. To me, passing user_id felt too artificial, but I guess, in some cases such ability can be handy.

Handling imgPreserver as a convertor was the most controversial idea here (from my pov), and the main reason why I started this discussion. However, I can accept it and, maybe, may begin to like it one day. It still feels like the wrong level to me, but it won't wake me up in the middle of the night screaming and crying ;)

The bottom line - I don't want you to spend too much time redoing this. If you have some bright ideas on how to keep the most of it but address the "wrong level" issue - it will be nice. If not - I'll gratefully accept the current implementation.

Some additional things to consider:

  • I don't see how you handle the lack of stored file. I think currently it has no way to extract the original url. Maybe adding encoded orig url to resImgURL (as a query param) may help here. Not a big deal, but it means if the user changed his mind and decided not to serve already preserved images, it will be impossible to do.
  • both user-uploaded images, as well as proxied images, will be stored side by side with no way to realize which is what. Not an immediate problem, but can be someday as some big site will ask to remove all images remark42 downloaded from them. Smth to think about.

@smaant
Copy link
Collaborator Author

smaant commented Oct 19, 2019

Haha, thank you very much for a very detailed response! I actually think that "proxy" idea is growing on me, I'll see what I can do with it. I'm going to spend 5 hours on a plane this weekend, may just as well do something useful :)

backend/app/rest/proxy/image.go Outdated Show resolved Hide resolved
backend/app/rest/proxy/image.go Show resolved Hide resolved
@umputun
Copy link
Owner

umputun commented Dec 16, 2019

so, you had 2 months to think about it ;) any conclusions?

@smaant
Copy link
Collaborator Author

smaant commented Dec 16, 2019

@umputun doh, I'm so sorry for disappearing! Yes, I actually pushed my changes implementing caching in proxy back in October (see the last commit e4db45a), now need only rebase it to master and add/update tests (and probably add support into BoltDB storage). I think I'll try to do it during Christmas holidays, would it work?

@umputun
Copy link
Owner

umputun commented Dec 17, 2019

sure. somehow I missed that push completely

@smaant
Copy link
Collaborator Author

smaant commented Dec 17, 2019

Yeah, it's my fault probably, I was planning to finish everything quickly so didn't add any comment after pushing. Do you want to take a look at the implementation now (I don't expect anything to change while rebasing/adding tests) or would prefer to review finished version?

@umputun
Copy link
Owner

umputun commented Dec 17, 2019

let's do it on rebased version. I don't recall what even changed in store/image package and caused the conflict, so it's hard to tell how transparent this merge is going to be

@smaant
Copy link
Collaborator Author

smaant commented Dec 30, 2019

Cleaned up everything, added tests and rebased to master. Tested without frontned (with both FS and BoldDB storages) so it's all good for review. Will do end-to-end testing with frontend tomorrow.

@smaant
Copy link
Collaborator Author

smaant commented Dec 30, 2019

linter complains about using sha1, do we need to care given that it's not being used for anything security related? I didn't use sha256 or higher to keep keys (filenames) from getting too long and potentially hitting some limits.

@umputun
Copy link
Owner

umputun commented Dec 30, 2019

sha1 is fine for our uses and //nolint will be appropriate here

@smaant
Copy link
Collaborator Author

smaant commented Dec 31, 2019

Tested with frontend, all related functions seems to be working properly (posting/viewing comments with and without proxy/caching enabled).

A few notes:

  • config related to enabling http->https proxy has changed, I didn't add backward compatibility or any warnings if previous option is detected. It might make sense to do it (added backward compatibility)
  • in order to guaranty that image ID doesn't change (which happened in case of resizing and FS image store) I removed storing images extensions for FS store at all. content-type for cached images is image/* which seems to be accepted by browsers without any problems.
  • as I mentioned in code (https://github.com/umputun/remark/pull/447/files#diff-4a38ba5920fb68398b49ccea32190da2R88-R91) turning off both proxy and caching might have unexpected for users effect, not sure if we need to do anything about it right now

@umputun umputun self-requested a review January 4, 2020 18:40
Copy link
Owner

@umputun umputun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just one missing change in a comment and an idea about ignoring CacheExternal in proxied handler

backend/app/store/image/image.go Outdated Show resolved Hide resolved
backend/app/rest/proxy/image.go Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@paskal paskal added the backend label Jan 14, 2020
@umputun umputun mentioned this pull request Jan 16, 2020
Copy link
Owner

@umputun umputun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx

@umputun umputun merged commit d69495a into umputun:master Jan 19, 2020
.golangci.yml Show resolved Hide resolved
README.md Show resolved Hide resolved
backend/app/store/image/fs_store.go Show resolved Hide resolved
backend/app/store/image/fs_store.go Show resolved Hide resolved
backend/app/store/image/bolt_store.go Show resolved Hide resolved
@umputun umputun added this to the v1.5 milestone Jan 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants