Add `file: true` feature to add_metadata #332

janko · 2018-12-03T01:39:26Z

Sometimes to extract metadata the user needs a file object which has a #path. Previously the recommended approach was to call Shrine.with_file directly:

add_metadata :custom do |io, context|
  Shrine.with_file(io) do |file|
    # ...
  end
end

Other than being a bit verbose, this introduces a performance issue when multiple metadata blocks call Shrine.with_file and refresh_metadata plugin is used. Because UploadedFile#download doesn't implement any kind of memoization, the same file would unnecessarily be copied to disk multiple times.

To solve this, we add the ability to pass file: true to #add_metadata, in which case Shrine will internally call Shrine.with_file around the whole metadata extraction, so the copying to disk will happen only once during whole metadata extraction.

add_metadata :custom, file: true do |file, context|
  file # this is now a file object
end

This is an alternative fix for the problem described in #329.

Sometimes to extract metadata the user needs a file object which has a `#path`. Previously the recommended approach was to call `Shrine.with_file` directly: add_metadata :custom do |io, context| Shrine.with_file(io) do |file| # ... end end Other than being a bit verbose, this introduces a performance issue when multiple metadata blocks call Shrine.with_file and refresh_metadata plugin is used. Because UploadedFile#download doesn't implement any kind of memoization, the same file will unnecessarily be copied to disk multiple times. To solve this, we add the ability to pass `file: true` to #add_metadata, in which case Shrine will internally call `Shrine.with_file` around the whole metadata extraction, so the copying to disk will happen only once. add_metadata :custom, file: true do |file, context| file # this is now a file object end

jrochkind · 2018-12-03T14:50:08Z

I wonder if a lower-level API solution could be possible, that's more flexibly composable.

I potentially need "local file path" access not just in metadata generation, but in versions/derivatives extraction as well. While one could write a particular versions/derivatives API to support it similar to what you've done here with "extract metadata", I think it would be ideal if we could support a composable utility that could be used with any API, including custom local APIs, future not yet thought of APIs, etc.

Also ideally one that could prevent even a single extra-copy of the file, as well as extra downloads of bytes from storage location.

And would this current architecture result in at least one more extra retrieval for versions/derivatives too? Not sure.

Here's one thing I've been thinking of in response to my own investigations, letting it sit in my brain a bit, and these discussions....

What if down provided a public #as_temp_file_path (may not be the right name) method? Calling it would a) make sure the entire file was flushed to the local temp file (that down already has, so long as it's rewindable), and b) just give the caller the path to the temp file.

Shrine.with_file could now include a branch that's just File.open(io.as_temp_file_path) if io.respond_to?(:as_temp_file_path). (In reality possibly slightly more complex due to now all down io objects being rewindable?, not sure what the API should look like).

Down can already promise (and implementation probably already does?) that the tempfile path will remain good as long as the returned down IO object is in scope. (I believe Down::ChunkedIO currently has no API to actually delete it's backing rewindable cache; the only way it gets deleted is when stdlib Tempfile does it on GC. While in some ways that might be unfortunate, it does mean that the temp file path should stay good as long as anything still has a reference to the Down::ChunkedIO).

Additionally note that at least on any "unixy" file system/OS, if there's a file handle to an on-disk file, even if some other code has deleted that file -- the file handle will remain good, and the bytes still undisturbed on disk, until the file handle has closed. So if a File.open on a path works, that file handle remains good until closed even if something else unlinks the file path. But note I think there's no need to pass everything the same file handle and rewind it in between, each block can get it's own file handle to the same on-disk path (which would also make it more feasible to later add concurrency, if someone wanted to, something not possible trying to share the file handle serially with rewinds in between).

There might be reasons what I'm suggesting wouldn't work or would be tricker than I'm thinking. But if it could work, i think it has the advantages:

Can avoid any extra local file copies, as well as extra storage retrievals
Does not require changes to the add_metadata API -- does not require you to know when declaring add_metadata if you'll need a complete local file, the complete local file can still be requested 'lazily', which decreases coupling between the code that is extracting metadata, and the outer add_metadata declaration.
Is a lower-level utility which can be used in any other context too, including versions/derivatives, including architecture custom to a specific shrine user. If you want a local file, you can still just do Shrine.with_file, no need to customize a higher-level API in every possible place that might need a local file.

janko · 2018-12-11T08:55:23Z

I potentially need "local file path" access not just in metadata generation, but in versions/derivatives extraction as well. While one could write a particular versions/derivatives API to support it similar to what you've done here with "extract metadata", I think it would be ideal if we could support a composable utility that could be used with any API, including custom local APIs, future not yet thought of APIs, etc.

Oh, right, I didn't realize this is another place where the file would be copied. Ok, this is not an ideal solution then, though it's still useful for reducing copies during "regular" metadata extraction (which happens if you're attaching a raw file). But you gave me a lot to think about.

What if down provided a public #as_temp_file_path (may not be the right name) method? Calling it would a) make sure the entire file was flushed to the local temp file (that down already has, so long as it's rewindable), and b) just give the caller the path to the temp file.

I just want to point out that this is an additional 3rd optimisation that we could add, from our discussions at #329. I think it doesn't affect whether we decide to merge #329, and if we do merge it but don't add this additional optimization we would still have gained a big performance improvement.

The reason I wanted to state this is because I don't think this down optimization is worth it. Primarily, there would be no speed benefit, because downloading a remote file is still much slower than writing to disk, so it wouldn't matter whether we're writing to one file or two files while downloading. On Heroku these scripts executed in the same time for a 200 MB file from a close S3 bucket:

IO.copy_stream(Down.open(url), Tempfile.new)
# has same execution time as
IO.copy_stream(Down.open(url, rewindable: false), Tempfile.new)

But writing to an additional file does increase disk IO, which can affect real world performance, as you explained in #329 (comment) (thank you for sharing that insight). However, if Down::ChunkedIO's Tempfile is the only additional disk IO, and we find a solution where no additional copies are made as we add more metadata/derivatives, then I would rather recommend users to increase disk IO capacity if they're hitting the limits.

An additional reason why I don't want to expose Down::ChunkedIO's underlying Tempfile is that yesterday I decided that I want to add an additional security measure to Down::ChunkedIO – janko/down@cdc71c3 – which removes the possibility of exposing the underlying Tempfile. It's still in master so it can change if there are strong objections, but yesterday I've also seen this technique mentioned in this Reddit thread, which leads me to believe that people do care about this.

I believe Down::ChunkedIO currently has no API to actually delete it's backing rewindable cache; the only way it gets deleted is when stdlib Tempfile does it on GC. While in some ways that might be unfortunate, it does mean that the temp file path should stay good as long as anything still has a reference to the Down::ChunkedIO

The Tempfile is closed and unlinked on Down::ChunkedIO#close, which is called automatically when Shrine::UploadedFile#download or Shrine::UploadedFile#open are called with a block.

Additionally note that at least on any "unixy" file system/OS, if there's a file handle to an on-disk file, even if some other code has deleted that file -- the file handle will remain good, and the bytes still undisturbed on disk, until the file handle has closed.

This is an interesting observation, I will keep it in mind.

I still need more time to come up with additional arguments for or against merging this PR. On one hand I like that it reduces number of copies when assigning raw files. On the other it only partially solves your use case. When you download the file to make derivatives, you'd still be making an additional download. I'm inclined to suggest this for your use case:

process(:store) do |io, context|
  io.download do |file|
    io.metadata.merge!(extract_metadata(file, context))

    processor = ImageProcessing::Vips.source(file)
    versions  = {}

    versions[:foo] = processor.resize_to_limit!(...)
    versions[:bar] = processor.resize_to_limit!(...)
    # ...

    versions
  end
end

Your solution in #329 definitely covers more ground and removes mental overhead from the user. But I'm still worried that it might lead to potential bugs depending on what users do with the Tempfile.

janko · 2018-12-12T22:59:13Z

After some more thinking, I don't like this solution after all.

When assigning raw IO objects, if people want to avoid multiple copies, they should just convert their IO object to a file object first if it's not already (e.g. StringIO, Shrine::Plugins::DataUri::DataFile). So a good solution already exists for that, therefore this feature is not needed there.

For your use case it would help, but not enough, because you'd still have to make an additional copy either way for processing versions/derivatives, which is undesirable.

So I'll close this then and continue the discussion in #329.

janko requested review from jrochkind and hmistry December 3, 2018 01:39

janko mentioned this pull request Dec 3, 2018

WIP sketch for discussion: Idempotent/reusable UploadedFile#download approach #329

Closed

janko force-pushed the add-file-option-to-add-metadata branch from 4be0616 to fd4b59a Compare December 3, 2018 01:45

janko force-pushed the add-file-option-to-add-metadata branch from fd4b59a to 84ea8c2 Compare December 3, 2018 01:48

janko closed this Dec 12, 2018

janko deleted the add-file-option-to-add-metadata branch December 12, 2018 22:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `file: true` feature to add_metadata #332

Add `file: true` feature to add_metadata #332

janko commented Dec 3, 2018

jrochkind commented Dec 3, 2018

janko commented Dec 11, 2018 •

edited

janko commented Dec 12, 2018 •

edited

Add file: true feature to add_metadata #332

Add file: true feature to add_metadata #332

Conversation

janko commented Dec 3, 2018

jrochkind commented Dec 3, 2018

janko commented Dec 11, 2018 • edited

janko commented Dec 12, 2018 • edited

Add `file: true` feature to add_metadata #332

Add `file: true` feature to add_metadata #332

janko commented Dec 11, 2018 •

edited

janko commented Dec 12, 2018 •

edited