Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change open to use Down.open #8

Closed
renchap opened this issue Aug 31, 2016 · 6 comments
Closed

Change open to use Down.open #8

renchap opened this issue Aug 31, 2016 · 6 comments

Comments

@renchap
Copy link
Owner

renchap commented Aug 31, 2016

From Janko :

Could #open be implemented in terms of Down.open? E.g. Down.open(url(id))? That would be useful for things like restore_cached_data plugin, which extracts metadata from cached files (which in this case might remote, directly uploaded to Google Cloud Storage), so that it doesn't need to download the whole file just to extract metadata.
For the restore_cached_plugin, when a cached file is sent in a hidden field (e.g. with direct uploads or on validation errors), an attacker could technically modify the "metadata" attributes, thus bypassing file validations. The best way that I found to solve this in a generic way was to give the ability to re-extract metadata on that assignment. This is also useful when doing direct uploads to services that don't extract metadata, to be able to extract metadata manually.
Extracting metadata uses #open, which I previously implemented in storages to download the whole file, which wasn't ideal since that part is synchronous. However, I managed to come up with Down.open which creates an IO representing the remote file, which will download only how much is read. Yeah, that would require a signed URL. That's the way I implemented it with S3, relying on the fact that signed URLs will always work. Internally Down.open uses a generic Down::ChunkedIO, which might be usable if you don't have a URL (I used it in GridFS).

This will require #7 first

@renchap
Copy link
Owner Author

renchap commented Mar 14, 2017

@janko-m I had a new look at this and signed URLs on GCS requires a service account key. You cant get one with a standard account, so it will fail in many cases.
Any other idea on how to make it work ? It is possible to retrieve the object with an HTTP GET, but it requires to create the correct HTTP auth headers and it is no longer a simple GET.

@janko
Copy link
Contributor

janko commented Mar 19, 2017

@renchap I just pushed a new version of Down which allows Down.open to accept request headers, so as long as you figure out which URL is the google client internally hitting with which request headers, you should be able to make the same request with Down.open.

Down.open("http://example.com/image.jpg", {"Authorization" => "..."})

@janko
Copy link
Contributor

janko commented May 24, 2017

I recently thought of one trick that would downloading on-demand, but still allow you to use the get_object, thus keeping google-api-client's download resumability and not having to rely on a URL. Down.open internally uses Down::ChunkedIO, which takes any chunk generator and transforms it into an IO object, so you could use Down::ChunkedIO directly with this trick:

def open(id)
  read_pipe, write_pipe = IO.pipe

  Thread.new do
    begin
      storage_api.get_object(@bucket, object_name(id), download_dest: write_pipe)
    ensure
      write_pipe.close
    end
  end

  chunks = Enumerator.new { |y| y << read_pipe.readpartial(16*1024) until read_pipe.eof? }
  object = storage_api.get_object(@bucket, object_name(id)) # hopefully not providing :download_dest makes a HEAD request?

  Down::ChunkedIO.new(
    chunks: chunks,
    size: object.size,
    on_close: -> { read_pipe.close }, # will cause download to abort with an Errno::EPIPE, and google-api-client will hopefully clean up
  )
end

I considered doing the same with in the S3 storage, but considering that aws-sdk has the ability to stream chunks as they are downloaded if you pass a block to Aws::S3::Object#get, I managed to get away without IO.pipe:

  def open(id)
    io = Down::ChunkedIO.new(chunks: (object = object(id)).enum_for(:get))
    io.size = object.content_length # Down::ChunkedIO retrieves first chunk on initialization, so Aws::S3::Object#content_length is populated now
    io
  end

@rosskevin
Copy link
Contributor

@renchap With my #16 PR, I tried:

      def open(id)
        Down.open(url(id))
      end

The linter tests broke with:

Minitest::UnexpectedError: Down::ClientError: 403 Forbidden
    /Users/kross/.rvm/gems/ruby-2.4.1@shrine-google_cloud_storage/gems/down-4.1.0/lib/down/net_http.rb:214:in `response_error!'
    /Users/kross/.rvm/gems/ruby-2.4.1@shrine-google_cloud_storage/gems/down-4.1.0/lib/down/net_http.rb:162:in `open'
    /Users/kross/.rvm/gems/ruby-2.4.1@shrine-google_cloud_storage/gems/down-4.1.0/lib/down/backend.rb:14:in `open'
    /Users/kross/.rvm/gems/ruby-2.4.1@shrine-google_cloud_storage/gems/down-4.1.0/lib/down.rb:14:in `open'
    /Users/kross/projects/shrine-google_cloud_storage/lib/shrine/storage/google_cloud_storage.rb:76:in `open'
    /Users/kross/.rvm/gems/ruby-2.4.1@shrine-google_cloud_storage/gems/shrine-2.8.0/lib/shrine/storage/linter.rb:68:in `lint_open'
    /Users/kross/.rvm/gems/ruby-2.4.1@shrine-google_cloud_storage/gems/shrine-2.8.0/lib/shrine/storage/linter.rb:40:in `call'
    /Users/kross/projects/shrine-google_cloud_storage/test/gcs_test.rb:51:in `block (2 levels) in <top (required)>'

I'm not sure of the intent, so I'll leave #16 as-is, but seems like this one is not far off once that is merged.

@janko
Copy link
Contributor

janko commented Oct 12, 2017

Down::ClientError: 403 Forbidden means that the request to the given URL responded with 403 Forbidden, so url(id) apparently didn't produce a correct URL. But it probably makes more sense to leave #16 as-is, since it preserves the current behaviour, and later we can see how to do it.

Btw, I wouldn't use Down.open, because google-api-client has the ability to automatically retry failed/interrupted downloads (though I intend to add this ability to Down at some point). It seems that File#download forward the first argument directly to google-api-client's #get_object so it seems like it doesn't need to be a path but it can be any IO object. Then we could achieve streaming like this:

class ProcIO
  def initialize(&proc)
    @proc = proc
  end

  def write(data)
    @proc.call(data)
    data.bytesize # it's ruby convention to return bytes written, it's required by methods like IO.copy_stream
  end
end

def open(id)
  file = get_file(id)

  chunks = Enumerator.new do |yielder|
    proc_io = ProcIO.new { |data| yielder << data }
    file.download(proc_io)
  end

  Down::ChunkedIO.new(
    chunks: chunks,
    size: file.size,
    data: { file: file } # so the user can access it if they want to
  )
end

@janko
Copy link
Contributor

janko commented Oct 12, 2017

Note that the above should be possible because a while ago I expanded what can be the :download_dest.

janko added a commit to janko/shrine-google_cloud_storage that referenced this issue Nov 22, 2017
janko added a commit to janko/shrine-google_cloud_storage that referenced this issue Nov 22, 2017
This will make the storage work well with the restore_metadata Shrine plugin,
as it will allow extracting metadata by downloading only a portion of
the file content.

Closes renchap#8
janko added a commit to janko/shrine-google_cloud_storage that referenced this issue Nov 30, 2017
This will make the storage work well with the restore_metadata Shrine plugin,
as it will allow extracting metadata by downloading only a portion of
the file content.

Closes renchap#8
renchap pushed a commit that referenced this issue Dec 1, 2017
* Implement #open using Down::ChunkedIO

This will make the storage work well with the restore_metadata Shrine plugin,
as it will allow extracting metadata by downloading only a portion of
the file content.

Closes #8

* Add ProcIO#flush required by google-api-client

* Open test image file in binary mode

That way retrieved content will in binary encoding, which is generally
safer to use cross-platform. It also avoids comparing the same content
in binary and UTF-8 encoding in tests, which will differ even if the
bytes themselves are the same.

* Don't verify downloaded data in #open
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants