Can't stream download (to disk/output stream) #62

Closed
bracken opened this Issue Oct 8, 2010 · 5 comments

Comments

Projects
None yet
5 participants
@bracken

bracken commented Oct 8, 2010

I often need to download large files as I'm meching away, but currently a file has to be read into memory before you can save it to disk and that's not good for large files. I think we need a simple download option that skips memory and goes directly to disk/io stream.
This kind of interface would be great:

agent.download(url, hash)

first argument is the url
second (optional) argument is a hash that can have these keys

  • :target - file handle/io stream?
  • :method - :get (default), :post, etc.
  • :post_args - the post arguments if needed
  • :follow_redirects - true/false (default true)

Code examples:
# if no target is specified a temp file (Tempfile.new) will be created for you and returned
temp_file_handle = agent.download(url)

# you can pass your own file object for it to download into
agent.download(url, :target=>my_file_handle, :method=>:post)

Forms and links could also have a download method that works similarly

# maybe submit_and_download for clarity?
temp_file = form.download
#or
form.download(:target=>io_object)

temp_file = link.download
#or
link.download(:target=>file_handle)
@flavorjones

This comment has been minimized.

Show comment Hide comment
@flavorjones

flavorjones Oct 10, 2010

Member

I have to admit that this has bugged me for a long time. Let me think about some of the API considerations.

Member

flavorjones commented Oct 10, 2010

I have to admit that this has bugged me for a long time. Let me think about some of the API considerations.

@cstrahan

This comment has been minimized.

Show comment Hide comment
@cstrahan

cstrahan Mar 22, 2011

This would be awesome! There are plenty of scenarios where I need to download something that won't fit in RAM, throwing a NoMemoryError.

This would be awesome! There are plenty of scenarios where I need to download something that won't fit in RAM, throwing a NoMemoryError.

@drbrain

This comment has been minimized.

Show comment Hide comment
@drbrain

drbrain Apr 10, 2011

Member

This will require more work to implement properly, it's not suitable for mechanize 2.0, I'll target this for mechanize 2.1

Member

drbrain commented Apr 10, 2011

This will require more work to implement properly, it's not suitable for mechanize 2.0, I'll target this for mechanize 2.1

@zouzhile

This comment has been minimized.

Show comment Hide comment
@zouzhile

zouzhile May 31, 2011

I worried about this before using mechanize to download a file which requires http session. And mechanize runs into the memory issue not long after. it would be great if streaming the save of Mechanize::File is enabled asap.

/usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize/chain/response_reader.rb:16:in `write': failed to allocate memory (NoMemoryError)
        from /usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize/chain/response_reader.rb:16:in `handle'
        from /usr/local/lib/ruby/1.8/net/protocol.rb:383:in `call_block'
        from /usr/local/lib/ruby/1.8/net/protocol.rb:374:in `<<'
        from /usr/local/lib/ruby/1.8/net/protocol.rb:88:in `read'
        from /usr/local/lib/ruby/1.8/net/http.rb:2240:in `read_chunked'
        from /usr/local/lib/ruby/1.8/net/http.rb:2215:in `read_body_0'
        from /usr/local/lib/ruby/1.8/net/http.rb:2181:in `read_body'
        from /usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize/chain/response_reader.rb:14:in `handle'
        from /usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize/chain.rb:24:in `handle'
        from /usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:543:in `fetch_page'
        from /usr/local/lib/ruby/1.8/net/http.rb:1054:in `request'
        from /usr/local/lib/ruby/1.8/net/http.rb:2144:in `reading_body'
        from /usr/local/lib/ruby/1.8/net/http.rb:1053:in `request'
        from /usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:538:in `fetch_page'
        from /usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:259:in `get'

I worried about this before using mechanize to download a file which requires http session. And mechanize runs into the memory issue not long after. it would be great if streaming the save of Mechanize::File is enabled asap.

/usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize/chain/response_reader.rb:16:in `write': failed to allocate memory (NoMemoryError)
        from /usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize/chain/response_reader.rb:16:in `handle'
        from /usr/local/lib/ruby/1.8/net/protocol.rb:383:in `call_block'
        from /usr/local/lib/ruby/1.8/net/protocol.rb:374:in `<<'
        from /usr/local/lib/ruby/1.8/net/protocol.rb:88:in `read'
        from /usr/local/lib/ruby/1.8/net/http.rb:2240:in `read_chunked'
        from /usr/local/lib/ruby/1.8/net/http.rb:2215:in `read_body_0'
        from /usr/local/lib/ruby/1.8/net/http.rb:2181:in `read_body'
        from /usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize/chain/response_reader.rb:14:in `handle'
        from /usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize/chain.rb:24:in `handle'
        from /usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:543:in `fetch_page'
        from /usr/local/lib/ruby/1.8/net/http.rb:1054:in `request'
        from /usr/local/lib/ruby/1.8/net/http.rb:2144:in `reading_body'
        from /usr/local/lib/ruby/1.8/net/http.rb:1053:in `request'
        from /usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:538:in `fetch_page'
        from /usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:259:in `get'

drbrain added a commit that referenced this issue Oct 26, 2011

@drbrain

This comment has been minimized.

Show comment Hide comment
@drbrain

drbrain Oct 26, 2011

Member

Since I made the mistake of delaying this to Mechanize 2.1 it isn't as nice a solution as I would have liked (where Mechanize::File doesn't load content into memory) because parsers expect the body to be a String.

I added Mechanize::Download and you can default unknown content types to downloading to disk using:

agent.pluggable_parser.default = Mechanize::Download

For mechanize 3 Mechanize::Download can replace Mechanize::File where the body parameter is always an IO-like object.

Member

drbrain commented Oct 26, 2011

Since I made the mistake of delaying this to Mechanize 2.1 it isn't as nice a solution as I would have liked (where Mechanize::File doesn't load content into memory) because parsers expect the body to be a String.

I added Mechanize::Download and you can default unknown content types to downloading to disk using:

agent.pluggable_parser.default = Mechanize::Download

For mechanize 3 Mechanize::Download can replace Mechanize::File where the body parameter is always an IO-like object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment