Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Generate a ZIP file to a network stream #548

Open
stevecrozz opened this issue Jan 25, 2023 · 1 comment
Open

Feature Request: Generate a ZIP file to a network stream #548

stevecrozz opened this issue Jan 25, 2023 · 1 comment

Comments

@stevecrozz
Copy link

Streaming currently works great, but only for streams where you can seek arbitrarily, like a file on a hard disk. Specifically, when we close an archive, we write the table of entries by setting the stream's pos for each entry and performing the write:

def update_local_headers
pos = @output_stream.pos
@cdir.each do |entry|
@output_stream.pos = entry.local_header_offset
entry.write_local_entry(@output_stream, rewrite: true)
end
@output_stream.pos = pos
end

It should be possible, as zip_tricks and other libraries outside the Ruby ecosystem demonstrate, to create a streaming zip archive across a network without the ability seek to an arbitrary point in the stream. This approach relies on writing the entry table at the end. I haven't found a formal description of this approach outside of source code, but a plain english version is nicely summarized by the authors of the Python stream-zip package:

It's not possible to completely stream-write ZIP files. Small bits of metadata for each member file, such as its name, must be placed at the end of the ZIP. In order to do this, stream-zip buffers this metadata in memory until it can be output.

I found that the zip_tricks gem can be configured to write zip files this way and I did try to make rubyzip do the same. So I created my own IO adapter to wrap a network socket to see if it was possible to stream zip creations across a network using rubyzip and found that it is not possible because I had to implement pos = and it was not possible to correctly seek the underlying stream after having already sent the stream to a network socket. At that point the buffer is flushed to the network and there is no way I can rewind.

My use case is I need to create a large zip archive composed of objects from an object store (s3) and send the resulting archive back to an object store while working within disk space limitations. In theory, this should require no disk space and only a small amount of memory for a stream buffer and an entries table which would be written at the end of the stream.

@hainesr
Copy link
Member

hainesr commented Jan 29, 2023

Hello, yes I agree we should support this feature; as you say the ZIP standard supports streaming where seek isn't available.

I'm not sure I'll be able to get this into v3 as it will require a fundamental change to how things are done quite deep in RubyZip. I'm working on this when I can though and will get done as soon as I can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants