Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-fs API #14

Closed
max-mapper opened this issue May 13, 2015 · 11 comments
Closed

non-fs API #14

max-mapper opened this issue May 13, 2015 · 11 comments

Comments

@max-mapper
Copy link

hey I have a weird request. I wrote this https://github.com/maxogden/punzip for a somewhat common but annoying use case: given a large zip on a server, only extract a single file from it, as a stream, without downloading the whole thing.

here's more detail on the use case: https://gist.github.com/maxogden/11a85ae12074fed0b9f6

the cool thing is that it totally works! I can mount a 500mb zip, point yauzl at it, and my code translates yauzls calls into HTTP range calls like this:

  mount-url requested +542ms 514105344-514170879 received 65536 bytes
  mount-url requested +173ms 514170880-514172204 received 1325 bytes

those were yauzl getting the entry table at the end of the file (I think).

unfortunately I had to use fuse to make it compatible with yauzl. It would be nice, though, if I could give yauzl a function with e.g. 'getBytes(offset, length)` or something and it would be able to use that as the data source rather than a file descriptor/path to a file.

i'm open to any suggestions or ideas you might have for this use case!

@max-mapper
Copy link
Author

also while I'm here I should say thanks for this library. I use it in http://npmjs.com/extract-zip and https://www.npmjs.com/package/electron-prebuilt, and as you can see around 500 folks install it a day (on many OSes) and we haven't a single bug report about the zip handling yet.

@thejoshwolfe
Copy link
Owner

I like this idea.

@andrewrk
Copy link
Collaborator

yauzl uses fd-slicer to abstract the fs API calls. I could see API being added to that project.

@max-mapper
Copy link
Author

@andrewrk ahh gotcha. yes that is a good idea. I even think a simple place to start would be to allow a custom fs implementation to be passed in, then I could e.g. get this whole thing working in the browser with https://www.npmjs.com/package/level-filesystem, or I could theoretically write a fs API compatible shim that did the range request stuff (though a higher level API would probably be better)

@mafintosh
Copy link

I'm assuming you just need random access reads for this to work? If thats the case passing in a simple function like

function read (start, end, cb) {
  ...
  cb(null, resultBuffer)
}

would be a really simple solution that would allow us to decouple this from files in general.

If yauzl were to support this we could even implement real time unzipping of files over bittorrent using torrent-stream:

var engine = torrentStream('magnet://....')

function read (start, end, cb) {
  engine.files[0].createReadStream({start: start, end: end)
    .pipe(concat(function (data) {
      cb(null, data)
    })
}

yauzl(read, function (err, zipfile) {
  // omg can random access unzip any file in the 40tb zip in realtime !
  ...
})

This would make yauzl work in the browser using browserify for a lot of use cases.

@thejoshwolfe
Copy link
Owner

I'm assuming you just need random access reads for this to work?

And we need to know the file size of the zip file.

@mafintosh
Copy link

@thejoshwolfe gotcha. then a options map with {length: fileLength, read: readFunction} would work

@nmccready
Copy link

Besides thoughts, is there a development branch or someone's fork which has any of these ideas as code?

@thejoshwolfe
Copy link
Owner

... the 40tb zip ...

I just noticed this in some example code above. Silliness aside, ZIP64 is actually not supported currently. See #6.

@thejoshwolfe
Copy link
Owner

I closed this issue optimistically. I would still appreciate feedback on whether the new API suits your needs.

see https://github.com/thejoshwolfe/yauzl/blob/master/README.md#fromrandomaccessreaderreader-totalsize-options-callback

@thejoshwolfe
Copy link
Owner

Here's some background on why this API turned out more complicated than originally discussed. Simply reading ranges of bytes into a buffer is really not what we want. Consider the case where we need a range of 100MB of file data. Buffering it all at once is too RAM intensive, and chunking it up into smaller reads is unfair to the underlying file-access layer. If we're downloading ranges of a file using network requests, we should be able to ask for as much continuous data at a time as possible so we don't have all the unnecessary overhead of establishing a new TCP connection for each chunk.

What we really want is an API that gets a read stream for a range of bytes. Occasionally, yauzl knows that it wants a small range of bytes into a buffer, and it might be easy for the underlying file-access layer to provide that without going through a read stream, so that's an optional function to implement.

Another option is for the file-access layer to implement a close function. I can see this being useful for a torrent backend or, of course, a normal fs backend. This allows the file-access layer to take its time tearing things down asynchronously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants