Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to handle different values for extraFieldLength in local file header vs central directory file header #137

Closed
bennettrogers opened this issue Feb 6, 2021 · 2 comments

Comments

@bennettrogers
Copy link

I'm trying to extract the data offset byte ranges for each entry in a zipfile. The files I'm working with seem to have different values for the extraFieldLength in the central directory file header vs the local file header. I've noticed that in the readme, you state that the local file headers are ignored except for checking the signature, but that doesn't seem exactly right. When creating a readStream for an entry, this library (correctly) uses the values of extraFieldLength and fileNameLength from the local file header to calculate the localFileHeaderEnd.

Do you know why the value for extraFieldLength would differ between these two locations, and why the local file header would be the correct value? What was your reason for using the value from the local file header? I also need to use the correct value to extract the true data ranges for each entry, but it seems I can't rely on the extraFieldLength that is emitted for the entries generated by readEntry(). I'm considering forking your excellent lib so I can add an extra function to get at the correct offsets, but if you've got a better idea I'd love to hear it!

Thank you!

@thejoshwolfe
Copy link
Owner

Sorry for the delayed response.

The zip file format allows a central directory header and a local file header to have completely different and conflicting information for the same entry. This is rarely done, and the most common times i've seen it are when the local file header is being written at a time when the program has incomplete information about the entry, which means the central directory header is generally more authoritative. However, there's no bound on how strange and frustrating zip file implementations will be, and it doesn't surprise me that you've somehow gotten zip files with conflicting extra field data. I can only speculate why the lengths and data would be different, but it's not an error; it's just silly.

If yauzl had a more complete and even lower-level api, it would fully expose all the low level information and let the client deal with it however. That's ... actually kind of a good idea actually. yauzl should probably do that. But as it's designed now, yauzl wants to present an API that gives you the file name of the entry, for example, instead of giving you 4 different file names that are all probably the same. (Yes you can encode the file name 4 times in the zip file, and it's not uncommon.)

As far as why yauzl uses the local file header extraFieldLength, it's because it's the most reliable way to determine how many bytes to ignore in that area of the file. You're right that it's technically not completely ignored, but it's only used to measure the size of the data structure for the purpose of skipping over it.

It sounds like you might like to have a full readout of all the low level fields from both the central and local header. Let me know if that sounds helpful, and sorry again for the delayed response.

@thejoshwolfe
Copy link
Owner

I've just released yauzl 3.1.0 which has several features that might help you.

Try running node examples/compareCentralAndLocalHeaders.js /path/to/your/file.zip. It will show a comparison of local file header info and central directory info for each item in the zip file. In my experience the extra fields that encode timestamps and other fs-related metadata are almost never the same between the two.

For extracting the true data ranges, see readLocalFileHeader(), and you probably want {minimal: true}. Then also see openReadStreamLowLevel() for more related commentary.

Let me know if that solves your problem, and please feel free to reopen if you have any further issues or question!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants