Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chinese filename decode #84

Open
imcuttle opened this issue May 4, 2018 · 11 comments

Comments

Projects
None yet
5 participants
@imcuttle
Copy link

commented May 4, 2018

File: 中文测试.zip

The zip file contains 中文测试.md,when I pass decodeStrings: true, the result is
image

when I pass decodeStrings: false, the error The "path" argument must be of type string be thrown.

@thejoshwolfe

This comment has been minimized.

Copy link
Owner

commented May 6, 2018

The problem seems to be that the filenames are encoded in UTF8, but general purpose bit 11 is not set. The zipfile claims the filenames are encoded with CP437, and in that encoding, the filename you're seeing is the correct interpretation. The zip file is expecting zipfile readers (like yauzl) to interpret the filename as UTF8 without being instructed to do so.

In other words, yauzl is behaving correctly, and the zipfile is malformed.

Do you know what program created this zip file?

@thejoshwolfe

This comment has been minimized.

Copy link
Owner

commented May 6, 2018

Is it Archive Utility on Mac?

@rossj

This comment has been minimized.

Copy link

commented May 6, 2018

@imcuttle I have a need to handle similar not-so-standard .zip files in my application, and I wanted to share my heuristic solution.

If you only need to deal with this file and similar files that are always UTF-8 (even if they don't indicate this), you can use the decodeStrings: true option and convert them to strings yourself. Your The "path" argument must be of type string error is likely coming from some other code downstream that is expecting it to be a string. You probably need to do the Buffer -> string conversion before this point.

In my case, it is a bit more complicated, as I need to simultaneously handle zip files that are UTF-8 (with and without the proper bit being set), as well as files that are CP437 encoded. My solution is to use decodeStrings: false, collect all of the ZipEntries and fileName Buffers, and then to inspect these name Buffers to try and guess the proper encoding.

Specifically, I use the code in this gist to get some information on the name Buffers, followed by this logic:

const aggs = checkStringBufs(entries.map(entry => entry.fileName as Buffer));

let encoding: string;
if (aggs.allAsciiChar) {
    // utf8 is backwards compatible with ascii
    encoding = 'utf8';
} else if (aggs.all7Bit) {
    // Hmmm, no high bits but some control chars, probably cp437
    encoding = 'cp437';
} else if (aggs.validUtf8) {
    // Some high bits set, but seems to be UTF-8
    encoding = 'utf8';
} else {
    // Some high bits set, but not UTF-8!
    encoding = 'cp437';
}

This has been working well for the .zip files that I deal with.

@imcuttle

This comment has been minimized.

Copy link
Author

commented May 7, 2018

@thejoshwolfe

Is it Archive Utility on Mac?

Yep, the zip file created by mac system, It's puzzled that the zip file is malformed.

image

@thejoshwolfe

This comment has been minimized.

Copy link
Owner

commented May 7, 2018

You'd think that Apple would be better at writing software, but their Archive Utility really sucks at zip files. I've been working around bugs in that software for years.

If this issue is as simple as "Archive Utility always forgets to set bit 11", then maybe yauzl should have better support for this situation. I'll think about this.

@linYeeTracy

This comment has been minimized.

Copy link

commented Sep 27, 2018

我也遇到了这个问题,请问这个问题解决了嘛?

@thejoshwolfe

This comment has been minimized.

Copy link
Owner

commented Sep 27, 2018

Sorry, I haven't been working on this project lately. I'll revisit this issue next week.

@imcuttle

This comment has been minimized.

Copy link
Author

commented Sep 28, 2018

I guess that isn't the wrong of yauzl.

when I passed option decodeStrings: false, filename of zip file by mac os could received normally.
See the pr

@thejoshwolfe

This comment has been minimized.

Copy link
Owner

commented Oct 5, 2018

when I passed option decodeStrings: false, filename of zip file by mac os could received normally.
See the pr

I believe the toString() call is using "utf8" encoding by default, which is the encoding intended by the zip file creator (Mac Archive Utility). On principle this isn't necessarily safe or correct, but in practice it's probably fine.

@thejoshwolfe

This comment has been minimized.

Copy link
Owner

commented Oct 5, 2018

I did some research into Info-ZIP's charset detection code, and in the absence of General Purpose Bit 11, Info-ZIP uses a different charset depending on the operating system. It will only use CP437 as required by the spec on some platforms, presumably DOS. However, on Linux and Mac, Info-ZIP will simply always use UTF-8 for decoding file paths, because UTF-8 is the "native" charset on those platforms, whatever that means. This suggests it's safe for yauzl to drop support for CP437 and just use UTF-8 in all situations as well. 🤔

@wizardpisces

This comment has been minimized.

Copy link

commented Dec 19, 2018

pr is rejected !! No clue about how to better deal with this issue, any progress ???

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.