Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

option to emit raw string buffers instead of decoded strings #42

Closed
dweinstein opened this issue Oct 3, 2016 · 8 comments
Closed

option to emit raw string buffers instead of decoded strings #42

dweinstein opened this issue Oct 3, 2016 · 8 comments

Comments

@dweinstein
Copy link

dweinstein commented Oct 3, 2016

I'm using version 2.6.0 FWIW, node 6.

± node debug bin.js foo.zip
< Debugger listening on [::]:5858
connecting to 127.0.0.1:5858 ... ok
break in bin.js:2
  1
> 2 'use strict'
  3 const extractExec = require('./')
  4 const fs = require('fs')
c
break in index.js:46
 44     // TODO: what if we get multiple plists?
 45     const plist = plists[0]
>46     debugger
 47     getExecStream(fd, plist.CFBundleExecutable, (err, entry, exec) => {
 48       debugger
c
break in index.js:19
 17     zip.on('entry', function onentry (entry) {
 18       if ((/XXXThing.*app\/XXXThing-.*/i).test(entry.fileName)) {
>19         debugger;
 20       }
 21       if (!isOurExec(entry, execname)) { return }
repl
Press Ctrl + C to leave debug repl
> entry.fileName
'Payload/XXXThing-╬▓.app/XXXThing-╬▓'
> execname
'XXXThing-β'

as you can see the execname is right but the entry.fileName is not right utf-8 AFAICT.

@dweinstein
Copy link
Author

dweinstein commented Oct 3, 2016

Here's a reduced testcase:

± unzip -l test.zip
Archive:  test.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  10-03-2016 15:16   ç/
        6  10-03-2016 15:16   ç/hello
---------                     -------
        6                     2 files
± node testcase.js
ç/
ç/hello
Error: not found

Testcase code:

'use strict'

const fromFd = require('yauzl').fromFd
const once = require('once')
const fs = require('fs')

// ± unzip -l test.zip
// Archive:  test.zip //   Length      Date    Time    Name
// ---------  ---------- -----   ----
//         0  10-03-2016 15:16   ç/
//         6  10-03-2016 15:16   ç/hello
// ---------                     -------
//         6                     2 files
//
function issue42 (fd, cb) {
  cb = once(cb)
  fromFd(fd, (err, zip) => {
    if (err) return cb(err)
    zip.on('entry', function onentry (entry) {
      if ((/ç\/hello/).test(entry.fileName)) {
        console.log(entry.fileName)
        cb(entry.fileName)
      }
      console.log(entry.fileName)
    })
    zip.on('end', () => {
      if (!cb.called) {
        cb(new Error('not found'))
      }
    })
  })
}

const fd = fs.openSync(__dirname + '/test.zip', 'r')

issue42(fd, (err, res) => {
  console.log(err, res)
})

test zip:

test.zip

@thejoshwolfe
Copy link
Owner

Interesting bug report. The behavior you're seeing from Info-Zip is actually non-standard behavior. yauzl is behaving "correctly" with respect to the zipfile specification.

There are multiple ways for a zipfile to indicate that the filenames are encoded in utf-8, and your zipfile does none of them. According to the spec, if no charset is specified, then cp437 is to be used, which is what yauzl is doing.

I'm not sure why Info-Zip's unzip is making an assumption about the filename being UTF-8. I've read the man page and even spent some time searching the source for the reason for that behavior. The closest I came is an excerpt from the zip man page, which may or may not be relevant:

Though the zip standard requires storing paths in an archive using a specific character set, in practice zips have stored paths in archives in whatever the local character set is.

So the question remains, what should yauzl do in this situation? Should the spec be considered correct, or should "in practice" behavior of popular tools be considered correct? It's a tough call, but I'm leaning toward the spec.

If you'd like to fix your zipfile, try setting general purpose bit 11 in all the entries. That is what yazl does to indicate the filename is to be decoded using utf8. If you're creating the zipfile at a higher level than that, then i suggest using a different library/utility for creating zipfiles, because the one you're using is non-conformant. If you didn't make the zipfile at all, but you got it from a user, then i suggest you forward this paragraph to your user.

I haven't seen general purpose bit 11 mishandled like this in any existing zipfile utility i've tested this with. I can't say for sure, but i believe i've tested this issue with Info-Zip's zip, Windows Compressed Folder, Mac's Archive Utility, and 7-Zip. I'm not as familiar with Java's ZipFile class, python's zipfile module, or WinRAR.

So I don't know how this zipfile came to exist with the filename encoding messed up, but I really don't think I should follow in Info-Zip's nonstandard footsteps on this matter. Following the spec is one of yauzl's design principles, and cp437 support is a feature.

@dweinstein
Copy link
Author

FWIW the zip was created with zip on a mac.

@dweinstein
Copy link
Author

dweinstein commented Oct 11, 2016

@thejoshwolfe would you consider using something like https://gist.github.com/dweinstein/3125bed0a478e2b0acfccfae91c90fd5#file-guess-encoding-js which is a port to javascript of libzip's _zip_guess_encoding ? So far testing has been ok.

I have a branch you can try out here https://github.com/dweinstein/yauzl/tree/guess-encoding -- all tests are passing for me at the moment.

@thejoshwolfe
Copy link
Owner

I would consider adding an option to the open() API (and related API's) that forces all strings in a zipfile to be interpreted as UTF-8. I wouldn't want to do any guessing in this library, because that seems too high-level, error prone, non-standard, etc., but if users simply tell yauzl to deviate from the spec in this regard, that seems like a compromise everyone can be happy with.

Realistically, I'd bet that it's safe to always pass in that flag, if it existed. The only time it would cause a problem is if a zipfile was created with cp437 and actually used the non-ascii part of cp437. It could happen, but it'd probably be a very old zipfile if it ever did.

Does that sound like a viable solution to your problem?

@thejoshwolfe
Copy link
Owner

Better proposal: add an option to open() that leaves all strings undecoded as Buffer objects instead of strings. Then you can use any kind of encoding guesser or assume UTF-8 as you wish. I think this is the right solution to this issue.

@dweinstein
Copy link
Author

that sounds pretty reasonable. Having the buffer along with the flags surfaced will definitely allow another library to do the guessing...

@thejoshwolfe thejoshwolfe changed the title fileEntry name encoding issue option to emit raw string buffers instead of decoded strings Oct 11, 2016
@thejoshwolfe thejoshwolfe reopened this Oct 11, 2016
@thejoshwolfe
Copy link
Owner

published in version 2.7.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants