-
-
Notifications
You must be signed in to change notification settings - Fork 31.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow reading member names with bogus encodings in zipfile #72267
Comments
Could you please provide more information sjt? |
Suggested NEWS/whatsnew entry: Add a new *memberNameEncoding* argument to the ZipFile constructor, allowing Motivation: There are applications in Japan that create zipfiles with directories containing filenames encoded in Shift JIS. There may be such software in other countries as well. As this is a violation of the Zip format definition, this library implements only an option to read such files. Done: (1) Add a memberNameEncoding argument to the main() function, which may be set from the command line with "--membernameencoding={codec}". This command line option may be used with -e or -l, but not -c or -t. There is no point to it in the latter, since the member names are not printed. To do (?): (10) NEWS/whatsnew Not done: I don't think these are appropriate/needed at this time, but listed in case somebody thinks otherwise. (13) Add a subtype of RuntimeError (see 7d)? |
I should have a contributor agreement form on file. Ned Deily suggested that I try to get this patch in before the 12 noon deadline Sept. 12, so here it is. I believe the patch is "safe" in the sense that its functionality needs to be explicitly enabled, and it should be very difficult to persuade it to inadvertantly write to any file. No existing execution paths should be changed at all. |
This issue is a duplicate of bpo-10614. But proposed patch looks more ready. |
Stephen, thanks for submitting the patch. Unless another core developer has time to review and commit this prior to the feature code off tomorrow, this will probably need to wait for 3.7. Also, at the moment, your tracker user record does not indicate that there is a contributor agreement on file for you. You could followup on that with Ewa at the PSF or just submit a new form. |
Re: wait for 3.7 if reviewers are busy, understood. N.B. Contributor agreement is now on file (I received the PDF from python.org already). Re: existing patches: The Japanese patches by umedoblock are very Japanese-centric, and worse, they try to guess the encoding by the crude method of seeing what decodes successfully. They are not acceptable IMO. Aaargh. Just noticed the Japanese in test_zipfile.py. Will change it to use \u escapes soon. |
Added comments on Rietveld. Maybe I'll commit rewritten patch tomorrow before 12:00 UTC (oh, already today!). |
Can't reply on Rietveld? Lost 2 hours work! Patch updated (encoded-member-names-v2), most changes accepted. Not happy about name change or default to cp437, I want this API to be hard to use and not be part of the normal process (utf-8 or cp437). Considering errors= argument, but that must default to 'strict' -- the problem this patch solves is zip utilities extracting to files with unreadable names, surrogateescape is more of the same. Two incomplete tests (assertRaisesRegex and capture main() stderr) still in progress, must do dayjob now. |
Sorry, I can't push the patch in a haste. In needs more design discussion, comparing with other implementations (I found only that ZipFile in Java can take the charset argument, but "charset" is common name for text encoding in Java), wider discussion. The problem exists for years, and there are workarounds. |
If you have a workaround that's available to nonprogrammers, I'd like to hear about it. I have found none, that's why I went to the trouble to put together a patch even though I knew that the odds of actually getting it in to Python 3.6 was very low -- my patch (or Sergey Dorofeev's, but that needs work to be applicable to trunk) does everything I've ever needed, so I suppose it would do for all the use cases so far posted (except umedoblock's encoding-guessing approach, but that can be handled by many 3rd-party encoding-guessing codecs, I think, and IMO should be). |
Python is programming language, I don't understand what you mean saying "available to nonprogrammers". As a programmer you can recode ZipInfo name before outputting or what you want to do with it: filename = filename.encode('cp437').decode(encoding) In command line you can can use iconv or other converters:
This is not very handy, but works in most cases. |
Cleaned up a few loose ends while it's all fresh in mind. Will ping python-dev in 4-6 weeks for review for 3.7. Thanks to Serhiy for review. The current version of the patch is much improved over the initial submission due to his efforts. |
A ZipFile can be read when open in not read-only mode. Thus the encoding argument should be accepted when mode != 'r'. It would be weird to read file names and write new entries with different encodings. Thus the encoding argument should affect output encoding too. You have named the new ZipFile attribute metadataEncoding. Indeed, I missed this, and other developers missed this when ported to Python 3, but the specification says, that the UTF-8 bit affect not just the encoding of file names, but the encoding of comments. Thus a file comment must be a string, and be decoded with the same encoding as a file name. Currently it is of type bytes. I don't know what is the best way to resolve this issue without breaking backward compatibility. Perhaps add the text_comment property. |
Thanks for followup! I was just about to write you, now that 3.6 is out. Season's Greetings! First, how do you propose to proceed with bpo-28115 ("use argparse for the ZipFile module")? If you expect to commit that first (I'm in no hurry for this patch, BTW, as long as it gets into 3.7 I'm happy), this issue should depend on it and use argparse too. I don't see any good reason for allowing non-UTF-8 encoding to a file open for writing, and a good reason (the ZipFile standard) for not allowing it. Certainly the CLI should not allow it, any more than it does now. At least in my experiments InfoZip and the default zip utilities on Windows and Mac DTRT with UTF-8 zipfiles, so there is no absolute need for writing nonconforming zipfiles. If you want to block on a convert-to-UTF-8 option I can do that (but I don't need it myself). (Note to self: if writing to existing zipfile is extension of existing file, need to prevent mixed encodings. Also warn about conversion.) I thought I checked that comments were decoded. Maybe that's only on the UTF-8 path? Or maybe I needed more coffee. (Hope so, that would be a messy problem if ASCII/Latin1 returns bytes and UTF-8 returns str!) I'll think about this. Yes, it's a backwards-compatibility issue so needs care. Would be weird if names are decoded but other metadata (comments) not, though. Surely someone would complain if they actually used comments? (I'm thinking maybe a compatibility break might be OK? With deprecation cycle?) I expect to check all execution paths accessing metadata and have a proposed patch by 12/31. I think I'm still short some tests, will check and write them if needed. |
I experimented with this a lot. There is a problem with the append mode. We can read in the append mode, therefore we need an encoding. But when we close a ZipFile after appending, non-ASCII file names will be encoded in UTF-8 in the central directory. Next time when we open the archive for reading with different encoding we will get an error because filenames in the central directory and in local headers are different. We need to write non-ASCII files back with the specified encoding to get a self-consistent data. Finally I left this as it was initially. We can return to the problem with the append module later. The differences between PR 32007 and your patches:
I was going to make more changes, but left it for future. |
Your PR looks good to me. I agree with not making it easy to _write_ zipfiles with non-standard encoding used for names. There is a possibility that someone wants that ability when writing zip files (not yet clear) in https://bugs.python.org/issue40172. I don't like it, but if the need exists, that feature can be addressed on that issue via relevant options. |
I'm not going to have time to look at the PR for a couple days. I don't understand what the use case is for writing or appending with filenames in a non-UTF-8 encoding. At least in my experience, reading such files is rare, but I have never been asked to write one. The correspondents who send me zipfiles with the directory encoded in shift_jisx0213 are perfectly happy to read zipfiles with the directory encoded in UTF-8. If that is true for other users, then unzipping the file to a temporary directory with the appropriate --metadata-encoding, adding the required paths there, and zipping a new archive seems perfectly workable. In that case making this feature read-only makes the most sense to me. |
Thanks Serhiy! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: