Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reading member names with bogus encodings in zipfile #72267

Closed
yaseppochi mannequin opened this issue Sep 11, 2016 · 19 comments
Closed

Allow reading member names with bogus encodings in zipfile #72267

yaseppochi mannequin opened this issue Sep 11, 2016 · 19 comments
Assignees
Labels
3.11 stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@yaseppochi
Copy link
Mannequin

yaseppochi mannequin commented Sep 11, 2016

BPO 28080
Nosy @gpshead, @ned-deily, @yaseppochi, @methane, @serhiy-storchaka
PRs
  • bpo-28080:: Add support for the fallback encoding in ZIP files #32007
  • Files
  • encoded-member-names: Mercurial queue style patch for Lib/zipfile.py, Doc/Lib/zipfile.rst, Lib/test/test_zipfile.py
  • encoded-member-names-v2
  • encoded-member-names-v3
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2022-03-24.00:33:09.557>
    created_at = <Date 2016-09-11.18:23:14.387>
    labels = ['type-feature', 'library', '3.11']
    title = 'Allow reading member names with bogus encodings in zipfile'
    updated_at = <Date 2022-03-24.00:33:09.555>
    user = 'https://github.com/yaseppochi'

    bugs.python.org fields:

    activity = <Date 2022-03-24.00:33:09.555>
    actor = 'gregory.p.smith'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2022-03-24.00:33:09.557>
    closer = 'gregory.p.smith'
    components = ['Library (Lib)']
    creation = <Date 2016-09-11.18:23:14.387>
    creator = 'sjt'
    dependencies = []
    files = ['44564', '44571', '44593']
    hgrepos = []
    issue_num = 28080
    keywords = ['patch', 'needs review']
    message_count = 19.0
    messages = ['275828', '275834', '275837', '275842', '275844', '275847', '275856', '275891', '275996', '276025', '276051', '276054', '284029', '284139', '415605', '415746', '415756', '415876', '415917']
    nosy_count = 5.0
    nosy_names = ['gregory.p.smith', 'ned.deily', 'sjt', 'methane', 'serhiy.storchaka']
    pr_nums = ['32007']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'commit review'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue28080'
    versions = ['Python 3.11']

    @yaseppochi yaseppochi mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Sep 11, 2016
    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Sep 11, 2016

    Could you please provide more information sjt?

    @yaseppochi
    Copy link
    Mannequin Author

    yaseppochi mannequin commented Sep 11, 2016

    Suggested NEWS/whatsnew entry:

    Add a new *memberNameEncoding* argument to the ZipFile constructor, allowing
    :mod:`zipfile` to read filenames in non-conforming encodings from the
    zipfile as Unicode. This implementation assumes all member names have the same encoding.

    Motivation:

    There are applications in Japan that create zipfiles with directories containing filenames encoded in Shift JIS. There may be such software in other countries as well. As this is a violation of the Zip format definition, this library implements only an option to read such files.

    Done:

    (1) Add a memberNameEncoding argument to the main() function, which may be set from the command line with "--membernameencoding={codec}". This command line option may be used with -e or -l, but not -c or -t. There is no point to it in the latter, since the member names are not printed.
    (2) Add a memberNameEncoding argument to the ZipFile constructor. This is the only way to set it, so this is global to the ZipFile.
    (3) Add this attribute to repr.
    (4) Add a check that the mode is read in main() and in the ZipFile constructor, and if not invoke USAGE and exit or raise RuntimeError.
    (5) When retrieving member names in constructing ZipInfo instances, check if memberNameEncoding is set, and if so use it, unless the UTF-8 bit is set. In that case, obey the UTF-8 bit, as the specified encoding is surely user error.
    (6) Add a CODEC_USAGE message.
    (7) Update the docs (docstrings, library reference, NEWS).
    (8) Add tests:
    (a) List a zipfile's SJIS-encoded directory.
    (b) List a UTF-8-encoded directory and an ISO-8859-1-encoded directory as Shift-JIS.
    (c) Check that USAGE is invoked on attempts to write a zipfile in main().
    (d) Check that an appropriate error is raised on attempts to write in other functions.
    Many other tests are run as well.
    ALL TESTS PASS.
    (9) Docs build without error.

    To do (?):

    (10) NEWS/whatsnew
    (11) Check relevant code paths are all covered by tests.
    (12) Review docs for clarity and organization.

    Not done:

    I don't think these are appropriate/needed at this time, but listed in case somebody thinks otherwise.

    (13) Add a subtype of RuntimeError (see 7d)?
    (14) Issue warning if both membernameencoding and utf-8 bit are set (see 4)?
    (15) Support InfoZip encoding extension mentioned in APPNOTE.TXT - .ZIP File Format Specification, v6.3.4.
    (16) Support per-member encodings (I think the zipfile standard permits, but not sure).

    @yaseppochi
    Copy link
    Mannequin Author

    yaseppochi mannequin commented Sep 11, 2016

    I should have a contributor agreement form on file.

    Ned Deily suggested that I try to get this patch in before the 12 noon deadline Sept. 12, so here it is.

    I believe the patch is "safe" in the sense that its functionality needs to be explicitly enabled, and it should be very difficult to persuade it to inadvertantly write to any file. No existing execution paths should be changed at all.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Sep 11, 2016

    This issue is a duplicate of bpo-10614. But proposed patch looks more ready.

    @serhiy-storchaka serhiy-storchaka self-assigned this Sep 11, 2016
    @ned-deily
    Copy link
    Member

    ned-deily commented Sep 11, 2016

    Stephen, thanks for submitting the patch. Unless another core developer has time to review and commit this prior to the feature code off tomorrow, this will probably need to wait for 3.7. Also, at the moment, your tracker user record does not indicate that there is a contributor agreement on file for you. You could followup on that with Ewa at the PSF or just submit a new form.

    @ned-deily ned-deily added the 3.7 label Sep 11, 2016
    @yaseppochi
    Copy link
    Mannequin Author

    yaseppochi mannequin commented Sep 11, 2016

    Re: wait for 3.7 if reviewers are busy, understood. N.B. Contributor agreement is now on file (I received the PDF from python.org already).

    Re: existing patches:
    My patch is very similar in the basic approach to Sergey Dorofeev's patch in bpo-10614. Main differences:
    (1) Sergey's patch treats the "encoding" parameter as a first class citizen with a default to cp437, whereas mine treats it as a special case defaulting to None, with utf-8 and cp437 getting special treatment as the standard encodings. Subtle point, but I like it this way.
    (2) My patch includes support for the argument in the __main__ script.
    (3) Sergey's patch misses one execution path in the current code so needs update before application.

    The Japanese patches by umedoblock are very Japanese-centric, and worse, they try to guess the encoding by the crude method of seeing what decodes successfully. They are not acceptable IMO.

    Aaargh. Just noticed the Japanese in test_zipfile.py. Will change it to use \u escapes soon.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Sep 11, 2016

    Added comments on Rietveld. Maybe I'll commit rewritten patch tomorrow before 12:00 UTC (oh, already today!).

    @yaseppochi
    Copy link
    Mannequin Author

    yaseppochi mannequin commented Sep 11, 2016

    Can't reply on Rietveld? Lost 2 hours work!

    Patch updated (encoded-member-names-v2), most changes accepted. Not happy about name change or default to cp437, I want this API to be hard to use and not be part of the normal process (utf-8 or cp437). Considering errors= argument, but that must default to 'strict' -- the problem this patch solves is zip utilities extracting to files with unreadable names, surrogateescape is more of the same. Two incomplete tests (assertRaisesRegex and capture main() stderr) still in progress, must do dayjob now.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Sep 12, 2016

    Sorry, I can't push the patch in a haste. In needs more design discussion, comparing with other implementations (I found only that ZipFile in Java can take the charset argument, but "charset" is common name for text encoding in Java), wider discussion. The problem exists for years, and there are workarounds.

    @yaseppochi
    Copy link
    Mannequin Author

    yaseppochi mannequin commented Sep 12, 2016

    If you have a workaround that's available to nonprogrammers, I'd like to hear about it. I have found none, that's why I went to the trouble to put together a patch even though I knew that the odds of actually getting it in to Python 3.6 was very low -- my patch (or Sergey Dorofeev's, but that needs work to be applicable to trunk) does everything I've ever needed, so I suppose it would do for all the use cases so far posted (except umedoblock's encoding-guessing approach, but that can be handled by many 3rd-party encoding-guessing codecs, I think, and IMO should be).

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Sep 12, 2016

    Python is programming language, I don't understand what you mean saying "available to nonprogrammers". As a programmer you can recode ZipInfo name before outputting or what you want to do with it:

        filename = filename.encode('cp437').decode(encoding)

    In command line you can can use iconv or other converters:

    python3 -m zipfile -l myzip.zip | iconv -t cp437 | iconv -f "$encoding"
    

    This is not very handy, but works in most cases.

    @yaseppochi
    Copy link
    Mannequin Author

    yaseppochi mannequin commented Sep 12, 2016

    Cleaned up a few loose ends while it's all fresh in mind. Will ping python-dev in 4-6 weeks for review for 3.7.

    Thanks to Serhiy for review. The current version of the patch is much improved over the initial submission due to his efforts.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Dec 26, 2016

    A ZipFile can be read when open in not read-only mode. Thus the encoding argument should be accepted when mode != 'r'.

    It would be weird to read file names and write new entries with different encodings. Thus the encoding argument should affect output encoding too.

    You have named the new ZipFile attribute metadataEncoding. Indeed, I missed this, and other developers missed this when ported to Python 3, but the specification says, that the UTF-8 bit affect not just the encoding of file names, but the encoding of comments. Thus a file comment must be a string, and be decoded with the same encoding as a file name. Currently it is of type bytes. I don't know what is the best way to resolve this issue without breaking backward compatibility. Perhaps add the text_comment property.

    @yaseppochi
    Copy link
    Mannequin Author

    yaseppochi mannequin commented Dec 27, 2016

    Thanks for followup! I was just about to write you, now that 3.6 is out. Season's Greetings!

    First, how do you propose to proceed with bpo-28115 ("use argparse for the ZipFile module")? If you expect to commit that first (I'm in no hurry for this patch, BTW, as long as it gets into 3.7 I'm happy), this issue should depend on it and use argparse too.

    I don't see any good reason for allowing non-UTF-8 encoding to a file open for writing, and a good reason (the ZipFile standard) for not allowing it. Certainly the CLI should not allow it, any more than it does now. At least in my experiments InfoZip and the default zip utilities on Windows and Mac DTRT with UTF-8 zipfiles, so there is no absolute need for writing nonconforming zipfiles. If you want to block on a convert-to-UTF-8 option I can do that (but I don't need it myself). (Note to self: if writing to existing zipfile is extension of existing file, need to prevent mixed encodings. Also warn about conversion.)

    I thought I checked that comments were decoded. Maybe that's only on the UTF-8 path? Or maybe I needed more coffee. (Hope so, that would be a messy problem if ASCII/Latin1 returns bytes and UTF-8 returns str!) I'll think about this. Yes, it's a backwards-compatibility issue so needs care. Would be weird if names are decoded but other metadata (comments) not, though. Surely someone would complain if they actually used comments? (I'm thinking maybe a compatibility break might be OK? With deprecation cycle?) I expect to check all execution paths accessing metadata and have a proposed patch by 12/31.

    I think I'm still short some tests, will check and write them if needed.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Mar 20, 2022

    I experimented with this a lot. There is a problem with the append mode. We can read in the append mode, therefore we need an encoding. But when we close a ZipFile after appending, non-ASCII file names will be encoded in UTF-8 in the central directory. Next time when we open the archive for reading with different encoding we will get an error because filenames in the central directory and in local headers are different. We need to write non-ASCII files back with the specified encoding to get a self-consistent data.

    Finally I left this as it was initially. We can return to the problem with the append module later.

    The differences between PR 32007 and your patches:

    • The parameter was renamed to metadata_encoding to avoid confusion with existing parameter of ZipFile.open() encoding. In future I am going to use it also for comments. The attribute and the CLI option were renamed correspondingly.
    • --metadata-encoding can also be used with the -t option.
    • "surrogateescape" no longer used. If the encoding in not suitable, you will get an error. Use the default and decode filenames manually in such cases. We can change this in future.
    • Updated documentation.
    • Tests were significantly rewritten. Now they test the behavior with wrong metadata_encoding, mixed UTF-8 and legacy encodings, and reading after append.

    I was going to make more changes, but left it for future.

    @serhiy-storchaka serhiy-storchaka added 3.11 and removed 3.7 labels Mar 20, 2022
    @gpshead
    Copy link
    Member

    gpshead commented Mar 22, 2022

    Your PR looks good to me.

    I agree with not making it easy to _write_ zipfiles with non-standard encoding used for names.

    There is a possibility that someone wants that ability when writing zip files (not yet clear) in https://bugs.python.org/issue40172. I don't like it, but if the need exists, that feature can be addressed on that issue via relevant options.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Mar 22, 2022

    New changeset a25a985 by Serhiy Storchaka in branch 'main':
    bpo-28080: Add support for the fallback encoding in ZIP files (GH-32007)
    a25a985

    @yaseppochi
    Copy link
    Mannequin Author

    yaseppochi mannequin commented Mar 23, 2022

    I'm not going to have time to look at the PR for a couple days.

    I don't understand what the use case is for writing or appending with filenames in a non-UTF-8 encoding. At least in my experience, reading such files is rare, but I have never been asked to write one. The correspondents who send me zipfiles with the directory encoded in shift_jisx0213 are perfectly happy to read zipfiles with the directory encoded in UTF-8.

    If that is true for other users, then unzipping the file to a temporary directory with the appropriate --metadata-encoding, adding the required paths there, and zipping a new archive seems perfectly workable. In that case making this feature read-only makes the most sense to me.

    @gpshead
    Copy link
    Member

    gpshead commented Mar 24, 2022

    Thanks Serhiy!

    @gpshead gpshead closed this as completed Mar 24, 2022
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants