Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZipFile: add a filename_encoding argument #54823

Open
ocean-city mannequin opened this issue Dec 3, 2010 · 15 comments
Open

ZipFile: add a filename_encoding argument #54823

ocean-city mannequin opened this issue Dec 3, 2010 · 15 comments
Labels
3.7 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@ocean-city
Copy link
Mannequin

ocean-city mannequin commented Dec 3, 2010

BPO 10614
Nosy @loewis, @amauryfa, @vstinner, @methane, @serhiy-storchaka
Superseder
  • bpo-28080: Allow reading member names with bogus encodings in zipfile
  • Files
  • non-ascii-cp932.zip: built with python2.7
  • zipfile.patch: decode_filename zipfile.patch
  • encodings.py
  • 10614-zipfile-encoding.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2010-12-03.07:41:56.474>
    labels = ['3.7', 'type-feature', 'library']
    title = 'ZipFile: add a filename_encoding argument'
    updated_at = <Date 2016-12-26.13:06:43.489>
    user = 'https://bugs.python.org/ocean-city'

    bugs.python.org fields:

    activity = <Date 2016-12-26.13:06:43.489>
    actor = 'methane'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2010-12-03.07:41:56.474>
    creator = 'ocean-city'
    dependencies = []
    files = ['19935', '26372', '26376', '46043']
    hgrepos = []
    issue_num = 10614
    keywords = ['patch']
    message_count = 15.0
    messages = ['123197', '123201', '123202', '123229', '123332', '126791', '136233', '165351', '165384', '165386', '200187', '200193', '200311', '284025', '284026']
    nosy_count = 10.0
    nosy_names = ['loewis', 'amaury.forgeotdarc', 'vstinner', 'ocean-city', 'methane', 'THRlWiTi', 'Laurent.Mazuel', 'umedoblock', 'serhiy.storchaka', 'Sergey.Dorofeev']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = None
    status = 'open'
    superseder = '28080'
    type = 'enhancement'
    url = 'https://bugs.python.org/issue10614'
    versions = ['Python 3.6', 'Python 3.7']

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Dec 3, 2010

    Currently, ZipFile only accepts ascii or utf8 as file
    name encodings. On Windows (Japanese), usually CP932
    is used for it. So currently, when we melt ZipFile
    via py3k, non-ascii file name becomes strange. Can we handle
    this issue? (ie: adding encoding option for ZipFile#init)

    @ocean-city ocean-city mannequin added extension-modules C modules in the Modules dir type-feature A feature request or enhancement labels Dec 3, 2010
    @amauryfa
    Copy link
    Member

    amauryfa commented Dec 3, 2010

    The ZIP format specification mentions only cp437 and utf8: http://www.pkware.com/documents/casestudies/APPNOTE.TXT see Apeendix D.
    Do zip files created on Japanese Windows contain some information about the encoding they use?
    Or do some programs write cp932 where they are supposed to use one of the encodings above?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Dec 3, 2010

    No, there is no indication in the zipfile that it deviates from the spec. That doesn't stop people from creating such zipfiles, anyway; many zip tools ignore the spec and use instead CP_ACP (which, of course, will then get misinterpreted if extracted on a different system).

    I think we must support this case somehow, but must be careful to avoid creating such files unless explicitly requested. One approach might be to have two encodings given: one to interpret the existing filenames, and one to be used for new filenames (with a recommendation to never use that parameter since zip now supports UTF-8 in a well-defined manner).

    @vstinner
    Copy link
    Member

    vstinner commented Dec 3, 2010

    @hirokazu: Can you attach a small test archive?

    Yes, we can add a "default_encoding" attribute to ZipFile and add an optional default_encoding argument to its constructor.

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Dec 4, 2010

    I'm not sure why, but I got BadZipFile error now. Anyway,
    here is cp932 zip file to be created with python2.7.

    @vstinner
    Copy link
    Member

    In bpo-10972, I propose to add an option for the filename encoding to UTF-8. But I would like to force UTF-8 to create a ZIP file, it doesn't concern the decompression of a ZIP file.

    Proposal of a specification to fix both issues at the same time.

    "default_encoding" name is confusing because it doesn't specify if it is the encoding of (text?) file content or the encoding the filename. Why not simply "filename_encoding"?

    The option can be added in multiple places:

    • argument to ZipFile constructor: this is needed to decompress
    • argument to ZipFile.write() and ZipInfo, because they are 3 different manners to add files

    ZipFile.filename_encoding (and ZipInfo.filename_encoding) will be None by default: in this case, use the current algorithm (try cp437 or use UTF-8). Otherwise, use the encoding. If the encoding is UTF-8: set unicode flag.

    Examples:
    ---

    zipfile.ZipFile("non-ascii-cp932.zip", filename_encoding="cp932")
    
    f = zipfile.ZipFile("test.zip", "w")
    f.write(filename, filename_encoding="UTF-8")
    info = ZipInfo(filename, filename_encoding="UTF-8")
    f.writestr(info, b'data')

    Don't add filename_encoding argument to ZipFile.writestr(), because it may conflict if a ZipInfo is passed and ZipInfo.filename_encoding and filename_encoding are different.

    @vstinner vstinner changed the title ZipFile and CP932 encoding ZipFile: add a filename_encoding argument Feb 1, 2011
    @vstinner
    Copy link
    Member

    I closed issue bpo-12048 as a duplicate of this issue: yaoyu wants to uncompress a ZIP file having filenames encoded to GBK.

    @umedoblock
    Copy link
    Mannequin

    umedoblock mannequin commented Jul 13, 2012

    I fixed this problem.
    I make new methos _decode_filename().

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jul 13, 2012

    umedoblock: your patch is incorrect, as it produces moji-bake. if there is a file name b'f\x94n', it will decode as sjis under your patch (to u'f\u99ac'), even though it was meant as cp437 (i.e. u'f\xf6n').

    @umedoblock
    Copy link
    Mannequin

    umedoblock mannequin commented Jul 13, 2012

    Hi, Martin.
    I tried your test case with attached file.
    And I got below result.

    p3 ./encodings.py
    encoding: sjis, filename: f馬
    encoding: cp437, filename: fön
    sjis_filename = f馬
    cp437_filename = fön

    There are two success cases.
    So I think that the patch needs to change default_encoding
    before or in _decode_filename().

    But I have no idea about how to change a default_encoding.

    @SergeyDorofeev
    Copy link
    Mannequin

    SergeyDorofeev mannequin commented Oct 18, 2013

    I'd like to submit patch to support zip archives created on systems that use non-US codepage (e.g. russian CP866).
    Codepage would be specified in additional parameter of ZipFile constructor, named "codepage".
    If it is not specified, old behavior is preserved (use CP437).

    --- zipfile.py-orig 2013-09-18 16:45:56.000000000 +0400
    +++ zipfile.py 2013-10-15 00:24:06.105157572 +0400
    @@ -885,7 +885,7 @@
    fp = None # Set here since __del__ checks it
    _windows_illegal_name_trans_table = None

    -    def __init__(self, file, mode="r", compression=ZIP_STORED, allowZip64=False):
    +    def __init__(self, file, mode="r", compression=ZIP_STORED, allowZip64=False, codepage='cp437'):
             """Open the ZIP file with mode read "r", write "w" or append "a"."""
             if mode not in ("r", "w", "a"):
                 raise RuntimeError('ZipFile() requires mode "r", "w", or "a"')
    @@ -901,6 +901,7 @@
             self.mode = key = mode.replace('b', '')[0]
             self.pwd = None
             self._comment = b''
    +        self.codepage = codepage
    
             # Check if we were passed a file-like object
             if isinstance(file, str):
    @@ -1002,7 +1003,7 @@
                     filename = filename.decode('utf-8')
                 else:
                     # Historical ZIP filename encoding
    -                filename = filename.decode('cp437')
    +                filename = filename.decode(self.codepage)
                 # Create ZipInfo instance to store file information
                 x = ZipInfo(filename)
                 x.extra = fp.read(centdir[_CD_EXTRA_FIELD_LENGTH])
    @@ -1157,7 +1158,7 @@
                     # UTF-8 filename
                     fname_str = fname.decode("utf-8")
                 else:
    -                fname_str = fname.decode("cp437")
    +                fname_str = fname.decode(self.codepage)
                 if fname_str != zinfo.orig_filename:
                     raise BadZipFile(

    @vstinner
    Copy link
    Member

    Please rename codepage to encoding. By the way, 437 is a codepage, cp437 is
    a (python) encoding.

    I don't think that ZIP is limited to windows. I uncompressed zip files many
    times on various OSes, github also produces zip (and github is probably not
    using windows). And codepage term is only used on windows. Mac OS 9 users
    might produce mac roman filenames.

    @SergeyDorofeev
    Copy link
    Mannequin

    SergeyDorofeev mannequin commented Oct 18, 2013

    OK, here you are:

    --- zipfile.py-orig 2013-09-18 16:45:56.000000000 +0400
    +++ zipfile.py 2013-10-19 01:59:07.444346674 +0400
    @@ -885,7 +885,7 @@
    fp = None # Set here since __del__ checks it
    _windows_illegal_name_trans_table = None

    -    def __init__(self, file, mode="r", compression=ZIP_STORED,
    allowZip64=False):
    +    def __init__(self, file, mode="r", compression=ZIP_STORED,
    allowZip64=False, encoding='cp437'):
             """Open the ZIP file with mode read "r", write "w" or append
    "a"."""
             if mode not in ("r", "w", "a"):
                 raise RuntimeError('ZipFile() requires mode "r", "w", or "a"')
    @@ -901,6 +901,7 @@
             self.mode = key = mode.replace('b', '')[0]
             self.pwd = None
             self._comment = b''
    +        self.encoding = encoding
    
             # Check if we were passed a file-like object
             if isinstance(file, str):
    @@ -1001,8 +1002,8 @@
                     # UTF-8 file names extension
                     filename = filename.decode('utf-8')
                 else:
    -                # Historical ZIP filename encoding
    -                filename = filename.decode('cp437')
    +                # Historical ZIP filename encoding, default is CP437
    +                filename = filename.decode(self.encoding)
                 # Create ZipInfo instance to store file information
                 x = ZipInfo(filename)
                 x.extra = fp.read(centdir[_CD_EXTRA_FIELD_LENGTH])
    @@ -1157,7 +1158,7 @@
                     # UTF-8 filename
                     fname_str = fname.decode("utf-8")
                 else:
    -                fname_str = fname.decode("cp437")
    +                fname_str = fname.decode(self.encoding)
                 if fname_str != zinfo.orig_filename:
                     raise BadZipFile(

    On Fri, Oct 18, 2013 at 11:47 AM, STINNER Victor <report@bugs.python.org>wrote:

    STINNER Victor added the comment:

    Please rename codepage to encoding. By the way, 437 is a codepage, cp437 is
    a (python) encoding.

    I don't think that ZIP is limited to windows. I uncompressed zip files many
    times on various OSes, github also produces zip (and github is probably not
    using windows). And codepage term is only used on windows. Mac OS 9 users
    might produce mac roman filenames.

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue10614\>


    @methane methane added 3.7 (EOL) end of life stdlib Python modules in the Lib dir and removed extension-modules C modules in the Modules dir labels Dec 26, 2016
    @serhiy-storchaka
    Copy link
    Member

    See also bpo-28080.

    @methane
    Copy link
    Member

    methane commented Dec 26, 2016

    Thanks. Patch posted in bpo-28080 looks better than mine.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    Status: No status
    Development

    No branches or pull requests

    4 participants