Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to read zip file with extra #78787

Closed
altendky mannequin opened this issue Sep 7, 2018 · 9 comments
Closed

Unable to read zip file with extra #78787

altendky mannequin opened this issue Sep 7, 2018 · 9 comments
Labels
3.7 (EOL) end of life stdlib Python modules in the Lib dir type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@altendky
Copy link
Mannequin

altendky mannequin commented Sep 7, 2018

BPO 34606
Nosy @serhiy-storchaka, @altendky, @tirkarthi, @vladima
PRs
  • bpo-34606: decode extra data only if zip64 end of central directory record is present #9107
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-09-07.20:40:31.649>
    created_at = <Date 2018-09-07.14:41:59.250>
    labels = ['invalid', '3.7', 'library', 'type-crash']
    title = 'Unable to read zip file with extra'
    updated_at = <Date 2019-06-25.12:21:32.555>
    user = 'https://github.com/altendky'

    bugs.python.org fields:

    activity = <Date 2019-06-25.12:21:32.555>
    actor = 'Andreas.G\xc3\xa4er'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-09-07.20:40:31.649>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)']
    creation = <Date 2018-09-07.14:41:59.250>
    creator = 'altendky'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 34606
    keywords = []
    message_count = 9.0
    messages = ['324743', '324766', '324790', '324791', '324797', '324799', '325119', '325132', '346518']
    nosy_count = 5.0
    nosy_names = ['serhiy.storchaka', 'Andreas.G\xc3\xa4er', 'altendky', 'xtreak', 'v2m']
    pr_nums = ['9107']
    priority = 'normal'
    resolution = 'not a bug'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue34606'
    versions = ['Python 3.7']

    @altendky
    Copy link
    Mannequin Author

    altendky mannequin commented Sep 7, 2018

    This was first found over in Twisted tests. We probably aren't too terribly worried about it but I wanted to report here anyways.

    https://twistedmatrix.com/trac/ticket/9525

    Both 3.6 and 3.7 write the same file (sha at the end) based on the script in the first snippet. 3.6 can read both files, 3.7 can't read either.

    ----
    altendky@lt:~/twisted$ cat ../z.py
    import sys
    import zipfile

        print(sys.version)
    
        fn = sys.argv[1]
        print(fn)
    
        with zipfile.ZipFile(fn, 'w') as zf:
            zi = zipfile.ZipInfo("0")
            zi.extra = b"hello, extra"
            zf.writestr(zi, b"the real data")
    
        zipfile.ZipFile(fn)

    altendky@lt:~/twisted$ venv36/bin/python ../z.py 36.zip
    3.6.6 (default, Jul 24 2018, 16:23:12)
    [GCC 6.3.0 20170516]
    36.zip
    

    ----

        altendky@lt:~/twisted$ venv37/bin/python ../z.py 37.zip
        3.7.0 (default, Jul  7 2018, 15:49:24)
        [GCC 6.3.0 20170516]
        37.zip
        Traceback (most recent call last):
          File "../z.py", line 14, in <module>
            zipfile.ZipFile(fn)
          File "/home/altendky/.pyenv/versions/3.7.0/lib/python3.7/zipfile.py", line 1200, in __init__
            self._RealGetContents()
          File "/home/altendky/.pyenv/versions/3.7.0/lib/python3.7/zipfile.py", line 1323, in _RealGetContents
            x._decodeExtra()
          File "/home/altendky/.pyenv/versions/3.7.0/lib/python3.7/zipfile.py", line 440, in _decodeExtra
            raise BadZipFile("Corrupt extra field %04x (size=%d)" % (tp, ln))
        zipfile.BadZipFile: Corrupt extra field 6568 (size=27756)

    altendky@lt:~/twisted$ cat ../z.py
    import sys
    import zipfile
    
        print(sys.version)
    
        fn = sys.argv[1]
        print(fn)
    #with zipfile.ZipFile(fn, 'w') as zf:
    #    zi = zipfile.ZipInfo("0")
    #    zi.extra = b"hello, extra"
    #    zf.writestr(zi, b"the real data")
    
        zipfile.ZipFile(fn)

    altendky@lt:~/twisted$ venv36/bin/python ../z.py 37.zip
    3.6.6 (default, Jul 24 2018, 16:23:12)
    [GCC 6.3.0 20170516]
    37.zip
    

    ----

        altendky@lt:~/twisted$ venv37/bin/python ../z.py 36.zip
        3.7.0 (default, Jul  7 2018, 15:49:24)
        [GCC 6.3.0 20170516]
        36.zip
        Traceback (most recent call last):
          File "../z.py", line 14, in <module>
            zipfile.ZipFile(fn)
          File "/home/altendky/.pyenv/versions/3.7.0/lib/python3.7/zipfile.py", line 1200, in __init__
            self._RealGetContents()
          File "/home/altendky/.pyenv/versions/3.7.0/lib/python3.7/zipfile.py", line 1323, in _RealGetContents
            x._decodeExtra()
          File "/home/altendky/.pyenv/versions/3.7.0/lib/python3.7/zipfile.py", line 440, in _decodeExtra
            raise BadZipFile("Corrupt extra field %04x (size=%d)" % (tp, ln))
        zipfile.BadZipFile: Corrupt extra field 6568 (size=27756)

    altendky@lt:~/twisted$ sha256sum 36.zip
    0f54bd6ab84facfeefc2c38f12c30eb84101b3be3d91f8826f6fa36e73b86cb6  36.zip
    

    ----
    altendky@lt:~/twisted$ sha256sum 37.zip
    0f54bd6ab84facfeefc2c38f12c30eb84101b3be3d91f8826f6fa36e73b86cb6 37.zip

    @altendky altendky mannequin added 3.7 (EOL) end of life stdlib Python modules in the Lib dir type-crash A hard crash of the interpreter, possibly with a core dump labels Sep 7, 2018
    @tirkarthi
    Copy link
    Member

    It's also reproducible on master. git blame tells me this is introduced with feccdb2 (https://bugs.python.org/issue29774)

    ➜ cpython git:(master)
    commit feccdb2
    Author: Serhiy Storchaka <storchaka@gmail.com>
    Date: Thu Mar 9 18:34:03 2017 +0200

    bpo-29774: Improve error reporting for corrupted extra field in ZIP file. (#583)
    
    diff --git a/Lib/zipfile.py b/Lib/zipfile.py
    index b5c16dbc12..8a19ca246b 100644
    --- a/Lib/zipfile.py
    +++ b/Lib/zipfile.py
    @@ -438,7 +438,9 @@ class ZipInfo (object):
             unpack = struct.unpack
             while len(extra) >= 4:
                 tp, ln = unpack('<HH', extra[:4])
    -            if tp == 1:
    +            if ln+4 > len(extra):
    +                raise BadZipFile("Corrupt extra field %04x (size=%d)" % (tp, ln))
    +            if tp == 0x0001:
                     if ln >= 24:
                         counts = unpack('<QQQ', extra[4:28])
                     elif ln == 16:

    # Master branch

    ➜  cpython git:(master) ./python.exe
    Python 3.8.0a0 (heads/master:874809ea38, Sep  7 2018, 21:03:18)
    [Clang 7.0.2 (clang-700.1.81)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>>
    ➜  cpython git:(master) ./python.exe ../backups/bpo34606.py 36.zip
    3.8.0a0 (heads/master:874809ea38, Sep  7 2018, 21:03:18)
    [Clang 7.0.2 (clang-700.1.81)]
    36.zip
    Traceback (most recent call last):
      File "../backups/bpo34606.py", line 14, in <module>
        zipfile.ZipFile(fn)
      File "/Users/karthikeyansingaravelan/stuff/python/cpython/Lib/zipfile.py", line 1204, in __init__
        self._RealGetContents()
      File "/Users/karthikeyansingaravelan/stuff/python/cpython/Lib/zipfile.py", line 1327, in _RealGetContents
        x._decodeExtra()
      File "/Users/karthikeyansingaravelan/stuff/python/cpython/Lib/zipfile.py", line 440, in _decodeExtra
        raise BadZipFile("Corrupt extra field %04x (size=%d)" % (tp, ln))
    zipfile.BadZipFile: Corrupt extra field 6568 (size=27756)
    ➜  cpython git:(master) ✗ rm 36.zip

    # checkout feccdb2

    ➜  cpython git:(master) git checkout feccdb2a249a71be330765be77dee57121866779 Lib/zipfile.py
    ➜  cpython git:(master) ✗ ./python.exe ../backups/bpo34606.py 36.zip
    3.8.0a0 (heads/master:874809ea38, Sep  7 2018, 21:03:18)
    [Clang 7.0.2 (clang-700.1.81)]
    36.zip
    Traceback (most recent call last):
      File "../backups/bpo34606.py", line 14, in <module>
        zipfile.ZipFile(fn)
      File "/Users/karthikeyansingaravelan/stuff/python/cpython/Lib/zipfile.py", line 1105, in __init__
        self._RealGetContents()
      File "/Users/karthikeyansingaravelan/stuff/python/cpython/Lib/zipfile.py", line 1230, in _RealGetContents
        x._decodeExtra()
      File "/Users/karthikeyansingaravelan/stuff/python/cpython/Lib/zipfile.py", line 442, in _decodeExtra
        raise BadZipFile("Corrupt extra field %04x (size=%d)" % (tp, ln))
    zipfile.BadZipFile: Corrupt extra field 6568 (size=27756)
    ➜  cpython git:(master) ✗ rm 36.zip

    # checkout feccdb2~1

    ➜ cpython git:(master) ✗ git checkout feccdb2~1 Lib/zipfile.py
    ➜ cpython git:(master) ✗ ./python.exe ../backups/bpo34606.py 36.zip
    3.8.0a0 (heads/master:874809ea38, Sep 7 2018, 21:03:18)
    [Clang 7.0.2 (clang-700.1.81)]
    36.zip
    ➜ cpython git:(master) ✗ gsha256sum 36.zip
    0f54bd6ab84facfeefc2c38f12c30eb84101b3be3d91f8826f6fa36e73b86cb6 36.zip

    Hope this helps.

    I am adding Serhiy who might have a better explanation. @serhiy Feel free to unassign yourself if this is irrelevant.

    Thanks

    @serhiy-storchaka
    Copy link
    Member

    The value of the extra attribute is not an arbitrary bytes object. It is a sequence of blocks in the following format: 16-bit identifier followed by 16-bit size of the block followed by the specified amount of bytes. Your example creates invalid ZIP file.

    @altendky
    Copy link
    Mannequin Author

    altendky mannequin commented Sep 7, 2018

    Turns out the docs do document this. My apologies.

    https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

    4.3.11 Archive extra data record:

        archive extra data signature    4 bytes  (0x08064b50)
        extra field length              4 bytes
        extra field data                (variable size)
    

    Aside from the discrepancy between 16-bits and 4 bytes, it seems like something should happen, even if it's something other than 'fixing' the code to handle the malformed data. Isn't it a bug for zipfile to create a non-compliant file? Shouldn't it either check or provide an interface by which a compliant file could sensibly be created? It doesn't seem great to just expect users to rewrite this each time they call.

    (42).to_bytes(4, 'little') + len(data).to_bytes(4, 'little') + data

    or, should it be 'big'? and would it be (len(data) + 4 + 4)?

    @altendky
    Copy link
    Mannequin Author

    altendky mannequin commented Sep 7, 2018

    Python 3.7 works with 2-byte elements, I managed to find the wrong section in the doc-linked docs.

    4.5 Extensible data fields
    --------------------------

       4.5.1 In order to allow different programs and different types
       of information to be stored in the 'extra' field in .ZIP
       files, the following structure MUST be used for all
       programs storing data in this field:
    
           header1+data1 + header2+data2 . . .
    
       Each header should consist of:
    
           Header ID - 2 bytes
           Data Size - 2 bytes
    
       Note: all fields stored in Intel low-byte/high-byte order.
    

    And the test showing it working.

    ----

        import sys
        import zipfile
        
        print(sys.version)
        
        fn = sys.argv[1]
        print(fn)
        
        options = {
            'endianness': ('little', 'big'),
            'header_element_bytes': (2, 4),
            'additional_size': (0, 4, 4 + 4),
        }
        
        for endianness in options['endianness']:
            for additional_size in options['additional_size']:
                for header_element_bytes in options['header_element_bytes']:
                    print('\n\n --- trying endianness: {}, additional_size: {}, header_element_bytes: {}'.format(endianness, additional_size, header_element_bytes))
                    with zipfile.ZipFile(fn, 'w') as zf:
                        zi = zipfile.ZipInfo("0")
                        extra_data = b"hello, extra, and some more just to make it longer and such so yeah"
                        zi.extra = (
                            (42).to_bytes(header_element_bytes, endianness)
                            + (len(extra_data) + additional_size).to_bytes(header_element_bytes, endianness)
                            + extra_data
                        )
                        zf.writestr(zi, b"the real data")
        
                    try:
                        zipfile.ZipFile(fn)
                    except Exception as e:
                        print(e)
                    else:
                        print('success')

    altendky@lt:~/twisted$ python3.7 ../z.py 37.py
    3.7.0 (default, Jul  7 2018, 15:49:24)
    [GCC 6.3.0 20170516]
    37.py
    
    
     --- trying endianness: little, additional_size: 0, header_element_bytes: 2
    success
    
    
     --- trying endianness: little, additional_size: 0, header_element_bytes: 4
    Corrupt extra field 6568 (size=27756)
    
    
     --- trying endianness: little, additional_size: 4, header_element_bytes: 2
    Corrupt extra field 002a (size=71)
    
    
     --- trying endianness: little, additional_size: 4, header_element_bytes: 4
    Corrupt extra field 6568 (size=27756)
    
    
     --- trying endianness: little, additional_size: 8, header_element_bytes: 2
    Corrupt extra field 002a (size=75)
    
    
     --- trying endianness: little, additional_size: 8, header_element_bytes: 4
    Corrupt extra field 6568 (size=27756)
    
    
     --- trying endianness: big, additional_size: 0, header_element_bytes: 2
    Corrupt extra field 2a00 (size=17152)
    
    
     --- trying endianness: big, additional_size: 0, header_element_bytes: 4
    Corrupt extra field 0000 (size=10752)
    
    
     --- trying endianness: big, additional_size: 4, header_element_bytes: 2
    Corrupt extra field 2a00 (size=18176)
    
    
     --- trying endianness: big, additional_size: 4, header_element_bytes: 4
    Corrupt extra field 0000 (size=10752)
    
    
     --- trying endianness: big, additional_size: 8, header_element_bytes: 2
    Corrupt extra field 2a00 (size=19200)
    
    
     --- trying endianness: big, additional_size: 8, header_element_bytes: 4
    Corrupt extra field 0000 (size=10752)
    

    @serhiy-storchaka
    Copy link
    Member

    Right, the correct section is 4.5. All integers in ZIP files are unsigned and in the little endian format.

    @vladima
    Copy link
    Mannequin

    vladima mannequin commented Sep 12, 2018

    In this particular case looks like a crux of the problem was in the fact that compression encodes extra fields only if either zip64 is set or length of the field is larger than threshold but decompression always tries to decode it. Attached PR switches decoding to be conditioned on the presence of zip64 end of central directory record.

    @altendky
    Copy link
    Mannequin Author

    altendky mannequin commented Sep 12, 2018

    Vladimir, if compression were the cause wouldn't the extra bytes I added (signature and length) not have any effect? If you've found an issue it seems like it would be a different one than I was triggering.

    @AndreasGer
    Copy link
    Mannequin

    AndreasGer mannequin commented Jun 25, 2019

    I'm a little bit puzzled that this bug is closed as "not a bug" despite the fact that all versions of python allow the user to create "invalid" ZIP files and all version up to 3.7 are able to read those "invalid" files.

    Currently we're stuck with a lot of already created ZIP files that cannot be read or fixed after an update to 3.7.

    Would it be an option to make decoding of extra data optional eg. with an extra keyword arg to __init__?

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life stdlib Python modules in the Lib dir type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants