Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lzma: stream padding in xz files #88300

Open
rogdham mannequin opened this issue May 14, 2021 · 2 comments
Open

lzma: stream padding in xz files #88300

rogdham mannequin opened this issue May 14, 2021 · 2 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@rogdham
Copy link
Mannequin

rogdham mannequin commented May 14, 2021

BPO 44134
Nosy @animalize, @Rogdham
Files
  • example1.xz
  • example2.xz
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2021-05-14.17:52:02.842>
    labels = ['type-bug', '3.8', '3.9', '3.10', '3.11', '3.7', 'library']
    title = 'lzma: stream padding in xz files'
    updated_at = <Date 2021-05-16.10:27:53.156>
    user = 'https://github.com/rogdham'

    bugs.python.org fields:

    activity = <Date 2021-05-16.10:27:53.156>
    actor = 'rogdham'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2021-05-14.17:52:02.842>
    creator = 'rogdham'
    dependencies = []
    files = ['50044', '50045']
    hgrepos = []
    issue_num = 44134
    keywords = []
    message_count = 2.0
    messages = ['393681', '393738']
    nosy_count = 3.0
    nosy_names = ['nadeem.vawda', 'malin', 'rogdham']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = None
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue44134'
    versions = ['Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9', 'Python 3.10', 'Python 3.11']

    @rogdham
    Copy link
    Mannequin Author

    rogdham mannequin commented May 14, 2021

    Hello,

    The lzma module does not works well with XZ stream padding. Depending on the case, it may work; or it may stops the stream prematurely without error; or an error may be raised; or no error may be raised when it must.

    In the XZ file format, stream padding is a number of null bytes (multiple of 4) that can be between and after streams.

    From the specification (section 2.2):

    Only the decoders that support decoding of concatenated Streams MUST support Stream Padding.

    Since the lzma module supports decoding of concatenated streams, it must support stream padding as well.

    #### Examples to reproduce the issue:

    1. example1.xz:
      • made of one stream followed by 4 null bytes:
        $ (echo 'Hi!' | xz; head -c 4 /dev/zero) > example1.xz
      • will raise an exception in both modes (FORMAT_AUTO and FORMAT_XZ)
    >>> with lzma.open('/example1.xz', format=lzma.FORMAT_AUTO) as f:
    ...     f.read()
    ...
    Traceback (most recent call last):
      File "<stdin>", line 2, in <module>
      File "/usr/lib/python3.9/lzma.py", line 200, in read
        return self._buffer.read(size)
      File "/usr/lib/python3.9/_compression.py", line 99, in read
        raise EOFError("Compressed file ended before the "
    EOFError: Compressed file ended before the end-of-stream marker was reached
    >>> with lzma.open('/example1.xz', format=lzma.FORMAT_XZ) as f:
    ...     f.read()
    ...
    Traceback (most recent call last):
      File "<stdin>", line 2, in <module>
      File "/usr/lib/python3.9/lzma.py", line 200, in read
        return self._buffer.read(size)
      File "/usr/lib/python3.9/_compression.py", line 99, in read
        raise EOFError("Compressed file ended before the "
    EOFError: Compressed file ended before the end-of-stream marker was reached
    1. example2.xz:
      • made of two streams with 18 null bytes of stream padding between them
        $ (echo 'Hi!' | xz; head -c 18 /dev/zero; echo 'Second stream' | xz) > example2.xz
      • second stream will be ignored with FORMAT_XZ
      • the two streams will be decoded with FORMAT_AUTO, where it should raise an error (18 null bytes is not multiple of 4, so the stream padding is invalid according to the XZ specification and the decoder “MUST indicate an error”)
    >>> with lzma.open('/tmp/example2.xz', format=lzma.FORMAT_AUTO) as f:
    ...     f.read()
    ...
    b'Hi!\nSecond stream\n'
    >>> with lzma.open('/tmp/example2.xz', format=lzma.FORMAT_XZ) as f:
    ...     f.read()
    ...
    b'Hi!\n'

    #### Analysis

    This issue comes from the relation between _lzma and _compression. In _lzma, the C library is called without the LZMA_CONCATENATED flag, which means that multiple streams and stream padding must be supported in Python.

    In _compression, when a LZMADecompressor is done (.eof is True), an other one is created to decompress from that point. If the new one fails to decompress the remaining data, the LZMAError is ignored and we assume we reached the end.

    So the behavior seen above can be explained as follows:

    • In FORMAT_AUTO, it seems that .eof is False while we haven't read 18 bytes
    • In FORMAT_AUTO, 18 null bytes will be decompressed as b'' with .eof being True afterwards
    • In FORMAT_XZ, it seems that .eof is False while we haven't read 12 bytes
    • In FORMAT_XZ, no stream padding is valid, so as soon as we have more than 12 bytes an LZMAError is raised

    #### Possible solution

    A possible solution would be to add a finish method on the decompressor interface, and support it appropriately in _compression when we reached EOF on the input. Then, in LZMADecompressor implementation, use the LZMA_CONCATENATED flag, and implement the finish method to call lzma_code with LZMA_FINISH as action.

    I think this would be preferred than trying to solve the issue in Python, because if the format is FORMAT_AUTO we don't know if the format is XZ (and we should support stream padding) or not.

    @rogdham rogdham mannequin added 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels May 14, 2021
    @rogdham
    Copy link
    Mannequin Author

    rogdham mannequin commented May 16, 2021

    It must be decided what to do in the following cases, which are not valid per the XZ file specification, but supported by the lzma module (and tested):

    1. different format concatenated together (e.g. a .xz and a .lzma); this somehow includes tailing null bytes (12 null bytes is a valid .lzma)
    2. trailing junk (i.e. non-null bytes after the stream)

    The answer may be different depending on the format arg (e.g. FORMAT_AUTO vs FORMAT_XZ).

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    Status: No status
    Development

    No branches or pull requests

    0 participants