Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupt input data when using lzma to decompress a file #92018

Closed
fangq opened this issue Apr 28, 2022 · 4 comments
Closed

Corrupt input data when using lzma to decompress a file #92018

fangq opened this issue Apr 28, 2022 · 4 comments
Labels
type-bug An unexpected behavior, bug, or error

Comments

@fangq
Copy link

fangq commented Apr 28, 2022

Bug report

The built-in lzma module failed to decompress a valid .lzma file with the following errors:

_lzma.LZMAError: Corrupt input data

The .lzma file was compressed using the public-domain C library written by Igor Pavlov, see
https://github.com/fangq/zmat/tree/master/src/easylzma/pavlov

before compression, the binary buffer has a length of 1966104 bytes, after compression, the file, mat.lzma (can be downloaded from this link) has a length of 1536957 bytes.

when running file mat.lzma, it prints

mat.lzma: LZMA compressed data, non-streamed, size 1966104

I was able to decompress this file using either the C library mentioned above, or using the below NodeJS/JavaScript script (with either lzma-purejs or lzma npm modules)

const fs = require('fs')
const lzma = require('lzma-purejs')

async function main() {
  var data=lzma.decompressFile(fs.readFileSync('mat.lzma'));
  console.log(data.length)
}
main().then(() => console.log('Done'))

the above script corrected decoded the buffer:

$ node testlzma.js 
1966104
Done

however, using the below python script, I got an error

import lzma

filename='mat.lzma'
buf=lzma.open(filename,  format=lzma.FORMAT_ALONE).read();

print(len(buf))

error message:

$ python3 testlzma.py
Traceback (most recent call last):
  File "testlzma.py", line 4, in <module>
    buf=lzma.open(filename,  format=lzma.FORMAT_ALONE).read();
  File "/usr/lib/python3.6/lzma.py", line 200, in read
    return self._buffer.read(size)
  File "/usr/lib/python3.6/_compression.py", line 103, in read
    data = self._decompressor.decompress(rawblock, size)
_lzma.LZMAError: Corrupt input data

Because Igor Pavlov's C library implements the original lzma algorithm, so I believe the FORMAT_ALONE flag was used correctly.

I want to mention that the test file mat.lzma can be correctly decompressed using lzma -d on Ubuntu 20.04, but it gives an error on Ubuntu 18.04 and 22.04 (both uses xz utils based lzma), I believe this is due to the nature that the two lzma commands are different

fangq@ubuntu20_04:~$ lzma --version
LZMA command line tool 9.22
LZMA SDK 9.22

fangq@ubuntu20_04:~$ lzma -v -d mat.lzma
mat.lzma:	 21.83% -- replaced with mat
fangq@ubuntu18_04:~$ lzma --version
xz (XZ Utils) 5.2.2
liblzma 5.2.2

fangq@ubuntu18_04$ lzma -v -d mat.lzma
mat.lzma (1/1)
 99.9 %   1,500.9 KiB / 1,920.0 KiB = 0.782                                    
lzma: mat.lzma: Compressed data is corrupt
 99.9 %   1,500.9 KiB / 1,920.0 KiB = 0.782     

Your environment

Python 3.6 on Ubuntu 18.04
Python 3.8 on Ubuntu 20.04
Python 3.10 on Ubuntu 22.04

@fangq fangq added the type-bug An unexpected behavior, bug, or error label Apr 28, 2022
@serhiy-storchaka
Copy link
Member

Does it work with FORMAT_RAW?

@fangq
Copy link
Author

fangq commented Apr 29, 2022

@serhiy-storchaka, no, FORMAT_RAW also failed

@kbeldan
Copy link
Contributor

kbeldan commented Apr 30, 2022

It seems you are hitting a problem with how xz (and its lzma lib used by python) handles the end-of-stream marker.
As can be seen in xz's lzma_decoder.c, lzma_decode() has a problem when this marker is present while the uncompressed size is known.

Which doesn't match Igor Pavlov's specs from his DOC/lzma-specification.txt:

If "Uncompressed size" field contains ones in all 64 bits, it means that
uncompressed size is unknown and there is the "end marker" in stream,
that indicates the end of decoding point.
In opposite case, if the value from "Uncompressed size" field is not
equal to ((2^64) - 1), the LZMA stream decoding must be finished after
specified number of bytes (Uncompressed size) is decoded. And if there
is the "end marker", the LZMA decoder must read that marker also.

xz and python can decompress your file after replacing the uncompressed size at the head of your file:

- 5D 00 00 10  00 18 00 1E  00 00 00 00  00 00 04 0B
+ 5D 00 00 10  00 FF FF FF  FF FF FF FF  FF 00 04 0B

@fangq
Copy link
Author

fangq commented Apr 30, 2022

thanks @kbeldan, very helpful! after replacing the size in the header to -1, I was indeed able to decompress the file/buffer.

fangq/pyjdata@61bb884#diff-d43e2c1f1c71d64169fcaa6b02fcdaee40d8c841335a61645029ceb263fe087fR168-R169

from reading the spec you quoted above, I still consider this a bug on liblzma/xz side. will close this one, but leave the below bug report open:

https://bugs.launchpad.net/ubuntu/+source/xz-utils/+bug/1970762

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants