Corrupt input data when using lzma to decompress a file #92018

fangq · 2022-04-28T13:43:36Z

Bug report

The built-in lzma module failed to decompress a valid .lzma file with the following errors:

_lzma.LZMAError: Corrupt input data

The .lzma file was compressed using the public-domain C library written by Igor Pavlov, see
https://github.com/fangq/zmat/tree/master/src/easylzma/pavlov

before compression, the binary buffer has a length of 1966104 bytes, after compression, the file, mat.lzma (can be downloaded from this link) has a length of 1536957 bytes.

when running file mat.lzma, it prints

mat.lzma: LZMA compressed data, non-streamed, size 1966104

I was able to decompress this file using either the C library mentioned above, or using the below NodeJS/JavaScript script (with either lzma-purejs or lzma npm modules)

const fs = require('fs')
const lzma = require('lzma-purejs')

async function main() {
  var data=lzma.decompressFile(fs.readFileSync('mat.lzma'));
  console.log(data.length)
}
main().then(() => console.log('Done'))

the above script corrected decoded the buffer:

$ node testlzma.js 
1966104
Done

however, using the below python script, I got an error

import lzma

filename='mat.lzma'
buf=lzma.open(filename,  format=lzma.FORMAT_ALONE).read();

print(len(buf))

error message:

$ python3 testlzma.py
Traceback (most recent call last):
  File "testlzma.py", line 4, in <module>
    buf=lzma.open(filename,  format=lzma.FORMAT_ALONE).read();
  File "/usr/lib/python3.6/lzma.py", line 200, in read
    return self._buffer.read(size)
  File "/usr/lib/python3.6/_compression.py", line 103, in read
    data = self._decompressor.decompress(rawblock, size)
_lzma.LZMAError: Corrupt input data

Because Igor Pavlov's C library implements the original lzma algorithm, so I believe the FORMAT_ALONE flag was used correctly.

I want to mention that the test file mat.lzma can be correctly decompressed using lzma -d on Ubuntu 20.04, but it gives an error on Ubuntu 18.04 and 22.04 (both uses xz utils based lzma), I believe this is due to the nature that the two lzma commands are different

fangq@ubuntu20_04:~$ lzma --version
LZMA command line tool 9.22
LZMA SDK 9.22

fangq@ubuntu20_04:~$ lzma -v -d mat.lzma
mat.lzma:	 21.83% -- replaced with mat

fangq@ubuntu18_04:~$ lzma --version
xz (XZ Utils) 5.2.2
liblzma 5.2.2

fangq@ubuntu18_04$ lzma -v -d mat.lzma
mat.lzma (1/1)
 99.9 %   1,500.9 KiB / 1,920.0 KiB = 0.782                                    
lzma: mat.lzma: Compressed data is corrupt
 99.9 %   1,500.9 KiB / 1,920.0 KiB = 0.782

Your environment

Python 3.6 on Ubuntu 18.04
Python 3.8 on Ubuntu 20.04
Python 3.10 on Ubuntu 22.04

The text was updated successfully, but these errors were encountered:

serhiy-storchaka · 2022-04-29T05:18:11Z

Does it work with FORMAT_RAW?

fangq · 2022-04-29T12:21:53Z

@serhiy-storchaka, no, FORMAT_RAW also failed

kbeldan · 2022-04-30T00:55:21Z

It seems you are hitting a problem with how xz (and its lzma lib used by python) handles the end-of-stream marker.
As can be seen in xz's lzma_decoder.c, lzma_decode() has a problem when this marker is present while the uncompressed size is known.

Which doesn't match Igor Pavlov's specs from his DOC/lzma-specification.txt:

If "Uncompressed size" field contains ones in all 64 bits, it means that
uncompressed size is unknown and there is the "end marker" in stream,
that indicates the end of decoding point.
In opposite case, if the value from "Uncompressed size" field is not
equal to ((2^64) - 1), the LZMA stream decoding must be finished after
specified number of bytes (Uncompressed size) is decoded. And if there
is the "end marker", the LZMA decoder must read that marker also.

xz and python can decompress your file after replacing the uncompressed size at the head of your file:

- 5D 00 00 10  00 18 00 1E  00 00 00 00  00 00 04 0B
+ 5D 00 00 10  00 FF FF FF  FF FF FF FF  FF 00 04 0B

fangq · 2022-04-30T03:45:16Z

thanks @kbeldan, very helpful! after replacing the size in the header to -1, I was indeed able to decompress the file/buffer.

fangq/pyjdata@61bb884#diff-d43e2c1f1c71d64169fcaa6b02fcdaee40d8c841335a61645029ceb263fe087fR168-R169

from reading the spec you quoted above, I still consider this a bug on liblzma/xz side. will close this one, but leave the below bug report open:

https://bugs.launchpad.net/ubuntu/+source/xz-utils/+bug/1970762

fangq added the type-bug An unexpected behavior, bug, or error label Apr 28, 2022

fangq closed this as completed Apr 30, 2022

fangq mentioned this issue Apr 30, 2022

Adding lzma.bmsh to python benchmark neurolabusc/MeshFormatsJS#8

Open

anpgab mentioned this issue Oct 31, 2022

BGO部分AssetBundle解密后，无法用UnityPy读取 hexstr/FGOAssetsModifyTool#78

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupt input data when using lzma to decompress a file #92018

Corrupt input data when using lzma to decompress a file #92018

fangq commented Apr 28, 2022 •

edited

Loading

serhiy-storchaka commented Apr 29, 2022

fangq commented Apr 29, 2022

kbeldan commented Apr 30, 2022

fangq commented Apr 30, 2022 •

edited

Loading

Corrupt input data when using lzma to decompress a file #92018

Corrupt input data when using lzma to decompress a file #92018

Comments

fangq commented Apr 28, 2022 • edited Loading

serhiy-storchaka commented Apr 29, 2022

fangq commented Apr 29, 2022

kbeldan commented Apr 30, 2022

fangq commented Apr 30, 2022 • edited Loading

fangq commented Apr 28, 2022 •

edited

Loading

fangq commented Apr 30, 2022 •

edited

Loading