Error while readin AOD file #16

vkuznet · 2017-11-04T00:12:48Z

Jim, now I'm trying to read AOD file
/afs/cern.ch/user/v/valya/public/C84930B2-7C55-E711-B915-02163E014722.root
and it fails right away:

    for branchname in tree.arrays().keys():
        print(branchname)

gives

Traceback (most recent call last):
  File "vk_test.py", line 38, in <module>
    branchNames(eTree)
  File "vk_test.py", line 24, in branchNames
    for branchname in tree.arrays().keys():
  File "/Users/vk/Work/Languages/Python/GIT/uproot/uproot/tree.py", line 458, in arrays
    outi, res = branch.array(dtype, executor, False)
  File "/Users/vk/Work/Languages/Python/GIT/uproot/uproot/tree.py", line 1427, in array
    return TBranch.array(self, dtype, executor, block)
  File "/Users/vk/Work/Languages/Python/GIT/uproot/uproot/tree.py", line 1182, in array
    out[start:end] = self._basket(i, parallel=False)
  File "/Users/vk/Work/Languages/Python/GIT/uproot/uproot/tree.py", line 857, in _basket
    self._basketwalkers[i]._evaluate(parallel)
  File "/Users/vk/Work/Languages/Python/GIT/uproot/uproot/_walker/lazyarraywalker.py", line 54, in _evaluate
    string = self._original_function(walker.readbytes(length))
  File "/Users/vk/Work/Languages/Python/GIT/uproot/uproot/rootio.py", line 84, in <lambda>
    return lambda x: zlib_decompress(x[9:])
error: Error -3 while decompressing data: incorrect header check

The text was updated successfully, but these errors were encountered:

jpivarski · 2017-11-04T02:38:52Z

Well, it's not quite immediate because you've asked uproot to interpret and convert all arrays in the tree (tree.arrays()), then only return their names (.keys()), which could be accomplished without all the heavy calculations by just tree.allbranchnames.

However, it's a real bug. I've looked at this in both versions and as it scans through branches, reading them all out, it encounters this one:

GlobalObjectMapRecord_hltGtStage2ObjectMap__HLT.obj.m_gtObjectMap.m_algoBitNumber

which somehow is failing to decompress. The branch has the same compression parameters as the file (in principle, they can be different, and I haven't handled that yet), and it's just zlib-7.

Seeking to this point in the file, it has the right kind of header (starting with "XZ"), which is supposed to be 9 bytes long, followed by zlib data. Python's zlib then complains about the format of the data.

01613360  58 5a 00 28 02 00 00 1c  00 fd 37 7a 58 5a 00 00  |XZ.(......7zXZ..|
01613400  01 69 22 de 36 02 00 21  01 00 00 00 00 37 27 97  |.i".6..!.....7'.|
01613420  d6 e0 1b ff 01 ec 5d 00  00 60 04 08 ca 06 5b f5  |......]..`....[.|
01613440  fa 66 ca fc 2c 3c 41 57  c9 09 f4 5f f9 55 48 b6  |.f..,<AW..._.UH.|
01613460  73 22 9b fe 54 36 56 93  d7 91 c5 94 58 f5 b0 d7  |s"..T6V.....X...|
01613500  c9 03 c0 fd dc f0 9e 3d  2a 61 2e 81 2f 2e 1c 2d  |.......=*a../..-|
01613520  42 88 81 7b 45 66 a2 bb  69 f2 06 b8 f7 bb bc 1d  |B..{Ef..i.......|
01613540  41 24 7a ec 6d fc c1 08  0f 48 8b 88 11 b2 0c 76  |A$z.m....H.....v|
01613560  c0 c1 87 6b bb b5 25 16  29 da 87 3d 32 e7 24 25  |...k..%.)..=2.$%|
01613600  69 7f 08 81 a4 cd a3 f3  7f c5 be 3c 2f 6a 49 13  |i..........</jI.|

It looks wrong to me, too: after the 9-byte header, there's only 3 bytes of stuff before another "XZ". That could be part of the compressed data, but it would be a weird coincidence for the compressed data to also have an "XZ". It looks like another block of compressed data. Could the compressed data really be just 3 bytes long? Maybe Python's library has a problem with that (maybe the Python library would rather the data be padded...).

According to the basket header, that compressed data is supposed to be 561 bytes compressed and 7168 uncompressed, which makes me even more suspicious. I'll have to come back to this.

jpivarski · 2017-11-04T10:42:20Z

This is one of those wake-you-up-in-the-middle-of-the-night things. It was staring me in the face. The file declared the compression to be zlib-7, but the two-character header of this compressed block is "XZ", which means LZMA. Apparently, when fCompress and that two-character header disagree, the two-character header has precedence.

So instead of using fCompress to determine which compression algorithm to use, we should use the first two bytes of the 9-byte compressed block header. Here's where it is in C++ ROOT:

https://github.com/root-project/root/blob/5b7d9393c1c0c242be452510ac8ddf08bd492d40/core/zip/src/RZip.cxx#L348

They check is_valid_header_zlib, is_valid_header_lzma, is_valid_header_lz4 to determine the compression algorithm on the spot, rather than the file's or branch's own fCompress. I would have thought that fCompress ought to agree, but okay.

Now we can read in all those arrays, including the LZMA ones. Check out version 1.6.2 from GitHub.

python -i -c 'import uproot; t = uproot.open("/afs/cern.ch/user/v/valya/public/C84930B2-7C55-E711-B915-02163E014722.root")["Events"]; print(t.arrays())'

vkuznet · 2017-11-04T12:03:47Z

Jim, thanks for looking into this and provide me details. I better understand the logic behind the uproot now. Could you please commit 1.6.2 tag since I don't see it, may be you tag the code but didn't commit. Thanks, Valentin.

…

On 0, Jim Pivarski ***@***.***> wrote: This is one of those wake-you-up-in-the-middle-of-the-night things. It was staring me in the face. The file declared the compression to be zlib-7, but the two-character header of this compressed block is "XZ", which means LZMA. Apparently, when `fCompress` and that two-character header disagree, the two-character header has precedence. So instead of using `fCompress` to determine which compression algorithm to use, we should use the first two bytes of the 9-byte compressed block header. Here's where it is in C++ ROOT: [https://github.com/root-project/root/blob/5b7d9393c1c0c242be452510ac8ddf08bd492d40/core/zip/src/RZip.cxx#L348](https://github.com/root-project/root/blob/5b7d9393c1c0c242be452510ac8ddf08bd492d40/core/zip/src/RZip.cxx#L348) They check `is_valid_header_zlib`, `is_valid_header_lzma`, `is_valid_header_lz4` to determine the compression algorithm on the spot, rather than the file's or branch's own `fCompress`. I would have thought that `fCompress` _ought_ to agree, but okay. Now we can read in all those arrays, including the LZMA ones. Check out version 1.6.2 from GitHub. ``` python -i -c 'import uproot; t = uproot.open("/afs/cern.ch/user/v/valya/public/C84930B2-7C55-E711-B915-02163E014722.root")["Events"]; print(t.arrays())' ``` -- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: #16 (comment)

jpivarski · 2017-11-04T15:10:42Z

I had committed it, but it wasn't a release. I just uploaded it to PyPI and made a formal GitHub release. (I always do these two things together to ensure that the version numbers are in sync: it's easy to botch a version number in PyPI.)

As for explaining the problem, I was just thinking out loud. This detail (whether to trust fCompress or the compressed buffer header when the two are in conflict) is exactly the sort of reason we need multiple implementations of ROOT I/O, to spread the knowledge.

jpivarski closed this as completed Nov 4, 2017

rklasen mentioned this issue Feb 10, 2019

"for arrays in uproot.iterate(...)" raises value exception #225

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while readin AOD file #16

Error while readin AOD file #16

vkuznet commented Nov 4, 2017

jpivarski commented Nov 4, 2017

jpivarski commented Nov 4, 2017

vkuznet commented Nov 4, 2017 via email

jpivarski commented Nov 4, 2017

Error while readin AOD file #16

Error while readin AOD file #16

Comments

vkuznet commented Nov 4, 2017

jpivarski commented Nov 4, 2017

jpivarski commented Nov 4, 2017

vkuznet commented Nov 4, 2017 via email

jpivarski commented Nov 4, 2017