Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

root file, written with uproot, size issue #345

Closed
marinang opened this issue Sep 24, 2019 · 26 comments · Fixed by #368
Closed

root file, written with uproot, size issue #345

marinang opened this issue Sep 24, 2019 · 26 comments · Fixed by #368

Comments

@marinang
Copy link
Member

marinang commented Sep 24, 2019

Hi,

it seems that files written with uproot have a very large size. I tried to write a subset of a dataset (less than 1%) with less variables to another root file and the size of the latter is of the same order of magnitude as the size of the original root file.

I tried to play by adding a compression as described in the README

tree = uproot.newtree(branchdict, compression=uproot.LZ4(4))

and also to reduce the data type of the branches from float64 to float32. This does reduce the size of the output root file but it is still very large.

@reikdas
Copy link
Collaborator

reikdas commented Sep 24, 2019

Would it be possible for you to share the code used to write the file?

@marinang
Copy link
Member Author

marinang commented Sep 24, 2019

with uproot.recreate(file_out) as f:
        f["DecayTree"] = uproot.newtree({b:np.float32 for b in allbranches}, compression=uproot.LZ4(1))
        
        for d in datasets:
            f["DecayTree"].extend(dict(d))

where datasets are the datasets (70, ~5k events each, with 269 variables) I write in the root file.

@reikdas
Copy link
Collaborator

reikdas commented Sep 24, 2019

Could you try -

with uproot.recreate(file_out) as f:
        f["DecayTree"] = uproot.newtree({b:np.float32 for b in allbranches}, flushsize="5 MB", compression=uproot.LZ4(1))
        
        for d in datasets:
            f["DecayTree"].extend(dict(d))

The flushsize was incorrectly set to 30 KB instead of our intended default of 1 MB (@jpivarski I think 1 MB as the default seems okay?). See #346.
The basket flushing behaviour is a little suspect right now (most noticeable is that it is slow) but will be fixed very soon.
But please do note, that ROOT files written with uproot will usually be just a bit larger. (Not by a factor of almost 100 like in this case though!)

On an aside, is the ROOT file you are comparing this to also compressed with LZ4 level 1? LZ4, in general, does not have a very good compression ratio and level 1 is the level with the lowest compression ratio.

@marinang
Copy link
Member Author

Ok thanks I will try.

On an aside, is the ROOT file you are comparing this to also compressed with LZ4 level 1? LZ4, in general, does not have a very good compression ratio and level 1 is the level with the lowest compression ratio.

I am comparing with ROOT file with compression level 1 with ZLIB. I was not sure actually if 1 was a low compression level or a high one.

@marinang
Copy link
Member Author

marinang commented Sep 24, 2019

I confirm this works and the produced root files have a much smaller size (~factor 100).

@jpivarski
Copy link
Member

jpivarski commented Sep 24, 2019

As per #340, what about using newbasket to avoid flushing altogether?

for d in datasets:
    for branch, array in d.items():
        f["DecayTree"][branch].newbasket(array)

This way, we can narrow down to see if the issue is related to extend (new code, still being worked on) or basket-writing in general (old code, which we think is working well).

From your original description, the output file is 100× larger than you expect. That sort of thing is not going to be fixed by compression, especially compression on floating point numbers. (Compression on strings can be high, depending on their regularity, compression on integers generally isn't better than a factor of 2, unless they're really degenerate, but it's hard to get even a factor of 2 on floating point numbers. Especially float32, whose exponent and fraction don't line up with 8-bit boundaries, which can be missed by compression algorithms that use bytes as tokens, which I've heard is true for LZ4.) So we're not looking for any issues related to compression—you might even want to use no compression while diagnosing the error to simplify the debugging (i.e. you know exactly how big each basket should be when it's uncompressed).

@marinang Could you share the data to be written? By your estimates ("70, ~5k events each, with 269 variables"), it should be 0.35 GB uncompressed. You can put a dictionary of arrays into a file that Numpy can easily open with numpy.savez, and because of the simplicity of these files, it should be pretty much exactly 0.35 MB. (The ROOT file will be somewhat larger than this, but the benefit is that anyone in HEP knows how to use it. Our goal is to make it not too much larger than this.)

With that file, @reikdas should be able to reproduce and begin debugging the error. Thanks!

@jpivarski
Copy link
Member

I confirm this works and the produced root files have a much smaller size (~factor 100).

Wait—what worked? Changing the compression or the flush size?

@reikdas
Copy link
Collaborator

reikdas commented Sep 24, 2019

Wait—what worked? Changing the compression or the flush size?

@jpivarski If you look above, I modified his code with flushsize="5 MB" and told him to try it. It will also work with the intended default of 1 MB (I did not want any bytes added because of the ROOT format to mess up the sizes so I just suggested a larger number :) - if the file size did not decrease as expected, having a larger flushsize makes things clearer to me)

@marinang
Copy link
Member Author

marinang commented Sep 24, 2019

@marinang Could you share the data to be written? By your estimates ("70, ~5k events each, with 269 variables"), it should be 0.35 MB uncompressed. You can put a dictionary of arrays into a file that Numpy can easily open with numpy.savez, and because of the simplicity of these files, it should be pretty much exactly 0.35 MB. (The ROOT file will be somewhat larger than this, but the benefit is that anyone in HEP knows how to use it. Our goal is to make it not too much larger than this.)

I am actually getting a file that has a ~350 MB size (70 times the flushsize ?).

@jpivarski
Copy link
Member

A 30 kB flush size did that much damage? (This is not an unusual flush size for C++ ROOT.) A basket has about 80 bytes of overhead, so if we're really writing 30 kB to each basket, we should be in the asymptotic limit where larger basket sizes make a small percentage difference. We like megabyte baskets because it's a better Numpy-to-Python ratio (speed, not size).

With 5000 events × 70 extend calls × 4 bytes each (not counting different variables in different branches) means a total of 1367 kB per branch. Thus, moving from 30 kB flush size to 1 MB flush size means moving from 45 flushes to 1 flush. The overhead of 45 flushes × 269 branches = 12000 flushes is one hundred times larger than his 0.35 GB of real data? The PR you just opened (#346) is not a bad idea, but I'm suspicious that it might be hiding the real problem.

@jpivarski
Copy link
Member

I am actually getting a file that has a ~350 MB size (70 times the flushsize ?).

@marinang Bad math on my side (I fixed my comment); it's 0.35 GB because I was dividing by 1024**3, not 1024**2. Sorry!

@reikdas
Copy link
Collaborator

reikdas commented Sep 24, 2019

@jpivarski The 30 KB is the default flushsize defined in the tree, which means that it gets divided by the number of branches (269 in this case). So the number of flushes per branch is way more than 45.

@jpivarski
Copy link
Member

Ohhhhhhh. Yeah, that could be it. If that's what's going on, then it's an understood bug and the change in default is a good fix (and we might want to go higher than 1 MB, since this is for the whole tree). 200 variables is not unusual; NanoAOD has 1000 variables. How much is reasonable "total working memory"? I'd think maybe 100 MB (because computers have on the order of GB of total RAM)?

All this time, I had forgotten and was thinking we were talking about per-branch sizes, not per-tree sizes.

@jpivarski
Copy link
Member

What do you think about 100 MB?

@reikdas
Copy link
Collaborator

reikdas commented Sep 24, 2019

@jpivarski We could set the default flushsize at the tree level depending on the number of branches crossing certain thresholds?

@jpivarski
Copy link
Member

That defeats the purpose of setting it on the TTree (if it's a number that scales with the number of TBranches, then it would effectively be a size per TBranch). I was overthinking it—the parameter that a user knows most readily is how much working space they have; getting how much that means per branch is an extra step of derivation.

Since laptops and servers have several to hundreds of GB, 100 MB won't hurt anybody for available space, and 100 MB / 1000 branches is 0.1 MB per branch, which is a nicely large size.

reikdas added a commit that referenced this issue Sep 24, 2019
@reikdas
Copy link
Collaborator

reikdas commented Sep 24, 2019

I have updated the PR to make the default flushsize 100 MB.

@jpivarski
Copy link
Member

This looks good; I think it's resolved and you can merge/deploy the PR.

@marinang
Copy link
Member Author

marinang commented Oct 4, 2019

I was looping through root files, with uproot.iterate, applying a selection and then writing the selected data to a new root file using the low-level interface. In total there ~1M events and 600 branches written. In each uproot.iterate loop ~1k is added to each branch with newbasket. I end up with a file that is approximately 40 GB while I was expecting around 2GB. (The flushsize is the default one so 30 MB)

EDIT:

Actually I also go an error

Traceback (most recent call last):
  File "trigger_merge.py", line 104, in <module>
    merge_and_apply_df("Data", "649")
  File "trigger_merge.py", line 95, in merge_and_apply_df
    f["DecayTree"][b].newbasket(df[b])
  File "/afs/cern.ch/work/m/mmarinan/anaconda3/envs/analysis/lib/python3.7/site-packages/uproot/write/objects/TTree.py", line 334, in newbasket
    tree.write(tree.file, cursor, self._treelvl1._tree.write_name, tree.write_key, copy(tree.keycursor), Util())
  File "/afs/cern.ch/work/m/mmarinan/anaconda3/envs/analysis/lib/python3.7/site-packages/uproot/write/objects/TTree.py", line 742, in write
    uproot.write.compress.write(context, copy(self.tree_write_cursor), givenbytes, None, key, copy(self.write_keycursor))
  File "/afs/cern.ch/work/m/mmarinan/anaconda3/envs/analysis/lib/python3.7/site-packages/uproot/write/compress.py", line 65, in write
    cursor.write_data(context._sink, givenbytes)
  File "/afs/cern.ch/work/m/mmarinan/anaconda3/envs/analysis/lib/python3.7/site-packages/uproot/write/sink/cursor.py", line 75, in write_data
    self.update_data(sink, data)
  File "/afs/cern.ch/work/m/mmarinan/anaconda3/envs/analysis/lib/python3.7/site-packages/uproot/write/sink/cursor.py", line 72, in update_data
    sink.write(data, self.index)
  File "/afs/cern.ch/work/m/mmarinan/anaconda3/envs/analysis/lib/python3.7/site-packages/uproot/write/sink/file.py", line 18, in write
    self._sink.write(data)
OSError: [Errno 27] File too large

@reikdas
Copy link
Collaborator

reikdas commented Oct 4, 2019

If you are using the low-level newbasket method to fill data into and write baskets to the file, the flush size is not taken into consideration. Each newbasket method call creates a new basket.

Coming back to the problem, @jpivarski I do not think this is because of the TTree reallocation growth factor. In this case, I think the TTree is being reallocated 14 times, with the last few times the basket header bytes being reallocated is large (~57 MB) and the last time it is written having some extra allocated space that is not being used (but that is according to my calculations about ~1 MB). Even taking the increasing size of the basket buffers(fBasketEntry, fBasketSeek and fBasketBytes), the additional space being wasted is in the order of ~10 MB per rewrite so that should not be any reason for a 38 GB increase in size. Do you have any idea off the top of your head why this could be happening? I have confirmed that the basket data is not being rewritten.

@jpivarski
Copy link
Member

Can you reproduce his example, logging each time that the TTree metadata (or anything else) gets rewritten? It doesn't have to be real data, just as many branches and events as his dataset. Probably the rewrite algorithm is behaving differently than we expect.

@reikdas
Copy link
Collaborator

reikdas commented Oct 4, 2019

In this case, I think the TTree is being reallocated 14 times, with the last few times the basket header bytes being reallocated is large (~57 MB) and the last time it is written having some extra allocated space that is not being used (but that is according to my calculations about ~1 MB). Even taking the increasing size of the basket buffers(fBasketEntry, fBasketSeek and fBasketBytes), the additional space being wasted is in the order of ~10 MB per rewrite

Disregard this - I had forgotten how the number of basket information is local to the branch.

According to our current reallocation algorithm, the TTree is correctly rewritten 3000 times (5 times per branch for 600 branches), which means that everything is re-written (with each rewrite, all of the 600 branches, each of which each contain fBasketEntry, fBasketSeek and fBasketBytes which are long buffers), so I am not too surprised that the file is larger by that extent.
@jpivarski We need to find a better growth factor or let the user define some sort of preallocated buffer size.

@jpivarski
Copy link
Member

At least in our discussions, we came up with an algorithm that wouldn't have TTree metadata rewriting scale with the number of branches. If one branch triggers a rewrite (it passes 10 baskets or whatever, so that there's no more room in its fBasket* arrays), didn't we decide that all branches should get larger fBasket* arrays in the rewrite?

In the simplest case (probably Matt's case), each basket has the same number of entries, so if one branch triggers a rewrite, then all the other branches are guaranteed to trigger a rewrite immediately thereafter because they'll have the same number of baskets as soon as they get filled. In more complex cases, the above is only approximately true, but still worth preparing for.

So this is the algorithm: as soon as any one branch in the TTree needs larger fBasket* arrays, rewrite the TTree with all branches having 10× larger fBasket* arrays. In this algorithm, the number of TTree-rewrites does not scale with the number of branches, and it only scales logarithmically with the number of entries.

@reikdas
Copy link
Collaborator

reikdas commented Oct 4, 2019

So this is the algorithm: as soon as any one branch in the TTree needs larger fBasket* arrays, rewrite the TTree with all branches having 10× larger fBasket* arrays.

Ah okay. Thanks!
Sorry, I must have skipped over implementing this. Anyway, it is a trivial fix. :)

@reikdas
Copy link
Collaborator

reikdas commented Oct 4, 2019

@marinang Can you take a look if #368 fixes your issue?

@marinang
Copy link
Member Author

marinang commented Oct 6, 2019

@reikdas it seems to work I tried with the following example

In [4]: with uproot.recreate("example.root") as f:
   ...:     f["t"] = uproot.newtree({f"branch_{i}": np.float32 for i in range(500)})
   ...:     for j in range(1000):
   ...:         for i in range(500):
   ...:             f["t"][f"branch_{i}"].newbasket(np.random.normal(0, 1, 1000))

and I obtain a 2 Gb file so as expected.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants