-
Notifications
You must be signed in to change notification settings - Fork 67
root file, written with uproot, size issue #345
Comments
Would it be possible for you to share the code used to write the file? |
with uproot.recreate(file_out) as f:
f["DecayTree"] = uproot.newtree({b:np.float32 for b in allbranches}, compression=uproot.LZ4(1))
for d in datasets:
f["DecayTree"].extend(dict(d)) where datasets are the datasets (70, ~5k events each, with 269 variables) I write in the root file. |
Could you try -
The flushsize was incorrectly set to 30 KB instead of our intended default of 1 MB (@jpivarski I think 1 MB as the default seems okay?). See #346. On an aside, is the ROOT file you are comparing this to also compressed with LZ4 level 1? LZ4, in general, does not have a very good compression ratio and level 1 is the level with the lowest compression ratio. |
Ok thanks I will try.
I am comparing with ROOT file with compression level 1 with ZLIB. I was not sure actually if 1 was a low compression level or a high one. |
I confirm this works and the produced root files have a much smaller size (~factor 100). |
As per #340, what about using for d in datasets:
for branch, array in d.items():
f["DecayTree"][branch].newbasket(array) This way, we can narrow down to see if the issue is related to From your original description, the output file is 100× larger than you expect. That sort of thing is not going to be fixed by compression, especially compression on floating point numbers. (Compression on strings can be high, depending on their regularity, compression on integers generally isn't better than a factor of 2, unless they're really degenerate, but it's hard to get even a factor of 2 on floating point numbers. Especially float32, whose exponent and fraction don't line up with 8-bit boundaries, which can be missed by compression algorithms that use bytes as tokens, which I've heard is true for LZ4.) So we're not looking for any issues related to compression—you might even want to use no compression while diagnosing the error to simplify the debugging (i.e. you know exactly how big each basket should be when it's uncompressed). @marinang Could you share the data to be written? By your estimates ("70, ~5k events each, with 269 variables"), it should be 0.35 GB uncompressed. You can put a dictionary of arrays into a file that Numpy can easily open with numpy.savez, and because of the simplicity of these files, it should be pretty much exactly 0.35 MB. (The ROOT file will be somewhat larger than this, but the benefit is that anyone in HEP knows how to use it. Our goal is to make it not too much larger than this.) With that file, @reikdas should be able to reproduce and begin debugging the error. Thanks! |
Wait—what worked? Changing the compression or the flush size? |
@jpivarski If you look above, I modified his code with flushsize="5 MB" and told him to try it. It will also work with the intended default of 1 MB (I did not want any bytes added because of the ROOT format to mess up the sizes so I just suggested a larger number :) - if the file size did not decrease as expected, having a larger flushsize makes things clearer to me) |
I am actually getting a file that has a ~350 MB size (70 times the flushsize ?). |
A 30 kB flush size did that much damage? (This is not an unusual flush size for C++ ROOT.) A basket has about 80 bytes of overhead, so if we're really writing 30 kB to each basket, we should be in the asymptotic limit where larger basket sizes make a small percentage difference. We like megabyte baskets because it's a better Numpy-to-Python ratio (speed, not size). With 5000 events × 70 |
@marinang Bad math on my side (I fixed my comment); it's 0.35 GB because I was dividing by |
@jpivarski The 30 KB is the default flushsize defined in the tree, which means that it gets divided by the number of branches (269 in this case). So the number of flushes per branch is way more than 45. |
Ohhhhhhh. Yeah, that could be it. If that's what's going on, then it's an understood bug and the change in default is a good fix (and we might want to go higher than 1 MB, since this is for the whole tree). 200 variables is not unusual; NanoAOD has 1000 variables. How much is reasonable "total working memory"? I'd think maybe 100 MB (because computers have on the order of GB of total RAM)? All this time, I had forgotten and was thinking we were talking about per-branch sizes, not per-tree sizes. |
What do you think about 100 MB? |
@jpivarski We could set the default flushsize at the tree level depending on the number of branches crossing certain thresholds? |
That defeats the purpose of setting it on the TTree (if it's a number that scales with the number of TBranches, then it would effectively be a size per TBranch). I was overthinking it—the parameter that a user knows most readily is how much working space they have; getting how much that means per branch is an extra step of derivation. Since laptops and servers have several to hundreds of GB, 100 MB won't hurt anybody for available space, and 100 MB / 1000 branches is 0.1 MB per branch, which is a nicely large size. |
I have updated the PR to make the default flushsize 100 MB. |
This looks good; I think it's resolved and you can merge/deploy the PR. |
I was looping through root files, with uproot.iterate, applying a selection and then writing the selected data to a new root file using the low-level interface. In total there ~1M events and 600 branches written. In each uproot.iterate loop ~1k is added to each branch with EDIT: Actually I also go an error
|
If you are using the low-level Coming back to the problem, @jpivarski I do not think this is because of the TTree reallocation growth factor. In this case, I think the TTree is being reallocated 14 times, with the last few times the basket header bytes being reallocated is large (~57 MB) and the last time it is written having some extra allocated space that is not being used (but that is according to my calculations about ~1 MB). Even taking the increasing size of the basket buffers( |
Can you reproduce his example, logging each time that the TTree metadata (or anything else) gets rewritten? It doesn't have to be real data, just as many branches and events as his dataset. Probably the rewrite algorithm is behaving differently than we expect. |
Disregard this - I had forgotten how the number of basket information is local to the branch. According to our current reallocation algorithm, the TTree is correctly rewritten 3000 times (5 times per branch for 600 branches), which means that everything is re-written (with each rewrite, all of the 600 branches, each of which each contain |
At least in our discussions, we came up with an algorithm that wouldn't have TTree metadata rewriting scale with the number of branches. If one branch triggers a rewrite (it passes 10 baskets or whatever, so that there's no more room in its In the simplest case (probably Matt's case), each basket has the same number of entries, so if one branch triggers a rewrite, then all the other branches are guaranteed to trigger a rewrite immediately thereafter because they'll have the same number of baskets as soon as they get filled. In more complex cases, the above is only approximately true, but still worth preparing for. So this is the algorithm: as soon as any one branch in the TTree needs larger |
Ah okay. Thanks! |
@reikdas it seems to work I tried with the following example
and I obtain a 2 Gb file so as expected. |
Hi,
it seems that files written with uproot have a very large size. I tried to write a subset of a dataset (less than 1%) with less variables to another root file and the size of the latter is of the same order of magnitude as the size of the original root file.
I tried to play by adding a compression as described in the README
and also to reduce the data type of the branches from float64 to float32. This does reduce the size of the output root file but it is still very large.
The text was updated successfully, but these errors were encountered: