-
Notifications
You must be signed in to change notification settings - Fork 67
Problem with jagged arrays when creating a pandas dataframe from tree. #88
Comments
Could you point me to the file? If some branches have different multiplicities than others (e.g. some are elections and some are muons), this might be impossible and just needs a better error message. Although I don't remember what I attempted to do in this case: if it's trying to put an event's worth of electrons into one row, rather than exploding them out, then it's conceptually possible to put them all in a single DataFrame, but that DataFrame would be slow and most Pandas operations wouldn't work on it. Pandas prefers a non-jagged data model. |
Yeah true It would be very slow, but It was actually to apply a classifier decision function that takes as input a data frame , and then save this dataframe with the output of the classifier as ROOT file. So indeed I don't need to load the whole root file and can use numpy instead. Anyway, the ROOT file can be found here. |
hum.. is it really the problematic file?
(and usually, |
@sbinet As I understand it, the ROOT file is not broken or unreadable, and presumably it can be read into arrays, just not a Pandas DataFrame. I'll try that myself soon (thanks, Mathieu!). It's probably a question of how uproot combines JaggedArrays of different multiplicities into a DataFrame— and whether it should at all. @marinang How do you want this as a DataFrame, if they do have different multiplicities? I'm 99% sure that a classifier won't be able to deal with non-numeric types in a DataFrame, such as arrays with different lengths. Could you check with some small examples? If it's not going to work for you, it wouldn't do much good for me to make uproot output the data that way. If, as I suspect, the classifier requires a flat (purely rectangular) table of numbers and your ROOT file has the standard variable-length lists of particles, what would it mean to classify it? When ML HEP people encounter this problem with variable-length lists of jets, they usually do a recurrent neutral net with the jets in some arbitrarily chosen order and no boundaries between events. For that, I should give you an exploded table. Somehow, that miraculously gives meaningful results, despite applying an order-sensitive learner to an arbitrarily chosen order and giving up knowledge of which jets were found in which events. But if you have different multiplicities of different types of objects— say, jets, electrons, and muons— then there isn't even a way to do that. We can't alternate rows of all jets, then all elections, then all muons because they have different sets of fields (unless we take the union of fields and put n/a in the ones that don't apply). We can't put the different particle types side by side in different groups of columns unless we expand all particle list lengths to the longest or shortest length (inserting n/a if we choose the longest), but this creates meaningless associations between the first jet, first electron, and first muon, for instance. They'd be seen by the classifier as being part of the same feature vector, which is imputing a relationship that doesn't exist. The only "right way" to do this is to make summary variables that represent the variable-length lists of particles in a fixed length way. Unfortunately, though, that's feature engineering (possibly automated with autoencoders on the particle lists?). I feel that this is the great unsolved problem of using ML for HEP, and it gets less attention than it deserves. Do you have in mind a way of dealing with this? If so, I can make uproot deliver DataFrames capable of achieving that end, whatever it is. |
It always pays for me to look at the file before talking about it. You can disregard that entire rant about DataFrames of JaggedArrays with different multiplicities. Your TTree has no JaggedArrays— it's a completely flat ntuple. The issue was that some of the branches are non-scalar: four of them are 3x3 matrices, presumably covariance matricies. ("Jagged" refers to variable size, like the number of electrons in an event.) For our purposes, fixed size tensors are perfectly "flat" because they can be arranged into columns without loss of information. uproot was simply lacking the code to do that. With uproot 2.8.28, a tensor-valued branch like " I hope the HEP and ML community put some thought into the "variable number of electrons" issue, but you don't have that issue and you should be good to go with this fix. |
Incidentally, now that I've just looked at the code and reminded myself what it does, if you do have JaggedArrays, they go into the DataFrame as a column of array objects. That is, a TTree containing electron pT and electron eta becomes a DataFrame with one row per event and an array object of all the first event's electron pTs in the first row, first column, all of the first event's electron etas in the first row, second column, etc. The "slowness" I referred to earlier is due to the overhead of having Python objects (whole arrays) in the DataFrame, and an ML classifier wouldn't know what to do with that. For a TTree of only electrons, it's the wrong choice: we'd want to explode it out so that each row is an electron (but then we've lose information about where each event starts). Most TTrees that have JaggedArrays for one type of particle have multiple types of particles, and you can only explode one type. I've brought this up with Spark developers and SQL experts and nobody knows what to do with our case— it's uniquely challenging. |
Hi,
I've loaded a tree from a file doing:
tree = uproot.open(f)["DecayTree"]
and when I am asking for a dataframe with all branches I get:
ValueError Traceback (most recent call last)
in ()
----> 1 df = tree.pandas.df()
~/packages/anaconda3/envs/analysis1/lib/python3.6/site-packages/uproot/_connect/to_pandas.py in df(self, branches, entrystart, entrystop, cache, basketcache, keycache, executor, blocking)
41 def df(self, branches=None, entrystart=None, entrystop=None, cache=None, basketcache=None, keycache=None, executor=None, blocking=True):
42 import pandas
---> 43 return self._tree.arrays(branches=branches, outputtype=pandas.DataFrame, entrystart=entrystart, entrystop=entrystop, cache=cache, basketcache=basketcache, keycache=keycache, executor=executor, blocking=blocking)
~/packages/anaconda3/envs/analysis1/lib/python3.6/site-packages/uproot/tree.py in arrays(self, branches, outputtype, entrystart, entrystop, cache, basketcache, keycache, executor, blocking)
431
432 # start the job of filling the arrays
--> 433 futures = [(branch.name, interpretation, branch.array(interpretation=interpretation, entrystart=entrystart, entrystop=entrystop, cache=cache, basketcache=basketcache, keycache=keycache, executor=executor, blocking=False)) for branch, interpretation in branches]
434
435 # make functions that wait for the filling job to be done and return the right outputtype
~/packages/anaconda3/envs/analysis1/lib/python3.6/site-packages/uproot/tree.py in (.0)
431
432 # start the job of filling the arrays
--> 433 futures = [(branch.name, interpretation, branch.array(interpretation=interpretation, entrystart=entrystart, entrystop=entrystop, cache=cache, basketcache=basketcache, keycache=keycache, executor=executor, blocking=False)) for branch, interpretation in branches]
434
435 # make functions that wait for the filling job to be done and return the right outputtype
~/packages/anaconda3/envs/analysis1/lib/python3.6/site-packages/uproot/tree.py in array(self, interpretation, entrystart, entrystop, cache, basketcache, keycache, executor, blocking)
1229 basket_entryoffset = self._basket_entryoffset(basketstart, basketstop)
1230
-> 1231 destination = interpretation.destination(basket_itemoffset[-1], basket_entryoffset[-1])
1232
1233 def fill(j):
~/packages/anaconda3/envs/analysis1/lib/python3.6/site-packages/uproot/interp/numerical.py in destination(self, numitems, numentries)
257 raise ValueError("cannot reshape {0} items as {1} (groups of {2})".format(numitems, self.todims, product))
258 if _dimsprod(self.toarray.shape) < numitems:
--> 259 raise ValueError("cannot put {0} items into an array of {1} items".format(numitems, _dimsprod(self.toarray.shape)))
260 return self.toarray, numitems // product
261
ValueError: cannot put 135738 items into an array of 15082 items
Cheers,
Matt
The text was updated successfully, but these errors were encountered: