Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

Problem with jagged arrays when creating a pandas dataframe from tree. #88

Closed
marinang opened this issue Jun 1, 2018 · 6 comments
Closed

Comments

@marinang
Copy link
Member

marinang commented Jun 1, 2018

Hi,

I've loaded a tree from a file doing:

tree = uproot.open(f)["DecayTree"]

and when I am asking for a dataframe with all branches I get:

ValueError Traceback (most recent call last)
in ()
----> 1 df = tree.pandas.df()

~/packages/anaconda3/envs/analysis1/lib/python3.6/site-packages/uproot/_connect/to_pandas.py in df(self, branches, entrystart, entrystop, cache, basketcache, keycache, executor, blocking)
41 def df(self, branches=None, entrystart=None, entrystop=None, cache=None, basketcache=None, keycache=None, executor=None, blocking=True):
42 import pandas
---> 43 return self._tree.arrays(branches=branches, outputtype=pandas.DataFrame, entrystart=entrystart, entrystop=entrystop, cache=cache, basketcache=basketcache, keycache=keycache, executor=executor, blocking=blocking)

~/packages/anaconda3/envs/analysis1/lib/python3.6/site-packages/uproot/tree.py in arrays(self, branches, outputtype, entrystart, entrystop, cache, basketcache, keycache, executor, blocking)
431
432 # start the job of filling the arrays
--> 433 futures = [(branch.name, interpretation, branch.array(interpretation=interpretation, entrystart=entrystart, entrystop=entrystop, cache=cache, basketcache=basketcache, keycache=keycache, executor=executor, blocking=False)) for branch, interpretation in branches]
434
435 # make functions that wait for the filling job to be done and return the right outputtype

~/packages/anaconda3/envs/analysis1/lib/python3.6/site-packages/uproot/tree.py in (.0)
431
432 # start the job of filling the arrays
--> 433 futures = [(branch.name, interpretation, branch.array(interpretation=interpretation, entrystart=entrystart, entrystop=entrystop, cache=cache, basketcache=basketcache, keycache=keycache, executor=executor, blocking=False)) for branch, interpretation in branches]
434
435 # make functions that wait for the filling job to be done and return the right outputtype

~/packages/anaconda3/envs/analysis1/lib/python3.6/site-packages/uproot/tree.py in array(self, interpretation, entrystart, entrystop, cache, basketcache, keycache, executor, blocking)
1229 basket_entryoffset = self._basket_entryoffset(basketstart, basketstop)
1230
-> 1231 destination = interpretation.destination(basket_itemoffset[-1], basket_entryoffset[-1])
1232
1233 def fill(j):

~/packages/anaconda3/envs/analysis1/lib/python3.6/site-packages/uproot/interp/numerical.py in destination(self, numitems, numentries)
257 raise ValueError("cannot reshape {0} items as {1} (groups of {2})".format(numitems, self.todims, product))
258 if _dimsprod(self.toarray.shape) < numitems:
--> 259 raise ValueError("cannot put {0} items into an array of {1} items".format(numitems, _dimsprod(self.toarray.shape)))
260 return self.toarray, numitems // product
261

ValueError: cannot put 135738 items into an array of 15082 items

Cheers,
Matt

@jpivarski
Copy link
Member

Could you point me to the file? If some branches have different multiplicities than others (e.g. some are elections and some are muons), this might be impossible and just needs a better error message.

Although I don't remember what I attempted to do in this case: if it's trying to put an event's worth of electrons into one row, rather than exploding them out, then it's conceptually possible to put them all in a single DataFrame, but that DataFrame would be slow and most Pandas operations wouldn't work on it. Pandas prefers a non-jagged data model.

@marinang
Copy link
Member Author

marinang commented Jun 2, 2018

Yeah true It would be very slow, but It was actually to apply a classifier decision function that takes as input a data frame , and then save this dataframe with the output of the classifier as ROOT file. So indeed I don't need to load the whole root file and can use numpy instead.

Anyway, the ROOT file can be found here.

https://cernbox.cern.ch/index.php/s/lwIHERvCzgT2rlj

@sbinet
Copy link

sbinet commented Jun 2, 2018

hum.. is it really the problematic file?
I was able to dump it in its full extent with go-hep/rootio:

$> root-dump chi2emu_460000341_single_muon_2016_AllStreams_stripping28r1_MagDown.root
>>> file[chi2emu_460000341_single_muon_2016_AllStreams_stripping28r1_MagDown.root]
key[000]: DecayTree;1 "DecayTree" (TTree)
[000][BCID]: 0
[000][BCType]: 3
[000][DeltaEta]: 0.0041181203
[000][DeltaPhi]: 0.35834965
[000][DeltaR]: 0.3583733
[000][FoilMaterialDistance]: 6.5149827
[000][GpsTime]: 0
[000][HLT1TCK]: 1362630159
[000][HLT2TCK]: 1631131151
[000][L0DUTCK]: 5647
[000][ModuleMaterialDistance]: 19.47313
[000][OdinTCK]: 0
[000][Polarity]: -1
[000][TRUE_chi_10]: true
[000][TRUE_e_minus]: false
[000][TRUE_mu_plus]: false
[000][VeloMaterialDistance]: 0.7633978
[000][chi_10_BPVLTIME]: 0.009978685
[000][chi_10_BPVVDCHI2]: 161280.81
[000][chi_10_BPVVDR]: 17.400373
[000][chi_10_DIRA_OWNPV]: 0.9999805081454151
[000][chi_10_ENDVERTEX_CHI2]: 0.00019833678142893254
[000][chi_10_ENDVERTEX_COV_]: [6.268918e-05 -3.784743e-05 -0.0001679406 -3.784743e-05 0.0019657847 0.00873885 -0.0001679406 0.00873885 0.039108664]
[...]
[1312][nu_TRUE_P]: 117239.98546842115
[1312][nu_TRUE_PE]: 117239.98546842119
[1312][nu_TRUE_PT]: 17404.25197800813
[1312][nu_TRUE_PX]: -13335.18
[1312][nu_TRUE_PY]: -11183.96
[1312][nu_TRUE_PZ]: 115940.96
[1312][runNumber]: 424223243
[1312][sum_PT]: 68990.96
[1312][sum_PX]: -31684.914
[1312][sum_PY]: -61284.74

(and usually, uproot is better than go-hep/rootio to recover from broken ROOT files...)

@jpivarski
Copy link
Member

@sbinet As I understand it, the ROOT file is not broken or unreadable, and presumably it can be read into arrays, just not a Pandas DataFrame. I'll try that myself soon (thanks, Mathieu!).

It's probably a question of how uproot combines JaggedArrays of different multiplicities into a DataFrame— and whether it should at all.

@marinang How do you want this as a DataFrame, if they do have different multiplicities? I'm 99% sure that a classifier won't be able to deal with non-numeric types in a DataFrame, such as arrays with different lengths. Could you check with some small examples? If it's not going to work for you, it wouldn't do much good for me to make uproot output the data that way.

If, as I suspect, the classifier requires a flat (purely rectangular) table of numbers and your ROOT file has the standard variable-length lists of particles, what would it mean to classify it? When ML HEP people encounter this problem with variable-length lists of jets, they usually do a recurrent neutral net with the jets in some arbitrarily chosen order and no boundaries between events. For that, I should give you an exploded table. Somehow, that miraculously gives meaningful results, despite applying an order-sensitive learner to an arbitrarily chosen order and giving up knowledge of which jets were found in which events.

But if you have different multiplicities of different types of objects— say, jets, electrons, and muons— then there isn't even a way to do that. We can't alternate rows of all jets, then all elections, then all muons because they have different sets of fields (unless we take the union of fields and put n/a in the ones that don't apply). We can't put the different particle types side by side in different groups of columns unless we expand all particle list lengths to the longest or shortest length (inserting n/a if we choose the longest), but this creates meaningless associations between the first jet, first electron, and first muon, for instance. They'd be seen by the classifier as being part of the same feature vector, which is imputing a relationship that doesn't exist.

The only "right way" to do this is to make summary variables that represent the variable-length lists of particles in a fixed length way. Unfortunately, though, that's feature engineering (possibly automated with autoencoders on the particle lists?). I feel that this is the great unsolved problem of using ML for HEP, and it gets less attention than it deserves.

Do you have in mind a way of dealing with this? If so, I can make uproot deliver DataFrames capable of achieving that end, whatever it is.

@jpivarski
Copy link
Member

jpivarski commented Jun 3, 2018

It always pays for me to look at the file before talking about it. You can disregard that entire rant about DataFrames of JaggedArrays with different multiplicities. Your TTree has no JaggedArrays— it's a completely flat ntuple.

The issue was that some of the branches are non-scalar: four of them are 3x3 matrices, presumably covariance matricies. ("Jagged" refers to variable size, like the number of electrons in an event.) For our purposes, fixed size tensors are perfectly "flat" because they can be arranged into columns without loss of information. uproot was simply lacking the code to do that.

With uproot 2.8.28, a tensor-valued branch like "sum_PY" become nine columns: "sum_PY[0][0]", "sum_PY[0][1]", ..., "sum_PY[2][2]". You'll have no problem running this through an ML classifier and having it do the right thing.

I hope the HEP and ML community put some thought into the "variable number of electrons" issue, but you don't have that issue and you should be good to go with this fix.

@jpivarski
Copy link
Member

Incidentally, now that I've just looked at the code and reminded myself what it does, if you do have JaggedArrays, they go into the DataFrame as a column of array objects. That is, a TTree containing electron pT and electron eta becomes a DataFrame with one row per event and an array object of all the first event's electron pTs in the first row, first column, all of the first event's electron etas in the first row, second column, etc.

The "slowness" I referred to earlier is due to the overhead of having Python objects (whole arrays) in the DataFrame, and an ML classifier wouldn't know what to do with that. For a TTree of only electrons, it's the wrong choice: we'd want to explode it out so that each row is an electron (but then we've lose information about where each event starts). Most TTrees that have JaggedArrays for one type of particle have multiple types of particles, and you can only explode one type. I've brought this up with Spark developers and SQL experts and nobody knows what to do with our case— it's uniquely challenging.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants