Pandas Dataframe and jagged arrays in different branches #322

sznajder · 2019-08-23T21:25:13Z

I have a Root tree containing several branches with different dimensions. I need to make plots of variables in different branches on a event by events basis. I am opening my Tree with Uproot and converting it directly into a Pandas dataframe and I am facing two problems:

If I use Flatten=TRUE option and it gives an error because the branches have different dimensions. How can I solve this problem ?
I need to make the plot per event. How do I loop over events in a Pandas data frame ?
Thanks,
Andre

jpivarski · 2019-08-23T22:14:40Z

A single Pandas DataFrame cannot represent flattened data with different numbers of values in each event. You'll have to create one DataFrame for electrons, one DataFrame for muons, etc., if you use flatten=True. It is normal to work with multiple DataFrames—there are many merging options.

You could set flatten=False to get a Python list of values in each cell. Then a single DataFrame could hold data from different particles because Python lists can have different lengths. The DataFrame method for applying a function to each row is called apply.

However, if you set flatten=False or do a Pandas apply, you're just doing a Python for-loop: you gain nothing from compiled functions or vectorization. If you're okay with that (speed is not an issue), you could cut out the middleman and just do a for-loop over the jagged array:

for outer in jagged_array:
    for inner in outer:
        f(inner)

or similarly with indexes:

for i in range(len(jagged_array)):
    for j in range(len(jagged_array[i])):
        f(jagged_array[i][j])

or you could get out of awkward array entirely with jagged_array.tolist(), which turns it into lists of lists. Plain Python lists will be quite a bit faster than doing for loops directly on the jagged array (because the lookup is simpler; less code).

If performance is an issue, you shouldn't use flatten=False or DataFrame.apply. Columnar analysis code has a different strategy than rowwise. The best version of my tutorials on these techniques is here.

jpivarski closed this as completed Aug 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas Dataframe and jagged arrays in different branches #322

Pandas Dataframe and jagged arrays in different branches #322

sznajder commented Aug 23, 2019

jpivarski commented Aug 23, 2019

Pandas Dataframe and jagged arrays in different branches #322

Pandas Dataframe and jagged arrays in different branches #322

Comments

sznajder commented Aug 23, 2019

jpivarski commented Aug 23, 2019