Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

Pandas Dataframe and jagged arrays in different branches #322

Closed
sznajder opened this issue Aug 23, 2019 · 1 comment
Closed

Pandas Dataframe and jagged arrays in different branches #322

sznajder opened this issue Aug 23, 2019 · 1 comment

Comments

@sznajder
Copy link

I have a Root tree containing several branches with different dimensions. I need to make plots of variables in different branches on a event by events basis. I am opening my Tree with Uproot and converting it directly into a Pandas dataframe and I am facing two problems:

  1. If I use Flatten=TRUE option and it gives an error because the branches have different dimensions. How can I solve this problem ?
  2. I need to make the plot per event. How do I loop over events in a Pandas data frame ?
    Thanks,
    Andre
@jpivarski
Copy link
Member

A single Pandas DataFrame cannot represent flattened data with different numbers of values in each event. You'll have to create one DataFrame for electrons, one DataFrame for muons, etc., if you use flatten=True. It is normal to work with multiple DataFrames—there are many merging options.

You could set flatten=False to get a Python list of values in each cell. Then a single DataFrame could hold data from different particles because Python lists can have different lengths. The DataFrame method for applying a function to each row is called apply.

However, if you set flatten=False or do a Pandas apply, you're just doing a Python for-loop: you gain nothing from compiled functions or vectorization. If you're okay with that (speed is not an issue), you could cut out the middleman and just do a for-loop over the jagged array:

for outer in jagged_array:
    for inner in outer:
        f(inner)

or similarly with indexes:

for i in range(len(jagged_array)):
    for j in range(len(jagged_array[i])):
        f(jagged_array[i][j])

or you could get out of awkward array entirely with jagged_array.tolist(), which turns it into lists of lists. Plain Python lists will be quite a bit faster than doing for loops directly on the jagged array (because the lookup is simpler; less code).

If performance is an issue, you shouldn't use flatten=False or DataFrame.apply. Columnar analysis code has a different strategy than rowwise. The best version of my tutorials on these techniques is here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants