# Pandas development for RPV delphes files

I'd like to use this notebook to explore the usage of pandas dataframes for my RPV deep-learning work. I'm wondering if I can possibly simplify the workflow and/or make it more elegant.

In [1]:
from __future__ import print_function

import numpy as np
import pandas as pd

import root_numpy as rnp

Welcome to ROOTaaS 6.06/06


## Loading data in pandas
Let's start by playing with just one delphes file

In [2]:
#input_file = '/global/cscratch1/sd/wbhimji/DelphesOutput/PU-HighRes-2/QCDBkg_JZ5_800_1300-10k-PU-HighRes-2-1-1-1.root'
input_file = '/global/cscratch1/sd/wbhimji/DelphesOutput/NoPU/QCDBkg_JZ3_160_400-10k-NoPU-01-01.root'

In [3]:
# Branch name remapping for convenience
branch_dict = {
    'Event.Number' : 'eventNumber',
    'Event.ProcessID' : 'proc',
    'Tower.Eta' : 'clusEta',
    'Tower.Phi' : 'clusPhi',
    'Tower.E' : 'clusE',
    'Tower.Eem' : 'clusEM',
    'FatJet.PT' : 'fatJetPt',
    'FatJet.Eta' : 'fatJetEta',
    'FatJet.Phi' : 'fatJetPhi',
    'FatJet.Mass' : 'fatJetM',
    'Track.PT' : 'trackPt',
    'Track.Eta' : 'trackEta',
    'Track.Phi' : 'trackPhi',
}

In [4]:
# Convert ROOT data to numpy
data = rnp.root2array(input_file,
                      branches=branch_dict.keys(),
                      warn_missing_tree=True)

data.dtype.names = [branch_dict[n] for n in data.dtype.names]



For some reason, the event-level quantities like event number, procID, are returned as arrays. Weird.

Let's try to build a dataframe of just the jets.

How do I do that?

Maybe I can start by calculating how many jets are in each event. Then I could create the empty structure of the appropriate size. Then I would loop over events and fill the entries.
- however, I'd need to keep track of the start/stop indices when filling the ndarray.

Or maybe I can just build the list of arrays with event id and then concatenate everything together.

I can use the np.repeat function to specify how many times to repeat each element.

In [5]:
# Flatten the event number array
eventNumber = np.concatenate(data['eventNumber'])

# A function for counting numbers of objects in each event
vec_count_objects = np.vectorize(lambda x: x.shape[0])

# Build jets dataframe
numJet = vec_count_objects(data['fatJetPt'])
jets = pd.DataFrame.from_items([
    ('eventNumber', np.repeat(eventNumber, numJet)),
    ('pt', np.concatenate(data['fatJetPt'])),
    ('eta', np.concatenate(data['fatJetEta'])),
    ('phi', np.concatenate(data['fatJetPhi'])),
    ('m', np.concatenate(data['fatJetM']))
])

# Build tracks dataframe
numTrack = vec_count_objects(data['trackPt'])
tracks = pd.DataFrame.from_items([
    ('eventNumber', np.repeat(eventNumber, numTrack)),
    ('pt', np.concatenate(data['trackPt'])),
    ('eta', np.concatenate(data['trackEta'])),
    ('phi', np.concatenate(data['trackPhi']))
])

# Build clusters dataframe
numClus = vec_count_objects(data['clusE'])
clusters = pd.DataFrame.from_items([
    ('eventNumber', np.repeat(eventNumber, numClus)),
    ('E', np.concatenate(data['clusE'])),
    ('eta', np.concatenate(data['clusEta'])),
    ('phi', np.concatenate(data['clusPhi'])),
    ('emE', np.concatenate(data['clusEM'])),
])

In [34]:
# An event summary dataframe
events = pd.DataFrame.from_items([
    ('eventNumber', eventNumber),
    ('proc', np.concatenate(data['proc'])),
])

In [36]:
events.head()

Unnamed: 0,eventNumber,proc
0,0,113
1,1,111
2,2,111
3,3,113
4,4,111


In [35]:
jets.head()

Unnamed: 0,eventNumber,pt,eta,phi,m
0,0,236.33017,2.028603,-0.947379,53.839207
1,0,178.687531,-2.278885,2.245858,32.17028
2,0,26.583904,1.460624,1.658434,9.51849
3,1,178.356033,-2.581952,1.30375,56.339207
4,1,108.5467,1.307942,-2.464099,41.645054


## Physics selections

In [7]:
class units():
    GeV = 1

class cuts():
    # Object selection
    fatjet_pt_min = 200*units.GeV
    fatjet_eta_max = 2.
    # Baseline event selection
    baseline_num_fatjet_min = 3
    baseline_fatjet_pt_min = 440*units.GeV
    # Signal region event selection
    sr_deta12_max = 1.4
    sr4j_mass_min = 800*units.GeV
    sr5j_mass_min = 600*units.GeV

In [8]:
def select_jets(jets):
    return ((jets.pt > cuts.fatjet_pt_min) &
            (np.abs(jets.eta) < cuts.fatjet_eta_max))

In [21]:
selected_jets_mask = select_jets(jets)
selected_jets = jets[selected_jets_mask]

In [22]:
selected_jets_mask.head(10)

0    False
1    False
2    False
3    False
4    False
5    False
6     True
7    False
8    False
9    False
dtype: bool

In [26]:
selected_jets_mask.sum(), selected_jets.shape

(6288, (6288, 5))

Ok, so I can perform jet selection, but how do I turn this into event selection?

I need to be able to count the number of selected jets in each event. Then I need to be able to turn that into a per-event decision. Then I need to be able to filter all of the object dataframes with it.

Maybe I could filter the object df first, and count the objects using a groupby command.

In [52]:
selected_jets.groupby('eventNumber').size().to_frame('numJet').head()

Unnamed: 0_level_0,numJet
eventNumber,Unnamed: 1_level_1
2,1
4,1
5,1
8,1
11,1


In [43]:
events.merge?

In [48]:
temp = events.merge(selected_jets.groupby('eventNumber').size().to_frame('numJet'),
                    left_on='eventNumber', right_index=True)

temp.head()

Unnamed: 0,eventNumber,proc,numJet
2,2,111,1
4,4,111,1
5,5,113,1
8,8,113,1
11,11,111,1


In [53]:
# This only works for now because event number is the same as the RangeIndex.
events['numJet'] = selected_jets.groupby('eventNumber').size()

In [57]:
(events.numJet >= 3).sum()

26