In [1]:
%load_ext autoreload
%autoreload 2

import copy
import numpy as np
import awkward as ak
import uproot
import pandas as pd
import dask
import vector
import particle
import hepunits

from memflow.dataset.data import ParquetData
vector.register_awkward()

  from pandas.core.computation.check import NUMEXPR_INSTALLED


# File system #
For ttH, we use Parquet files, containing a single dataframe to contain our data.

For technical reasons, the files for the hard (`hard`) scattering and reconstructed events (`reco`) are separate. This means that $N(reco) < N(hard)$ and the only connexion between the two datasets is through a unique `event number`. We will use this integer to match hard and reco together later.

First, let us load the hard-level information.

Note : I build these data classes to be `lazy` by default (can be turned off). This means initially the branches are not loaded by default, thy are only loaded when requested and then saved in the object. When you print a branch for the first time, it can take a few seconds to retrieve the data, but it will be faster on the second attempt because the branch is already loaded. 

Note : contrary to HH, the dataset is already built with variable-length objects. This means we can work directly without having to deal with the default values for absent particles.

In [2]:
data_hard = ParquetData(
    files = [
        '/nfs/scratch/fynu/fbury/MEM_data/ttH/TF_v6/hard/2018/ttH/ttH_HToInvisible_M125.parquet',
    ], # these are the files we want to load, there can be several
    # in parquet data there is no tre
    lazy = True, # Explained above.
    N = 100000, # this is to load only the N first events in the tree, 
    # in case you are just playing/debugging and don't need to load all the data (can be slow)
    # to load all, just comment out
)
print (data_hard)

Data object
Loaded branches:
   ... file: 100000
   ... sample: 100000
   ... tree: 100000
Branch in files not loaded:
   ... Generator_scalePDF
   ... Generator_weight
   ... Generator_x1
   ... Generator_x2
   ... Generator_xpdf1
   ... Generator_xpdf2
   ... W_minus_from_antitop_eta
   ... W_minus_from_antitop_genPartIdxMother
   ... W_minus_from_antitop_idx
   ... W_minus_from_antitop_mass
   ... W_minus_from_antitop_pdgId
   ... W_minus_from_antitop_phi
   ... W_minus_from_antitop_pt
   ... W_minus_from_antitop_status
   ... W_minus_from_antitop_statusFlags
   ... W_plus_from_top_eta
   ... W_plus_from_top_genPartIdxMother
   ... W_plus_from_top_idx
   ... W_plus_from_top_mass
   ... W_plus_from_top_pdgId
   ... W_plus_from_top_phi
   ... W_plus_from_top_pt
   ... W_plus_from_top_status
   ... W_plus_from_top_statusFlags
   ... Z_from_higgs_eta
   ... Z_from_higgs_genPartIdxMother
   ... Z_from_higgs_idx
   ... Z_from_higgs_mass
   ... Z_from_higgs_pdgId
   ... Z_from_higgs_phi
  

When printing the data object, above, you can see three branches have been initialised from the start.
- `file` : contains the file path
- `sample` : contains the file name
- `tree` : contains the tree name

These are just metadata that can be used later, eg when comparing with the reco dataset.

The number next to them is the number of entries.

The other branches as you can see are not loaded yet.

In [3]:
# Print the metadata
data_hard['file'],data_hard['sample'],data_hard['tree']


(<Array [...] type='100000 * string'>,
 <Array ['ttH_HToInvisible_M125.parquet', ...] type='100000 * string'>,
 <Array ['tree', 'tree', 'tree', ..., 'tree', 'tree'] type='100000 * string'>)

You can see these are arrays, but not numpy. 

These are [awkward](https://awkward-array.org/doc/main/) arrays, they can contain a variable number of entries in each dimension.

In our case above, they contain `N * string`, meaning `N` (depending on what your argument was) strings, which make sense because these are string identifiers.

In [4]:
# Useful tools for checking out what is contained inside
data_hard['file'].show()
data_hard['file'].type.show()

['/nfs/scratch/fynu/fbury/MEM_data/ttH/TF_v6/hard/2018/ttH/ttH_HToInvisible_M125.parquet',
 '/nfs/scratch/fynu/fbury/MEM_data/ttH/TF_v6/hard/2018/ttH/ttH_HToInvisible_M125.parquet',
 '/nfs/scratch/fynu/fbury/MEM_data/ttH/TF_v6/hard/2018/ttH/ttH_HToInvisible_M125.parquet',
 '/nfs/scratch/fynu/fbury/MEM_data/ttH/TF_v6/hard/2018/ttH/ttH_HToInvisible_M125.parquet',
 '/nfs/scratch/fynu/fbury/MEM_data/ttH/TF_v6/hard/2018/ttH/ttH_HToInvisible_M125.parquet',
 '/nfs/scratch/fynu/fbury/MEM_data/ttH/TF_v6/hard/2018/ttH/ttH_HToInvisible_M125.parquet',
 '/nfs/scratch/fynu/fbury/MEM_data/ttH/TF_v6/hard/2018/ttH/ttH_HToInvisible_M125.parquet',
 '/nfs/scratch/fynu/fbury/MEM_data/ttH/TF_v6/hard/2018/ttH/ttH_HToInvisible_M125.parquet',
 '/nfs/scratch/fynu/fbury/MEM_data/ttH/TF_v6/hard/2018/ttH/ttH_HToInvisible_M125.parquet',
 '/nfs/scratch/fynu/fbury/MEM_data/ttH/TF_v6/hard/2018/ttH/ttH_HToInvisible_M125.parquet',
 ...,
 '/nfs/scratch/fynu/fbury/MEM_data/ttH/TF_v6/hard/2018/ttH/ttH_HToInvisible_M125.par

In [5]:
# To see what branches have been loaded
print (data_hard.keys())

# To see all the branches (including the ones not loaded)
print (data_hard.branches)

dict_keys(['file', 'tree', 'sample'])
['higgs_statusFlags', 'lep_plus_from_W_plus_pdgId', 'neutrino_from_W_plus_genPartIdxMother', 'top_eta', 'antiquark_from_W_plus_status', 'antiquark_from_W_plus_phi', 'quark_from_W_minus_mass', 'Z_from_higgs_idx', 'W_plus_from_top_status', 'antiquark_from_W_plus_genPartIdxMother', 'Generator_x1', 'higgs_pdgId', 'antitop_pt', 'bottom_pdgId', 'Z_from_higgs_mass', 'quark_from_W_minus_genPartIdxMother', 'Z_from_higgs_statusFlags', 'neutrino_from_W_plus_idx', 'antitop_idx', 'higgs_idx', 'W_plus_from_top_phi', 'antineutrino_from_W_minus_mass', 'antitop_phi', 'lep_plus_from_W_plus_mass', 'lep_plus_from_W_plus_eta', 'neutrino_from_W_plus_pdgId', 'antiquark_from_W_plus_mass', 'quark_from_W_plus_pdgId', 'antineutrino_from_W_minus_pt', 'Generator_x2', 'top_mass', 'antineutrino_from_W_minus_phi', 'neutrinos_from_Z_eta', 'lep_plus_from_W_plus_statusFlags', 'bottom_phi', 'quark_from_W_minus_eta', 'Generator_xpdf2', 'event', 'antibottom_mass', 'top_phi', 'quark_fro

In [6]:
# Now let us load some data , we will check the Higgs pt #
data_hard['higgs_pt'].show()
data_hard['higgs_pt'].type.show()

#You can see an array of N float64 containing the Higgs energy

[[101],
 [271],
 [227],
 [188],
 [170],
 [135],
 [45.4],
 [134],
 [353],
 [152],
 ...,
 [188],
 [114],
 [62],
 [250],
 [275],
 [261],
 [194],
 [106],
 [273]]
100000 * var * float64


In [7]:
print (data_hard)
# And now the branch `H1_E` is shown as loaded now
# If you wanted to retrieve the branch again, it would be faster

Data object
Loaded branches:
   ... file: 100000
   ... higgs_pt: 100000
   ... sample: 100000
   ... tree: 100000
Branch in files not loaded:
   ... Generator_scalePDF
   ... Generator_weight
   ... Generator_x1
   ... Generator_x2
   ... Generator_xpdf1
   ... Generator_xpdf2
   ... W_minus_from_antitop_eta
   ... W_minus_from_antitop_genPartIdxMother
   ... W_minus_from_antitop_idx
   ... W_minus_from_antitop_mass
   ... W_minus_from_antitop_pdgId
   ... W_minus_from_antitop_phi
   ... W_minus_from_antitop_pt
   ... W_minus_from_antitop_status
   ... W_minus_from_antitop_statusFlags
   ... W_plus_from_top_eta
   ... W_plus_from_top_genPartIdxMother
   ... W_plus_from_top_idx
   ... W_plus_from_top_mass
   ... W_plus_from_top_pdgId
   ... W_plus_from_top_phi
   ... W_plus_from_top_pt
   ... W_plus_from_top_status
   ... W_plus_from_top_statusFlags
   ... Z_from_higgs_eta
   ... Z_from_higgs_genPartIdxMother
   ... Z_from_higgs_idx
   ... Z_from_higgs_mass
   ... Z_from_higgs_pdgId
  

Now we may want to restrict our hard-dataset to a specific decay.

Indeed, the HH sample here contains ttH events, with t->bW->b(lnu/qq) and H->ZZ->4nu.

We are so far only interested in the fully hadronic decay, when both W go into quarks.

You can print the branches of each decay (see names above), and try to see all the possible decays in our simulations.

To move one, we will select our specific decay using a mask.

In [8]:
mask = np.logical_and.reduce(
    [
        # Higgs decay : H->ZZ->4nu #
        ak.num(data_hard['higgs_idx']) == 1,
        ak.num(data_hard['Z_from_higgs_idx']) == 2,
        ak.num(data_hard['neutrinos_from_Z_idx']) == 4,
        # top decay : t->b q qbar #
        ak.num(data_hard['top_idx']) == 1,
        ak.num(data_hard['W_plus_from_top_idx']) == 1,
        ak.num(data_hard['quark_from_W_plus_idx']) == 1,
        ak.num(data_hard['antiquark_from_W_plus_idx']) == 1,
        # antitop decay : tbar->bbar q qbar #
        ak.num(data_hard['antitop_idx']) == 1,
        ak.num(data_hard['W_minus_from_antitop_idx']) == 1,
        ak.num(data_hard['quark_from_W_minus_idx']) == 1,
        ak.num(data_hard['antiquark_from_W_minus_idx']) == 1,
        #ak.num(data_hard['antitop_idx']) == 1,
    ]
)
print (f'Selecting {mask.sum()} events out of {len(mask)}')

Selecting 39842 events out of 100000


We now use the mask to "cut" on those events, meaning only keeping the once passing our selection, ie the ones with a `True` value in our boolen array.

Note : these events are not exactly "cut" here, it is just that when you call a branch, only the ones passing the mask will be returned. In the background all the events are still there in the branch. This is useful for resetting the mask later in cas you want to recover these events. 

Note bis : after a selection is made, you can make another mask and apply it again. But the mask must now correspond to the array obtained after the first mask has been used. Otherwise there will be a mismatch between array lengths.

In [9]:
print (f'Before cut : {data_hard.events}')

data_hard.cut(mask)

print (f'After cut : {data_hard.events}')

Before cut : 100000
After cut : 39842


In [10]:
# You can see now that the higgs pt branch only contains the number of event passing the cut
data_hard['higgs_pt']

Now a big part in playing with particles is to assemble our branches into proper 4-vectors. This will allow us to use the powerful [vector](https://pypi.org/project/vector/) package, that can be used within awkward arrays too. 

Note : in this hard-level setup, the different components of the particles (pt,eta,phi,mass) are in separate branches. So we need to manually turn them into particles. This will be different for the reco.

Note bis : this might change in the future

Note ter : in the HH notbook we used the (px,py,pz,E) frame, but here we will ue the (pt,eta,phi,m) frame. They are natively supported by the vector package.

In [11]:
# Note : you should only call this cell once
# if you call it a second time it will tell you the leptons particle is already created
data_hard.make_particles(
    'higgs',
    {
        'pt'  : [
            'higgs_pt',
        ],
        'eta'  : [
            'higgs_eta',
        ],
        'phi'  : [
            'higgs_phi',
        ],
        'mass'  : [
            'higgs_mass',
        ],
        'pdgid'  : [
            'higgs_pdgId',
        ],
    },
)
# Here we create a higgs array, containing a single particle (we know it's there from the cells above)
# we tell the object what each branch corresponds to in term of variables
# pt,eta,phi,m/mass are required variables, they determine the 4-vectors, but could have used px,py,pz,E as well (just different basis)
# any other variable (here, pdg id) is just added as fields to the array
data_hard['higgs'].type.show()

39842 * var * Momentum4D[
    pt: float64,
    eta: float64,
    phi: float64,
    mass: float64,
    pdgid: float64
]


Now we can play !

You can get many of the `vector` package methods and attributes for all the particles together, or individually

In [12]:
# we had the (pt,eta,phi,mass) frame in the array, but thanks to vector we can also get the (px,py,pz,E) components
data_hard['higgs'].px

In [13]:
# let us make all the other particles now 
# for convenience, some particles will be grouped together
# but you are free to build them as you want !
# Note : with a lot of events, this might take a while to load all the branches ...
data_hard.make_particles(
    'tops',
    {
        'pt'  : [
            'top_pt',
            'antitop_pt',
        ],
        'eta'  : [
            'top_eta',
            'antitop_eta',
        ],
        'phi'  : [
            'top_phi',
            'antitop_phi',
        ],
        'mass'  : [
            'top_mass',
            'antitop_mass',
        ],
        'pdgid'  : [
            'top_pdgId',
            'antitop_pdgId',
        ],
    },
)
data_hard.make_particles(
    'bottoms',
    {
        'pt'  : [
            'bottom_pt',
            'antibottom_pt',
        ],
        'eta'  : [
            'bottom_eta',
            'antibottom_eta',
        ],
        'phi'  : [
            'bottom_phi',
            'antibottom_phi',
        ],
        'mass'  : [
            'bottom_mass',
            'antibottom_mass',
        ],
        'pdgid'  : [
            'bottom_pdgId',
            'antibottom_pdgId',
        ],
    },
)
data_hard.make_particles(
    'Ws',
    {
        'pt'  : [
            'W_plus_from_top_pt',
            'W_minus_from_antitop_pt',
        ],
        'eta'  : [
            'W_plus_from_top_eta',
            'W_minus_from_antitop_eta',
        ],
        'phi'  : [
            'W_plus_from_top_phi',
            'W_minus_from_antitop_phi',
        ],
        'mass'  : [
            'W_plus_from_top_mass',
            'W_minus_from_antitop_mass',
        ],
        'pdgid'  : [
            'W_plus_from_top_pdgId',
            'W_minus_from_antitop_pdgId',
        ],
    },
)
# For quarks, even though we have applied the mask, in intern the whole arrays are kept
# This causes an issue because in some events there are no quarks, so an error will be triggered
# To avoid this, we will provide a `pad_value` argument to fill missing values
# This is just technical, because in the end we have selected only the events with quarks, so it will skip all the zero padded values already
data_hard.make_particles(
    'quarks',
    {
        'pt'  : [
            'quark_from_W_plus_pt',
            'antiquark_from_W_plus_pt',
            'quark_from_W_minus_pt',
            'antiquark_from_W_minus_pt',
        ],
        'eta'  : [
            'quark_from_W_plus_eta',
            'antiquark_from_W_plus_eta',
            'quark_from_W_minus_eta',
            'antiquark_from_W_minus_eta',
        ],
        'phi'  : [
            'quark_from_W_plus_phi',
            'antiquark_from_W_plus_phi',
            'quark_from_W_minus_phi',
            'antiquark_from_W_minus_phi',
        ],
        'mass'  : [
            'quark_from_W_plus_mass',
            'antiquark_from_W_plus_mass',
            'quark_from_W_minus_mass',
            'antiquark_from_W_minus_mass',
        ],
        'pdgid'  : [
            'quark_from_W_plus_pdgId',
            'antiquark_from_W_plus_pdgId',
            'quark_from_W_minus_pdgId',
            'antiquark_from_W_minus_pdgId',
        ],
    },
    pad_value = 0.,
)
# Note : the Zs are already in one array here (because I codedit that way), no need to specify them separately
data_hard.make_particles(
    'Zs',
    {
        'pt'  : [
            'Z_from_higgs_pt',
        ],
        'eta'  : [
            'Z_from_higgs_eta',
        ],
        'phi'  : [
            'Z_from_higgs_phi',
        ],
        'mass'  : [
            'Z_from_higgs_mass',
        ],
        'pdgid'  : [
            'Z_from_higgs_pdgId',
        ],
    },
)
# same for the neutrinos
data_hard.make_particles(
    'neutrinos',
    {
        'pt'  : [
            'neutrinos_from_Z_pt',
        ],
        'eta'  : [
            'neutrinos_from_Z_eta',
        ],
        'phi'  : [
            'neutrinos_from_Z_phi',
        ],
        'mass'  : [
            'neutrinos_from_Z_mass',
        ],
        'pdgid'  : [
            'neutrinos_from_Z_pdgId',
        ],
    },
)

In [14]:
data_hard['neutrinos'].pt # the pt of all leptons, for each event
# Note : the size is now type: N * var * float64
# - N events
# - var : variable length, ie variable number of neutrinos
# in this case we made sure to have 4 neutrinos

In [15]:
data_hard['neutrinos'].eta[:,[1,3]] # the eta of only the neutrinos (at index 1,3)

In [16]:
data_hard['neutrinos'][:,0].deltaR(data_hard['quarks'][:,2])
# the deltaR between the first neutrino and the third quark

You can play with these objects as you want, there is a wide variety of methods in the [vector](https://pypi.org/project/vector/) and you can check how awkward works.

Note : you can pass to numpy arrays using 
```
data_hard['neutrinos'].pt.to_numpy()
```

This can get a bit tricky when the array has variable length though, here we made sure there are four neutrinos.

Let us move to Reco now, we will load exactly as before, but a different tree

In [17]:
data_reco = ParquetData(
    files = [
        '/nfs/scratch/fynu/fbury/MEM_data/ttH/TF_v2/reco/2018/ttH/ttH_HToInvisible_M125.parquet',
    ],
    lazy = True, # Explained above.
    #N = data_hard.N,
)
# To avoid having to manually set some arguments (eg, files and N), we will reuse the hard ones
print (data_reco)

Data object
Loaded branches:
   ... file: 231528
   ... sample: 231528
   ... tree: 231528
Branch in files not loaded:
   ... Generator_scalePDF
   ... Generator_weight
   ... Generator_x1
   ... Generator_x2
   ... Generator_xpdf1
   ... Generator_xpdf2
   ... InputMet_phi
   ... InputMet_pt
   ... cleanedJet_btagDeepFlavB
   ... cleanedJet_eta
   ... cleanedJet_mass
   ... cleanedJet_phi
   ... cleanedJet_pt
   ... event
   ... ncleanedBJet
   ... ncleanedJet
   ... region
   ... weight_nominal
   ... xs_weight


In the reco case we only have access to the MET and the jets. Contrary to the hard-level, here the objects are already as awkward array with variable length.

Let us make the particles as before.

Note : while for hard-level we only included the pdgid as additional info, we will specialise it here a bit more per object
- jets : btag score and whether it passed the btag medium threshold
- met : we only have the pt and phi (the MET is always transverse), but we will provide 0. to eta and mass to make it a proper particle

In [18]:
jets = data_reco.make_particles(
    'jets',
    {
        'pt'      : 'cleanedJet_pt',
        'eta'     : 'cleanedJet_eta',
        'phi'     : 'cleanedJet_phi',
        'mass'    : 'cleanedJet_mass',
        'btag'    : 'cleanedJet_btagDeepFlavB',
    },
)
met = data_reco.make_particles(
    'met',
    {
        'pt'      : 'InputMet_pt',
        'eta'     : 0.,
        'phi'     : 'InputMet_phi',
        'mass'    : 0.,
    },
)
jets.type.show()
met.type.show()

231528 * var * Momentum4D[
    pt: float64,
    eta: float64,
    phi: float64,
    mass: float64,
    btag: float64
]
231528 * Momentum4D[
    pt: float64,
    eta: float64,
    phi: float64,
    mass: float64
]


Now, because we use reconstructed objects, you finally see above an example of a variable length awkward array. Some events have 5 jets, but can go up to 15 jets.

In [19]:
# Print number of jets in each event
print (ak.num(jets,axis=1))
# print min number of jets #
print (ak.min(ak.num(jets,axis=1)))
# print max number of jets #
print (ak.max(ak.num(jets,axis=1)))

[5, 8, 7, 10, 5, 6, 6, 5, 5, 10, 6, 7, ..., 5, 6, 5, 6, 9, 6, 7, 5, 5, 5, 6, 9]
5
15


In [20]:
# as you can see below the jets are pt-ordered
jets.pt

In [21]:
# The btag is an index that tells how likely the jet comes from a b quark (we call them bjets)
# Note : a score or 0.5 does not mean 50% chance of being a bjet
jets.btag.show()
jets.btag.type.show()

[[0.968, 0.998, 0.0286, 0.0177, 0.0125],
 [0.648, 0.00455, 0.05, 0.00617, 0.116, 0.988, 0.00925, 0.0668],
 [0.0133, 0.938, 0.0183, 0.0387, 0.0326, 0.00654, 0.108],
 [0.997, 0.00394, 0.0171, 0.0178, 0.032, ..., 0.531, 0.00849, 0.497, 0.0256],
 [0.104, 0.0055, 0.00355, 0.0263, 0.565],
 [0.997, 0.19, 0.0316, 0.00264, 0.00582, 0.00883],
 [0.00623, 0.0418, 0.0307, 0.905, 0.0147, 0.047],
 [0.427, 0.0186, 0.0535, 0.01, 0.965],
 [1, 0.147, 0.0208, 0.00881, 0.272],
 [0.041, 0.00951, 0.00326, 0.0177, 0.0365, ..., 0.143, 0.999, 0.00962, 0.0134],
 ...,
 [0.00668, 0.404, 0.00798, 0.00748, 0.996, 0.0683],
 [0.805, 0.0529, 0.921, 0.0228, 0.00353, 0.0895, 0.00644, 0.0105, 0.00659],
 [0.00615, 0.0198, 0.00195, 0.0468, 0.00273, 0.998],
 [0.00186, 0.044, 0.00813, 0.799, 0.398, 0.878, 0.00974],
 [0.0282, 1, 0.00186, 0.0995, 0.972],
 [0.139, 0.219, 0.00331, 0.977, 0.767],
 [0.0066, 0.728, 0.792, 0.00406, 0.00406],
 [0.0047, 0.00974, 0.0671, 0.00903, 0.999, 0.963],
 [0.345, 0.00969, 0.105, 0.228, 0.0369, 0.

In [26]:
isinstance(jets,ak.Array)

True

In [22]:
# we might want to order the jets differently
# right now they are pt-ordered, but we might want to have them btag-ordered
idx = ak.argsort(jets.btag,ascending=False)
idx.show()
# these are the indices of the decreasing order of btag score
jets_sorted = jets[idx]
jets_sorted.btag.show()
jets_sorted.pt.show()
# as you can see, now the jts are b-tag ordered and not pt-ordered anymore

[[1, 0, 2, 3, 4],
 [5, 0, 4, 7, 2, 6, 3, 1],
 [1, 6, 3, 4, 2, 0, 5],
 [0, 6, 8, 4, 9, 3, 2, 7, 5, 1],
 [4, 0, 3, 1, 2],
 [0, 1, 2, 5, 4, 3],
 [3, 5, 1, 2, 4, 0],
 [4, 0, 2, 1, 3],
 [0, 4, 1, 2, 3],
 [7, 6, 5, 0, 4, 3, 9, 8, 1, 2],
 ...,
 [4, 1, 5, 2, 3, 0],
 [2, 0, 5, 1, 3, 7, 8, 6, 4],
 [5, 3, 1, 0, 4, 2],
 [5, 3, 4, 1, 6, 2, 0],
 [1, 4, 3, 0, 2],
 [3, 4, 1, 0, 2],
 [2, 1, 0, 3, 4],
 [4, 5, 2, 1, 3, 0],
 [6, 7, 0, 3, 2, 8, 4, 1, 5]]
[[0.998, 0.968, 0.0286, 0.0177, 0.0125],
 [0.988, 0.648, 0.116, 0.0668, 0.05, 0.00925, 0.00617, 0.00455],
 [0.938, 0.108, 0.0387, 0.0326, 0.0183, 0.0133, 0.00654],
 [0.997, 0.531, 0.497, 0.032, 0.0256, ..., 0.0171, 0.00849, 0.00644, 0.00394],
 [0.565, 0.104, 0.0263, 0.0055, 0.00355],
 [0.997, 0.19, 0.0316, 0.00883, 0.00582, 0.00264],
 [0.905, 0.047, 0.0418, 0.0307, 0.0147, 0.00623],
 [0.965, 0.427, 0.0535, 0.0186, 0.01],
 [1, 0.272, 0.147, 0.0208, 0.00881],
 [0.999, 0.143, 0.0425, 0.041, 0.0365, ..., 0.0134, 0.00962, 0.00951, 0.00326],
 ...,
 [0.996, 0.404

In [23]:
# Check the MET #
print (met.pt,met.phi)
# The MET is only on the transverse plane, let us check that
print (met.eta,met.pz)

[254, 238, 233, 322, 231, 275, 244, 210, ..., 246, 243, 261, 434, 227, 280, 493] [-0.374, -0.624, 3.04, 1.21, 1.84, 2.42, ..., 1.12, -0.612, -1.04, 0.434, 0.496]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [24]:
data_reco['region']

Note : contrary to the other HH case, here the reco case contains all the regions we select in the analysis.

However, we have multiple CR (control region) but for now we will only deal with SR (signal region).

For the story, our SR vetoes leptons, to make sure we target the fully hadronic decays. Our CR include leptons, and we use them to measure some of our backgrounds before using them in the SR.

In the analysis setup, the regions are integers, with 0 being the SR.

In [25]:
mask = data_reco['region'] == 0
print ('Before cut', data_reco.events)
data_reco.cut(mask)
print ('After cut', data_reco.events)

Before cut 231528
After cut 114647


Finally, we will want at some point to combine the hard and reco information to have a bijection hard<->reco.

As explained previously, the trees have different size, so index 413 in the hard tree, will be different from the index 413 in the reco tree.

Instead, we will use the event number that is unique per event, to match the two cases.

BUT, one needs to be a bit careful because that event number is unique ... per sample. So you can have the same event number in sample A and sample B. If we use a single file in the data objects, this is fine, otherwise some case must be taken.

I have made a utility function to perform this, see below.

In [37]:
from memflow.dataset.utils import get_data_intersection_indices

hard_idx, reco_idx = get_data_intersection_indices(
    datas = [
        data_hard,
        data_reco
    ],
    branch = 'event', # this is the branch present in both data object that contains the event number
    different_files = True, # this is because here (for technical reasons), the reco and hard trees are in different files
)

Looking into file metadata
Will pair these files together :
   - /nfs/scratch/fynu/fbury/MEM_data/ttH/TF_v6/hard/2018/ttH/ttH_HToInvisible_M125.parquet <-> /nfs/scratch/fynu/fbury/MEM_data/ttH/TF_v2/reco/2018/ttH/ttH_HToInvisible_M125.parquet
For entry 0 : from 39842 events, 4828 selected
For entry 1 : from 114647 events, 4828 selected


The function is a but verbose but this is for explicit printing, because quite a crucial step.

Because we had made cuts to both the hard and reco data, only the common indices were selected.

Now `reco_idx` contains the indices in the reco data that are matched to hard data, and inversely for `hard_idx`.

In [38]:
reco_idx,hard_idx

(array([   1,    2,    3, ..., 6020, 6021, 6022]),
 array([    3,     4,    13, ..., 39814, 39821, 39834]))

In [39]:
# Now we can compare gen and reco events
# For example, you can do some manual gen-matching !
# Compare the (pt,eta,phi) or the jets and the quarks/b-quarks, and try to link them
# In principle, the pt can be vary significantly (though rare) but the delta_phi and delta_eta must be pretty close
# so we typically look at the angle difference mostly
# For your information, we usually do that on delta_R = sqrt(delta_eta**2+delta_phi**2) < 0.4
# But for playing here, you can just compare by eye the delta_eta and delta_phi independently
# Once you find a match compare the btag score of the jet with the pdgid of the quark
# - light jets ( -4 < pdg id < 4) : should have a low btag score
# - bjets (pdg id = +/- 5) : should have a high btag score
for r_idx,g_idx in zip(reco_idx,hard_idx):
    print (f'Reco idx {r_idx}, hard idx {g_idx}')
    event_jets = data_reco['jets'][r_idx]
    event_quarks = data_hard['quarks'][g_idx]
    event_bottoms = data_hard['bottoms'][g_idx]
    print (f'\t {len(event_jets)} jets')
    print (f'\t\t pt  : {event_jets.pt}')
    print (f'\t\t eta : {event_jets.eta}')
    print (f'\t\t phi : {event_jets.phi}')
    print (f'\t\t btag: {event_jets.btag}')
    print (f'\t {len(event_quarks)} (light) quarks')
    print (f'\t\t pt  : {event_quarks.pt}')
    print (f'\t\t eta : {event_quarks.eta}')
    print (f'\t\t phi : {event_quarks.phi}')
    print (f'\t\t pdg : {event_quarks.pdgid}')
    print (f'\t {len(event_bottoms)} b-quarks')
    print (f'\t\t pt  : {event_bottoms.pt}')
    print (f'\t\t eta : {event_bottoms.eta}')
    print (f'\t\t phi : {event_bottoms.phi}')
    print (f'\t\t pdg : {event_bottoms.pdgid}')

Reco idx 1, hard idx 3
	 8 jets
		 pt  : [239, 211, 132, 107, 55.3, 39.1, 38.4, 30.5]
		 eta : [-1.6, 1.1, 0.177, -0.972, -0.015, 1.64, 0.717, -1.35]
		 phi : [1.88, -1.68, 1.61, -3.04, 2.57, -0.707, -0.664, 2.45]
		 btag: [0.648, 0.00455, 0.05, 0.00617, 0.116, 0.988, 0.00925, 0.0668]
	 4 (light) quarks
		 pt  : [186, 31.8, 78.2, 319]
		 eta : [1.1, 0.725, -1.43, -1.61]
		 phi : [-1.7, -0.68, 2.24, 1.79]
		 pdg : [2, -1, 3, -4]
	 2 b-quarks
		 pt  : [58.8, 28.6]
		 eta : [1.88, -0.461]
		 phi : [-0.566, 2.68]
		 pdg : [5, -5]
Reco idx 2, hard idx 4
	 7 jets
		 pt  : [381, 225, 154, 96.4, 45.5, 43.8, 31.1]
		 eta : [-1.52, 0.582, -2.15, 0.529, -1.82, -0.253, -1.6]
		 phi : [-0.572, 2.11, -1.06, 1.5, 0.803, 2.31, -3.01]
		 btag: [0.0133, 0.938, 0.0183, 0.0387, 0.0326, 0.00654, 0.108]
	 4 (light) quarks
		 pt  : [73, 47.5, 53, 51.6]
		 eta : [0.527, -0.332, -1.66, -1.45]
		 phi : [1.49, 2.4, 1.14, -0.609]
		 pdg : [4, -3, 3, -4]
	 2 b-quarks
		 pt  : [285, 68.8]
		 eta : [0.586, -1.57]
		

You can see that in some cases, we have N jets from 2 bottom + 4 light quarks. But most often we have more jets (from variety of origins, fake jets, from the pileup, additional radiations, etc). We can in principles have less (eg, one jet is too low energy to be detected, outside the detector acceptance, or in line with an electron and missed, etc). In our current selections, we ask for $\geq$ 5 jets.

You can also see the btag score above, the closer to 1, the more likely from a b quark. In some cases, you can see we have two jets, but one has a very low btag score. This can be because the btag score (which is ML itself) can be wrong, or just that this is an additional jet, and we lost the second original bjet.

These are basically the challenges we face when performing analyses.

As an exercise, I will let you play with electrons and muons, comparing with the hard data.

In another notebook we will assemble them into pytorch datasets. Most of the code above has been incorporated into proper python scripts with all the logic to be used in a dataset class, but this here allows you finer investigation about how we structure our data in an analysis.