In [1]:
%load_ext autoreload
%autoreload 2

import copy
import numpy as np
import awkward as ak
import uproot
import pandas as pd
import dask
import vector
import particle
import hepunits

from memflow.dataset.data import RootData
vector.register_awkward()

  from pandas.core.computation.check import NUMEXPR_INSTALLED


# File system #
For HH, we use ROOT files (this might be different for ttH, but I will buid the machinery so the following should work the same).

Each ROOT file contains two trees, which are just tabular tables whose columns and names branches. The `gen-level` contains "true" information (all the events in the dataset), and the `reco-level` contains reconstructed events, but only for events passing our selections. This means that $N(reco) < N(gen)$ and the only connexion between the two datasets is through a unique `event number`. We will use this integer to match gen and reco together later.

First, let us load the gen-level information.

Note : I build these data classes to be `lazy` by default (can be turned off). This means initially the branches are not loaded by default, thy are only loaded when requested and then saved in the object. When you print a branch for the first time, it can take a few seconds to retrieve the data, but it will be faster on the second attempt because the branch is already loaded. 

In [2]:
data_hard = RootData(
    files = [
        '/home/ucl/cp3/fbury/scratch/MEM_data/Transfermer_v3/results/GluGluToHHTo2B2VTo2L2Nu_node_cHHH0.root',
    ], # these are the files we want to load, there can be several
    treenames = [
        'gen_HH;1',
    ], # this is the name of the gen-level tree
    lazy = True, # Explained above.
    N = 1000, # this is to load only the N first events in the tree, 
    # in case you are just playing/debugging and don't need to load all the data (can be slow)
    # to load all, just comment out
)
print (data_hard)

Data object
Loaded branches:
   ... file: 1000
   ... sample: 1000
   ... tree: 1000
Branch in files not loaded:
   ... H1_E
   ... H1_Px
   ... H1_Py
   ... H1_Pz
   ... H1_eta
   ... H1_idx
   ... H1_mass
   ... H1_pdgId
   ... H1_phi
   ... H1_pt
   ... H1_sum_E
   ... H2_E
   ... H2_Px
   ... H2_Py
   ... H2_Pz
   ... H2_eta
   ... H2_idx
   ... H2_mass
   ... H2_pdgId
   ... H2_phi
   ... H2_pt
   ... H2_sum_E
   ... ISR_10_E
   ... ISR_10_Px
   ... ISR_10_Py
   ... ISR_10_Pz
   ... ISR_10_eta
   ... ISR_10_idx
   ... ISR_10_mass
   ... ISR_10_parent
   ... ISR_10_pdgId
   ... ISR_10_phi
   ... ISR_10_pt
   ... ISR_11_E
   ... ISR_11_Px
   ... ISR_11_Py
   ... ISR_11_Pz
   ... ISR_11_eta
   ... ISR_11_idx
   ... ISR_11_mass
   ... ISR_11_parent
   ... ISR_11_pdgId
   ... ISR_11_phi
   ... ISR_11_pt
   ... ISR_12_E
   ... ISR_12_Px
   ... ISR_12_Py
   ... ISR_12_Pz
   ... ISR_12_eta
   ... ISR_12_idx
   ... ISR_12_mass
   ... ISR_12_parent
   ... ISR_12_pdgId
   ... ISR_12_phi
   .

When printing the data object, above, you can see three branches have been initialised from the start.
- `file` : contains the file path
- `sample` : contains the file name
- `tree` : contains the tree name

These are just metadata that can be used later, eg when comparing with the reco dataset.

The number next to them is the number of entries.

The other branches as you can see are not loaded yet.

In [3]:
# Print the metadata
data_hard['file'],data_hard['sample'],data_hard['tree']



(<Array [...] type='1000 * string'>,
 <Array ['GluGluToHHTo2B2VTo2L2Nu_node_cHHH0.root', ...] type='1000 * string'>,
 <Array [...] type='1000 * string'>)

You can see these are arrays, but not numpy. 

These are [awkward](https://awkward-array.org/doc/main/) arrays, they can contain a variable number of entries in each dimension.

In our case above, they contain `N * string`, meaning `N` (depending on what your argument was) strings, which make sense because these are string identifiers.

In [4]:
# Useful tools for checking out what is contained inside
data_hard['file'].show()
data_hard['file'].type.show()

['/home/ucl/cp3/fbury/scratch/MEM_data/Transfermer_v3/results/GluGluToHHTo2B2VTo2L2Nu_node_cHHH0.root',
 '/home/ucl/cp3/fbury/scratch/MEM_data/Transfermer_v3/results/GluGluToHHTo2B2VTo2L2Nu_node_cHHH0.root',
 '/home/ucl/cp3/fbury/scratch/MEM_data/Transfermer_v3/results/GluGluToHHTo2B2VTo2L2Nu_node_cHHH0.root',
 '/home/ucl/cp3/fbury/scratch/MEM_data/Transfermer_v3/results/GluGluToHHTo2B2VTo2L2Nu_node_cHHH0.root',
 '/home/ucl/cp3/fbury/scratch/MEM_data/Transfermer_v3/results/GluGluToHHTo2B2VTo2L2Nu_node_cHHH0.root',
 '/home/ucl/cp3/fbury/scratch/MEM_data/Transfermer_v3/results/GluGluToHHTo2B2VTo2L2Nu_node_cHHH0.root',
 '/home/ucl/cp3/fbury/scratch/MEM_data/Transfermer_v3/results/GluGluToHHTo2B2VTo2L2Nu_node_cHHH0.root',
 '/home/ucl/cp3/fbury/scratch/MEM_data/Transfermer_v3/results/GluGluToHHTo2B2VTo2L2Nu_node_cHHH0.root',
 '/home/ucl/cp3/fbury/scratch/MEM_data/Transfermer_v3/results/GluGluToHHTo2B2VTo2L2Nu_node_cHHH0.root',
 '/home/ucl/cp3/fbury/scratch/MEM_data/Transfermer_v3/results/Gl

In [5]:
# To see what branches have been loaded
print (data_hard.keys())

# To see all the branches (including the ones not loaded)
print (data_hard.branches)

dict_keys(['file', 'tree', 'sample'])
['H1_E', 'H1_Px', 'H1_Py', 'H1_Pz', 'H1_eta', 'H1_idx', 'H1_mass', 'H1_pdgId', 'H1_phi', 'H1_pt', 'H1_sum_E', 'H2_E', 'H2_Px', 'H2_Py', 'H2_Pz', 'H2_eta', 'H2_idx', 'H2_mass', 'H2_pdgId', 'H2_phi', 'H2_pt', 'H2_sum_E', 'ISR_10_E', 'ISR_10_Px', 'ISR_10_Py', 'ISR_10_Pz', 'ISR_10_eta', 'ISR_10_idx', 'ISR_10_mass', 'ISR_10_parent', 'ISR_10_pdgId', 'ISR_10_phi', 'ISR_10_pt', 'ISR_11_E', 'ISR_11_Px', 'ISR_11_Py', 'ISR_11_Pz', 'ISR_11_eta', 'ISR_11_idx', 'ISR_11_mass', 'ISR_11_parent', 'ISR_11_pdgId', 'ISR_11_phi', 'ISR_11_pt', 'ISR_12_E', 'ISR_12_Px', 'ISR_12_Py', 'ISR_12_Pz', 'ISR_12_eta', 'ISR_12_idx', 'ISR_12_mass', 'ISR_12_parent', 'ISR_12_pdgId', 'ISR_12_phi', 'ISR_12_pt', 'ISR_13_E', 'ISR_13_Px', 'ISR_13_Py', 'ISR_13_Pz', 'ISR_13_eta', 'ISR_13_idx', 'ISR_13_mass', 'ISR_13_parent', 'ISR_13_pdgId', 'ISR_13_phi', 'ISR_13_pt', 'ISR_14_E', 'ISR_14_Px', 'ISR_14_Py', 'ISR_14_Pz', 'ISR_14_eta', 'ISR_14_idx', 'ISR_14_mass', 'ISR_14_parent', 'ISR_14_pdgId', 

In [6]:
# Now let us load some data , we will check the first Higgs energy #
data_hard['H1_E'].show()
data_hard['H1_E'].type.show()

#You can see an array of N float64 containing the Higgs energy

[456,
 202,
 513,
 622,
 363,
 959,
 627,
 224,
 843,
 394,
 ...,
 325,
 168,
 175,
 283,
 148,
 126,
 781,
 298,
 307]
1000 * float64


In [7]:
print (data_hard)
# And now the branch `H1_E` is shown as loaded now
# If you wanted to retrieve the branch again, it would be faster

Data object
Loaded branches:
   ... H1_E: 1000
   ... file: 1000
   ... sample: 1000
   ... tree: 1000
Branch in files not loaded:
   ... H1_Px
   ... H1_Py
   ... H1_Pz
   ... H1_eta
   ... H1_idx
   ... H1_mass
   ... H1_pdgId
   ... H1_phi
   ... H1_pt
   ... H1_sum_E
   ... H2_E
   ... H2_Px
   ... H2_Py
   ... H2_Pz
   ... H2_eta
   ... H2_idx
   ... H2_mass
   ... H2_pdgId
   ... H2_phi
   ... H2_pt
   ... H2_sum_E
   ... ISR_10_E
   ... ISR_10_Px
   ... ISR_10_Py
   ... ISR_10_Pz
   ... ISR_10_eta
   ... ISR_10_idx
   ... ISR_10_mass
   ... ISR_10_parent
   ... ISR_10_pdgId
   ... ISR_10_phi
   ... ISR_10_pt
   ... ISR_11_E
   ... ISR_11_Px
   ... ISR_11_Py
   ... ISR_11_Pz
   ... ISR_11_eta
   ... ISR_11_idx
   ... ISR_11_mass
   ... ISR_11_parent
   ... ISR_11_pdgId
   ... ISR_11_phi
   ... ISR_11_pt
   ... ISR_12_E
   ... ISR_12_Px
   ... ISR_12_Py
   ... ISR_12_Pz
   ... ISR_12_eta
   ... ISR_12_idx
   ... ISR_12_mass
   ... ISR_12_parent
   ... ISR_12_pdgId
   ... ISR_12_ph

Now we may want to restrict our gen-dataset to a specific decay.

Indeed, the HH sample here contains HH->bblnulnu (2 b-quarks + 2 charged leptons + 2 neutrinos). 

But these leptons and neutrinos can either come from bbWW or bbZZ, we will ultimately want to deal with both, but let us focus on bbWW.

To do this, we will select the events for which we have a W+, a W-, two b-quarks (a bottom and antibottom). The charged leptons and neutrinos must come from the Ws. 

By default, all branches are filled, but when the particle has not been found in the simulations, a default value of -9999 is used. This means to select our bbWW, we just want to find the events for which the branches have values different from -9999, so typically we check thet E>0. We do that for every branch we are interested in, and combine all the boolean masks into a single one with the `AND` condition.

Note : this branch configuration was decided by myself based on the tools I had at my disposal to process the simulations, it will be different for ttH (hopefully not too much) that uses a different framework.

In [8]:
branches_bbWW_DL = [
    'W_plus',
    'W_minus',
    'bottom',
    'antibottom',
    'lep_plus_from_W',
    'lep_minus_from_W',
    'neutrino_from_W',
    'antineutrino_from_W',
] # All the branches for which we want to find a particle

mask_bbWW_DL = np.logical_and.reduce(
    [
        data_hard[f'{br}_E'] >= 0 # ask there for a non-default value
        for br in branches_bbWW_DL
    ]
) # get the combined mask
print (f'Out of {len(mask_bbWW_DL)} events, {mask_bbWW_DL.sum()} are bbWW DL')

Out of 1000 events, 968 are bbWW DL


We now use the mask to "cut" on those events, meaning only keeping the once passing our selection, ie the ones with a `True` value in our boolen array.

Note : these events are not exactly "cut" here, it is just that when you call a branch, only the ones passing the mask will be returned. In the background all the events are still there in the branch. This is useful for resetting the mask later in cas you want to recover these events. 

Note bis : after a selection is made, you can make another mask and apply it again. But the mask must now correspond to the array obtained after the first mask has been used. Otherwise there will be a mismatch between array lengths.

In [9]:
print (f'Before cut : {data_hard.events}')

data_hard.cut(mask_bbWW_DL)

print (f'After cut : {data_hard.events}')

Before cut : 1000
After cut : 968


Now a big part in playing with particles is to assemble our branches into proper 4-vectors. This will allow us to use the powerful [vector](https://pypi.org/project/vector/) package, that can be used within awkward arrays too. 

Note : in my setup (branches and all that) you need to do that manually. In other frameworks (eg my colleagues in Zurich) the 4-vectors are already declared because they use a different data format file (parquet).

In [10]:
# Note : you should only call this cell once
# if you call it a second time it will tell you the leptons particle is already created
if 'leptons' in data_hard:
    data_hard.delete('leptons')
data_hard.make_particles(
    'leptons',
    {
        'px'  : [
            'lep_plus_from_W_Px',
            'neutrino_from_W_Px',
            'lep_minus_from_W_Px',
            'antineutrino_from_W_Px',
        ],
        'py'  : [
            'lep_plus_from_W_Py',
            'neutrino_from_W_Py',
            'lep_minus_from_W_Py',
            'antineutrino_from_W_Py',
        ],
        'pz'  : [
            'lep_plus_from_W_Pz',
            'neutrino_from_W_Pz',
            'lep_minus_from_W_Pz',
            'antineutrino_from_W_Pz',
        ],
        'E'  : [
            'lep_plus_from_W_E',
            'neutrino_from_W_E',
            'lep_minus_from_W_E',
            'antineutrino_from_W_E',
        ],
        'pdgId'  : [
            'lep_plus_from_W_pdgId',
            'neutrino_from_W_pdgId',
            'lep_minus_from_W_pdgId',
            'antineutrino_from_W_pdgId',
        ],
    },
    lambda vec : vec.E >= 0, # just makes sure to only load non-default values
)
# Here we create a leptons array, consisting of 4 particles : l+, nu, l-, anti-nu
# we tell the object what each branch corresponds to in term of variables
# px,py,pz,E ae required variables, they determine the 4-vectors, but could have used pt,eta,phi,m as well (just different basis)
# any other variable (here, pdg id) is just added as fields to the array
data_hard['leptons'].type.show()

968 * var * Momentum4D[
    px: float64,
    py: float64,
    pz: float64,
    E: float64,
    pdgId: float64
]


Now we can play !

You can get many of the `vector` package methods and attributes for all the leptons together, or individually

In [11]:
data_hard['leptons'][2] # the l- lepton, with all attributes

In [12]:
data_hard['leptons'].pt # the pt of all leptons, for each event
# Note : the size is now type: N * var * float64
# - N events
# - var : variable length, ie variable number of leptons
# in this case we made sure to have 4 leptons, but not absolutely necessary, we will se later

In [13]:
data_hard['leptons']

In [14]:
data_hard['leptons'].eta[:,[1,3]] # the eta of only the neutrinos (at index 1,3)

In [15]:
data_hard['leptons'][:,0].deltaR(data_hard['leptons'][:,2])
# the deltaR between the l+ and l-

You can play with these leptons as you want, there is a wide variety of methods in the [vector](https://pypi.org/project/vector/) and you can check how awkward works.

Note : you can pass to numpy arrays using 
```
data_hard['leptons'].pt.to_numpy()
```

This can get a bit tricky when the array has variable length though, here we made sure there are four leptons.

Now let us do the same for the bquarks

In [16]:
data_hard.make_particles(
    'bquarks',
    {
        'px'  : [
            'bottom_Px',
            'antibottom_Px',
        ],
        'py'  : [
            'bottom_Py',
            'antibottom_Py',
        ],
        'pz'  : [
            'bottom_Pz',
            'antibottom_Pz',
        ],
        'E'  : [
            'bottom_E',
            'antibottom_E',
        ],
        'pdgId'  : [
            'bottom_pdgId',
            'antibottom_pdgId',
        ],
    },
    lambda vec: vec.E > 0.,
)
# you can play with them as well, they behave like the leptons made above

# you can also do the same for say the Higgses (`H1`and `H2`), but we will not need it for now

Let us move to Reco now, we will load exactly as before, but a different tree

In [17]:
data_reco = RootData(
    files = data_hard.files,
    treenames = [
        'reco_DL;1',
    ],
    lazy = True, # Explained above.
    N = data_hard.N,
)
# To avoid having to manually set some arguments (eg, files and N), we will reuse the gen ones
print (data_reco)

Data object
Loaded branches:
   ... file: 1000
   ... sample: 1000
   ... tree: 1000
Branch in files not loaded:
   ... VBF1_E
   ... VBF1_Px
   ... VBF1_Py
   ... VBF1_Pz
   ... VBF1_eta
   ... VBF1_idx
   ... VBF1_mass
   ... VBF1_phi
   ... VBF1_pt
   ... VBF1_sel
   ... VBF2_E
   ... VBF2_Px
   ... VBF2_Py
   ... VBF2_Pz
   ... VBF2_eta
   ... VBF2_idx
   ... VBF2_mass
   ... VBF2_phi
   ... VBF2_pt
   ... VBF2_sel
   ... VBF3_E
   ... VBF3_Px
   ... VBF3_Py
   ... VBF3_Pz
   ... VBF3_eta
   ... VBF3_idx
   ... VBF3_mass
   ... VBF3_phi
   ... VBF3_pt
   ... VBF3_sel
   ... VBF4_E
   ... VBF4_Px
   ... VBF4_Py
   ... VBF4_Pz
   ... VBF4_eta
   ... VBF4_idx
   ... VBF4_mass
   ... VBF4_phi
   ... VBF4_pt
   ... VBF4_sel
   ... VBF5_E
   ... VBF5_Px
   ... VBF5_Py
   ... VBF5_Pz
   ... VBF5_eta
   ... VBF5_idx
   ... VBF5_mass
   ... VBF5_phi
   ... VBF5_pt
   ... VBF5_sel
   ... VBF6_E
   ... VBF6_Px
   ... VBF6_Py
   ... VBF6_Pz
   ... VBF6_eta
   ... VBF6_idx
   ... VBF6_mass
   .

In this reco case, we will reconstruct the muons, electons and jets separately, for reasons I can explain later.

Let us make the particles as before.

In my setup, I save up to 4 muons, 4 electrons and 15 jets. This does not mean there will always be that many particles, again some will have the default -9999 value. We will also get the MET (missing energy transverse) as a particle, but there is by definition only one of them.

To make it less boilerplate, I will use loops to avoid copy pasting too many lines.

Note : while for gen-level we only included the pdgid as additional info, we will specialise it here a bit more per object
- muons/electrons : provide the pdgId + charge
- jets : btag score and whether it passed the btag medium threshold

In [18]:
# Given we have to load many branches, this may take a bit of time
n_jets = 15
jets = data_reco.make_particles(
    'jets',
    {
        'px'      : [f'j{i}_Px' for i in range(1,n_jets+1)],
        'py'      : [f'j{i}_Py' for i in range(1,n_jets+1)],
        'pz'      : [f'j{i}_Pz' for i in range(1,n_jets+1)],
        'E'       : [f'j{i}_E' for i in range(1,n_jets+1)],
        'btag'    : [f'j{i}_btag' for i in range(1,n_jets+1)],
        'btagged' : [f'j{i}_btagged' for i in range(1,n_jets+1)],
    },
    lambda vec: vec.E > 0.,
)
n_e = 4
electrons = data_reco.make_particles(
    'electrons',
    {
        'px'      : [f'e{i}_Px' for i in range(1,n_e+1)],
        'py'      : [f'e{i}_Py' for i in range(1,n_e+1)],
        'pz'      : [f'e{i}_Pz' for i in range(1,n_e+1)],
        'E'       : [f'e{i}_E' for i in range(1,n_e+1)],
        'pdgId'   : [f'e{i}_pdgId' for i in range(1,n_e+1)],
        'charge'  : [f'e{i}_charge' for i in range(1,n_e+1)],
    },
    lambda vec: vec.E > 0.,
)
n_m = 4
muons = data_reco.make_particles(
    'muons',
    {
        'px'      : [f'm{i}_Px' for i in range(1,n_m+1)],
        'py'      : [f'm{i}_Py' for i in range(1,n_m+1)],
        'pz'      : [f'm{i}_Pz' for i in range(1,n_m+1)],
        'E'       : [f'm{i}_E' for i in range(1,n_m+1)],
        'pdgId'   : [f'm{i}_pdgId' for i in range(1,n_m+1)],
        'charge'  : [f'm{i}_charge' for i in range(1,n_m+1)],
    },
    lambda vec: vec.E > 0.,
)

met = data_reco.make_particles(
    'met',
    {
        'px'      : ['met_Px'],
        'py'      : ['met_Py'],
        'pz'      : ['met_Pz'],
        'E'       : ['met_E'],
    },
)


What I have eluded previously is that the `make_particles` does not only make the new branch with the particle, but also returns that branch. 

This is why I saved them in the different objects : `electrons`, `muons`, `jets` and `met`.

You can for example either interact with `electrons` or `data_reco['electrons']` and they should be exactly the same.

In [19]:
electrons.pt

Now, because we use reconstructed objects, you finally see above an example of a variable length awkward array. Some events have 0, some 1 and some 2 electrons. 

We can see the same for jets, with bigger numbers. 

Note : in my reco selections, I require 
- at least two leptons : can be 2 electron (`ee` channel), 2 muons (`µµ` or `mumu` channel) or 1 electron and 1 muon (`eµ` or `emu`) channel
- at least two jets in the resolved category
- there is a boosted category, where there are >=2 jets, but we will conveniently not talk about it for now

In [20]:
jets.pt

In [21]:
ak.num(jets,axis=1) # this returns the number of jets in each event

Let us now select the events in the category we want (resolved) and make sure these are from our signal region (SR).

I have made two branches that contain 0s and 1s for specifically this purpose

In [22]:
mask = np.logical_and(
    data_reco['flag_SR'],
    data_reco['flag_resolved'],
)
print ('Before cut', data_reco.events)
data_reco.cut(mask)
print ('After cut', data_reco.events)

Before cut 1000
After cut 142


Finally, we will want at some point to combine the gen and reco information to have a bijection gen<->reco.

As explained previously, the trees have different size, so index 413 in the gen tree, will be different from the index 413 in the reco tree.

Instead, we will use the event number that is unique per event, to match the two cases.

BUT, one needs to be a bit careful because that event number is unique ... per sample. So you can have the same event number in sample A and sample B. If we use a single file in the data objects, this is fine, otherwise some case must be taken.

I have made a utility function to perform this, see below.

In [32]:
from memflow.dataset.data import get_intersection_indices

gen_idx, reco_idx = get_intersection_indices(
    datas = [
        data_hard,
        data_reco
    ],
    branch = 'event', # this is the branch present in both data object that contains the event number
)

Looking into file metadata
	entry 0 : ['/home/ucl/cp3/fbury/scratch/MEM_data/Transfermer_v3/results/GluGluToHHTo2B2VTo2L2Nu_node_cHHH0.root']
	entry 1 : ['/home/ucl/cp3/fbury/scratch/MEM_data/Transfermer_v3/results/GluGluToHHTo2B2VTo2L2Nu_node_cHHH0.root']
Will only consider common files : ['/home/ucl/cp3/fbury/scratch/MEM_data/Transfermer_v3/results/GluGluToHHTo2B2VTo2L2Nu_node_cHHH0.root']
(Note : this assumes the files have the same order between the different data objects)
For entry 0 : from 968 events, 137 selected
For entry 1 : from 142 events, 137 selected


The function is a but verbose but this is for explicit printing, because quite a crucial step.

Because we had made cuts to both the gen and reco data, only the common indices were selected.

Now `reco_idx` contains the indices in the reco data that are matched to gen data, and same for `gen_idx`.

In [27]:
reco_idx,gen_idx

(array([  0,   1,   2,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
         14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
         28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,
         41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  53,  54,
         55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,  67,
         68,  69,  70,  71,  72,  73,  75,  76,  77,  78,  79,  80,  81,
         82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,
         95,  96,  97,  98,  99, 101, 102, 103, 104, 105, 106, 107, 108,
        109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121,
        122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
        135, 136, 137, 138, 139, 140, 141]),
 array([  0,   2,  28,  37,  44,  50,  51,  55,  60,  65,  66,  73,  80,
         84, 107, 120, 124, 147, 151, 154, 156, 164, 165, 170, 191, 202,
        211, 226, 229, 232, 235, 241, 257, 264, 273, 288, 293, 299, 301,
      

In [28]:
for r_idx,g_idx in zip(reco_idx,gen_idx):
    print (f'Reco idx {r_idx}, gen idx {g_idx}')
    jets = data_reco['jets'][r_idx]
    bquarks = data_hard['bquarks'][g_idx]
    print (f'\t {len(jets)} jets',jets.pt, jets.btag)
    print ('\t bquarks',bquarks.pdgId)

Reco idx 0, gen idx 0
	 2 jets [110, 117] [0.999, 0.896]
	 bquarks [5, -5]
Reco idx 1, gen idx 2
	 2 jets [124, 49] [0.97, 0.00702]
	 bquarks [5, -5]
Reco idx 2, gen idx 28
	 3 jets [129, 179, 118] [0.484, 0.0111, 0.00277]
	 bquarks [5, -5]
Reco idx 4, gen idx 37
	 3 jets [45.2, 47.5, 47.1] [0.966, 0.357, 0.016]
	 bquarks [5, -5]
Reco idx 5, gen idx 44
	 2 jets [233, 102] [0.583, 0.0102]
	 bquarks [5, -5]
Reco idx 6, gen idx 50
	 2 jets [58.9, 27.8] [0.998, 0.182]
	 bquarks [5, -5]
Reco idx 7, gen idx 51
	 5 jets [28.2, 29.1, 27.9, 41.1, 26.5] [0.686, 0.0486, 0.0362, 0.0288, 0.016]
	 bquarks [5, -5]
Reco idx 8, gen idx 55
	 3 jets [46, 30.2, 74.4] [0.959, 0.0552, 0.00787]
	 bquarks [5, -5]
Reco idx 9, gen idx 60
	 3 jets [169, 59.5, 29] [0.997, 0.989, 0.026]
	 bquarks [5, -5]
Reco idx 10, gen idx 65
	 2 jets [36.2, 94.8] [0.758, 0.628]
	 bquarks [5, -5]
Reco idx 11, gen idx 66
	 3 jets [177, 96.8, 81.7] [0.93, 0.0112, 0.00821]
	 bquarks [5, -5]
Reco idx 12, gen idx 73
	 4 jets [26.6, 3

You can see that in some cases, we have 2 jets from 2 bquarks. But most often we have more jets (from variety of origins, fake jets, from the pileup, additional radiations, etc). We can in principles have less (eg, one jet is too low energy to be detected, outside the detector acceptance, or in line with an electron and missed, etc), but I specifically asked for two jets so this should not be the case here.

You can also see the btag score above, the closer to 1, the more likely from a b quark. In some cases, you can see we have two jets, but one has a very low btag score. This can be because the btag score (which is ML itself) can be wrong, or just that this is an additional jet, and we lost the second original bjet.

These are basically the challenges we face when performing analyses.

As an exercise, I will let you play with electrons and muons, comparing with the gen data.

In another notebook we will assemble them into pytorch datasets. Most of the code above has been incorporated into proper python scripts with all the logic to be used in a dataset class, but this here allows you finer investigation about how we structure our data in an analysis.