# Jet Clustering using "Smart Topo-Clusters"

The big idea is to use neural networks for classification and/or energy calibration of topo-clusters, and use these topo-clusters for making jets. In this notebook I'll be playing around with some ideas for this, to see what works.

In this notebook we will *not* be training neural networks. That's taken care of by other notebooks in the `/classifier` and `/regression` directories of this repo. We will instead be applying the existing, trained networks to some data.

#### 1) Setup

First, let's import a bunch of packages we know we'll need right off-the-bat.

Note that as we've set up our environment with `conda`, our `ROOT` installation has all the bells and whistles. This includes the `pythia8` library and its associated `ROOT` wrapper, `TPythia8`. We can optionally use this for jet-clustering, as it comes `fj-core`.
Alternatively we could use the Pythonic interface for `fastjet` or [pyjet](https://github.com/scikit-hep/pyjet), but the latter requires linking an external fastjet build for speed and this doesn't seem to work when following their documentation.

In [1]:
import numpy as np
import ROOT as rt
import uproot as ur
#import pandas as pd
import sys, glob

Welcome to JupyROOT 6.22/02


Here's also some slightly contrived setup for `latex`. We may need this for the `atlas_mpl_style` package, which is employed in some of Max's plotting utilities that we may want to borrow. Since `latex` isn't set up on the [UChicago ML platform](https://ml.maniac.uchicago.edu) by default, our setup script may install it separately but it's still not on `$PATH` since we don't touch our bash profile. This cell uses some `IPython` magic to adjust `$PATH` for this notebook.

In [2]:
# Check if latex is set up already.
# We use some Jupyter magic -- alternatively one could use python's subprocess here.
has_latex = !command -v latex
has_latex = (not has_latex == [])

# If latex was not a recognized command, our setup script should have installed
# at a fixed location, but it is not on the $PATH. Now let's use some Jupyter magic.
# See https://ipython.readthedocs.io/en/stable/interactive/shell.html for info.
if(not has_latex):
    latex_prefix = '/usr/local/texlive/2020/bin/x86_64-linux'
    jupyter_env = %env
    path = jupyter_env['PATH']
    path = path + ':' + latex_prefix
    %env PATH = $path
    jupyter_env = %env
    path = jupyter_env['PATH']

env: PATH=/opt/conda/envs/ml4p/bin:/opt/conda/condabin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/texlive/2020/bin/x86_64-linux


In [3]:
# some extra setup
path_prefix = '/workspace/LCStudies/'

#### 2) Fetching the data

Now we get our data. For now, our classifiers are being trained to distinguish between $\pi^+$ and $\pi^0$. Assuming that all charged pions behave the same way, we can really treat this as a $\pi^\pm$ vs. $\pi^0$ classifier. **For our toy workflow, we'll say that we only want to cluster $\pi^\pm$ topo-clusters into jets.** We will treat $\pi^0$ as a background.

Note that our original data is saved in `ROOT` files where each entry corresponds with one topo-cluster. To perform jet clustering, we want to reorganize this so that we have topo-clusters grouped by event. Our data does contain info on run numbers and event numbers, so we should be able to perform this recombination. To speed things up, we will save the clusters (grouped by event) to a new set of `ROOT` files, so that we don't necessarily need to perform this recombination every time we run our workflow -- we'll check to see if these files have been made previously.

*Before* we perform this recombination, we might as well get our network scores for each cluster. This will only require accessing the calorimeter images from the files.

In [4]:
if(path_prefix not in sys.path): sys.path.append(path_prefix)
from  util import ml_util as mu

# ----- Meta-data for our dataset -----
layers = ["EMB1", "EMB2", "EMB3", "TileBar0", "TileBar1", "TileBar2"]
nlayers = len(layers)
cell_size_phi = [0.098, 0.0245, 0.0245, 0.1, 0.1, 0.1]
cell_size_eta = [0.0031, 0.025, 0.05, 0.1, 0.1, 0.2]
len_phi = [4, 16, 16, 4, 4, 4]
len_eta = [128, 16, 8, 4, 4, 2]
assert(len(len_phi) == nlayers)
assert(len(len_eta) == nlayers)
# -------------------------------------

# We open the files using uproot, and use our ml_util to get the images.
data_dir = '/workspace/LCStudies/data'
data_files = glob.glob(data_dir + '/*.root')
data_files = {x.split('/')[-1].replace('.root',''):x for x in data_files}
tree_name = 'ClusterTree'
trees = {key: ur.open(file)[tree_name] for key, file in data_files.items()}

pcells = {
    ifile : {
        layer : mu.setupCells(itree, layer)
        for layer in layers
    }
    for ifile, itree in trees.items()
}

#### 3) Applying the classifier network


Now, let's import some `tensorflow` and `keras` stuff that we'll need for applying our trained networks to the data.

In [5]:
ngpu = 1
gpu_list = ["/gpu:"+str(i) for i in range(ngpu)]
import tensorflow as tf
strategy = tf.distribute.MirroredStrategy(devices=gpu_list)
ngpu = strategy.num_replicas_in_sync
print ('Number of devices: {}'.format(ngpu))

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
Number of devices: 1


2020-10-29 06:18:33.637044: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-10-29 06:18:33.689971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:3e:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-10-29 06:18:33.690323: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-10-29 06:18:33.693031: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-10-29 06:18:33.695482: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-10-29 06:18:33.695868: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so

In [6]:
import pickle
from  util import qol_util as qu
modelpath = path_prefix + 'classifier/Models'
model_postfix = '_flat_do20' # for our simple, per-layer NN's
models = {}
model_history = {}
model_scores = {}

i = 0
for layer in layers:
    if(i == 0): print('Loading ',end='')
    print(layer,end='')
    if(i!= len(layers)-1): print(', ',end='')
    else: print('.')
    i += 1
    
    models[layer] = tf.keras.models.load_model(modelpath+'/model_' + layer + model_postfix + '.h5')
    # Load history object.
    with open(modelpath + '/model_' + layer + model_postfix + '.history','rb') as model_history_file:
        model_history[layer] = pickle.load(model_history_file)
    
# Recalculate network scores for the datasets.
prefix = 'Calculating network scores:'
l = len(pcells) * len(layers)
qu.printProgressBar(0, l, prefix=prefix, suffix='Complete', length=50)
i = 0
for key in pcells.keys():
    
    model_scores[key] = {}
    for layer in layers: 
        model_scores[key][layer] = models[layer].predict(pcells[key][layer])
        i += 1
        qu.printProgressBar(i, l, prefix=prefix, suffix='Complete', length=50)

Loading EMB1, EMB2, EMB3, TileBar0, TileBar1, TileBar2.
Calculating network scores: |██████████████████████████████████████████████████| 100.0% Complete


2020-10-29 06:18:36.257388: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10


For each event, the corresponding `model_scores` entry is a tuple.
The first entry is the "background score" -- how likely the cluster is to be a $\pi^0$. The second is the "signal score" -- how likely the cluster is to be a $\pi^\pm$. At least this seems to be the correct interpretation.

Now that we have our network scores for all our clusters, we need to group our clusters by event -- in a new file. As each entry will correspond with one event, most of our scalar branches will now become vectors, listing properties for each cluster in the event. Besides adding network scores for each cluster, we will also add truth info -- to what kind of pion it actually corresponds.

**Note** that as we're including a signal/background flag as a new branch, this is where we determine the signal/background split.

In [7]:
sig_definition = {'signal':['piminus','piplus'],'background':['pi0']}

from pathlib import Path
jet_data_dir = path_prefix + 'jets/data'
Path(jet_data_dir).mkdir(parents=True, exist_ok=True)

# Get our original data files.
files = {key:rt.TFile(file,'READ') for key, file in data_files.items()}
trees = {key:file.Get(tree_name) for key, file in files.items()}

# Now we want to effectively add some new columns. We accomplish this with friend trees.
data = {
    'signal':np.zeros(1,dtype=np.dtype('i2')),
}
for layer in layers:
    bname = layer + '_NN'
    data[bname] = np.zeros(2,dtype=np.dtype('f8'))

for key in sorted(trees.keys()):
    
    friend_filename = data_files[key].split('/')[-1]
    friend_filename = jet_data_dir + '/' + friend_filename
    friend_file = rt.TFile(friend_filename,'RECREATE')
    
    friend_tree = rt.TTree(tree_name + '_friend',tree_name + '_friend')
    branches = {}

    # --- Setup the branches. This is a rather general/flexible code block. ---
    for bname, val in data.items():
        descriptor = bname
        bshape = val.shape
        if(bshape != (1,)):
            for i in range(len(bshape)):
                descriptor += '[' + str(bshape[i]) + ']'
        descriptor += '/'
        if(val.dtype == np.dtype('i2')): descriptor += 'S'
        elif(val.dtype == np.dtype('i4')): descriptor += 'I'
        elif(val.dtype == np.dtype('i8')): descriptor += 'L'
        elif(val.dtype == np.dtype('f4')): descriptor += 'F'
        elif(val.dtype == np.dtype('f8')): descriptor += 'D'
        else:
            print('Warning, setup issue for branch: ', key, '. Skipping.')
            continue
        branches[key] = friend_tree.Branch(bname,val,descriptor)
    # --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
    
    # Now we fill the friend tree.
    nentries = trees[key].GetEntries()
    for layer in layers: assert len(model_scores[key][layer]) == nentries

    # Signal flag will be constant since input trees are divided by particle identity.
    sig = 0
    if(key in sig_definition['signal']): sig = 1
    nn_scores = model_scores[key]
    
    prefix = 'Filling friend tree for ' + key + ':'
    if(len(prefix) < 32): prefix = prefix + ' ' * (32 - len(prefix))
    qu.printProgressBar(0, int(nentries/100), prefix=prefix, suffix='Complete', length=50)
    for i in range(nentries):
        data['signal'][0] = sig
        for layer in layers: data[layer + '_NN'][:] = nn_scores[layer][i,:]
        friend_tree.Fill()
        if(i%100 ==0): qu.printProgressBar(i/100, int(nentries/100), prefix=prefix, suffix='Complete', length=50)
    friend_tree.Write()
    friend_file.Close()

Filling friend tree for pi0:     |██████████████████████████████████████████████████| 100.0% Complete
Filling friend tree for piminus: |██████████████████████████████████████████████████| 100.0% Complete
Filling friend tree for piplus:  |██████████████████████████████████████████████████| 100.0% Complete


In [None]:
# rdf_array_str = 'auto to_eval = std::string("$NAME[") + std::to_string(rdfentry_) + "]"; return float(TPython::Eval(to_eval.c_str()));'


# for key, file in data_files.items():
#     if(key != 'pi0'): continue
#     print(file)
#     f = rt.TFile(file,'READ')
#     t = f.Get(tree_name)

#     # Now we make an RDataFrame. This might be the simplest way to
#     # add a new column from a Python list/array.
#     df = rt.RDataFrame(t)
#     for layer in ['EMB1','EMB2']:
#         arr = model_scores[key][layer][:,0]
#         arr2 = model_scores[key][layer][:,0]
#         df = df.Define(layer + '_NN_0', rdf_array_str.replace('$NAME','arr'))
#         df = df.Define(layer + '_NN_1', rdf_array_str.replace('$NAME','arr2'))



#     display = df.Display()
#     display.Print()

In [None]:
# Let's create a chain with all the original data.
chain = rt.TChain(tree_name)
for file in data_files.values(): chain.AddFile(file)
    
# Now let's index things by eventNumber -- this is like sorting.
# See https://root-forum.cern.ch/t/usage-of-tchainindex/19074/4 (TTreeIndex vs. TChainIndex)
chain_idx = rt.TTreeIndex(chain,'eventNumber','0') # first index is eventNumber, second is empty (runNumber is always the same)

# Note that the indices are not unique, i.e. for each event number there are likely multiple entries.
# However I don't think this is an issue.
n_idx = chain_idx.GetN()
assert(n_idx == chain.GetEntries()) 

In [None]:
for i in range(20):
    chain.GetEntry(chain_idx.GetIndex()[i])
    print(chain.eventNumber,chain.runNumber)

In [None]:
chain_idx.GetEntryNumberWithIndex(1058,0)

In [None]:
for i in range(-15,15,1):
    j = 378 + i
    chain.GetEntry(j)
    print(j, chain.eventNumber)

## Code below this cell is old/unused.

In [None]:
# sys.path.append(path_prefix)
# sys.path
# from  util import ml_util as mu
# inputpath = path_prefix+'data/'

# # first making our DataFrames and taking care of scalars
# branches = ['runNumber', 'eventNumber', 'truthE', 'truthPt', 'truthEta', 'truthPhi', 'clusterIndex', 'nCluster', 'clusterE', 'clusterECalib', 'clusterPt', 'clusterEta', 'clusterPhi', 'cluster_nCells', 'cluster_sumCellE', 'cluster_ENG_CALIB_TOT', 'cluster_ENG_CALIB_OUT_T', 'cluster_ENG_CALIB_DEAD_TOT', 'cluster_EM_PROBABILITY', 'cluster_HAD_WEIGHT', 'cluster_OOC_WEIGHT', 'cluster_DM_WEIGHT', 'cluster_CENTER_MAG', 'cluster_FIRST_ENG_DENS', 'cluster_cell_dR_min', 'cluster_cell_dR_max', 'cluster_cell_dEta_min', 'cluster_cell_dEta_max', 'cluster_cell_dPhi_min', 'cluster_cell_dPhi_max', 'cluster_cell_centerCellEta', 'cluster_cell_centerCellPhi', 'cluster_cell_centerCellLayer', 'cluster_cellE_norm']
# rootfiles = ["pi0", "piplus", "piminus"]
# trees = {
#     rfile : ur.open(inputpath+rfile+".root")['ClusterTree']
#     for rfile in rootfiles
# }
# pdata = {
#     ifile : itree.pandas.df(branches, flatten=False)
#     for ifile, itree in trees.items()
# }

# np0 = len(pdata['pi0'])
# npp = len(pdata['piplus'])
# npm = len(pdata['piminus'])

# # Taking care of multi-dim branches using Max's ml_util. I think that the uproot-pandas interface doesn't handle these nicely.
# pcells = {
#     ifile : {
#         layer : mu.setupCells(itree, layer)
#         for layer in layers
#     }
#     for ifile, itree in trees.items()
# }

# print("Number of pi0 events: {}".format(np0))
# print("Number of pi+ events: {}".format(npp))
# print("Number of pi- events: {}".format(npm))
# print("Total: {}".format(np0+npp+npm))

In [None]:
# # Merge the pandas DataFrames.
# pi0_frame  = pdata['pi0']
# pipm_frame = pd.concat([pdata['piminus'],pdata['piplus']])
# pdata = {'pi0': pi0_frame, 'pipm':pipm_frame}
# assert(len(pdata['pipm']) == npp+npm)

# # Let's write a function to "nicely" merge the dicts, 
# # in case we want to do this again in some other way.
# def MergeImageDicts(dict1,dict2):
#     assert(set(dict1.keys()) == set(dict2.keys()))
#     dict3 = {}
#     for key in dict1.keys():
#         arr1 = dict1[key]
#         arr2 = dict2[key]
#         arr3 = np.concatenate((arr1,arr2),axis=0)
#         dict3[key] = arr3
#     return dict3

# # Now merge the dictionaries.
# merged_dict = MergeImageDicts(pcells['piminus'],pcells['piplus'])
# pcells['pipm'] = merged_dict
# del pcells['piplus']
# del pcells['piminus']

In [None]:
sys.path.append('/workspace/LCStudies/setup/fastjet/fastjet-install/lib/python3.8/site-packages')
import fastjet as fj

In [None]:
print(fj.__doc__)

In [None]:
print(fj.PseudoJet.__doc__)

In [None]:
class AwesomeModel:
    def predict(self, x):
        return x[0] * x[1]

model = AwesomeModel()

In [None]:
import numba
import ROOT
@ROOT.Numba.Declare(["RVec<float>", "int"], "float") 
def pysumpow(x: np.ndarray, y: int):
    return np.sum(x)**y

In [None]:
a = range(10)
b = np.random.rand(10)
df = ROOT.RDataFrame(10)
df = df.Define("x", 'auto to_eval = std::string("a[") + std::to_string(rdfentry_) + "]"; return float(TPython::Eval(to_eval.c_str()));')
df = df.Define("y", 'auto to_eval = std::string("b[") + std::to_string(rdfentry_) + "]"; return float(TPython::Eval(to_eval.c_str()));')


display = df.Display()
display.Print()