# Jet Clustering using "Smart Topo-Clusters"

The big idea is to use neural networks for classification and/or energy calibration of topo-clusters, and use these topo-clusters for making jets. In this notebook I'll be playing around with some ideas for this, to see what works.

In this notebook we will *not* be training neural networks. That's taken care of by other notebooks in the `/classifier` and `/regression` directories of this repo. We will instead be applying the existing, trained networks to some data.

#### 1) Setup

First, let's import a bunch of packages we know we'll need right off-the-bat.

Note that as we've set up our environment with `conda`, our `ROOT` installation has all the bells and whistles. This includes the `pythia8` library and its associated `ROOT` wrapper, `TPythia8`. We will use this for jet-clustering, as it comes `fj-core`. (alternatively we could use [pyjet](https://github.com/scikit-hep/pyjet) but this requires linking an external fastjet build for speed).

In [1]:
import numpy as np
import ROOT as rt
import uproot as ur
import pandas as pd
import sys, glob

Welcome to JupyROOT 6.22/02


Here's also some slightly contrived setup for `latex`. We may need this for the `atlas_mpl_style` package, which is employed in some of Max's plotting utilities that we may want to borrow. Since `latex` isn't set up on the [UChicago ML platform](https://ml.maniac.uchicago.edu) by default, our setup script may install it separately but it's still not on `$PATH` since we don't touch our bash profile. This cell uses some `IPython` magic to adjust `$PATH` for this notebook.

In [2]:
# Check if latex is set up already.
# We use some Jupyter magic -- alternatively one could use python's subprocess here.
has_latex = !command -v latex
has_latex = (not has_latex == [])

# If latex was not a recognized command, our setup script should have installed
# at a fixed location, but it is not on the $PATH. Now let's use some Jupyter magic.
# See https://ipython.readthedocs.io/en/stable/interactive/shell.html for info.
if(not has_latex):
    latex_prefix = '/usr/local/texlive/2020/bin/x86_64-linux'
    jupyter_env = %env
    path = jupyter_env['PATH']
    path = path + ':' + latex_prefix
    %env PATH = $path
    jupyter_env = %env
    path = jupyter_env['PATH']

env: PATH=/opt/conda/envs/ml4p/bin:/opt/conda/condabin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/texlive/2020/bin/x86_64-linux


In [3]:
# some extra setup
path_prefix = '/workspace/LCStudies/'

#### 2) Fetching the data


Now we get our data. For now, our classifiers are being trained to distinguish between $\pi^+$ and $\pi^0$. Assuming that all charged pions behave the same way, we can really treat this as a $\pi^\pm$ vs. $\pi^0$ classifier.

**For our toy workflow, we'll say that we only want to cluster $\pi^\pm$ topo-clusters into jets.** We will treat $\pi^0$ as a background.

We will load the data into a `pandas` DataFrame. As a matter of taste I'm not a *huge* `pandas` fan -- it's sometimes rather slow with big datasets, compared to using `ROOT` or `h5`-based methods -- but for now it will work just fine! This code snippet is borrowed from Max's notebooks. 

Note that we want to concat together our $\pi^-$ and $\pi^+$ data, so we end up with $2$ DataFrames intead of $3$. I haven't figured out how to handle `ROOT::TChain` in `uproot` (i.e. creating chains instead of trees, for exporting to `pandas`) but we can concatenate the DataFrames after-the-fact. Maybe this is slow?

In [4]:
## A ROOT-based way of loading data.
# data_dir = '/workspace/LCStudies/data'
# data_files = glob.glob(data_dir + '/*.root')
# data_files.sort() # pi0 will be the 1st entry now
# data_files = {'background':[data_files[0]], 'signal':[data_files[1],data_files[2]]}

# # Now, we want to get the data in the files. For signal, we will be combining 2 TTree's, so we will use the TChain functionality.
# tree_name = 'ClusterTree' # this name is the same across all our files
# data_chains = {'background':rt.TChain(),'signal':rt.TChain()}
# for key, chain in data_chains.items():
#     chain.SetTitle(key)
#     for file in data_files[key]: chain.AddFile(file,rt.TTree.kMaxEntries,tree_name)
#     report = key
#     if(key == 'signal'): report += '    '
#     report += ': ' + str(chain.GetEntries())
#     print(report)

In [5]:
#meta-data for our dataset
layers = ["EMB1", "EMB2", "EMB3", "TileBar0", "TileBar1", "TileBar2"]
nlayers = len(layers)
cell_size_phi = [0.098, 0.0245, 0.0245, 0.1, 0.1, 0.1]
cell_size_eta = [0.0031, 0.025, 0.05, 0.1, 0.1, 0.2]
len_phi = [4, 16, 16, 4, 4, 4]
len_eta = [128, 16, 8, 4, 4, 2]

assert(len(len_phi) == nlayers)
assert(len(len_eta) == nlayers)

cell_shapes = {layers[i]:(len_eta[i],len_phi[i]) for i in range(nlayers)}
#for i in range(nlayers): cell_shapes[layers[i]] = (len_eta[i],len_phi[i])

In [6]:
sys.path.append(path_prefix)
sys.path
from  util import ml_util as mu
inputpath = path_prefix+'data/'

# first making our DataFrames and taking care of scalars
branches = ['runNumber', 'eventNumber', 'truthE', 'truthPt', 'truthEta', 'truthPhi', 'clusterIndex', 'nCluster', 'clusterE', 'clusterECalib', 'clusterPt', 'clusterEta', 'clusterPhi', 'cluster_nCells', 'cluster_sumCellE', 'cluster_ENG_CALIB_TOT', 'cluster_ENG_CALIB_OUT_T', 'cluster_ENG_CALIB_DEAD_TOT', 'cluster_EM_PROBABILITY', 'cluster_HAD_WEIGHT', 'cluster_OOC_WEIGHT', 'cluster_DM_WEIGHT', 'cluster_CENTER_MAG', 'cluster_FIRST_ENG_DENS', 'cluster_cell_dR_min', 'cluster_cell_dR_max', 'cluster_cell_dEta_min', 'cluster_cell_dEta_max', 'cluster_cell_dPhi_min', 'cluster_cell_dPhi_max', 'cluster_cell_centerCellEta', 'cluster_cell_centerCellPhi', 'cluster_cell_centerCellLayer', 'cluster_cellE_norm']
rootfiles = ["pi0", "piplus", "piminus"]
trees = {
    rfile : ur.open(inputpath+rfile+".root")['ClusterTree']
    for rfile in rootfiles
}
pdata = {
    ifile : itree.pandas.df(branches, flatten=False)
    for ifile, itree in trees.items()
}

np0 = len(pdata['pi0'])
npp = len(pdata['piplus'])
npm = len(pdata['piminus'])

# Taking care of multi-dim branches using Max's ml_util. I think that the uproot-pandas interface doesn't handle these nicely.
pcells = {
    ifile : {
        layer : mu.setupCells(itree, layer)
        for layer in layers
    }
    for ifile, itree in trees.items()
}

print("Number of pi0 events: {}".format(np0))
print("Number of pi+ events: {}".format(npp))
print("Number of pi- events: {}".format(npm))
print("Total: {}".format(np0+npp+npm))

Number of pi0 events: 263891
Number of pi+ events: 435967
Number of pi- events: 434627
Total: 1134485


Now we merge the $\pi^+$ and $\pi^-$ DataFrames and dictionaries.

In [7]:
# Merge the pandas DataFrames.
pi0_frame  = pdata['pi0']
pipm_frame = pd.concat([pdata['piminus'],pdata['piplus']])
pdata = {'pi0': pi0_frame, 'pipm':pipm_frame}
assert(len(pdata['pipm']) == npp+npm)

# Let's write a function to "nicely" merge the dicts, 
# in case we want to do this again in some other way.
def MergeImageDicts(dict1,dict2):
    assert(set(dict1.keys()) == set(dict2.keys()))
    dict3 = {}
    for key in dict1.keys():
        arr1 = dict1[key]
        arr2 = dict2[key]
        arr3 = np.concatenate((arr1,arr2),axis=0)
        dict3[key] = arr3
    return dict3

# Now merge the dictionaries.
merged_dict = MergeImageDicts(pcells['piplus'],pcells['piminus'])
pcells['pipm'] = merged_dict
del pcells['piplus']
del pcells['piminus']

Note that for the time being, we will not have a 1:1 signal-to-background ratio. This is because we have many more $\pi^\pm$ entries than $\pi^0$ entries.

#### 3) Applying the classifier network


Now, let's import some `tensorflow` and `keras` stuff that we'll need for applying our trained networks to the data. We'll also repackage the data for the networks.

In [8]:
ngpu = 1
gpu_list = ["/gpu:"+str(i) for i in range(ngpu)]
import tensorflow as tf
strategy = tf.distribute.MirroredStrategy(devices=gpu_list)
ngpu = strategy.num_replicas_in_sync
print ('Number of devices: {}'.format(ngpu))

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
Number of devices: 1


2020-10-26 21:33:27.167619: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-10-26 21:33:27.202115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:3e:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-10-26 21:33:27.202383: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-10-26 21:33:27.204223: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-10-26 21:33:27.206027: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-10-26 21:33:27.206327: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so

In [9]:
from keras.utils import np_utils
training_dataset = ['pi0','pipm']

# create train/validation/test subsets containing 70%/10%/20%
# of events from each type of pion event
for p_index, plabel in enumerate(training_dataset):
    mu.splitFrameTVT(pdata[plabel],trainfrac=0.7)
    pdata[plabel]['label'] = p_index

# merge pi0 and pi+- events
pdata_merged = pd.concat([pdata[ptype] for ptype in training_dataset])

pcells_merged = {
    layer : np.concatenate([pcells[ptype][layer]
                            for ptype in training_dataset])
    for layer in layers
}

plabels = np_utils.to_categorical(pdata_merged['label'],len(training_dataset))

Now let's load an NN model that we've previously trained & saved.

For now, we'll deal with our simple, per-layer networks.

In [10]:
import pickle
modelpath = path_prefix + 'classifier/Models'

models = {}
model_history = {}
model_scores = {}
for layer in layers:
    print('Loading ' + layer)
    models[layer] = tf.keras.models.load_model(modelpath+'/model_' + layer + '_flat_do20.h5')
    
    # load history object
    with open(modelpath + '/model_' + layer + '_flat_do20.history','rb') as model_history_file:
        model_history[layer] = pickle.load(model_history_file)
    
    # recalculate network scores for the dataset
    model_scores[layer] = models[layer].predict(
        pcells_merged[layer]
    )

Loading EMB1
Loading EMB2
Loading EMB3
Loading TileBar0
Loading TileBar1
Loading TileBar2


2020-10-26 21:52:21.427251: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10


For each event, the corresponding `model_score` entry is a tuple.
The first entry is the "background score" -- how likely the cluster is to be a $\pi^0$. The second is the "signal score" -- how likely the cluster is to be a $\pi^\pm$. **TODO:** Check this. Seems to *empirically* be the correct interpretation, at least.

We 
