# Jet Clustering using "Smart Topo-Clusters" \[Rewrite\]

## Instead of stitching columns with network scores and *then* making an eventTree, we're going to first make an eventTree and *then* stitch on columns with network scores.

I have also migrated the event reconstruction (`clusterTree` -> `eventTree`) to a separate workflow, because we will only need to run that once per sample.


In this notebook we will *not* be training neural networks. That's taken care of by other notebooks in the `/classifier` and `/regression` directories of this repo. We will instead be applying the existing, trained networks to some data.

#### 1) Setup

First, let's import a bunch of packages we know we'll need right off-the-bat.

In [1]:
import numpy as np
import ROOT as rt
import uproot as ur
import sys

Welcome to JupyROOT 6.22/02


Here's also some slightly contrived setup for `latex`. We may need this for the `atlas_mpl_style` package, which is employed in some of Max's plotting utilities that we may want to borrow. Since `latex` isn't set up on the [UChicago ML platform](https://ml.maniac.uchicago.edu) by default, our setup script may install it separately but it's still not on `$PATH` since we don't touch our bash profile. This cell uses some `IPython` magic to adjust `$PATH` for this notebook.

In [2]:
# Check if latex is set up already.
# We use some Jupyter magic -- alternatively one could use python's subprocess here.
has_latex = !command -v latex
has_latex = (not has_latex == [])

# If latex was not a recognized command, our setup script should have installed
# at a fixed location, but it is not on the $PATH. Now let's use some Jupyter magic.
# See https://ipython.readthedocs.io/en/stable/interactive/shell.html for info.
if(not has_latex):
    latex_prefix = '/usr/local/texlive/2020/bin/x86_64-linux'
    jupyter_env = %env
    path = jupyter_env['PATH']
    path = path + ':' + latex_prefix
    %env PATH = $path
    jupyter_env = %env
    path = jupyter_env['PATH']

env: PATH=/opt/conda/envs/ml4p/bin:/opt/conda/condabin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/texlive/2020/bin/x86_64-linux


In [3]:
# some extra setup
path_prefix = '/workspace/LCStudies/'
layers = ['EMB1','EMB2','EMB3','TileBar0','TileBar1','TileBar2']

#### 2) Fetching the data

In another workflow, [EventReconstruction.ipynb](todo), we built our `eventTree`. Let's get it.

For applying network scores, we will be accessing the tree using [uproot](https://uproot.readthedocs.io/en/latest/).

In [4]:
jet_data_dir = path_prefix + 'jets/data'
event_filename = jet_data_dir + '/' + 'events.root'
event_treename = 'eventTree'
event_tree_ur = ur.open(event_filename)[event_treename]

event_tree_keys =[x.decode('utf-8') for x in event_tree_ur.keys()]
for layer in layers: assert(layer in event_tree_keys)
    
# Get our calo images from the ROOT file.
calo_images =  {layer:event_tree_ur.array(layer) for layer in layers}

#### 3) Adding network scores to clusters in events

Now we have our event tree, where each entry corresponds with a *full event* -- a collection of topo-clusters.

Now we want to evaluate some networks on the events' topo-clusters, and tack on the scores to our data.

**TODO**: This setup works nicely for models being applied to a single branch. Can this be nicely extended to models that use multiple branches as input?

In [5]:
# Setup for TensorFlow and Keras.
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' # disable some of the tensorflow info printouts, only display errors
ngpu = 1
gpu_list = ["/gpu:"+str(i) for i in range(ngpu)]
import tensorflow as tf
strategy = tf.distribute.MirroredStrategy(devices=gpu_list)
ngpu = strategy.num_replicas_in_sync
print ('Number of devices: {}'.format(ngpu))

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
Number of devices: 1


In [6]:
classifier_prefix = path_prefix + 'classifier/Models' + '/'
# Specify the networks we want to apply, which layers to evaluate them on, and where they are saved.
# The key will ultimately give the name used in the ROOT file.
networks = {
    'EMB1_NN':     {'layers':['EMB1'],     'file':classifier_prefix + 'model_EMB1_flat_do20.h5'},
    'EMB2_NN':     {'layers':['EMB2'],     'file':classifier_prefix + 'model_EMB2_flat_do20.h5'},
    'EMB3_NN':     {'layers':['EMB3'],     'file':classifier_prefix + 'model_EMB3_flat_do20.h5'},
    'TileBar0_NN': {'layers':['TileBar0'], 'file':classifier_prefix + 'model_TileBar0_flat_do20.h5'},
    'TileBar1_NN': {'layers':['TileBar1'], 'file':classifier_prefix + 'model_TileBar1_flat_do20.h5'},
    'TileBar2_NN': {'layers':['TileBar2'], 'file':classifier_prefix + 'model_TileBar2_flat_do20.h5'}
}

# Make sure our network branch names don't conflict with existing branch names,
# and that they're being applied to images that exist.
for key in networks.keys(): 
    assert(key not in event_tree_keys)
    for element in networks[key]['layers']:
        assert(element in layers)
        
network_models = {}
for key in networks.keys():
    print('Loading ' + str(key) + ' ...',end='')
    network_models[key] = tf.keras.models.load_model(networks[key]['file'])
    print(' Done.')

Loading EMB1_NN ... Done.
Loading EMB2_NN ... Done.
Loading EMB3_NN ... Done.
Loading TileBar0_NN ... Done.
Loading TileBar1_NN ... Done.
Loading TileBar2_NN ... Done.


In [13]:
model_scores = {}
if(path_prefix not in sys.path): sys.path.append(path_prefix)
from util import event_util as eu

nevents = event_tree_ur.numentries
for key, entry in networks.items():
    input_layers = entry['layers']
    if(len(input_layers) > 1):
        print('Multi-layer NN\'s not yet implemented. Skipping ' + str(key) + '.')
        continue
    
    print('Evaluating network: ' + str(key))
    input_layer = input_layers[0]    
    model_input = eu.setupCells(calo_images,input_layer)
    model_scores[key] = network_models[key].predict(model_input)
    
    # TODO: possibly reshape the output immediately, before tree fill

Evaluating network: EMB1_NN
Evaluating network: EMB2_NN
Evaluating network: EMB3_NN
Evaluating network: TileBar0_NN
Evaluating network: TileBar1_NN
Evaluating network: TileBar2_NN


In [17]:
def ShapeToString(shape):
    l = len(shape)
    shape_string = ''
    for i in range(l): shape_string += '[' + str(shape[i]) + ']'
    return shape_string

*TODO*: Add the network scores back to `eventTree`. This can be accomplished via a friend tree and a sequential loop.

In [22]:
if(path_prefix not in sys.path): sys.path.append(path_prefix)
from  util import qol_util as qu

event_file = rt.TFile(event_filename,'UPDATE')
event_tree = event_file.Get(event_treename)

event_tree.SetBranchStatus('*',0)
event_tree.SetBranchStatus('eventNumber',1)
event_tree.SetBranchStatus('nCluster',1)
nevents = event_tree.GetEntriesFast()

# Get the maximum number of clusters per event
rt.gROOT.SetBatch(True)
event_tree.Draw('nCluster')
h = rt.gPad.GetPrimitive('htemp')
max_nCluster = int(h.GetXaxis().GetBinCenter(h.FindLastBinAbove(0)))
rt.gROOT.SetBatch(False)
 
for model in model_scores.keys():
    t = rt.TTree(model,model)
    
    # prepare our branches. TODO: make this more flexible?
    branch_buffer = {
        'eventNumber':[np.zeros(1,dtype=np.dtype('i8')),'eventNumber/I'],
        'nCluster':[np.zeros(1,dtype=np.dtype('i2')),'nCluster/S'],
    }
    model_shape_string = '[nCluster]' + ShapeToString(list(model_scores[model].shape)[1:])
    model_shape = tuple([max_nCluster] + list(model_scores[model].shape)[1:])
    branch_buffer[model] = [np.zeros(model_shape,dtype=np.dtype('f8')), model + model_shape_string + '/D']
    
    for name, branch in branch_buffer.items(): t.Branch(name,branch[0],branch[1])
    cluster_iterator = 0
    for i in range(nevents):
        event_tree.GetEntry(i)
        
        branch_buffer['eventNumber'][0][0] = event_tree.eventNumber
        branch_buffer['nCluster'][0][0] = event_tree.nCluster
        
        for j in range(event_tree.nCluster):
            branch_buffer[model][0][j,:] = model_scores[model][cluster_iterator,:]
            cluster_iterator += 1
        t.Fill()
    t.Write(model,rt.TObject.kOverwrite)    
event_file.Close()

# # Explicitly make the trees friends of the event tree TODO: This is not doing anything
# event_file = rt.TFile(event_filename,'UPDATE')
# event_tree = event_file.Get(event_treename)
# for model in model_scores.keys(): event_tree.AddFriend(model,event_file)
# event_file.Close()

### 8) Jet clustering


Now that we have things grouped by event, we should cluster the topo-clusters in each event into jets.

We have a few possible ways of doing this:

- `fastjet`
    - We can use the Pythonic interface. It might be fast, however it takes Python lists of `fastjet.Pseudojet` objects as inputs to clustering and I'm not sure if building these will slow us down or not. Documentation is not very good.
    
- `pyjet`
    - The 3rd-party interface between `fastjet` and `numpy`. Seems elegant but setup with external `fastjet` -- needed for the fastest clustering -- doesn't seem to work. Instructions are outdated and the project hasn't been updated in nearly a year.
    
- `TPythia8`
    - Though it's kind of a hack, our `ROOT` installation from conda includes `pythia8` + `TPythia8`, which gives access to its `SlowJet` object. Despite the name this actually employs `fastjet` core as of a few versions ago so it's fast. It takes `Pythia8.event` objects as input, but we can artificially construct these.

In [None]:
#importing fastjet Python library, which should be made by our setup script
fj_dir = path_prefix + '/setup/fastjet/fastjet-install/lib/python3.8/site-packages'
sys.path.append(fj_dir)
import fastjet as fj

print(fj.__doc__)

In [None]:
# Jet clustering params
R = 0.4
jet_def = fj.JetDefinition(fj.antikt_algorithm, R)
event_jets = []

# Open our events file, and perform jet clustering using fastjet
f = rt.TFile(jet_data_dir + '/' + 'events.root','read')
t = f.Get('events')

# only activate the branches we need for clustering
tbranches = t.GetListOfBranches()
active_branches = ['nCluster','clusterPt', 'clusterEta','clusterPhi','clusterE']
t.SetBranchStatus('*',0)
for branch in active_branches: t.SetBranchStatus(branch,1)

vec_polar = rt.Math.PtEtaPhiEVector()
nevents = t.GetEntriesFast()

stride = 1000
l = int(nevents/stride)
bar_length = 50
prefix = 'Performing jet clustering:'
qu.printProgressBarColor(0, l, prefix=prefix, suffix='Complete', length=bar_length)

for i in range(nevents):
    t.GetEntry(i)
    
    nCluster = t.nCluster
    particles = nCluster * [fj.PseudoJet(0.,0.,0.,0.)]
    for j in range(nCluster):
        vec_polar.SetCoordinates(t.clusterPt[j],t.clusterEta[j],t.clusterPhi[j],t.clusterE[j])
        pj = fj.PseudoJet(vec_polar.Px(), vec_polar.Py(), vec_polar.Pz(), vec_polar.E()) # fastjet uses Cartesian
        particles[j] = pj
    event_jets.append(jet_def(particles))
    
    if(i%stride == 0): qu.printProgressBarColor(int(i/stride), l, prefix=prefix, suffix='Complete', length=bar_length)
qu.printProgressBarColor(l, l, prefix=prefix, suffix='Complete', length=bar_length)

f.Close()

In [None]:
# Now make a TTree containing the jet info.
njets_max = np.max(np.array([len(x) for x in event_jets],dtype=np.dtype('i2')))

f = rt.TFile(jet_data_dir + '/' + 'events.root','update')
t = rt.TTree('jets','jets')

branch_buffer = {
    'nJet': [np.zeros(1,dtype=np.dtype('i2')), 'nJet/S'],
    'jetPt': [np.zeros(njets_max,dtype=np.dtype('f8')), 'jetPt[nJet]/D'],
    'jetEta': [np.zeros(njets_max,dtype=np.dtype('f8')), 'jetEta[nJet]/D'],
    'jetPhi': [np.zeros(njets_max,dtype=np.dtype('f8')), 'jetPhia[nJet]/D'],
    'jetE': [np.zeros(njets_max,dtype=np.dtype('f8')), 'jetE[nJet]/D']
}
for key,val in branch_buffer.items(): t.Branch(key,val[0],val[1])
stride = 1000
l = int(nevents/stride)
prefix = 'Writing jets to file:'

qu.printProgressBarColor(0, l, prefix=prefix, suffix='Complete', length=bar_length)
for i in range(nevents):
    n = len(event_jets[i])
    branch_buffer['nJet'][0][0] = n
    for j in range(n):
        branch_buffer['jetPt'][0][j] = event_jets[i][j].pt()
        branch_buffer['jetEta'][0][j] = event_jets[i][j].eta()
        branch_buffer['jetPhi'][0][j] = event_jets[i][j].phi()
        branch_buffer['jetE'][0][j] = event_jets[i][j].e()
    t.Fill()
    
    if(i%stride == 0): qu.printProgressBarColor(int(i/stride), l, prefix=prefix, suffix='Complete', length=bar_length)
    
t.Write('',rt.TObject.kOverwrite)
qu.printProgressBarColor(l, l, prefix=prefix, suffix='Complete', length=bar_length)
f.Close()

### Code below here is unused/testing.

In [None]:
# Little RDataFrame demo.

a = range(10)
b = np.random.rand(10)
df = ROOT.RDataFrame(10)
df = df.Define("x", 'auto to_eval = std::string("a[") + std::to_string(rdfentry_) + "]"; return float(TPython::Eval(to_eval.c_str()));')
df = df.Define("y", 'auto to_eval = std::string("b[") + std::to_string(rdfentry_) + "]"; return float(TPython::Eval(to_eval.c_str()));')
display = df.Display()
display.Print()

In [None]:
# Quick check showing that clusterIndex is not unique between files.
trees = {key:file.Get(tree_name) for key, file in files.items()}
t1 = trees['piminus']
t2 = trees['piplus']

t1_range = range(3,6)
t2_range = range(316,318)

for i in t1_range:
    t1.GetEntry(i)
    print(t1.eventNumber,'\t',t1.clusterIndex)

print('---')
for i in t2_range:
    t2.GetEntry(i)
    print(t2.eventNumber,'\t',t2.clusterIndex)