# Jet Clustering

This workflow is for use with the jet samples, that contain both `ClusterTree` and `EventTree` (provided by the `MLTree` utility). This **cannot** handle data where the `EventTree` does not exist, because that contains info on piecing the clusters together into events*, and the baseline jet clustering.

\* This pieceing together can be accomplished in workflows like `EventReconstructionPion.ipynb` but it's rather complex.

#### TODO:

- finish up calculation of scores
- jet clustering (save to new file?)
- comparison of jets

#### 1) Setup

First, let's import a bunch of packages we know we'll need right off-the-bat.

Note that as we've set up our environment with `conda`, our `ROOT` installation has all the bells and whistles. This includes the `pythia8` library and its associated `ROOT` wrapper, `TPythia8`. We can optionally use this for jet-clustering, as it comes `fj-core`.
Alternatively we could use the Pythonic interface for `fastjet` or [pyjet](https://github.com/scikit-hep/pyjet), but the latter requires linking an external fastjet build for speed and this doesn't seem to work when following their documentation.

In [1]:
import numpy as np
import ROOT as rt
import uproot as ur
import sys

Welcome to JupyROOT 6.22/02


In [2]:
# some extra setup
path_prefix = '/workspace/LCStudies/'

#### 2) Fetching the data

Now we get our data. For now, our classifiers are being trained to distinguish between $\pi^+$ and $\pi^0$. Assuming that all charged pions behave the same way, we can really treat this as a $\pi^\pm$ vs. $\pi^0$ classifier. **For our toy workflow, we'll say that we only want to cluster $\pi^\pm$ topo-clusters into jets.** We will treat $\pi^0$ as a background.

For our input data, we have `ROOT` files containing a tree called `ClusterTree`. In each tree, each entry corresponds with one topo-cluster, and the different files correspond with different topo-cluster parent particles (e.g. $3$ files for $\pi^+$,$\pi^-$ and $\pi^0$). Each topo-cluster entry contains information on the event from which it came ("runNumber" and "eventNumber"), and many topo-clusters (within and across files) share the same event. Our ultimate goal is to regroup this data into one file where each entry corresponds with one *event*. This is a sensible way to arrange the data before performing any jet clustering (which is performed by event), and writing to a file will allow us to skip this whole process after doing it once.

In [3]:
#TODO: Some of this meta-data is unused.
# ----- Meta-data for our dataset -----
layers = ["EMB1", "EMB2", "EMB3", "TileBar0", "TileBar1", "TileBar2"]
nlayers = len(layers)
cell_size_phi = [0.098, 0.0245, 0.0245, 0.1, 0.1, 0.1]
cell_size_eta = [0.0031, 0.025, 0.05, 0.1, 0.1, 0.2]
len_phi = [4, 16, 16, 4, 4, 4]
len_eta = [128, 16, 8, 4, 4, 2]
assert(len(len_phi) == nlayers)
assert(len(len_eta) == nlayers)
meta_data = {
    layers[i]:{
        'cell_size':(cell_size_eta[i],cell_size_phi[i]),
        'dimensions':(len_eta[i],len_phi[i])
    }
    for i in range(nlayers)
}

In [4]:
import glob, os
import subprocess as sub
# where the original data sits
data_dir = '/workspace/LCStudies/data/jet'
data_filenames = glob.glob(data_dir + '/' + '*.root')

# our "local" data dir, where we create modified data files
from pathlib import Path
jet_data_dir = path_prefix + 'jets/data'
Path(jet_data_dir).mkdir(parents=True, exist_ok=True)

# Get the original data.
files = {name:rt.TFile(name,'READ') for name in data_filenames}

# Some data files might be missing an EventTree.
# For now, we will skip these because our methods count on an existing EventTree.
delete_keys = []
for key, val in files.items():
    file_keys = [x.GetName() for x in val.GetListOfKeys()]
    if('ClusterTree' not in file_keys or 'EventTree' not in file_keys):
        delete_keys.append(key)

for key in delete_keys: 
    print('Ignoring file:',key,'(no EventTree/ClusterTree found).')
    del files[key]

Ignoring file: /workspace/LCStudies/data/jet/user.angerami.21717971.OutputStream._000077.root (no EventTree/ClusterTree found).
Ignoring file: /workspace/LCStudies/data/jet/user.angerami.21717971.OutputStream._000308.root (no EventTree/ClusterTree found).
Ignoring file: /workspace/LCStudies/data/jet/user.angerami.21717971.OutputStream._000225.root (no EventTree/ClusterTree found).
Ignoring file: /workspace/LCStudies/data/jet/user.angerami.21717971.OutputStream._000259.root (no EventTree/ClusterTree found).
Ignoring file: /workspace/LCStudies/data/jet/user.angerami.21717971.OutputStream._000457.root (no EventTree/ClusterTree found).
Ignoring file: /workspace/LCStudies/data/jet/user.angerami.21717971.OutputStream._000270.root (no EventTree/ClusterTree found).
Ignoring file: /workspace/LCStudies/data/jet/user.angerami.21717971.OutputStream._000026.root (no EventTree/ClusterTree found).
Ignoring file: /workspace/LCStudies/data/jet/user.angerami.21717971.OutputStream._000481.root (no EventT

In [5]:
if(path_prefix not in sys.path): sys.path.append(path_prefix)
from  util import qol_util as qu # for progress bar

# now we make a local copy of the files in the jet_data_dir, keeping only certain branches
active_branches = {}
active_branches['cluster'] = [
    'runNumber',
    'eventNumber',
    'truthE',
    'truthPt',
    'truthEta',
    'truthPhi',
    'clusterIndex',
    'nCluster',
    'clusterE',
    'clusterECalib',
    'clusterPt',
    'clusterEta',
    'clusterPhi',
    'cluster_nCells',
    'cluster_ENG_CALIB_TOT',
    'EMB1',
    'EMB2',
    'EMB3',
    'TileBar0',
    'TileBar1',
    'TileBar2'
]
active_branches['event'] = [
    'runNumber',
    'eventNumber',
    'lumiBlock',
    'NPV',
    'nTruthPart',
    'clusterCount',
    'nCluster',
    'clusterE',
    'clusterPt',
    'clusterEta',
    'clusterPhi',
    'AntiKt4EMTopoJetsPt',
    'AntiKt4EMTopoJetsEta',
    'AntiKt4EMTopoJetsPhi',
    'AntiKt4EMTopoJetsE',
    'AntiKt4LCTopoJetsPt',
    'AntiKt4LCTopoJetsEta',
    'AntiKt4LCTopoJetsPhi',
    'AntiKt4LCTopoJetsE',
    'AntiKt4TruthJetsPt',
    'AntiKt4TruthJetsEta',
    'AntiKt4TruthJetsPhi',
    'AntiKt4TruthJetsE'
]

tree_names = {'cluster':'ClusterTree','event':'EventTree'}
data_filenames = []

l = len(files.keys())
i = 0
qu.printProgressBarColor(i, l, prefix='Copying data files:', suffix='Complete', length=50)

for path, tfile in files.items():
    filename_new = jet_data_dir + '/' + path.split('/')[-1]
    old_trees = {x:tfile.Get(tree_names[x]) for x in tree_names.keys()}
    
    for key, tree in old_trees.items():
        tree.SetBranchStatus('*',0)
        for bname in active_branches[key]: tree.SetBranchStatus(bname,1)
    
    tfile_new = rt.TFile(filename_new,'RECREATE')
    new_trees = {x:old_trees[x].CloneTree() for x in old_trees.keys()}
    tfile_new.Write()
    data_filenames.append(filename_new)
    i += 1
    qu.printProgressBarColor(i, l, prefix='Copying data files:', suffix='Complete', length=50)
    del old_trees
    del new_trees

Copying data files: |[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m| 100.0% Complete


In [6]:
# Access the files & trees with uproot
files = {name:rt.TFile(name,'READ') for name in data_filenames}
tree_names = {'cluster':'ClusterTree','event':'EventTree'}
#trees = {key:{tree_key:file.Get(tree_name) for tree_key,tree_name in tree_names.items()} for key,file in files.items()}
ur_trees = {file:{tree_key:ur.open(file)[tree_name] for tree_key,tree_name in tree_names.items()} for file in data_filenames}

# reminder: how to get an awkward array for a particular branch, here a is a key corresponding to a filename
#ur_trees[a]['cluster'].array('EMB1')

In [7]:
# Prep the calo images. These will be network inputs.
if(path_prefix not in sys.path): sys.path.append(path_prefix)
from util import ml_util as mu
from util import qol_util as qu # for progress bar

l = len(layers) * len(ur_trees.keys())
i = 0
pfx = 'Loading calo images:'
sfx = 'Complete'
qu.printProgressBarColor(i, l, prefix=pfx, suffix=sfx, length=50)

calo_images = {}
for dfile in data_filenames:
    calo_images[dfile] = {}
    for layer in layers:
        calo_images[dfile][layer] = mu.setupCells(ur_trees[dfile]['cluster'],layer)
        i += 1
        qu.printProgressBarColor(i, l, prefix=pfx, suffix=sfx, length=50)

Loading calo images: |[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m| 100.0% Complete


Due to "out of memory" issues I'm experiencing with loading the images above **and** the regression inputs below (which contain another copy of the images, with standardization), I will try to delete some things as they are no longer needed.

In [8]:
# Load some additional information, needed for energy regression.
if(path_prefix not in sys.path): sys.path.append(path_prefix)
from sklearn.preprocessing import StandardScaler
from util import ml_util as mu # for passing calo images to regression networks

scaler_e = StandardScaler()
scaler_cal = StandardScaler()
scaler_eta = StandardScaler()
# epsilon = 1.0e-12 # we will clean values of calibrated energy

regression_cols = {}
for dfile in data_filenames:
    regression_cols[dfile] = {}
    
#     e_calib = ur_trees[dfile]['cluster'].array('cluster_ENG_CALIB_TOT')
#     e_calib = np.where(e_calib < epsilon, epsilon, e_calib) # cleaning: remove zeros to prevent log from blowing up
#     regression_cols[dfile]['s_logECalib'] = scaler_cal.fit_transform(np.log(e_calib).reshape(-1,1))
        
    e = ur_trees[dfile]['cluster'].array('clusterE')
    regression_cols[dfile]['s_logE'] = scaler_cal.fit_transform(np.log(e).reshape(-1,1))
    
    eta = ur_trees[dfile]['cluster'].array('clusterEta')
    regression_cols[dfile]['s_eta'] = scaler_cal.fit_transform(eta.reshape(-1,1))

l = len(ur_trees.keys())
i = 0
pfx = 'Preparing regression inputs:'
sfx = 'Complete'
qu.printProgressBarColor(i, l, prefix=pfx, suffix=sfx, length=50)
    
regression_input = {}
for dfile in data_filenames:
    combined_images = np.concatenate(tuple([calo_images[dfile][layer] for layer in layers]), axis=1)
    s_combined,scaler_combined = mu.standardCells(combined_images, layers)
    regression_input[dfile] = np.column_stack((regression_cols[dfile]['s_logE'], regression_cols[dfile]['s_eta'],s_combined))
    del regression_cols[dfile]
    del ur_trees[dfile]
    i = i + 1
    qu.printProgressBarColor(i, l, prefix=pfx, suffix=sfx, length=50)

Preparing regression inputs: |[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m[32m█[0m| 100.0% Complete


#### 3)  Adding network scores

Now that we have our data, we want to apply our classification and regression networks to all of the topo-clusters, and save the resulting scores for each cluster.

We will save the scores in a new tree in each file, which will have as many entries as the existing `ClusterTree`. (It will be a friend tree).

**Note** that these networks have been trained elsewhere -- and to start, we're using networks trained on a totally different dataset where we didn't have pile-up. And with our jet dataset, we don't actually know which topo-clusters are caused by $\pi^\pm$ and which are caused by $\pi^0$, so we can only gauge results by how the jet energy resolution is affected, or topo-cluster energy (i.e. we cannot directly determine *classification* accuracy).

In [9]:
# Setup for TensorFlow and Keras.
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' # disable some of the tensorflow info printouts, only display errors
ngpu = 1
gpu_list = ["/gpu:"+str(i) for i in range(ngpu)]
import tensorflow as tf
strategy = tf.distribute.MirroredStrategy(devices=gpu_list)
ngpu = strategy.num_replicas_in_sync
print ('Number of devices: {}'.format(ngpu))

# Dictionary for storing all our neural network models that will be evaluated
network_models = {}

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
Number of devices: 1


In [10]:
# Load our classification model. As we use the simple combo network, we must *also* load all of our so-called "flat networks", one per calo layer.
# We must evaluate all these networks on the data, and then pass their outputs to the simple combo network.
classification_dir = path_prefix + 'classifier/Models'
regression_dir = path_prefix + 'regression/Models'

In [11]:
# flat classifiers
print('Loading flat classification models... ')
flat_model_files = glob.glob(classification_dir + '/flat/' + '*.h5')
flat_model_files.sort()
flat_model_names = []
for model in flat_model_files:
    model_name = model.split('model_')[-1].split('_flat')[0]
    print('\tLoading ' + model_name + '... ',end='')
    flat_model_names.append(model_name)
    network_models[model_name] = tf.keras.models.load_model(model)
    print('Done.')

Loading flat classification models... 
	Loading EMB1... Done.
	Loading EMB2... Done.
	Loading EMB3... Done.
	Loading TileBar0... Done.
	Loading TileBar1... Done.
	Loading TileBar2... Done.


In [12]:
# combo classifier
print('Loading simple combo classification model... ',end='')
combo_model_file = classification_dir + '/simple/' + 'model_simple_do20.h5'
network_models['combo'] = tf.keras.models.load_model(combo_model_file)
print('Done.')

Loading simple combo classification model... Done.


In [13]:
# energy regression networks
print('Loading charged-pion energy regression model... ',end='')
charged_energy_model_file = regression_dir + '/' + 'all_charged.h5'
network_models['e_charged'] = tf.keras.models.load_model(charged_energy_model_file)
print('Done.')

print('Loading neutral-pion energy regression model... ',end='')
neutral_energy_model_file = regression_dir + '/' + 'all_neutral.h5'
network_models['e_neutral'] = tf.keras.models.load_model(neutral_energy_model_file)
print('Done.')

Loading charged-pion energy regression model... Done.
Loading neutral-pion energy regression model... Done.


In [14]:
# Now get the network scores.
if(path_prefix not in sys.path): sys.path.append(path_prefix)
from util import ml_util as mu # for passing calo images to classification networks
from  util import qol_util as qu # for progress bar

model_scores = {}
# We will need to evaluate scores in a particular order, since some networks depend on others' output:
# 1) Flat networks (classification)
# 2) Simple combo network (classification using flat networks) 
# 3) Energy regression networks (regression, using simple combo network)

In [15]:
# shortened file list for debugging
a = data_filenames[:2]

In [17]:
# Flat networks.
# Note: The network names correspond to the branch names they act on.
l = len(flat_model_names) * len(data_filenames)
i = 0
pfx = 'Evaluating flat networks:'
sfx = 'Complete'
qu.printProgressBarColor(i, l, prefix=pfx, suffix=sfx, length=50)
for layer in flat_model_names:
    model = network_models[layer]
    model_scores[layer] = {}
    
    for dfile in a: # TODO: change a -> data_filenames
        model_input = calo_images[dfile][layer]
        model_scores[layer][dfile] = model.predict(model_input)
        i += 1
        qu.printProgressBarColor(i, l, prefix=pfx, suffix=sfx, length=50)

Evaluating flat networks: |[31m█[0m[31m█[0m[31m█[0m[31m█[0m[31m█[0m[31m█[0m[31m█[0m[31m█[0m[31m█[0m[31m█[0m[31m█[0m[31m█[0m[31m█[0m[31m█[0m------------------------------------| 28.6% Complete

In [20]:
# Combo network.
name = 'combo'
model = network_models[name]
model_scores[name] = {}

l = len(data_filenames)
i = 0
pfx = 'Evaluating combo network:'
sfx = 'Complete'
qu.printProgressBarColor(i, l, prefix=pfx, suffix=sfx, length=50)
for dfile in a: # TODO: change a -> ur_trees.keys()
    input_scores = np.column_stack([model_scores[layer][dfile][:,1] for layer in layers]) # TODO: Why the [:,1]? Based on Max's code.
    model_scores[name][dfile] = model.predict(input_scores)
    i += 1
    qu.printProgressBarColor(i, l, prefix=pfx, suffix=sfx, length=50)
    
# We can safely delete the flat network scores now (maybe this saves some memory, though I should probably do better memory management in general)
for layer in layers:
    del model_scores[layer]

Evaluating combo network: |[31m█[0m[31m█[0m[31m█[0m[31m█[0m[31m█[0m[31m█[0m[31m█[0m[31m█[0m[31m█[0m[31m█[0m[31m█[0m[31m█[0m[31m█[0m[31m█[0m------------------------------------| 28.6% Complete

In [None]:
# Energy regression - charged pions.
name = 'e_charged'
model = network_models[name]
model_scores[name] = {}

l = len(ur_trees.keys())
i = 0
pfx = 'Evaluating energy regression networks (charged pions):'
sfx = 'Complete'
qu.printProgressBarColor(i, l, prefix=pfx, suffix=sfx, length=50)

    
#     combined_images = np.concatenate(tuple([calo_images[key][layer] for layer in layers]), axis=1)
#     del calo_images[key] # delete this element of calo_images, it has been copied

#     s_combined,scaler_combined = mu.standardCells(combined_images, layers)

#     All_input[key] = np.column_stack((data_frames[key]['s_logE'], data_frames[key]['s_eta'],s_combined))

In [None]:
# TODO: 
#
# regression network(s) (need to finish)
#
# save scores to files (in new tree?)
#
# perform clustering