# Particle identification

This assignment aims to learn how to define and run deep-learning methods for particle identification of neutrino events. In the last machine-learning lecture, we implemented a number of classification models using standard machine-learning methods (i.e., logistic regression and decision trees). However, we will use deep learning for this assignment instead, which consists of complex artificial neural networks.

##Prerequisites

Let's start with turning on the GPU (if available):

```
Edit -> Notebook settings -> Hardware accelerator: GPU -> Save.
```



Download the dataset, as well as load the needed Python packages and modules:

In [None]:
!wget "https://raw.githubusercontent.com/saulam/neutrinoml/main/modules.py"
!wget "https://raw.githubusercontent.com/saulam/neutrinoml/main/df_pgun_teaching.p"

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Input, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Activation, Flatten, Dense
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import scale
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from modules import *

Check whether the GPU was found:

In [None]:
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

##Dataset

We can now load the dataset:

In [None]:
# read dataframe
df = pd.read_pickle('df_pgun_teaching.p')

We may have a look at the dataset. It consists of 59,578 particle gun events ([from the SFGD detector](https://doi.org/10.1088/1748-0221/15/12/p12003)) with the following attributes:

- **TruePID**: PDG code for particle identification (PID); 2212 (proton), 13 (muon), 211 (pion).
- **TrueMomentum**: momentum in MeV.
- **NNodes**: number of nodes of the event (3D spatial points).
- **NodeOrder**: order of the nodes within the event.
- **NodePosX**: array with the coordinates of the nodes along the X-axis (in mm).
- **NodePosY**: array with the coordinates of the nodes along the Y-axis (in mm).
- **NodePosZ**: array with the coordinates of the nodes along the Z-axis (in mm).
- **NodeT**: array with the timestamps of the nodes (in ms).
- **Nodededx**: array with energy deposits of the nodes (dE/dx).
- **TrkLen**: length of the track (in mm).
- **TrkEDepo**: total track energy deposition (in arbitrary unit).
- **TrkDir1**: track direction, polar angle (in degrees).
- **TrkDir2**: track direction, azimuth angle (in degrees).


In [None]:
df

And check the correlations of the variables (please notice that the node features are not included since each even has a different length):

In [None]:
df.corr()

The 3D spatial points of the events are usually stored in the form of hits or nodes. We chose the latter for our dataset. A hit corresponds with a cube with real energy deposition (there are usually many hits across the track signature), whilst a node corresponds with a fitted position after performing the track reconstruction.

<div>
<img src="https://raw.githubusercontent.com/saulam/neutrinoml/main/hit.png" width="400"/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<img src="https://raw.githubusercontent.com/saulam/neutrinoml/main/node.png" width="400"/>
</div>

We may also have a look at the events by plotting the nodes within the detector space. By default, we're looking at the first event (event 0), but we can display more events by playing with the variable `event_number`.

In [None]:
event_number = 0
plot_event(df, event_number)

Regardless of the type of data we use and the algorithm chosen, it is essential to perform a **preprocessing** of the data, which allows us to prepare the data to make it understandable for the machine-learning algorithm.

As explained before, the goal is to learn to predict a label **y** from a fixed-size vector of features **X**. However, the input data is in 3D, and every event (track) has a different size. Thus, a simple way of doing it is to use two of the features to start with: `TrkLen` and `TrkEDepo`. Please, notice that we are encoding the PID code from protons (2212), muons (13), and pions (211) into 0, 1, and 2, respectively.

In [None]:
X = np.zeros(shape=(len(df),2), dtype=np.float32) # array of size (n_events, 2)
y = np.zeros(shape=(len(df),), dtype=np.float32)  # array of size (n_events,)

# fill dataset
for event_n, event in df.iterrows():
    
    pid_label = event['TruePID']
    
    # retrieve the first node
    X[event_n, 0] = event['TrkLen']
    X[event_n, 1] = event['TrkEDepo']

    # PID label
    if pid_label==2212: 
      pid_label=0 # proton
    elif pid_label==13: 
      pid_label=1 # muons
    else:
      pid_label=2 # pions
    y[event_n] = pid_label

# standardize the dataset (mean=0, std=1)
X_stan = scale(X)

In order to understand the training data, it's always good to visualise first. For simplicity, let's start comparing protons and muons (ignoring pions). A good way of doing it is to create a scatter plot of one feature against the other:

In [None]:
param_names = ['TrkLen', 'TrkEDepo']
y_names = ['proton', 'muon']

plot_params_pid(X[y!=2], y[y!=2], param_names, y_names)

Good! It's easy to distinguish by eye two "almost" independent distributions: one for protons and the other for muons.

## Fully connected neural networks

Training a machine-learning algorithm is usually not an easy task. The algorithm learns from some training data until it is ready to make predictions on unseen data. In order to test how the algorithm performs on new data, the dataset used for training is divided into two groups (sometimes is divided into three groups, but we're keeping two groups here for simplicity):

- Training set: the model learns from this set only. It must be the largest set.
- Test set: it is used to evaluate the model at the end of the training, only once it is fully trained. 

In this example, we keep 60% of the data for training and 40% for testing. Besides, it's always recommended to shuffle the training examples to prevents any bias during the training.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_stan[y!=2], y[y!=2], test_size=0.4, random_state=7) # random shuffle and split: 60% training, 40% test

This assignment aims to deal with deep learning methods, a subset of machine learning consisting of artificial neural networks. We will implement the fully connected neural network (i.e., all neurons in one layer are connected to all the neurons in the next layer) shown below using the Keras interface from the TensorFlow deep-learning framework. Keras is an API ideal for neural-network prototyping. In the architecture below, each neuron must compute the following function $\sigma(w x + b) < 0.5$, where $w$ and $b$ are the input weight and bias of the neuron, respectively, and $\sigma$ is the [activation function](https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6).

<div>
<img src="https://raw.githubusercontent.com/saulam/neutrinoml/main/dense_nn.png" width="700"/>
</div>


In [None]:
tf.random.set_seed(7) # for reproducibility

num_features = 2 # TrkLen, TrkEDepo
num_classes = 1 # one output unit is enough since it's a binary classification problem

# Fully connected neural network model
input = Input(shape=(num_features,)) # input layer
x = Dense(10, activation='relu')(input) # hidden layer 1
x = Dense(10, activation='relu')(x) # hidden layer 2
output = Dense(num_classes, activation='sigmoid')(x) # output layer
model = Model(inputs=input, outputs=output)

# compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

And train the model for 10 epochs and a batch size of 128:

*   Batch: a set of $n$ input examples (also called mini-batch). The input examples in a batch are processed independently, in parallel. During training, a batch results in only one update to the model (one forward pass and one backward pass).
*   Epoch: one forward pass and one backward pass of all the training examples. In other words, an epoch is one pass over the entire dataset, and it is used to separate training into distinct phases. For a dataset consisting of $m$ training examples and a batch size of $n$, then it will take $m / n$ iterations to complete one epoch.


In [None]:
model.fit(X_train, y_train, epochs=10, batch_size=128, verbose=1)

It's also usual to calculate some metrics to evaluate our deep-learning method's performance on the test set.

In [None]:
y_pred = model.predict(X_test).round()
print("Overall accuracy: {:2.3}\n".format(accuracy_score(y_test, y_pred)))
print(" - Proton accuracy: {:2.3}".format(accuracy_score(y_test[y_test==0], y_pred[y_test==0])))
print(" - Muon accuracy: {:2.3}\n".format(accuracy_score(y_test[y_test==1], y_pred[y_test==1])))
conf=confusion_matrix(y_pred, y_test)
print_conf(conf, ['protons', 'muons'])

Nice! We're getting almost perfect separation using only two input parameters! With logistic regression (first lecture), proton accuracy was similar, but the muon accuracy it was ~83%. The improvement using neural networks is obvious.

Is there any room for improving the current results? 

In the same way we did in the previous lecture, a more robust but straightforward way of making the input data interpretable for the algorithm is to keep the information of only a few nodes of each track. Our preprocessing is illustrated in the following figure (there are many combinations, we are showing just one practical example here):

<div>
<img src="https://raw.githubusercontent.com/saulam/neutrinoml/main/reg.png" width="500"/>
</div>

where we keep the dE/dx of the first 3 and last 5 nodes of each track, along with their 4 global parameters, building up an array of size 12. For events where the track has less than 8 nodes (first 3 + last 5 nodes), we simply fill the empty positions of the array with -1s.

To sum up, with this preprocessing, we should end up having our input dataset **X**, consisting of 59,578 vectors of size 12 each (a 59,578x12 matrix). The values to estimate, **y**, are the labels of each event (proton or muon).

In [None]:
X = np.zeros(shape=(len(df),12), dtype=np.float32) # array of size (n_event, 12)
y = np.zeros(shape=(len(df),), dtype=np.float32)   # array of size (n_event,)
X.fill(-1) # filled with -1s

# fill dataset
for event_n, event in df.iterrows():

    NodeOrder = event['NodeOrder']
    Nodededx = event['Nodededx'][NodeOrder]

    # retrieve up to the first 3 nodes
    nfirstnodes = min(Nodededx.shape[0], 3)
    X[event_n,:nfirstnodes] = Nodededx[:nfirstnodes]

    if Nodededx.shape[0]>nfirstnodes:
        # retrieve up to the last 5 nodes
        nlastnodes = min(Nodededx.shape[0]-3, 5)
        X[event_n,nfirstnodes:nfirstnodes+nlastnodes] = Nodededx[-nlastnodes:]

    # global parameters
    X[event_n,-4] = event['TrkLen']
    X[event_n,-3] = event['TrkEDepo']
    X[event_n,-2] = event['TrkDir1']
    X[event_n,-1] = event['TrkDir2']

    # PID label
    pid_label = event['TruePID']
    if pid_label==2212:
      pid_label=0 # protons
    elif pid_label==13: 
      pid_label=1 # muons
    else:
      pid_label=2 # pions
    y[event_n] = pid_label
    y[event_n] = pid_label

# standardize the dataset (mean=0, std=1)
X_stan = scale(X)

In order to understand the training data, it's always good to visualise first. A good way of doing it could be creating a histogram plot of each of our 12 features:

In [None]:
param_names = ['dE/dx node 1', 'dE/dx node 2', 'dE/dx node 3', 'dE/dx node n-4',\
               'dE/dx node n-3', 'dE/dx node n-2', 'dE/dx node n-1', 'dE/dx node n', 'TrkLen',\
               'TrkEDepo', 'TrkDir1', 'TrkDir2']
y_names = ["proton", "muon","pion"]
plot_parameters(X, y, param_names, y_names, mode="classification")

We split the dataset again into training and test sets:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_stan[y!=2], y[y!=2], test_size=0.4, random_state=7) # 60% training and 40% test

Define a new network (we just need to fix the input layer), train it on the new dataset and test it:

In [None]:
tf.random.set_seed(7) # for reproducibility

num_features = 12 # TrkLen, TrkEDepo
num_classes = 1 # one output unit is enough since it's a binary classification problem

# Fully connected neural network model
input = Input(shape=(num_features,)) # input layer
x = Dense(10, activation='relu')(input) # hidden layer 1
x = Dense(10, activation='relu')(x) # hidden layer 2
output = Dense(num_classes, activation='sigmoid')(x) # output layer
model = Model(inputs=input, outputs=output)

# compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

# train the model
model.fit(X_train, y_train, epochs=10, batch_size=128, verbose=1)

# test the model
y_pred = model.predict(X_test).round()
print("Overall accuracy: {:2.3}\n".format(accuracy_score(y_test, y_pred)))
print(" - Proton accuracy: {:2.3}".format(accuracy_score(y_test[y_test==0], y_pred[y_test==0])))
print(" - Muon accuracy: {:2.3}\n".format(accuracy_score(y_test[y_test==1], y_pred[y_test==1])))
conf=confusion_matrix(y_pred, y_test)
print_conf(conf, ['protons', 'muons'])

The results are amazing! However, we have solved a binary classification problem, while our dataset has a third type of particles that we have ignored (pions). Out network architecture is easily extensible to solve problems with a number of classes $k>2$.



In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_stan, y, test_size=0.4, random_state=7) # 60% training and 40% test

In [None]:
tf.random.set_seed(7) # for reproducibility

num_features = 12 # TrkLen, TrkEDepo
num_classes = 3 # proton, muon, and pion

# Fully connected neural network model
input = Input(shape=(num_features,)) # input layer
x = Dense(10, activation='relu')(input) # hidden layer 1
x = Dense(10, activation='relu')(x) # hidden layer 2
output = Dense(num_classes, activation='softmax')(x) # output layer
model = Model(inputs=input, outputs=output)

# compile the model
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

# train the model
model.fit(X_train, y_train, epochs=10, batch_size=128, verbose=1)

# test the model
y_pred = model.predict(X_test).argmax(axis=1)
print("Overall accuracy: {:2.3}\n".format(accuracy_score(y_test, y_pred)))
print(" - Proton accuracy: {:2.3}".format(accuracy_score(y_test[y_test==0], y_pred[y_test==0])))
print(" - Muon accuracy: {:2.3}".format(accuracy_score(y_test[y_test==1], y_pred[y_test==1])))
print(" - Pion accuracy: {:2.3}\n".format(accuracy_score(y_test[y_test==2], y_pred[y_test==2])))
conf=confusion_matrix(y_pred, y_test)
print_conf(conf, ['protons', 'muons', 'pions'])

The muon/pion separation looks much better than for decision trees (last lecture)!

The way to add more capacity to our model (making it more capable to learn) is to add more layers and neurons per layer!

In [None]:
tf.random.set_seed(7) # for reproducibility

num_features = 12 # TrkLen, TrkEDepo
num_classes = 3 # proton, muon, and pion

# Fully connected neural network model
input = Input(shape=(num_features,)) # input layer
x = Dense(100, activation='relu')(input) # hidden layer 1
x = Dense(100, activation='relu')(x) # hidden layer 2
x = Dense(100, activation='relu')(x) # hidden layer 2
x = Dense(100, activation='relu')(x) # hidden layer 2
output = Dense(num_classes, activation='softmax')(x) # output layer
model = Model(inputs=input, outputs=output)

# compile the model
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

# train the model
model.fit(X_train, y_train, epochs=10, batch_size=128, verbose=1)

# test the model
y_pred = model.predict(X_test).argmax(axis=1)
print("Overall accuracy: {:2.3}\n".format(accuracy_score(y_test, y_pred)))
print(" - Proton accuracy: {:2.3}".format(accuracy_score(y_test[y_test==0], y_pred[y_test==0])))
print(" - Muon accuracy: {:2.3}".format(accuracy_score(y_test[y_test==1], y_pred[y_test==1])))
print(" - Pion accuracy: {:2.3}\n".format(accuracy_score(y_test[y_test==2], y_pred[y_test==2])))
conf=confusion_matrix(y_pred, y_test)
print_conf(conf, ['protons', 'muons', 'pions'])

## Convolutional neural networks

[Convolutional neural network (CNN)](https://direct.mit.edu/neco/article-abstract/1/4/541/5515/Backpropagation-Applied-to-Handwritten-Zip-Code?redirectedFrom=fulltext) algorithms that operate on images have been very successful in a number of [HEP tasks](https://iml-wg.github.io/HEPML-LivingReview/). The main feature of CNNs is that they apply a series of filters (using convolutions, hence the name of the CNN), usually followed by spatial pooling, applied in sequence to extract increasingly powerful and abstract features that allow the CNN to classify the images [[citation](http://dl.acm.org/citation.cfm?id=2999134.2999257)]. Each of the filters consists of a set of values that are learnt by the CNN through the training process.  CNNs are typically deep neural networks that consist of many convolutional layers, with the output from one convolutional layer forming the input to the next. The last layers of a CNN are usually fully connected layers, where the output layer is followed by a sigmoid or softmax activation function.

Since CNNs learn from images, let's generate a 2D image for each event in the dataset. An easy way of doing it is to save the YZ projection of each 3D event (the projection chosen is not completely arbitrary. We wanted to keep the Z-axis since it corresponds to the beam direction):

In [None]:
def map_value(y,z):
  min_y = -257.56
  max_y = 317.56
  min_z = -2888.78
  max_z = -999.1

  y = int((y-min_y)//10)
  z = int((z-min_z)//10)

  return y, z


X = np.zeros(shape=(len(df),58,189,1), dtype=np.float32) # array of size (n_event, 56, 184)
y = np.zeros(shape=(len(df),), dtype=np.float32)   # array of size (n_event,)

# fill dataset
for event_n, event in df.iterrows():

    NodePosY = event['NodePosY']
    NodePosZ = event['NodePosZ']
    Nodededx = event['Nodededx']

    old_y, old_z, dedxs = -1, -1, []
    for index in range(len(NodePosY)):
        y_coord, z_coord = NodePosY[index], NodePosZ[index]
        y_coord, z_coord = map_value(y_coord, z_coord)

        if index==0 or (y_coord==old_y and z_coord==old_z):
            dedxs.append(Nodededx[index])
            old_y, old_z = y_coord, z_coord
        else:
            X[event_n, old_y, old_z, 0] = np.mean(dedxs)
            old_y, old_z, dedxs = y_coord, z_coord, []
            dedxs.append(Nodededx[index])

    X[event_n, old_y, old_z, 0] = np.mean(dedxs)

    # PID label
    pid_label = event['TruePID']
    if pid_label==2212:
      pid_label=0 # protons
    elif pid_label==13: 
      pid_label=1 # muons
    else:
      pid_label=2 # pions
    y[event_n] = pid_label

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=7) # 60% training and 40% test

We may plot two different views of the same 3D event and the corresponding YZ projection to check everything worked as expected:

In [None]:
event_number = 0
plot_projection(df, event_number, X)

We will implement the following convolutional connected neural network:

<div>
<img src="https://raw.githubusercontent.com/saulam/neutrinoml/main/cnn.png" width="900"/>
</div>

In [None]:
tf.random.set_seed(7) # for reproducibility

# Convolutional neural network model
inp_shape = (58,189,1)
input = Input(shape=inp_shape) # input layer
x = Conv2D(16, (6,18), padding='valid', strides=(2,3), activation='relu')(input) # conv layer 1
x = MaxPooling2D(pool_size=(2,3), strides=(2,3))(x) # max-pooling 1
x = Conv2D(32, (3,3), padding='valid', strides=(2,3), activation='relu')(x) # conv layer 2
x = MaxPooling2D(pool_size=(2,3), strides=(2,3))(x) # max-pooling 2
x = Flatten()(x) # from 3D to 1D
x = Dense(64, activation='relu')(x) # fully connected layer at the end
output = Dense(3, activation='softmax')(x) # output layer

# compile the model
model = Model(inputs=input, outputs=output)
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

In [None]:
# train the model
model.fit(X_train, y_train, epochs=10, batch_size=128, verbose=1)

# test the model
y_pred = model.predict(X_test).argmax(axis=1)
print("Overall accuracy: {:2.3}\n".format(accuracy_score(y_test, y_pred)))
print(" - Proton accuracy: {:2.3}".format(accuracy_score(y_test[y_test==0], y_pred[y_test==0])))
print(" - Muon accuracy: {:2.3}".format(accuracy_score(y_test[y_test==1], y_pred[y_test==1])))
print(" - Pion accuracy: {:2.3}\n".format(accuracy_score(y_test[y_test==2], y_pred[y_test==2])))
conf=confusion_matrix(y_pred, y_test)
print_conf(conf, ['protons', 'muons', 'pions'])

How do we interpret the results? Does it mean CNNs are less powerful than fully connected networks (FCNs)? No! From the physics point of view, we are training the CNN to identify particles but just looking at their signatures in a 2D projection! In contrast, we were giving our FCN as input some reconstructed physics parameters that were useful for performing PID. Thus, the scientist's goal should be to understand which method is best for each situation.

##Homework

It's your time to beat the results above!

The idea is to add capacity to the models by designing wider (more neurons or convolutional filters per layer) and deeper (more layers) networks.

Useful links:

- How to Control Neural Network Model Capacity With Nodes and Layers: https://machinelearningmastery.com/how-to-control-neural-network-model-capacity-with-nodes-and-layers/.
- TensorFlow 2 quickstart for beginners: https://www.tensorflow.org/tutorials/quickstart/beginner.
- Building a Convolutional Neural Network Using TensorFlow – Keras: https://www.analyticsvidhya.com/blog/2021/06/building-a-convolutional-neural-network-using-tensorflow-keras/.