# Particle identification

This assignment aims to learn how to define and run a classification model for particle identification of neutrino events. Classification models are the most common method in machine learning. The goal is to predict a label for each input example.

##Prerequisites

Let's start with downloading the dataset, as well as loading the needed Python packages and modules:

In [None]:
!wget "https://raw.githubusercontent.com/saulam/neutrinoml/main/modules.py"
!wget "https://raw.githubusercontent.com/saulam/neutrinoml/main/df_pgun_teaching.p"

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pydotplus
from mpl_toolkits.mplot3d import Axes3D
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import scale, PolynomialFeatures
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO  
from modules import *
from IPython.display import Image  

##Dataset

We can now load the dataset:

In [None]:
# read dataframe
df = pd.read_pickle('df_pgun_teaching.p')

We may have a look at the dataset. It consists of 59,578 particle gun events with the following attributes:

- **TruePID**: PDG code for particle identification (PID); 2212 (proton), 13 (muon), 211 (pion).
- **TrueMomentum**: momentum in MeV.
- **NNodes**: number of nodes of the event (3D spatial points).
- **NodeOrder**: order of the nodes within the event.
- **NodePosX**: array with the coordinates of the nodes along the X-axis (in mm).
- **NodePosY**: array with the coordinates of the nodes along the Y-axis (in mm).
- **NodePosZ**: array with the coordinates of the nodes along the Z-axis (in mm).
- **NodeT**: array with the timestamps of the nodes (in ms).
- **Nodededx**: array with energy deposits of the nodes (dE/dx).
- **TrkLen**: length of the track (in mm).
- **TrkEDepo**: total track energy deposition (in arbitrary unit).
- **TrkDir1**: track direction, polar angle (in degrees).
- **TrkDir2**: track direction, azimuth angle (in degrees).


In [None]:
df

And check the correlations of the variables (please notice that the node features are not included since each even has a different length):

In [None]:
df.corr()

The 3D spatial points of the events are usually stored in the form of hits or nodes. We chose the latter for our dataset. A hit corresponds with a cube with real energy deposition (there are usually many hits across the track signature), whilst a node corresponds with a fitted position after performing the track reconstruction.

<div>
<img src="https://raw.githubusercontent.com/saulam/neutrinoml/main/hit.png" width="400"/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<img src="https://raw.githubusercontent.com/saulam/neutrinoml/main/node.png" width="400"/>
</div>

We may also have a look at the events by plotting the nodes within the detector space. By default, we're looking at the first event (event 0), but we can display more events by playing with the variable `event_number`.

In [None]:
event_number = 0
plot_event(df, event_number)

Regardless of the type of data we use and the algorithm chosen, it is essential to perform a **preprocessing** of the data, which allows us to prepare the data to make it understandable for the machine-learning algorithm.

As explained before, the goal is to learn to predict a label **y** from a fixed-size vector of features **X**. However, the input data is in 3D, and every event (track) has a different size. Thus, a simple way of doing it is to use two of the features to start with: `TrkLen` and `TrkEDepo`. Please, notice that in order to have a binary classification problem, we are encoding the PID code from protons (2212) and muons (13) into 0 and 1 (ignoring pions), respectively.

In [None]:
X = np.zeros(shape=(len(df),2), dtype=np.float32) # array of size (n_events, 2)
y = np.zeros(shape=(len(df),), dtype=np.float32)  # array of size (n_events,)

# fill dataset
for event_n, event in df.iterrows():
    
    pid_label = event['TruePID']
    
    # store only protons and muons
    if pid_label==2212 or pid_label==13:
        # retrieve the first node
        X[event_n, 0] = event['TrkLen']
        X[event_n, 1] = event['TrkEDepo']

        # PID label
        if pid_label==2212:
          pid_label=0
        else:
          pid_label=1
        y[event_n] = pid_label

# standardize the dataset (mean=0, std=1)
X_stan = scale(X)

In order to understand the training data, it's always good to visualise first. A good way of doing it is to create a scatter plot of one feature against the other:

In [None]:
param_names = ['TrkLen', 'TrkEDepo']
y_names = ['proton', 'muon']

plot_params_pid(X, y, param_names, y_names)

Good! It's easy to distinguish by eye two "almost" independent distributions: one for protons and the other for muons.

## Logistic regression

Training a machine-learning algorithm is usually not an easy task. The algorithm learns from some training data until it is ready to make predictions on unseen data. In order to test how the algorithm performs on new data, the dataset used for training is divided into two groups (sometimes is divided into three groups, but we're keeping two groups here for simplicity):

- Training set: the model learns from this set only. It must be the largest set.
- Test set: it is used to evaluate the model at the end of the training, only once it is fully trained. 

In this example, we keep 60% of the data for training and 40% for testing. Besides, it's always recommended to shuffle the training examples to prevents any bias during the training.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_stan, y, test_size=0.4, random_state=7) # 60% training, 40% test


As shown in theory, despite its name, logistic regression is a binary classification algorithm based on the principles of linear regression.

In logistic regression, the output of the linear prediction￼ $z = mx + b$￼ is passed to the sigmoid function $\sigma$:

$$
\sigma(z) = \frac{1}{1+ e^{-z}}
$$

In [None]:
plot_sigmoid()

$$
\hat{y} =
\begin{cases}
0 & \text{if } \sigma(m x + b) < 0.5 \\
1 & \text{if } \sigma(m x + b) \geq 0.5 \\
\end{cases}
$$

Since the sigmoid function is bounded to the interval $(0,1)$, we can express the output of the logistic regression in probabilistic terms. The probability of belonging to each of the classes is therefore defined as:

$$
P(y|x) =
\begin{cases}
\sigma(m x+b)     & \text{if } y = 1 \\
1 - \sigma(m x+b) & \text{if } y = 0 \\
\end{cases}
$$

The logistic regression algorithm learns the parameters $m$ and the bias $b$ that satisfy the above equation. Fortunately, we don't have to perform the forward and backward propagation ourselves, and we may use the `LogisticRegression` class from `sklearn`:

In [None]:
log_reg = LogisticRegression(random_state=7).fit(X_train, y_train) # run the logistic regression model (random_state=7 for reproducibility)
m, b = log_reg.coef_[0], log_reg.intercept_[0]
print("m0: {}, m1: {}, b: {}".format(m[0], m[1], b))

We may now either use the logistic regression model to calculate the predictions on each event, or just calculate them analytically using the learnt parameters $m_0$, $m_1$, and $b$: 

$$
\hat{y} = \mathbf{x}^t\mathbf{m} + b = 
\begin{pmatrix}
x_{0} & x_{1}
\end{pmatrix}
\begin{pmatrix}
m_{0} \\
m_{1}
\end{pmatrix}+ b
$$

In [None]:
sigmoid = lambda x: 1 / (1 + np.exp(-x))
event_number = 0
prob_alg = log_reg.predict_proba(X_train[event_number].reshape(1,2))[0,1]
prob_ana = sigmoid(np.dot(X_train[event_number].reshape(1,2),m.reshape(2,1))+b)[0,0]
print("Probability from algorithm: {:1.5}, analytical probability {:1.5}".format(prob_alg, prob_ana))
print("Actual label: {}".format(int(y_train[event_number])))

We get the same probability! Since 0.0031902 < 0.5, the logistic regression model predicts the input event was a proton, which is actually correct. We can also plot the line the model learnt in order to separate protons and muons.

In [None]:
param_names = ['TrkLen', 'TrkEDepo']
y_names = ['proton', 'muon']
plot_logistic_regression(log_reg, X_test, y_test, param_names, y_names) # plot the logistic regression results

It's also usual to calculate some metrics to evaluate how good our machine-learning method performs on the test set.

In [None]:
y_pred = log_reg.predict(X_test)
print("Overall accuracy: {:2.3}\n".format(accuracy_score(y_test, y_pred)))
print(" - Proton accuracy: {:2.3}".format(accuracy_score(y_test[y_test==0], y_pred[y_test==0])))
print(" - Muon accuracy: {:2.3}\n".format(accuracy_score(y_test[y_test==1], y_pred[y_test==1])))
conf=confusion_matrix(y_pred, y_test)
print_conf(conf, ['protons', 'muons'])

Nice! The muon accuracy might be slightly better, though. Let's increase the dimensionality of the problem!

A more robust but straightforward way of making the input data interpretable for the algorithm is to keep the information of only a few nodes of each track. Our preprocessing is illustrated in the following figure (there are many combinations, we are showing just one practical example here):

<div>
<img src="https://raw.githubusercontent.com/saulam/neutrinoml/main/reg.png" width="500"/>
</div>

where we keep the dE/dx of the first 3 and last 5 nodes of each track, along with their 4 global parameters, building up an array of size 12. For events where the track has less than 8 nodes (first 3 + last 5 nodes), we simply fill the empty positions of the array with -1s.

To sum up, with this preprocessing, we should end up having our input dataset **X**, consisting of 59,578 vectors of size 12 each (a 59,578x12 matrix). The values to estimate, **y**, are the labels of each event (proton or muon).

In [None]:
X = np.zeros(shape=(len(df),12), dtype=np.float32) # array of size (n_event, 12)
y = np.zeros(shape=(len(df),), dtype=np.float32)   # array of size (n_event,)
X.fill(-1) # filled with -1s

# fill dataset
for event_n, event in df.iterrows():
    pid_label = event['TruePID']
    
    if pid_label==2212 or pid_label==13:
    
      NodeOrder = event['NodeOrder']
      Nodededx = event['Nodededx'][NodeOrder]

      # retrieve up to the first 3 nodes
      nfirstnodes = min(Nodededx.shape[0], 3)
      X[event_n,:nfirstnodes] = Nodededx[:nfirstnodes]

      if Nodededx.shape[0]>nfirstnodes:
          # retrieve up to the last 5 nodes
          nlastnodes = min(Nodededx.shape[0]-3, 5)
          X[event_n,nfirstnodes:nfirstnodes+nlastnodes] = Nodededx[-nlastnodes:]

      # global parameters
      X[event_n,-4] = event['TrkLen']
      X[event_n,-3] = event['TrkEDepo']
      X[event_n,-2] = event['TrkDir1']
      X[event_n,-1] = event['TrkDir2']

      # PID label
      if pid_label==2212:
        pid_label=0
      else:
        pid_label=1
      y[event_n] = pid_label

# standardize the dataset (mean=0, std=1)
X_stan = scale(X)

In order to understand the training data, it's always good to visualise first. A good way of doing it could be creating a histogram plot of each of our 12 features:

In [None]:
param_names = ['dE/dx node 1', 'dE/dx node 2', 'dE/dx node 3', 'dE/dx node n-4',\
               'dE/dx node n-3', 'dE/dx node n-2', 'dE/dx node n-1', 'dE/dx node n', 'TrkLen',\
               'TrkEDepo', 'TrkDir1', 'TrkDir2']
y_names = ["proton", "muon"]
plot_parameters(X, y, param_names, y_names, mode="classification")

We split the dataset again into training and test sets:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_stan, y, test_size=0.4, random_state=7) # 60% training and 40% test

And run the logistic regression:

In [None]:
log_reg = LogisticRegression(random_state=7, max_iter=1000).fit(X_train, y_train)

y_pred = log_reg.predict(X_test)
print("Overall accuracy: {:2.3}\n".format(accuracy_score(y_test, y_pred)))
print(" - Proton accuracy: {:2.3}".format(accuracy_score(y_test[y_test==0], y_pred[y_test==0])))
print(" - Muon accuracy: {:2.3}\n".format(accuracy_score(y_test[y_test==1], y_pred[y_test==1])))
conf=confusion_matrix(y_pred, y_test)
print_conf(conf, ['protons', 'muons'])

The results are amazing! However, we have solved a binary classification problem, while our dataset has a third type of particles that we have ignored (pions). Although, in essence, logistic regression can only be applied to binary classification problems, it is easily extensible to solve problems with a number of classes $k>2$.



## Decision trees

A **decision tree** is a tree structure similar to a flowchart where an internal node represents a feature (or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The top node of a **decision tree** is known as the root node. The model learns how to make the partitions based on the value of each feature. It also partitions the tree recursively, which is called *recursive partitioning*.

The **decision tree** is a white-box ML algorithm. It exposes its internal decision-making logic, unlike black-box algorithms such as neural networks. This means that decision trees are **explanatory models**.

It is convenient to use the `DecisionTreeClassifier` from `sklearn`, since it makes the training and testing transparent for the user. We initially configure a decision tree with a depth of 3:

In [None]:
dtree = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=7) # create decision tree
dtree = dtree.fit(X_train,y_train) # train decision tree on train set
y_pred = dtree.predict(X_test) # make predictions on test set

And print the performance metrics:

In [None]:
print("Overall accuracy: {:2.3}\n".format(accuracy_score(y_test, y_pred)))
print(" - Proton accuracy: {:2.3}".format(accuracy_score(y_test[y_test==0], y_pred[y_test==0])))
print(" - Muon accuracy: {:2.3}\n".format(accuracy_score(y_test[y_test==1], y_pred[y_test==1])))
conf=confusion_matrix(y_pred, y_test)
y_names = ['protons', 'muons']
print_conf(conf, y_names)

Not bad, right? Especially if we plot the tree and try to understand how the decisions are made:

In [None]:
param_names = ['dE/dx node 1', 'dE/dx node 2', 'dE/dx node 3', 'dE/dx node n-4',\
               'dE/dx node n-3', 'dE/dx node n-2', 'dE/dx node n-1', 'dE/dx node n', 'TrkLen',\
               'TrkEDepo', 'TrkDir1', 'TrkDir2']
dot_data = StringIO()
export_graphviz(dtree, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = param_names,class_names=y_names)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
#graph.write_png('diabetes.png')
Image(graph.create_png())

The learnt tree is so simple and intuitive! We may increase the complexity by playing with the `max_depth` variable of `DecisionTreeClassifier`.

Last but not least, we may regenerate the dataset, but in this case considering the three types of particles (protons, muons, and pions):

In [None]:
X = np.zeros(shape=(len(df),12), dtype=np.float32) # array of size (n_event, 12)
y = np.zeros(shape=(len(df),), dtype=np.float32)   # array of size (n_event,)
X.fill(-1) # filled with -1s

# fill dataset
for event_n, event in df.iterrows():

    NodeOrder = event['NodeOrder']
    Nodededx = event['Nodededx'][NodeOrder]

    # retrieve up to the first 3 nodes
    nfirstnodes = min(Nodededx.shape[0], 3)
    X[event_n,:nfirstnodes] = Nodededx[:nfirstnodes]

    if Nodededx.shape[0]>nfirstnodes:
        # retrieve up to the last 5 nodes
        nlastnodes = min(Nodededx.shape[0]-3, 5)
        X[event_n,nfirstnodes:nfirstnodes+nlastnodes] = Nodededx[-nlastnodes:]

    # global parameters
    X[event_n,-4] = event['TrkLen']
    X[event_n,-3] = event['TrkEDepo']
    X[event_n,-2] = event['TrkDir1']
    X[event_n,-1] = event['TrkDir2']

    # PID label
    pid_label = event['TruePID']
    if pid_label==2212:
      pid_label=0 # protons
    elif pid_label==13: 
      pid_label=1 # muons
    else:
      pid_label=2 # pions
    y[event_n] = pid_label
    y[event_n] = pid_label

# standardize the dataset (mean=0, std=1)
X_stan = scale(X)

X_train, X_test, y_train, y_test = train_test_split(X_stan, y, test_size=0.4, random_state=7) # 60% training and 40% test

It is always recommended to plot the histogram of each feature:

In [None]:
param_names = ['dE/dx node 1', 'dE/dx node 2', 'dE/dx node 3', 'dE/dx node n-4',\
               'dE/dx node n-3', 'dE/dx node n-2', 'dE/dx node n-1', 'dE/dx node n', 'TrkLen',\
               'TrkEDepo', 'TrkDir1', 'TrkDir2']
y_names = ["proton", "muon", "pions"]
plot_parameters(X, y, param_names, y_names, mode="classification")

We retrain our decision tree on the new dataset and print the results:

In [None]:
dtree = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=7) # create decision tree
dtree = dtree.fit(X_train,y_train) # train decision tree on train set
y_pred = dtree.predict(X_test) # make predictions on test set

print("Overall accuracy: {:2.3}\n".format(accuracy_score(y_test, y_pred)))
print(" - Proton accuracy: {:2.3}".format(accuracy_score(y_test[y_test==0], y_pred[y_test==0])))
print(" - Muon accuracy: {:2.3}".format(accuracy_score(y_test[y_test==1], y_pred[y_test==1])))
print(" - Pion accuracy: {:2.3}\n".format(accuracy_score(y_test[y_test==2], y_pred[y_test==2])))
conf=confusion_matrix(y_pred, y_test)
y_names = ['protons', 'muons', 'pions']
print_conf(conf, y_names)

The results are not excellent. Should the tree be deeper?



##Homework

It's your time to beat the results above!

You could try to generate a new dataset based on your physics knowledge (i.e., influence the feature selection), squeeze logistic regression and decision trees, or try different algorithms:

- Support Vector Machines (SVMs): https://scikit-learn.org/stable/modules/svm.html.
- Naive Bayes: https://scikit-learn.org/stable/modules/naive_bayes.html.
- Random Forest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.
- Etc.