# Data Science at IGFAE 2024
## Lesson 2

## Pietro Vischia (Universidad de Oviedo and ICTEA), pietro.vischia@cern.ch

In [None]:
# Uncomment and run this if you are running on Colab (remove only the "#", keep the "!").
# You can run it anyway, but it will do nothing if you have already installed all dependencies
# (and it will take some time to tell you it is not gonna do anything)


#from google.colab import drive
#drive.mount('/content/drive')
#%cd "/content/drive/MyDrive/"
#! git clone https://github.com/vischia/data_science_school_igfae2024.git
#%cd machine_learning_tutorial
#!pwd
#!ls
#!pip install livelossplot shap uproot

In [None]:
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import glob
import os
import re
import math
import socket
import json
import pickle
import gzip
import copy
import array
import numpy as np
import numpy.lib.recfunctions as recfunc
from tqdm import tqdm

from scipy.optimize import newton
from scipy.stats import norm

import uproot

import datetime
from timeit import default_timer as timer

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.metrics import roc_curve, auc, accuracy_score
from sklearn.tree import export_graphviz
from sklearn.inspection import permutation_importance
try:
    # See #1137: this allows compatibility for scikit-learn >= 0.24
    from sklearn.utils import safe_indexing
except ImportError:
    from sklearn.utils import _safe_indexing

import pandas as pd


## Neural networks

#### Details on neural networks

Biology teaches us that the brain is constituted of neurons and connections between them: the synapses.
By comparing the brain of various animals, we now think that the more the number of neurons and most importantly of synapses is large, the more complex are the functions that the brain can execute.

Let's learn the inner workings of a very simplified mathematical model of brain: an artificual neural network.

The first element is the neuron. The simplest model (and one of the first) we have is the [*perceptron*](https://en.wikipedia.org/wiki/Perceptron). The neuron is modelled by a mathematical function that takes some arguments as inputs, combines them linearly, and returns an output value.
We denote as *weights* the coefficients of the linear combination.

However, we want to be able to approximate nonlinear functions, so we need to plug in a degree of nonlinearity inside the neuron, and we want the neuron to fire only when a certain threshold in the output is reached (a certain amount of stimulation).

We modify the output of the neuron by an activation function $f_{act}$: the neuron is activated if the activation function returns a non-zero value. The output of the neuron is defined as:

$$
y_n = f_{act}(\sum_i w_{i,n} x_{i,n})
$$

If the activation function is not linear, we are happy because we have obtained a neuron that gives a nonlinear output and gets activated only if the stimuli it receives are large enough.

The activation function become larger than 0.5 at $x=0$. We need to include the possibility of shifting the value for which the neuron activates. This is done by introducing a bias. The neuron output, that we wrote above as

$$
y_n = f_{act}(\sum_i w_{i,n} x_{i,n})
$$

(that activates for $\sum(w_i x_i)>0$) will be

$$
y_n = f_{act}(\sum_i w_{i,n} x_{i,n} + w_{bb})
$$

that activates at a learnable ($w_b$) value. 


You will use in this exercise mostly two activation functions:

- [ReLU](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)), a function $f(x)$ that returns 0 if $x<0$, and $x$ otherwise;
- [sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function), a function that rescales any number into a number between 0 and 1.

Here you have a graphical representation of the perceptron, by [https://towardsdatascience.com](https://towardsdatascience.com):

![neuron](https://miro.medium.com/max/1435/1*n6sJ4yZQzwKL9wnF5wnVNg.png "Figure from https://towardsdatascience.com")

Now we have to connect the neurons. The simplest way is to build layers of neurons, and to connect all neurons of consecutive layers. Starting with the inputs, there is a first layer of neurons. Each neuron combines linearly the inputs and passes the result through activation function to give an output value. The set of outputs of a layer will be the input of the following layer:

![neuralnet](https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/800px-Colored_neural_network.svg.png "Figure from wikipedia")

A neural network is characterized by a set of weights assigned to the connections that define the structure of the network. You can see this as a mathematical function with many free parameters (the weights) that takes the inputs and gives an output. The problem of learning is then the problem of finding the values of the free parameters that minimize the difference between the output and the target distribution that we want to learn.


#### The training process

Schematically, the training process consists in:

- for each epoch
   * for each training set data point:
      1. calculate the output of each neuron, starting from the inputs to the output
      2. compare the output of the last neuron with the reference wine quality
      3. propagate the error back towards the inputs, without updating the weights (the error needs to be propagated with respect to the current values of the weights)
      4. update all the weights
      5. save the value of the loss function for each event
   * for each test set data point:
      1. save the value of the loss function for each event
   * aggregate the errors by computing a Mean Squared Average (MSE)
      1. the MSE of the errors in the training dataset events is the average training loss
      2. the MSE of the errors in the test dataset is the averate validation loss (you see here I am using validation and test indifferently)

The idea is the training will stop when the loss function doesn't improve anymore (it remains stationary at its minimum. If the training loss keeps diminishing and the test loss begins increasing, then we might be starting to learn statistical fluctuations of the training dataset.


#### Clarification on the connections between networks (to fix ideas)

If the network has the following structure: (input layer: two inputs `A` et `B`;  first internal (_hidden_) layer: two neurons `1a` et `1b`; second hidden layer: two neurons `2a` et `2b`; output `y`), the list of connections (the weights) is:

- Four weights connecting the inputs to the layer 1:
    - `wA1a` (connects input `A` to neuron `1a`)
    - `wA1b` (connects input `A` to neuron `1b`)
    - `wB1a` (connects input `B` to neuron `1a`)
    - `wB1b` (connects input `B` to neuron `1b`)
- Four weights connecting the neurons of layer 1 to those of layer 2:_
    - `W1a2a` (connects neuron `1a` to neuron `2a`)
    - `W1a2b` (connects neuron `1a` to neuron `2b`)
    - `W1b2a` (connects neuron `1b` to neuron `2a`)
    - `W1b2b` (connects neuron `1b` to neuron `2b`)
- Two weights connecting the neurons of layer 2 to the output y:
    - `W2ay` (connects neuron `2a` to output `y`)
    - `W2by` (connects neuron `2b` to output `y`)

####  Backpropagation

To perform backpropagation we need, for each neuron, to propagate back the error of the neurons of the following layer (so you need to go backwards). We use the chain rule.

- Error for a neuron of the output layer:

$$
\epsilon = (y_{true} - \hat{y}) * activation\_derivative(\hat{y})
$$

Here $\hat{y}$ is the output of this output neuron, and $y_{true}$ is the target quality of the wine

- Error for a neuron $m$ of an internal layer $N$:

$$
\epsilon_{m, N} = \sum_{k} (w_{k, N+1} * \epsilon_{k, N+1}) * activation\_derivative(\hat{m})
$$

Here, $\epsilon_{k,N+1}$ is the error of the neuron $k$ of the following layer (layer $N+1$), $w_{k, N+1}$ is the weight of the connection between the neuron $m$ and the neuron $k$ of the next layer, and $\hat{m}$ is the output of neuron $m$


#### Updating the weights

After having backpropagated all the gradient, you have to update all the weights using this formula:

$$
w = w + learning\_rate * erreur * input
$$

Ici $erreur$ is the error calculated via backpropagation, $input$ is the input value of the neuron that had been originally passed to the neuron, and $learning rate$ is a parameter governing how fast we climb down the gradient.

 
#### At the end of each epoch

To check for convergence of the network, a standard practice is to aggregeate the errors $\hat{y} - y_{true}$  of all the events at the end of each epoch, in order to reduce the sensitivity to statistical fluctuations in the training sample. The first pillar of statistical wisdom according to Stigler is precisely aggregation. 

In analogy with $\chi^2$ fit, we can for example calculate the $MSE = \frac{1}{N} \sum_{events} (\hat{y}-y_{true})^2$ and plot the MSE as a function of the number of epochs. If the network is improving its predictions, we should see something like this:

![mse](https://cern.ch/vischia/mse_pythonCourse.png "Figure by Pietro Vischia, 2019")

#### Diagnostic plots

- 1) MSE as a function of the epoch
- 2) Histogram of $\frac{\hat{y} - y_{true}}{y_{true}}$ 
- 3) Histogram of $\frac{\hat{y} - y_{true}}{y_{true}}$ as a function of $y_{true}$


### Weights initialization
- To initialize the weights at the beginning you can use a Gaussian, or a truncated gaussian ( (scipy.stats.truncnorm), or a $random uniform[0,1]$


In [None]:
!pip install torch torchinfo
import torchinfo
import torch
torch.manual_seed(0)

import torch.nn as nn
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

import torch.nn.functional as F

## Torch and automatic differentiation

Let's briefly see how `torch`deals with gradients

Let's calculate gradients of a simple equation using autodiff

In [None]:
x0 = torch.tensor(1., requires_grad=True)
x1 = torch.tensor(2., requires_grad=True)
print("x0 = ",x0)
print("x1 = ",x1)

p = 2*x0 + x0*torch.sin(x1) + x1**3
print("p = ",p)

p.backward()
print("dp/dx0 = ",x0.grad)
print("dp/dx1 = ", x1.grad)

The importance of using operations that have been overloaded within the library: implement the same equation, but this time the function `sin` is imported from `math`.

Notice how no error message is thrown, but gradients are completely different. This is because `torch`is blind to the portion of equation involving `math.sin`, in the sense that it cannot anymore propagate the gradient through it.

In [None]:
x0 = torch.tensor(1., requires_grad=True)
x1 = torch.tensor(2., requires_grad=True)
p = 2*x0 + x0*math.sin(x1) + x1**3
print(p)
p.backward()
print(x0.grad, x1.grad)

In [None]:
#compare with analytic solution (credit: Matthias Komm)
x0_grad_analytic = 2+torch.sin(x1)
x1_grad_analytic = x0*torch.cos(x1)+3*x1**2

print("dp/dx0 = ",x0_grad_analytic)
print("dp/dx1 = ",x1_grad_analytic)

Now let's plot some activation function.

Let's also plot the derivative of each activation function, in two ways: by plotting the explicitly coded derivative function, and by plotting the derivative computed via automatic differentiation, `x.grad`.

In [None]:
x = torch.tensor(np.linspace(-6, 6, 100), requires_grad=True)
y =torch.sigmoid(x)
yprime = lambda x: torch.sigmoid(x)*(1-torch.sigmoid(x))

func, =plt.plot(x.detach().numpy(),y.detach().numpy(), label="sigmoid(x)")
plt.xlabel("x")
Y = torch.sum(y)
Y.backward()
derivative, =plt.plot(x.detach().numpy(),x.grad.detach().numpy(), label="autodiff sigmoid'(x)")
anal_derivative = plt.plot(x.detach().numpy(), yprime(x).detach().numpy(), label="analytical sigmoid'(x)")

plt.legend()
plt.savefig("sigmoid.png")
plt.figure()
x = torch.tensor(np.linspace(-6, 6, 100), requires_grad=True)
y =torch.relu(x)
yprime = lambda x: torch.where(x>0, 1,0) 

func, =plt.plot(x.detach().numpy(),y.detach().numpy(), label="relu(x)")
plt.xlabel("x")
Y = torch.sum(y)
Y.backward()
derivative, =plt.plot(x.detach().numpy(),x.grad.detach().numpy(), label="relu'(x)")
anal_derivative = plt.plot(x.detach().numpy(), yprime(x).detach().numpy(), label="analytical sigmoid'(x)")
plt.legend()
plt.savefig("relu.png")

# Import data

We will use simulated events corresponding to three physics processes.

- ttH production
- ttW production
- Drell-Yan production

We will select the multilepton final state, which is a challenging final state with a rich structure and nontrivial background separation.

<img src="figs/2lss.png" alt="ttH multilepton 2lss" style="width:40%;"/>


In [None]:
INPUT_FOLDER = './'
HAVE_GPU = True
# Uncomment this if you haven't installed the data yet
#!mkdir data; cd data/; wget https://www.hep.uniovi.es/vischia/cmsdas2024/ft_tth_multilep_igfae2024.tar.gz; tar xzvf ft_tth_multilep_igfae2024.tar.gz; mv igfae2024/* .; rmdir igfae2024; rm ft_tth_multilep_igfae2024.tar.gz; cd -;

In [None]:
import uproot
sig = uproot.open('data/signal.root')['Friends'].arrays(library="pd")
bk1 = uproot.open('data/background_1.root')['Friends'].arrays(library="pd")
bk2 = uproot.open('data/background_2.root')['Friends'].arrays(library="pd")


## Data inspection

We will now apply in one go all the manipulations of the input dataset that we have seen yesterday

First we drop all features that either correspond to unwanted objects (third lepton) or to labels we will need later on for regression.

In [None]:
signal = sig.drop(["Hreco_Lep2_pt", "Hreco_Lep2_eta", "Hreco_Lep2_phi", "Hreco_Lep2_mass", "Hreco_evt_tag", "Hreco_HTXS_Higgs_pt", "Hreco_HTXS_Higgs_y"], axis=1 )
bkg1 = bk1.drop(["Hreco_Lep2_pt", "Hreco_Lep2_eta", "Hreco_Lep2_phi", "Hreco_Lep2_mass", "Hreco_evt_tag","Hreco_HTXS_Higgs_pt", "Hreco_HTXS_Higgs_y"], axis=1 )
bkg2 = bk2.drop(["Hreco_Lep2_pt", "Hreco_Lep2_eta", "Hreco_Lep2_phi", "Hreco_Lep2_mass", "Hreco_evt_tag","Hreco_HTXS_Higgs_pt", "Hreco_HTXS_Higgs_y"], axis=1 )


Now we add the labels for classification...

In [None]:
signal['label'] = 1
bkg = pd.concat([bkg1, bkg2])
bkg['label'] = 0
data = pd.concat([signal,bkg]).sample(frac=1).reset_index(drop=True)
X = data.drop(["label"], axis=1)
y = data["label"]

and we split the data into training and test dataset.
Let's also go straight to the downsampling (you can run on your own on the whole training dataset, but for this demonstration we don't need to do that).

In [None]:
import sklearn
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.33, random_state=42)
print("We have", len(X_train), "training samples and ", len(X_test), "testing samples")

Ntrain=10000
Ntest=2000
X_train = X_train[:Ntrain]
y_train = y_train[:Ntrain]
X_test = X_test[:Ntest]
y_test = y_test[:Ntest]

# LABEL OF THIS PLACE HERE, WILL BE USEFUL LATER

For neural networks we will use `pytorch`, a backend designed natively for tensor operations.
I prefer it to tensorflow, because it exposes (i.e. you have to call them explicitly in your code) the optimizer steps and the backpropagation steps.

You could also use the `tensorflow` backend, either directly or through the `keras` frontend.
Saying "I use keras" does not tell you which backend is being used. It used to be either `tensorflow` or `theano`. Nowadays `keras` is I think almost embedded inside tensorflow, but it is still good to specify.

`torch` handles the data management via the `Dataset` and `DataLoader` classes.
Here we don't need any specific `Dataset` class, because we are not doing sophisticated things, but you may need that in the future.

The `Dataloader` class takes care of providing quick access to the data by sampling batches that are then fed to the network for (mini)batch gradient descent.

In [None]:
class MyDataset(Dataset):
    def __init__(self, X, y, device=torch.device("cpu")):
        self.X = torch.Tensor(X.values if isinstance(X, pd.core.frame.DataFrame) else X).to(device)
        self.y = torch.Tensor(y.values).to(device)

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        label = self.y[idx]
        datum = self.X[idx]
        
        return datum, label

batch_size=512 # Minibatch learning


train_dataset = MyDataset(X_train, y_train)
test_dataset = MyDataset(X_test, y_test)

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

train_features, train_labels = next(iter(train_dataloader))
print(f"Feature batch shape: {train_features.size()}")
print(f"Labels batch shape: {train_labels.size()}")


For educational purposes, let's get access the data loader via its iterator, and sample a single batch by calling `next` on the iterator

In [None]:
random_batch_X, random_batch_y = next(iter(train_dataloader))
print(random_batch_X.shape, random_batch_y.shape) 

Let's build a simple neural network, by inheriting from the `nn.Module` class. **This is very crucial, because that class is the responsible for providing the automatic differentiation infrastructure for tracking parameters and doing backpropagation**

In [None]:
class NeuralNetwork(nn.Module):
    def __init__(self, ninputs, device=torch.device("cpu")):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(ninputs, 512),
            nn.ReLU(),
            nn.Linear(512, 128),
            nn.ReLU(),
            nn.Linear(128,64),
            nn.ReLU(),
            nn.Linear(64,8),
            nn.Sigmoid(),
            nn.Linear(8, 1)
        )
        self.linear_relu_stack.to(device)

    def forward(self, x):
        # Pass data through conv1
        x = self.linear_relu_stack(x)
        return x

Let's instantiate the neural network and print some info on it

In [None]:
model = NeuralNetwork(X_train.shape[1])

print(model) # some basic info

print("Now let's see some more detailed info by using the torchinfo package")
torchinfo.summary(model, input_size=(batch_size, X_train.shape[1])) # the input size is (batch size, number of features)

Now let's introduce a crucial concept: `torch` lets you manage in which device you want to put your data and models, to optimize access at different stages

In [None]:
devicestring = "mps" # for macos. "cuda" for CUDA gpus, "cpu" for CPUs

device = torch.device("cuda:0" if torch.cuda.is_available() else devicestring)


# Get a batch from the dataloader
random_batch_X, random_batch_y = next(iter(train_dataloader))

print("The original dataloader resides in", random_batch_X.get_device())

# Let's reinstantiate the dataset
train_dataset = MyDataset(X_train, y_train, device=device)
test_dataset = MyDataset(X_test, y_test, device=device)

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

random_batch_X, random_batch_y = next(iter(train_dataloader))

print("The new dataloader puts the batches in in", random_batch_X.get_device())

# Reinstantiate the model, on the chosen device
model = NeuralNetwork(X_train.shape[1], device)


We have learned how load the data into the GPU, how to define and instantiate a model. Now we need to define a training loop.

In `keras`, this is wrapped hidden into the `.fit()` method, which I think is bad because it hides the actual procedure.

In [None]:
def train_loop(dataloader, model, loss_fn, optimizer, scheduler, device):
    size = len(dataloader.dataset)
    losses=[] # Track the loss function
    # Set the model to training mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.train()
    #for batch, (X, y) in enumerate(dataloader):
    for (X,y) in tqdm(dataloader):
        # Reset gradients (to avoid their accumulation)
        optimizer.zero_grad()
        # Compute prediction and loss
        pred = model(X)
        #if (all_equal3(pred.detach().numpy())):
        #    print("All equal!")
        loss = loss_fn(pred.squeeze(dim=1), y)
        losses.append(loss.detach().cpu())
        # Backpropagation
        loss.backward()
        optimizer.step()

    scheduler.step()
    return np.mean(losses)

Now we need to define the loop that is run on the test dataset.

**The test dataset is just used for evaluating the output of the model. No backpropagation is needed, therefore backpropagation must be switched off!!!**

In [None]:
def test_loop(dataloader, model, loss_fn, device):
    losses=[] # Track the loss function
    # Set the model to evaluation mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.eval()
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    # Evaluating the model with torch.no_grad() ensures that no gradients are computed during test mode
    # also serves to reduce unnecessary gradient computations and memory usage for tensors with requires_grad=True
    with torch.no_grad():
        #for X, y in dataloader:
        for (X,y) in tqdm(dataloader):
            pred = model(X)
            loss = loss_fn(pred.squeeze(dim=1), y).item()
            losses.append(loss)
            test_loss += loss
            #correct += (pred.argmax(1) == y).type(torch.float).sum().item()
            
    return np.mean(losses)

We are now read to train this!
At the moment we are trying to do classification. We will set our loss function to be the cross entropy.

Torch provides the functionality to use generic functions as loss function. We will show an example one.

In [None]:

#loss_fn = torch.nn.MSELoss()
loss_fn = torch.nn.CrossEntropyLoss()

#loss_fn = torch.nn.CrossEntropyLoss(reduction='none')
def my_simple_loss(y_hat,y):
    loss = torch.mean( y[:,0]*torch.pow( y_hat - y[:,1], 2))
    #quad=-1,2
    #lin=-2,1
    #sm=-3,0
    return loss
# We would use this loss function in the same way as the other predefined loss functions:
# loss_fn=my_simple_loss


Time to define optimizer and scheduler, number of epochs, and finally to train!

In [None]:
epochs=100
learningRate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=learningRate)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)

train_losses=[]
test_losses=[]
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loss=train_loop(train_dataloader, model, loss_fn, optimizer, scheduler, device)
    test_loss=test_loop(test_dataloader, model, loss_fn, device)
    train_losses.append(train_loss)
    test_losses.append(test_loss)
    print("Avg train loss", train_loss, ", Avg test loss", test_loss, "Current learning rate", scheduler.get_last_lr())
print("Done!")


plt.plot(train_losses, label="Average training loss")
plt.plot(test_losses, label="Average test loss")
plt.legend(loc="best")

In [None]:
def plot_rocs(scores_labels_names):
    plt.figure()
    for score, label, name  in scores_labels_names:
        fpr, tpr, thresholds = roc_curve(label, score)
        plt.plot(
            fpr, tpr, 
            linewidth=2, 
            label=f"{name} (AUC = {100.*auc(fpr, tpr): .2f} %)"
        )
    plt.plot([0, 1], [0, 1], color="navy", lw=2, linestyle="--")
    plt.grid()
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("Receiver Operating Characteristic curve")
    plt.legend(loc="lower right")
    plt.show()
    plt.close()

plot_rocs([
    (model(torch.tensor(X_train.to_numpy(),device=model.device)).numpy(force=True), y_train, "Train"), 
    (model(torch.tensor(X_test.to_numpy(),device=model.device)).numpy(force=True), y_test, "Test")  
])

## WHOOPS! The network is not learning anything!!!

What can we do?

Go back to that cell that had this text: `# LABEL OF THIS PLACE HERE, WILL BE USEFUL LATER`
and add there the following lines:

```
from sklearn.preprocessing import StandardScaler

for column in X_train.columns:
    scaler = StandardScaler().fit(X_train[column])
    X_train[column] = scaler.transform(X_train[column])
    X_test[column] = scaler.transform(X_test[column])
```

You could also use the basic syntax recommended by the documentation, as follows, but then you would be standardizing all the features to exacly the same mean. This may work for some data sets, but for this specific one it does not (it actually significantly worsens the performance---you can try ;) ).

```
scaler = StandardScaler().fit(X_train)

X_train[X_train.columns] = scaler.transform(X_train[X_train.columns])
X_test[X_test.columns] = scaler.transform(X_test[X_test.columns])
```

Rerun starting from that cell, and now check the new loss function evolution.

#### Can you explain what is the effect of these lines and their effect on the gradient descent?

In [None]:
# Exercises!!!

## Multiclass

Go back to the original dataset, but now assign different labels to the two backgrounds

In [None]:
signal = sig.drop(["Hreco_Lep2_pt", "Hreco_Lep2_eta", "Hreco_Lep2_phi", "Hreco_Lep2_mass", "Hreco_evt_tag", "Hreco_HTXS_Higgs_pt", "Hreco_HTXS_Higgs_y"], axis=1 )
bkg1 = bk1.drop(["Hreco_Lep2_pt", "Hreco_Lep2_eta", "Hreco_Lep2_phi", "Hreco_Lep2_mass", "Hreco_evt_tag","Hreco_HTXS_Higgs_pt", "Hreco_HTXS_Higgs_y"], axis=1 )
bkg2 = bk2.drop(["Hreco_Lep2_pt", "Hreco_Lep2_eta", "Hreco_Lep2_phi", "Hreco_Lep2_mass", "Hreco_evt_tag","Hreco_HTXS_Higgs_pt", "Hreco_HTXS_Higgs_y"], axis=1 )

signal['label'] = 2
bkg1['label'] = 1
bkg2['label'] = 0
bkg = pd.concat([bkg1, bkg2])

data = pd.concat([signal,bkg]).sample(frac=1).reset_index(drop=True)
X = data.drop(["label"], axis=1)
y = data["label"]


Now you need to apply the technique of **one-hot encoding** to convert a categorical label into a vector (one dimension per category/class).

Now you need to apply the technique of **one-hot encoding** to convert a categorical label (=0,1,2) into a vector (one dimension per category/class):


| Sample | Categorical | One-hot encoding |
| --- | --- | --- |
| Background 2 | $0$ | $(1,0,0)$ |
| Background 1 | $1$ | $(0,1,0)$ |
| Signal | $2$ | $(0,0,1)$ |


You can use the function `one_hot = torch.nn.functional.one_hot(target)` to one-hot encode the target labels (both for signal and background)

One-hot encoding is described in slide `81`of this morning's lecture.

You can use the function `one_hot = torch.nn.functional.one_hot(target)` to one-hot encode the target labels (both for signal and background)

In [None]:
# Add this where appropriate
for label in [0,1,2]:
    one_hot = torch.nn.functional.one_hot(torch.tensor(label), num_classes=3)
    print (f"Encoding label '{label}' as '{one_hot.numpy(force=True)}'")

Next you need to recreate the test/train split, using the same lines of code you used for the simple classification problem


Next you have to modify the neural network model: the output must be a dimension-three vector rather than a scalar.

You can use `self.output = nn.Linear(8, 3)` as last layer, and you can prepend it a `SoftMax` or `sigmoid` activation function, to ensure the outputs are interpretable as probabilities



After training, you will need to figure out how to get categorical predictions to be able to test performance (for instance to produce the ROC curve for each pair of classes, or for one class against all the others.

However, the minimal useful thing is to produce the confusion matrix we have seen in this morning's lecture:

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
pred_y = model(torch.tensor(X_test.values, device=device)).numpy(force=True)
pred_class = np.argmax(pred_y,axis=1)
true_class = np.argmax(y_test.numpy(force=True),axis=1)

print (f"{'true class':10s} | {'predicted class':15s} ")
print ('='*30)
for i in range(10):
    print (f"{true_class[i]:<10d} | {pred_class[i]:<15d}")

confusion_mat = confusion_matrix(true_class, pred_class, normalize='true')
                                 
plt.figure()
disp = ConfusionMatrixDisplay(confusion_matrix=confusion_mat, display_labels=['bkg2', 'bkg1', 'sig'])
disp.plot()
plt.show()
plt.close()

## Exercise

Calculate the predictions for each class, then plot the ROC curve for:

- signal vs bkg1
- signal vs bkg2
- bkg vs bkg2

Then, in another plot, plot:

- signal vs all backgrounds


# Regression

Go back to the original dataset, but now we will use the Higgs boson transverse momentum as a target for regression.
Notice how we avoid dropping the `Hreco_HTXS_Higgs_pt` column from the dataset, and we put that one into the target `y`.

The training will be done only on the signal (you want to regress the momentum in the specific process of interest).

We will still use the backgrounds, but just to make comparisons, in the sense that once you have the pT regressor, you can apply it to ttH (signal) events, but also separately to background events to see what's the shape of the regressed pT where the regressed pT is not supposed to exist (Drell-Yan events) or when it is supposed to exist but for another particle (W boson in ttW) events.

In [None]:
signal = sig.drop(["Hreco_Lep2_pt", "Hreco_Lep2_eta", "Hreco_Lep2_phi", "Hreco_Lep2_mass", "Hreco_evt_tag", "Hreco_HTXS_Higgs_y"], axis=1 )
bkg1 = bk1.drop(["Hreco_Lep2_pt", "Hreco_Lep2_eta", "Hreco_Lep2_phi", "Hreco_Lep2_mass", "Hreco_evt_tag", "Hreco_HTXS_Higgs_y"], axis=1 )
bkg2 = bk2.drop(["Hreco_Lep2_pt", "Hreco_Lep2_eta", "Hreco_Lep2_phi", "Hreco_Lep2_mass", "Hreco_evt_tag", "Hreco_HTXS_Higgs_y"], axis=1 )

X = signal.drop(["Hreco_HTXS_Higgs_pt"], axis=1)
y = signal["Hreco_HTXS_Higgs_pt"]

`# MIMIMI HERE SOMETHING THERE WILL BE`

Now, since the target is a regression, we need to tweak two things.

First, the last activation function should not be a `nn.Sigmoid()` anymore (which forces the output to be in the range `[0,1]`. You should now use `nn.ReLU()`.

Second, the cross entropy loss is not appropriate anymore. You should use the `MSELoss()`.

With these two changes, you should be able to regress the Higgs boson transverse momentum.

Plot the loss function, and then produce a scatter plot of the neural network prediction versus the true value of the Higgs transverse momentum (`y` vs `pred=model(x)`). Finally, produce a plot where you show the shape of the pT regressor separately for the signal, for bkg1 (ttW),  and for bkg2 (Drell-Yan). For this latest plot, you should normalize to 1 the three distributions, to check for shape differences (you can use `density=True` when plotting).

## What is going on!?!??! Why is the loss always NotANumber?

This is because the network is not managing to cope with the vast range of values for the output (the pT).

Try reducing the range of values by adding, in correspondence of `# MIMIMI HERE SOMETHING THERE WILL BE`, the following transformation:


`y = signal["Hreco_HTXS_Higgs_pt"].apply(np.log)`

## Somehow better, but still suboptimal!

The spread of the predictions is too large to be used. Nonconvex optimization is difficult, and sometimes tweaking the model and training to success is tricky.

A way of hacking this problem is to use a more sophisticated loss function that penalizes predictions with different means. You can try!

```
class penalized_mse(nn.Module):
    def __init__(self):
        super().__init__()
        
    def forward(self, pred, target):
        #return ((pred-target)**2).mean() + 2*((torch.log(pred)-torch.log(target))**2).mean()
        print(pred.mean(), pred.var(),target.var())
        return ((pred-target)**2).mean()*(torch.abs(pred.var()-target.var()))

loss_fn=penalized_mse()
```

Also see if modifying the network can help to improve the prediction. For example ...
* add dropout layers
* use a different activation function
* add batch normalization layers
* reduce the number of layer
* reduce the number of nodes

### The end