# Lisbon Machine Learning School
## Exercise 3: data preprocessing, and neural network structure

(C) Pietro Vischia (Universidad de Oviedo and ICTEA), pietro.vischia@cern.ch


## Setup the environment

- If you are running locally, you don't need to run anything

- If you are running on Google Colab, uncomment and run the next cell (remove only the "#", keep the "!"). You can also run it from a local installation, but it will do nothing if you have already installed all dependencies (and it will take some time to tell you it is not gonna do anything).

## Load the needed libraries

In [None]:
import os

import torch
import torch.nn as nn  
import torch.optim as optim 
from torch.utils.data import Dataset, DataLoader 
import torch.nn.functional as F 
import torchvision
import torchinfo
from tqdm import tqdm

import sklearn
import sklearn.model_selection
from sklearn.metrics import roc_curve, auc, accuracy_score

import uproot

import pandas as pd

import matplotlib
matplotlib.rcParams['figure.figsize'] = (8, 6)
matplotlib.rcParams['axes.labelsize'] = 14
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if torch.backends.mps.is_available():
    device = torch.device("mps")
    torch.set_default_dtype(torch.float32)

print('Using torch version', torch.__version__)


## Load the data

We will use the same data we used for exercise 2, that is simulated events corresponding to three physics processes.
- ttH production
- ttW production
- Drell-Yan ($pp\\to Z/\\gamma^*$+jets) production

We will select the multilepton final state, which is a challenging final state with a rich structure and nontrivial background separation.

<img src="figs/2lss.png" alt="ttH multilepton 2lss" style="width:40%"/>

We use the [uproot](https://uproot.readthedocs.io/en/latest/basic.html) library to conveniently read in a [ROOT TNuple](https://root.cern.ch/doc/master/classTNtuple.html) which can automatically convert it to a [pandas dataframe](https://pandas.pydata.org/).

In [None]:
# Download the data only if you haven't done so yet

if not os.path.isfile("data/signal_blind20.root"): 
    !mkdir data; cd data/; wget https://www.hep.uniovi.es/vischia/lisbon_ml_school/lisbon_ml_school_tth.tar.gz; tar xzvf lisbon_ml_school_tth.tar.gz; rm lisbon_ml_school_tth.tar.gz; cd -;


In [None]:
INPUT_FOLDER = './data'

sig = uproot.open(os.path.join(INPUT_FOLDER,'signal_blind20.root'))['Friends'].arrays(library="pd")
bk1 = uproot.open(os.path.join(INPUT_FOLDER,'background_1.root'))['Friends'].arrays(library="pd")
bk2 = uproot.open(os.path.join(INPUT_FOLDER,'background_2.root'))['Friends'].arrays(library="pd")

## Data Inspection

Select the features you want to use for this exercise, don't forget to remove unnecessary features.

Most of the variables are input features, corresponding to detector measurements of the properties of the reconstructed decay products.

There are three special variables, though:

- `Hreco_evt_tag`: this feature has values in ${0,1}$, where $1$ flags the event as signal event, and $0$ flags the event as background event;
- `Hreco_HTXS_Higgs_pt`: this feature contains the true generate Higgs boson transverse momentum at generator level (used for regression);
- `Hreco_HTXS_Higgs_y`: this feature contains the true generated Higgs boson rapidity (not pseudorapidity) at generator level (used for regression).


### Important

Twenty percent of the events have `-99` in the `Hreco_HTXS_Higgs_pt` and `Hreco_HTXS_Higgs_y` values. These are the "unlabelled" events that you will have to send predictions for. You should filter them out for training and testing

## The assignment

- For this data challenge, your target is to simultaneously regress the Higgs transverse momentum `Hreco_HTXS_Higgs_pt` and the rapidity `Hreco_HTXS_Higgs_y`, in `2lss` events. As a reminder, this means filtering the data like so:

In [None]:
# Filter data
data=data[data['Hreco_Lep2_pt']==-99]
# Drop unneeded features
data = data.drop(["Hreco_Lep2_pt", "Hreco_Lep2_eta", "Hreco_Lep2_phi", "Hreco_Lep2_mass", 
                  "Hreco_evt_tag","Hreco_HTXS_Higgs_pt", "Hreco_HTXS_Higgs_y"], axis=1 )


- The loss function typically used for regression problems is the mean square error: in this case you will have to figure out how to deal with the fact that the output vector has dimension two (transverse momentum, and rapidity).
- A tricky challenge is to deal with output features that have different scales: the rapidity is of $\mathcal{O}(1)$, the transverse momentum is of $\\mathcal{O}(100-1000}$

## Regression problems

Regression problems require the prediction to be free of adopting the same range as the target variable(s) that need to be regressed.

This is why the sigmoid activation function is not a good choice. The typical form of output layers of a regression problem is, if `n_outputs` is the dimension of the output vector:


In [None]:
nn.Linear(32, n_outputs),
nn.ReLU()

The other big change with respect to classification models is that the cross-entropy is not the proper loss function anymore.

The regression problem is essentially a generalization of a linear regression problem, and the typical error estimates from classical statistics apply, each with its pros and cons.

#### Mean Absolute Error (MAE)

$MAE(\hat{y}, y^{*}) = \frac{1}{N} \sum |\hat{y} - y^{*}|$

- Lower values are better.
- It estimates the average error, thus cannot distinguish between one large error and many small errors.

#### Root Mean Squared Error (RMSE)

$RMSE(\hat{y}, y^{*}) = \sqrt{\sum \frac{(\hat{y} - y^{*})^2}{N}}$

- Lower values are better.
- It estimates the spread of the residuals (standard deviation of the unexplained variance)
- It gives large weight to large errors (if you use it as loss function, it will prioritize the reduction of large errors)

#### Mean Absolute Percentage Error (MAPE)

$MAPE(\hat{y}, y^{*}) = \frac{100\%}{N} \sum \Big|\frac{\hat{y} - y^{*}}{y^{*}}\Big|$

#### R-Squared Score

$R^2(\hat{y}, y^{*}) = 1-\frac{ \sum (\hat{y} - y^{*})^2}{  \sum(\bar{y} - y^{*})^2  }$, 

where $\bar{y}$ is the arithmetic mean of the true values, $\bar{y} = \frac{1}{N}\sum_{i=0}^{N-1} y^{*}$

- It estimates how well the model explains the variance of the data
- It can be negative (and that means that the model fits badly the data)


You can consult online [an overview of the available loss functions in `pytorch`](https://pytorch.org/docs/stable/nn.html#loss-functions).


## A few hints

- Remove useless features and input features
- Consider the possibility of applying preprocessing to the input features, to the target features, or to both
- Choose the appropriate metric
- Loss functions can be made as complicated as you want by defining your own loss function, e.g.:

In [None]:
class your_own_loss(nn.Module):
    def __init__(self):
        super().__init__()
        
    def forward(self, pred, target):
        return ...

loss_fn=your_own_loss()

## The scoring system

- You will have to define a model with two output nodes: the first one must regress the Higgs boson transverse momentum, the second one must regress the Higgs boson rapidity.
- You can also use any flavour of boosted decision trees you may see fit, but implemented in `torch`.
- Remember to select only `2lss` events (drop events with three leptons)
- You will have to evaluate your model on the unlabelled data, save the predictions to a csv file with commas as separators (format: pt, y), and send the csv file [lisbon-ml-workshop@cern.ch](mailto:lisbon-ml-workshop@cern.ch). 
- If you have filtered the features further, please include in the email the code that creates the `data` dataframe.

We will evaluate the results of the challenge on the unlabelled events, using as performance metric the RMSE.
