# Lisbon Machine Learning School
## Exercise 3: data preprocessing, and neural network structure

(C) Pietro Vischia (Universidad de Oviedo and ICTEA), pietro.vischia@cern.ch


## Setup the environment

- If you are running locally, you don't need to run anything

- If you are running on Google Colab, uncomment and run the next cell (remove only the "#", keep the "!"). You can also run it from a local installation, but it will do nothing if you have already installed all dependencies (and it will take some time to tell you it is not gonna do anything).

## Load the needed libraries

In [3]:
import os

import torch
import torch.nn as nn  
import torch.optim as optim 
from torch.utils.data import Dataset, DataLoader 
import torch.nn.functional as F 
import torchvision
import torchinfo
from tqdm import tqdm

import sklearn
import sklearn.model_selection
from sklearn.metrics import roc_curve, auc, accuracy_score

import uproot

import pandas as pd

import matplotlib
matplotlib.rcParams['figure.figsize'] = (8, 6)
matplotlib.rcParams['axes.labelsize'] = 14
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if torch.backends.mps.is_available():
    device = torch.device("mps")
    torch.set_default_dtype(torch.float32)

print('Using torch version', torch.__version__)


Using torch version 2.6.0


## Load the data

We will use the same data we used for exercise 2, that is simulated events corresponding to three physics processes.
- ttH production
- ttW production
- Drell-Yan ($pp\\to Z/\\gamma^*$+jets) production

We will select the multilepton final state, which is a challenging final state with a rich structure and nontrivial background separation.

<img src="figs/2lss.png" alt="ttH multilepton 2lss" style="width:40%"/>

We use the [uproot](https://uproot.readthedocs.io/en/latest/basic.html) library to conveniently read in a [ROOT TNuple](https://root.cern.ch/doc/master/classTNtuple.html) which can automatically convert it to a [pandas dataframe](https://pandas.pydata.org/).

In [5]:
# Download the data only if you haven't done so yet

if not os.path.isfile("data/signal.root"): 
    !mkdir data; cd data/; wget https://www.hep.uniovi.es/vischia/cmsdas2024/ft_tth_multilep_igfae2024.tar.gz; tar xzvf ft_tth_multilep_igfae2024.tar.gz; mv igfae2024/* .; rmdir igfae2024; rm ft_tth_multilep_igfae2024.tar.gz; cd -;


In [7]:
INPUT_FOLDER = './data'

sig = uproot.open(os.path.join(INPUT_FOLDER,'signal.root'))['Friends'].arrays(library="pd")
bk1 = uproot.open(os.path.join(INPUT_FOLDER,'background_1.root'))['Friends'].arrays(library="pd")
bk2 = uproot.open(os.path.join(INPUT_FOLDER,'background_2.root'))['Friends'].arrays(library="pd")

## Data Inspection

Select the features you want to use for this exercise, don't forget to remove unnecessary features.

Most of the variables are input features, corresponding to detector measurements of the properties of the reconstructed decay products.

There are three special variables, though:

- `Hreco_evt_tag`: this feature has values in ${0,1}$, where $1$ flags the event as signal event, and $0$ flags the event as background event;
- `Hreco_HTXS_Higgs_pt`: this feature contains the true generate Higgs boson transverse momentum at generator level (used for regression);
- `Hreco_HTXS_Higgs_y`: this feature contains the true generated Higgs boson rapidity (not pseudorapidity) at generator level (used for regression).


## The assignment

- For this data challenge, your target is to simultaneously regress the Higgs transverse momentum `Hreco_HTXS_Higgs_pt` and the rapidity `Hreco_HTXS_Higgs_y`
- The loss function typically used for regression problems is the mean square error: in this case you will have to figure out how to deal with the fact that the output vector has dimension two (transverse momentum, and rapidity).
- A tricky challenge is to deal with output features that have different scales: the rapidity is of $\mathcal{O}(1)$, the transverse momentum is of $\\mathcal{O}(100-1000}$

## The scoring system

- You will have to define a model with two output nodes: the first one must regress the Higgs boson transverse momentum, the second one must regress the Higgs boson rapidity. To test that the model is doing what it should, you can run this small routine:


- You can also use any flavour of boosted decision trees you may see fit, but implemented in `torch`.
- You will have to save your full model (see below), and send it to [lisbon-ml-workshop@cern.ch](mailto:lisbon-ml-workshop@cern.ch) . The model must be loadable by the command `torch.load()` and evaluable by `pred = model(...)`

We will evaluate the results of the challenge on our secret evaluation data set, using as performance metric XXXXXXXXXXXXXX

In [None]:
torch.save(model.state_dict(), best_model_path) # Save the full state of the model, to have access to the training history
