<a href="https://colab.research.google.com/github/vischia/pv_data_science_school/blob/master/2d_supervised_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning School, ICNFP 2025 edition
## Exercise 2c: multiclass classification

## Pietro Vischia (Universidad de Oviedo and ICTEA), pietro.vischia@cern.ch

In [None]:
runOnColab=False

In [None]:
if runOnColab:
    from google.colab import drive
    drive.mount('/content/drive')
    %cd "/content/drive/MyDrive/"
    if not os.path.isdir("pv_data_science_school"): 
        %git clone https://github.com/vischia/pv_data_science_school.git
    %cd pv_data_science_school
#!pwd
#!ls

In [None]:
import os
import torch
import torch.nn as nn  
import torch.optim as optim 
from torch.utils.data import Dataset, DataLoader 
import torch.nn.functional as F 
import torchvision
import torchinfo
from tqdm import tqdm

import sklearn
import sklearn.model_selection
from sklearn.metrics import roc_curve, auc, accuracy_score

import uproot

import pandas as pd

import matplotlib
matplotlib.rcParams['figure.figsize'] = (8, 6)
matplotlib.rcParams['axes.labelsize'] = 14
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if torch.backends.mps.is_available():
    device = torch.device("mps")
    torch.set_default_dtype(torch.float32)

print('Using torch version', torch.__version__)


import random
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.use_deterministic_algorithms(True) #Usually overkill

# Import data

We will use simulated events corresponding to three physics processes.

- ttH production
- ttW production
- Drell-Yan production

We will select the multilepton final state, which is a challenging final state with a rich structure and nontrivial background separation.

<img src="figs/2lss.png" alt="ttH multilepton 2lss" style="width:40%;"/>


In [None]:
import uproot
sig = uproot.open('data/signal_blind20.root')['Friends'].arrays(library="pd")
bk1 = uproot.open('data/background_1.root')['Friends'].arrays(library="pd")
bk2 = uproot.open('data/background_2.root')['Friends'].arrays(library="pd")


## Data inspection

We will now apply in one go all the manipulations of the input dataset that we have seen yesterday

First we drop all features that either correspond to unwanted objects (third lepton) or to labels we will need later on for regression.

In [None]:
sig.head()


In [None]:
signal = sig.drop(["Hreco_Lep2_pt", "Hreco_Lep2_eta", "Hreco_Lep2_phi", "Hreco_Lep2_mass", "Hreco_evt_tag", "Hreco_HTXS_Higgs_y"], axis=1 )

X = signal.drop(["Hreco_HTXS_Higgs_pt"], axis=1)
y = signal["Hreco_HTXS_Higgs_pt"]


and we split the data into training and test dataset.
Let's also go straight to the downsampling (you can run on your own on the whole training dataset, but for this demonstration we don't need to do that).

In [None]:
import sklearn
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.33, random_state=42)
print("We have", len(X_train), "training samples and ", len(X_test), "testing samples")

Ntrain=10000
Ntest=2000
X_train = X_train[:Ntrain]
y_train = y_train[:Ntrain]
X_test = X_test[:Ntest]
y_test = y_test[:Ntest]


from sklearn.preprocessing import StandardScaler

# NOTE: in earlier versions of the StandardScaler, `.values.reshape(-1,1)` was not needed. The interface must have changed.

for column in X_train.columns:
    scaler = StandardScaler().fit(X_train[column].values.reshape(-1,1))
    X_train[column] = scaler.transform(X_train[column].values.reshape(-1,1))
    X_test[column] = scaler.transform(X_test[column].values.reshape(-1,1))


For neural networks we will use `pytorch`, a backend designed natively for tensor operations.
I prefer it to tensorflow, because it exposes (i.e. you have to call them explicitly in your code) the optimizer steps and the backpropagation steps.

You could also use the `tensorflow` backend, either directly or through the `keras` frontend.
Saying "I use keras" does not tell you which backend is being used. It used to be either `tensorflow` or `theano`. Nowadays `keras` is I think almost embedded inside tensorflow, but it is still good to specify.

`torch` handles the data management via the `Dataset` and `DataLoader` classes.
Here we don't need any specific `Dataset` class, because we are not doing sophisticated things, but you may need that in the future.

The `Dataloader` class takes care of providing quick access to the data by sampling batches that are then fed to the network for (mini)batch gradient descent.

In [None]:
class MyDataset(Dataset):
    def __init__(self, X, y, device=torch.device("cpu")):
        self.X = torch.Tensor(X.values if isinstance(X, pd.core.frame.DataFrame) else X).to(device)
        self.y = torch.Tensor(y.values).to(device)

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        label = self.y[idx]
        datum = self.X[idx]
        
        return datum, label

batch_size=2048 # Minibatch learning


train_dataset = MyDataset(X_train, y_train)
test_dataset = MyDataset(X_test, y_test)

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

train_features, train_labels = next(iter(train_dataloader))
print(f"Feature batch shape: {train_features.size()}")
print(f"Labels batch shape: {train_labels.size()}")

print(train_features)

For educational purposes, let's get access the data loader via its iterator, and sample a single batch by calling `next` on the iterator

In [None]:
random_batch_X, random_batch_y = next(iter(train_dataloader))
print(random_batch_X.shape, random_batch_y.shape) 
print(random_batch_X)

Let's build a simple neural network, by inheriting from the `nn.Module` class. **This is very crucial, because that class is the responsible for providing the automatic differentiation infrastructure for tracking parameters and doing backpropagation**

In [None]:
class NeuralNetwork(nn.Module):
    def __init__(self, ninputs, device=torch.device("cpu")):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            #nn.Dropout(p=0.2)
            nn.Linear(ninputs, 512),
            nn.ReLU(),
            nn.Linear(512,128),
            #nn.BatchNorm1d(128),
            nn.ReLU(),
            #nn.Dropout(p=0.2)
            nn.Linear(128,64),
            nn.ReLU(),
            nn.Linear(64, 1),
            nn.ReLU()
        )
        self.device=device
        self.linear_relu_stack.to(self.device)

    def forward(self, x):
        # Pass data through conv1
        y = self.linear_relu_stack(x)
        return y

Let's instantiate the neural network and print some info on it

In [None]:
model = NeuralNetwork(X_train.shape[1])

print(model) # some basic info

print("Now let's see some more detailed info by using the torchinfo package")
torchinfo.summary(model, input_size=(batch_size, X_train.shape[1])) # the input size is (batch size, number of features)

Now let's introduce a crucial concept: `torch` lets you manage in which device you want to put your data and models, to optimize access at different stages

In [None]:
devicestring = "mps" # "mps" for macos. "cuda" for CUDA gpus, "cpu" for CPUs

device = torch.device("cuda:0" if torch.cuda.is_available() else devicestring)


# Get a batch from the dataloader
random_batch_X, random_batch_y = next(iter(train_dataloader))

print("The original dataloader resides in", random_batch_X.get_device())

# Let's reinstantiate the dataset
device = torch.device("mps")
train_dataset = MyDataset(X_train, y_train, device=device)
test_dataset = MyDataset(X_test, y_test, device=device)

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

random_batch_X, random_batch_y = next(iter(train_dataloader))

print("The new dataloader puts the batches in in", random_batch_X.get_device())

# Reinstantiate the model, on the chosen device
model = NeuralNetwork(X_train.shape[1], device)


We have learned how load the data into the GPU, how to define and instantiate a model. Now we need to define a training loop.

In `keras`, this is wrapped hidden into the `.fit()` method, which I think is bad because it hides the actual procedure.

In [None]:
def train_loop(dataloader, model, loss_fn, optimizer, scheduler, device):
    size = len(dataloader.dataset)
    losses=[] # Track the loss function
    # Set the model to training mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.train()
    #for batch, (X, y) in enumerate(dataloader):
    for (X,y) in tqdm(dataloader):
        # Reset gradients (to avoid their accumulation)
        optimizer.zero_grad()
        # Compute prediction and loss
        yhat = model(X)
        #if (all_equal3(pred.detach().numpy())):
        #    print("All equal!")
        loss = loss_fn(yhat.squeeze(dim=1), y)
        losses.append(loss.detach().cpu())
        # Backpropagation
        loss.backward()
        optimizer.step()

    scheduler.step()
    return np.mean(losses)

Now we need to define the loop that is run on the test dataset.

**The test dataset is just used for evaluating the output of the model. No backpropagation is needed, therefore backpropagation must be switched off!!!**

In [None]:
def test_loop(dataloader, model, loss_fn, device):
    losses=[] # Track the loss function
    # Set the model to evaluation mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.eval()
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    # Evaluating the model with torch.no_grad() ensures that no gradients are computed during test mode
    # also serves to reduce unnecessary gradient computations and memory usage for tensors with requires_grad=True
    with torch.no_grad():
        #for X, y in dataloader:
        for (X,y) in tqdm(dataloader):
            yhat = model(X)
            loss = loss_fn(yhat.squeeze(dim=1), y).item()
            losses.append(loss)
            test_loss += loss
            #correct += (pred.argmax(1) == y).type(torch.float).sum().item()
            
    return np.mean(losses)

We are now read to train this!
At the moment we are trying to do classification. We will set our loss function to be the cross entropy.

Torch provides the functionality to use generic functions as loss function. We will show an example one.

In [None]:

loss_fn = torch.nn.MSELoss()

class penalized_mse(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, pred, target):
        #return ((pred-target)**2).mean() + 2*((torch.log(pred)-torch.log(target))**2).mean()
        #print(pred.mean(), pred.var(),target.var())
        return ((pred-target)**2).mean()*(torch.abs(pred.var()-target.var()))

#loss_fn=penalized_mse()

#loss_fn = torch.nn.CrossEntropyLoss(reduction='none')
def my_simple_loss(y_hat,y):
    loss = torch.mean( y[:,0]*torch.pow( y_hat - y[:,1], 2))
    #quad=-1,2
    #lin=-2,1
    #sm=-3,0
    return loss
# We would use this loss function in the same way as the other predefined loss functions:
# loss_fn=my_simple_loss


Time to define optimizer and scheduler, number of epochs, and finally to train!

In [None]:
epochs=100
learningRate = 0.001
optimizer = torch.optim.Adam(model.parameters(), lr=learningRate)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)

train_losses=[]
test_losses=[]
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loss=train_loop(train_dataloader, model, loss_fn, optimizer, scheduler, device)
    test_loss=test_loop(test_dataloader, model, loss_fn, device)
    train_losses.append(train_loss)
    test_losses.append(test_loss)
    print("Avg train loss", train_loss, ", Avg test loss", test_loss, "Current learning rate", scheduler.get_last_lr())
print("Done!")


plt.plot(train_losses, label="Average training loss")
plt.plot(test_losses, label="Average test loss")
plt.legend(loc="best")

In [None]:
plt.figure()
y_pred = model(torch.tensor(X_test.values, device=device)).numpy(force=True)[:,0]
plt.scatter(y_pred, y_test.values, marker='o', s=1.)
plt.xlabel("Predicted value")
plt.ylabel("True value")
plt.ylim(min(y_test),max(y_test))
plt.xlim(min(y_test),max(y_test))
plt.plot([0,max(y_test)],[0,max(y_test)],linestyle='--',c='black')
plt.show()
plt.close()
plt.figure()
y_pred = model(torch.tensor(X_test.values, device=device)).numpy(force=True)[:,0]
diff = y_pred-y_test.values
hist,_,_ = plt.hist(diff, bins=100)
fractions = [2.3,15.85,50,84.15,97.7]
percentiles = np.percentile(diff,fractions)
for i in range(len(percentiles)):
    plt.plot([percentiles[i],percentiles[i]],[0,max(hist)*1.1],c='black',linestyle='--')
    plt.text(percentiles[i],max(hist)*1.11,f"{fractions[i]:.0f}%")
plt.xlabel("Predicted value - True value")
plt.show()
plt.close()

In [None]:
plt.figure()
y_pred = model(torch.tensor(X_test.values, device=device)).numpy(force=True)[:,0]
plt.scatter(np.exp(y_pred), np.exp(y_test.values), marker='o', s=1.)
plt.xlabel("Predicted value")
plt.ylabel("True value")
plt.ylim(min(np.exp(y_test)),max(np.exp(y_test)))
plt.xlim(min(np.exp(y_test)),max(np.exp(y_test)))
plt.plot([0,max(np.exp(y_test))],[0,max(np.exp(y_test))],linestyle='--',c='black')
plt.show()
plt.close()
plt.figure()
y_pred = model(torch.tensor(X_test.values, device=device)).numpy(force=True)[:,0]
diff = np.exp(y_pred)-np.exp(y_test.values)
hist,_,_ = plt.hist(diff, bins=100)
fractions = [2.3,15.85,50,84.15,97.7]
percentiles = np.percentile(diff,fractions)
for i in range(len(percentiles)):
    plt.plot([percentiles[i],percentiles[i]],[0,max(hist)*1.1],c='black',linestyle='--')
    plt.text(percentiles[i],max(hist)*1.11,f"{fractions[i]:.0f}%")
plt.xlabel("Predicted value - True value")
plt.show()
plt.close()

## What is going on!?!??! Why is the loss always NotANumber?

This is because the network is not managing to cope with the vast range of values for the output (the pT).

Try reducing the range of values by adding, in correspondence of `# MIMIMI HERE SOMETHING THERE WILL BE`, the following transformation:


`y = signal["Hreco_HTXS_Higgs_pt"].apply(np.log)`

## WHOOPS! The network is not learning anything!!!

What can we do?

- Log outputs
- Lower initial learning rate
- Change to Adam optimizer
- Increase batch size
- Do not apply scaler

- add dropout layers
- add batch normalization layers
- use a different activation function
- change the number of layer
- change the number of nodes


### The end