<a href="https://colab.research.google.com/github/vischia/pv_data_science_school/blob/master/2b_supervised_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning School, ICNFP 2025 edition
## Exercise 2b: headaches in training neural networks: classification

## Pietro Vischia (Universidad de Oviedo and ICTEA), pietro.vischia@cern.ch

## Setup the environment


In [1]:
runOnColab=False

In [2]:
if runOnColab:
    from google.colab import drive
    drive.mount('/content/drive')
    %cd "/content/drive/MyDrive/"
    if not os.path.isdir("pv_data_science_school"): 
        %git clone https://github.com/vischia/pv_data_science_school.git
    %cd pv_data_science_school
#!pwd
#!ls

In [3]:
import os
import torch
import torch.nn as nn  
import torch.optim as optim 
from torch.utils.data import Dataset, DataLoader 
import torch.nn.functional as F 
import torchvision
import torchinfo
from tqdm import tqdm

import sklearn
import sklearn.model_selection
from sklearn.metrics import roc_curve, auc, accuracy_score

import uproot

import pandas as pd

import matplotlib
matplotlib.rcParams['figure.figsize'] = (8, 6)
matplotlib.rcParams['axes.labelsize'] = 14
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if torch.backends.mps.is_available():
    device = torch.device("mps")
    torch.set_default_dtype(torch.float32)

print('Using torch version', torch.__version__)


import random
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.use_deterministic_algorithms(True) #Usually overkill

Using torch version 2.7.1


We will use simulated events corresponding to three physics processes.
- ttH production
- ttW production
- Drell-Yan ($pp\\to Z/\\gamma^*$+jets) production

We will select the multilepton final state, which is a challenging final state with a rich structure and nontrivial background separation.

<img src="figs/2lss.png" alt="ttH multilepton 2lss" style="width:40%"/>

We use the [uproot](https://uproot.readthedocs.io/en/latest/basic.html) library to conveniently read in a [ROOT TNuple](https://root.cern.ch/doc/master/classTNtuple.html) which can automatically convert it to a [pandas dataframe](https://pandas.pydata.org/).

In [4]:
# This line downloads the data only if you haven't done so yet

if not os.path.isfile("data/signal_blind20.root"): 
    !mkdir data; cd data/; wget https://www.hep.uniovi.es/vischia/lisbon_ml_school/lisbon_ml_school_tth.tar.gz; tar xzvf lisbon_ml_school_tth.tar.gz; rm lisbon_ml_school_tth.tar.gz; cd -;
else:
    print("Data were already downloaded, I am not downloading them again.")

Data were already downloaded, I am not downloading them again.


In [5]:
import uproot

sig = uproot.open('data/signal_blind20.root')['Friends'].arrays(library="pd")
bk1 = uproot.open('data/background_1.root')['Friends'].arrays(library="pd")
bk2 = uproot.open('data/background_2.root')['Friends'].arrays(library="pd")


In [6]:
# Create a new column 'label' and set its value to 1 or 0 for all rows (=events)
sig['label'] = 1 
bk1['label'] = 0
bk2['label'] = 0

bk1=bk1.sample(frac=sig.shape[0]/bk1.shape[0]/2).reset_index(drop=True)
bk2=bk2.sample(frac=sig.shape[0]/bk2.shape[0]/2).reset_index(drop=True)
# Merge the two backgrounds into one dataframe
bkg = pd.concat([bk1, bk2])

print(f"bkg1 shape {bk1.shape}")
print(f"bkg2 shape {bk2.shape}")
print(f"bkg1+bkg2 shape {bkg.shape}")

# Merge the signal and background into one dataframe
print(f" Signal shape {sig.shape}")
print(f" Bkg shape {bkg.shape}")

data = pd.concat([sig,bkg])

print(f" Data shape {data.shape}")
print(data.columns)

# Filter data
data=data[data['Hreco_Lep2_pt']==-99]
# Drop unneeded features
data = data.drop(["index","Hreco_Lep2_pt", "Hreco_Lep2_eta", "Hreco_Lep2_phi", "Hreco_Lep2_mass", 
                  "Hreco_evt_tag","Hreco_HTXS_Higgs_pt", "Hreco_HTXS_Higgs_y", "Hreco_More5_Jets_pt", "Hreco_More5_Jets_eta", "Hreco_More5_Jets_phi", "Hreco_More5_Jets_mass",], axis=1 )


data = data.sample(frac=1).reset_index(drop=True)


X = data.drop(["label"], axis=1)
y = data["label"]

print(f"data shape {data.shape}")
print(f"input feature shape {X.shape}")
print(f"label (=target) shape {y.shape}")

bkg1 shape (149644, 35)
bkg2 shape (149644, 35)
bkg1+bkg2 shape (299288, 35)
 Signal shape (299287, 36)
 Bkg shape (299288, 35)
 Data shape (598575, 36)
Index(['index', 'Hreco_Lep0_pt', 'Hreco_Lep1_pt', 'Hreco_Lep2_pt',
       'Hreco_HadTop_pt', 'Hreco_All5_Jets_pt', 'Hreco_More5_Jets_pt',
       'Hreco_Jets_plus_Lep_pt', 'Hreco_Lep0_eta', 'Hreco_Lep1_eta',
       'Hreco_Lep2_eta', 'Hreco_HadTop_eta', 'Hreco_All5_Jets_eta',
       'Hreco_More5_Jets_eta', 'Hreco_Jets_plus_Lep_eta', 'Hreco_Lep0_phi',
       'Hreco_Lep1_phi', 'Hreco_Lep2_phi', 'Hreco_HadTop_phi',
       'Hreco_All5_Jets_phi', 'Hreco_More5_Jets_phi',
       'Hreco_Jets_plus_Lep_phi', 'Hreco_Lep0_mass', 'Hreco_Lep1_mass',
       'Hreco_Lep2_mass', 'Hreco_HadTop_mass', 'Hreco_All5_Jets_mass',
       'Hreco_More5_Jets_mass', 'Hreco_Jets_plus_Lep_mass', 'Hreco_TopScore',
       'Hreco_met', 'Hreco_met_phi', 'Hreco_HTXS_Higgs_pt',
       'Hreco_HTXS_Higgs_y', 'Hreco_evt_tag', 'label'],
      dtype='object')
data shape (392743, 

## Train a dense neural network


For neural networks we will use `pytorch`, a backend designed natively for tensor operations.
I prefer it to tensorflow, because it exposes (i.e. you have to call them explicitly in your code) the optimizer steps and the backpropagation steps.

You could also use the `tensorflow` backend, either directly or through the `keras` frontend.
Saying "I use keras" does not tell you which backend is being used. It used to be either `tensorflow` or `theano`. Nowadays `keras` is I think almost embedded inside tensorflow, but it is still good to specify.

`torch` handles the data management via the `Dataset` and `DataLoader` classes.
Here we don't need any specific `Dataset` class, because we are not doing sophisticated things, but you may need that in the future.

The `Dataloader` class takes care of providing quick access to the data by sampling batches that are then fed to the network for (mini)batch gradient descent.

We'll also calculate the proportion of signal to background events in the data sample, to rescale the class weights appropriately
                                                                                                                            

In [None]:
import sklearn
X_train_orig, X_test_orig, y_train_orig, y_test_orig = sklearn.model_selection.train_test_split(X, y, test_size=0.33, random_state=42)
print("We have", len(X_train), "training samples and ", len(X_test), "testing samples")


class MyDataset(Dataset):
    def __init__(self, X, y, device=torch.device("cpu")):
        self.X = torch.Tensor(X.values if isinstance(X, pd.core.frame.DataFrame) else X).to(device)
        self.y = torch.Tensor(y.values).to(device)

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        label = self.y[idx]
        datum = self.X[idx]
        
        return datum, label

batch_size=512 # Minibatch learning

def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

g = torch.Generator()
g.manual_seed(42)

# Unscaled features
train_dataset_orig = MyDataset(X_train_orig, y_train_orig)
test_dataset_orig = MyDataset(X_test_orig, y_test_orig)

train_dataloader_orig = DataLoader(train_dataset_orig, batch_size=batch_size, shuffle=True, worker_init_fn=seed_worker, generator=g)
test_dataloader_orig = DataLoader(test_dataset_orig, batch_size=batch_size, shuffle=True, worker_init_fn=seed_worker, generator=g)

# Scaled features
train_dataset_scaled = MyDataset(X_train_scaled, y_train_scaled)
test_dataset_scaled = MyDataset(X_test_scaled, y_test_scaled)

train_dataloader_scaled = DataLoader(train_dataset_scaled, batch_size=batch_size, shuffle=True, worker_init_fn=seed_worker, generator=g)
test_dataloader_scaled = DataLoader(test_dataset_scaled, batch_size=batch_size, shuffle=True, worker_init_fn=seed_worker, generator=g)

train_features, train_labels = next(iter(train_dataloader_scaled))
print(f"Feature batch shape: {train_features.size()}")
print(f"Labels batch shape: {train_labels.size()}")


# Assume y_train is a 1D NumPy array or pandas Series with 0/1 labels
n_pos = np.sum(y_train_orig == 1)
n_neg = np.sum(y_train_orig == 0)

# Calculate positive class weight (negatives / positives)
pos_weight = torch.tensor([n_neg / n_pos], dtype=torch.float32)
print(pos_weight)


In [None]:
class MyDataset(Dataset):
    def __init__(self, X, y, device=torch.device("cpu")):
        self.X = torch.Tensor(X.values if isinstance(X, pd.core.frame.DataFrame) else X).to(device)
        self.y = torch.Tensor(y.values).to(device)

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        label = self.y[idx]
        datum = self.X[idx]
        
        return datum, label

batch_size=512 # Minibatch learning

def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

g = torch.Generator()
g.manual_seed(42)

# Unscaled features
train_dataset_orig = MyDataset(X_train_orig, y_train_orig)
test_dataset_orig = MyDataset(X_test_orig, y_test_orig)

train_dataloader_orig = DataLoader(train_dataset_orig, batch_size=batch_size, shuffle=True, worker_init_fn=seed_worker, generator=g)
test_dataloader_orig = DataLoader(test_dataset_orig, batch_size=batch_size, shuffle=True, worker_init_fn=seed_worker, generator=g)

# Scaled features
train_dataset_scaled = MyDataset(X_train_scaled, y_train_scaled)
test_dataset_scaled = MyDataset(X_test_scaled, y_test_scaled)

train_dataloader_scaled = DataLoader(train_dataset_scaled, batch_size=batch_size, shuffle=True, worker_init_fn=seed_worker, generator=g)
test_dataloader_scaled = DataLoader(test_dataset_scaled, batch_size=batch_size, shuffle=True, worker_init_fn=seed_worker, generator=g)

train_features, train_labels = next(iter(train_dataloader_scaled))
print(f"Feature batch shape: {train_features.size()}")
print(f"Labels batch shape: {train_labels.size()}")


# Assume y_train is a 1D NumPy array or pandas Series with 0/1 labels
n_pos = np.sum(y_train_orig == 1)
n_neg = np.sum(y_train_orig == 0)

# Calculate positive class weight (negatives / positives)
pos_weight = torch.tensor([n_neg / n_pos], dtype=torch.float32)
print(pos_weight)

Let's build a simple neural network, by inheriting from the `nn.Module` class. **This is very crucial, because that class is the responsible for providing the automatic differentiation infrastructure for tracking parameters and performing backpropagation**

In [None]:
class NeuralNetwork(nn.Module):
    def __init__(self, ninputs, device=torch.device("cpu")):
        super().__init__()
        self.device = device
        
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(ninputs, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128,64),
            nn.ReLU(),
            nn.Linear(64,8),
            nn.ReLU(),
            nn.Linear(8, 1),
            nn.Sigmoid() # No sigmoid if using BCEWithLogitsLoss
        )
        self.linear_relu_stack.to(device)

    def forward(self, x):
        # Pass data through conv1
        x = self.linear_relu_stack(x)
        return x

Let's instantiate the neural network and print some info about it

In [None]:
model = NeuralNetwork(X_train_orig.shape[1])

print(model) # some basic info

print("Now let's see some more detailed info by using the torchinfo package")
torchinfo.summary(model, input_size=(batch_size, X_train_orig.shape[1])) # the input size is (batch size, number of features)

Now let's introduce a crucial concept: `torch` lets you manage in which device you want to put your data and models, to optimize access at different stages.

Let's do that by, for educational purposes, accessing the data loader via its iterator, and sample a single batch by calling `next` on the iterator

In [None]:
device = torch.device("cpu")

if torch.backends.mps.is_available() and torch.backends.mps.is_built():
    device = torch.device("mps")
if torch.cuda.is_available() and torch.cuda.device_count()>0:
    device = torch.device("cuda")
    
print ("Available device: ",device)


# Get a batch from the dataloader
random_batch_X, random_batch_y = next(iter(train_dataloader_orig))

print("The original dataloader resides in", random_batch_X.get_device())

# Let's reinstantiate the dataset

# Unscaled features
train_dataset_orig = MyDataset(X_train_orig, y_train_orig, device=device)
test_dataset_orig = MyDataset(X_test_orig, y_test_orig, device=device)

train_dataloader_orig = DataLoader(train_dataset_orig, batch_size=batch_size, shuffle=True)
test_dataloader_orig = DataLoader(test_dataset_orig, batch_size=batch_size, shuffle=True)

# Scaled features
train_dataset_scaled = MyDataset(X_train_scaled, y_train_scaled, device=device)
test_dataset_scaled = MyDataset(X_test_scaled, y_test_scaled, device=device)

train_dataloader_scaled = DataLoader(train_dataset_scaled, batch_size=batch_size, shuffle=True)
test_dataloader_scaled = DataLoader(test_dataset_scaled, batch_size=batch_size, shuffle=True)

random_batch_X, random_batch_y = next(iter(train_dataloader_orig))

print("The new dataloader puts the batches in in", random_batch_X.get_device())

# Reinstantiate the model, on the chosen device
model = NeuralNetwork(X_train_orig.shape[1], device)

#check if the NN can be evaluated some data; note: it has not been trained yet
print (model(torch.tensor(X_train_orig.values[:10],device=device)))

We have learned how load the data into the GPU, how to define and instantiate a model. Now we need to define a training loop.

In `keras`, this is wrapped hidden into the `.fit()` method, which I think is bad because it hides the actual procedure.

In [None]:
def train_loop(dataloader, model, loss_fn, optimizer, scheduler, best_model_path, device, disable=False):
    size = len(dataloader.dataset)
    losses=[] # Track the loss function
    # Set the model to training mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.train()
    #for batch, (X, y) in enumerate(dataloader):
    best_loss = np.inf
    for (X,y) in tqdm(dataloader, disable=disable):
        # Reset gradients (to avoid their accumulation)
        optimizer.zero_grad()
        # Compute prediction and loss
        pred = model(X)
        #if (all_equal3(pred.detach().numpy())):
        #    print("All equal!")
        loss = loss_fn(pred.squeeze(dim=1), y)
        losses.append(loss.detach().cpu())
        if loss < best_loss:
            best_loss = loss.detach().cpu()
            torch.save(model.state_dict(), best_model_path) # Save the full state of the model, to have access to the training history
        # Backpropagation
        loss.backward()
        optimizer.step()

    scheduler.step()
    return np.mean(losses)

Now we need to define the loop that is run on the test dataset.

**The test dataset is just used for evaluating the output of the model. No backpropagation is needed, therefore backpropagation must be switched off!!!**

In [None]:
def test_loop(dataloader, model, loss_fn, device, disable=False):
    losses=[] # Track the loss function
    # Set the model to evaluation mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.eval()
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    # Evaluating the model with torch.no_grad() ensures that no gradients are computed during test mode
    # also serves to reduce unnecessary gradient computations and memory usage for tensors with requires_grad=True
    with torch.no_grad():
        #for X, y in dataloader:
        for (X,y) in tqdm(dataloader, disable=disable):
            pred = model(X)
            loss = loss_fn(pred.squeeze(dim=1), y).item()
            losses.append(loss)
            test_loss += loss
            #correct += (pred.argmax(1) == y).type(torch.float).sum().item()
            
    return np.mean(losses)

We are now read to train this network!
At the moment we are trying to do classification. We will set our loss function to be the cross entropy.

Torch provides the functionality to use generic functions as loss function. We will show an example one.

Let's also reinstantiate the model, just because later you'll be asked to rerun the cells including the initialization of the model, and if we reinstantiate here you don't have to go too far above.

In [None]:
model = NeuralNetwork(X_train_orig.shape[1], device)

epochs=20
learningRate = 0.001

# The loss defines the metric deciding how good or bad is the prediction of the network
loss_fn = torch.nn.BCELoss() # If using the sigmoid. Otherwise, no sigmoid but: WithLogitsLoss(pos_weight=pos_weight.to(device)) (used to increase class importance for instance)
# The optimizer decides which path to follow through the gradient of the loss function
optimizer = torch.optim.SGD(model.parameters(), lr=learningRate)
# The scheduler reduces the learning rate for the optimizer in order to for the optimizer to be able to "enter" narrow minima
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)


In [None]:
train_losses=[]
test_losses=[]
best_model_path = "best_dnn_model.h5"
for t in range(epochs):
    if t%5 == 0:
        print(f"Epoch {t+1}\n-------------------------------")
    train_loss=train_loop(train_dataloader_orig, model, loss_fn, optimizer, scheduler, best_model_path, device, disable=True)
    test_loss=test_loop(test_dataloader_orig, model, loss_fn, device, disable=True)
    train_losses.append(train_loss)
    test_losses.append(test_loss)
    if t%5 == 0:
        print("Avg train loss", train_loss, ", Avg test loss", test_loss, "Current learning rate", scheduler.get_last_lr())
print("Done!")

In [None]:
plt.figure()
plt.plot(train_losses, label="Average training loss")
plt.plot(test_losses, label="Average test loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend(loc="best")
plt.show()
plt.close()

What if we train more? Let's train for 40 more epochs.

Now we have two choices: either we re-instantiate the model, increase the number of epochs, and retrain, or *we keep training from where we left off*!!! Let's try the latter, for another 40 epochs. Note how the loss function at the first step picks up the training from where it stopped before!

In [None]:
for t in range(40):
    if t%5 == 0:
        print(f"Epoch {t+1}\n-------------------------------")
    train_loss=train_loop(train_dataloader_orig, model, loss_fn, optimizer, scheduler, best_model_path, device, True)
    test_loss=test_loop(test_dataloader_orig, model, loss_fn, device, True)
    train_losses.append(train_loss)
    test_losses.append(test_loss)
    #print("Avg train loss", train_loss, ", Avg test loss", test_loss, "Current learning rate", scheduler.get_last_lr())
print("Done!")

In [None]:
plt.figure()
plt.plot(train_losses, label="Average training loss")
plt.plot(test_losses, label="Average test loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend(loc="best")
plt.show()
plt.close()

Since we were appending the loss function to the vector of losses, the plot already shows all the epochs, and you can see that now we are training to convergence.

Before plotting the ROC curves, let's retrain yesterday's adaptive boost BDT, as a benchmark

In [None]:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier

bdt_learning_rate = 0.1

bdt_ada = AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=3, criterion='log_loss'), n_estimators=100, learning_rate=bdt_learning_rate, random_state=42)
fitted_bdt_ada=bdt_ada.fit(X_train_orig, y_train_orig)


We can now plot the ROC curve

In [None]:
def plot_rocs(scores_labels_names):
    plt.figure()
    for score, label, name  in scores_labels_names:
        fpr, tpr, thresholds = roc_curve(label, score)
        plt.plot(
            fpr, tpr, 
            linewidth=2, 
            label=f"{name} (AUC = {100.*auc(fpr, tpr): .2f} %)"
        )
    plt.plot([0, 1], [0, 1], color="navy", lw=2, linestyle="--")
    plt.grid()
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("Receiver Operating Characteristic curve")
    plt.legend(loc="lower right")
    plt.show()
    plt.close()
with torch.no_grad():
    plot_rocs([
        (fitted_bdt_ada.decision_function(X_test_orig), y_test_orig, 'AdaBoost (test)'),
        (model(torch.tensor(X_train_orig.to_numpy(),device=model.device)).cpu().numpy(), y_train_orig, "Train"), 
        (model(torch.tensor(X_test_orig.to_numpy(),device=model.device)).cpu().numpy(), y_test_orig, "Test")  
        # If using BCEWithLogitsLoss, then you need to apply the sigmoid by hand to get probabilities
        #        (torch.sigmoid(model(torch.tensor(X_train_orig.to_numpy(),device=model.device))).cpu().numpy().ravel(), y_train_orig, "Train"), 
        #        (torch.sigmoid(model(torch.tensor(X_test_orig.to_numpy(),device=model.device))).cpu().numpy().ravel(), y_test_orig, "Test")  
    ])

A 73.96% AUC is not bad, but also not particularly good, and AdaBoost is still overperforming, in particular for low FPR, which is usually our region of interest.

Exercise:
- The performance may improve if you weight more the positive class (to penalize misclassification). This is done by removing the sigmoid, switching to the `BCEWithLogitsLoss` with a custom weight (see commented code) and adding a sigmoid to the evaluation when computing the ROC curve (also commented code)

However, note that we have used the unscaled features. What if we train a model to use the scaled features?

In [None]:
model_scaled = NeuralNetwork(X_train_scaled.shape[1], device)

epochs=20
learningRate = 0.001

# The loss defines the metric deciding how good or bad is the prediction of the network
loss_fn = torch.nn.BCELoss() #BCEWithLogitsLoss(pos_weight=pos_weight.to(device))
# The optimizer decides which path to follow through the gradient of the loss function
optimizer_scaled = torch.optim.SGD(model_scaled.parameters(), lr=learningRate)
# The scheduler reduces the learning rate for the optimizer in order to for the optimizer to be able to "enter" narrow minima
scheduler_scaled = torch.optim.lr_scheduler.ExponentialLR(optimizer_scaled, gamma=0.9)
train_losses=[]
test_losses=[]
best_model_path_scaled = "best_dnn_model_scaled.h5"
for t in range(epochs):
    if t%5 == 0:
        print(f"Epoch {t+1}\n-------------------------------")
    train_loss=train_loop(train_dataloader_scaled, model_scaled, loss_fn, optimizer_scaled, scheduler_scaled, best_model_path_scaled, device, disable=True)
    test_loss=test_loop(test_dataloader_scaled, model_scaled, loss_fn, device, disable=True)
    train_losses.append(train_loss)
    test_losses.append(test_loss)
    if t%5 == 0:
        print("Avg train loss", train_loss, ", Avg test loss", test_loss, "Current learning rate", scheduler.get_last_lr())
print("Done!")


plt.figure()
plt.plot(train_losses, label="Average training loss")
plt.plot(test_losses, label="Average test loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend(loc="best")
plt.show()
plt.close()

with torch.no_grad():
    plot_rocs([
        (fitted_bdt_ada.decision_function(X_test_orig), y_test_orig, 'AdaBoost (test)'),
        (model_scaled(torch.tensor(X_train_scaled.to_numpy(),device=model_scaled.device)).cpu().numpy(), y_train_scaled, "Train"), 
        (model_scaled(torch.tensor(X_test_scaled.to_numpy(),device=model_scaled.device)).cpu().numpy(), y_test_scaled, "Test")  
        # If using BCEWithLogitsLoss, then you need to apply the sigmoid by hand to get probabilities
        #(torch.sigmoid(model_scaled(torch.tensor(X_train_scaled.to_numpy(),device=model_scaled.device))).cpu().numpy().ravel(), y_train_scaled, "Train"), 
        #(torch.sigmoid(model_scaled(torch.tensor(X_test_scaled.to_numpy(),device=model_scaled.device))).cpu().numpy().ravel(), y_test_scaled, "Test")  
    ])


Need to train more!

In [None]:
for t in range(40):
    if t%5 == 0:
        print(f"Epoch {t+1}\n-------------------------------")
    train_loss=train_loop(train_dataloader_scaled, model_scaled, loss_fn, optimizer_scaled, scheduler_scaled, best_model_path_scaled, device, True)
    test_loss=test_loop(test_dataloader_scaled, model_scaled, loss_fn, device, True)
    train_losses.append(train_loss)
    test_losses.append(test_loss)
    if t%5 == 0:
        print("Avg train loss", train_loss, ", Avg test loss", test_loss, "Current learning rate", scheduler.get_last_lr())
print("Done!")

In [None]:
plt.figure()
plt.plot(train_losses, label="Average training loss")
plt.plot(test_losses, label="Average test loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend(loc="best")
plt.show()
plt.close()
with torch.no_grad():
    plot_rocs([
        (fitted_bdt_ada.decision_function(X_test_orig), y_test_orig, 'AdaBoost (test)'),
        (model_scaled(torch.tensor(X_train_scaled.to_numpy(),device=model_scaled.device)).cpu().numpy(), y_train_scaled, "Train"), 
        (model_scaled(torch.tensor(X_test_scaled.to_numpy(),device=model_scaled.device)).cpu().numpy(), y_test_scaled, "Test")  
        # If using BCEWithLogitsLoss, then you need to apply the sigmoid by hand to get probabilities
        #(torch.sigmoid(model_scaled(torch.tensor(X_train_scaled.to_numpy(),device=model_scaled.device))).cpu().numpy().ravel(), y_train_scaled, "Train"), 
        #(torch.sigmoid(model_scaled(torch.tensor(X_test_scaled.to_numpy(),device=model_scaled.device))).cpu().numpy().ravel(), y_test_scaled, "Test")  
    ])


The AUC went down to 66.20 from 73.96% and the training is noisy.

The scaling was supposed to have improved the classifier significantly!

What is going on? Maybe it's a matter of which features are used?

Let's check this by computing the permutation importance of the unscaled and scaled trainings:


In [None]:
from sklearn.metrics import roc_auc_score
def compute_permutation_importance(model, X_test: pd.DataFrame, y_test: pd.Series, metric_fn=roc_auc_score, n_repeats=1, device='cpu'):
    """
    Calculate permutation feature importance based on AUC drop.
    """
    # Convert inputs
    X_test_tensor = torch.tensor(X_test.values, dtype=torch.float32).to(device)
    y_test_array = y_test.values

    # Baseline performance
    baseline_preds = model(X_test_tensor).squeeze().cpu().detach().numpy()
    baseline_score = metric_fn(y_test_array, baseline_preds)

    importances = []

    for col in X_test.columns:
        scores = []
        for _ in range(n_repeats):
            X_test_permuted = X_test.copy()
            X_test_permuted[col] = np.random.permutation(X_test_permuted[col].values)
            X_perm_tensor = torch.tensor(X_test_permuted.values, dtype=torch.float32).to(device)
            permuted_preds = model(X_perm_tensor).squeeze().cpu().detach().numpy()
            permuted_score = metric_fn(y_test_array, permuted_preds)
            scores.append(baseline_score - permuted_score)  # importance = drop in score
        importances.append(np.mean(scores))

    # Create importance dataframe
    importance_df = pd.DataFrame({
        'feature': X_test.columns,
        'importance': importances
    }).sort_values(by='importance', ascending=False)

    return importance_df


def plot_permutation_importance(importance_df, top_n=20):
    """
    Plot top_n most important features based on permutation importance.
    """
    df = importance_df.copy()
    df = df.sort_values(by="importance", ascending=False).head(top_n)

    plt.figure(figsize=(5, 6))
    plt.barh(df['feature'], df['importance'])
    plt.xlabel("Importance (AUC Drop)")
    plt.ylabel("Feature")
    plt.title(f"Top {top_n} Feature Importances (Permutation)")
    plt.gca().invert_yaxis()  # Highest importance at the top
    plt.tight_layout()
    plt.show()


importance_df = compute_permutation_importance(model, X_test_orig, y_test_orig, metric_fn=roc_auc_score, n_repeats=10, device=device)
plot_permutation_importance(importance_df)
importance_df_scaled = compute_permutation_importance(model_scaled, X_test_scaled, y_test_scaled, metric_fn=roc_auc_score, n_repeats=10, device=device)
plot_permutation_importance(importance_df_scaled)


Now the situation is a bit more clear! Some of the irrelevant and low-magnitude features that were contributing the least to the unscaled classifier (for instance, the Hadronic top eta) now are super important and almost exclusively drive the output of the network!

This can happen, and the interpretation is as follows: scaling brings all features to the same numerical footing. If some feature with smaller original magnitudes were also not important for the training (maybe they were injecting just noise, maybe they were very weak), then scaling them will amplify their influence: this is because features with small magnitudes are mostly ignored by neural networks unless they turn out to be so important that the network is forced to learn that they should have a very large weight. After scaling the feature, now the network may be sensitive to their noise because the noise is not penalized by the huge weight that was necessary to overcome the small average magnitude of the feature in the unscaled version.

A typical workaround is to use the importance of the unscaled features to prune the least important variables. After that, scaling should actually result in an improved model: let's try and to that!

In [None]:
importance_df = importance_df.sort_values(by="importance", ascending=False)

to_discard = importance_df["feature"].tail(importance_df.shape[0]//3)

X_train_orig = X_train_orig.drop(to_discard, axis=1)
X_test_orig = X_test_orig.drop(to_discard, axis=1)
X_train_scaled = X_train_scaled.drop(to_discard, axis=1)
X_test_scaled = X_test_scaled.drop(to_discard, axis=1)



# Unscaled features
train_dataset_orig = MyDataset(X_train_orig, y_train_orig, device=device)
test_dataset_orig = MyDataset(X_test_orig, y_test_orig, device=device)

train_dataloader_orig = DataLoader(train_dataset_orig, batch_size=batch_size, shuffle=True)
test_dataloader_orig = DataLoader(test_dataset_orig, batch_size=batch_size, shuffle=True)

# Scaled features
train_dataset_scaled = MyDataset(X_train_scaled, y_train_scaled, device=device)
test_dataset_scaled = MyDataset(X_test_scaled, y_test_scaled, device=device)

train_dataloader_scaled = DataLoader(train_dataset_scaled, batch_size=batch_size, shuffle=True)
test_dataloader_scaled = DataLoader(test_dataset_scaled, batch_size=batch_size, shuffle=True)


In [None]:
model = NeuralNetwork(X_train_orig.shape[1], device)
print(model)
print(X_test_orig.shape)
epochs=20
learningRate = 0.001

# The loss defines the metric deciding how good or bad is the prediction of the network
loss_fn = torch.nn.BCELoss() #BCEWithLogitsLoss(pos_weight=pos_weight.to(device))
# The optimizer decides which path to follow through the gradient of the loss function
optimizer = torch.optim.SGD(model.parameters(), lr=learningRate)
# The scheduler reduces the learning rate for the optimizer in order to for the optimizer to be able to "enter" narrow minima
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
train_losses=[]
test_losses=[]
best_model_path = "best_dnn_model.h5"
for t in range(epochs):
    if t%5 == 0:
        print(f"Epoch {t+1}\n-------------------------------")
    train_loss=train_loop(train_dataloader_orig, model, loss_fn, optimizer, scheduler, best_model_path, device, disable=True)
    test_loss=test_loop(test_dataloader_orig, model, loss_fn, device, disable=True)
    train_losses.append(train_loss)
    test_losses.append(test_loss)
    if t%5 == 0:
        print("Avg train loss", train_loss, ", Avg test loss", test_loss, "Current learning rate", scheduler.get_last_lr())
print("Done!")


plt.figure()
plt.plot(train_losses, label="Average training loss")
plt.plot(test_losses, label="Average test loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend(loc="best")
plt.show()
plt.close()

with torch.no_grad():
    plot_rocs([
        (model(torch.tensor(X_train_orig.to_numpy(),device=model.device)).cpu().numpy(), y_train_orig, "Train"), 
        (model(torch.tensor(X_test_orig.to_numpy(),device=model.device)).cpu().numpy(), y_test_orig, "Test")  
        # If using BCEWithLogitsLoss, then you need to apply the sigmoid by hand to get probabilities
        #(torch.sigmoid(model(torch.tensor(X_train_orig.to_numpy(),device=model.device))).cpu().numpy().ravel(), y_train_orig, "Train"), 
        #(torch.sigmoid(model(torch.tensor(X_test_orig.to_numpy(),device=model.device))).cpu().numpy().ravel(), y_test_orig, "Test")  
    ])


The ROC goes to 75.53 from 73.96: a very small price for having removed the least important input features.

Let's see what happens to the scaled model:

In [None]:
model_scaled = NeuralNetwork(X_train_scaled.shape[1], device)
print(model_scaled)
print(X_test_scaled.shape)
epochs=60
learningRate = 0.001

# The loss defines the metric deciding how good or bad is the prediction of the network
loss_fn = torch.nn.BCELoss()
# The optimizer decides which path to follow through the gradient of the loss function
optimizer_scaled = torch.optim.SGD(model_scaled.parameters(), lr=learningRate)
# The scheduler reduces the learning rate for the optimizer in order to for the optimizer to be able to "enter" narrow minima
scheduler_scaled = torch.optim.lr_scheduler.ExponentialLR(optimizer_scaled, gamma=0.9)
train_losses=[]
test_losses=[]
best_model_path_scaled = "best_dnn_model_scaled.h5"
for t in range(epochs):
    if t%5 == 0:
        print(f"Epoch {t+1}\n-------------------------------")
    train_loss=train_loop(train_dataloader_scaled, model_scaled, loss_fn, optimizer_scaled, scheduler_scaled, best_model_path_scaled, device, disable=True)
    test_loss=test_loop(test_dataloader_scaled, model_scaled, loss_fn, device, disable=True)
    train_losses.append(train_loss)
    test_losses.append(test_loss)
    if t%5 == 0:
        print("Avg train loss", train_loss, ", Avg test loss", test_loss, "Current learning rate", scheduler_scaled.get_last_lr())
print("Done!")


plt.figure()
plt.plot(train_losses, label="Average training loss")
plt.plot(test_losses, label="Average test loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend(loc="best")
plt.show()
plt.close()


plot_rocs([
    (model_scaled(torch.tensor(X_train_scaled.to_numpy(),device=model_scaled.device)).numpy(force=True), y_train_scaled, "Train"), 
    (model_scaled(torch.tensor(X_test_scaled.to_numpy(),device=model_scaled.device)).numpy(force=True), y_test_scaled, "Test")  
])

Now the AUC is 67.39%, and it behaves a bit weirdly. What is likely happening is that the output is concentrated in some values and therefore the AUC jumps around: let's verify it

In [None]:
plt.figure()

plt.hist(model_scaled(torch.tensor(X_test_scaled[y_test_scaled==0].to_numpy(),device=model_scaled.device)).numpy(force=True), bins=20, alpha=0.7, label="Score for background events")
plt.hist(model_scaled(torch.tensor(X_test_scaled[y_test_scaled==1].to_numpy(),device=model_scaled.device)).numpy(force=True), bins=20, alpha=0.7, label="Score for signal events")
plt.title("With scaled features")
#plt.yscale("log")
plt.show()
plt.close()

Indeed we see a multimodal distribution, which is one of the worst cases. Compare with the output for the non-scaled features: it still recognizes some background events as signal events, but the double peak is less pronounced, and there is no signal peak at low classifier values:

In [None]:
plt.figure()

plt.hist(model(torch.tensor(X_test_orig[y_test_orig==0].to_numpy(),device=model.device)).numpy(force=True), bins=20, alpha=0.7, label="Score for background events")
plt.hist(model(torch.tensor(X_test_orig[y_test_orig==1].to_numpy(),device=model.device)).numpy(force=True), bins=20, alpha=0.7, label="Score for signal events")
plt.title("With unscaled features")
#plt.yscale("log")
plt.show()
plt.close()

Let's also look at the feature importance now:

In [None]:

importance_df = compute_permutation_importance(model, X_test_orig, y_test_orig, metric_fn=roc_auc_score, n_repeats=10, device=device)
plot_permutation_importance(importance_df)
importance_df_scaled = compute_permutation_importance(model_scaled, X_test_scaled, y_test_scaled, metric_fn=roc_auc_score, n_repeats=10, device=device)
plot_permutation_importance(importance_df_scaled)


A-ha! Before scaling, the top variables are mostly masses and transverse momenta. After scaling, the single most important variable seems to be the phi angle of the `Hadronic top` variable, which doesn't make much sense.

In machine learning, mostly you proceed guided by some loose principles, but most of the work is done by trial and error. Sometimes things that are supposed to help (scaling) are actually detrimental (they amplify noise)

What if we increase the amount of data and the size of the network?

In [None]:
X_train_orig, X_test_orig, y_train_orig, y_test_orig = sklearn.model_selection.train_test_split(X, y, test_size=0.33, random_state=42)

print(f"We have {len(X_train_orig)} training samples with {sum(y_train_orig)} signal and {sum(1-y_train_orig)} background events")
print(f"We have {len(X_test_orig)} testing samples with {sum(y_test_orig)} signal and {sum(1-y_test_orig)} background events")
print(f"Input shape is {X.shape}, target shape is {y.shape}.")

X_train_scaled = X_train_orig.copy(deep=True)
X_test_scaled = X_test_orig.copy(deep=True)

y_train_scaled = y_train_orig.copy(deep=True) # These won't change anyway, but to keep consistent names let's clone them too
y_test_scaled = y_test_orig.copy(deep=True) # These won't change anyway, but to keep consistent names let's clone them too


from sklearn.preprocessing import (
    MaxAbsScaler, # maxAbs
    MinMaxScaler, # MinMax
    Normalizer, # Normalization (equal integral)
    StandardScaler# standard scaling
)
from sklearn.decomposition import PCA

# Scale the input features and the target variable

scaler = MinMaxScaler().fit(X_train_scaled)
X_train_scaled[X_train_orig.columns] = scaler.transform(X_train_scaled[X_train_orig.columns])
X_test_scaled[X_train_orig.columns] = scaler.transform(X_test_scaled[X_train_orig.columns])
#for column in X_train_scaled.columns:
#    scaler = StandardScaler().fit(X_train_scaled.filter([column], axis=1))
#    X_train_scaled[column] = scaler.transform(X_train_scaled.filter([column], axis=1))
#    X_test_scaled[column] = scaler.transform(X_test_scaled.filter([column], axis=1))

# Unscaled features
train_dataset_orig = MyDataset(X_train_orig, y_train_orig, device=device)
test_dataset_orig = MyDataset(X_test_orig, y_test_orig, device=device)

train_dataloader_orig = DataLoader(train_dataset_orig, batch_size=batch_size, shuffle=True)
test_dataloader_orig = DataLoader(test_dataset_orig, batch_size=batch_size, shuffle=True)

# Scaled features
train_dataset_scaled = MyDataset(X_train_scaled, y_train_scaled, device=device)
test_dataset_scaled = MyDataset(X_test_scaled, y_test_scaled, device=device)

train_dataloader_scaled = DataLoader(train_dataset_scaled, batch_size=batch_size, shuffle=True)
test_dataloader_scaled = DataLoader(test_dataset_scaled, batch_size=batch_size, shuffle=True)

# Now we don't subsample

In [None]:
class NeuralNetwork(nn.Module):
    def __init__(self, ninputs, device=torch.device("cpu")):
        super().__init__()
        self.device = device
        
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(ninputs, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128,128),
            nn.ReLU(),
            nn.Linear(128,64),
            nn.ReLU(),
            nn.Linear(64,8),
            nn.ReLU(),
            nn.Linear(8, 1),
            nn.Sigmoid() # No sigmoid if using BCEWithLogitsLoss
        )
        self.linear_relu_stack.to(device)

    def forward(self, x):
        # Pass data through conv1
        x = self.linear_relu_stack(x)
        return x

In [None]:
model = NeuralNetwork(X_train_orig.shape[1], device)
print(model)
print(X_test_orig.shape)
epochs=20
learningRate = 0.001

# The loss defines the metric deciding how good or bad is the prediction of the network
loss_fn = torch.nn.BCELoss() #BCEWithLogitsLoss(pos_weight=pos_weight.to(device))
# The optimizer decides which path to follow through the gradient of the loss function
optimizer = torch.optim.SGD(model.parameters(), lr=learningRate)
# The scheduler reduces the learning rate for the optimizer in order to for the optimizer to be able to "enter" narrow minima
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
train_losses=[]
test_losses=[]
best_model_path = "best_dnn_model.h5"
for t in range(epochs):
    if t%5 == 0:
        print(f"Epoch {t+1}\n-------------------------------")
    train_loss=train_loop(train_dataloader_orig, model, loss_fn, optimizer, scheduler, best_model_path, device, disable=True)
    test_loss=test_loop(test_dataloader_orig, model, loss_fn, device, disable=True)
    train_losses.append(train_loss)
    test_losses.append(test_loss)
    if t%5 == 0:
        print("Avg train loss", train_loss, ", Avg test loss", test_loss, "Current learning rate", scheduler.get_last_lr())
print("Done!")


plt.figure()
plt.plot(train_losses, label="Average training loss")
plt.plot(test_losses, label="Average test loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend(loc="best")
plt.show()
plt.close()

with torch.no_grad():
    plot_rocs([
        (model(torch.tensor(X_train_orig.to_numpy(),device=model.device)).cpu().numpy(), y_train_orig, "Train"), 
        (model(torch.tensor(X_test_orig.to_numpy(),device=model.device)).cpu().numpy(), y_test_orig, "Test")  
        # If using BCEWithLogitsLoss, then you need to apply the sigmoid by hand to get probabilities
        #(torch.sigmoid(model(torch.tensor(X_train_orig.to_numpy(),device=model.device))).cpu().numpy().ravel(), y_train_orig, "Train"), 
        #(torch.sigmoid(model(torch.tensor(X_test_orig.to_numpy(),device=model.device))).cpu().numpy().ravel(), y_test_orig, "Test")  

    ])

plt.figure()

plt.hist(model(torch.tensor(X_test_orig[y_test_orig==0].to_numpy(),device=model.device)).numpy(force=True), bins=20, alpha=0.7, label="Score for background events")
plt.hist(model(torch.tensor(X_test_orig[y_test_orig==1].to_numpy(),device=model.device)).numpy(force=True), bins=20, alpha=0.7, label="Score for signal events")
plt.title("With unscaled features")
#plt.yscale("log")
plt.show()
plt.close()
    

Wow! We gained a full 2%, to 72.03!!!

Let's look at the amount of training data and the amount of trainable parameters

In [None]:
print("Training data size: ", X_train_orig.shape[0])
torchinfo.summary(model, input_size=(batch_size, X_train_orig.shape[1])) # the input size is (batch size, number of features)

We have 201809 parameters for 263137 data points.

Exercise:
- Increase obscenely the network structure, to maybe 300k parameters, and retrain

In any case, this case is pretty difficult to handle: the `bk1`, ttW production, is very similar to the signal ttH, so clumping ttW together with Drell-Yan in training can easily confuse the network: indeed, the final classifier recognizes that there are two regimes in the background, one that peaks near the signal (is similar to the signal) and the other that doesn't.

Next: let's try a multiclass separation between signal, bk1, bkg2!

## That's all!