# Loading and Evaluating Multiple Models

This notebooks serves as a guide on how to load multiple models that were saved in the way described in [workbook.ipynb](workbook.ipynb). We will also look at how to evaluate them using an independent test dataset.

In [None]:
import json
import os
from pathlib import Path
from urllib.request import urlretrieve

import awkward as ak
import numpy as np
import pandas as pd
import torch
from tqdm.auto import tqdm

from models import from_config

# Load models

In [None]:
save_path = Path("saved_models")

In [None]:
!ls -la $save_path

In [None]:
# manually select models to evaluate
configs = ["deepset"]

# alternative: all models in `save_path`
#configs = [path.name for path in save_path.iterdir()]

In [None]:
models = {}
for tag in configs:
    model_path = save_path / tag
    with open(model_path / "config.json") as f:
        config = json.load(f)
    model = from_config(config)
    state = torch.load(model_path / "state.pt")
    model.load_state_dict(state)
    models[tag] = model

In [None]:
models

# Load test data

In [None]:
from utils import GraphDataset, load_data, map_np, collate_fn, preprocess, get_adj, loss_fn, accuracy_fn

In [None]:
filename = "smartbkg_dataset_4k_testing.parquet"
url = ... # url for testdata will be provided on second lab day

In [None]:
if not os.path.exists(filename):
    urlretrieve(url, filename)

In [None]:
feature_columns = ["prodTime", "x", "y", "z", "energy", "px", "py", "pz"]
df, labels = load_data(filename, row_groups=[0])

In [None]:
with open("pdg_mapping.json") as f:
    pdg_mapping = dict(json.load(f))
df["pdg_mapped"] = map_np(df.pdg, pdg_mapping, fallback=len(pdg_mapping) + 1)

In [None]:
data = preprocess(df, pdg_mapping=pdg_mapping, feature_columns=feature_columns)

In [None]:
data["adj"] = [get_adj(index, mother) for index, mother in zip(data["index"], data["mother"])]

In [None]:
dl = torch.utils.data.DataLoader(
    GraphDataset(feat=data["features"], pdg=data["pdg_mapped"], adj=data["adj"], y=labels),
    batch_size=256,
    collate_fn=collate_fn
)

# Evaluation

If we let some test data run through one of our trained models, each event in that dataset will be mapped to a number between 0 and 1. This is due to the sigmoid activation function used for every last layer. The output value can be interpreted as the confidence the models has in a particular event passing the skimming. In order to decide which events get thrown away prior to the detector simulation, a threshold needs to be selected. That way every event which generated an output less than the threshold gets thrown away and the others are kept. To each threshold, there is a corresponding pair of true and false positive rates (FPR, TPR). We can plot them against each other, leading to a *reciever operating characteristic* (ROC) curve. Looking at the graphic below, this curve can be used to evaluate the model's performance, which can be quantified by calculating the area under the ROC curve (auc).  

![](figures/Roc_curve.png)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, accuracy_score, auc

The following function will run our model once through the DataLoader to produce pairs of model outputs (`logits`) and targets (`y`):

In [None]:
def evaluate(model, dl):
    model.eval()
    out_y = []
    out_logits = []
    for x, y, mask in tqdm(dl):
        out_y.append(y)
        with torch.no_grad():
            logits = model(x, mask=mask)
        out_logits.append(logits.squeeze())
    return torch.cat(out_y), torch.cat(out_logits)

In [None]:
scores = {}
for name, model in models.items():
    print("Evaluating", name)
    scores[name] = evaluate(model, dl)

In [None]:
for name, (y_true, y_logits) in scores.items():
    y_score = y_logits.sigmoid()
    fpr, tpr, thr = roc_curve(y_true.numpy(), y_score.numpy())
    plt.plot(fpr, tpr, label=name)
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.legend()
plt.grid()

In [None]:
summary = []
for (name, model), (y_true, y_logits) in zip(models.items(), scores.values()):
    y_score = y_logits.sigmoid()
    fpr, tpr, thr = roc_curve(y_true.numpy(), y_score.numpy())
    loss = loss_fn(y_logits, y_true)
    accuracy = accuracy_fn(y_logits, y_true)
    summary.append(
        {
            "Model": name,
            "Loss": loss.item(),
            "Accuracy": accuracy.item(),
            "AUC": auc(fpr, tpr),
        }
    )

In [None]:
summary

In [None]:
pd.DataFrame(summary).set_index("Model")

# Speedup estimation

How do we finally choose a threshold? We want our model to deliver the highest possible speedup to our simulation chain. With certain assumptions (see [labday.md](labday.md)), a formula can be derived that only depends on FPR, TPR. This will be your job. With that you can plot against the thresholds a speedup curve and find its maximum as well as the corresponding optimal threshold.

In [None]:
def speedup(fpr, tpr):
    return ...

In [None]:
for i, (model_name, score) in enumerate(scores.items()):
    fpr, tpr, thr = roc_curve(labels, score)
    s = speedup(fpr, tpr)
    s_max = max(s)
    thr_max = thr[np.argmax(s)]
    summary[i]["Max. Speedup"] = s_max
    summary[i]["Best Threshold"] = thr_max
    plt.plot(thr[1:], s[1:], label=model_name)
plt.xlabel("Threshold")
plt.ylabel("Speedup")
plt.legend()
plt.grid()

In [None]:
pd.DataFrame(summary).set_index("Model")