# pNN Tutorial
The entire workflow of *defining*, *training*, and *evaluating* parametric neural networks

### Set-up (only execute once, at the beginning)

Download the datasets (HEPMASS-IMB + HEPMASS test-set):

In [None]:
# download HEPMASS-IMB archive from Zenodo; and save it under data\
!mkdir data
!wget -O data/hepmass-imb.zip https://zenodo.org/record/6453048/files/hepmass-imb.zip?download=1

In [None]:
# extract
!unzip data/hepmass-imb.zip -d data/hep-imb

# delete archive
!rm data/hepmass-imb.zip

In [None]:
# download test-set of HEPMASS
!wget -O data/all_test.csv.gz http://archive.ics.uci.edu/ml/machine-learning-databases/00347/all_test.csv.gz

In [None]:
# extract
!gzip -d data/all_test.csv.gz

And prepare HEPMASS, before usage:

In [None]:
# pre-process:
#  - specify "-d" if you want to delete "all_test.csv"
!python3 process_csv.py data/all_test.csv data/hepmass/test.csv

### Load libraries

Loading the required libraries and packages

In [None]:
import os

With the following cell we force `TF` to use only the CPU, this is due the training sequences that require sampling numbers which is quite slow on the GPU.
- The GPU may still be advantageous in case the pNN is convolutional. So in such case comment this line.

In [None]:
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

In [None]:
import numpy as np
import pandas as pd
#import tensorflow as tf
import matplotlib.pyplot as plt
import tensorflow as tf

from script import utils, cms
from script.utils import free_mem

from script.models.layers import Divide
from script.datasets import Hepmass, Benchmark

useful when working on notebooks:

In [None]:
# reloads the modified source files, automatically
%load_ext autoreload
%autoreload 2

Get nice looking plots, and fix the randomness:

In [None]:
cms.plot.set_style()
utils.set_random_seed(42)

### Data Loading and Exploration
In this section we load the `HEPMASS-IMB` dataset (you can find it on [Zenodo](https://doi.org/10.5281/zenodo.6453048)), which is a modification of `HEPMASS` ([UCI ML repository](http://archive.ics.uci.edu/ml/datasets/hepmass)).
- The dataset has 28 features: 26 of them describe each event (entry), one is the class label, and the last *mass feature* represents the physics parameter (i.e. the signal mass hypothesis).
- The event related features are already normalized, to have approximately zero-mean and unitary variance.
- There are five mass points: $\{500,750,1000,1250,1500\}$ GeV.
- There is one signal and one background process: the signal is labeled with class `1`, and the background `0`.
- The training background samples are the same of `HEPMASS`, while the signal has been greatly reduce to make the dataset be more challenging and realistic.

In [None]:
data = Benchmark()

# the actual loading occurs here:
data.load(signal='data/hep-imb/imbalanced_signal.csv',
          bkg='data/hep-imb/imbalanced_background.csv', features=Hepmass.FEATURES)

* The `.load()` function expects the dataset to be divided in two parts: a signal and background ones.
* Both parts must be either a `csv` file (if so specify the path where it's stored), or a (list) of `pd.DataFrame` (assuming the data-frame have been already loaded into memory). In case you provide a list of data-frames, the function will concatenate each into one.
* Each `csv` expects the following columns: `[type, mass, name, weight]`; the `name` column is optional for the signal csv.
* Also you can specify which features to load, through the `features` argument.

In [None]:
# once loaded (or merged), we can inspect both the signal and background:
data.signal.head()

In [None]:
data.background.head()

* We can access the signal and background data-frames, though the fields `.signal` and `.background` respectively.
* The class label is stored in the `type` column; the mass feature under `mass`, the name of the background processes at `name` (in this case you see the default name: "background"): this column can be useful to retrieve the samples belonging to each bkg process, if more than one; lastly, the `weight` column can be used to weight the samples: in this case each entry has a $1$.

In [None]:
# displays all the mass hypotheses
data.signal['mass'].unique()

In [None]:
# number of training samples
len(data.signal), len(data.background)

Feature distribution

In [None]:
bins = 100
df = data.ds  # .ds contains both signal and bkg dataframes
mass = data.signal['mass'].unique()

In [None]:
axes = utils.get_plot_axes(rows=5, cols=5, size=(12, 10))
axes = np.reshape(axes, newshape=[-1])

bkg = data.background
sig = data.signal

for i, col in enumerate(data.columns['feature'][1:-1]):
    ax = axes[i]

    stats = df[col].describe()
    value_range = (stats['min'], stats['max'])

    bkg[col].plot(kind='hist', bins=bins, histtype='step', label='bkg', hatch='//', ax=ax,
                  range=value_range, linewidth=2, weights=np.ones_like(bkg[col]) / len(mass))

    for m in mass:
        sig[sig['mass'] == m][col].plot(kind='hist', bins=bins, histtype='step', ax=ax,
                                        label=f'{int(m)} GeV', range=value_range, linewidth=2)

    ax.set_ylabel('Weighted Num. Events')
    ax.set_xlabel(col)

    ax.legend(loc='best')
    free_mem()

plt.show()

---
### Model Definition
In this section we're going to instantiate a pNN model as a feed-forward neural network (since our data is tabular.) In particular we define an *affine-pNN* with:
* *pre-processing layers* applied on the input nodes;
* 4 layers with $[300, 150, 100, 50]$ units, `ReLU` activation, and `Dropout`;
* the *affine-conditioning* mechanism applied at each intermediate layer.

Pre-processing layers (at `script.models.layers`):
* `Divide(v)`: divides the input by the provided value `v`.
* `Clip(v_min, v_max)`: bounds the input value between `v_min` and `v_max`.
* `StandardScaler(mean, std)`: standardizes the input by subtracting by the provided `mean`, and then dividing by the provided standard deviation (`std`).

In [None]:
# in our case, HEPMASS has the features to be already standardized,
# so we only normalize the mass feature "m" by dividing it by 1000.
preproc = {'m': [Divide(1000.0)]}

* Pre-processing layers are defined by means of a `dict`.
* The dict *keys* indicate on which input the layer(s) have to be applied: possible values are `"x"` for the features, and `"m"` for the mass (physics parameter).
* The dict *values* should be a list of layers, each of them will be applied sequentially on the output of the previous one.

To create a model we can either instantiate it (from `script.models.pnn` or `script.models.affine`) and then compile, or use the utility function `script.utils.get_compiled_pnn`:

In [None]:
model, checkpoint = utils.get_compiled_pnn(data, units=[300, 150, 100, 50], activation=tf.nn.relu,
                                           conditioning=dict(method='affine', place='all'),
                                           dropout=0.25, kernel_initializer='he_uniform',
                                           preprocess=preproc, lr=5e-4, save='tutorial/affine_pnn',
                                           kernel_regularizer=tf.keras.regularizers.l2(1e-5),
                                           bias_regularizer=tf.keras.regularizers.l2(1e-6))

* The function will return a compiled `model` (with default `Adam` optimizer) ready to be trained.
* If we specify the `save` argument (a string), the function will also create a `tf.keras.callbacks.ModelCheckpoint` instance (`checkpoint`) to save the model's weights during training.
* The hyper-parameters can be directly passed as `kwargs`. In this case we have set the `units` (which also determines the number of hidden layers), optimizer's learning rate (`lr`), pre-processing layers (`preprocess`), the `activation` function, and finally the regularization (in terms of `dropout`, and l2 weight decay with `kernel_regularizer` and `bias_regularizer`.)

In the hyper-params, we also specified the conditioning mechanism with the `conditioning` dict:
* we can choose the **kind** (by `method`) which can be "concat" (concatenation-based), "biasing" (conditional biasing), "scaling" (conditional scaling), and "affine" (affine-conditioning).
* and the **location** (where conditioning occurs) by specifying the `place` argument, that can be either `start` (right after pre-processed input layers), `all` (after each non-linearity), and `end` (just before the output layer.)

This results in the following model architecture:

In [None]:
# pre-processing -> (dense -> conditioning -> dropout) x4 -> output (sigmoid)
model.summary()

---
### Training
Before training a pNN we pick a training `Sequence` that implements either or both: *parameter assignment* for the background data, and *balanced training*.

Available sequences are:
* `cms.data.IdenticalSequence`: implements the identical sampling strategy for mass assignment.
* `cms.data.UniformSequence`: assigns the mass from a uniform distribution on the mass interval.
* `cms.data.BalancedUniformSequence`: allows or not to sample the mass uniformly, additionally it balances the mini-batches. It supports *class* balance (`balance_signal=False` and `balance_bkg=False`), *signal* balance (`balance_signal=True` and `balance_bkg=False`), *background* balance (`balance_signal=False` and `balance_bkg=True`), and *full* balance (`balance_signal=True` and `balance_bkg=True`).
* `cms.data.BalancedIdenticalSequence`: behaves like the `BalancedUniformSequence` but the mass is assigned following the identical strategy.

These classes have a handy method (`get_data()`) that takes care of splitting the provided data into training and validation sets (which are `tf.data.Dataset` objects). Specific arguments to the sequence can be provided as `kwargs`.

In [None]:
# identical (sampled) sequence, without mini-batch balancing
train, valid = cms.data.IdenticalSequence.get_data(data, train_batch=1024,
                                                   features=data.columns['feature'])

In [None]:
history = model.fit(x=train, epochs=25, validation_data=valid, verbose=2,
                    # you can have other callbacks (e.g. early stop) as well
                    callbacks=[checkpoint])

In [None]:
# plot learning curves
utils.plot_history(history, keys=['loss', 'binary_accuracy', 'auc', 'weight-norm'])

---
### Evaluation
We evaluate our pNN on the test-set of `HEPMASS`, and measure various metrics such as the AUC of the ROC and PR (precision-recall) curves, as well as the significance ratio.

In [None]:
from script import evaluation

In [None]:
# load the test-set
test = Hepmass()
test.load(path=Hepmass.TEST_PATH)

In [None]:
# load the best set of weights
utils.load_from_checkpoint(model, path=model.save_path)

In [None]:
# set-up models for comparison: in this case we only have one model.
models = {'affine-pNN': model}

Compute evaluation metrics on each mass point, per model:

In [None]:
roc, pr, ams = evaluation.compare_table(models, dataset=test, ratio=True)

In [None]:
# ROC
evaluation.pivot(df=pd.DataFrame(roc), dataset=test)

In [None]:
# precision recall
evaluation.pivot(df=pd.DataFrame(pr), dataset=test)

In [None]:
# significance ratio
evaluation.pivot(df=pd.DataFrame(ams), dataset=test)

Plot the models:

In [None]:
evaluation.compare(models, dataset=test, legend='lower right')