# Learning to pivot

Main paper: https://arxiv.org/abs/1611.01046

In the notebook, we are going to make classifier's predictions independent from a *nuisance* parameters.
While nuisance parameters themselves can be not explicitely present in the dataset, they can be partially inffered from the rest of the features.

In [None]:
try:
    import mlhep2019
except ModuleNotFoundError:
    import subprocess as sp
    result = sp.run(
        ['pip3', 'install', 'git+https://github.com/yandexdataschool/mlhep2019.git'],
        stdout=sp.PIPE, stderr=sp.PIPE
    )
    
    if result.returncode != 0:
        print(result.stdout.decode('utf-8'))
        print(result.stderr.decode('utf-8'))
    
    import mlhep2019

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from IPython import display

from tqdm.notebook import tqdm

import numpy as np
import torch
import torch.utils.data

from mlhep2019.pivot import get_susy, split
from mlhep2019.pivot import nuisance_metric_plot, nuisance_prediction_hist

In [None]:
def get_free_gpu():
    from pynvml import nvmlInit, nvmlDeviceGetHandleByIndex, nvmlDeviceGetMemoryInfo, nvmlDeviceGetCount
    nvmlInit()

    return np.argmax([
        nvmlDeviceGetMemoryInfo(nvmlDeviceGetHandleByIndex(i)).free
        for i in range(nvmlDeviceGetCount())
    ])

In [None]:
if torch.cuda.is_available():
    cuda_id = get_free_gpu()
    device = 'cuda:%d' % (get_free_gpu(), )
    print('Selected %s' % (device, ))
else:
    device = 'cpu'
    print('WARNING: using cpu!')

### please, don't remove the following line
x = torch.tensor([1], dtype=torch.float32).to(device)

## Downloading SUSY dataset

The dataset can be found at https://archive.ics.uci.edu/ml/datasets/SUSY

The original paper is:
Baldi, P., P. Sadowski, and D. Whiteson. “Searching for Exotic Particles in High-energy Physics with Deep Learning.” Nature Communications 5 (July 2, 2014)

### Data Set Information:

The data has been produced using Monte Carlo simulations. The first 8 features are kinematic properties measured by the particle detectors in the accelerator. The last ten features are functions of the first 8 features; these are high-level features derived by physicists to help discriminate between the two classes. There is an interest in using deep learning methods to obviate the need for physicists to manually develop such features. Benchmark results using Bayesian Decision Trees from a standard physics package and 5-layer neural networks and the dropout algorithm are presented in the original paper. The last 500,000 examples are used as a test set.

### Attribute Information:

The first column is the class label (1 for signal, 0 for background), followed by the 18 features (8 low-level features then 10 high-level features):: lepton 1 pT, lepton 1 eta, lepton 1 phi, lepton 2 pT, lepton 2 eta, lepton 2 phi, missing energy magnitude, missing energy phi, MET_rel, axial MET, M_R, M_TR_2, R, MT2, S_R, M_Delta_R, dPhi_r_b, cos(theta_r1). For detailed information about each feature see the original paper.

In [None]:
def read_SUSY(path):
    import gzip
    
    SUSY_SIZE = 5000000
    SUSY_FEATURES = 19

    data = np.ndarray(shape=(SUSY_SIZE, SUSY_FEATURES), dtype='float32')
    with gzip.open(path, 'r') as f:
        for i, l in tqdm(enumerate(f), total=SUSY_SIZE):
            data[i] = [
                float(x)
                for x in l.split(b',')
            ]
    return data

In [None]:
data = read_SUSY('../../../share/SUSY.csv.gz')
data, labels = data[:, 1:], data[:, 0]

In [None]:
data_train, labels_train, data_test, labels_test = split(data, labels, split_ratios=(4, 1))

In [None]:
data_mean, data_std = np.mean(data_train, axis=0), np.std(data_train, axis=0)

In [None]:
data_train -= data_mean[None, :]
data_train /= data_std[None, :]

### never use test statistics to transform the test dataset!
data_test -= data_mean[None, :]
data_test /= data_std[None, :]

Here we select the first feature 'lepton 1 pT' as nuisance parameter. Feel free to try a different nusiance parameter, or, perhaps, several nuisance parameters at once.

In [None]:
data_train, nuisance_train = data_train[:, 1:], data_train[:, 0]
data_test, nuisance_test = data_test[:, 1:], data_test[:, 0]

In [None]:
plt.hist(
    [nuisance_train[labels_train > 0.5], nuisance_train[labels_train < 0.5]],
    bins=100, histtype='step', label=['class 0', 'class 1'], log=True
)
plt.xlabel('nuisance')
plt.legend()
plt.show()

In [None]:
X_train, X_test, y_train, y_test, z_train, z_test = [
    torch.tensor(x, dtype=torch.float32, device=device)
    for x in [data_train, data_test, labels_train, labels_test, nuisance_train, nuisance_test]
]

In [None]:
dataset_test = torch.utils.data.TensorDataset(X_test, y_test, z_test)
loader_test =  torch.utils.data.DataLoader(dataset_test, batch_size=1024, shuffle=False)

def test_predictions(model):
    with torch.no_grad():
        return np.concatenate([
            torch.sigmoid(model(X_batch)).to('cpu').detach().numpy()
            for X_batch, _, _ in loader_test
        ], axis=0)

## Task

- choose between conditional and unconditional pivoting (or both);
- implement and train ordinary classifier;
- implement and train pivoted classifier;
- compare ROC AUC score depending on nuisance parameter;
- make sure, that the Mutual Information (MI) for pivoted classifier is lower than MI for ordinary classifier.

### Classifier

SUSY and HIGGS datasets are quite difficult classification problems.
It is recommended to use networks with 3-5 dense layers, 100+ units each.

### Adversary

- The nuisance parameter is continuous, therefore, a regression loss should be used (e.g. MSE, MAE).
- Adversary should be a small network, 1 hidden layer with ~64 units is fine.
- Try to find a good coefficient $\lambda$ in $\mathcal{L}_\mathrm{clf} - \lambda \mathcal{L}_\mathrm{adv}$.


### Extra tasks
(can be implemented out of order)

- try to make adversary loss extremely high (more than $10$ times higher than the loss of the optimal constant), but:
    - use 16 steps of the adversary training per one step of the classifier training;
    - adversary must converge on its own with fixed classifier;
- implement both types of pivoting;
- include nuisance parameter in the feature set (keeping it as nuisance parameter), and train pivoted classifier;
- try several nuisance parameters at once;
- compare different regression losses for the adversary (e.g. MSE vs MAE).

In [None]:
# your code here
raise NotImplementedError

In [None]:
# your code here
raise NotImplementedError

In [None]:
example_output = clf(torch.randn(3, data_train.shape[1]).to(device))
example_output_shape = example_output.to('cpu').detach().numpy().shape
example_labels = torch.randint(0, 2, size=(3, )).float().to(device)

example_loss = loss_fn_clf(example_output, example_labels).to('cpu').detach().numpy()

assert example_output_shape == (3, ) or example_output_shape == (3, 2), \
    'Output shape must be either (3, ) or (3, 2)'

assert example_loss.shape == tuple(), 'Check loss implementation'

In [None]:
n_epoches = 64
n_batches = 4096

losses = np.zeros((n_epoches, n_batches), dtype='float32')

opt_clf = torch.optim.Adam(clf.parameters(), lr=1e-3)

for i in range(n_epoches):
    for j in range(n_batches):
        ### define training procedure here
        indx = torch.randint(0, data_train.shape[0], size=(32, ))
        X_batch, y_batch, z_batch = X_train[indx], y_train[indx], z_train[indx]
        
        losses[i, j] = 0.0

In [None]:
plot_losses(classifier=losses_clf)

In [None]:
assert np.mean(losses_clf[-1, :]) < np.log(2), 'Classifier seems to not learn anything'
assert np.mean(losses_clf[-1, :]) > 0, 'Perhaps, you forgot to fill `losses_clf` array with actual loss values?'

## Let's pivot

In [None]:
# your code here
raise NotImplementedError

In [None]:
# your code here
raise NotImplementedError

In [None]:
assert clf(torch.randn(3, data_train.shape[0]).to(device)).to('cpu').detach().numpy().shape == (3, )

In [None]:
# your code here
raise NotImplementedError

In [None]:
plot_losses(classifier=losses_clf, adversary=losses_adv)

In [None]:
assert np.mean(losses_clf[-1, :]) < np.log(2), 'The classifier seems to not learn anything.'
assert np.mean(losses_clf[-1, :]) > 0, 'Perhaps, you forgot to fill `losses_clf` array with actual loss values?'
assert np.mean(losses_adv[-1, :]) > 0, 'Perhaps, you forgot to fill `losses_adv` array with actual loss values?'

## Results

In [None]:
nuisance_prediction_hist([
        test_predictions(clf),
        test_predictions(pclf),
    ],
    nuisance_test, nuisance_bins=6,
    labels=labels_test.astype('int'),
    names=['non-pivoted', 'pivoted']
)

In [None]:
from sklearn.metrics import roc_auc_score

The following figure shows dependency between predictions and the nuisance parameter:
- each column correspond to a different model;
- rows correspond to nuisance parameter bins;
- each plot show distribution of model predictions within the corresponding nuisance bin.

- $\mathrm{MI}$ - (unconditional) mutual information between the nuisance parameter and model predictions.
- $\mathrm{MI}_i$ - mutual information between the nuisance parameter and model predictions **within** $i$-th class.

**Note**, that the following Mutual Information estimates migh be unreliable.

In [None]:
nuisance_metric_plot([
        test_predictions(clf),
        test_predictions(pclf),
    ],
    labels=labels_test,
    nuisance=nuisance_test.astype('int'),
    metric_fn=roc_auc_score, metric_name='ROC AUC',
    names=['non-pivoted', 'pivoted']
)