<table>
  <tr>
    <td><center><img src="img/mlhep-logo-transparent.png" width="400"></center></td>
    <td><h1><center>The Sixth Machine Learning in High Energy Physics Summer School (MLHEP) 2020</center></h1></td>
  </tr>
 </table>

<h1><center>Seminar</center></h1>
<h2><center>Cross-Validation, Quality Metric Uncertainty Estimation, <br>Statistical Model Comparison</center></h2>

In [1]:
import pandas as pd
import numpy as np
import numpy.testing as np_testing
import matplotlib.pyplot as plt

%matplotlib inline

# Part 1: Quality metric uncertainty estimation

### Data preparation

UCI MAGIC dataset: https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope

The data are MC generated (see below) to simulate registration of high energy gamma particles in a ground-based atmospheric Cherenkov gamma telescope using the imaging technique. Cherenkov gamma telescope observes high energy gamma rays, taking advantage of the radiation emitted by charged particles produced inside the electromagnetic showers initiated by the gammas, and developing in the atmosphere. This Cherenkov radiation (of visible to UV wavelengths) leaks through the atmosphere and gets recorded in the detector, allowing reconstruction of the shower parameters. The available information consists of pulses left by the incoming Cherenkov photons on the photomultiplier tubes, arranged in a plane, the camera. Depending on the energy of the primary gamma, a total of few hundreds to some 10000 Cherenkov photons get collected, in patterns (called the shower image), allowing to discriminate statistically those caused by primary gammas (signal) from the images of hadronic showers initiated by cosmic rays in the upper atmosphere (background).

Features description:
- **Length:** continuous # major axis of ellipse [mm]
- **Width:** continuous # minor axis of ellipse [mm]
- **Size:** continuous # 10-log of sum of content of all pixels [in #phot]
- **Conc:** continuous # ratio of sum of two highest pixels over fSize [ratio]
- **Conc1:** continuous # ratio of highest pixel over fSize [ratio]
- **Asym:** continuous # distance from highest pixel to center, projected onto major axis [mm]
- **M3Long:** continuous # 3rd root of third moment along major axis [mm]
- **M3Trans:** continuous # 3rd root of third moment along minor axis [mm]
- **Alpha:** continuous # angle of major axis with vector to origin [deg]
- **Dist:** continuous # distance from origin to center of ellipse [mm]
- **Label:** g,h # gamma (signal), hadron (background)

g = gamma (signal): 12332 \
h = hadron (background): 6688

In [2]:
f_names = np.array(["Length", "Width", "Size", "Conc", "Conc1", "Asym", "M3Long", "M3Trans", "Alpha", "Dist"])

data = pd.read_csv("data/MAGIC/magic04.data", header=None, names=list(f_names)+["Label"])
data.head()

Unnamed: 0,Length,Width,Size,Conc,Conc1,Asym,M3Long,M3Trans,Alpha,Dist,Label
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828,g
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,g
2,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,g
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,g
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,g


In [3]:
# prepare a matrix of input features
X = data[f_names].values

# prepare a vector of true labels
y = 1 * (data['Label'].values == "g")



# scale data (apply to X, y for the further simplisity)
from sklearn.preprocessing import StandardScaler

# scale data: X = (X - mean) / sigma
X = StandardScaler().fit_transform(X)

In [4]:
from sklearn.model_selection import train_test_split

# split the data into train and test subsamples to fit and test classifiers
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=11, stratify=y)

### Fit a classifier

In [5]:
# import Logistic Regression classifier
from sklearn.linear_model import LogisticRegression

# define a classifier
logreg = LogisticRegression(penalty='l2', C=1.0, max_iter=1000, class_weight=None, solver='lbfgs', random_state=11)

# fit it using the train subsample
logreg.fit(X_train, y_train)

LogisticRegression(max_iter=1000, random_state=11)

### Make predictions

In [6]:
# get predictions for the test subsample
y_test_proba = logreg.predict_proba(X_test)[:, 1]

# predict labels
y_test_pred = logreg.predict(X_test)

In [7]:
print("Proba: ", y_test_proba[:5])
print("Pred:  ", y_test_pred[:5])
print("True:  ", y_test[:5])

Proba:  [9.22801236e-01 5.92866289e-04 7.62689962e-01 5.53478948e-01
 7.68378253e-01]
Pred:   [1 0 1 1 1]
True:   [0 0 1 0 1]


### Compute quality metrics

In [8]:
from sklearn import metrics

def quality_metrics_report(y_true, y_pred, y_proba):
    """
    Parameters
    ----------
    y_true: array-like of shape (n_samples,)
        Ground truth (correct) target values.
    y_pred: array-like of shape (n_samples,)
        Estimated targets as returned by a classifier.
    y_proba : array, shape = [n_samples]
        Target scores, can be probability estimates of the positive
        class.
        
    Returns
    -------
    List of metric values: [accuracy, precision, recall, f1, roc_auc]
    """
    
    accuracy  = metrics.accuracy_score(y_true, y_pred)
    precision = metrics.precision_score(y_true, y_pred)
    recall    = metrics.recall_score(y_true, y_pred)
    f1        = metrics.f1_score(y_true, y_pred)
    roc_auc   = metrics.roc_auc_score(y_true, y_proba)
    
    return [accuracy, precision, recall, f1, roc_auc]

In [9]:
# compute roc auc score on the test
[accuracy, precision, recall, f1, roc_auc] = quality_metrics_report(y_test, y_test_pred, y_test_proba)

print("Test sample:")
print("Accuracy:  ", accuracy)
print("Precision: ", precision)
print("Recall:    ", recall)
print("F1-score:  ", f1)
print("ROC AUC:   ", roc_auc)

Test sample:
Accuracy:   0.7915878023133543
Precision:  0.8037166085946573
Recall:     0.8978267920856309
F1-score:   0.8481691435575303
ROC AUC:    0.8367635664478921


## Quality metric uncertainty estimation with Bootstrap

<center><img src="img/bootstrap.png" width="800"></center>

Uncertainty estimation with bootstrap:

1. Given a model fitted on a train sample
2. 𝑁 – number of objects in a test sample
3. For 𝑖=1, …, 𝐵 do: \
    3.1 Sample with replacement a subsample with 𝑁 objects from the test sample \
    3.2 Calculate quality metrics on this subsample
4. Estimate statistics of the metrics: mean, variance, confidence intervals


### Task 1
Estimate the quality metrics uncertainties for the classifier considered above. For this, complete the function below.

**Hint:** to sample indeces with replacements use `np.random.choice()` function.

In [10]:
def botstrap_uncertainties(model, X_test, y_test, iters=100):
    
    metrics = []
    
    for i in range(iters):
        
        # you need to sample a subsample indeces using np.random.choice():
        # inds_boot = ...
        inds_boot = np.random.choice(np.arange(len(X_test)), size=len(X_test))
        
        X_test_boot = X_test[inds_boot]
        y_test_boot = y_test[inds_boot]
        
        # make prediction
        y_test_proba_boot = model.predict_proba(X_test_boot)[:, 1]
        y_test_pred_boot  = model.predict(X_test_boot)
        
        # compute quaility metrics
        metrics_boot = quality_metrics_report(y_test_boot, y_test_pred_boot, y_test_proba_boot)
        metrics.append(metrics_boot)
        
    metrics = np.array(metrics)
    df = pd.DataFrame()
    df['Metrics'] = columns=['Accuracy', 'Precision', 'Recall', 'F1', 'ROC AUC']
    df['Mean']    = metrics.mean(axis=0)
    df['Std']     = metrics.std(axis=0)
    
    return df

In [11]:
df = botstrap_uncertainties(logreg, X_test, y_test, iters=100)
print(df)

     Metrics      Mean       Std
0   Accuracy  0.792294  0.003665
1  Precision  0.804392  0.004556
2     Recall  0.897828  0.003922
3         F1  0.848536  0.003179
4    ROC AUC  0.837551  0.003929


Expected output (approximately):

<center>   
    
```python
     Metrics      Mean       Std
0   Accuracy  0.789478  0.004269
1  Precision  0.801227  0.004772
2     Recall  0.897766  0.004172
3         F1  0.846743  0.003407
4    ROC AUC  0.835261  0.004732
    
``` 
    
</center>

# Part 2: Cross-validation

## K-Fold CV

<center><img src="img/kfold.png" width="600"></center>

K-Fold:
    
1. Split the data into 𝐾 folds
2. For 𝑖=1,…,𝐾 do: \
    2.1 Keep 𝑖-th fold for validation \
    2.2 Use other 𝐾−1 folds to fit a model \
    2.3 Measure its quality on the validation fold \
3. Estimate mean and standard deviation of the quality metrics


In [12]:
# define a classifier
logreg = LogisticRegression(penalty='l2', C=1.0, max_iter=1000, class_weight=None, solver='lbfgs', random_state=11)

### Task 2
Using K-Fold cross-validation estimate means and standard deviation of the quality metrics for the classifier above. 

**Hint:** use `sklearn.model_selection.KFold(shuffle=True, random_state=11)` as it is shown in https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html . Use function `quality_metrics_report` above to compute the quality metrics.

In [13]:
from sklearn.model_selection import KFold

def kfold_uncertainties(model, X, y, n_splits=10):
    
    metrics = []
    
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=11)
    for train_index, test_index in kf.split(X):
        
        # fit the model on the train subsample, 
        # get y_test_proba and y_test_pred predictions on the test
        logreg.fit(X[train_index], y[train_index])
        y_test_pred = logreg.predict(X[test_index])
        y_test_proba = logreg.predict_proba(X[test_index])[:, 1]
        
        # compute quaility metrics
        metrics_iter = quality_metrics_report(y[test_index], y_test_pred, y_test_proba)
        metrics.append(metrics_iter)
        
    metrics = np.array(metrics)
    df = pd.DataFrame()
    df['Metrics'] = columns=['Accuracy', 'Precision', 'Recall', 'F1', 'ROC AUC']
    df['Mean']    = metrics.mean(axis=0)
    df['Std']     = metrics.std(axis=0)
    
    return df

In [14]:
df_kf = kfold_uncertainties(logreg, X, y, n_splits=10)
print(df_kf)

     Metrics      Mean       Std
0   Accuracy  0.790694  0.006743
1  Precision  0.802153  0.010757
2     Recall  0.898820  0.008051
3         F1  0.847678  0.006583
4    ROC AUC  0.839133  0.006790


Expected output:

<center>   
    
```python
     Metrics      Mean       Std
0   Accuracy  0.790694  0.006743
1  Precision  0.802153  0.010757
2     Recall  0.898820  0.008051
3         F1  0.847678  0.006583
4    ROC AUC  0.839133  0.006790
    
``` 
    
</center>

# Part 3: Statistical model comparison (optional)

## Hypothesis test

<center><img src="img/hypo.png" width="600"></center>

We have two hypothesis:
* $𝐻_0$: the models have the same quality
* $𝐻_1$: the models have the different qualities

Define models we would like to compare.

In [15]:
# define a model
model_1 = LogisticRegression(penalty='l2', C=1.0, max_iter=1000, class_weight=None, solver='lbfgs', random_state=11)

# test a model
kfold_uncertainties(model_1, X, y, n_splits=10)

Unnamed: 0,Metrics,Mean,Std
0,Accuracy,0.790694,0.006743
1,Precision,0.802153,0.010757
2,Recall,0.89882,0.008051
3,F1,0.847678,0.006583
4,ROC AUC,0.839133,0.00679


In [16]:
# Import kNN classifier
from sklearn.neighbors import KNeighborsClassifier

# define a model 2
model_2 = KNeighborsClassifier(n_neighbors=10)

# test a model
kfold_uncertainties(model_2, X, y, n_splits=10)

Unnamed: 0,Metrics,Mean,Std
0,Accuracy,0.790694,0.006743
1,Precision,0.802153,0.010757
2,Recall,0.89882,0.008051
3,F1,0.847678,0.006583
4,ROC AUC,0.839133,0.00679


In [17]:
# install mlxtend lib
# !pip3 install mlxtend

### Dietterich’s 5x2-Fold CV paired t test

In [18]:
from mlxtend.evaluate import paired_ttest_5x2cv

t, p = paired_ttest_5x2cv(estimator1=model_1,
                          estimator2=model_2,
                          X=X, y=y,
                          random_seed=1)

print('t statistic: %.3f' % t)
print('p-value: %.3f' % p)

if p <= 0.05:
    print("The models are significantply different.")
else:
    print("The models are NOT significantply different.")

t statistic: -17.720
p-value: 0.000
The models are significantply different.
