# How to evaluate model calibration

More than knowing if a model can provide good results, have you ever thought about how your model would be used in practice? Analysts would use it as a yes/no decider or as a probability estimator? Knowing this information and using it to deliver good business solutions is essencial. Also, choosing an evaluation metric that's adjusted to the optimization your company wants to have is important. For example, evaluating a fraud detection model with accuracy is not recommended.

In this notebook I want to touch in the second possibility raised before, that is, when our model will be used as a probability estimator. I will present an example using a real competition data set and point out reading resources to further understand the topics covered here.  

# Importing libs and reading data

In [None]:
# Libs to deal with tabular data
import pandas as pd
import numpy as np

# Plotting libs
import seaborn as sns
import matplotlib.pyplot as plt

# Machine learning
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

# Auxiliar libs
from tqdm import tqdm_notebook as tqdm
import os
import time
from math import ceil
from IPython.display import display, Markdown

In [None]:
# Reading train split
train = pd.read_csv('../input/santander-customer-transaction-prediction/train.csv')
X = train.drop(['ID_code', 'target'], axis=1).values
y = train['target'].values

# Picking an evaluation metric

The data set used as an example is taken from a Kaggle competition called Santander Customer Transaction Prediction. According to the description, the data describes the behaviour of the customers in order to predict if a certain transaction will be made.

In [None]:
y.mean()

As shown above, the response variable is imbalanced, thus we need to pick an evaluation metric that gives us meaningful information about the model's performance. Accuracy is not recommended because even if we always predict 0, the accuracy will be 90%. Besides, it doesn't take into account the influence of false positives and false negatives as F-score does.

In this problem I will use the area under the ROC curve. It ranges from 0 to 1 and higher values mean better models. The main advantage of using AUROC is that it can easily assess if the probabilities yielded by the model are ordered correctly. In other words, it answer the following question: what's the probability of a positive case have a probability greater than the probability of a negative case? For example, an AUC of 0.8 tells us that if we randomly pick a positive case, its probability will be on average greater than the probability of 80% of the negative cases.

Remember that we want a model that estimates probabilities. F1-score is not good choice because it needs the specification of a threshold.

If you want to get a better understanding of AUC and how it's computed, the blog Analytics Revolution made a great two-post series diving into this subject:

- First post: [ROC Curves in two lines of code](https://blog.revolutionanalytics.com/2016/08/roc-curves-in-two-lines-of-code.html)
- Second post: [Calculating AUC: the area under a ROC Curve](https://blog.revolutionanalytics.com/2016/11/calculating-auc.html/)

# Calibration curves (reliability diagram)

Calibration is another important concept to understand when evaluating probabilities yielded by a model. A model is called calibrated when the probabilities estimated resemble the real frequency of positive cases. For example, if we take a sample of 100 cases scored by the model whose mean probability of being positive is 60%, then we expect the real rate of positive cases in this sample to be 60%. Calibration is an important characteristic to have in credit scoring models, for instance.

To assess the calibration, one can use a type of chart called calibration curve. To build it, use the following steps:

1. Score the test set and sort the probabilities by increasing order.
2. Divide the test set in a number of bins with equal probability width.
3. For each bin, compute the mean probability outputed by the model and the true rate of positive cases.
4. Plot a graph where x-axis is the mean probability outputed and the y-axis is the rate of positive cases. Then, each bin corresponds to a dot in this graph.

<img src="https://www.researchgate.net/profile/Anand_Avati/publication/321160854/figure/download/fig2/AS:562618466344960@1511150103061/Reliability-curve-calibration-plot-of-the-model-output-probabilities-on-the-test-set.png" width="500"/>

For example, in the calibration curve above we can see that the model curve is very close to the diagonal line. The perfectly calibrated model will have a diagonal curve because in all segments of probability the mean predicted value will be equal to the true fraction of positives.

If you want to see more examples and have a better understanding of model calibration, check [this article](https://scikit-learn.org/stable/modules/calibration.htmlhttps://scikit-learn.org/stable/modules/calibration.html) from the scikit-learn documentation.

Notice that's possible to adopt other strategies to define the bins. We could, for instance, use bins with equal number of samples. In some cases, some regions of the probability space have a small number of samples, so that it's hard to evaluate if the model is really calibrated. If we are dealing with a very imbalanced binary-classification, it's really hard to have good calibrated probabilities in the high end probability range.

# Expected calibration error (ECE)

Instead of visually analyzing the calibration we can use the metric ECE. As described by [Guo et al. 2017](https://arxiv.org/abs/1706.04599), expected calibration error is an weighted average of the difference between the mean predicted value and the fraction of positives in M bins. We can easily compute this metric using the bins created for the calibration curve.

$$\text{ECE} = \sum^{M}_{m=1} \frac{|B_m|}{n} |pred(B_m) - frac(B_m)|$$

Where $B_m$ is the bin *m*, *pred()* is the mean predicted value and *frac()* is the fraction of positive cases.

In [None]:
def ece(y_test, preds, strategy = 'uniform'):
    df = pd.DataFrame({'target': y_test, 'proba': preds, 'bin': np.nan})
    
    if(strategy == 'uniform'):
        lim_inf = np.linspace(0, 0.9, 10)
        for idx, lim in enumerate(lim_inf):
            df.loc[df['proba'] >= lim, 'bin'] = idx

    elif(strategy == 'quantile'):
        pass
    
    df_bin_groups = pd.concat([df.groupby('bin').mean(), df['bin'].value_counts()], axis = 1)
    df_bin_groups['ece'] = (df_bin_groups['target'] - df_bin_groups['proba']).abs() * (df_bin_groups['bin'] / df.shape[0])
    return df_bin_groups['ece'].sum()

# Fitting the model

First, I will split the data set using a stratified sample by the rate of positive cases (10%).  

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

In [None]:
def make_report(y_test, preds):
    # Computing AUC
    auc = roc_auc_score(y_test, preds)
    display(Markdown(f'AUROC: {auc}'))
    display(Markdown(f'AUROC: {2*auc-1}'))
    display(Markdown(f'Fraction of positive cases in the test set: {y_test.mean()}'))
    display(Markdown(f'Mean predicted value in the test set:       {preds.mean()}'))
    display(Markdown(f'ECE (equal width bins):       {ece(y_test, preds)}'))
    
    # Plotting probabilities
    display(Markdown('#### Histogram of the probability distribution'))
    sns.histplot(preds)
    plt.show()
    
    # Plotting KDE by class
    display(Markdown('#### KDE plots of the probability distribution by class'))
    fig, ax1 = plt.subplots()
    sns.kdeplot(preds[y_test == 0], label = 'Class 0', ax = ax1)
    ax2 = ax1.twinx()
    sns.kdeplot(preds[y_test == 1], label = 'Class 1', color = 'red', ax = ax2)
    lines, labels = ax1.get_legend_handles_labels()
    lines2, labels2 = ax2.get_legend_handles_labels()
    ax2.legend(lines + lines2, labels + labels2, loc=0)
    plt.show()
    
    # Plotting calibration
    display(Markdown('#### Calibration curve (equal width bins)'))
    fop, mpv = calibration_curve(y_test, preds, n_bins=10)
    plt.plot(mpv, fop, "s-", label='model')
    plt.plot([0,1],[0,1], label='ideal')
    plt.xlabel('Mean predicted value')
    plt.ylabel('Fraction of positives')
    plt.legend()
    plt.show()
    
    display(Markdown('#### Calibration curve (equal size bins)'))
    fop, mpv = calibration_curve(y_test, preds, n_bins=10, strategy='quantile')
    plt.plot(mpv, fop, "s-", label='model')
    plt.plot([0,1],[0,1], label='ideal')
    plt.xlabel('Mean predicted value')
    plt.ylabel('Fraction of positives')
    plt.legend()
    plt.show()

## First attempt

Let's fit a simple Random Forest model and see what happens.

In [None]:
%%time
rf = RandomForestClassifier(n_jobs=-1, max_depth=8, random_state=42).fit(X_train, y_train)
preds = rf.predict_proba(X_test)[:,1]

In [None]:
make_report(y_test, preds)

Notice that the probabilities range from 0 to about 0.4, which is not good. The problem is that as probabilities these numbers don't make much sense. Decision trees output the label, so that the number we see is just how many times the class 1 was returned in the trees. There is nothing that garantee that the frequency is an actual probability. Also, the calibration curve seems very off and the function I use from scikit-learn to generate the bins only created 4 of them (that's because it generates bins of width 0.1. Random forests are not based on a probabilistic framework, thus they don't output real probabilities.

## Second attempt

Now I'm going to fit an ensemble of random forests and for each model I will also fit an isotonic regression to map the RF "probabilities" to real probabilities. The isotonic regression is a method to map a single feature to a desired target with the constraint that this mapping must be non-decreasing. In addition to that, the curve is free-form. In the end, we will have 5 pairs of models.

If you want to know more about these topics, here are some good readings:

- [Isotonic regression in Wikipedia](https://en.wikipedia.org/wiki/Isotonic_regression)
- [Calibrated Classifier Cross Validation](https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html?highlight=calibrated#sklearn.calibration.CalibratedClassifierCV)

In [None]:
%%time
rf = RandomForestClassifier(n_jobs=-1, max_depth=8, random_state=42)
cv = CalibratedClassifierCV(rf, method='isotonic').fit(X_train, y_train)
preds = cv.predict_proba(X_test)[:,1]

In [None]:
make_report(y_test, preds)

The AUC increased by about 0.01 and the calibration curve is much closer to the ideal one. So, it means that our ensemble provides probabilies that are sorted in a way that positive cases have 80% chance of being ahead of a negative case's probability. Besides, the probabilies resemble the real fraction of positive cases in the test set.

# Where is the model not working well?

To see how the model above performs, I'm going to split the test set again, but with 10 bins of equal number of cases. Notice that the bins may not have the same bin width. Then I will compute the AUROC for each bin and plot the result.

In [None]:
def sliced_auc(preds, y_test, bins):
    df = pd.concat([pd.Series(preds), pd.Series(y_test)], axis=1, keys=['proba', 'label'])
    df['decile'] = pd.qcut(df['proba'], bins)
    decile_dict = df.groupby('decile').groups
    decile_dict = {k: roc_auc_score(df.loc[v, 'label'], df.loc[v, 'proba']) for k, v in decile_dict.items()}
    return decile_dict

def plot_sliced_auc(auc_dict):
    plt.figure(figsize=(10,6))
    sns.barplot(
        x = list(map(lambda x: x.right, auc_dict.keys())), 
        y = list(auc_dict.values()),
        color = 'cornflowerblue'
    )
    plt.xlabel('Mid value of the deciles')
    plt.ylabel('AUC')
    plt.ylim(0, 1)
    plt.show()

In [None]:
auc_dict = sliced_auc(preds, y_test, 10)
plot_sliced_auc(auc_dict)

As we can see above, the first and last bars have an AUC a little better than the other bins, althought it's that much. Overall the model has an even performance along the probability range. 

I don't know if there is a better way to measure what I'm attempting to do here. If you know someting, let me know in the comments section, please.