This notebook takes a look at the post processing from [this](https://www.kaggle.com/cdeotte/triple-stratified-kfold-with-tfrecords) notebook and also explores the use of test time augmentation (TTA).

We will explore the ways that the predictions from each model are combined, the use of predictions from TTA and whether we extract all possible information from the models that we spend many hours training. 

The main point that will be pursued is a method for encoding the uncertainty each model has in a given data point and using this information while ensembling.

In [None]:
# Import everything

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import scipy.stats
import matplotlib.pyplot as plt
from scipy.stats import spearmanr
from matplotlib.colors import LogNorm

from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
import tensorflow as tf

In [None]:
import sys
!cp ../input/rapids/rapids.0.14.0 /opt/conda/envs/rapids.tar.gz
!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz > /dev/null
sys.path = ["/opt/conda/envs/rapids/lib/python3.7/site-packages"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib/python3.7"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib"] + sys.path 
!cp /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/
import cuml #, cupy

In [None]:
# Setting color palette.
orange_black = [
    '#fdc029', '#df861d', '#FF6347', '#aa3d01', '#a30e15', '#800000', '#171820'
]

# Setting plot styling.
plt.style.use('ggplot')

I have trained a network using [this](https://www.kaggle.com/cdeotte/triple-stratified-kfold-with-tfrecords) notebook and saved all of the TTA predictions for both out of fold (OOF) points and the test set that each model makes.

In [None]:
# For this model I augmented 11 images at test time
n_tta = 1500
file = 'tta-exploration-128x128-b0'

# Load all of the TTA predictions for the test set.
all_tta_test = pd.read_csv('../input/'+file+'/all_tta_test.csv')
all_tta_test.columns.values[0] = 'image_name'
submission = pd.read_csv('../input/'+file+'/submission.csv')

# Load the oof df where the TTA predictions are contained in the numbered columns.
tta_keys = [str(i) for i in range(n_tta)]
oof = pd.read_csv('../input/'+file+'/oof.csv')

The first thing we will look at is the distribution of the standard deviation across the mean TTA predictions. We take the mean of the TTA predictions for each example, as well as the standard deviation. 

Examples with a small standard deviation across TTA predictions are points that we can consider to sit in an area of the input distribution that each model is confident about. Those that have a large standard deviation are unstable to changes (augmentations) that we have decided do not change the class of that example, and therefore come from a part of the input distribution that is close to the decision boundary. This uncertainity is what we will try to leverage.

In [None]:
def IQR(data,ax=1):
    return np.subtract(*np.percentile(data, [75, 25],axis=ax))

In [None]:
tta_preds = oof[tta_keys]
mn_pred = np.mean(tta_preds,axis=1)
std_pred = np.std(tta_preds,axis=1)


plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
mxb = oof['target']==0
h = plt.hist2d(mn_pred[mxb],std_pred[mxb],norm=LogNorm(),bins=100)
plt.colorbar(h[3])
plt.xlabel('Arithmetic mean')
plt.ylabel('Standard deviation')
plt.title('Benign')
plt.subplot(1,2,2)
mxm = oof['target']==1
h = plt.hist2d(mn_pred[mxm],std_pred[mxm],norm=LogNorm(),bins=30)
plt.colorbar(h[3])
plt.xlabel('Arithmetic mean')
plt.ylabel('Standard deviation')
plt.title('Malignant')
plt.show()

plt.figure(figsize=(8,8))
plt.scatter(mn_pred[mxb],std_pred[mxb],color='orange',s=10,label='Benign')
plt.scatter(mn_pred[mxm],std_pred[mxm],color='blue',s=10,label='Malignant')
plt.xlabel('Arithmetic mean')
plt.ylabel('Standard deviation')
plt.legend()
plt.show()

These plots show that there is a wide range in the uncertainties that each model has for different means, if the above points traced out a parabola with no width then this would be of no use because the relationship between mean and standard deviation would be deterministic - this would mean that all of the information in the standard deviation is captured by the mean, and vice versa. As this relationship is not deterministic, we might be able to gain some leverage from it, and we will explore this.

I would like to note that I do not think this extra information will be useful on its own, but rather that it will be useful when ensembling multiple different models. The reason that I think this is the case can be explained through an example. The model shown above is relatively confident about the points that run along the bottom of the bell. Now imagine we have another model, where those same points run along the top of the bell, but the predictions for each point differ. When blending these models I would like to place more weight on the confident predictions from the former model than on the unconfident predictions of the latter (with some weighting also given to each models accuracy).

The incorporation of the standard deviation into an ensemble prediction can be done in many ways, and we will explore just one below.

## Distribution and error of TTA predictions
The first thing we need to know is how the TTA predictions are distributed, how many samples we need to gather to get accurate statistics on that distribution, and what measures we should use to quantify the uncertainty in this distribution. 

Note that to be safe I have not assumed I have enough examples to do bootstrap sampling, even though I do for low TTA numbers.

In [None]:
nplot = 5

fig, axs = plt.subplots(nplot, nplot,figsize=(10,10))
fig.suptitle("Distribution of TTA predictions per example", fontsize=14)
for i,ax in enumerate(axs.flatten()):
    ax.hist(tta_preds.iloc[i,:],bins=50,color=orange_black[3])
plt.show()

Most of these distributions are skewed, but I do not think that looking at quartiles will provide any more information because the data is always skewed towards either zero or one, and the amount that it is skewed depends on its distance from zero or one and so the information about how the predictions are skewed is already contained in the mean.

The use of quartiles would be interesting to check. For now I will keep things simple and look at the standard deviation.

In [None]:
# Look at the range and standard deviation for the TTA predictions.
ranges = np.max(tta_preds,axis=1)-np.min(tta_preds,axis=1)
iqr = IQR(tta_preds)

fig, axs = plt.subplots(1, 3,figsize=(20,5))
axs[0].hist(ranges,bins=100)
axs[0].set_title('Range')
axs[1].hist(std_pred,bins=100)
axs[1].set_title('Standard deviations')
axs[2].hist(iqr,bins=100)
axs[2].set_title('Interquartile range')
plt.show()

The tail of these distributions are quite long, suggesting that they should be more heavily sampled to get a more accurate prediction for these data points. The other data points do not need as many samples for an accurate mean to be calculated. Such a method would be useful for evaluation even if the standard deviation proves to be uninteresting.

Next I want to look at how many TTA predictions I need to get an accurate estimate of the mean and standard deviation for each data point. To do so we look at the standard errors for both of these statistics.

In [None]:
tta_arr = np.array(tta_preds)
n_samp = tta_arr.shape[1]
# mn_pred, std_pred
average_t=[];max_t=[];std_t=[]
error_std=[]
# If you drop the relative part you can run this to n_samp, but that is less informative.
for TTA in range(2,100):
    end = int(n_tta//TTA)
    samples = tta_arr[:,:TTA*end].reshape(-1,TTA)
    all_avs = np.mean(samples,axis=1).reshape(len(tta_arr),end)
    if all_avs.shape[1]>10:
        all_stds = np.std(samples,axis=1).reshape(len(tta_arr),end)
        all_e_std = np.std(all_stds,axis=1)
        # error_std+=[np.mean(all_e_std)]
        # Make it the relative standard error of the standard deviations
        error_std+=[np.mean(all_e_std/std_pred)]
    if all_avs.shape[1] > 30:
        # Calculate the standard errors directly.
        std_errors = np.std(all_avs,axis=1)
    else:
        # Estimate the standard error per example and then take the average as the best estimate.
        std_errors = np.mean(all_stds/np.sqrt(TTA),axis=1)
    # Take the relative standard error of the mean
    std_errors = std_errors/mn_pred
    std_t+=[np.std(std_errors)]
    average_t+=[np.mean(std_errors)]
    max_t+=[np.max(std_errors)]
    

In [None]:
fig, axs = plt.subplots(1, 4,figsize=(20,5))
axs[0].plot(average_t)
axs[0].set_title('Mean standard error')
axs[0].set_xlabel('TTA')
axs[1].plot(std_t)
axs[1].set_title('Std of standard errors')
axs[1].set_xlabel('TTA')
axs[2].plot(max_t)
axs[2].set_title('Max standard error')
axs[2].set_xlabel('TTA')
# axs[3].plot([int(n_tta//TTA) for TTA in range(2,len(max_t))])
# axs[3].set_title('N samples')
# axs[3].set_xlabel('TTA')
axs[3].plot(error_std)
axs[3].set_title('Mean standard error of std')
axs[3].set_xlabel('TTA')
plt.show()

The estimate of the mean converges much faster than the estimate of the standard deviation. That leaves the question of whether the standard deviation can be successcully used for training when using models that do not have a large amount of TTA. An importance sampling algorithm would overcome this issue. Unfortunately I do not have this implemented at present.

## Performance of different means
Before moving on to stacking we will look at the use of different means for extracting a prediction from the TTA predictions. The distributions for different TTA are heavily skewed and so the best summary statistic is not obvious.

In [None]:
tta_preds = oof[tta_keys]
mns = dict(
    amean = np.mean(tta_preds,axis=1),
    gmean = scipy.stats.gmean(tta_preds,axis=1),
    median = np.median(tta_preds,axis=1)
)

for mn in mns:
    print('{} AUC = {}'.format(mn,roc_auc_score(oof['target'],mns[mn])))

The median is the best summary here, which is expected as the distributions are skewed. But for small sample sizes (the test set only has 11 TTA predictions) the arithmetic mean is more stable, so we will use that.

To explore the methods of combining the predictions for submission from the different models the best we can do in this case is to look at the private leaderboard scores. As the competition is over we no longer have to worry about overfitting to this and it can be considered a held out test set.

In [None]:
def make_sub(func,name='submission'):
    submission = pd.DataFrame(dict(image_name=test_tta['0'], target=func(preds_all,axis=1)))
    submission.to_csv(name+'.csv', index=False)
    submission.head()

test_tta = pd.read_csv('../input/tta-exploration-128x128-b0/all_tta_test.csv')
FOLDS=3; n_test_tta=11
preds_all = [np.mean(test_tta[[str(i) for i in range(j*n_test_tta,(j+1)*n_test_tta)]],axis=1) for j in range(FOLDS)]
preds_all = np.vstack(preds_all).T

make_sub(np.mean,name='amean_submission')
make_sub(scipy.stats.gmean,name='gmean_submission')
make_sub(np.median,name='median_submission')

# Blend/Stack models
This is the key part of this notebook, can the incorporation of the uncertainty of base models increase the OOF CV score more than blending with linear weights?

In the following we will load the predictions from two models, find the optimal blend using weights and the means, and also train a neural network that takes as input their means and std.

In [None]:
# Load a new model
n_tta = 11
# n_tta = 1500
FOLDS=3
files = ['tta-exploration-128x128-b0','tta-exploration-128x128-b3']
# files = ['tta-exploration-128x128-b0','tta-exploration-128x128-b3','meta-384-b5-coarse','meta-384-b3']

# Load all of the TTA predictions for the test set.
all_tta_list = [pd.read_csv('../input/'+file+'/all_tta_test.csv') for file in files]
for df in all_tta_list:
    df.columns.values[0] = 'image_name' 
sub_list= [pd.read_csv('../input/'+file+'/submission.csv') for file in files]

# Load the oof df where the TTA predictions are contained in the numbered columns.
tta_keys = [str(i) for i in range(n_tta)]
oof_list = [pd.read_csv('../input/'+file+'/oof.csv') for file in files]

In [None]:
# Add the std as features and then drop the tta keys
def set_right(df,tta_keys):
    df['std']=np.std(df[tta_keys],axis=1)
    return df.drop(tta_keys,axis=1)

oof1 = [set_right(df,tta_keys).set_index('image_name') for df in oof_list]

In [None]:
# Assemble
df_oof = pd.concat(oof1,axis=1)
labels = df_oof['target'].iloc[:,0]

### Visualize the correlation between the two models.

In [None]:
def plot_scatter(ax,x,y,mx1,mx2,title):
    ax.scatter(x[mx1],y[mx1],color='orange',s=10,label='Benign')
    ax.scatter(x[mx2],y[mx2],color='blue',s=10,label='Malignant')
    ax.set_title(title)
    ax.legend()
    
    
fig, axs = plt.subplots(2, 2,figsize=(20,15))
mxb = labels==0
mxm = labels==1
std1 = df_oof['std'].iloc[:,0]; pred1 = df_oof['pred'].iloc[:,0]
std2 = df_oof['std'].iloc[:,1]; pred2 = df_oof['pred'].iloc[:,1]
plot_scatter(axs[0,0],std1,std2,mxb,mxm,'Standard Deviations')
plot_scatter(axs[0,1],pred1,pred2,mxb,mxm,'Predictions')
plot_scatter(axs[1,0],pred1,std2,mxb,mxm,'Prediction 1 vs Standard Deviation 2')
plot_scatter(axs[1,1],std1,pred2,mxb,mxm,'Standard Deviation 1 vs Prediction 2')
fig.show()

In [None]:
corr_disp = df_oof[['pred','std']].copy()
corr_disp.columns = ['p1','p2','s1','s2']

corr=corr_disp.astype(float).corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

In [None]:
# Look at the model spaces when including std as a feature with tsne

model = cuml.TSNE()
embed2D = model.fit_transform(df_oof[['pred','std']])

fig, axs = plt.subplots(1,2,figsize=(20,8))
mxb = labels==0
mxm = labels==1
plot_scatter(axs[0],embed2D[:,0],embed2D[:,1],mxb,mxm,'Predictions and Uncertainty')

# Look at the model spaces when including std as a feature with tsne

model = cuml.TSNE()
embed2D = model.fit_transform(df_oof[['pred']])
plot_scatter(axs[1],embed2D[:,0],embed2D[:,1],mxb,mxm,'Predictions Only')

The clustering in both examples looks approximately the same, so we cannot say much about which will be better without fitting a model and checking.

## Linear Blend

In [None]:
# Find the best weights

roc=[]
WGTS=np.linspace(0,1,11)
for wgt in WGTS:
    pred = wgt*df_oof['pred'].iloc[:,0] + (1-wgt)*df_oof['pred'].iloc[:,1]
    roc+=[roc_auc_score(labels,pred)]
best_roc = max(roc)
loc = np.where(roc==best_roc)[0][0]
best_weight = WGTS[loc]
print('The best roc = {:.6f} with weights : ({},{})'.format( best_roc, best_weight, (1-best_weight)) )

## Stack Models 

In [None]:
from xgboost import XGBClassifier

In [None]:
ids = df_oof.index
fold_finder = df_oof['fold'].iloc[:,0]

def get_oof(DATA,LABELS,folds=10,PRINT=0,SEED=42):
    skf = StratifiedKFold(n_splits=FOLDS,shuffle=True,random_state=SEED)
    oof_pred = []; oof_tar = []; oof_val = []; oof_names = []; oof_folds = [] 

    for fold,(idxT,idxV) in enumerate(skf.split(DATA,LABELS)):
    #  The below will use the same splits as the original data, but using larger folds reduces the dependence on SEED.
    # for fold in range(FOLDS):
    #     idxT = fold_finder!=fold; idxV = fold_finder==fold;

        X = DATA[idxT]; y = LABELS[idxT]
        X_val = DATA[idxV]; y_val = LABELS[idxV]

        # Train the model
        # Note that the optimal parameters for this model were found on the data that did not include the standard
        # deviation
        model = XGBClassifier(  objective='binary:logistic',
                            seed=0,  
                            nthread=-1, 
                            learning_rate=0.01, 
                            n_estimators=1200,
                            scale_pos_weight=1,
                            min_child_weight=1,
                            max_depth=2,
                            subsample=0.7,
                            colsample_bytree=0.9,
                            gamma=5,
                            random_state=42,
                         )
        model.fit(X,y)

    #     model.load_weights('fold-%i.h5'%fold)

        # PREDICT OOF USING TTA
        pred = model.predict_proba( X_val )[:,1]
        oof_pred.append( pred )                

        # GET OOF TARGETS AND NAMES
        oof_tar.append( y_val )
        oof_names.append( ids[idxV] )
        oof_folds.append( np.ones_like(oof_tar[-1],dtype='int8')*fold )

        # REPORT RESULTS
        auc = roc_auc_score(oof_tar[-1],oof_pred[-1])
        if PRINT:
            print('#### FOLD %i OOF AUC = %.3f'%(fold+1,auc))


    # COMPUTE OVERALL OOF AUC
    oof = np.concatenate(oof_pred); true = np.concatenate(oof_tar);
    names = np.concatenate(oof_names); folds = np.concatenate(oof_folds)
    auc = roc_auc_score(true,oof)
    print('Overall OOF AUC with TTA = %.5f'%auc)

In [None]:
# model.feature_importances_
FOLDS = 20
print('Including confidence estimate:')
get_oof(np.concatenate((df_oof['pred'],df_oof['std']),axis=1),labels,folds=FOLDS)
print('\nOnly predictions:')
get_oof(np.array(df_oof['pred']),labels,folds=FOLDS)

So we can see that including the standard deviation provides a slight improvement in the OOF AUC score when stacking, but it does not beat straightforward blending. The stacking and blending scores cannot be properly explored without further experiments. However, this idea can be explored more robustly in less expensive settings where statistics can be gathered easily and tests can be run where more models are included (the effect of including the standard deviation could grow or vanish as more models are added to the stacking). 

For now, this is a moderate success that implies it is worth pursuing further.