# Check your OOF predictions before Stacking


As we are getting close to the end of the competition "stacking" becomes a hot subject. 

From the public kernels/discusions so far I understand that stacking has not shown any significant improvement. My early attempt also wasn't succesful, so I decided to dig a bit more.. 

This remind me directly a very clever kernel by @ogrellier (Olivier) in Toxic Comments Classification competition where he is investigating distribution differences between OOF, OOF folds and Test set predictions. 

See here: [things-you-need-to-be-aware-of-before-stacking](https://www.kaggle.com/ogrellier/things-you-need-to-be-aware-of-before-stacking/notebook)  


So I decided to do the same for MoA and publish a kernel with my findings in order to stimulate a discussion and get your feedback about this topic. 
As Olivier said in his nb:
>  I believe we have to tackle this issue before successfully stacking models.

<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:lightblue; border:0; color:black' role="tab" aria-controls="home"><center>Table of Contents</center></h2>

- [Reading OOF Predictions](#2)     
- [OOF vs Test distributions](#3)
- [F1-scores vs Probability thresholds](#4)
- [Conclusions](#5)
- [References](#6)

In [None]:
import numpy as np 
import pandas as pd 
import warnings
warnings.filterwarnings("ignore")

import os, sys
sys.path.append('../input/iterative-stratification/iterative-stratification-master')
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

from sklearn.metrics import roc_auc_score, roc_curve, log_loss, auc
from sklearn.model_selection import KFold

import matplotlib.pyplot as plt
import seaborn as sns

import bokeh
from bokeh.io import output_file, show
from bokeh.layouts import column, gridplot
from bokeh.plotting import figure
from bokeh.palettes import brewer
from bokeh.models.tools import HoverTool
bokeh.io.output_notebook()


In [None]:
# Function to calculate the mean log loss of the targets including clipping

def mean_log_loss(y_true, y_pred):
    y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
    metrics = []
    for target in range(206):
        metrics.append(log_loss(y_true[:, target], y_pred[:, target], labels=[0,1]))
    return np.mean(metrics)

In [None]:
FILTER = False # True

In [None]:
train_features = pd.read_csv('../input/lish-moa/train_features.csv')
# test_features = pd.read_csv('../input/lish-moa/test_features.csv')
train_targets_scored = pd.read_csv('../input/lish-moa/train_targets_scored.csv')
# train_targets_nonscored = pd.read_csv('../input/lish-moa/train_targets_nonscored.csv')
# train_drug = pd.read_csv('../input/lish-moa/train_drug.csv')
submission = pd.read_csv('../input/lish-moa/sample_submission.csv')

if FILTER:
    # Filter non_ctl samples - to allign with the training pipeline that produced OOF
    train_features = train_features.loc[train_features.cp_type!='ctl_vehicle']
    train_targets_scored = train_targets_scored.iloc[train_features.index]

    train_features = train_features.reset_index(drop=True)
    train_targets_scored = train_targets_scored.reset_index(drop=True)

train_targets_scored = train_targets_scored.set_index('sig_id')    
# make labels 
class_true = submission.columns[1:]
class_preds = [c_ + "_oof" for c_ in class_true]

<a id="2"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:lightblue; border:0; color:black' role="tab" aria-controls="home"><center> Read OOF predictions </center></h2>


Note: For demo I've just used a pair (OOF/Sub) of my own experiments (private dataset). In the next version(s) I'll try some of the public kernels too. 

In [None]:
DIR_OOF = '../input/moa-subs-oof/'

# Read oof data 
oof = pd.read_csv(DIR_OOF +"10-oof.csv",  ) # index_col='sig_id'
oof.index = train_features.sig_id.values
# oof = oof.iloc[:, 1:]
oof.columns = class_preds

# Read submission/test data 
sub = pd.read_csv(DIR_OOF +"10-submission.csv")

print('OOF loaded - CV score (with ctl):', np.round(mean_log_loss(train_targets_scored[class_true].values, oof.values), 7)) 

In [None]:
# merge OOF preds with targets
oof = oof.merge(train_targets_scored, left_index=True, right_index=True)

# sanity check
oof[class_true].shape, oof[class_preds].shape

### Create your CV folds as in the training pipeline where you produced the OOF/Sub files

In [None]:
# create your CV folds as in the training pipeline where you produced OOF/Sub files

skf = MultilabelStratifiedKFold(n_splits=5, shuffle=True, random_state=34)

<a id="3"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:lightblue; border:0; color:black' role="tab" aria-controls="home"><center> OOF vs Test distributions (per target) </center></h2>


> The problem now is we don't know if there are even further differences between OOF probabilities and Test predictions. If this is the case this would undermine any stacking attempt.

Let see if we can see anything interesting in the probability distributions themselves?

In [None]:

figures = []
for i_class, class_name in enumerate(class_true):
    
    s = figure(plot_width=700, plot_height=300, title=f"Probability logits [Target: {class_name}], [ID:{i_class}]")
    # per fold
    for n_fold, (_, val_idx) in enumerate(skf.split(train_features, train_targets_scored)):
        probas = oof[class_preds[i_class]].values[val_idx]
        p_log = np.log((probas + 1e-5) / (1 - probas + 1e-5))
        hist, edges = np.histogram(p_log, density=True, bins=50)
        s.line(edges[:50], hist, legend_label="Fold %d" % n_fold, color=brewer["Set1"][7][n_fold])
    # all oof
    oof_probas = oof[class_preds[i_class]].values
    oof_logit = np.log((oof_probas + 1e-5) / (1 - oof_probas + 1e-5))
    hist, edges = np.histogram(oof_logit, density=True, bins=50)
    s.line(edges[:50], hist, legend_label="Full OOF", color=brewer["Paired"][6][1], line_width=3)
    # test 
    sub_probas = sub[class_name].values
    sub_logit = np.log((sub_probas + 1e-5) / (1 - sub_probas + 1e-5))
    hist, edges = np.histogram(sub_logit, density=True, bins=50)
    s.line(edges[:50], hist, legend_label="Test", color=brewer["Paired"][6][5], line_width=3)
    # fig specs
    s.legend.location = 'top_left'
#     s.legend.click_policy = 'hide'
    s.add_tools(HoverTool(tooltips=[('(x, y)', '(@x, @y)')]))
    figures.append(s)

# put the results in a column and show
show(column(figures))

As you can see here we have some differences between OOF and Test distributions for some of the targets (e.g. no. 4,5,10,20,30,71,90, 94, 97 etc.). 

Other OOF/Test pairs have shown:

- spikes at certain values (left tail `~ -10.0` as here and middle e.g. `~ -7.0`) 

> At lest, this should give us a way to check if what we do in OOF will translate to test probabilities... or not!



<a id="4"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:lightblue; border:0; color:black' role="tab" aria-controls="home"><center> F1-scores vs probability thresholds  </center></h2>

Next, we plot all F1-scores vs. thresholds for each target - unfold to see the plots 

Warning: there are a lot and look bit ugly :-)

In [None]:

figures = []
for i_class, class_name in enumerate(class_true):
    # create a new plot for current class
    # Compute full score :
    full = roc_auc_score(oof[class_true[i_class]], oof[class_preds[i_class]])
    
    s = figure(plot_width=750, plot_height=280, title="F1 score vs threshold [Target: %s]: full OOF score: %.6f" % (class_name, full))  #avg
    
    for n_fold, (_, val_idx) in enumerate(skf.split(train_features, train_targets_scored)):
        # Get False positives, true positives and the list of thresholds used to compute them
        fpr, tpr, thresholds = roc_curve(oof[class_true[i_class]].iloc[val_idx], 
                                         oof[class_preds[i_class]].iloc[val_idx])
        # Compute recall, precision and f1_score
        recall = tpr
        precision = tpr / (fpr + tpr + 1e-5)
        f1_scores = 2 * precision * recall / (precision + recall + 1e-5)
        # Finally plot the f1_scores against thresholds
        s.line(thresholds, f1_scores, legend_label="Fold %d" % n_fold, color=brewer["Set1"][7][n_fold])
        s.legend.location = 'top_right'
        s.legend.click_policy = 'hide'
        s.add_tools(HoverTool(tooltips=[('(x, y)', '(@x, @y)')]))
    figures.append(s)

# put the results in a column and show
show(column(figures))

<a id="5"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:lightblue; border:0; color:black' role="tab" aria-controls="home"><center> Conclusion </center></h2>


> As a conclusion, I would say that OOF probabilities need to be aligned before any stacking. 

* I'd  suggest to check yours too! Fork the kernel and replace the OOF/Sub pair with yours to start the analysis.  

* I'd apprecate any feedback/suggestions for improvement and of course any further insights & explanations

<a id="6"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:lightblue; border:0; color:black' role="tab" aria-controls="home"><center> References/Credits </center></h2>


* [1] [things-you-need-to-be-aware-of-before-stacking](https://www.kaggle.com/ogrellier/things-you-need-to-be-aware-of-before-stacking/notebook) by Olivier (@ogrellier), Toxic Comments Classification