https://mlwave.com/kaggle-ensembling-guide/

https://github.com/MLWave/Kaggle-Ensemble-Guide

https://www.quora.com/What-are-the-differences-between-the-three-commonly-ensemble-learning-techniques-stacking-boosting-and-bagging

In [1]:
import os
import numpy as np 
import pandas as pd 

#### Load data

In [4]:
path = r"C:\Users\piush\Desktop\iceberg_submissions"

all_files = os.listdir(path)

# Read and concatenate submissions
outs = [pd.read_csv(os.path.join(path, f), index_col=0) for f in all_files]
concat_sub = pd.concat(outs, axis=1)
cols = list(map(lambda x: "is_iceberg_" + str(x), range(len(concat_sub.columns))))
concat_sub.columns = cols
concat_sub.reset_index(inplace=True)
concat_sub.head()

Unnamed: 0,id,is_iceberg_0,is_iceberg_1,is_iceberg_2,is_iceberg_3,is_iceberg_4
0,5941774d,0.01943216,0.094676,0.008171,0.010215,0.005586
1,4023181e,0.03168809,0.952222,0.638348,0.237662,0.145927
2,b20200e4,4e-08,0.167771,0.008061,1.0,1.5e-05
3,e7f018bb,0.9925741,0.989356,0.999506,0.999533,0.999914
4,4371c8c3,0.02215107,0.900321,0.774739,0.994269,0.033843


In [5]:
# check correlation
concat_sub.corr()

Unnamed: 0,is_iceberg_0,is_iceberg_1,is_iceberg_2,is_iceberg_3,is_iceberg_4
is_iceberg_0,1.0,0.420791,0.80932,0.51628,0.890434
is_iceberg_1,0.420791,1.0,0.489134,0.432111,0.493308
is_iceberg_2,0.80932,0.489134,1.0,0.490128,0.789509
is_iceberg_3,0.51628,0.432111,0.490128,1.0,0.548663
is_iceberg_4,0.890434,0.493308,0.789509,0.548663,1.0


In [6]:
# get the data fields ready for stacking
concat_sub['is_iceberg_max'] = concat_sub.iloc[:, 1:6].max(axis=1)
concat_sub['is_iceberg_min'] = concat_sub.iloc[:, 1:6].min(axis=1)
concat_sub['is_iceberg_mean'] = concat_sub.iloc[:, 1:6].mean(axis=1)
concat_sub['is_iceberg_median'] = concat_sub.iloc[:, 1:6].median(axis=1)

In [7]:
# set up cutoff threshold for lower and upper bounds, easy to twist 
cutoff_lo = 0.8
cutoff_hi = 0.2

##### Mean Stacking

In [8]:
concat_sub['is_iceberg'] = concat_sub['is_iceberg_mean']
concat_sub[['id', 'is_iceberg']].to_csv('stack_mean.csv', 
                                        index=False, float_format='%.6f')

###### LB 0.1698 , decent first try - still some gap comparing with our top-line model performance in stack.



##### Median Stacking

In [9]:
concat_sub['is_iceberg'] = concat_sub['is_iceberg_median']
concat_sub[['id', 'is_iceberg']].to_csv('stack_median.csv', 
                                        index=False, float_format='%.6f')

###### LB 0.1575, very close with our top-line model performance, but we want to see some improvement at least.

##### PushOut + Median Stacking
Pushout strategy is a bit agressive given what it does...

In [10]:
concat_sub['is_iceberg'] = np.where(np.all(concat_sub.iloc[:,1:6] > cutoff_lo, axis=1), 1, 
                                    np.where(np.all(concat_sub.iloc[:,1:6] < cutoff_hi, axis=1),
                                             0, concat_sub['is_iceberg_median']))
concat_sub[['id', 'is_iceberg']].to_csv('stack_pushout_median.csv', 
                                        index=False, float_format='%.6f')

###### LB 0.1940, not very impressive results given the base models in the pipeline...

##### MinMax + Mean Stacking
MinMax seems more gentle and it outperforms the previous one given its peformance score.

In [11]:
concat_sub['is_iceberg'] = np.where(np.all(concat_sub.iloc[:,1:6] > cutoff_lo, axis=1), 
                                    concat_sub['is_iceberg_max'], 
                                    np.where(np.all(concat_sub.iloc[:,1:6] < cutoff_hi, axis=1),
                                             concat_sub['is_iceberg_min'], 
                                             concat_sub['is_iceberg_mean']))
concat_sub[['id', 'is_iceberg']].to_csv('stack_minmax_mean.csv', 
                                        index=False, float_format='%.6f')

##### LB 0.1622, need to stack with Median to see the results.

##### MinMax + Median Stacking

In [12]:
concat_sub['is_iceberg'] = np.where(np.all(concat_sub.iloc[:,1:6] > cutoff_lo, axis=1), 
                                    concat_sub['is_iceberg_max'], 
                                    np.where(np.all(concat_sub.iloc[:,1:6] < cutoff_hi, axis=1),
                                             concat_sub['is_iceberg_min'], 
                                             concat_sub['is_iceberg_median']))
concat_sub[['id', 'is_iceberg']].to_csv('stack_minmax_median.csv', 
                                        index=False, float_format='%.6f')

###### LB 0.1488 - Great! This is an improvement to our top-line model performance (LB 0.1538). But can we do better?

##### MinMax + BestBase Stacking

In [13]:
# load the model with best base performance
sub_base = pd.read_csv(r"C:\Users\piush\Desktop\iceberg_submissions/sub_200_ens_densenet.csv")

In [14]:
concat_sub['is_iceberg_base'] = sub_base['is_iceberg']
concat_sub['is_iceberg'] = np.where(np.all(concat_sub.iloc[:,1:6] > cutoff_lo, axis=1), 
                                    concat_sub['is_iceberg_max'], 
                                    np.where(np.all(concat_sub.iloc[:,1:6] < cutoff_hi, axis=1),
                                             concat_sub['is_iceberg_min'], 
                                             concat_sub['is_iceberg_base']))
concat_sub[['id', 'is_iceberg']].to_csv('stack_minmax_bestbase.csv', 
                                        index=False, float_format='%.6f')

###### LB 0.1463 - Yes! This is a decent score given none of the models in our ensemble pipeline has achieved thus better. I am sure there are more twisted ways to boost the score further, so will keep updating or just leave to more Kagglers to discover!

P.S. As I wrote along this work, deeply I think, building strong & roboust model is always the key component, stacking only comes last with the promise to surprise, sometimes, in an unpleasant direction