# Avoid overfitting with a tiny sliver for training data

Inspired by the Kaggle Don’t Overfit Challenge: Tiny Training Trial. The challenge; build the best performing model you can with a <5% training vs >95% test split with TF-IDF encodings on an Amazon multi-classification problem. With so many data hungry algorithms out there that take days or more to compute, we thought it’d be refreshing to go the other way and experiment with what can be done with extremely small and noisy datasets! Iterate and experiment with training times on the order of seconds. Our split is:

Train: 1244 points

Approach overview

•Build Ensemble that includes multiple model categories: Logistic Regression, Random Forests, XGBoost, Adaboost, and Neural Networks.

•Split the training dataset into K stratified folds. For each fold and model category, train a separate model using Grid Search.

•Combine all models into ensemble using Averaging.

I experimented with:

    1.Which model categories to include in the ensemble
    2.How many stratified folds to use: 1, 5, 10, 20, 40
    3.How to build the ensemble: Averaging vs. Max voting
    4.Oversampling techniques such as SMOTE and ADASYN: including models trained with SMOTE data in the ensemble worked for the Public leaderboad, but not for Private
    5.Feature standardization: did not seem to improve anything.

Lessons Learned

    Ensembling is the way to go, of course.
    Increasing the number of stratified folds improved performance.
    Improvements in training data accuracy (on validation set) did not necessarrily translate to better accuracies in the Public dataset. A prime example for this was the LR method that did not perform as well in the training validation accuracy compared to other methods such as NN. However, LR was an integral part of the overall Ensemble; whenever we removed it, the Public dataset accuracy ended up much worse.
    Ensembling using Averaging always worked better than Max voting.
    We kind of `overfitted' to the Public Leaderboard, i.e., our best performing model in Public was not the best in Private.
    Adding models trained with oversampled data, using either SMOTE or ADASYN, decreased accuracy in Private dataset.
    Gini impurity appeared to work better than Entropy for tree-based models.

In [1]:
import pandas as pd, numpy as np, time, sys, h5py
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score, GridSearchCV
from keras.layers import Input, Dense , Dropout , TimeDistributed , LSTM , GRU, concatenate, BatchNormalization
from keras.models import Model
from keras.optimizers import SGD , Adadelta, RMSprop, Adam, Adamax
from keras.models import  load_model
from keras.callbacks import EarlyStopping
from keras.utils import  to_categorical 
from keras.regularizers import l1, l2, l1_l2
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier, BaggingClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifierCV
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
import pickle
from sklearn.svm import SVC

In [2]:
# Initialize problem parameters
class Args:
    """ Class containing all model arguments """
    def __init__( self ):
        self.project    = 'MLchallenge_DontOverfit'
        self.dataPath   = '/home/harsh/Downloads/DontOverfit/'       .format(self.project)
        self.modelsPath = '/home/harsh/Downloads/DontOverfit/Models/' .format(self.project)
        self.resultsPath= '/home/harsh/Downloads/DontOverfit/Results/'.format(self.project)
        self.CV_folds   = 40  # split the Training data in stratified folds, to train different versions of models 
args = Args()

In [4]:
# LOAD DATA
train = pd.read_csv( args.dataPath + 'TTT_train.csv' )
test = pd.read_csv( args.dataPath + 'TTT_test_features.csv', index_col = 'ID')
print(train.describe())

                f0           f1           f2           f3           f4  \
count  1244.000000  1244.000000  1244.000000  1244.000000  1244.000000   
mean      0.000566     0.000697     0.000468     0.001733     0.000708   
std       0.019962     0.024577     0.016497     0.031072     0.024959   
min       0.000000     0.000000     0.000000     0.000000     0.000000   
25%       0.000000     0.000000     0.000000     0.000000     0.000000   
50%       0.000000     0.000000     0.000000     0.000000     0.000000   
75%       0.000000     0.000000     0.000000     0.000000     0.000000   
max       0.704060     0.866833     0.581853     0.709016     0.880315   

                f5           f6           f7           f8           f9  ...  \
count  1244.000000  1244.000000  1244.000000  1244.000000  1244.000000  ...   
mean      0.000717     0.000585     0.000357     0.007151     0.000693  ...   
std       0.025296     0.020650     0.012606     0.050962     0.024434  ...   
min       0.00000

In [5]:
X = train.loc[:, train.columns != 'label']
y = train['label']
y_cat = to_categorical(y)
# Generate a set of stratified folds of Training to train different versions of each model.
folds = list(StratifiedKFold(n_splits=args.CV_folds, shuffle=True, random_state=1).split(X, y))



In [6]:
##################################################
# function to fit a model on every fold, and store trained model
def fitValidateSave( model, modelType ):
    #
    accuracies = []
    for foldIndex, fold in enumerate(folds):
        X_fold      = np.take( X, fold[0], axis=0)
        y_fold      = np.take( y, fold[0], axis=0)
        #
        #oversampler = RandomOverSampler(random_state=77)
        #X_fold, y_fold = oversampler.fit_sample(X_fold, y_fold)
        #
        X_fold_test = np.take( X, fold[1], axis=0)
        y_fold_test = np.take( y, fold[1], axis=0)
        #
        model.fit(X_fold, y_fold)
        #
        accuracies.append( model.score(X_fold_test, y_fold_test) )
        print( '{}: {}'.format(foldIndex, accuracies[-1]) )
        #print(model.best_params_)
        #
        pickle.dump( model, open( '{}/{}_fold{}.h5'.format( args.modelsPath, modelType, foldIndex ) , 'wb'))
    print( 'Average accuracy for {} is:  {}'.format( modelType, np.mean(accuracies)) )  
    return model
##################################################


##################################################
# Compute accuracies across folds using an already trained model.
def validateAcrossFolds( modelType ):
    #
    accuracies = []
    for foldInd, fold in enumerate(folds):
        X_fold_test = np.take( X, fold[1], axis=0)
        y_fold_test = np.take( y, fold[1], axis=0)
        #
        if 'NN' in modelType:
            y_fold_test = to_categorical(y_fold_test)
            model = load_model( '{}/{}_fold{}.h5'.format( args.modelsPath, modelType, foldInd ) )
            accuracies.append( model.evaluate(X_fold_test, y_fold_test, batch_size=512, verbose=0 )[1] )
        else:
            model = pickle.load(open( '{}/{}_fold{}.h5'.format( args.modelsPath, modelType, foldInd ), 'rb'))
            accuracies.append( model.score(X_fold_test, y_fold_test) )
        print( '{}: {}'.format(foldInd, accuracies[-1]) )
        #
    print( 'Average accuracy for {} is:  {}'.format( modelType, np.mean(accuracies)) )  
    return model
##################################################

# Logistic Regression


In [7]:


parameters = {
    "penalty":["l2"],
    "C": [ 3., 4., 5.],
    "fit_intercept": [True],
    "class_weight":['balanced'],
    "solver":[ 'lbfgs' ],
    "multi_class": ["multinomial"],
    "random_state":[77]
    }
LR = GridSearchCV(LogisticRegression(), 
                  parameters, 
                  cv=4, 
                  n_jobs=-1)

LR = fitValidateSave( LR, 'LR' )

0: 0.75
1: 0.78125
2: 0.8125
3: 0.8125
4: 0.8709677419354839
5: 0.7096774193548387
6: 0.8709677419354839
7: 0.7741935483870968
8: 0.9032258064516129
9: 0.7741935483870968
10: 0.8709677419354839
11: 0.7741935483870968
12: 0.7741935483870968
13: 0.9354838709677419
14: 0.7419354838709677
15: 0.7419354838709677
16: 0.7741935483870968
17: 0.8387096774193549
18: 0.6774193548387096
19: 0.7419354838709677
20: 0.9032258064516129
21: 0.8064516129032258
22: 0.6774193548387096
23: 0.7419354838709677
24: 0.8064516129032258
25: 0.7741935483870968
26: 0.8709677419354839
27: 0.7419354838709677
28: 0.7419354838709677
29: 0.8387096774193549
30: 0.8709677419354839
31: 0.7096774193548387
32: 0.8709677419354839
33: 0.8709677419354839
34: 0.7096774193548387
35: 0.8064516129032258
36: 0.8387096774193549
37: 0.7419354838709677
38: 0.7419354838709677
39: 0.7419354838709677
Average accuracy for LR is:  0.793422379032258


# Random Forests


In [8]:
parameters = {
    "criterion":["gini"],
    "max_depth":[ 15, 30  ],
    "min_samples_split": [ 5 ],
    "min_samples_leaf": [1],
    "max_features":[None ],
    "random_state": [77],
    "n_estimators":[ 200 ]
    }
RF_gini = GridSearchCV(RandomForestClassifier(), 
                  parameters, 
                  cv=4, 
                  n_jobs=-1)

RF_gini = fitValidateSave( RF_gini, 'RF_gini' )

0: 0.8125
1: 0.78125
2: 0.84375
3: 0.84375
4: 0.9032258064516129
5: 0.7419354838709677
6: 0.8387096774193549
7: 0.7419354838709677
8: 0.9354838709677419
9: 0.7096774193548387
10: 0.8387096774193549
11: 0.7741935483870968
12: 0.8709677419354839
13: 0.9354838709677419
14: 0.7741935483870968
15: 0.8387096774193549
16: 0.7741935483870968
17: 0.8387096774193549
18: 0.7419354838709677
19: 0.7419354838709677
20: 0.8709677419354839
21: 0.7741935483870968
22: 0.7419354838709677
23: 0.7096774193548387
24: 0.7096774193548387
25: 0.7096774193548387
26: 0.8709677419354839
27: 0.8387096774193549
28: 0.8064516129032258
29: 0.8709677419354839
30: 0.8387096774193549
31: 0.7096774193548387
32: 0.8064516129032258
33: 0.8064516129032258
34: 0.7741935483870968
35: 0.7419354838709677
36: 0.9032258064516129
37: 0.7741935483870968
38: 0.6774193548387096
39: 0.8387096774193549
Average accuracy for RF_gini is:  0.8013860887096774


# AdaBoost

In [9]:
from sklearn.model_selection import GridSearchCV

AB_gini = AdaBoostClassifier( base_estimator = DecisionTreeClassifier( 
                             criterion         = 'gini', 
                             splitter          = 'random',
                             max_depth         = 30, 
                             min_samples_split = 5, 
                             min_samples_leaf  = 1,
                             max_features      = None,
                             random_state      = 77 
                            ),
                            learning_rate= 1,
                            n_estimators = 200
                         )
AB_gini = fitValidateSave( AB_gini, 'AB_gini' )

0: 0.8125
1: 0.8125
2: 0.8125
3: 0.875
4: 0.8064516129032258
5: 0.7741935483870968
6: 0.8709677419354839
7: 0.7741935483870968
8: 0.967741935483871
9: 0.7741935483870968
10: 0.8064516129032258
11: 0.8387096774193549
12: 0.8064516129032258
13: 0.9032258064516129
14: 0.8064516129032258
15: 0.9032258064516129
16: 0.8387096774193549
17: 0.9032258064516129
18: 0.7419354838709677
19: 0.7419354838709677
20: 0.9032258064516129
21: 0.8064516129032258
22: 0.7419354838709677
23: 0.7419354838709677
24: 0.8387096774193549
25: 0.7741935483870968
26: 0.8387096774193549
27: 0.8387096774193549
28: 0.8387096774193549
29: 0.7741935483870968
30: 0.8064516129032258
31: 0.7096774193548387
32: 0.7741935483870968
33: 0.9354838709677419
34: 0.8387096774193549
35: 0.7419354838709677
36: 0.8709677419354839
37: 0.7741935483870968
38: 0.7096774193548387
39: 0.7419354838709677
Average accuracy for AB_gini is:  0.8142641129032258


# XG Boost

In [10]:
XGB = XGBClassifier(  max_depth=6,  
                      learning_rate=0.1, 
                      n_estimators=100, 
                      verbosity=1, 
                      objective='multi:softmax', 
                      num_class=y_cat.shape[-1],
                      booster='gbtree', 
                      n_jobs=4, 
                      gamma=0, 
                      min_child_weight=1,
                      max_delta_step=0, 
                      subsample=.7, 
                      colsample_bytree=.6, 
                      colsample_bylevel=.6, 
                      colsample_bynode=.6, 
                      reg_alpha=.0, 
                      reg_lambda=.0, 
                      scale_pos_weight=1, 
                      base_score=0.1, 
                      random_state=77 
                      )
XGB = fitValidateSave( XGB, 'XGB' )

Parameters: { scale_pos_weight } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


0: 0.78125
Parameters: { scale_pos_weight } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


1: 0.75
Parameters: { scale_pos_weight } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


2: 0.78125
Parameters: { scale_pos_weight } might not be used.

  This may not be accurate due to some pa