# Summary

In this notebook our aim is focusing on K-fold cross validation and getting better intuition for it. We take a bunch of models (all XGBoost, with varying number of estimators) and see how well we can use AUC estimated from the out-of-fold part of the training set to predict AUC actually achieved on the test set.

We find that the results on the test set depend only very little on the K-fold cross validation we use. The out-of-fold AUC undershoot the test set AUC quite notably and the difference seems to be the largest for the 10-fold CV. This is a bit surprising, though it might be related to the large number of mislabeled samples.

# Import libraries

In [None]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, roc_curve, accuracy_score
from sklearn.model_selection import KFold
np.random.seed(42)

# Define various parameters of the K-fold CV and models. Plus the test set size.

In [None]:
# size of the test set (for our purposes we need to know the targets, 
# so this is different from the competition test set)
TEST_SIZE = 0.9

# number of folds to try out
N_FOLDS = [3, 5, 7, 10]

# define model parameters
LEARNING_RATE = 0.1
N_ESTIMATORS  = [10, 25, 50, 75, 100, 125, 150, 200] # this defines different models
MAX_DEPTH     = 4
N_JOBS        = 16
TREE_METHOD   = 'hist'
VERBOSITY     = 1

# Load the data, separate training and test set

In [None]:
train = pd.read_csv('../input/tabular-playground-series-nov-2021/train.csv')

In [None]:
train.head()

Separate train and test sets

In [None]:
ids = train['id'].values
train_ids = np.random.choice(ids, replace=False, size=int((1 - TEST_SIZE) * len(train)))
test_ids = np.array([x for x in ids if x not in train_ids])

In [None]:
y_train = train[train['id'].isin(train_ids)]['target'].values
X_train = train[train['id'].isin(train_ids)].drop(['id', 'target'], axis = 1).values
y_test  = train[train['id'].isin(test_ids )]['target'].values
X_test  = train[train['id'].isin(test_ids )].drop(['id', 'target'], axis = 1).values

# Train various XGBoost classifiers using different number of K-folds

Here we train various XGBoost classifiers using K-fold cross validation. To simulate a set of models, at each number of K-folds we train several XGBoosts each with a different number of estimators.

For each model and K-fold CV we store average AUC on the test. For later analysis, for each fold we also store AUC for the out-of-fold part of the train set.

In [None]:
# Save final AUC for each K-fold and for each XGB model (defined by the number of estimators)
auc = [[] for x in N_FOLDS]

# Also save the results for out-of-fold part of the train set
# Three indices: 1. how many K-folds (3, 5, ...) 
#                2. which model (how many estimators) 
#                3. which out-of-fold
oof_auc = [[] for x in N_FOLDS]
    
#Iterate over all possible K-fold cross-validation strategies
for i, nf in enumerate(N_FOLDS):
    kf = KFold(n_splits=nf) 
    
    print(f'\nRunning {nf}-fold splitting')
    print('-----')

    # Iterate over XGB models (determined by the number of estimators)
    for nest in N_ESTIMATORS:

        print(f'Running {nest} estimators')

        # define the model
        xgb = XGBClassifier(learning_rate = LEARNING_RATE, n_estimators = nest, max_depth = MAX_DEPTH, 
                                n_jobs = N_JOBS, tree_method = TREE_METHOD, verbosity=VERBOSITY, 
                                eval_metric = 'logloss', use_label_encoder = False)


        # predictions on the test set - we will average over the K-folds
        y_test_pred = np.zeros(len(X_test))

        # out-of-fold AUC for the current CV strategy/model
        c_oof_auc = []

        # iterate over the K-folds
        for train_index, valid_index in kf.split(X_train):

            # fit the model	on the train set
            model_xgb = xgb.fit(X_train[train_index],y_train[train_index])

            # predict on the test set - average over the K-folds
            y_test_pred += model_xgb.predict_proba(X_test)[:,1]/nf

            # predict on the out-of-fold part of the train set
            y_valid_pred = model_xgb.predict_proba(X_train[valid_index])[:,1]
            c_oof_auc.append(roc_auc_score(y_train[valid_index], y_valid_pred))

        # Save area under the curve for the final prediction 
        auc[i].append(roc_auc_score(y_test, y_test_pred))
        # and also for the out-of-fold part of the training set
        oof_auc[i].append(c_oof_auc)
        
        # Keep us informed about what is going on
        print(f'     auc: {auc[i][-1]}')
        print(f'     oof auc: {oof_auc[i][-1]}')

# Investigate the results

First let's compare the final results on the test set. We find that regardless of the cross-validation strategy, we get very similar results! The choice of model matters way more than the number of K-folds we choose.

In [None]:
cols = ['r', 'g', 'b', 'k']

for idx, col in zip(range(4), cols):
    plt.scatter(N_ESTIMATORS, auc[idx], color = col, label = f'{N_FOLDS[idx]} folds')
    
plt.legend()
plt.ylabel('AUC')
plt.xlabel('Number of XGBoost estimators');

Here we look at the differences. We subtract the result obtained with 10 K-folds and confirm that the differences are really tiny

In [None]:
for idx, col in zip(range(4), cols):
    plt.scatter(N_ESTIMATORS, auc[idx] - np.array(auc[-1]), color = col, label = f'{N_FOLDS[idx]} folds')
    # we convert to np.array as subtracting two lists is not defined
    
plt.legend()
plt.ylabel(f'AUC relative to the {N_FOLDS[-1]}-fold result')
plt.xlabel('Number of XGBoost estimators');

Here we compare the results on the test set (which do not depend much on number of K-folds, so we just plot one with black points) with the out-of-fold results (color, showing +- one standard deviation).

In [None]:
cols = ['r', 'g', 'b', 'orange']

for i, ne in enumerate(N_ESTIMATORS):
    for j, col in enumerate(cols):
        if i == 0: #Avoid having too large a legend
            plt.errorbar(
                ne + 2*j - 3, 
                np.mean(oof_auc[j][i]), 
                yerr = np.std(oof_auc[j][i]), 
                color = col,
                label = f'{N_FOLDS[j]} folds'
            )
        else:
            plt.errorbar(
                ne + 2*j - 3, 
                np.mean(oof_auc[j][i]), 
                yerr = np.std(oof_auc[j][i]), 
                color = col
            )

plt.scatter(N_ESTIMATORS, auc[-1], color = 'k')
plt.legend();
plt.ylabel('AUC')
plt.xlabel('Number of XGBoost estimators');

We see that the out-of-fold results undershoot the black dots, which represent the test set results. This is not that surprising as each was out-of-fold was trained on a subset of the data only and we expect the out-of-fold results to trail a bit. 

What is a bit surprising is the magnitude of the difference and the fact that the 10-fold CV leads to the worst out-of-fold result. It also has the largest error standard deviation. 

Overall we expected a somewhat tighter relation between the out-of-fold results and the results on the test set, but this might be caused by the large number of mislabelled data that [appears to be present](https://www.kaggle.com/motloch/nov21-mislabeled-25).