After some discussions in this [thread](https://www.kaggle.com/c/rsna-miccai-brain-tumor-radiogenomic-classification/discussion/275233) whether OOF or CV is better, I want to clarify some points by looking at the results empirically:

- CV AUC gives a better bias estimate than OOF AUC (in particular for high K)
- the high std can be explained by the bias-variance tradeoff



# Regular 5 and 200-fold CV

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.metrics import log_loss
import numpy as np
import pandas as pd

X = pd.read_csv("../input/leak-in-metadata/X_train.csv", usecols=['T2w_Percent Phase Field of View',
                                        'FLAIR_Echo Train Length',
                                        'T2w_shape',
                                        'target'])
X.fillna(0, inplace=True)
y = X["target"].values
X.drop("target", axis=1, inplace=True)
X = X.values

o = []
o2 = []
for fold, (train_index, val_index) in enumerate(StratifiedKFold(n_splits=200).split(X, y)):
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]

    regr = LogisticRegression()
    regr.fit(X_train, y_train)
    
    y_pred = regr.predict_proba(X_val)[...,1]
    
    auc = roc_auc_score(y_val, y_pred)
    val_loss = log_loss(y_val, y_pred)
    
    o.append(auc)
    o2.append(val_loss)

print("Loss", np.mean(o2), np.std(o2))
print("AUC", np.mean(o), np.std(o))

We can see that the standard deviation is extremely high. However, this is not surprising by the [bias-variance-tradeoff](https://stats.stackexchange.com/questions/61783/bias-and-variance-in-leave-one-out-vs-k-fold-cross-validation). Even the proper scoring rule log loss is affected by the high number of folds. Let us reduce the folds to 5.

In [None]:
o = []
o2 = []
for fold, (train_index, val_index) in enumerate(StratifiedKFold(n_splits=5).split(X, y)):
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]

    regr = LogisticRegression()
    regr.fit(X_train, y_train)
    
    y_pred = regr.predict_proba(X_val)[...,1]
    
    auc = roc_auc_score(y_val, y_pred)
    val_loss = log_loss(y_val, y_pred)
    
    o.append(auc)
    o2.append(val_loss)

print("Loss", np.mean(o2), np.std(o2))
print("AUC", np.mean(o), np.std(o))

The standard deviation is closer to the true standard deviation. However, we increased the bias. The AUC has decreased to 0.57.

# OOF 5 and 200-fold CV

In [None]:
oof_score = np.zeros((X.shape[0],))
for fold, (train_index, val_index) in enumerate(StratifiedKFold(n_splits=200).split(X, y)):
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]

    regr = LogisticRegression()
    regr.fit(X_train, y_train)
    
    y_pred = regr.predict_proba(X_val)[...,1]
    
    oof_score[val_index] = y_pred

print("OOF Loss", roc_auc_score(y, oof_score))
print("OOF AUC", log_loss(y, oof_score))

By using OOF, we actually get a higher AUC! OOF AUC is **0.6806**, while regular CV AUC **0.6575**. Let us look at 5-fold CV again.

In [None]:
oof_score = np.zeros((X.shape[0],))
for fold, (train_index, val_index) in enumerate(StratifiedKFold(n_splits=5).split(X, y)):
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]

    regr = LogisticRegression()
    regr.fit(X_train, y_train)
    
    y_pred = regr.predict_proba(X_val)[...,1]
    
    oof_score[val_index] = y_pred

print("OOF Loss", roc_auc_score(y, oof_score))
print("OOF AUC", log_loss(y, oof_score))

The AUC has increased from 0.68 to 0.7. In contrast, the regular 5-fold CV AUC is 0.5768147135783306.

# Train-test-split and CV

From the last section, we found out that:
- CV AUC < OOF AUC
- By increasing the number of folds, we increase the variance

Now, I want to look at the number of folds in relation to the bias of the estimator.

The train dataset is 78% of the total data and the test dataset is 22% of the total data. I am using 60 folds because the dataset is too small.

In [None]:
from sklearn.model_selection import train_test_split
from tqdm import tqdm

FOLDS = 60

X = pd.read_csv("../input/leak-in-metadata/X_train.csv", usecols=['T2w_Percent Phase Field of View',
                                        'FLAIR_Echo Train Length',
                                        'T2w_shape',
                                        'target'])
X.fillna(0, inplace=True)
y = X["target"].values
X.drop("target", axis=1, inplace=True)
X = X.values

X, X_test, y, y_test = train_test_split(X, y, test_size=0.78,
                                        train_size=0.22, random_state=3,
                                        shuffle=True, stratify=y)
print("train", X.shape, "test", X_test.shape)

avg_auc_shape = []
roc = []
loss = []
skf = StratifiedKFold(n_splits=FOLDS, random_state=None, shuffle=False)
oof = np.zeros((X.shape[0],), dtype=np.float64)
y_pred_test = np.zeros((y_test.shape[0],), dtype=np.float64)
for train_index, val_index in tqdm(skf.split(X, y), total=FOLDS):
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]

    model = LogisticRegression()
    model.fit(X_train, y_train)

    y_pred = model.predict_proba(X_val)[...,1]

    y_pred_test += model.predict_proba(X_test)[...,1] / FOLDS

    oof[val_index] = y_pred

    roc.append(roc_auc_score(y_val, y_pred))
    loss.append(log_loss(y_val, y_pred))
    avg_auc_shape.append(y_val.shape[0])

print("Each AUC was computed on", np.mean(avg_auc_shape), "samples")
print("CV AUC", np.mean(roc), np.std(roc))
print("CV loss", np.mean(loss), np.std(loss))
print()
print("OOF AUC", roc_auc_score(y, oof))
print("OOF loss", log_loss(y, oof))
print()
print("AUC test", roc_auc_score(y_test, y_pred_test))
print("Loss test", log_loss(y_test, y_pred_test))

This time we see that OOF AUC < AUC but the regular CV AUC is much closer to the test dataset. Again the estimation of the variance is wrong.

Let us reduce the number of folds to 5.

In [None]:
FOLDS = 5

avg_auc_shape = []
roc = []
loss = []
skf = StratifiedKFold(n_splits=FOLDS, random_state=None, shuffle=False)
oof = np.zeros((X.shape[0],), dtype=np.float64)
y_pred_test = np.zeros((y_test.shape[0],), dtype=np.float64)
for train_index, val_index in tqdm(skf.split(X, y), total=FOLDS):
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]

    model = LogisticRegression()
    model.fit(X_train, y_train)

    y_pred = model.predict_proba(X_val)[...,1]

    y_pred_test += model.predict_proba(X_test)[...,1] / FOLDS

    oof[val_index] = y_pred

    roc.append(roc_auc_score(y_val, y_pred))
    loss.append(log_loss(y_val, y_pred))
    avg_auc_shape.append(y_val.shape[0])

print("Each AUC was computed on", np.mean(avg_auc_shape), "samples")
print("CV AUC", np.mean(roc), np.std(roc))
print("CV loss", np.mean(loss), np.std(loss))
print()
print("OOF AUC", roc_auc_score(y, oof))
print("OOF loss", log_loss(y, oof))
print()
print("AUC test", roc_auc_score(y_test, y_pred_test))
print("Loss test", log_loss(y_test, y_pred_test))

We have a better estimation of the standard deviation by reducing the number of folds (bias-variance tradeoff). 0.522 + 0.0622 = 0.5842 which is quite close to 0.5756.

However, the bias is again too high. Note that OOF AUC > CV AUC.

# Distribution

In the last section, we only considered one dataset by using train_test_split with 1 seed. However, this is only one particular instance of the dataset. In the next experiment, we sample 300 datasets from the whole dataset distribution to get a better estimate of the CV.

We use 60 folds again.

In [None]:
FOLDS = 60

avg_auc_shape = []
cv_auc = []
cv_loss = []

oof_auc = []
oof_loss = []

test_auc = []
test_loss = []
for i in tqdm(range(300)):
    X = pd.read_csv("../input/leak-in-metadata/X_train.csv", usecols=['T2w_Percent Phase Field of View',
                                            'FLAIR_Echo Train Length',
                                            'T2w_shape',
                                            'target'])
    X.fillna(0, inplace=True)
    y = X["target"].values
    X.drop("target", axis=1, inplace=True)
    X = X.values

    X, X_test, y, y_test = train_test_split(X, y, test_size=0.78,
                                            train_size=0.22, random_state=i,
                                            shuffle=True, stratify=y)
    #print("train", X.shape, "test", X_test.shape)

    skf = StratifiedKFold(n_splits=FOLDS, random_state=None, shuffle=False)
    oof = np.zeros((X.shape[0],), dtype=np.float64)
    y_pred_test = np.zeros((y_test.shape[0],), dtype=np.float64)
    for train_index, val_index in skf.split(X, y):
        X_train, X_val = X[train_index], X[val_index]
        y_train, y_val = y[train_index], y[val_index]

        model = LogisticRegression()
        model.fit(X_train, y_train)

        y_pred = model.predict_proba(X_val)[...,1]

        y_pred_test += model.predict_proba(X_test)[...,1] / FOLDS

        oof[val_index] = y_pred

        cv_auc.append(roc_auc_score(y_val, y_pred))
        cv_loss.append(log_loss(y_val, y_pred))
        avg_auc_shape.append(y_val.shape[0])
    
    oof_auc.append(roc_auc_score(y, oof))
    oof_loss.append(log_loss(y, oof))
    
    test_auc.append(roc_auc_score(y_test, y_pred_test))
    test_loss.append(log_loss(y_test, y_pred_test))

print("Each AUC was computed on", np.mean(avg_auc_shape), "samples")
print("CV AUC", np.mean(cv_auc), np.std(cv_auc))
print("CV loss", np.mean(cv_loss), np.std(cv_loss))
print()
print("OOF AUC", np.mean(oof_auc))
print("OOF loss", np.mean(oof_loss))
print()
print("AUC test", np.mean(test_auc))
print("Loss test", np.mean(test_loss))

Again we see that CV AUC is closer to AUC test than OOF AUC. Next, we test the results with 5 folds.

In [None]:
FOLDS = 5

avg_auc_shape = []
cv_auc = []
cv_loss = []

oof_auc = []
oof_loss = []

test_auc = []
test_loss = []
for i in tqdm(range(300)):
    X = pd.read_csv("../input/leak-in-metadata/X_train.csv", usecols=['T2w_Percent Phase Field of View',
                                            'FLAIR_Echo Train Length',
                                            'T2w_shape',
                                            'target'])
    X.fillna(0, inplace=True)
    y = X["target"].values
    X.drop("target", axis=1, inplace=True)
    X = X.values

    X, X_test, y, y_test = train_test_split(X, y, test_size=0.78,
                                            train_size=0.22, random_state=i,
                                            shuffle=True, stratify=y)
    #print("train", X.shape, "test", X_test.shape)

    skf = StratifiedKFold(n_splits=FOLDS, random_state=None, shuffle=False)
    oof = np.zeros((X.shape[0],), dtype=np.float64)
    y_pred_test = np.zeros((y_test.shape[0],), dtype=np.float64)
    for train_index, val_index in skf.split(X, y):
        X_train, X_val = X[train_index], X[val_index]
        y_train, y_val = y[train_index], y[val_index]

        model = LogisticRegression()
        model.fit(X_train, y_train)

        y_pred = model.predict_proba(X_val)[...,1]

        y_pred_test += model.predict_proba(X_test)[...,1] / FOLDS

        oof[val_index] = y_pred

        cv_auc.append(roc_auc_score(y_val, y_pred))
        cv_loss.append(log_loss(y_val, y_pred))
        avg_auc_shape.append(y_val.shape[0])
    
    oof_auc.append(roc_auc_score(y, oof))
    oof_loss.append(log_loss(y, oof))
    
    test_auc.append(roc_auc_score(y_test, y_pred_test))
    test_loss.append(log_loss(y_test, y_pred_test))

print("Each AUC was computed on", np.mean(avg_auc_shape), "samples")
print("CV AUC", np.mean(cv_auc), np.std(cv_auc))
print("CV loss", np.mean(cv_loss), np.std(cv_loss))
print()
print("OOF AUC", np.mean(oof_auc))
print("OOF loss", np.mean(oof_loss))
print()
print("AUC test", np.mean(test_auc))
print("Loss test", np.mean(test_loss))

Before we saw a strong effect on the bias. This time the effect is not as strong. For 60-fold CV we had 0.575583, here we have 0.57384. AUC test is 0.57694. Then |0.575583 - 0.57694| = 0.001357 (60-fold) and |0.57384 - 0.57694| = 0.0031 (5-fold CV). Hence, 60-fold CV is closer to AUC test.