<h1><center>Addressing Overconfidence: Repeated K-Fold CV</center></h1>

## Introduction
I have seen quite a few public notebooks for this task that report unrealistically high accuracies and AUC metrics. These notebooks rely on using a specific train test split random state so that the test data is very easy to predict, rather than producing high-quality generalizable models. I thought it would be a good idea to explore whats really going on and how dangerous it is to trust models produced in this way. Then I will demonstrate how easy it is to implement a more realistic method of evaluating accuracy and apply that to some of the high-performing models shared by other users.

This is not intendended to disparage the great contributions of other users participating in this task but to help fellow Kagglers avoid overconfidence and disappointment.

### Import Required Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import time
import statistics
import os

import matplotlib.pyplot as plt
import xgboost
import lightgbm
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from catboost import CatBoostClassifier

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score, cross_val_predict
from sklearn import linear_model
from sklearn import metrics

### Read Dataset

In [None]:
heart_df = pd.read_csv("/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv")
heart_df.head()

## Getting Started

Since there are already a number of very well presented explorations of this dataset we will skip EDA and feature selection, going directly to model fitting - where we encounter the problem at hand. We start by performing a simple train-test split and fitting a basic model as seen in [this](https://www.kaggle.com/nayansakhiya/heart-fail-analysis-and-quick-prediction-96-rate) very nice notebook.


In [None]:
selected_features = ['time','ejection_fraction','serum_creatinine','age']
X = heart_df[selected_features]
X_all_features = heart_df[heart_df.columns.difference(['DEATH_EVENT'])]
y = heart_df['DEATH_EVENT']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2698)


### Fitting Simple Model

In [None]:

r_clf = RandomForestClassifier(max_features=0.5, max_depth=15, random_state=1)
r_clf.fit(x_train, y_train)
acc =  r_clf.score(x_test,y_test)
print(f"Random Forest Test Accuracy: {round(acc*100, 3)}")

## Mission Accomplished! Or is it?

Our Random Forest Classifier with just 4 features scores accuracy over 96% on unseen test data - a fabulous result. Since the test data is unseen we expect our model to have comparable accuracy when we share it with colleagues and reach deployment, right?

This is a very dangerous trap!

Even though the test data is not seen by the model, the test accuracy is seen by the person choosing the random state, and that person likes high test accuracy numbers.

Lets fit a model with exactly the same parameters to different random splits of the data to see what accuracy we might expect if our random state was actually random.

In [None]:
# Same parameters for the RF Classifier
r_clf = RandomForestClassifier(max_features=0.5, max_depth=15, random_state=1)

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)
r_clf.fit(x_train, y_train)
acc =  r_clf.score(x_test,y_test)
print(f"Random Forest Test 1 Accuracy: {round(acc*100, 3)}")
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4321)
r_clf.fit(x_train, y_train)
acc =  r_clf.score(x_test,y_test)
print(f"Random Forest Test 2 Accuracy: {round(acc*100, 3)}")
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1337)
r_clf.fit(x_train, y_train)
acc =  r_clf.score(x_test,y_test)
print(f"Random Forest Test 3 Accuracy: {round(acc*100, 3)}")

Yikes! Thats not good, accuracy in all three cases is more than 10% lower than before, but are we worried? Is it possible that the model generated from our first seed is just a better, more generalizable model?

It is possible but unlikely, lets try create a situation where we can check. This time we will split the model into 3 parts train, validation and test, then try find the seed that gives great validation accuracy and see what this implies for test accuracy. 

Starting with an initial split, the test set from this first split we will set aside for later, it represents what we might encounter as real world data.


In [None]:
#start with initial split
x_train_val, x_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=1111)

Now we will split the remaining data and this time we will exploit the random state to get high validation accuracy:

In [None]:
for i in range(1000):
    x_train, x_val, y_train, y_val = train_test_split(x_train_val, y_train_val, test_size=0.25, random_state=i)
    # Same parameters for the RF Classifier
    r_clf = RandomForestClassifier(max_features=0.5, max_depth=15, random_state=1)
    r_clf.fit(x_train, y_train)
    acc =  r_clf.score(x_val,y_val)*100   
    if acc > 94:
        print(f"Random Forest Val Accuracy: {round(acc, 3)}")
        print(f"High Accuracy at: {i}")
        break


OK! Our cheesy exploits have produced a model with high validation accuracy, lets see how it generalizes to the test set we split out earlier:

In [None]:
acc = r_clf.score(x_test,y_test)*100
print(f"Random Forest Test Accuracy: {round(acc, 3)}")

## Oh No!

Even though our model scored high accuracy on the validation set, we lost more than 16% accuracy moving to test. If this test set is representative, we might have had a very embarrassing deployment. We definitely can't trust accuracy gained by changing the random state.

**Random state is for reproducability not optimization!** Selecting a split that gives great test results does not make your model better and will lead us into all kinds of embarrassing disappointments. 

So we know not to manipulate the random state, but it can still happen by chance! 


## How Can We Be More Confident in Our Models?

We saw above that using a 3-way train, test, validation split can help to avoid overconfidence by giving us an additional unseen data check, but in order to really have confidence in the accuracy of our models there are a number of tools at our disposal.

Now we will demonstrate how easy it is to get a more stable and reliable estimate of our models accuracy using 3 ideas, all of which are well supported by sklearn:
1. Stratification - Ensuring the test data has representative propotions of the response variable.
2. K-Fold Cross Validation -  Spliting data into K equal groups, where each group is used as the test set once as multiple models are fit.
3. Repeated Estimates - Repeating the entire process and aggregating the accuracy to reduce random noise.

Please read [this](https://machinelearningmastery.com/k-fold-cross-validation/) excellent article for a complete explanation.


We set our random seed once in our setup section and never touch it.

### Setup

In [None]:
np.random.seed(19566390)
FOLDS = 5
REPEATS = 5

Now we will define a function that takes a model, a dataset and some options and applies repeated cross validation scoring. This kind of function will work for models/pipelines that conform to the sklearn style interface.

In [None]:
def repeat_cross_validation_accuracy(model, x, y, n_folds = 5, n_repeats = 5, metric = 'accuracy'):
    oof_acc = []
    oof_predictions = []
    for i in range(n_repeats):
        kf = StratifiedKFold(n_folds, shuffle=True)
        acc = cross_val_score(model, x.values, y=y.values,scoring=metric, cv = kf)
        oof_acc.append(acc.mean())
        predictions = cross_val_predict(model, x, y=y.values, cv=kf)
        oof_predictions.append(predictions)
    return oof_acc, oof_predictions 

Now we will setup a number of models and select features based on the public notebooks shared by other Kaggle users reporting 90% + accuracy.

In [None]:

selected_features = ['time','ejection_fraction','serum_creatinine','age']

X = heart_df[selected_features]
#X_all_features = heart_df[heart_df.columns.difference(['DEATH_EVENT'])]
y = heart_df['DEATH_EVENT']

In [None]:
rf_model = RandomForestClassifier(max_features=0.5, max_depth=15)
knn_model = KNeighborsClassifier(n_neighbors=6)
dt_model = DecisionTreeClassifier(max_leaf_nodes=10, criterion='entropy')
gb_model = GradientBoostingClassifier(max_depth=2 )
xgb_model = xgboost.XGBRFClassifier(max_depth=3 )
lgb_model = lightgbm.LGBMClassifier(max_depth=2)
cat_model = CatBoostClassifier(verbose=0)

models = dict()
models['Random Forest'] = rf_model
models['KNN'] = knn_model
models['Decision Tree'] = dt_model
models['Gradient Boosting'] = gb_model
models['XGB'] = xgb_model
models['LGB'] = lgb_model
models['Cat Boost'] = cat_model

Now we will fit and score the models multiple times to get reliable estimate of out-of-fold (OOF) accuracy.

In [None]:
accuracies = []
training_times = []

print(f"Fitting models with {REPEATS} iterations of {FOLDS} fold CV\n")
for k in models.keys():
    print(f"####################################\nTraining Model: {k}")
    start = time.time()
    acc, preds = repeat_cross_validation_accuracy(models[k], X, y, n_folds = FOLDS, n_repeats = REPEATS)
    end = time.time()
    elapsed = end - start
    print(f"Total Training Time: {round(elapsed,4)} seconds")
    print(f"Mean OOF Accuracy: {round(statistics.mean(acc)*100, 2)} %")
    print(f"####################################\n\n")
    accuracies.append(statistics.mean(acc))
    training_times.append(elapsed)

## Results Table

In [None]:
res_df = pd.DataFrame({'Model': list(models.keys()), 'Accuracy': accuracies, f"Train Time ({FOLDS} by {REPEATS})": training_times})
res_df

## Conclusion

Accuracy estimates for all models are significantly lower than the previously reported values but better reflect what we would expect to find on actually unseen data.

Its obvious that a method like this using repeated cross validation requires significantly more computation and resources, and this kind of approach may not be suitable for models that require extensive training. However, as we can see in the results table and the code above, the training time and coding required to obtain a more reliable measure of accuracy for most models is trivial even using 5 repeats of 5 fold CV.

There are also concerns about the time variable and whether it is valid to include this in predictive models, I may revisit this notebook to address that at a later date.

Thanks to everyone who has contributed public notebooks and participated in discussions regarding this task. I hope someone finds this helpful.