## Why should you care about Cross-Validation?
There are three main reasons why a clean and robust cross-validation will help:
1. To estimate the generalization error of a given model
2. To select the best performing model from a group of models
3. To select the best hyperparameters for your model

## K-Fold Cross-Validation
- We divide the train set into `k` folds of equal size
- The model is trained on `k-1` folds and validated on the `kth` fold
- We obtain `k` performance values one for each fold
- Final performance metric is mean of the performance for each fold +/- standard deviation

Typical `k` are 5 or 10.
- Use 5 if your are short on the computational resources, typically at beginning of the model development lifecycle
- Use 10 if you can afford the computational cost, if you need tighter confidence intervals around the model performance metric. Typically used when you need to carefully evaluate your model performances, and standard in most notebooks for machine learning competitions.

**With higher `k`:**
- You get bigger train sets which enables you to reduce model bias
- You get more variance since the model may start to fit the noise in the data

## Repeated K-Fold Cross-Validation
- This just repeats the KFold Cross-Validation `n` times, each times making a different, randomized data split
- Before each of the `n` split into `k` folds, the training set is suffled
- As a result, we obtain `k x n` performance metrics
- Warnings: there could be overlap between the validation sets in different repeats.

## Stratified K-Fold Cross-Validation
- Only used for classification problems
- Procedure is identical to K-Fold Cross Validation
- It is useful with highly imbalanced datasets, because it ensures that each fold has a similar proportion of observations for each class
- Your also get `k` performance metrics
- No overlap of validation sets

There are other cross-validation schemes such as LeaveOneOut or LeavePOut but they are excessively computationaly expensive.

**Takeaway 1**
> As baseline when starting out a kaggle competition, I recommend you to use K-Fold Cross Validation with `k = 10`, or Stratified K-Fold if you are working an highly imbalanced dataset.

# Demo

In [None]:
# libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.metrics import log_loss
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import KFold, RepeatedKFold, StratifiedKFold, cross_validate, train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import lightgbm as lgb
from time import time
from scipy import stats

In [None]:
# CONFIGURATION
PATH_TO_TRAIN_SET = '../input/tabular-playground-series-jun-2021/train.csv'
PATH_TO_TEST_SET = '../input/tabular-playground-series-jun-2021/test.csv'
RANDOM_STATE = 6
TARGET = 'target'
K = 10

In [None]:
# Utils
def load_data():
    """ load data to build the model and submission data on which to make predictions """
    X_all = pd.read_csv(PATH_TO_TRAIN_SET) # data for model development (train set + val sets )
    X_sub = pd.read_csv(PATH_TO_TEST_SET) # data for prediction to submit on kaggle
    return X_all, X_sub

# Preprocessing Functions
def baseline_preprocessing(X_all, X_sub):
    
    # extract target
    y_all = X_all[TARGET].copy()
    
    # drop id and target columns for model inputs
    X_all = X_all.drop(['id', TARGET], axis=1)
    X_sub = X_sub.drop('id', axis=1)
    
    return X_all, X_sub, y_all

In [None]:
# load and preprocess the data
X_all, X_sub = load_data()
sub_idx = X_sub['id']
X_all, X_sub, y_all = baseline_preprocessing(X_all, X_sub)

## KFold Cross-Validation

In [None]:
# model
model = lgb.LGBMClassifier(random_state=RANDOM_STATE, n_estimators=50)

# K-Fold Cross-Validation
kf = KFold(n_splits=10, shuffle=True, random_state=RANDOM_STATE)

In [None]:
# estimate generalization error
clf =  cross_validate(model,
                      X_all,
                      y_all,
                      scoring='neg_log_loss',
                      return_train_score=True,
                      cv=kf,
                      n_jobs=-1)

In [None]:
clf['train_score'].mean()

In [None]:
# print elapsed time
print("Elapsed time:", round(clf['fit_time'].sum(),3), "s\n")

# print expected test score
print("Min  score:", clf['test_score'].mean() - np.std(clf['test_score'], ddof=1))
print("Mean score:", clf['test_score'].mean())
print("Max  score:", clf['test_score'].mean() + np.std(clf['test_score'], ddof=1))

## Stratified KFold Cross-Validation

In [None]:
# model
model = lgb.LGBMClassifier(random_state=RANDOM_STATE, n_estimators=50)

# specify cross validation scheme
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=RANDOM_STATE)

# estimate generalization error
clf =  cross_validate(model,
                      X_all,
                      y_all,
                      scoring='neg_log_loss',
                      return_train_score=True,
                      cv=skf,
                      n_jobs=-1)

In [None]:
# print elapsed time
print("Elapsed time:", round(clf['fit_time'].sum(),3), "s\n")

# print expected test score
print("Min  score:", clf['test_score'].mean() - np.std(clf['test_score'], ddof=1))
print("Mean score:", clf['test_score'].mean())
print("Max  score:", clf['test_score'].mean() + np.std(clf['test_score'], ddof=1))

We can see that the margin of error around the mean performance metric is tighter, this is because when using stratified KFold the proportion of classes are kept similar between each fold. This is much better to estimate the generalisation error of the model.

**Stratified KFold is used in the remaining of the notebook**

### Simple Hyperparameters Tuning using Stratified KFold Cross-Validation

In [None]:
# model initialisation
LGBMC = lgb.LGBMClassifier(random_state=RANDOM_STATE)

# cross validation scheme
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=RANDOM_STATE)

# hyperparameter space
param_grid = dict(n_estimators=stats.randint(10, 200),
                  max_depth=[-1, 2, 4, 8, 16, 32],
                  num_leaves=stats.randint(2,50),
                 )

# set up the search
search =  RandomizedSearchCV(LGBMC,
                            param_grid,
                            scoring='neg_log_loss',
                            return_train_score=True,
                            cv=skf,
                            refit=True,
                            n_jobs=-1,
                            n_iter=20,
                            verbose=2,
                            random_state=RANDOM_STATE)

# find the best hyperparameters
search.fit(X_all, y_all)

In [None]:
print("best score:", search.best_score_, "\n")
print("best params:\n", search.best_params_)

In [None]:
results = pd.DataFrame(search.cv_results_)
print(results.shape)
results.columns

In [None]:
# we can sort the models based on performances and only keep keys columns for readibility purpose
results.sort_values(by='mean_test_score', ascending=False, inplace=True)
results.reset_index(drop=True, inplace=True)
results[['mean_fit_time', 'std_fit_time','param_max_depth', 'param_n_estimators', 'param_num_leaves', 'mean_train_score', 'std_train_score', 'mean_test_score', 'std_test_score']].head(10)

## Predict and Submit

In [None]:
y_pred_sub = search.predict_proba(X_sub)

In [None]:
sub_idx_array = sub_idx.to_numpy()
sub_idx_array = sub_idx_array.reshape(-1, 1)
DATA = np.concatenate((sub_idx_array, y_pred_sub), axis=1)
sub_columns = ['id','Class_1','Class_2','Class_3','Class_4','Class_5','Class_6','Class_7','Class_8','Class_9']

In [None]:
my_submission = pd.DataFrame(data=DATA, columns=sub_columns)
my_submission['id']=my_submission['id'].astype('int')
my_submission.to_csv("submissionCV.csv", index=False)