# Cross-validation  
Here I will show hot to use it by lightGBM model.  
Dataset used by kaggle competition on https://www.kaggle.com/c/ieee-fraud-detection


# CV concept
refers from https://www.kaggle.com/kyakovlev/ieee-cv-options
## Basics
Cross-validation is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data.

In k-fold cross-validation, you split the input data into k subsets of data (also known as folds).

## Main strategy
1. Divide Train set in subsets (Training set itself + Validation set)
2. Define Validation Metric (in our case it is ROC-AUC)
3. Stop training when Validation metric stops improving
4. Make predictions for Test set
5. Seems simple but he devil's always in the details

In [5]:
## General imports
import numpy as np
import pandas as pd
import os, sys, gc, warnings, random, datetime
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score
from tqdm import tqdm
import lightgbm as lgb
import math
warnings.filterwarnings('ignore')

def seed_everything(seed=0):
    random.seed(seed)
    np.random.seed(seed)

SEED = 42
seed_everything(SEED)

TARGET = 'isFraud'
train_df = pd.read_csv('./train_transaction.csv')

In [6]:
# hypterparams for lgb
lgb_params = {
                    'objective':'binary',
                    'boosting_type':'gbdt',
                    'metric':'auc',
                    'n_jobs':-1,
                    'learning_rate':0.01,
                    'num_leaves': 2**8,
                    'max_depth':-1,
                    'tree_learner':'serial',
                    'colsample_bytree': 0.7,
                    'subsample_freq':1,
                    'subsample':0.7,
                    'n_estimators':5000,
                    'max_bin':255,
                    'verbose':-1,
                    'seed': SEED,
                    'early_stopping_rounds':100,
                }

In [20]:
# dataset is time-series, so preparing it for training and split into two sets
START_DATE = datetime.datetime.strptime('2017-11-30', '%Y-%m-%d')
train_df['DT_M'] = train_df['TransactionDT'].apply(lambda x: (START_DATE + datetime.timedelta(seconds=x)))
train_df['DT_M'] = (train_df['DT_M'].dt.year - 2017) * 12 + train_df['DT_M'].dt.month

# we use last month as varify data
test_df = train_df[train_df['DT_M'] == train_df['DT_M'].max()].reset_index(drop=True)
train_df = train_df[train_df['DT_M'] < (train_df['DT_M'].max())].reset_index(drop=True)

print('Shape control:', train_df.shape, test_df.shape)

for col in list(train_df):
    if train_df[col].dtype == 'O':
        # print(col)
        train_df[col] = train_df[col].fillna('unseen_before_label')
        test_df[col] = test_df[col].fillna('unseen_before_label')

        train_df[col] = train_df[col].astype(str)
        test_df[col] = test_df[col].astype(str)

        le = LabelEncoder()
        le.fit(list(train_df[col]) + list(test_df[col]))
        train_df[col] = le.transform(train_df[col])
        test_df[col] = le.transform(test_df[col])

        train_df[col] = train_df[col].astype('category')
        test_df[col] = test_df[col].astype('category')

rm_cols = [
    'TransactionID','TransactionDT', # These columns are pure noise right now
    TARGET,                          # Not target in features))
    'DT_M'                           # Column that we used to simulate test set
]

# Remove V columns (for faster training)
rm_cols += ['V'+str(i) for i in range(1,340)]

# Final features
features_columns = [col for col in list(train_df.columns) if col not in rm_cols]

Shape control: (417559, 395) (83655, 395)


In [17]:
# Choose 3 as number_splits for faster training
N_SPLITS = 3
RESULTS = test_df[['TransactionID',TARGET]]
RESULTS["kfold"] = 0
train_x, train_y = train_df[features_columns], train_df[TARGET]
test_x, test_y = test_df[features_columns], test_df[TARGET]

In [13]:
# no cv
# We don't know when the model is best, so need grid-search for rounds

for round in [500,1000,1500,2000]:
    params = lgb_params.copy()
    params["n_estimators"] = round
    params["early_stopping_rounds"] = None
    
    train = lgb.Dataset(train_x, train_y)
    
    clf = lgb.train(
        params,
        train,
    )
    pred = clf.predict(test_x)
    print("rounds "+str(round)+"  auc: ", roc_auc_score(test_y, pred))

rounds 500  auc:  0.9288169585447247
rounds 1000  auc:  0.9333970353168322
rounds 1500  auc:  0.9334095099797453
rounds 2000  auc:  0.9327899934117092


In [19]:
# KFold/StratifiedKFold 
# The difference between these two is StratifiedKFold keep the ratio of target in each split.
# Shuffle=True means shuffle the data before each epoch
# I tried shuffle=True or not, besides it's different at valid_socre, it's simliar to each test_auc_score.
# When set shuffle=True, we have a high valid_score
from sklearn.model_selection import KFold, StratifiedKFold

folds = KFold(n_splits=N_SPLITS, shuffle=True, random_state=SEED)
# folds = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=SEED)

for i, (train_idx, test_idx) in enumerate(folds.split(train_x, train_y)):
    tr_x, tr_y = train_x.loc[train_idx,:], train_y[train_idx]
    tt_x, tt_y = train_x.loc[test_idx,:], train_y[test_idx]
    
    train = lgb.Dataset(tr_x, tr_y)
    test = lgb.Dataset(tt_x, tt_y)
    
    clf = lgb.train(
        lgb_params,
        train,
        valid_sets = [train, test],
        verbose_eval = 1000,
    )
    RESULTS["kfold"] += clf.predict(test_x)/N_SPLITS

print("kfold avg auc:", roc_auc_score(test_y, RESULTS["kfold"]))

Training until validation scores don't improve for 100 rounds.
[1000]	training's auc: 0.997964	valid_1's auc: 0.962431
[2000]	training's auc: 0.999892	valid_1's auc: 0.966918
[3000]	training's auc: 0.999996	valid_1's auc: 0.967824
Early stopping, best iteration is:
[2983]	training's auc: 0.999996	valid_1's auc: 0.967841
Training until validation scores don't improve for 100 rounds.
[1000]	training's auc: 0.997748	valid_1's auc: 0.964897
[2000]	training's auc: 0.999889	valid_1's auc: 0.968642
Early stopping, best iteration is:
[2817]	training's auc: 0.99999	valid_1's auc: 0.969374
Training until validation scores don't improve for 100 rounds.
[1000]	training's auc: 0.997801	valid_1's auc: 0.960305
[2000]	training's auc: 0.999887	valid_1's auc: 0.964193
[3000]	training's auc: 0.999995	valid_1's auc: 0.96514
Early stopping, best iteration is:
[2995]	training's auc: 0.999995	valid_1's auc: 0.965149
kfold avg auc: 0.9310010973233493


In [23]:
# LBO leave one block out 
# We use the last month of data as valid_data
# We know the best round for training
estimators_bestround = []

for i , (tr_i, tt_i) in enumerate(folds.split(train_x,train_y)):
    # print(tr_i[0:10])
    tr_x, tr_y = train_x.loc[tr_i,:], train_y.loc[tr_i]
    tt_x, tt_y = train_x.loc[tt_i,:], train_y.loc[tt_i]
    train = lgb.Dataset(tr_x,tr_y)
    #Here is different from above
    test = lgb.Dataset(test_x,test_y)
    clf = lgb.train(
        lgb_params,
        train,
        valid_sets=[train,test],
        verbose_eval=1000,
    )
    estimators_bestround.append(clf.current_iteration())

Training until validation scores don't improve for 100 rounds.
[1000]	training's auc: 0.997964	valid_1's auc: 0.92682
Early stopping, best iteration is:
[1277]	training's auc: 0.999116	valid_1's auc: 0.927278
Training until validation scores don't improve for 100 rounds.
[1000]	training's auc: 0.997748	valid_1's auc: 0.930164
Early stopping, best iteration is:
[1145]	training's auc: 0.998546	valid_1's auc: 0.930523
Training until validation scores don't improve for 100 rounds.
Early stopping, best iteration is:
[751]	training's auc: 0.995247	valid_1's auc: 0.927125


In [25]:
# Choose approximate rounds for reducing overfitting
corrected_lgb_params = lgb_params.copy()
corrected_lgb_params['n_estimators'] = int(np.mean(estimators_bestround))
corrected_lgb_params['early_stopping_rounds'] = None

RESULTS['lbo'] = 0

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_x,train_y)):    
    tr_x, tr_y = train_x.iloc[trn_idx, :], train_y[trn_idx]    
    train = lgb.Dataset(tr_x, label=tr_y)
    clf = lgb.train(
        corrected_lgb_params,
        train
    )
    RESULTS['lbo'] += clf.predict(test_x) / N_SPLITS

print('lbo auc score: ', metrics.roc_auc_score(test_y, RESULTS['lbo']))

lbo auc score:  0.9319127497119095


In [44]:
# GroupKFold refers from https://www.kaggle.com/kyakovlev/ieee-cv-options#GroupKFold
# The folds are approximately balanced in the sense that the number of distinct groups is approximately the same in each fold.

# Why we may use it? Let's imagine that we want to separate train data by time blocks groups or client IDs or something else. 
# With GroupKFold we can be sure that our validation fold will contain groupIDs that are not in main train set. 
# Sometimes it helps to deal with "dataleakage" and overfit.

# just a simulate
import numpy as np
from sklearn.model_selection import GroupKFold
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8],[9,10],[11,12]])
y = np.array([1, 2, 3, 4,5,6])
groups = np.array([0, 0, 1, 1, 2, 2])
group_kfold = GroupKFold(n_splits=2)
group_kfold.get_n_splits(X, y, groups)
print(group_kfold)
GroupKFold(n_splits=2)
for train_index, test_index in group_kfold.split(X, y, groups):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print(X_train, X_test, y_train, y_test)

GroupKFold(n_splits=2)
TRAIN: [2 3] TEST: [0 1 4 5]
[[5 6]
 [7 8]] [[ 1  2]
 [ 3  4]
 [ 9 10]
 [11 12]] [3 4] [1 2 5 6]
TRAIN: [0 1 4 5] TEST: [2 3]
[[ 1  2]
 [ 3  4]
 [ 9 10]
 [11 12]] [[5 6]
 [7 8]] [1 2 5 6] [3 4]
