While stage 1 is running, it is easy to create models that are artificially too good to be true. A quick look at the leaderboard shows that, [as every single time](https://www.kaggle.com/c/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament/discussion/130649), a lot of us found a way to submit a perfect score.

This notebook is to give you some ideas on how to create a way to benchmark your models for both the classic competitions (where we predict winning probabilities) and the new ones (where we predict point spread)

We will make use of [**TubesML**](https://pypi.org/project/tubesml/), which helps in not worrying about information leakage during the validation process, no matter how complex the model pipeline gets (and it helps me developing and getting to the next release faster)

In [None]:
!pip install tubesml==0.2.0

In [None]:
import numpy as np
import pandas as pd

import tubesml as tml

from sklearn.model_selection import StratifiedShuffleSplit, GridSearchCV, train_test_split
from sklearn.linear_model import Ridge, LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, mean_squared_error, mean_absolute_error, log_loss
from sklearn.model_selection import KFold

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import mm_data_manipulation as mm

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Quick data preparation

The goal of the notebook is just to give some ideas about model validation, not to build an actual model. Therefore, let's just make a simple training dataset, what follows should work with any training set and any model.

In [None]:
def make_training_data(details, targets):
    tmp = details.copy()
    tmp.columns = ['Season', 'Team1'] + \
                ['T1_'+col for col in tmp.columns if col not in ['Season', 'TeamID']]
    total = pd.merge(targets, tmp, on=['Season', 'Team1'], how='left')

    tmp = details.copy()
    tmp.columns = ['Season', 'Team2'] + \
                ['T2_'+col for col in tmp.columns if col not in ['Season', 'TeamID']]
    total = pd.merge(total, tmp, on=['Season', 'Team2'], how='left')
    
    if total.isnull().any().any():
        raise ValueError('Something went wrong')
        
    stats = [col[3:] for col in total.columns if 'T1_' in col and 'region' not in col]

    for stat in stats:
        total['delta_'+stat] = total['T1_'+stat] - total['T2_'+stat]
        
    try:
        total['delta_off_edge'] = total['T1_off_rating'] - total['T2_def_rating']
        total['delta_def_edge'] = total['T2_off_rating'] - total['T1_def_rating']
    except KeyError:
        pass
        
    return total


def add_seed(seed_location, total):
    seed_data = pd.read_csv(seed_location)
    seed_data['region'] = seed_data['Seed'].apply(lambda x: x[0])
    seed_data['Seed'] = seed_data['Seed'].apply(lambda x: int(x[1:3]))
    total = pd.merge(total, seed_data, how='left', on=['TeamID', 'Season'])
    return total


def make_teams_target(data, league):
    if league == 'men':
        limit = 2003
    else:
        limit = 2010

    df = data[data.Season >= limit].copy()

    df['Team1'] = np.where((df.WTeamID < df.LTeamID), df.WTeamID, df.LTeamID)
    df['Team2'] = np.where((df.WTeamID > df.LTeamID), df.WTeamID, df.LTeamID)
    df['target'] = np.where((df['WTeamID'] < df['LTeamID']),1,0)
    df['target_points'] = np.where((df['WTeamID'] < df['LTeamID']),df.WScore - df.LScore,df.LScore - df.WScore)
    df.loc[df.WLoc == 'N', 'LLoc'] = 'N'
    df.loc[df.WLoc == 'H', 'LLoc'] = 'A'
    df.loc[df.WLoc == 'A', 'LLoc'] = 'H'
    df['T1_Loc'] = np.where((df.WTeamID < df.LTeamID), df.WLoc, df.LLoc)
    df['T2_Loc'] = np.where((df.WTeamID > df.LTeamID), df.WLoc, df.LLoc)
    df['T1_Loc'] = df['T1_Loc'].map({'H': 1, 'A': -1, 'N': 0})
    df['T2_Loc'] = df['T2_Loc'].map({'H': 1, 'A': -1, 'N': 0})

    reverse = data[data.Season >= limit].copy()
    reverse['Team1'] = np.where((reverse.WTeamID > reverse.LTeamID), reverse.WTeamID, reverse.LTeamID)
    reverse['Team2'] = np.where((reverse.WTeamID < reverse.LTeamID), reverse.WTeamID, reverse.LTeamID)
    reverse['target'] = np.where((reverse['WTeamID'] > reverse['LTeamID']),1,0)
    reverse['target_points'] = np.where((reverse['WTeamID'] > reverse['LTeamID']),
                                        reverse.WScore - reverse.LScore,
                                        reverse.LScore - reverse.WScore)
    reverse.loc[reverse.WLoc == 'N', 'LLoc'] = 'N'
    reverse.loc[reverse.WLoc == 'H', 'LLoc'] = 'A'
    reverse.loc[reverse.WLoc == 'A', 'LLoc'] = 'H'
    reverse['T1_Loc'] = np.where((reverse.WTeamID > reverse.LTeamID), reverse.WLoc, reverse.LLoc)
    reverse['T2_Loc'] = np.where((reverse.WTeamID < reverse.LTeamID), reverse.WLoc, reverse.LLoc)
    reverse['T1_Loc'] = reverse['T1_Loc'].map({'H': 1, 'A': -1, 'N': 0})
    reverse['T2_Loc'] = reverse['T2_Loc'].map({'H': 1, 'A': -1, 'N': 0})
    
    df = pd.concat([df, reverse], ignore_index=True)

    to_drop = ['WScore','WTeamID', 'LTeamID', 'LScore', 'WLoc', 'LLoc', 'NumOT']
    for col in to_drop:
        del df[col]
    
    df.loc[:,'ID'] = df.Season.astype(str) + '_' + df.Team1.astype(str) + '_' + df.Team2.astype(str)
    return df


def prepare_data(league):
    save_loc = 'processed_data/' + league + '/'

    if league == 'women':
        regular_season = '/kaggle/input/ncaaw-march-mania-2021-spread/WRegularSeasonDetailedResults.csv'
        playoff = '/kaggle/input/ncaaw-march-mania-2021/WNCAATourneyDetailedResults.csv'
        playoff_compact = '/kaggle/input/ncaaw-march-mania-2021/WNCAATourneyCompactResults.csv'
        seed = '/kaggle/input/ncaaw-march-mania-2021/WNCAATourneySeeds.csv'
        save_loc = 'data/processed_women/'
    else:
        regular_season = '/kaggle/input/ncaam-march-mania-2021-spread/MRegularSeasonDetailedResults.csv'
        playoff = '/kaggle/input/ncaam-march-mania-2021/MNCAATourneyDetailedResults.csv'
        playoff_compact = '/kaggle/input/ncaam-march-mania-2021/MNCAATourneyCompactResults.csv'
        seed = '/kaggle/input/ncaam-march-mania-2021/MNCAATourneySeeds.csv'
        save_loc = 'data/processed_men/'
    
    # Season stats
    reg = pd.read_csv(regular_season)
    reg = mm.process_details(reg)
    regular_stats = mm.full_stats(reg)
    
    regular_stats = add_seed(seed, regular_stats)    
    
    # Target data generation 
    target_data = pd.read_csv(playoff_compact)
    target_data = make_teams_target(target_data, league)
    
    all_reg = make_training_data(regular_stats, target_data)
    all_reg = all_reg[all_reg.DayNum >= 136]  # remove pre tourney 
    
    return all_reg

In [None]:
train_men = prepare_data('men')[['Season', 'target', 'target_points', 'ID', 'delta_Seed', 'delta_Score']]
train_women = prepare_data('women')[['Season', 'target', 'target_points', 'ID', 'delta_Seed', 'delta_Score']]
train_men.head()

# Random split of data

This method is fairly quick and the most basic one: set some test set aside, and evaluate your model. We make the following functions just to keep it clean.

In [None]:
def _clean_columns(train, test):
    for col in ['target', 'target_points', 'ID', 'DayNum', 
                'Team1', 'Team2', 'Season', 'competitive', 'competitive_score']:
        try:
            del train[col]
            del test[col]
        except KeyError:
            pass
    return train, test


def _make_preds(train, y_train, test, model, kfolds, predict_proba):
    # this function can be made fancier with, for example, the usual kfold 
    # with early stopping and prediction on the test set
    # We keep it simpler here
    
    oof, imp_coef = tml.cv_score(data=train, target=y_train, estimator=model, 
                                 cv=kfolds, imp_coef=True, predict_proba=predict_proba)
    
    fit_model = model.fit(train, y_train)
    if predict_proba:
        predictions = fit_model.predict_proba(test)[:,1]
    else:
        predictions = fit_model.predict(test)
    
    return fit_model, oof, imp_coef, predictions


def random_split(data, model, kfolds, target, test_size=0.2, predict_proba=False, tune=False, param_grid=None):
    
    # split the data, it is possible to stratify on the years
    train, test = tml.make_test(data, test_size=test_size, strat_feat='Season', random_state=324)
    
    y_train = train[target]
    y_test = test[target]
    
    # make sure unwanted columns are not there
    train, test = _clean_columns(train, test)
    
    if tune:  # optional if you like it
        if predict_proba:
            grid = GridSearchCV(model, param_grid=param_grid, n_jobs=-1, 
                                cv=5, scoring='neg_log_loss')
        else:
            grid = GridSearchCV(model, param_grid=param_grid, n_jobs=-1, 
                                cv=5, scoring='neg_mean_absolute_error')
        grid.fit(train, y_train)
        model = grid.best_estimator_
        print(grid.best_score_)
        print(grid.best_params_)
    
    # Cross validation with Kfold on train set + retraining and prediction on the test set
    fit_model, oof, imp_coef, predictions = _make_preds(train, y_train, test, model, kfolds, predict_proba)
    
    return fit_model, oof, predictions, imp_coef, train, y_train, test, y_test

We can then build a very simple pipeline and predict **the point spread** with this validation set up

In [None]:
pipe = Pipeline([('scl', tml.DfScaler()), ('ridge', Ridge())])

kfolds = KFold(n_splits=5, shuffle=True, random_state=345)

fitted, oof_pred, test_pred, imp_coef, train, y_train, test, y_test = random_split(train_men, pipe, kfolds, 'target_points')

imp_coef

Perfect, now we just need a function to evaluate what we have produced

In [None]:
def report_points(train, test, y_train, y_test, oof, preds, plot=True):
    mae_oof = round(mean_absolute_error(y_true=y_train, y_pred=oof), 4)
    mae_test = round(mean_absolute_error(y_true=y_test, y_pred=preds), 4)
    mse_oof = round(np.sqrt(mean_squared_error(y_true=y_train, y_pred=oof)), 4)
    mse_test = round(np.sqrt(mean_squared_error(y_true=y_test, y_pred=preds)), 4)
    acc_oof = round(accuracy_score(y_true=(y_train>0).astype(int), y_pred=(oof>0).astype(int)),4)
    acc_test = round(accuracy_score(y_true=(y_test>0).astype(int), y_pred=(preds>0).astype(int)),4)
    n_unsure_oof = round((abs(oof) < 3).mean() * 100, 2)
    n_unsure_test = round((abs(preds) < 3).mean() * 100, 2)

    if plot:
        # plot predictions
        tml.plot_regression_predictions(train, y_train, oof)
        tml.plot_regression_predictions(test, y_test, preds)
    
    print(f'MAE train: \t\t\t {mae_oof}')
    print(f'MAE test: \t\t\t {mae_test}')
    print(f'RMSE train: \t\t\t {mse_oof}')
    print(f'RMSE test: \t\t\t {mse_test}')
    print(f'Accuracy train: \t\t {acc_oof}')
    print(f'Accuracy test: \t\t\t {acc_test}')
    print(f'Unsure train: \t\t\t {n_unsure_oof}%')
    print(f'Unsure test: \t\t\t {n_unsure_test}%')
    
    
report_points(train, test, y_train, y_test, oof_pred, test_pred)

Not a great model, but we knew this already. 

The set up works also to predict **the probability of winning**

In [None]:

pipe = Pipeline([('scl', tml.DfScaler()), 
                 ('logit', LogisticRegression(solver='lbfgs', multi_class='auto'))])

fitted, oof_pred, test_pred, imp_coef, train, y_train, test, y_test = random_split(train_women, pipe, 
                                                                                   kfolds, 'target', 
                                                                                   predict_proba=True)

imp_coef

Which neads a slightly different function for reporting

In [None]:
def plot_pred_prob(oof, test, y_train, y_test):
    
    fig, ax = plt.subplots(1,2, figsize=(15, 6))
    
    df = pd.DataFrame()
    df['true'] = np.where(y_train > 0, 1, 0)
    df['Prediction'] = oof
    
    df[df.true==1]['Prediction'].hist(bins=50, ax=ax[0], alpha=0.5, color='g', label='Victory')
    df[df.true==0]['Prediction'].hist(bins=50, ax=ax[0], alpha=0.5, color='r', label='Loss')
    
    df = pd.DataFrame()
    df['true'] = np.where(y_test > 0, 1, 0)
    df['Prediction'] = test

    df[df.true==1]['Prediction'].hist(bins=50, ax=ax[1], alpha=0.5, color='g', label='Victory')
    df[df.true==0]['Prediction'].hist(bins=50, ax=ax[1], alpha=0.5, color='r', label='Loss')
    
    ax[0].axvline(0.5, color='k', linestyle='--')
    ax[1].axvline(0.5, color='k', linestyle='--')
    
    ax[0].set_title('Training data')
    ax[1].set_title('Test data')
    ax[0].grid(False)
    ax[1].grid(False)
    ax[0].legend()
    ax[1].legend()
    fig.suptitle('Probabilities of victory', fontsize=15)

In [None]:
def report_victory(y_train, y_test, oof, preds, probs=True):
    
    if probs:
        acc_oof = round(accuracy_score(y_true=y_train, y_pred=(oof>0.5).astype(int)),4)
        acc_test = round(accuracy_score(y_true=y_test, y_pred=(preds>0.5).astype(int)),4)
        n_unsure_oof = round((abs(oof - 0.5) < 0.1).mean() * 100, 4)
        n_unsure_test = round((abs(preds - 0.5) < 0.1).mean() * 100, 4)
        logloss_oof = round(log_loss(y_true=y_train, y_pred=oof), 4)
        logloss_test = round(log_loss(y_true=y_test, y_pred=preds), 4)
        
        plot_pred_prob(oof, preds, y_train, y_test)
    
    print(f'Accuracy train: \t\t {acc_oof}')
    print(f'Accuracy test: \t\t\t {acc_test}')
    print(f'Logloss train: \t\t\t {logloss_oof}')
    print(f'Logloss test: \t\t\t {logloss_test}')
    print(f'Unsure train: \t\t\t {n_unsure_oof}%')
    print(f'Unsure test: \t\t\t {n_unsure_test}%')
    
report_victory(y_train, y_test, oof_pred, test_pred)

# Yearly split of the data

In the simple train/test split we are also using future tournaments to predict on the past, which is quick but it doesn't arguably give a good read on how the models will do this year.

A different validation strategy is to **simulate this competion** by only training with a set of years and prediction on the next one. 

The results on the test set with this strategy can be easily compared with last year's competitions

In [None]:
def yearly_split(data, model, kfolds, target, predict_proba=False, tune=False, param_grid=None):
    
    fit_model = {}
    oof = {}
    imp_coef = {}
    train = {}
    test = {}
    y_train = {}
    y_test = {}
    predictions = {}
    
    years = [2015, 2016, 2017, 2018, 2019]
    
    for year in years:
        yr = str(year)
        train[yr] = data[data.Season < year].copy()
        test[yr] = data[data.Season == year].copy()
    
        y_train[yr] = train[yr][target]
        y_test[yr] = test[yr][target]

        train[yr], test[yr] = _clean_columns(train[yr], test[yr])
        
        if tune:
            if predict_proba:
                grid = GridSearchCV(model, param_grid=param_grid, n_jobs=-1, 
                                    cv=5, scoring='neg_log_loss')
            else:
                grid = GridSearchCV(model, param_grid=param_grid, n_jobs=-1, 
                                    cv=5, scoring='neg_mean_absolute_error')
            grid.fit(train[yr], y_train[yr])
            model = grid.best_estimator_
            print(grid.best_score_)
            print(grid.best_params_)
        
        fit_model[yr], oof[yr], imp_coef[yr], predictions[yr] = _make_preds(train[yr], 
                                                                            y_train[yr], 
                                                                            test[yr], 
                                                                            model, 
                                                                            kfolds, 
                                                                            predict_proba)
    
    return fit_model, oof, predictions, imp_coef, train, y_train, test, y_test

In [None]:
pipe = Pipeline([('scl', tml.DfScaler()), ('ridge', Ridge())])

fitted, oof_pred, test_pred, imp_coef, train, y_train, test, y_test = yearly_split(train_men, pipe, kfolds, 'target_points')

fitted.keys()

The function is returning a dictionary with the results year by year.

We can then leverage the previous function and just wrap them to see a report by year

In [None]:
def yearly_wrapper(train, test, y_train, y_test, oof, preds, proba=False):
    y_train_total = []
    y_test_total = []
    oof_total = []
    preds_total = []
    for yr in train.keys():
        print(yr)
        print('\n')
        if proba:
            report_victory(y_train[yr], y_test[yr], oof[yr], preds[yr], probs=True)
        else:
            report_points(train[yr], test[yr], y_train[yr], y_test[yr], oof[yr], preds[yr], plot=False)
        print('\n')
        print('_'*40)
        print('\n')
        y_train_total.append(y_train[yr])
        y_test_total.append(y_test[yr])
        oof_total += list(oof[yr])
        preds_total += list(preds[yr])
        
    print('Total predictions')
    print('\n')
    y_train_total = pd.concat(y_train_total, ignore_index=True)
    y_test_total = pd.concat(y_test_total, ignore_index=True)
    oof_total = pd.Series(oof_total)
    preds_total = pd.Series(preds_total)
    if proba:
        report_victory(y_train_total, y_test_total, oof_total, preds_total)
    else:
        report_points(train[yr], test[yr], y_train_total, y_test_total, oof_total, preds_total, plot=False)
    

yearly_wrapper(train, test, y_train, y_test, oof_pred, test_pred)

And, again, we can have a similar report when predicting the probability of winning

In [None]:
pipe = Pipeline([('scl', tml.DfScaler()), 
                 ('logit', LogisticRegression(solver='lbfgs', multi_class='auto'))])

fitted, oof_pred, test_pred, imp_coef, train, y_train, test, y_test = yearly_split(train_women, pipe, kfolds, 'target', predict_proba=True)

yearly_wrapper(train, test, y_train, y_test, oof_pred, test_pred, proba=True)

*Note: this model would have got you at the 213th position in the 2019's competition and as you can see it takes about 35 seconds of work.*

*Another note: this model outscored my submission (with a coding error, but still) by more than 70 positions...*

# Conclusion

Appropriately validating your model will give enable you to take informed modeling decisions without worrying about your results looking unnaturally better than they really are. The two methods presented here are simple but effective in simulating how the model will possibly behave in stage 2, when you will need to wait the actual game to happen to know your score. Each of the functions here presented are merely showing the concept and it is not difficult to increase their complexity to give you better insights on how the model is performing.

Good luck and enjoy the best yearly competitions on Kaggle!


P.s. For an even better model evaluation, moving the processing of the data inside of a pipeline might look like a lot of work but it also a great skill to master for real ML applications. For examples in how to do so, you can read [this notebook](https://www.kaggle.com/lucabasa/understand-and-use-a-pipeline)