# Simple MLP for March Madness Predictions üèÄüèÄüèÄ

In this notebook, I predict March Madness scores by utilizing a combination of feature engineering and neural networks! This notebook includes a simple neural network which is trained a large amount of somewhat related data, and an advanced neural network which is trained on a small amount of related data. I'll cover the preprocessing, implementations, predictions, and strengths/weaknesses of each model, as well as explain more of my thought process and steps as the notebook progresses, so without further ado, let's begin!

Note: You can follow along with the code even if you don't know basketball very well; however, there are some nuances in the feature engineering which involve some knowledge of basketball (references are provided though). I also use the terms neural network, NN (short for neural network), and <a href="https://machinelearningmastery.com/neural-networks-crash-course/">MLP</a> (short for multilayer perceptron) interchangeably throughout this notebook.

**If you are interested in using your code to generate results which you can use to fill out your March Madness bracket, check out my related notebook <a href="https://www.kaggle.com/ironicninja/generate-a-march-madness-bracket"><strong>here</strong></a>!**

# Essential Imports

Note, not every import is used here; this is just a convenient template I like to use (+ it doesn't hurt to have additional imports here)!

In [None]:
#-----General------#
import numpy as np
import pandas as pd
import os
import sys
import math
import random

#-----Plotting-----#
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
import plotly.offline as py
py.init_notebook_mode(connected=True)
import seaborn as sns
from pandas_profiling import ProfileReport

#-----Utility-----#
import itertools
import warnings
warnings.filterwarnings("ignore")
import re
import gc

#-----DS Packages-----#
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import tensorflow as tf
import tensorflow_addons as tfa
from sklearn.model_selection import GroupKFold, StratifiedKFold, RepeatedStratifiedKFold, KFold, train_test_split
from sklearn.decomposition import PCA
from scipy.stats import pearsonr, spearmanr
import xgboost as xgb
import keras
import keras.layers as layers

#-----Random Keras Stuff for Quick Implementation-----#
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.regularizers import l2
from keras.initializers import GlorotNormal
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

# Load in the Data

For this notebook, I use only 4 out of the 20 csv files provided. The main reason is because most of the data is unnecessary for predictions (e.g. cities, teams, etc.), not detailed enough for predictions (compact versions), or too complicated to effectively include in the model (e.g. coaches). I'll give a quick description of each file I use:

* ```STAGE (1 or 2)``` -  Determines which folder to take data from.
* ```regular_df (MRegularSeasonDetailedResults.csv)``` - Detailed regular season games starting from 2003.
* ```tourney_df (MNCAATourneyDetailedResults.csv)``` - Detailed MNCAA "March Madness" Tournament games starting from 2003.
* ```nit_df (MSecondaryTourneyCompactResults.csv)``` - National Invitational Tournament (secondary tournament) games starting from 2003.
* ```rankings_df (MMasseyOrdinals.csv)``` - College basketball rankings organized by Kenneth Massey.
* ```teams_df (MTeams.csv)``` - Contains the team name associated with each TeamID.

I'll also give a brief rundown of other files I experimented with but did not use in my final predictions:

* ```seeding_df (MNCAATourneySeeds.csv)``` - Seeds for the MCNAA Tournament. Did not include because of the difficulty quantifying differences between conferences (i.e., Seed 1 in Conference A is sometimes only as good as Seed 3 in Conference B).
* ```conference_tourney_df (MConferenceTourneyGames.csv)``` - Compact results of conference tourney games. There was no score differential for these games and therefore they were not included in the analysis.

In [None]:
%%time

STAGE = 2

regular_df = pd.read_csv(f"../input/ncaam-march-mania-2021-spread/MDataFiles_Stage{STAGE}_Spread/MRegularSeasonDetailedResults.csv")
tourney_df = pd.read_csv(f"../input/ncaam-march-mania-2021-spread/MDataFiles_Stage{STAGE}_Spread/MNCAATourneyDetailedResults.csv")
nit_df = pd.read_csv(f"../input/ncaam-march-mania-2021-spread/MDataFiles_Stage{STAGE}_Spread/MSecondaryTourneyCompactResults.csv")
rankings_df = pd.read_csv(f"../input/ncaam-march-mania-2021-spread/MDataFiles_Stage{STAGE}_Spread/MMasseyOrdinals.csv")
teams_df = pd.read_csv(f"../input/ncaam-march-mania-2021-spread/MDataFiles_Stage{STAGE}_Spread/MTeams.csv")

# Not formally used but experimented with before

seeding_df = pd.read_csv(f"../input/ncaam-march-mania-2021-spread/MDataFiles_Stage{STAGE}_Spread/MNCAATourneySeeds.csv")
conference_tourney_df = pd.read_csv(f"../input/ncaam-march-mania-2021-spread/MDataFiles_Stage{STAGE}_Spread/MConferenceTourneyGames.csv")

# Data Preprocessing

In this section, I prepare the DataFrames from which I will use for my training data. I'll provide some commentary for each function in the markdown above the code.

<h3> Function Descriptions </h3>

* ```poss(game, college)``` - Calculates the number of possessions for the game.
* ```offense_calc(game, college=True)``` - Calculates the offensive rating (ORTG) of a team. For a team, this is points per 100 possessions.
* ```defense_calc(game, college=True)``` - Calculates the defensive rating (DRTG) of a team. For a team, this is points allowed per 100 possessions.

Since we are in the context of college basketball, I opt to use the college-specific calculation of ORTG and DRTG.

<h3> References </h3>

* <a href="https://www.sports-reference.com/cbb/about/glossary.html"> College Basketball Reference </a>
* <a href="https://www.basketball-reference.com/about/glossary.html"> NBA Basketball Reference </a>

In [None]:
def poss(game, college):
    PTS = game['Score']
    FGM = game['FGM']
    FGA = game['FGA']
    FTA = game['FTA']
    TOV = game['TO']
    ORB = game['OR']
    DRB = game['DR']
    
    OppPTS = game['OppScore']
    OppFGM = game['OppFGM']
    OppFGA = game['OppFGA']
    OppFTA = game['OppFTA']
    OppTOV = game['OppTO']
    OppORB = game['OppOR']
    OppDRB = game['OppDR']
    
    if college:
        return 0.5 * (FGA + 0.475 * FTA - ORB + TOV) + 0.5 * (OppFGA + 0.475 * OppFTA - OppORB + OppTOV)
    else:
        return 0.5 * ((FGA + 0.4 * FTA - 1.07 * (ORB / (ORB + OppDRB)) * (FGA - FGM) + TOV) + 
               (OppFGA + 0.4 * OppFTA - 1.07 * (OppORB / (OppORB + DRB)) * (OppFGA - OppFGM) + OppTOV))

def offense_calc(game, college=True):
    PTS = game['Score']
    POSS = poss(game, college)
    return PTS/POSS*100

def defense_calc(game, college=True):
    PTS = game['OppScore']
    POSS = poss(game, college)
    return PTS/POSS*100

<h3> Function Description </h3>

```extract_weighted_avg(season, weight_type='linear')``` - Calculates the average stats for each team in a given season. Utilizes a weighted function using the parameter ```weight_type```.

<h3> Options for parameter weight_type</h3>

* ```linear``` - A game played on Day 1 of the season has a weight of 1; a game played on Day 100 of the season has a weight of 100. More recent games have **a lot more** weight than older games.
* ```log``` - A game played on Day 1 of the season has a weight of ~0.6; a game played on Day 100 of the season has a weight of ~2. More recent games have **slightly more** weight than older games.
* ```normal``` - A game played on Day 1 of the season has a weight of 1; a game played on Day 100 of the season has a weight of 1. Recent games are indistinguishable from older games (no weight difference).

In [None]:
def extract_weighted_avg(season, weight_type='linear'):
    season_df = regular_df.loc[regular_df['Season'] == season]
    min_daynum = min(season_df['DayNum'])

    winner_cols = ['WScore', 'LScore', 'WFGM', 'WFGA', 'WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WOR', 'WDR',
       'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 'LFGM', 'LFGA', 'LFGM3', 'LFGA3',
       'LFTM', 'LFTA', 'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF']
    loser_cols = ['LScore', 'WScore', 'LFGM', 'LFGA', 'LFGM3', 'LFGA3',
       'LFTM', 'LFTA', 'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF', 'WFGM', 'WFGA', 'WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WOR', 'WDR',
       'WAst', 'WTO', 'WStl', 'WBlk', 'WPF']
    stats_dict = {}
    weight_dict = {}
    wl_dict = {}
    
    for i in season_df.index:
        game_ser = season_df.loc[i]
        
        if weight_type.lower() == 'linear':
            day_weight = game_ser['DayNum'] - min_daynum + 1
        elif weight_type.lower() == 'log':
            day_weight = np.log(game_ser['DayNum'] - min_daynum + 2)
        elif weight_type.lower() == 'normal':
            day_weight = 1
        else:
            raise Exception("Please enter a correct weight_type (linear, log, normal).")
        
        # Compute Winner
        winner_arr = day_weight*game_ser[winner_cols].to_numpy()
        wteamid = game_ser['WTeamID']
        if wteamid in stats_dict:
            stats_dict[wteamid] += winner_arr
            weight_dict[wteamid] += day_weight
            wl_dict[wteamid][0] += 1
        else:
            stats_dict[wteamid] = winner_arr
            weight_dict[wteamid] = day_weight
            wl_dict[wteamid] = [1, 0]
            
        # Compute Loser
        loser_arr = day_weight*game_ser[loser_cols].to_numpy()
        lteamid = game_ser['LTeamID']
        if lteamid in stats_dict:
            stats_dict[lteamid] += loser_arr
            weight_dict[lteamid] += day_weight
            wl_dict[lteamid][1] += 1
        else:
            stats_dict[lteamid] = loser_arr
            weight_dict[lteamid] = day_weight
            wl_dict[lteamid] = [0, 1]
        
    for team in stats_dict:
        stats_dict[team] /= weight_dict[team]
        
    w_dict, l_dict = {}, {}
    for team in wl_dict:
        w_dict[team] = wl_dict[team][0]
        l_dict[team] = wl_dict[team][1]
        
    cols = ['Score', 'OppScore', 'FGM', 'FGA', 'FGM3', 'FGA3',
       'FTM', 'FTA', 'OR', 'DR', 'Ast', 'TO', 'Stl', 'Blk', 'PF', 'OppFGM', 'OppFGA', 'OppFGM3', 'OppFGA3',
       'OppFTM', 'OppFTA', 'OppOR', 'OppDR', 'OppAst', 'OppTO', 'OppStl', 'OppBlk', 'OppPF']
    
    weighted_df = pd.DataFrame.from_dict(stats_dict).T
    weighted_df.columns = cols
    weighted_df.sort_index(inplace=True)
    weighted_df['Diff'] = weighted_df['Score'] - weighted_df['OppScore']
    weighted_df['FG%'] = weighted_df['FGM']/weighted_df['FGA']
    weighted_df['FG3%'] = weighted_df['FGM3']/weighted_df['FGA3']
    weighted_df['FT%'] = weighted_df['FTM']/weighted_df['FTA']
    weighted_df['OppFG%'] = weighted_df['OppFGM']/weighted_df['OppFGA']
    weighted_df['OppFG3%'] = weighted_df['OppFGM3']/weighted_df['OppFGA3']
    weighted_df['OppFT%'] = weighted_df['OppFTM']/weighted_df['OppFTA']
    
    weighted_df['Wins'] = weighted_df.index.map(w_dict)
    weighted_df['Losses'] = weighted_df.index.map(l_dict)
    
    season_rankings_df = rankings_df.loc[rankings_df['Season'] == season]
    rankings_dict = season_rankings_df.loc[season_rankings_df['RankingDayNum'] == max(season_rankings_df['RankingDayNum'])].set_index("TeamID")['OrdinalRank'].to_dict()
    weighted_df['Ranking'] = weighted_df.index.map(rankings_dict)
    
    ortg_list = []
    drtg_list = []
    for i in weighted_df.index:
        game = weighted_df.loc[i]
        ortg_list.append(offense_calc(game))
        drtg_list.append(defense_calc(game))
    
    weighted_df['ORTG'] = ortg_list
    weighted_df['DRTG'] = drtg_list
    return weighted_df

In this snippet of code, I store the data for each season in a dictionary so it's easily accessible.

In [None]:
%%time

all_seasons_data = {}
for season in regular_df['Season'].unique():
    print(f"Extracting {season-1}-{season} Season...")
    all_seasons_data[season] = extract_weighted_avg(season, weight_type='log')

<h3> As a Demonstration... </h3>

In [None]:
test_df = extract_weighted_avg(2003)
test_log_df = extract_weighted_avg(2003, weight_type='log')
test_normal_df = extract_weighted_avg(2003, weight_type='normal')

test_comb_df = pd.concat((test_df['Score'], test_log_df['Score'], test_normal_df['Score']), axis=1)
test_comb_df.columns = ['Linear', 'Log', 'Normal']
test_comb_df

The differences are subtle but can be very relevant. For the rest of this notebook, I will use ```weight_type``` of log since it weighs recent games a little more than older games, but not by much.

# Simple Model

In this first model, I decided to try to predict the score of games based on the statistics of each game. In other words, given my knowledge of Team 1's ```Field Goals Made```, ```Field Goal Percentage```, ```Turnovers```, etc., can I predict the score of the game?

For my training data, I combine a team's offensive attributes (FGM, FGA, etc.) with the other team's defensive attributes (Steals, Blocks, etc.), which should be sufficient to predict the outcome of a game.

Note: I later opt to not use ```FGM```, ```FGM3```, and ```FTM``` since those attributes had a tendency to overfit the model (since it could just map out the function $2*\text{FGM} + 3*\text{FGM3} + \text{FTM}$).

<h3> Extra Model Preprocessing </h3>

The functions here should be pretty self-explanatory, so I won't provide any commentary for them.

In [None]:
def add_data(df, c='W'):
    df[f'{c}FG%'] = df[f'{c}FGM']/df[f'{c}FGA']
    df[f'{c}FG3%'] = df[f'{c}FGM3']/df[f'{c}FGA3']
    df[f'{c}FT%'] = df[f'{c}FTM']/df[f'{c}FTA']
    return df

extra_df = regular_df.copy()
extra_df = add_data(extra_df, c='W')
extra_df = add_data(extra_df, c='L')
extra_df

In [None]:
#offense_cols = ['FGM', 'FG%', 'FGM3', 'FG3%', 'FTM', 'FT%', 'OR', 'Ast', 'TO']
offense_cols = ['FG%', 'FG3%', 'FT%', 'OR', 'Ast', 'TO'] # For testing without FGM, FGM3, and FTM

defense_cols = ['DR', 'Stl', 'Blk', 'PF']
cols = offense_cols + defense_cols + ['Score']

w_cols = ['W' + t for t in offense_cols] + ['L' + t for t in defense_cols]
l_cols = ['L' + t for t in offense_cols] + ['W' + t for t in defense_cols]

w_tmp = extra_df[w_cols + ['WScore']]
w_tmp.columns = cols
l_tmp = extra_df[l_cols + ['LScore']]
l_tmp.columns = cols
all_data = pd.concat((w_tmp, l_tmp)).reset_index(drop=True)
all_data

We prepare the data by scaling the inputs with ```StandardScaler()``` but not scaling the targets (since it's unnecessary).

In [None]:
all_data = all_data.loc[np.count_nonzero(np.isnan(all_data), axis=1) == 0]

X_scaler_simple = StandardScaler()
X_simple = X_scaler_simple.fit_transform(all_data.loc[:, ~all_data.columns.str.contains('Score')])
y_simple = all_data['Score'].to_numpy().reshape(-1, 1)

<h3> Creating and Training a Simple MLP </h3>

This model is quite simple, with some batch normalization and dropout layers. I opt to use the <a href="https://medium.com/@neuralnets/swish-activation-function-by-google-53e1ea86f820"> swish activation function </a> and <a href="https://www.pyimagesearch.com/2019/10/07/is-rectified-adam-actually-better-than-adam/"> RectifiedAdam optimizer</a>.

In [None]:
def create_simple_mlp(num_columns, num_labels, hidden_layers, hidden_units, dropout_rates, learning_rate, regularizer_rate):
    inp = layers.Input(shape=(num_columns,))
    x = layers.BatchNormalization()(inp)
    for i in range(hidden_layers):
        x = layers.Dense(hidden_units, activation=tf.keras.activations.swish, kernel_regularizer=tf.keras.regularizers.l2(regularizer_rate))(x)
        x = layers.Dropout(dropout_rates)(x)
        x = layers.BatchNormalization()(x)

    out = layers.Dense(num_labels)(x)

    model = tf.keras.models.Model(inputs=inp, outputs=out)
    model.compile(
        optimizer=tfa.optimizers.RectifiedAdam(learning_rate=learning_rate),
        loss='mse',
        metrics=['mae', tf.keras.metrics.RootMeanSquaredError()],
    )

    return model

You can take a look at the model parameters and architecture here.

In [None]:
hidden_layers_simple = 2
hidden_units_simple = 64
dropout_rates_simple = 0.25
learning_rate_simple = 2e-3
regularizer_rate_simple = 1e-4
model_simple = create_simple_mlp(X_simple.shape[1], y_simple.shape[1], hidden_layers_simple, 
                                 hidden_units_simple, dropout_rates_simple, learning_rate_simple, regularizer_rate_simple)

tf.keras.utils.plot_model(model_simple)

The architecture is quite simple, with two hidden layers consisting of a Dense layer, Dropout layer, and some BatchNormalization. To train the model, I use KFold with 5 splits plus 10 epochs of finetuning at the end.

In [None]:
%%time

kf = KFold(n_splits=5, shuffle=True, random_state=SEED)
all_history = {}

c = 1
for train_index, test_index in kf.split(X_simple, y_simple):
    print(f"Fold {c} initiating...")
    history_simple = model_simple.fit(X_simple[train_index], y_simple[train_index], epochs=20, 
                                      verbose=0, batch_size=256, validation_data=(X_simple[test_index], y_simple[test_index]))
    
    if c == 1:
        all_history = history_simple.history
    else:
        for error in history_simple.history:
            all_history[error] += history_simple.history[error]
    
    c += 1
    
history_simple = model_simple.fit(X_simple, y_simple, epochs=10, 
                    verbose=1, batch_size=256, validation_data=(X_simple[test_index], y_simple[test_index]))

model_simple.save_weights("MNCAAsimple.hdf5")

for error in history_simple.history:
    all_history[error] += history_simple.history[error]

<h3> Model Analysis </h3>

Error history (MAE, Validation MAE, Validation RMSE) is graphed here.

In [None]:
plt.figure(figsize=(10, 8))
plt.plot(all_history['mae'])
plt.plot(all_history['val_mae'])
plt.plot(all_history['val_root_mean_squared_error'])
plt.ylim(0, 2*np.mean(all_history['val_root_mean_squared_error']))
plt.legend(['mae', 'val_mae', 'val_rmse'])
plt.xlabel("Epochs", fontsize=14)
plt.ylabel("Error", fontsize=14)
plt.title("Model Error Over Epochs", fontsize=16)
plt.show()

The training error doesn't seem to decrease much past epoch 5, and the validation MAE and RMSE greatly fluctuate around 1 for every epoch. It's important to note, however, that an error of 1 is extremely good, since it means the network is predicting the actual score to a margin of 1 point.

Here we prepare the training data. **Note that for these predictions, we include both the MCNAA and NIT Tournaments, but NOT the conference tournaments.** This is so that there is more data when validating our model so it's not as overfit.

In [None]:
my_cols = offense_cols + ["Opp" + t for t in defense_cols]

def create_data_simple(season):
    year_df = all_seasons_data[season]
    tourney_year_df = tourney_df.loc[tourney_df['Season'] == season]
    nit_year_df = nit_df.loc[nit_df['Season'] == season]
    
    X_1 = []
    y_1 = []
    X_2 = []
    y_2 = []
    
    def helper(df):
        for i in df.index:
            game = df.loc[i]
            w_df = year_df.loc[game['WTeamID']]
            l_df = year_df.loc[game['LTeamID']]
            
            X_1.append(w_df[offense_cols].tolist() + l_df[defense_cols].tolist())
            X_2.append(l_df[offense_cols].tolist() + w_df[defense_cols].tolist())
                
            w_score = game['WScore']
            l_score = game['LScore']
            y_1.append(w_score)
            y_2.append(l_score)
                       
    helper(tourney_year_df)
    helper(nit_year_df)
    
    return X_1, y_1, X_2, y_2

In [None]:
%%time

X_val_1_simple, y_true_1_simple = [], []
X_val_2_simple, y_true_2_simple = [], []
for season in regular_df['Season'].unique():
    print(season)
    X_1, y_1, X_2, y_2 = create_data_simple(season)
    X_val_1_simple += X_1
    y_true_1_simple += y_1
    X_val_2_simple += X_2
    y_true_2_simple += y_2

And now we can perform some preliminary validation tests.

In [None]:
def mae(y_pred, y_true):
    return np.mean(np.abs(y_pred - y_true))

def rmse(y_pred, y_true):
    return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))

X1_simple = X_scaler_simple.transform(np.array(X_val_1_simple, dtype='float64'))
X2_simple = X_scaler_simple.transform(np.array(X_val_2_simple, dtype='float64'))
y_true_1_simple = np.array(y_true_1_simple, dtype='float64')
y_true_2_simple = np.array(y_true_2_simple, dtype='float64')

X_pred_1_simple = model_simple.predict(X1_simple).reshape(-1)
X_pred_2_simple = model_simple.predict(X2_simple).reshape(-1)

print("Validation MAE 1: %.3f" % (mae(X_pred_1_simple, y_true_1_simple)))
print("Validation MAE 2: %.3f" % (mae(X_pred_2_simple, y_true_2_simple)))

pred_diff_simple = X_pred_1_simple - X_pred_2_simple
true_diff_simple = y_true_1_simple - y_true_2_simple

print("True Difference MAE: %.3f" % mae(pred_diff_simple, true_diff_simple))
print("True Difference RMSE: %.3f" % rmse(pred_diff_simple, true_diff_simple))

print("Correct Winner: %.3f%%" % (100*np.count_nonzero(pred_diff_simple*true_diff_simple > 0)/len(pred_diff_simple)))

The RMSE difference is ~13.7, which isn't great but is also not too bad. The model predicts the correct winner only 55.1% of the time though, which is somewhat disappointing.

<h3> Prediction File </h3>

In [None]:
%%time

def prediction_simple(season):
    year_df = seeding_df[seeding_df['Season'] == season]
    all_ids = list(set(year_df['TeamID']))
    all_ids.sort()
    
    season_df = all_seasons_data[season]
    
    id1_list = []
    id2_list = []
    index_list = []
    for id1 in all_ids:
        for id2 in all_ids:
            if id2 > id1:
                team1_df = season_df.loc[id1]
                team2_df = season_df.loc[id2]
                id1_list.append(team1_df[offense_cols].tolist() + team2_df[defense_cols].tolist())
                id2_list.append(team2_df[offense_cols].tolist() + team1_df[defense_cols].tolist())
                index_list.append(f"{season}_{id1}_{id2}")
                
    pred_1 = model_simple.predict(X_scaler_simple.transform(np.array(id1_list, dtype='float64')))
    pred_2 = model_simple.predict(X_scaler_simple.transform(np.array(id2_list, dtype='float64')))
    diff_pred = pred_1 - pred_2
    predictions_df = pd.DataFrame(data=diff_pred, index=index_list, columns=['Pred'])
    return predictions_df
    
predictions_list = []
seasons_pred_list = [2015, 2016, 2017, 2018, 2019] if STAGE == 1 else [2021]
for season in seasons_pred_list:
    print(f"Season {season-1}-{season} Processing...")
    predictions_list.append(prediction_simple(season))
    
final_pred_simple_df = pd.concat(predictions_list).reset_index()
final_pred_simple_df.columns = ['ID', 'Pred']
final_pred_simple_df.to_csv("submission_simple.csv", index=False)
final_pred_simple_df

<h3> Simulating an Actual Tournament </h3>

In the code below, I simulate a real MNCAA Tournament and use my ```model_simple``` to predict which teams would win.

**The function has two parameters:**
* ```season (int)``` - Simulation of the MNCAA Tournament for a certain season. Note, I've only tested this code on the 2018, 2019, and 2021 seasons, so there could be issues with prior seasons.
* ```PRINT (bool)``` - Determines whether the results should be printed to the terminal or not.

Note, this code is outdated and has been improved on in <a href="https://www.kaggle.com/ironicninja/generate-a-march-madness-bracket">this notebook</a>. I've kept it here since it was my first version of my improved code.

In [None]:
class predictSeasonSimple():
    def __init__(self, season, PRINT=True):
        self.year_df = seeding_df[seeding_df['Season'] == season]
        self.season_df = all_seasons_data[season]
        self.season = season
        self.PRINT = PRINT
        self.res_list = []
        
    def predict_id(self, id_1, id_2):
        pred_arr = np.array([self.season_df.loc[id_1][offense_cols].tolist() + self.season_df.loc[id_2][defense_cols].tolist()])
        return model_simple.predict_step(X_scaler_simple.transform(pred_arr)).numpy()[0][0]
    
    def team_id(self, id_test):
        return teams_df.loc[teams_df['TeamID'] == id_test]['TeamName'].iloc[0]
    
    def playin_round(self, id_tuple, div_df, playin_df, seed, div):
        id1, id2 = id_tuple

        res1 = self.predict_id(id1, id2)
        res2 = self.predict_id(id2, id1)

        team1 = f"{seed} {self.team_id(id1)}"
        team2 = f"{seed} {self.team_id(id2)}"
        
        final_res = res1-res2
        final_res_str = "won" if final_res >= 0 else "lost"
        self.res_list.append([team1, team2, team1 if final_res >= 0 else team2, div])
        print(f"Play-in round, seeds {team1} played {team2} and {final_res_str} by %.2f points." % (abs(final_res))) if self.PRINT else 0
        
        return div_df.append(playin_df.loc[playin_df['TeamID'] == id1] if final_res >= 0 else playin_df.loc[playin_df['TeamID'] == id2])
    
    def predict_div(self, div):
        div_df = self.year_df.loc[self.year_df['Seed'].str.contains(div)]
        div_df['Seed'] = div_df['Seed'].str.replace(div, '')
        div_df['Seed'] = div_df['Seed'].str.lstrip('0')

        # Check for play-in rounds
        playin_df = div_df.loc[div_df['Seed'].str.len() > 2]

        if len(playin_df):         
            div_df = div_df.loc[div_df['Seed'].str.len() <= 2]
            playin_df['Seed'] = playin_df['Seed'].str.slice(0, 2)
            seed = playin_df.iloc[0]['Seed']
            id_first = playin_df.iloc[:2]['TeamID'].tolist()
            div_df = self.playin_round(id_first, div_df, playin_df, seed, div)
            
            if len(playin_df) == 4:
                id_second = playin_df.iloc[-2:]['TeamID'].tolist()
                div_df = self.playin_round(id_second, div_df, playin_df, 16, div)

        matchup_df = div_df[['Seed', 'TeamID']].set_index('Seed')
        play_list = [1, 16, 8, 9, 5, 12, 4, 13, 6, 11, 3, 14, 7, 10, 2, 15]

        while True:
            tmp_list = []
            for i in range(0, len(play_list), 2):
                seed1 = str(play_list[i])
                seed2 = str(play_list[i+1])

                id1 = matchup_df.loc[seed1]['TeamID']
                id2 = matchup_df.loc[seed2]['TeamID']

                res1 = self.predict_id(id1, id2)
                res2 = self.predict_id(id2, id1)

                team1 = f"{seed1} {self.team_id(id1)}"
                team2 = f"{seed2} {self.team_id(id2)}"

                final_res = res1-res2
                final_res_str = "won" if final_res >= 0 else "lost"
                self.res_list.append([team1, team2, team1 if final_res >= 0 else team2, div])
                print(f"Seed {team1} played seed {team2} and {final_res_str} by %.2f points." % (abs(final_res))) if self.PRINT else 0

                tmp_list.append(seed1 if final_res >= 0 else seed2)

            play_list = tmp_list.copy()
            if len(play_list) <= 1:
                break

        return matchup_df.loc[play_list[0]]['TeamID']
    
    def predict_tour(self):
        div_list = ['W', 'X', 'Y', 'Z']
        div_winners = {}
        for div in div_list:
            print(f"Division {div}") if self.PRINT else 0
            winning_seed = self.predict_div(div)
            div_winners[div] = winning_seed
            print("\n") if self.PRINT else 0
        
        # Under the assumption that Division W always plays Division X in the Final Four
        id1 = div_winners['W']
        id2 = div_winners['X']
        id3 = div_winners['Y']
        id4 = div_winners['Z']
        
        team1 = self.team_id(id1)
        team2 = self.team_id(id2)
        team3 = self.team_id(id3)
        team4 = self.team_id(id4)
        
        f4_res_1 = self.predict_id(id1, id2)
        f4_res_2 = self.predict_id(id2, id1)
        final_res_1 = f4_res_1-f4_res_2
        final_res_str_1 = "won" if final_res_1 >= 0 else "lost"
        f4_winner_1 = [id1, team1] if final_res_1 >= 0 else [id2, team2]
        self.res_list.append([team1, team2, team1 if final_res_1 >= 0 else team2, "F4"])
        print(f"{team1} played {team2} and {final_res_str_1} by %.2f points." % (abs(final_res_1))) if self.PRINT else 0
        
        f4_res_3 = self.predict_id(id3, id4)
        f4_res_4 = self.predict_id(id4, id3)
        final_res_2 = f4_res_3-f4_res_4
        final_res_str_2 = "won" if final_res_2 >= 0 else "lost"
        f4_winner_2 = [id3, team3] if final_res_2 >= 0 else [id4, team4]
        self.res_list.append([team3, team4, team3 if final_res_2 >= 0 else team4, "F4"])
        print(f"{team3} played {team4} and {final_res_str_2} by %.2f points." % (abs(final_res_2))) if self.PRINT else 0
        
        champ_res_1 = self.predict_id(f4_winner_1[0], f4_winner_2[0])
        champ_res_2 = self.predict_id(f4_winner_2[0], f4_winner_1[0])
        champ_res_final = champ_res_1 - champ_res_2
        champ_res_str = "won" if champ_res_final >= 0 else "lost"
        champion = f4_winner_1[1] if champ_res_final >= 0 else f4_winner_2[1]
        self.res_list.append([f4_winner_1[1], f4_winner_2[1], champion, "Finals"])
        print(f"{f4_winner_1[1]} played {f4_winner_2[1]} and {champ_res_str} by %.2f points." % (abs(champ_res_final))) if self.PRINT else 0
        print(f"THE CHAMPION FOR THE {self.season} SEASON IS {champion}!!! Final score is {int(champ_res_1)} to {int(champ_res_2)}.")
        
        res_df = pd.DataFrame(data=self.res_list, columns=["Team 1", "Team 2", "Winner", "Division"])
        return res_df
    
my_season = predictSeasonSimple(2021, False)
res_df = my_season.predict_tour()
pd.set_option('display.max_rows', 68)
res_df

In [None]:
pd.set_option('display.max_rows', 10)

<h3> Simple Model Strengths & Weaknesses </h3>

**Strengths:**
* The biggest strength of this model is the sheer amount of data provided as both training and validation data. There are over 180,000 different scores (excluding the tournament scores), which means a neural network can better learn patterns within the data.
* The model is also quite simple which makes it easier to replicate and less likely to overfit when training.

**Weaknesses:**
* The biggest weakness of this model is that predicting the outcome of a game when you already have the stats for that game is much different than predicting how many points a team will score based on their season statistics. 
* Another weakness is that this score-predicting model may be significantly overfit; in theory, the neural network would just have to learn the function $2*\text{FGM} + 3*\text{FGM3} + \text{FTM}$ to accurately predict the score. 
    * Indeed, after conducting experiments testing this hypothesis (by removing FGM, FGM3, and FTM from the training data), the neural network can only predict the score accurate to $\pm$ 3 to 4 points; however, the relative validation error decreases by removing these attributes which have a tendency to overfit.

# Advanced Model 1 ("Sparse")

The advanced model is well... more advanced. In essence, after aggregating the stats for each team in a given season, each MNCAA/NIT Tournament matchup is split into a team's stats (e.g. PTS, FGM, Ast), the average opponent team's stats (e.g. Points allowed, Field Goals allowed), and then that same process for the other team. To put the methodology a bit more concretely, if Team A plays Team B, then the training data would be Team A's average stats, Team A's average opponent's stats, Team B's average stats, and Team B's average opponent's stats. The target data would be the difference in score between the two teams. I implement this methodology in the next two blocks of code.

Note: I call this model "sparse" since it is smaller than Advanced Model 2.

In [None]:
my_team_cols = ['Score', 'FGM', 'FG%', 'FGM3', 'FG3%', 'FTM', 'FT%', 'OR', 'DR', 'Ast', 'TO', 'Stl', 'Blk', 'PF']
#my_team_cols = ['FG%', 'FG3%', 'FT%', 'OR', 'DR', 'Ast', 'TO', 'Stl', 'Blk', 'PF']

opp_team_cols = ['Opp' + t for t in my_team_cols]

other_cols = ['Wins', 'Losses', 'Diff', 'Ranking', 'ORTG', 'DRTG'] # Used for Advanced Model 2

def create_data_season(season):
    year_df = all_seasons_data[season]
    tourney_year_df = tourney_df.loc[tourney_df['Season'] == season]
    nit_year_df = nit_df.loc[nit_df['Season'] == season]
    
    w_my_team_train, w_opp_team_train = [], []
    l_my_team_train, l_opp_team_train = [], []
    w_other_train, l_other_train = [], [] # Used for Advanced Model 2
    diff_train = []
    
    def helper(df):
        for i in df.index:
            game = df.loc[i]
            w_game = year_df.loc[game['WTeamID']]
            l_game = year_df.loc[game['LTeamID']]
            
            w_my_team_train.append(w_game[my_team_cols].tolist())
            w_opp_team_train.append(w_game[opp_team_cols].tolist())
            l_my_team_train.append(l_game[my_team_cols].tolist())
            l_opp_team_train.append(l_game[opp_team_cols].tolist())
            
            w_other_train.append(w_game[other_cols].tolist()) # Used for Advanced Model 2
            l_other_train.append(l_game[other_cols].tolist()) # Used for Advanced Model 2
            
            diff_train.append(game['WScore'] - game['LScore'])       
                       
    helper(tourney_year_df)
    helper(nit_year_df)
    
    return w_my_team_train, w_opp_team_train, l_my_team_train, l_opp_team_train, w_other_train, l_other_train, diff_train

In [None]:
%%time

my_team_train1, opp_team_train1 = [], []
my_team_train2, opp_team_train2 = [], []
my_team_test1, opp_team_test1 = [], []
my_team_test2, opp_team_test2 = [], []
other_train1, other_train2 = [], []
other_test1, other_test2 = [], []
y_train, y_test = [], []

for season in regular_df['Season'].unique():
    print(season)
    w_my_team_train, w_opp_team_train, l_my_team_train, l_opp_team_train, w_other_train, l_other_train, diff_train = create_data_season(season)
    if season < 2015:
        my_team_train1 += w_my_team_train
        opp_team_train1 += w_opp_team_train
        my_team_train2 += l_my_team_train
        opp_team_train2 += l_opp_team_train
        other_train1 += w_other_train
        other_train2 += l_other_train
        y_train += diff_train
    else:
        my_team_test1 += w_my_team_train
        opp_team_test1 += w_opp_team_train
        my_team_test2 += l_my_team_train
        opp_team_test2 += l_opp_team_train
        other_test1 += w_other_train
        other_test2 += l_other_train
        y_test += diff_train

<h3> Processing the Data </h3>

There are two things important to notice here; first, once again, the inputs are scaled using the ```StandardScaler()``` but the targets are not scaled. Second, the training data is concatenated to compensate for the asymmetry of the data, or in other words, there should be no difference with putting Team A then Team B versus Team B then Team A. **While certainly not a perfect solution, it works decently in implementation.**

In [None]:
# Convert to np.array

my_team_train1, opp_team_train1 = np.array(my_team_train1, dtype='float64'), np.array(opp_team_train1, dtype='float64')
my_team_train2, opp_team_train2 = np.array(my_team_train2, dtype='float64'), np.array(opp_team_train2, dtype='float64')
my_team_test1, opp_team_test1 = np.array(my_team_test1, dtype='float64'), np.array(opp_team_test1, dtype='float64')
my_team_test2, opp_team_test2 = np.array(my_team_test2, dtype='float64'), np.array(opp_team_test2, dtype='float64')
other_train1, other_train2 = np.array(other_train1, dtype='float64'), np.array(other_train2, dtype='float64')
other_test1, other_test2 = np.array(other_test1, dtype='float64'), np.array(other_test2, dtype='float64')
y_train, y_test = np.array(y_train, dtype='float64'), np.array(y_test, dtype='float64')

# Fit on data

X_scaler = StandardScaler()
X_scaler.fit(my_team_train1)
X_scaler.fit(opp_team_train1)
X_scaler.fit(my_team_train2)
X_scaler.fit(opp_team_train2)

# Scale all the data based on fit

my_team_train1_t = X_scaler.transform(my_team_train1)
opp_team_train1_t = X_scaler.transform(opp_team_train1)
my_team_train2_t = X_scaler.transform(my_team_train2)
opp_team_train2_t = X_scaler.transform(opp_team_train2)

my_team_test1_t = X_scaler.transform(my_team_test1)
opp_team_test1_t = X_scaler.transform(opp_team_test1)
my_team_test2_t = X_scaler.transform(my_team_test2)
opp_team_test2_t = X_scaler.transform(opp_team_test2)

# Scale other columns; only relevant for Advanced Model 2

other_scaler = StandardScaler()
other_scaler.fit(other_train1)
other_scaler.fit(other_train2)

other_train1_t = other_scaler.transform(other_train1)
other_train2_t = other_scaler.transform(other_train2)
other_test1_t = other_scaler.transform(other_test1)
other_test2_t = other_scaler.transform(other_test2)

# Concatenate the data

y_all = np.concatenate((y_train, -1*y_train))
X_team_1 = np.concatenate((my_team_train1_t, my_team_train2_t))
X_team_2 = np.concatenate((my_team_train2_t, my_team_train1_t))
X_opp_1 = np.concatenate((opp_team_train1_t, opp_team_train2_t))
X_opp_2 = np.concatenate((opp_team_train2_t, opp_team_train1_t))
X_other_1 = np.concatenate((other_train1_t, other_train2_t))
X_other_2 = np.concatenate((other_train2_t, other_train1_t))

In [None]:
def create_mlp_sparse(num_columns, num_labels, num_layers, hidden_units, dropout_rates, learning_rate, regularizer_rate):
    inp1 = layers.Input(shape=(num_columns,))
    x1 = layers.BatchNormalization()(inp1)
    for i in range(num_layers):
        x1 = layers.Dense(hidden_units, activation=tf.keras.activations.swish, kernel_regularizer=tf.keras.regularizers.l2(regularizer_rate))(x1)
        x1 = layers.Dropout(dropout_rates)(x1)
        x1 = layers.BatchNormalization()(x1)
    
    inp2 = layers.Input(shape=(num_columns,))
    x2 = layers.BatchNormalization()(inp2)
    for i in range(num_layers):
        x2 = layers.Dense(hidden_units, activation=tf.keras.activations.swish, kernel_regularizer=tf.keras.regularizers.l2(regularizer_rate))(x2)
        x2 = layers.Dropout(dropout_rates)(x2)
        x2 = layers.BatchNormalization()(x2)
        
    inp3 = layers.Input(shape=(num_columns,))
    x3 = layers.BatchNormalization()(inp3)
    for i in range(num_layers):
        x3 = layers.Dense(hidden_units, activation=tf.keras.activations.swish, kernel_regularizer=tf.keras.regularizers.l2(regularizer_rate))(x3)
        x3 = layers.Dropout(dropout_rates)(x3)
        x3 = layers.BatchNormalization()(x3)
        
    inp4 = layers.Input(shape=(num_columns,))
    x4 = layers.BatchNormalization()(inp4)
    for i in range(num_layers):
        x4 = layers.Dense(hidden_units, activation=tf.keras.activations.swish, kernel_regularizer=tf.keras.regularizers.l2(regularizer_rate))(x4)
        x4 = layers.Dropout(dropout_rates)(x4)
        x4 = layers.BatchNormalization()(x4)
        
    merged1 = layers.Concatenate(axis=1)([x1, x2])
    merged1 = layers.Dense(hidden_units, activation=tf.keras.activations.swish, kernel_regularizer=tf.keras.regularizers.l2(regularizer_rate))(merged1)
    merged1 = layers.Dropout(dropout_rates)(merged1)
    merged1 = layers.BatchNormalization()(merged1)
    
    merged2 = layers.Concatenate(axis=1)([x3, x4])
    merged2 = layers.Dense(hidden_units, activation=tf.keras.activations.swish, kernel_regularizer=tf.keras.regularizers.l2(regularizer_rate))(merged2)
    merged2 = layers.Dropout(dropout_rates)(merged2)
    merged2 = layers.BatchNormalization()(merged2)
    
    final = layers.Concatenate(axis=1)([merged1, merged2])
    out = layers.Dense(num_labels)(final)

    model = tf.keras.models.Model(inputs=[inp1, inp2, inp3, inp4], outputs=out)
    model.compile(
        optimizer=tfa.optimizers.RectifiedAdam(learning_rate=learning_rate),
        loss='mse',
        metrics=['mae', tf.keras.metrics.RootMeanSquaredError()],
    )

    return model

I initialize the MLP and visualize its architecture here.

In [None]:
num_layers_sparse = 2
hidden_units_sparse = 64
dropout_rates_sparse = 0.25
learning_rate_sparse = 2e-3
regularizer_rate_sparse = 1e-4
model_sparse = create_mlp_sparse(my_team_train1_t.shape[1], 1, num_layers_sparse, hidden_units_sparse, 
                                 dropout_rates_sparse, learning_rate_sparse, regularizer_rate_sparse)

tf.keras.utils.plot_model(model_sparse)

Notice that this network consists of a dense top (2 hidden layers for each of the 4 inputs) and bottlenecks into a sparser bottom (1 hidden layer for the concatenated layer). This is a key difference between this model and Advanced Model 2.

In [None]:
history_sparse = model_sparse.fit([X_team_1, X_opp_1, X_team_2, X_opp_2], y_all, epochs=200, verbose=0, 
                    batch_size=32, validation_data=([my_team_test1_t, opp_team_test1_t, my_team_test2_t, opp_team_test2_t], y_test))

model_sparse.save_weights("MCNAAsparse.hdf5")

<h3> Model Analysis </h3>

In [None]:
plt.figure(figsize=(10, 8))
plt.plot(history_sparse.history['mae'])
plt.plot(history_sparse.history['val_mae'])
plt.plot(history_sparse.history['val_root_mean_squared_error'])
plt.legend(['mae', 'val_mae', 'val_rmse'])
plt.xlabel("Epochs", fontsize=14)
plt.ylabel("Error", fontsize=14)
plt.title("Model Error Over Epochs", fontsize=16)
plt.show()

Unfortunately, it seems like this model overfits very quickly, as although the error continously decreases, the validation error gradually increases. This is definitely not what I would like to see, which is why I opt not to use this model for my actual predictions.

In [None]:
y_pred_norm = model_sparse.predict([my_team_test1_t, opp_team_test1_t, my_team_test2_t, opp_team_test2_t])
y_pred_alt = model_sparse.predict([my_team_test2_t, opp_team_test2_t, my_team_test1_t, opp_team_test1_t])
y_comb = ((y_pred_norm-y_pred_alt)/2).reshape(-1)
print("Normal MAE: %.3f" % (mae(y_pred_norm, y_test)))
print("Alternate MAE: %.3f" % (mae(y_pred_alt, -1*y_test)))
print("Combined MAE: %.3f" % (mae(y_comb, y_test)))
print("Correct Winner: %.3f%%" % (100*np.count_nonzero(y_comb*y_test > 0)/len(y_test)))

<h3> Sparse Model Strengths & Weaknesses </h3>

**Strengths**:
* Utilizes very relevant data as training data to provide more accurate predictions.

**Weaknesses**:
* Low amount of training data (2898 when including both tournaments and concatenation trick).
* No feature engineering in terms of which features to include, what features not to include, and how to reduce the dimensionality of the data, which may affect accuracy and the model's tendency to overfit.

# Advanced Model 2 ("Dense")

This "dense" model is very similar to Advanced Model 1, except it includes an extra column of inputs and has a slightly different NN architecture.

In [None]:
def create_mlp_dense(shape1, shape2, shape3, num_labels, num_layers, hidden_units, dropout_rates, learning_rate, regularizer_rate):
    
    # Left Side
    
    inp11 = layers.Input(shape=(shape1,))
    x1 = layers.BatchNormalization()(inp11)
    x1 = layers.Dense(hidden_units, activation=tf.keras.activations.swish, kernel_regularizer=tf.keras.regularizers.l2(regularizer_rate))(x1)
    x1 = layers.Dropout(dropout_rates)(x1)
    x1 = layers.BatchNormalization()(x1)
    
    inp12 = layers.Input(shape=(shape2,))
    x2 = layers.BatchNormalization()(inp12)
    x2 = layers.Dense(hidden_units, activation=tf.keras.activations.swish, kernel_regularizer=tf.keras.regularizers.l2(regularizer_rate))(x2)
    x2 = layers.Dropout(dropout_rates)(x2)
    x2 = layers.BatchNormalization()(x2)
        
    inp13 = layers.Input(shape=(shape3,))
    x3 = layers.BatchNormalization()(inp13)
    x3 = layers.Dense(hidden_units, activation=tf.keras.activations.swish, kernel_regularizer=tf.keras.regularizers.l2(regularizer_rate))(x3)
    x3 = layers.Dropout(dropout_rates)(x3)
    x3 = layers.BatchNormalization()(x3)
    
    merged11 = layers.Concatenate(axis=1)([x1, x2])
    for i in range(num_layers):
        merged11 = layers.Dense(hidden_units, activation=tf.keras.activations.swish, kernel_regularizer=tf.keras.regularizers.l2(regularizer_rate))(merged11)
        merged11 = layers.Dropout(dropout_rates)(merged11)
        merged11 = layers.BatchNormalization()(merged11)
    
    merged12 = layers.Concatenate(axis=1)([merged11, x3])
    merged12 = layers.Dense(hidden_units, activation=tf.keras.activations.swish, kernel_regularizer=tf.keras.regularizers.l2(regularizer_rate))(merged12)
    merged12 = layers.Dropout(dropout_rates)(merged12)
    merged12 = layers.BatchNormalization()(merged12)
        
    # Right Side
    
    inp21 = layers.Input(shape=(shape1,))
    x4 = layers.BatchNormalization()(inp21)
    x4 = layers.Dense(hidden_units, activation=tf.keras.activations.swish, kernel_regularizer=tf.keras.regularizers.l2(regularizer_rate))(x4)
    x4 = layers.Dropout(dropout_rates)(x4)
    x4 = layers.BatchNormalization()(x4)
    
    inp22 = layers.Input(shape=(shape2,))
    x5 = layers.BatchNormalization()(inp22)
    x5 = layers.Dense(hidden_units, activation=tf.keras.activations.swish, kernel_regularizer=tf.keras.regularizers.l2(regularizer_rate))(x5)
    x5 = layers.Dropout(dropout_rates)(x5)
    x5 = layers.BatchNormalization()(x5)
    
    inp23 = layers.Input(shape=(shape3,))
    x6 = layers.BatchNormalization()(inp23)
    x6 = layers.Dense(hidden_units, activation=tf.keras.activations.swish, kernel_regularizer=tf.keras.regularizers.l2(regularizer_rate))(x6)
    x6 = layers.Dropout(dropout_rates)(x6)
    x6 = layers.BatchNormalization()(x6)
        
    merged21 = layers.Concatenate(axis=1)([x4, x5])
    for i in range(num_layers):
        merged21 = layers.Dense(hidden_units, activation=tf.keras.activations.swish, kernel_regularizer=tf.keras.regularizers.l2(regularizer_rate))(merged21)
        merged21 = layers.Dropout(dropout_rates)(merged21)
        merged21 = layers.BatchNormalization()(merged21)
    
    merged22 = layers.Concatenate(axis=1)([merged21, x6])
    merged22 = layers.Dense(hidden_units, activation=tf.keras.activations.swish, kernel_regularizer=tf.keras.regularizers.l2(regularizer_rate))(merged22)
    merged22 = layers.Dropout(dropout_rates)(merged22)
    merged22 = layers.BatchNormalization()(merged22)
    
    final = layers.Concatenate(axis=1)([merged12, merged22])
    out = layers.Dense(num_labels)(final)

    model = tf.keras.models.Model(inputs=[inp11, inp12, inp13, inp21, inp22, inp23], outputs=out)
    model.compile(
        optimizer=tfa.optimizers.RectifiedAdam(learning_rate=learning_rate),
        loss='mse',
        metrics=['mae', tf.keras.metrics.RootMeanSquaredError()],
    )

    return model

In [None]:
num_layers = 2
hidden_units = 16
dropout_rates = 0.25
learning_rate = 2e-3
regularizer_rate = 1e-4
model_dense = create_mlp_dense(X_team_1.shape[1], X_opp_1.shape[1], X_other_1.shape[1], 1, num_layers, hidden_units, dropout_rates, learning_rate, regularizer_rate)

tf.keras.utils.plot_model(model_dense)

Notice the distinct differences in architecture between this "Dense" model and the above "Sparse" model. This model has a sparse top but a dense center with an extra two inputs.

In [None]:
history_dense = model_dense.fit([X_team_1, X_opp_1, X_other_1, X_team_2, X_opp_2, X_other_2], y_all, epochs=400, verbose=0, 
    batch_size=32, validation_data=([my_team_test1_t, opp_team_test1_t, other_test1_t, my_team_test2_t, opp_team_test2_t, other_test2_t], y_test))

model_dense.save_weights("MCNAAdense.hdf5")

<h3> Model Analysis </h3>

In [None]:
plt.figure(figsize=(10, 8))
plt.plot(history_dense.history['mae'])
plt.plot(history_dense.history['val_mae'])
plt.plot(history_dense.history['val_root_mean_squared_error'])
plt.legend(['mae', 'val_mae', 'val_rmse'])
plt.xlabel("Epochs", fontsize=14)
plt.ylabel("Error", fontsize=14)
plt.title("Model Error Over Epochs", fontsize=16)
plt.show()

This graph looks somewhat similar to the error for the sparse model, except the validation error here is not as high nor does it increase as much as the validation error for the sparse model.

In [None]:
y_pred_norm2 = model_dense.predict([my_team_test1_t, opp_team_test1_t, other_test1_t, my_team_test2_t, opp_team_test2_t, other_test2_t])
y_pred_alt2 = model_dense.predict([my_team_test2_t, opp_team_test2_t, other_test2_t, my_team_test1_t, opp_team_test1_t, other_test1_t])
y_comb2 = ((y_pred_norm2-y_pred_alt2)/2).reshape(-1)
print("Normal MAE: %.3f" % (mae(y_pred_norm2, y_test)))
print("Alternate MAE: %.3f" % (mae(y_pred_alt2, -1*y_test)))
print("Combined MAE: %.3f" % (mae(y_comb2, y_test)))
print("Normal RMSE: %.3f" % (rmse(y_pred_norm2, y_test)))
print("Alternate RMSE: %.3f" % (rmse(y_pred_alt2, -1*y_test)))
print("Combined RMSE: %.3f" % (rmse(y_comb2, y_test)))
print("Correct Winner: %.3f%%" % (100*np.count_nonzero(y_comb2*y_test > 0)/len(y_test)))

This model predicts the correct winner an astonishing 65% of the time! Compare that to our sparse model which predicted the correct winner 56% of the time and our simple model which predicted the correct winner 55% of the time and it becomes strikingly clear that this model is much more robust than the others. It's overall error is also less than the other models which is why I opt to use it as my primary predictor going into the 2021 March Madness Tournament.

<h3> Prediction File </h3>

In [None]:
%%time

def prediction_advanced(season, dense=True):
    year_df = seeding_df[seeding_df['Season'] == season]
    all_ids = list(set(year_df['TeamID']))
    all_ids.sort()
    
    season_df = all_seasons_data[season]
    
    my_team_1, opp_team_1 = [], []
    my_team_2, opp_team_2 = [], []
    other1, other2 = [], []
    index_list = []
    
    for id1 in all_ids:
        for id2 in all_ids:
            if id2 > id1:
                id1_df = season_df.loc[id1]
                id2_df = season_df.loc[id2]
                my_team_1.append(id1_df[my_team_cols].tolist())
                opp_team_1.append(id1_df[opp_team_cols].tolist())
                my_team_2.append(id2_df[my_team_cols].tolist())
                opp_team_2.append(id2_df[opp_team_cols].tolist())
                other1.append(id1_df[other_cols].tolist())
                other2.append(id2_df[other_cols].tolist())
                index_list.append(f"{season}_{id1}_{id2}")
                
    my_team_1, opp_team_1 = X_scaler.transform(np.array(my_team_1, dtype='float64')), X_scaler.transform(np.array(opp_team_1, dtype='float64'))
    my_team_2, opp_team_2 = X_scaler.transform(np.array(my_team_2, dtype='float64')), X_scaler.transform(np.array(opp_team_2, dtype='float64'))
    other1, other2 = other_scaler.transform(np.array(other1, dtype='float64')), other_scaler.transform(np.array(other2, dtype='float64'))

    if dense:
        pred_1 = model_dense.predict([my_team_1, opp_team_1, other1, my_team_2, opp_team_2, other2])
        pred_2 = model_dense.predict([my_team_2, opp_team_2, other2, my_team_1, opp_team_1, other1])
    else:
        pred_1 = model_sparse.predict([my_team_1, opp_team_1, my_team_2, opp_team_2])
        pred_2 = model_sparse.predict([my_team_2, opp_team_2, my_team_1, opp_team_1])
        
    true_pred = (pred_1-pred_2)/2
    predictions_df = pd.DataFrame(data=true_pred, index=index_list, columns=['Pred'])
    return predictions_df
    
predictions_adv_list = []
for season in seasons_pred_list:
    print(f"Season {season-1}-{season} Processing...")
    predictions_adv_list.append(prediction_advanced(season))
    
final_pred_adv_df = pd.concat(predictions_adv_list).reset_index()
final_pred_adv_df.columns = ['ID', 'Pred']
final_pred_adv_df.to_csv("submission.csv", index=False)
final_pred_adv_df

<h3> Simulate MNCAA Tournament with Advanced Model </h3>

Similar to my simulation with the simple model, there is better code provided <a href="https://www.kaggle.com/ironicninja/generate-a-march-madness-bracket"> here</a>.

In [None]:
class predictSeasonAdvanced():
    def __init__(self, season, dense=True, PRINT=True):
        self.year_df = seeding_df[seeding_df['Season'] == season]
        self.season_df = all_seasons_data[season]
        self.season = season
        self.PRINT = PRINT
        self.res_list = []
        self.dense = dense
        
    def predict_id(self, id_1, id_2):
        id1_df = self.season_df.loc[id_1]
        id2_df = self.season_df.loc[id_2]
        mt1 = X_scaler.transform(np.array([id1_df[my_team_cols].tolist()]))
        ot1 = X_scaler.transform(np.array([id1_df[opp_team_cols].tolist()]))
        o1 = other_scaler.transform(np.array([id1_df[other_cols].tolist()]))
        mt2 = X_scaler.transform(np.array([id2_df[my_team_cols].tolist()]))
        ot2 = X_scaler.transform(np.array([id2_df[opp_team_cols].tolist()]))
        o2 = other_scaler.transform(np.array([id2_df[other_cols].tolist()]))
        
        if self.dense:
            pred_1 = model_dense.predict_step([mt1, ot1, o1, mt2, ot2, o2]).numpy()[0][0]
            pred_2 = model_dense.predict_step([mt2, ot2, o2, mt1, ot1, o1]).numpy()[0][0]
        else:
            pred_1 = model_sparse.predict_step([mt1, ot1, mt2, ot2]).numpy()[0][0]
            pred_2 = model_sparse.predict_step([mt2, ot2, mt1, ot1]).numpy()[0][0]
        
        pred_final = (pred_1-pred_2)/2
        return pred_final
    
    def team_id(self, id_test):
        return teams_df.loc[teams_df['TeamID'] == id_test]['TeamName'].iloc[0]
    
    def playin_round(self, id_tuple, div_df, playin_df, seed, div):
        id1, id2 = id_tuple
        final_res = self.predict_id(id1, id2)
        final_res_str = "won" if final_res >= 0 else "lost"
        
        team1 = f"{seed} {self.team_id(id1)}"
        team2 = f"{seed} {self.team_id(id2)}"
        
        self.res_list.append([team1, team2, team1 if final_res >= 0 else team2, div])
        print(f"Play-in round, seeds {team1} played {team2} and {final_res_str} by %.2f points." % (abs(final_res))) if self.PRINT else 0
        
        return div_df.append(playin_df.loc[playin_df['TeamID'] == id1] if final_res >= 0 else playin_df.loc[playin_df['TeamID'] == id2])
    
    def predict_div(self, div):
        div_df = self.year_df.loc[self.year_df['Seed'].str.contains(div)]
        div_df['Seed'] = div_df['Seed'].str.replace(div, '')
        div_df['Seed'] = div_df['Seed'].str.lstrip('0')

        # Check for play-in rounds
        playin_df = div_df.loc[div_df['Seed'].str.len() > 2]

        if len(playin_df):         
            div_df = div_df.loc[div_df['Seed'].str.len() <= 2]
            playin_df['Seed'] = playin_df['Seed'].str.slice(0, 2)
            seed = playin_df.iloc[0]['Seed']
            id_first = playin_df.iloc[:2]['TeamID'].tolist()
            div_df = self.playin_round(id_first, div_df, playin_df, seed, div)
            
            if len(playin_df) == 4:
                id_second = playin_df.iloc[-2:]['TeamID'].tolist()
                div_df = self.playin_round(id_second, div_df, playin_df, 16, div)

        matchup_df = div_df[['Seed', 'TeamID']].set_index('Seed')
        play_list = [1, 16, 8, 9, 5, 12, 4, 13, 6, 11, 3, 14, 7, 10, 2, 15]

        while True:
            tmp_list = []
            for i in range(0, len(play_list), 2):
                seed1 = str(play_list[i])
                seed2 = str(play_list[i+1])

                id1 = matchup_df.loc[seed1]['TeamID']
                id2 = matchup_df.loc[seed2]['TeamID']

                final_res = self.predict_id(id1, id2)

                team1 = f"{seed1} {self.team_id(id1)}"
                team2 = f"{seed2} {self.team_id(id2)}"

                final_res_str = "won" if final_res >= 0 else "lost"
                self.res_list.append([team1, team2, team1 if final_res >= 0 else team2, div])
                print(f"Seed {team1} played seed {team2} and {final_res_str} by %.2f points." % (abs(final_res))) if self.PRINT else 0

                tmp_list.append(seed1 if final_res >= 0 else seed2)

            play_list = tmp_list.copy()
            if len(play_list) <= 1:
                break

        return matchup_df.loc[play_list[0]]['TeamID']
    
    def predict_tour(self):
        div_list = ['W', 'X', 'Y', 'Z']
        div_winners = {}
        for div in div_list:
            print(f"Division {div}") if self.PRINT else 0
            winning_seed = self.predict_div(div)
            div_winners[div] = winning_seed
            print("\n") if self.PRINT else 0
        
        # Under the assumption that Division W always plays Division X in the Final Four
        id1 = div_winners['W']
        id2 = div_winners['X']
        id3 = div_winners['Y']
        id4 = div_winners['Z']
        
        team1 = self.team_id(id1)
        team2 = self.team_id(id2)
        team3 = self.team_id(id3)
        team4 = self.team_id(id4)
        
        final_res_1 = self.predict_id(id1, id2)
        final_res_str_1 = "won" if final_res_1 >= 0 else "lost"
        f4_winner_1 = [id1, team1] if final_res_1 >= 0 else [id2, team2]
        self.res_list.append([team1, team2, team1 if final_res_1 >= 0 else team2, "F4"])
        print(f"{team1} played {team2} and {final_res_str_1} by %.2f points." % (abs(final_res_1))) if self.PRINT else 0
        
        final_res_2 = self.predict_id(id3, id4)
        final_res_str_2 = "won" if final_res_2 >= 0 else "lost"
        f4_winner_2 = [id3, team3] if final_res_2 >= 0 else [id4, team4]
        self.res_list.append([team3, team4, team3 if final_res_2 >= 0 else team4, "F4"])
        print(f"{team3} played {team4} and {final_res_str_2} by %.2f points." % (abs(final_res_2))) if self.PRINT else 0
        
        champ_res_final = self.predict_id(f4_winner_1[0], f4_winner_2[0])
        champ_res_str = "won" if champ_res_final >= 0 else "lost"
        champion = f4_winner_1[1] if champ_res_final >= 0 else f4_winner_2[1]
        self.res_list.append([f4_winner_1[1], f4_winner_2[1], champion, "Finals"])
        print(f"{f4_winner_1[1]} played {f4_winner_2[1]} and {champ_res_str} by %.2f points." % (abs(champ_res_final))) if self.PRINT else 0
        print(f"THE CHAMPION FOR THE {self.season} SEASON IS {champion}!!! They will win by {int(abs(champ_res_final))} points in the final round.")
        
        res_df = pd.DataFrame(data=self.res_list, columns=["Team 1", "Team 2", "Winner", "Division"])
        return res_df
    
my_season = predictSeasonAdvanced(2021, dense=True, PRINT=False)
res_df = my_season.predict_tour()
pd.set_option('display.max_rows', 68)
res_df

In [None]:
pd.set_option('display.max_rows', 10)

I won't provide a separate strengths & weaknesses section here since it would reiterate many of the same points as in Advanced Model 1, except an added strength here is a bit more data to work with and slightly more optimal architecture.

# Generating a March Madness Bracket

Code is from <a href="https://www.kaggle.com/ironicninja/generate-a-march-madness-bracket"> here</a>, a sample (and working) implementation is below.

In [None]:
class predictSeason():
    def __init__(self, pred_df, season=2021, PRINT=False):
        assert 'ID' in pred_df.columns, "Column 'ID' is not found in your input DataFrame. Check your spelling and capitalization."
        assert 'Pred' in pred_df.columns, "Column 'Pred' is not found in your input DataFrame. Check your spelling and capitalization."
        assert str(season) in pred_df['ID'].str.slice(0, 4).unique(), f"{season} season not found in your prediction DataFrame."
        
        try:
            self.teams_df = pd.read_csv(f"../input/ncaam-march-mania-2021-spread/MDataFiles_Stage{STAGE}_Spread/MTeams.csv")
            self.seeding_df = pd.read_csv(f"../input/ncaam-march-mania-2021-spread/MDataFiles_Stage2_Spread/MNCAATourneySeeds.csv")
        except:
            raise Exception("Some files are not found. Ensure the file paths are correct.")
            
        try:
            self.year_df = self.seeding_df[self.seeding_df['Season'] == season]
        except:
            raise Exception(f"The {season} season is out of range, please try again.")
            
            
        self.pred_df = pred_df
        self.season = season
        self.PRINT = PRINT
        self.res_list = []
        
    def predict_id(self, id1, id2):
        """
        Returns a boolean stating whether the team with id1 wins.
        """
        
        id_str = f"{self.season}_{min(id1, id2)}_{max(id1, id2)}"
        pred = self.pred_df.loc[self.pred_df['ID'] == id_str]['Pred'].iloc[0]
        res = True if pred >= 0 else False
        res = (1-res) if id1 > id2 else res
        return res
    
    def team_id(self, id_test):
        """
        Returns the name of a team with a certain ID.
        """
        
        return self.teams_df.loc[self.teams_df['TeamID'] == id_test]['TeamName'].iloc[0]
    
    def playin_round(self, id_tuple, div_df, playin_df, seed, div):
        """
        Handles logic for playin round (before Round of 64).
        """
        
        id1, id2 = id_tuple
        
        final_res = self.predict_id(id1, id2)

        team1 = f"{seed} {self.team_id(id1)}"
        team2 = f"{seed} {self.team_id(id2)}"
        
        final_res_str = "won" if final_res >= 0 else "lost"
        self.res_list.append([team1, team2, team1 if final_res else team2, "play-in", div])
        print(f"Play-in round, seeds {team1} played {team2} and {final_res_str}.") if self.PRINT else 0
        
        return div_df.append(playin_df.loc[playin_df['TeamID'] == id1] if final_res else playin_df.loc[playin_df['TeamID'] == id2])
    
    def predict_div(self, div):
        """
        Simulate and return the division winner.
        """
        
        div_df = self.year_df.loc[self.year_df['Seed'].str.contains(div)]
        div_df['Seed'] = div_df['Seed'].str.replace(div, '')
        div_df['Seed'] = div_df['Seed'].str.lstrip('0')

        # Check for play-in rounds
        playin_df = div_df.loc[div_df['Seed'].str.len() > 2]

        if len(playin_df):         
            div_df = div_df.loc[div_df['Seed'].str.len() <= 2]
            playin_df['Seed'] = playin_df['Seed'].str.slice(0, 2)
            seed = playin_df.iloc[0]['Seed']
            id_first = playin_df.iloc[:2]['TeamID'].tolist()
            div_df = self.playin_round(id_first, div_df, playin_df, seed, div)
            
            if len(playin_df) == 4:
                id_second = playin_df.iloc[-2:]['TeamID'].tolist()
                div_df = self.playin_round(id_second, div_df, playin_df, 16, div)

        matchup_df = div_df[['Seed', 'TeamID']].set_index('Seed')
        play_list = [1, 16, 8, 9, 5, 12, 4, 13, 6, 11, 3, 14, 7, 10, 2, 15] # This initial order handles all of the logic we need for the bracket
        round_num = 64

        while True:
            """
            Continue until there's a division winner.
            """
            
            tmp_list = []
            for i in range(0, len(play_list), 2):
                seed1 = str(play_list[i])
                seed2 = str(play_list[i+1])

                id1 = matchup_df.loc[seed1]['TeamID']
                id2 = matchup_df.loc[seed2]['TeamID']

                final_res = self.predict_id(id1, id2)

                team1 = f"{seed1} {self.team_id(id1)}"
                team2 = f"{seed2} {self.team_id(id2)}"

                final_res_str = "won" if final_res else "lost"
                self.res_list.append([team1, team2, team1 if final_res else team2, f"R{int(round_num)}", div])
                print(f"Seed {team1} played seed {team2} and {final_res_str}.") if self.PRINT else 0

                tmp_list.append(seed1 if final_res else seed2)

            play_list = tmp_list.copy()
            round_num /= 2
            
            if len(play_list) <= 1:
                break

        return matchup_df.loc[play_list[0]]['TeamID']
    
    def predict_tour(self):
        """
        Driver function for creating the bracket. Run this and only this function.
        """
        
        div_list = ['W', 'X', 'Y', 'Z']
        div_winners = {}
        for div in div_list:
            print(f"Division {div}") if self.PRINT else 0
            winning_seed = self.predict_div(div)
            div_winners[div] = winning_seed
            print("\n") if self.PRINT else 0
        
        # Under the assumption that Division W always plays Division X in the Final Four
        
        id1 = div_winners['W']
        id2 = div_winners['X']
        id3 = div_winners['Y']
        id4 = div_winners['Z']
        
        team1 = self.team_id(id1)
        team2 = self.team_id(id2)
        team3 = self.team_id(id3)
        team4 = self.team_id(id4)
        
        # F4 W & X
        final_res_1 = self.predict_id(id1, id2)
        final_res_str_1 = "won" if final_res_1 else "lost"
        f4_winner_1 = [id1, team1] if final_res_1 else [id2, team2]
        self.res_list.append([team1, team2, team1 if final_res_1 else team2, "R4", "F4"])
        print(f"{team1} played {team2} and {final_res_str_1}.") if self.PRINT else 0
        
        # F4 Y & Z
        final_res_2 = self.predict_id(id3, id4)
        final_res_str_2 = "won" if final_res_2 else "lost"
        f4_winner_2 = [id3, team3] if final_res_2 else [id4, team4]
        self.res_list.append([team3, team4, team3 if final_res_2 else team4, "R4", "F4"])
        print(f"{team3} played {team4} and {final_res_str_2}.") if self.PRINT else 0
        
        # Championship
        champ_res_final = self.predict_id(f4_winner_1[0], f4_winner_2[0])
        champ_res_str = "won" if champ_res_final else "lost"
        champion = f4_winner_1[1] if champ_res_final else f4_winner_2[1]
        self.res_list.append([f4_winner_1[1], f4_winner_2[1], champion, "R2", "Finals"])
        print(f"THE CHAMPION FOR THE {self.season} SEASON IS {champion}!!!")
        
        res_df = pd.DataFrame(data=self.res_list, columns=["Team 1", "Team 2", "Winner", "Round", "Division"])
        return res_df
    
my_season = predictSeason(final_pred_simple_df, season=2021)
res_df = my_season.predict_tour()
pd.set_option('display.max_rows', 68)
res_df

# Concluding Remarks

Thank you for reading through this notebook! I put these models together in around 3 days, so they are by no means perfect. I just wanted to share the work I conducted in case someone wants to try and use neural networks to predict March Madness scores or is just interested in MLPs in general. If you found this notebook to be particularly interesting or helpful, I would really appreciate it if you would give the notebook an <span style="color: green"> upvote </span> or a <span style="color: blue"> comment</span>!