The Greatest Sporting Event in the world, March Madness is upon us once again. Perhaps the hallmark the of this event is it's propensity for producing stunning upsets and thrilling Cinderella stories year after year. Well, that and the never ending quest to fill out a perfect bracket. What I'd like to do in this notebook is try to demystify March Madness, and ultimately try to predict which teams are more likely to pull of the big upsets in 2021

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import re
%matplotlib inline
pd.set_option('display.max_columns', None)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/college-basketball-dataset/cbb.csv')
tdf = df[df['POSTSEASON'].notnull()]
tdf.reset_index()
tdf = tdf[tdf['POSTSEASON']!='R68']
tdf['TEAM'].replace(to_replace=r' St.', value=r' State', regex=True, inplace=True)
tdf = tdf.replace(to_replace='Mississippi', value='Ole Miss')


Before I continue any further, let's define an upset. For this notebook, I will consider any game in which a team that is 3 or more seeds lower than the other team wins. For example, 10 beats 7, or 4 beats 1 would both be upsets, however 9 beats 8 or 3 beats 1 wouldn't.

My initial hypothesis is that teams with stronger offenses will be more prone to pulling off upsets because they will have the most chances to close the gap between them and a stronger team. Explained another way, every time two teams player, there is an expected point differential and a standard deviation, and when teams play more aggressively, they have more chances to 'make up' that standard deviation.

In [None]:
round1_low = tdf[(tdf['SEED']>=10) & (tdf['SEED']<16)]
round1_high = tdf[(tdf['SEED']<=7) & (tdf['SEED']>1)]

upsets_1st = round1_low[(round1_low['POSTSEASON']!='R64')]
non_upsets_1st = round1_low[(round1_low['POSTSEASON']=='R64')]

labels = tdf.columns[4:20] #.append(tdf.columns[7:20])
upset_means = upsets_1st[labels].mean()
non_up_means = non_upsets_1st[labels].mean()

x = np.arange(len(labels))
width = 0.35

fig, ax = plt.subplots(figsize=(14, 6))
g1 = ax.bar(x - width/2, upset_means, width, label='Upsets')
g2 = ax.bar(x + width/2, non_up_means, width, label='Non-Upsets')

ax.set_title('Upset Teams vs Non-Upset Teams Season Stats')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.set_xlabel('Stat')
ax.set_ylabel('Value')
ax.legend()

mean_diffs = upset_means-non_up_means
print(mean_diffs)


This initial EDA doesn't seem to say much so far. The upset teams outperform the non-upset teams in nearly every category (a positive difference for offensve stats/a negative difference for defensive stats). This is even after I separated out all the 16 and 1 sees since that upset has happened just once in well over 100 matchups (sorry Virginia), and some of the 16 seeds that make the tournament are just pretty bad. 

Now let's do a bootstrap analysis to see which of these differences are statistically significant.

In [None]:

def bs_rep_2d(data):
    bs_sample1 = np.random.choice(data, len(upsets_1st))
    bs_sample2 = np.random.choice(data, len(non_upsets_1st))
    return (bs_sample1.mean()-bs_sample2.mean())

def draw_bs_reps(data, size=1):
    
    bs_replicates = np.empty(size)
    
    for i in range(size):
        bs_replicates[i] = bs_rep_2d(data)
    return bs_replicates


In [None]:
p = []
for i in labels:
    bs_reps = draw_bs_reps(round1_low[i], size=1000)
    
    p.append(np.sum(np.abs(bs_reps)>np.abs(mean_diffs[i]))/len(bs_reps))

p_vals = tuple(zip(labels, p))

print(p_vals)
    

After doing the bootstrap analysis, we see that the team has a statistically significant advantage  in Adjusted Offensive Efficiency (ADJOE), Adjusted Defensive Efficiency (ADJDE), Effective FG Percentage Allowed (EFG_D), steals (TORD), and 3-pointer FG Rate Allowed (3P_D). Therefore based on this initial analysis it seems like the lower seeded teams that create more possessions (through steals and defensive efficiency) and are more efficient with those possessions are more likely to pull off an upset. 

Not the most groundbreaking insight, but we have another side of this equation to look at. The higher seeded team being upset.

In [None]:

h_upsets_1st = round1_high[round1_high['POSTSEASON']=='R64']
h_survive_1st = round1_high[round1_high['POSTSEASON']!='R64']

h_upset_means = h_upsets_1st[labels].mean()
h_surv_means = h_survive_1st[labels].mean()

x = np.arange(len(labels))
width = 0.35

fig, ax = plt.subplots(figsize=(14, 6))
g1 = ax.bar(x - width/2, h_upset_means, width, label='Upset')
g2 = ax.bar(x + width/2, h_surv_means, width, label='Survived')

ax.set_title('Higher Seeded Upset Teams vs Surviving Teams Season Stats')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.set_xlabel('Stat')
ax.set_ylabel('Value')
ax.legend()

mean_diffs = h_upset_means-h_surv_means
print(mean_diffs)

p = []
for i in labels:
    bs_reps = draw_bs_reps(round1_low[i], size=1000)
    
    p.append(np.sum(np.abs(bs_reps)>np.abs(mean_diffs[i]))/len(bs_reps))

p_vals = tuple(zip(labels, p))

print(p_vals)

After doing the same analysis for teams that got upset, we see that they have a statistically significant disadvantage in Offensive Efficiency. These analyses paint a clear picture showing that teams who are efficient on both sides of the court and generate turnovers are most likely to pull off upsets while teams that are less efficient (though not necessarily prone to turnovers) are more likely to get upset. 

This of course leads us to our next question. Where is the threshold between lower seeded efficiency and higher seeded inefficiency that makes an upset predictable, or even likely.

In [None]:
tdf.loc[(tdf['POSTSEASON']!='R64') & (tdf['SEED']>=10), 'Upset_W1st'] = 1
tdf.loc[(tdf['POSTSEASON']=='R64') & (tdf['SEED']>=10), 'Upset_W1st'] = 0
tdf.loc[(tdf['POSTSEASON']=='R64') & (tdf['SEED']<=7), 'Upset_L1st'] = 1
tdf.loc[(tdf['POSTSEASON']!='R64') & (tdf['SEED']<=7), 'Upset_L1st'] = 0

tdf.fillna(0, inplace=True)

udf = tdf[(tdf['Upset_W1st']==1) | (tdf['Upset_L1st']==1)]

udf.groupby('SEED').count().tail(7)

In [None]:
upset_1st_w = udf[udf['Upset_W1st']==1]
upset_1st_l = udf[udf['Upset_L1st']==1]

u1w_means = upset_1st_w[labels].mean()
u1l_means = upset_1st_l[labels].mean()

x = np.arange(len(labels))
width = 0.35

fig, ax = plt.subplots(figsize=(14, 6))
g1 = ax.bar(x - width/2, u1w_means, width, label='Winners')
g2 = ax.bar(x + width/2, u1l_means, width, label='Losers')

ax.set_title('Upset Winners vs Upset Losers Season Stats')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.set_xlabel('Stat')
ax.set_ylabel('Value')
ax.legend()

mean_diffs = u1w_means-u1l_means
print(mean_diffs)

In [None]:
def bs_rep_2d(data):
    bs_sample1 = np.random.choice(data, len(udf))
    bs_sample2 = np.random.choice(data, len(udf))
    return (bs_sample1.mean()-bs_sample2.mean())

p = []
for i in labels:
    bs_reps = draw_bs_reps(udf[i], size=1000)
    
    p.append(np.sum(np.abs(bs_reps)>np.abs(mean_diffs[i]))/len(bs_reps))

p_vals = tuple(zip(labels, p))

print(p_vals)

Not surprisingly, the upset losers have an advantage in most categories as they are the higher ranked team, however the upset winners have a clear advantage when it comes to the turnover battle. They have quite a statistically significant advantage when it comes to steals. In fact, this advantage is larger than they had even over other low seeds that didn't pull off upsets. 

This suggests that teams with better offensive and defensive efficiency will perform better over the course of an entire season. However, less efficient teams that remain opportunistic through turnovers can pull off an upset in one game. 

Now that we have a general idea of what stats make an upset in a given game more likely, let's look at each specific upset and see if those stats reinforce our current ideas.

In [None]:
gdf = pd.read_csv('/kaggle/input/ncaatourneygamesandseasonstats1519/NCAATourneyGamesAndSeasonStats_15-19.csv', index_col=0)
gdf.drop_duplicates(inplace=True)
gdf.reset_index(inplace=True)
gdf.drop('index', axis=1, inplace=True)

winner = []
low_seed_win = []
for index, item in gdf.iterrows():
    if item['SEED_winner'] > item['SEED_loser']:
        win = 'Lower'
        low = 1
        
    elif item['SEED_winner'] <= item['SEED_loser']:
        win = 'Higher'
        low = 0
        
    winner.append(win)
    low_seed_win.append(low)
    
gdf['winner'] = winner
gdf['low_seed_win'] = low_seed_win


all_upsets = gdf[(gdf['SEED_winner']>= (gdf['SEED_loser']+3))]

drop_cols = ['G_winner', 'G_loser', 'WAB_winner', 'WAB_loser']
all_upsets.drop(drop_cols, axis=1, inplace=True)
new_cols = ['ADJOE_diff', 'ADJDE_diff', 'BARTHAG_diff', 'EFG_O_diff', 'EFG_D_diff', 'TOR_diff', 'TORD_diff', 'ORB_diff', 'DRB_diff', 'FTR_diff', 'FTRD_diff', '2P_O_diff', '2P_D_diff', '3P_O_diff', '3P_D_diff', 'ADJ_T_diff']
winner_cols = all_upsets.columns[5:21]
loser_cols = all_upsets.columns[27:43]


for (a, b, c) in zip(new_cols, winner_cols, loser_cols):
    all_upsets[a] = all_upsets[b]-all_upsets[c]


diff_means = []
for i in new_cols:
    diff_means.append(all_upsets[i].mean())
    print(i, all_upsets[i].mean())
    


I imported each individual March Madness matchup from the past 7 tournaments using boxscore data from the sportsrefernce Python module. I will also use this to build a win probability model later.

Interestingly enough, the lower seeded winning team underperforms compared to the higher seeded team in every category except steal rate. This seems to strengthen the case even further that teams that create the most turnovers have the highest chance of pulling off the upset.

In [None]:
tdf['Turnover_diff'] = tdf['TORD']-tdf['TOR']
to_low = tdf[tdf['SEED']>=10]
to_top_50 = to_low.sort_values('Turnover_diff', ascending=False).head(50)
to_tail_rest = to_low.sort_values('Turnover_diff', ascending=False).tail(147)

upset_pct = to_top_50[to_top_50['POSTSEASON']!='R64']['TEAM'].count()/50
print('Top 50 TO Diff. 1st Rnd Upset Pct:', upset_pct)

rest_upset_pct = to_tail_rest[to_tail_rest['POSTSEASON']!='R64']['TEAM'].count()/147
print('Bottom 90 TO Diff. 1st Rnd Upset Pct:', rest_upset_pct)

round1_low = tdf[tdf['SEED']>=10]
std_upset_pct = round1_low[round1_low['POSTSEASON']!='R64']['TEAM'].count()/len(round1_low)
print('Standard 1st Rnd Upset Pct:', std_upset_pct)


These percentages seem to support my case even further, but for comparison's sake let's look at a few other stats.

In [None]:
tdf['E_diff'] = tdf['ADJOE']-tdf['ADJDE']
tdf['ETO_Avg'] = (tdf['E_diff']+tdf['Turnover_diff'])/2
t_low = tdf[tdf['SEED']>=10]
t_50 = t_low.sort_values('ETO_Avg', ascending=False).head(50)
t_rest = t_low.sort_values('ETO_Avg', ascending=False).tail(147)
e_low = tdf[tdf['SEED']>=10]
e_top50 = e_low.sort_values('E_diff', ascending=False).head(50)
e_tailrest = e_low.sort_values('E_diff', ascending=False).tail(147)

upset_pct = e_top50[e_top50['POSTSEASON']!='R64']['TEAM'].count()/50
print('Top 50 Efficiency Diff. 1st Rnd Upset Pct:', upset_pct)

rest_upset_pct = e_tailrest[e_tailrest['POSTSEASON']!='R64']['TEAM'].count()/147
print('Bottom rest Efficiency Diff. 1st Rnd Upset Pct:', rest_upset_pct)

eto_pct = t_50[t_50['POSTSEASON']!='R64']['TEAM'].count()/50
print('Top 50 Turnover/Efficiency Avg. 1st Rnd Upset Pct:', eto_pct)

eto_rest = t_rest[t_rest['POSTSEASON']!='R64']["TEAM"].count()/147
print('Bottom rest Turnover/Efficiency Avg. 1st Rnd Upset Pct:', eto_rest)

tdf['2P_diff'] = tdf['2P_O'] - tdf['2P_D']
tdf['3P_diff'] = tdf['3P_O'] - tdf['3P_D']

p2_low = tdf[tdf['SEED']>=10]
p2_50 = p2_low.sort_values('2P_diff', ascending=False).head(50)
p2_rest = p2_low.sort_values('2P_diff', ascending=False).tail(147)
p3_low = tdf[tdf['SEED']>=10]
p3_top50 = p3_low.sort_values('3P_diff', ascending=False).head(50)
p3_tailrest = p3_low.sort_values('3P_diff', ascending=False).tail(147)

p2_upset_pct = p2_50[p2_50['POSTSEASON']!='R64']['TEAM'].count()/50
print('Top 50 2P Diff. 1st Rnd Upset Pct:', p2_upset_pct)

p2_rest_upset_pct = p2_rest[p2_rest['POSTSEASON']!='R64']['TEAM'].count()/147
print('Bottom rest 2P Diff. 1st Rnd Upset Pct:', p2_rest_upset_pct)

p3_upset_pct = p3_top50[p3_top50['POSTSEASON']!='R64']['TEAM'].count()/50
print('Top 50 3P Diff. 1st Rnd Upset Pct:', p3_upset_pct)

p3_rest_upset_pct = p3_tailrest[p3_tailrest['POSTSEASON']!='R64']['TEAM'].count()/147
print('Bottom rest 3P Diff. 1st Rnd Upset Pct:', p3_rest_upset_pct)

So looking at efficiency, 2-point, and 3-point shooting differences between a team's offense and defense, efficiency actually seems like a better predictor than turnover differential. However, the best predictor is the average of turnover and efficiency differential.

Now let's try to predict using a team's season long stats whether or not they will lose in the first round.

In [None]:
tdf.loc[tdf['POSTSEASON']=='R64', 'Exit_1st'] = 1
tdf.loc[tdf['POSTSEASON']!='R64', 'Exit_1st'] = 0
rel_cols = tdf.columns[4:20].append(tdf.columns[22:23].append(tdf.columns[28:29]))

team_info = tdf[['TEAM', 'YEAR', 'SEED']]


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix

rf = RandomForestClassifier(n_estimators=220, max_depth=10, random_state=0, n_jobs=-1)

X = tdf[rel_cols]
y = tdf['Exit_1st']

scores = cross_val_score(rf, X, y, cv=5)

X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=0.25, random_state=0)

model = rf.fit(X_train, y_train)

y_predict = model.predict(X_test)

output = pd.DataFrame({'ID':X_test.index, 'Exit_1st':y_predict})
output = output.merge(team_info, how='left', left_on='ID', right_on=team_info.index)

print(scores)
print('Avg. Acc:', scores.mean())

cm = confusion_matrix(y_test, y_predict)

plt.figure()
sns.heatmap(cm, cmap='YlGnBu', yticklabels=['1st Rnd W', '1st Rnd L'], xticklabels=['1st Rnd W', '1st Rnd L'], annot=True)
plt.ylabel('Actual')
plt.xlabel('Predicted')

plt.figure()
sns.barplot(x= model.feature_importances_, y = X_test.columns)

print(rf.get_params())

Here we see that the model also agrees that Efficiency and Turnover differential average is the most important single stat in predicting tournament success.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 6)

X = tdf[rel_cols]
y = tdf['Exit_1st']

scores = cross_val_score(knn, X, y, cv=5)

X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=0.25, random_state=0)

model = knn.fit(X_train, y_train)

y_predict = model.predict(X_test)

output = pd.DataFrame({'ID':X_test.index, 'Exit_1st':y_predict})
output = output.merge(team_info, how='left', left_on='ID', right_on=team_info.index)

print(scores)
print('Avg. Acc:', scores.mean())

In [None]:
from sklearn.svm import SVC

svc = SVC()

X = tdf[rel_cols]
y = tdf['Exit_1st']

scores = cross_val_score(svc, X, y, cv=5)

X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=0.25, random_state=0)

model = svc.fit(X_train, y_train)

y_predict = model.predict(X_test)

output = pd.DataFrame({'ID':X_test.index, 'Exit_1st':y_predict})
output = output.merge(team_info, how='left', left_on='ID', right_on=team_info.index)

print(scores)
print('Avg. Acc:', scores.mean())

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
grd = GradientBoostingClassifier(random_state=0)

X = tdf[rel_cols]
y = tdf['Exit_1st']

scores = cross_val_score(grd, X, y, cv=5)

X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=0.25, random_state=0)

model = grd.fit(X_train, y_train)

y_predict = model.predict(X_test)

output = pd.DataFrame({'ID':X_test.index, 'Exit_1st':y_predict})
output = output.merge(team_info, how='left', left_on='ID', right_on=team_info.index)

print(scores)
print('Avg. Acc:', scores.mean())

After running a few different classifiers, Random Forest and SVC seem to be the most accurate. I will continue with Random Forest to see the feature importances, however, I may also use SVC later on especially if it runs quicker with many iterations.

In [None]:
ndf = pd.read_csv('/kaggle/input/fullseasonstats-1319/NCAATourneyFullSeasonStats_13-19.csv', index_col=0)
ndf.drop_duplicates(inplace=True)
ndf.drop(['SEED_higher.1', 'SEED_lower.1'], axis=1, inplace=True)
ndf.reset_index(inplace=True)
ndf.drop('index', axis=1, inplace=True)
ndf['TEAM_higher'].replace(to_replace=r' St.', value=r' State', regex=True, inplace=True)
ndf['TEAM_lower'].replace(to_replace=r' St.', value=r' State', regex=True, inplace=True)

low_wins = []
for index, item in ndf.iterrows():
    if item['winner']=='Lower':
        low_win = 1
        
    if item['winner']=='Higher':
        low_win = 0
        
    low_wins.append(low_win)
    
ndf['Low_Seed_Win']=low_wins

ndf.loc[(ndf['POSTSEASON_lower']=='R68') & (ndf['Low_Seed_Win']==1), 'Round'] = 0
ndf.loc[(ndf['POSTSEASON_higher']=='R68') & (ndf['Low_Seed_Win']==1), 'Round'] = 0
ndf.loc[(ndf['POSTSEASON_lower']=='R64') & (ndf['Low_Seed_Win']==0), 'Round'] = 1
ndf.loc[(ndf['POSTSEASON_higher']=='R64') & (ndf['Low_Seed_Win']==1), 'Round'] = 1
ndf.loc[(ndf['POSTSEASON_lower']=='R32') & (ndf['Low_Seed_Win']==0), 'Round'] = 2
ndf.loc[(ndf['POSTSEASON_higher']=='R32') & (ndf['Low_Seed_Win']==1), 'Round'] = 2
ndf.loc[(ndf['POSTSEASON_lower']=='S16') & (ndf['Low_Seed_Win']==0), 'Round'] = 3
ndf.loc[(ndf['POSTSEASON_higher']=='S16') & (ndf['Low_Seed_Win']==1), 'Round'] = 3
ndf.loc[(ndf['POSTSEASON_lower']=='E8') & (ndf['Low_Seed_Win']==0), 'Round'] = 4
ndf.loc[(ndf['POSTSEASON_higher']=='E8') & (ndf['Low_Seed_Win']==1), 'Round'] = 4
ndf.loc[(ndf['POSTSEASON_lower']=='F4') & (ndf['Low_Seed_Win']==0), 'Round'] = 5
ndf.loc[(ndf['POSTSEASON_higher']=='F4') & (ndf['Low_Seed_Win']==1), 'Round'] = 5
ndf.loc[(ndf['POSTSEASON_lower']=='2ND') & (ndf['Low_Seed_Win']==0), 'Round'] = 6
ndf.loc[(ndf['POSTSEASON_higher']=='2ND') & (ndf['Low_Seed_Win']==1), 'Round'] = 6

seed_w = []
seed_l = []
for index, item in ndf.iterrows():
    if item['winner'] == 'Higher':
        seed_w.append(item['SEED_higher'])
        seed_l.append(item['SEED_lower'])
        
    elif item['winner'] == 'Lower':
        seed_w.append(item['SEED_lower'])
        seed_l.append(item['SEED_higher'])
        
ndf['Seed_winner'] = seed_w
ndf['Seed_loser'] = seed_l

In [None]:
ndf.iloc[11] = ndf.iloc[11].fillna(6)
ndf.iloc[29] = ndf.iloc[29].fillna(6)
ndf.iloc[49] = ndf.iloc[49].fillna(5)
ndf.iloc[54] = ndf.iloc[54].fillna(5)
ndf = ndf.replace(to_replace='UNC', value='North Carolina')
ndf = ndf.replace(to_replace='UC-Irvine', value='UC Irvine')
ndf = ndf.replace(to_replace='Mississippi', value='Ole Miss')

for index, item in ndf.iterrows():
    if (item['POSTSEASON_lower']=='R68') | (item['POSTSEASON_higher']=='R68'):
        ndf.drop(index, inplace=True)


f415 = {'Round':6}
f415c = {'Round':5}
ndf.iloc[11:12] = ndf.iloc[11:12].replace(to_replace=f415, value=f415c)

Here I imported the game by game data again, except with one important difference that I didn't catch at first. In the previous dataset (the dataframe named gdf) I suffixed each column with '_winner_' and '_loser_'. This made my initial models far too accurate, I assume because it could instantly know based on column location whether a team won or lost. With this new dataset, I suffixed each columns with '_higher_' and '_lower_', and I got an accuracy much closer to the models I ran above to predict first round exit.

Now, I will be using this dataset to predict the results of the 2021 tournament, however, I will first test my methods on the 2019 tournament.

In [None]:
new_game_info = ndf[['winning_name', 'Seed_winner','losing_name', 'Seed_loser', 'Year', 'Round']]
rnd1_info = ndf[['TEAM_higher', 'SEED_higher', 'TEAM_lower', 'SEED_lower', 'Year']]
blnd_rel_cols = ndf.columns[10:26].append(ndf.columns[30:31].append(ndf.columns[34:50].append(ndf.columns[54:55].append(ndf.columns[5:7]))))

In [None]:
rf = RandomForestClassifier(random_state=0)

pre_2019 = ndf[ndf['Year']<2019]
g_2019 = ndf[ndf['Year']==2019]

X_train= pre_2019[blnd_rel_cols]
y_train = pre_2019['Low_Seed_Win']
X_test = g_2019[blnd_rel_cols]
y_test = g_2019['Low_Seed_Win']

model = rf.fit(X_train, y_train)

y_predict = model.predict(X_test)
y_score = model.score(X_test, y_test)
y_win_prob = model.predict_proba(X_test)

output = pd.DataFrame({'ID':X_test.index, 'Low_Seed_Win':y_predict, 'Low_Seed_Win_Prob':y_win_prob[:,1]})
output = output.merge(new_game_info, how='left', left_on='ID', right_on=new_game_info.index)

cm = confusion_matrix(y_test, y_predict)
sns.heatmap(cm, cmap='YlGnBu', annot=True, yticklabels = ['Higher Wins', 'Lower Wins'], xticklabels=['Higher Wins', 'Lower Wins'])
plt.ylabel('Actual')
plt.xlabel('Predicted')

plt.figure(figsize=(6, 10))
sns.barplot(x=rf.feature_importances_, y=X_test.columns)
print(y_score)
output.sort_values('Low_Seed_Win_Prob', ascending=False).head(20)

This model simply predicts the result of every game from the 2019 tournament. While this is useful to see how accurate the model can be, it won't help me when it comes to predicting the 2021 tournament because I don't already know what any of the tournament matchups beyond the first round will be.

For that, I will have to predict round by round, refilling each iteration with only the projected winner. That model will probably be far less accurate.

In [None]:
rf = RandomForestClassifier(random_state=0)

X_train= pre_2019[blnd_rel_cols]
y_train = pre_2019['Low_Seed_Win']

model = rf.fit(X_train, y_train)

y_score = []
output_all = pd.DataFrame()

X_test = g_2019[blnd_rel_cols][g_2019['Round']==1]
y_test = g_2019['Low_Seed_Win'][g_2019['Round']==1]
y_predict = model.predict(X_test)
y_prob = model.predict_proba(X_test)
y_score.append(model.score(X_test, y_test))

output = pd.DataFrame({'ID':X_test.index, 'Low_Seed_Win':y_predict, 'Low_Seed_Win_Prob':y_prob[:,1]})
output = output.merge(rnd1_info, how='left', left_on='ID', right_on=rnd1_info.index)
output_all = pd.concat([output_all, output])
    
winning_team = []
winning_seed = []
for index, item in output_all.iterrows():
    if item['Low_Seed_Win'] == 1:
        winning_team.append(item['TEAM_lower'])
        winning_seed.append(item['SEED_lower'])
    else:
        winning_team.append(item['TEAM_higher'])
        winning_seed.append(item['SEED_higher'])
        
output_all['Proj_Winner'] = winning_team
output_all['Proj_Seed'] = winning_seed

print(y_score)
#output_all.sort_values('Low_Seed_Win_Prob', ascending=False)

In [None]:
reg = pd.read_html('https://en.wikipedia.org/wiki/2019_NCAA_Division_I_Men%27s_Basketball_Tournament#Tournament_seeds', skiprows=0)


east = reg[4]
west = reg[5]
south = reg[6]
mwest = reg[7]
#mwest.iloc[0] = mwest.iloc[0].replace(to_replace='North Carolina', value='UNC')
#south.iloc[12] = south.iloc[12].replace(to_replace='UC Irvine', value='UC-Irvine')



region = []
for index, item in output_all.iterrows():
    if item['Proj_Winner'] in (list(east['School'])):
        region.append('East')
        
    elif item['Proj_Winner'] in (list(west['School'])):
        region.append('West')
        
    elif item['Proj_Winner'] in (list(south['School'])):
        region.append('South')
        
    elif item['Proj_Winner'] in (list(mwest['School'])):
        region.append('MWest')

output_all['Region'] = region

output_all

My first round here was actually fairly accurate with a few upsets being correctly picked (Murray State, 3 of the 9 seeds), and one being nearly picked correctly (Oregon 45% win prob). On the actual model I may lower the win prob threshold for the first round to a little under 50%. This would be to account for the fact that in any model with a lot of training data, worse teams will become increasingly unlikely to win in the model. Of course this makes it more accurate in the long run, but if any lower seeded team (other than 8 v 9) is even close to pulling off an upset, it could be one worth taking a chance on in your real bracket.

In [None]:
def get_2nd_seed(east, mwest, west, south):

    east.loc[((east['Proj_Seed']==1) | (east['Proj_Seed']==16)) & (8 in east.values), 'Plays_Next'] = 8
    east.loc[((east['Proj_Seed']==1) | (east['Proj_Seed']==16)) & (9 in east.values), 'Plays_Next'] = 9
    east.loc[((east['Proj_Seed']==2) | (east['Proj_Seed']==15)) & (7 in east.values), 'Plays_Next'] = 7
    east.loc[((east['Proj_Seed']==2) | (east['Proj_Seed']==15)) & (10 in east.values), 'Plays_Next'] = 10
    east.loc[((east['Proj_Seed']==3) | (east['Proj_Seed']==14)) & (11 in east.values), 'Plays_Next'] = 11
    east.loc[((east['Proj_Seed']==3) | (east['Proj_Seed']==14)) & (6 in east.values), 'Plays_Next'] = 6
    east.loc[((east['Proj_Seed']==4) | (east['Proj_Seed']==13)) & (12 in east.values), 'Plays_Next'] = 12
    east.loc[((east['Proj_Seed']==4) | (east['Proj_Seed']==13)) & (5 in east.values), 'Plays_Next'] = 5

    mwest.loc[((mwest['Proj_Seed']==1) | (mwest['Proj_Seed']==16)) & (8 in mwest.values), 'Plays_Next'] = 8
    mwest.loc[((mwest['Proj_Seed']==1) | (mwest['Proj_Seed']==16)) & (9 in mwest.values), 'Plays_Next'] = 9
    mwest.loc[((mwest['Proj_Seed']==2) | (mwest['Proj_Seed']==15)) & (7 in mwest.values), 'Plays_Next'] = 7
    mwest.loc[((mwest['Proj_Seed']==2) | (mwest['Proj_Seed']==15)) & (10 in mwest.values), 'Plays_Next'] = 10
    mwest.loc[((mwest['Proj_Seed']==3) | (mwest['Proj_Seed']==14)) & (11 in mwest.values), 'Plays_Next'] = 11
    mwest.loc[((mwest['Proj_Seed']==3) | (mwest['Proj_Seed']==14)) & (6 in mwest.values), 'Plays_Next'] = 6
    mwest.loc[((mwest['Proj_Seed']==4) | (mwest['Proj_Seed']==13)) & (12 in mwest.values), 'Plays_Next'] = 12
    mwest.loc[((mwest['Proj_Seed']==4) | (mwest['Proj_Seed']==13)) & (5 in mwest.values), 'Plays_Next'] = 5

    west.loc[((west['Proj_Seed']==1) | (west['Proj_Seed']==16)) & (8 in west.values), 'Plays_Next'] = 8
    west.loc[((west['Proj_Seed']==1) | (west['Proj_Seed']==16)) & (9 in west.values), 'Plays_Next'] = 9
    west.loc[((west['Proj_Seed']==2) | (west['Proj_Seed']==15)) & (7 in west.values), 'Plays_Next'] = 7
    west.loc[((west['Proj_Seed']==2) | (west['Proj_Seed']==15)) & (10 in west.values), 'Plays_Next'] = 10
    west.loc[((west['Proj_Seed']==3) | (west['Proj_Seed']==14)) & (11 in west.values), 'Plays_Next'] = 11
    west.loc[((west['Proj_Seed']==3) | (west['Proj_Seed']==14)) & (6 in west.values), 'Plays_Next'] = 6
    west.loc[((west['Proj_Seed']==4) | (west['Proj_Seed']==13)) & (12 in west.values), 'Plays_Next'] = 12
    west.loc[((west['Proj_Seed']==4) | (west['Proj_Seed']==13)) & (5 in west.values), 'Plays_Next'] = 5

    south.loc[((south['Proj_Seed']==1) | (south['Proj_Seed']==16)) & (8 in south.values), 'Plays_Next'] = 8
    south.loc[((south['Proj_Seed']==1) | (south['Proj_Seed']==16)) & (9 in south.values), 'Plays_Next'] = 9
    south.loc[((south['Proj_Seed']==2) | (south['Proj_Seed']==15)) & (7 in south.values), 'Plays_Next'] = 7
    south.loc[((south['Proj_Seed']==2) | (south['Proj_Seed']==15)) & (10 in south.values), 'Plays_Next'] = 10
    south.loc[((south['Proj_Seed']==3) | (south['Proj_Seed']==14)) & (11 in south.values), 'Plays_Next'] = 11
    south.loc[((south['Proj_Seed']==3) | (south['Proj_Seed']==14)) & (6 in south.values), 'Plays_Next'] = 6
    south.loc[((south['Proj_Seed']==4) | (south['Proj_Seed']==13)) & (12 in south.values), 'Plays_Next'] = 12
    south.loc[((south['Proj_Seed']==4) | (south['Proj_Seed']==13)) & (5 in south.values), 'Plays_Next'] = 5
    
    return east, mwest, west, south


In [None]:
def get_3rd_seed(east, mwest, west, south):
    east.loc[((east['Proj_Seed']==1) | (east['Proj_Seed']==16) | (east['Proj_Seed']==8) | (east['Proj_Seed']==9)) & (4 in east.values), 'Plays_Next'] = 4
    east.loc[((east['Proj_Seed']==1) | (east['Proj_Seed']==16) | (east['Proj_Seed']==8) | (east['Proj_Seed']==9)) & (5 in east.values), 'Plays_Next'] = 5
    east.loc[((east['Proj_Seed']==1) | (east['Proj_Seed']==16) | (east['Proj_Seed']==8) | (east['Proj_Seed']==9)) & (12 in east.values), 'Plays_Next'] = 12
    east.loc[((east['Proj_Seed']==1) | (east['Proj_Seed']==16) | (east['Proj_Seed']==8) | (east['Proj_Seed']==9)) & (13 in east.values), 'Plays_Next'] = 13
    east.loc[((east['Proj_Seed']==2) | (east['Proj_Seed']==15) | (east['Proj_Seed']==7) | (east['Proj_Seed']==10)) & (3 in east.values), 'Plays_Next'] = 3
    east.loc[((east['Proj_Seed']==2) | (east['Proj_Seed']==15) | (east['Proj_Seed']==7) | (east['Proj_Seed']==10)) & (6 in east.values), 'Plays_Next'] = 6
    east.loc[((east['Proj_Seed']==2) | (east['Proj_Seed']==15) | (east['Proj_Seed']==7) | (east['Proj_Seed']==10)) & (11 in east.values), 'Plays_Next'] = 11
    east.loc[((east['Proj_Seed']==2) | (east['Proj_Seed']==15) | (east['Proj_Seed']==7) | (east['Proj_Seed']==10)) & (14 in east.values), 'Plays_Next'] = 14

    west.loc[((west['Proj_Seed']==1) | (west['Proj_Seed']==16) | (west['Proj_Seed']==8) | (west['Proj_Seed']==9)) & (4 in west.values), 'Plays_Next'] = 4
    west.loc[((west['Proj_Seed']==1) | (west['Proj_Seed']==16) | (west['Proj_Seed']==8) | (west['Proj_Seed']==9)) & (5 in west.values), 'Plays_Next'] = 5
    west.loc[((west['Proj_Seed']==1) | (west['Proj_Seed']==16) | (west['Proj_Seed']==8) | (west['Proj_Seed']==9)) & (12 in west.values), 'Plays_Next'] = 12
    west.loc[((west['Proj_Seed']==1) | (west['Proj_Seed']==16) | (west['Proj_Seed']==8) | (west['Proj_Seed']==9)) & (13 in west.values), 'Plays_Next'] = 13
    west.loc[((west['Proj_Seed']==2) | (west['Proj_Seed']==15) | (west['Proj_Seed']==7) | (west['Proj_Seed']==10)) & (3 in west.values), 'Plays_Next'] = 3
    west.loc[((west['Proj_Seed']==2) | (west['Proj_Seed']==15) | (west['Proj_Seed']==7) | (west['Proj_Seed']==10)) & (6 in west.values), 'Plays_Next'] = 6
    west.loc[((west['Proj_Seed']==2) | (west['Proj_Seed']==15) | (west['Proj_Seed']==7) | (west['Proj_Seed']==10)) & (11 in west.values), 'Plays_Next'] = 11
    west.loc[((west['Proj_Seed']==2) | (west['Proj_Seed']==15) | (west['Proj_Seed']==7) | (west['Proj_Seed']==10)) & (14 in west.values), 'Plays_Next'] = 14

    mwest.loc[((mwest['Proj_Seed']==1) | (mwest['Proj_Seed']==16) | (mwest['Proj_Seed']==8) | (mwest['Proj_Seed']==9)) & (4 in mwest.values), 'Plays_Next'] = 4
    mwest.loc[((mwest['Proj_Seed']==1) | (mwest['Proj_Seed']==16) | (mwest['Proj_Seed']==8) | (mwest['Proj_Seed']==9)) & (5 in mwest.values), 'Plays_Next'] = 5
    mwest.loc[((mwest['Proj_Seed']==1) | (mwest['Proj_Seed']==16) | (mwest['Proj_Seed']==8) | (mwest['Proj_Seed']==9)) & (12 in mwest.values), 'Plays_Next'] = 12
    mwest.loc[((mwest['Proj_Seed']==1) | (mwest['Proj_Seed']==16) | (mwest['Proj_Seed']==8) | (mwest['Proj_Seed']==9)) & (13 in mwest.values), 'Plays_Next'] = 13
    mwest.loc[((mwest['Proj_Seed']==2) | (mwest['Proj_Seed']==15) | (mwest['Proj_Seed']==7) | (mwest['Proj_Seed']==10)) & (3 in mwest.values), 'Plays_Next'] = 3
    mwest.loc[((mwest['Proj_Seed']==2) | (mwest['Proj_Seed']==15) | (mwest['Proj_Seed']==7) | (mwest['Proj_Seed']==10)) & (6 in mwest.values), 'Plays_Next'] = 6
    mwest.loc[((mwest['Proj_Seed']==2) | (mwest['Proj_Seed']==15) | (mwest['Proj_Seed']==7) | (mwest['Proj_Seed']==10)) & (11 in mwest.values), 'Plays_Next'] = 11
    mwest.loc[((mwest['Proj_Seed']==2) | (mwest['Proj_Seed']==15) | (mwest['Proj_Seed']==7) | (mwest['Proj_Seed']==10)) & (14 in mwest.values), 'Plays_Next'] = 14

    south.loc[((south['Proj_Seed']==1) | (south['Proj_Seed']==16) | (south['Proj_Seed']==8) | (south['Proj_Seed']==9)) & (4 in south.values), 'Plays_Next'] = 4
    south.loc[((south['Proj_Seed']==1) | (south['Proj_Seed']==16) | (south['Proj_Seed']==8) | (south['Proj_Seed']==9)) & (5 in south.values), 'Plays_Next'] = 5
    south.loc[((south['Proj_Seed']==1) | (south['Proj_Seed']==16) | (south['Proj_Seed']==8) | (south['Proj_Seed']==9)) & (12 in south.values), 'Plays_Next'] = 12
    south.loc[((south['Proj_Seed']==1) | (south['Proj_Seed']==16) | (south['Proj_Seed']==8) | (south['Proj_Seed']==9)) & (13 in south.values), 'Plays_Next'] = 13
    south.loc[((south['Proj_Seed']==2) | (south['Proj_Seed']==15) | (south['Proj_Seed']==7) | (south['Proj_Seed']==10)) & (3 in south.values), 'Plays_Next'] = 3
    south.loc[((south['Proj_Seed']==2) | (south['Proj_Seed']==15) | (south['Proj_Seed']==7) | (south['Proj_Seed']==10)) & (6 in south.values), 'Plays_Next'] = 6
    south.loc[((south['Proj_Seed']==2) | (south['Proj_Seed']==15) | (south['Proj_Seed']==7) | (south['Proj_Seed']==10)) & (11 in south.values), 'Plays_Next'] = 11
    south.loc[((south['Proj_Seed']==2) | (south['Proj_Seed']==15) | (south['Proj_Seed']==7) | (south['Proj_Seed']==10)) & (14 in south.values), 'Plays_Next'] = 14

    return east, mwest, west, south

The above two functions matchup each seed in the region with their next seed. These are used in the below fucntion to get matchups for the next round.

In [None]:
#maybe try output_all as an input instead of the tuple triples
def new_matchups(output_all, rnd, season_df, blnd_rel_cols):
    
    if rnd == 2:
        
        rel_out = output_all[['Proj_Winner', 'Proj_Seed', 'Region', 'Year']]

        east = rel_out[rel_out['Region']=='East'].sort_values('Proj_Seed').reset_index()
        east.drop('index', axis=1, inplace=True)
        west = rel_out[rel_out['Region']=='West'].sort_values('Proj_Seed').reset_index()
        west.drop('index', axis=1, inplace=True)
        south = rel_out[rel_out['Region']=='South'].sort_values('Proj_Seed').reset_index()
        south.drop('index', axis=1, inplace=True)
        mwest = rel_out[rel_out['Region']=='MWest'].sort_values('Proj_Seed').reset_index()
        mwest.drop('index', axis=1, inplace=True)

        east2, mwest2, west2, south2 = get_2nd_seed(east, west, mwest, south)


        east2_null = east2[east2['Plays_Next'].isnull()]
        east2 = east2.dropna()
        mwest2_null = mwest2[mwest2['Plays_Next'].isnull()]
        mwest2 = mwest2.dropna()
        west2_null = west2[west2['Plays_Next'].isnull()]
        west2 = west2.dropna()
        south2_null = south2[south2['Plays_Next'].isnull()]
        south2 = south2.dropna()

        east2 = east2.merge(east2_null, how='left', left_on=['Plays_Next', 'Region'], right_on=['Proj_Seed', 'Region'])
        mwest2 = mwest2.merge(mwest2_null, how='left', left_on=['Plays_Next', 'Region'], right_on=['Proj_Seed', 'Region'])
        west2 = west2.merge(west2_null, how='left', left_on=['Plays_Next', 'Region'], right_on=['Proj_Seed', 'Region'])
        south2 = south2.merge(south2_null, how='left', left_on=['Plays_Next', 'Region'], right_on=['Proj_Seed', 'Region'])

        df = pd.concat([east2, mwest2, west2, south2])

        SEED_higher = []
        SEED_lower = []
        TEAM_higher = []
        TEAM_lower = []
        for index, item in df.iterrows():
            if item['Proj_Seed_x'] < item['Proj_Seed_y']:
                TEAM_higher.append(item['Proj_Winner_x'])
                TEAM_lower.append(item['Proj_Winner_y'])
                SEED_higher.append(item['Proj_Seed_x'])
                SEED_lower.append(item['Proj_Seed_y'])

            elif item['Proj_Seed_x'] >= item['Proj_Seed_y']:
                TEAM_higher.append(item['Proj_Winner_y'])
                TEAM_lower.append(item['Proj_Winner_x'])
                SEED_higher.append(item['Proj_Seed_y'])
                SEED_lower.append(item['Proj_Seed_x'])


        df['TEAM_higher'] = TEAM_higher
        df['SEED_higher'] = SEED_higher
        df['TEAM_lower'] = TEAM_lower
        df['SEED_lower'] = SEED_lower


        df.drop(['Proj_Winner_x', 'Proj_Seed_x', 'Plays_Next_x', 'Proj_Winner_y', 'Plays_Next_y', 'Year_y'], axis=1, inplace=True)

        df = df.merge(season_df, how='left', left_on=['TEAM_higher', 'Year_x'], right_on=['TEAM', 'YEAR'])
        df = df.merge(season_df, how='left', left_on=['TEAM_lower', 'Year_x'], right_on=['TEAM', 'YEAR'], suffixes=['_higher', '_lower'])

        rnd2_df = df[blnd_rel_cols]
        rnd2_df = rnd2_df.loc[:, ~rnd2_df.columns.duplicated()]
        rnd2_info = df[['TEAM_higher', 'SEED_higher', 'TEAM_lower', 'SEED_lower', 'Region', 'YEAR_higher']]
        rnd2_info = rnd2_info.loc[:, ~rnd2_info.columns.duplicated()]
        
        return rnd2_df, rnd2_info
    
    if rnd == 3:
        
        rel_out = output_all[['Proj_Winner', 'Proj_Seed', 'Region', 'YEAR_higher']]

        east = rel_out[rel_out['Region']=='East'].sort_values('Proj_Seed').reset_index()
        east.drop('index', axis=1, inplace=True)
        west = rel_out[rel_out['Region']=='West'].sort_values('Proj_Seed').reset_index()
        west.drop('index', axis=1, inplace=True)
        south = rel_out[rel_out['Region']=='South'].sort_values('Proj_Seed').reset_index()
        south.drop('index', axis=1, inplace=True)
        mwest = rel_out[rel_out['Region']=='MWest'].sort_values('Proj_Seed').reset_index()
        mwest.drop('index', axis=1, inplace=True)
        
        east3, mwest3, west3, south3 = get_3rd_seed(east, mwest, west, south)
        
        east3_null = east3[east3['Plays_Next'].isnull()]
        east3 = east3.dropna()
        mwest3_null = mwest3[mwest3['Plays_Next'].isnull()]
        mwest3 = mwest3.dropna()
        west3_null = west3[west3['Plays_Next'].isnull()]
        west3 = west3.dropna()
        south3_null = south3[south3['Plays_Next'].isnull()]
        south3 = south3.dropna()

        east3 = east3.merge(east3_null, how='left', left_on=['Plays_Next', 'Region'], right_on=['Proj_Seed', 'Region'])
        mwest3 = mwest3.merge(mwest3_null, how='left', left_on=['Plays_Next', 'Region'], right_on=['Proj_Seed', 'Region'])
        west3 = west3.merge(west3_null, how='left', left_on=['Plays_Next', 'Region'], right_on=['Proj_Seed', 'Region'])
        south3 = south3.merge(south3_null, how='left', left_on=['Plays_Next', 'Region'], right_on=['Proj_Seed', 'Region'])

        df = pd.concat([east3, mwest3, west3, south3])
        
        SEED_higher = []
        SEED_lower = []
        TEAM_higher = []
        TEAM_lower = []
        for index, item in df.iterrows():
            if item['Proj_Seed_x'] < item['Proj_Seed_y']:
                TEAM_higher.append(item['Proj_Winner_x'])
                TEAM_lower.append(item['Proj_Winner_y'])
                SEED_higher.append(item['Proj_Seed_x'])
                SEED_lower.append(item['Proj_Seed_y'])

            elif item['Proj_Seed_x'] >= item['Proj_Seed_y']:
                TEAM_higher.append(item['Proj_Winner_y'])
                TEAM_lower.append(item['Proj_Winner_x'])
                SEED_higher.append(item['Proj_Seed_y'])
                SEED_lower.append(item['Proj_Seed_x'])


        df['TEAM_higher'] = TEAM_higher
        df['SEED_higher'] = SEED_higher
        df['TEAM_lower'] = TEAM_lower
        df['SEED_lower'] = SEED_lower

        df.drop(['Proj_Winner_x', 'Proj_Seed_x', 'Plays_Next_x', 'Proj_Winner_y', 'Proj_Seed_y', 'YEAR_higher_y', 'Plays_Next_y'], axis=1, inplace=True)
 
        df = df.merge(season_df, how='left', left_on=['TEAM_higher', 'YEAR_higher_x'], right_on=['TEAM', 'YEAR'])
        df = df.merge(season_df, how='left', left_on=['TEAM_lower', 'YEAR_higher_x'], right_on=['TEAM', 'YEAR'], suffixes=['_higher', '_lower'])

        rnd3_df = df[blnd_rel_cols]
        rnd3_df = rnd3_df.loc[:, ~rnd3_df.columns.duplicated()]
        rnd3_info = df[['TEAM_higher', 'SEED_higher', 'TEAM_lower', 'SEED_lower', 'YEAR_higher', 'Region']]
        rnd3_info = rnd3_info.loc[:, ~rnd3_info.columns.duplicated()]
        
        return rnd3_df, rnd3_info
    
    if rnd == 4:
        
        rel_out = output_all[['Proj_Winner', 'Proj_Seed', 'Region', 'YEAR_higher']]

        east4 = rel_out[rel_out['Region']=='East'].sort_values('Proj_Seed').reset_index()
        east4.drop('index', axis=1, inplace=True)
        west4 = rel_out[rel_out['Region']=='West'].sort_values('Proj_Seed').reset_index()
        west4.drop('index', axis=1, inplace=True)
        south4 = rel_out[rel_out['Region']=='South'].sort_values('Proj_Seed').reset_index()
        south4.drop('index', axis=1, inplace=True)
        mwest4 = rel_out[rel_out['Region']=='MWest'].sort_values('Proj_Seed').reset_index()
        mwest4.drop('index', axis=1, inplace=True)
        
        east4_1 = east4.iloc[0:1]
        east4_2 = east4.iloc[1:2]
        mwest4_1 = mwest4.iloc[0:1]
        mwest4_2 = mwest4.iloc[1:2]
        west4_1 = west4.iloc[0:1]
        west4_2 = west4.iloc[1:2]
        south4_1 = south4.iloc[0:1]
        south4_2 = south4.iloc[1:2]
        
        east4 = east4_1.merge(east4_2, how='left', on=['Region', 'YEAR_higher'])
        mwest4 = mwest4_1.merge(mwest4_2, how='left', on=['Region', 'YEAR_higher'])
        west4 = west4_1.merge(west4_2, how='left', on=['Region', 'YEAR_higher'])
        south4 = south4_1.merge(south4_2, how='left', on=['Region', 'YEAR_higher'])
        
        df = pd.concat([east4, mwest4, west4, south4])
        
        SEED_higher = []
        SEED_lower = []
        TEAM_higher = []
        TEAM_lower = []
        for index, item in df.iterrows():
            if item['Proj_Seed_x'] < item['Proj_Seed_y']:
                TEAM_higher.append(item['Proj_Winner_x'])
                TEAM_lower.append(item['Proj_Winner_y'])
                SEED_higher.append(item['Proj_Seed_x'])
                SEED_lower.append(item['Proj_Seed_y'])

            elif item['Proj_Seed_x'] >= item['Proj_Seed_y']:
                TEAM_higher.append(item['Proj_Winner_y'])
                TEAM_lower.append(item['Proj_Winner_x'])
                SEED_higher.append(item['Proj_Seed_y'])
                SEED_lower.append(item['Proj_Seed_x'])


        df['TEAM_higher'] = TEAM_higher
        df['SEED_higher'] = SEED_higher
        df['TEAM_lower'] = TEAM_lower
        df['SEED_lower'] = SEED_lower
        
        df.drop(['Proj_Winner_x', 'Proj_Seed_x', 'Proj_Winner_y', 'Proj_Seed_y'], axis=1, inplace=True)
        
        df = df.merge(season_df, how='left', left_on=['TEAM_higher', 'YEAR_higher'], right_on=['TEAM', 'YEAR'])
        df = df.merge(season_df, how='left', left_on=['TEAM_lower', 'YEAR_higher'], right_on=['TEAM', 'YEAR'], suffixes=['_higher', '_lower'])
        
        rnd4_df = df[blnd_rel_cols]
        rnd4_df = rnd4_df.loc[:, ~rnd4_df.columns.duplicated()]
        rnd4_info = df[['TEAM_higher', 'SEED_higher', 'TEAM_lower', 'SEED_lower', 'YEAR_higher', 'Region']]
        rnd4_info = rnd4_info.loc[:, ~rnd4_info.columns.duplicated()]
        
        return rnd4_df, rnd4_info
    
    if rnd == 5:
        
        rel_out = output_all[['Proj_Winner', 'Proj_Seed', 'Region', 'YEAR_higher']]

        rel_out.loc[rel_out['Region']=='East', 'Plays_Next'] = 'West'
        rel_out.loc[rel_out['Region']=='MWest', 'Plays_Next'] = 'South'

        rel_null = rel_out[rel_out['Plays_Next'].isnull()]
        rel_out.dropna(inplace=True)
        df = rel_out.merge(rel_null, how='left', left_on=['YEAR_higher', 'Plays_Next'], right_on=['YEAR_higher', 'Region'])
        
        SEED_higher = []
        SEED_lower = []
        TEAM_higher = []
        TEAM_lower = []
        for index, item in df.iterrows():
            if item['Proj_Seed_x'] < item['Proj_Seed_y']:
                TEAM_higher.append(item['Proj_Winner_x'])
                TEAM_lower.append(item['Proj_Winner_y'])
                SEED_higher.append(item['Proj_Seed_x'])
                SEED_lower.append(item['Proj_Seed_y'])

            elif item['Proj_Seed_x'] >= item['Proj_Seed_y']:
                TEAM_higher.append(item['Proj_Winner_y'])
                TEAM_lower.append(item['Proj_Winner_x'])
                SEED_higher.append(item['Proj_Seed_y'])
                SEED_lower.append(item['Proj_Seed_x'])


        df['TEAM_higher'] = TEAM_higher
        df['SEED_higher'] = SEED_higher
        df['TEAM_lower'] = TEAM_lower
        df['SEED_lower'] = SEED_lower
        
        df.drop(['Proj_Winner_x', 'Proj_Seed_x', 'Proj_Winner_y', 'Proj_Seed_y', 'Plays_Next_x', 'Plays_Next_y', 'Region_y'], axis=1, inplace=True)
        
        df = df.merge(season_df, how='left', left_on=['TEAM_higher', 'YEAR_higher'], right_on=['TEAM', 'YEAR'])
        df = df.merge(season_df, how='left', left_on=['TEAM_lower', 'YEAR_higher'], right_on=['TEAM', 'YEAR'], suffixes=['_higher', '_lower'])
        
        rnd5_df = df[blnd_rel_cols]
        rnd5_df = rnd5_df.loc[:, ~rnd5_df.columns.duplicated()]
        rnd5_info = df[['TEAM_higher', 'SEED_higher', 'TEAM_lower', 'SEED_lower', 'YEAR_higher', 'Region_x']]
        rnd5_info = rnd5_info.loc[:, ~rnd5_info.columns.duplicated()]
        
        return rnd5_df, rnd5_info
    
    if rnd == 6:
        
        rel_out = output_all[['Proj_Winner', 'Proj_Seed', 'Region_x', 'YEAR_higher']]

        rel_1 = rel_out.iloc[0:1]
        rel_2 = rel_out.iloc[1:2]

        df = rel_1.merge(rel_2, how='left', on='YEAR_higher')
        
        SEED_higher = []
        SEED_lower = []
        TEAM_higher = []
        TEAM_lower = []
        for index, item in df.iterrows():
            if item['Proj_Seed_x'] < item['Proj_Seed_y']:
                TEAM_higher.append(item['Proj_Winner_x'])
                TEAM_lower.append(item['Proj_Winner_y'])
                SEED_higher.append(item['Proj_Seed_x'])
                SEED_lower.append(item['Proj_Seed_y'])

            elif item['Proj_Seed_x'] >= item['Proj_Seed_y']:
                TEAM_higher.append(item['Proj_Winner_y'])
                TEAM_lower.append(item['Proj_Winner_x'])
                SEED_higher.append(item['Proj_Seed_y'])
                SEED_lower.append(item['Proj_Seed_x'])


        df['TEAM_higher'] = TEAM_higher
        df['SEED_higher'] = SEED_higher
        df['TEAM_lower'] = TEAM_lower
        df['SEED_lower'] = SEED_lower

        df.drop(['Proj_Winner_x', 'Proj_Seed_x', 'Region_x_x', 'Proj_Winner_y', 'Proj_Seed_y', 'Region_x_y'], axis=1, inplace=True)
        
        df = df.merge(season_df, how='left', left_on=['TEAM_higher', 'YEAR_higher'], right_on=['TEAM', 'YEAR'])
        df = df.merge(season_df, how='left', left_on=['TEAM_lower', 'YEAR_higher'], right_on=['TEAM', 'YEAR'], suffixes=['_higher', '_lower'])
        
        rnd6_df = df[blnd_rel_cols]
        rnd6_df = rnd6_df.loc[:, ~rnd6_df.columns.duplicated()]
        rnd6_info = df[['TEAM_higher', 'SEED_higher', 'TEAM_lower', 'SEED_lower', 'YEAR_higher']]
        rnd6_info = rnd6_info.loc[:, ~rnd6_info.columns.duplicated()]
        
        return rnd6_df, rnd6_info

This fucntion creates a dataframe of matchups for each round depending on who won in the previous round.

In [None]:
rnd2_df, rnd2_info = new_matchups(output_all, 2, tdf, blnd_rel_cols)

In [None]:
rf = RandomForestClassifier(random_state=0)

X_train= pre_2019[blnd_rel_cols]
y_train = pre_2019['Low_Seed_Win']

model = rf.fit(X_train, y_train)

y_score = []
output_all = pd.DataFrame()

X_test = rnd2_df

y_predict = model.predict(X_test)
y_prob = model.predict_proba(X_test)

output = pd.DataFrame({'ID':X_test.index, 'Low_Seed_Win':y_predict, 'Low_Seed_Win_Prob':y_prob[:,1]})
output = output.merge(rnd2_info, how='left', left_on='ID', right_on=rnd2_info.index)
output_all = pd.concat([output_all, output])
    
winning_team = []
winning_seed = []
for index, item in output_all.iterrows():
    if item['Low_Seed_Win'] == 1:
        winning_team.append(item['TEAM_lower'])
        winning_seed.append(item['SEED_lower'])
    else:
        winning_team.append(item['TEAM_higher'])
        winning_seed.append(item['SEED_higher'])
        
output_all['Proj_Winner'] = winning_team
output_all['Proj_Seed'] = winning_seed

output_all.sort_values('Low_Seed_Win_Prob', ascending=False)

Again, in the 2nd round a threshold of 0.45 for the lower seeded team to win would've correctly predicted the Auburn upset over Kansas (though technically only a one seed difference, it was a major upset based onn program legaices).

In [None]:
rnd3_df, rnd3_info = new_matchups(output_all, 3, tdf, blnd_rel_cols)

In [None]:
rf = RandomForestClassifier(random_state=0)

X_train= pre_2019[blnd_rel_cols]
y_train = pre_2019['Low_Seed_Win']

model = rf.fit(X_train, y_train)

y_score = []
output_all = pd.DataFrame()

X_test = rnd3_df

y_predict = model.predict(X_test)
y_prob = model.predict_proba(X_test)

output = pd.DataFrame({'ID':X_test.index, 'Low_Seed_Win':y_predict, 'Low_Seed_Win_Prob':y_prob[:,1]})
output = output.merge(rnd3_info, how='left', left_on='ID', right_on=rnd3_info.index)
output_all = pd.concat([output_all, output])
    
winning_team = []
winning_seed = []
for index, item in output_all.iterrows():
    if item['Low_Seed_Win'] == 1:
        winning_team.append(item['TEAM_lower'])
        winning_seed.append(item['SEED_lower'])
    else:
        winning_team.append(item['TEAM_higher'])
        winning_seed.append(item['SEED_higher'])
        
output_all['Proj_Winner'] = winning_team
output_all['Proj_Seed'] = winning_seed

output_all.sort_values('Low_Seed_Win_Prob', ascending=False)

In [None]:
rnd4_df, rnd4_info = new_matchups(output_all, 4, tdf, blnd_rel_cols)

In [None]:
rf = RandomForestClassifier(random_state=0)

X_train= pre_2019[blnd_rel_cols]
y_train = pre_2019['Low_Seed_Win']

model = rf.fit(X_train, y_train)

y_score = []
output_all = pd.DataFrame()

X_test = rnd4_df

y_predict = model.predict(X_test)
y_prob = model.predict_proba(X_test)

output = pd.DataFrame({'ID':X_test.index, 'Low_Seed_Win':y_predict, 'Low_Seed_Win_Prob':y_prob[:,1]})
output = output.merge(rnd4_info, how='left', left_on='ID', right_on=rnd4_info.index)
output_all = pd.concat([output_all, output])
    
winning_team = []
winning_seed = []
for index, item in output_all.iterrows():
    if item['Low_Seed_Win'] == 1:
        winning_team.append(item['TEAM_lower'])
        winning_seed.append(item['SEED_lower'])
    else:
        winning_team.append(item['TEAM_higher'])
        winning_seed.append(item['SEED_higher'])
        
output_all['Proj_Winner'] = winning_team
output_all['Proj_Seed'] = winning_seed

output_all.sort_values('Low_Seed_Win_Prob', ascending=False)

By this point, the model may be favoring the highest seeds a little too much. Only once have all four 1 seeds made the final four.

In [None]:
rnd5_df, rnd5_info = new_matchups(output_all, 5, tdf, blnd_rel_cols)

In [None]:
rf = RandomForestClassifier(random_state=0)

X_train= pre_2019[blnd_rel_cols]
y_train = pre_2019['Low_Seed_Win']

model = rf.fit(X_train, y_train)

y_score = []
output_all = pd.DataFrame()

X_test = rnd5_df

y_predict = model.predict(X_test)
y_prob = model.predict_proba(X_test)

output = pd.DataFrame({'ID':X_test.index, 'Low_Seed_Win':y_predict, 'Low_Seed_Win_Prob':y_prob[:,1]})
output = output.merge(rnd5_info, how='left', left_on='ID', right_on=rnd5_info.index)
output_all = pd.concat([output_all, output])
    
winning_team = []
winning_seed = []
for index, item in output_all.iterrows():
    if item['Low_Seed_Win'] == 1:
        winning_team.append(item['TEAM_lower'])
        winning_seed.append(item['SEED_lower'])
    else:
        winning_team.append(item['TEAM_higher'])
        winning_seed.append(item['SEED_higher'])
        
output_all['Proj_Winner'] = winning_team
output_all['Proj_Seed'] = winning_seed

output_all.sort_values('Low_Seed_Win_Prob', ascending=False)

In [None]:
rnd6_df, rnd6_info = new_matchups(output_all, 6, tdf, blnd_rel_cols)

In [None]:
rf = RandomForestClassifier(random_state=0)

X_train= pre_2019[blnd_rel_cols]
y_train = pre_2019['Low_Seed_Win']

model = rf.fit(X_train, y_train)

y_score = []
output_all = pd.DataFrame()

X_test = rnd6_df

y_predict = model.predict(X_test)
y_prob = model.predict_proba(X_test)

output = pd.DataFrame({'ID':X_test.index, 'Low_Seed_Win':y_predict, 'Low_Seed_Win_Prob':y_prob[:,1]})
output = output.merge(rnd6_info, how='left', left_on='ID', right_on=rnd6_info.index)
output_all = pd.concat([output_all, output])
    
winning_team = []
winning_seed = []
for index, item in output_all.iterrows():
    if item['Low_Seed_Win'] == 1:
        winning_team.append(item['TEAM_lower'])
        winning_seed.append(item['SEED_lower'])
    else:
        winning_team.append(item['TEAM_higher'])
        winning_seed.append(item['SEED_higher'])
        
output_all['Proj_Winner'] = winning_team
output_all['Proj_Seed'] = winning_seed

output_all.sort_values('Low_Seed_Win_Prob', ascending=False)

Ultimately the model came very close to predicting the actual winner (Virginia), however beyond the first two rounds it was pretty poor at picking upsets. I will try to fix this in my final model.

In [None]:
tdf21 = pd.read_csv('/kaggle/input/college-basketball-dataset/cbb21.csv')
tdf21['TEAM'].replace(to_replace=r' St.', value=r' State', regex=True, inplace=True)
tdf21['TEAM'].replace(to_replace=r"Mount State Mary's", value=r"Mount St. Mary's", regex=True, inplace=True)
tdf21['TEAM'].replace(to_replace=r'Loyola Chicago', value=r'Loyola-Chicago', regex=True, inplace=True)
tdf21['TEAM'].replace(to_replace=r'Connecticut', value=r'UConn', regex=True, inplace=True)
tdf21 = tdf21[tdf21['SEED'].notnull()]
tdf21.drop([43, 45, 66, 67], inplace=True)
tdf21.reset_index()
tdf21['Turnover_diff'] = tdf21['TORD']-tdf21['TOR']
tdf21['E_diff'] = tdf21['ADJOE']-tdf21['ADJDE']
tdf21['ETO_Avg'] = (tdf21['E_diff']+tdf21['Turnover_diff'])/2
tdf21['YEAR'] = 2021


tdf21.sort_values('ETO_Avg', ascending=False).head(25)

Here we have the top 25 teams in terms of Efficiency and Turnover Differential Average. At the top is mostly the top seeded teams but there are some lower ones in here like Georgia Tech, Loyola-Chicago, Wisconsin and St. Bonaventure. Before even looking at these teams' opponents, they could be ones to watch out for.

In [None]:
reg = pd.read_html('https://en.wikipedia.org/wiki/2021_NCAA_Division_I_Men%27s_Basketball_Tournament', skiprows=0)

west = reg[4]
east = reg[5]
south = reg[6]
mwest = reg[7]

east.drop([11, 17], inplace=True)
west.drop([11, 17], inplace=True)
east.replace(to_replace='11*', value=11, inplace=True)
west.replace(to_replace='11*', value=11, inplace=True)
east.replace(to_replace='16*', value=16, inplace=True)
west.replace(to_replace='16*', value=16, inplace=True)
east['Seed'] = east['Seed'].astype('int64')
west['Seed'] = west['Seed'].astype('int64')


region = []
for index, item in tdf21.iterrows():
    if item['TEAM'] in (list(east['School'])):
        region.append('East')
        
    elif item['TEAM'] in (list(west['School'])):
        region.append('West')
        
    elif item['TEAM'] in (list(south['School'])):
        region.append('South')
        
    elif item['TEAM'] in (list(mwest['School'])):
        region.append('MWest')
        

tdf21['Region'] = region


In [None]:
def get_1st_seed(east, mwest, west, south):

    east.loc[(east['Seed']==1), 'Plays_Next'] = 16
    east.loc[(east['Seed']==2), 'Plays_Next'] = 15
    east.loc[(east['Seed']==3), 'Plays_Next'] = 14
    east.loc[(east['Seed']==4), 'Plays_Next'] = 13
    east.loc[(east['Seed']==5), 'Plays_Next'] = 12
    east.loc[(east['Seed']==6), 'Plays_Next'] = 11
    east.loc[(east['Seed']==7), 'Plays_Next'] = 10
    east.loc[(east['Seed']==8), 'Plays_Next'] = 9

    mwest.loc[(mwest['Seed']==1), 'Plays_Next'] = 16
    mwest.loc[(mwest['Seed']==2), 'Plays_Next'] = 15
    mwest.loc[(mwest['Seed']==3), 'Plays_Next'] = 14
    mwest.loc[(mwest['Seed']==4), 'Plays_Next'] = 13
    mwest.loc[(mwest['Seed']==5), 'Plays_Next'] = 12
    mwest.loc[(mwest['Seed']==6), 'Plays_Next'] = 11
    mwest.loc[(mwest['Seed']==7), 'Plays_Next'] = 10
    mwest.loc[(mwest['Seed']==8), 'Plays_Next'] = 9

    west.loc[(west['Seed']==1), 'Plays_Next'] = 16
    west.loc[(west['Seed']==2), 'Plays_Next'] = 15
    west.loc[(west['Seed']==3), 'Plays_Next'] = 14
    west.loc[(west['Seed']==4), 'Plays_Next'] = 13
    west.loc[(west['Seed']==5), 'Plays_Next'] = 12
    west.loc[(west['Seed']==6), 'Plays_Next'] = 11
    west.loc[(west['Seed']==7), 'Plays_Next'] = 10
    west.loc[(west['Seed']==8), 'Plays_Next'] = 9
    
    south.loc[(south['Seed']==1), 'Plays_Next'] = 16
    south.loc[(south['Seed']==2), 'Plays_Next'] = 15
    south.loc[(south['Seed']==3), 'Plays_Next'] = 14
    south.loc[(south['Seed']==4), 'Plays_Next'] = 13
    south.loc[(south['Seed']==5), 'Plays_Next'] = 12
    south.loc[(south['Seed']==6), 'Plays_Next'] = 11
    south.loc[(south['Seed']==7), 'Plays_Next'] = 10
    south.loc[(south['Seed']==8), 'Plays_Next'] = 9
    
    return east, mwest, west, south


In [None]:
east1, mwest1, west1, south1 = get_1st_seed(east, mwest, west, south)

east1 = east1[['Seed', 'School', 'Plays_Next']]
mwest1 = mwest1[['Seed', 'School', 'Plays_Next']]
west1 = west1[['Seed', 'School', 'Plays_Next']]
south1 = south1[['Seed', 'School', 'Plays_Next']]

d = {'School':'TEAM'}
east1.rename(columns = d, inplace=True)
mwest1.rename(columns = d, inplace=True)
west1.rename(columns = d, inplace=True)
south1.rename(columns = d, inplace=True)

east1_null = east1[east1['Plays_Next'].isnull()]
east1 = east1.dropna()
mwest1_null = mwest1[mwest1['Plays_Next'].isnull()]
mwest1 = mwest1.dropna()
west1_null = west1[west1['Plays_Next'].isnull()]
west1 = west1.dropna()
south1_null = south1[south1['Plays_Next'].isnull()]
south1 = south1.dropna()

east1 = east1.merge(east1_null, how='left', left_on=['Plays_Next'], right_on=['Seed'], suffixes=['_higher', '_lower'])
mwest1 = mwest1.merge(mwest1_null, how='left', left_on=['Plays_Next'], right_on=['Seed'], suffixes=['_higher', '_lower'])
west1 = west1.merge(west1_null, how='left', left_on=['Plays_Next'], right_on=['Seed'], suffixes=['_higher', '_lower'])
south1 = south1.merge(south1_null, how='left', left_on=['Plays_Next'], right_on=['Seed'], suffixes=['_higher', '_lower'])

df = pd.concat([east1, mwest1, west1, south1])
df.drop(['Plays_Next_higher', 'Plays_Next_lower'], axis=1, inplace=True)

df = df.merge(tdf21, how='left', left_on=['TEAM_higher', 'Seed_higher'], right_on=['TEAM', 'SEED'])
df = df.merge(tdf21, how='left', left_on=['TEAM_lower', 'Seed_lower'], right_on=['TEAM', 'SEED'], suffixes=['_higher', '_lower'])
df = df.loc[:, ~df.columns.duplicated()]

rel_cols_21 = df.columns[7:23].append(df.columns[27:28].append(df.columns[33:49].append(df.columns[53:54].append(df.columns[24:25].append(df.columns[50:51])))))
rnd1_df = df[rel_cols_21]

rnd1_info = df[['TEAM_higher','SEED_higher', 'TEAM_lower', 'SEED_lower', 'Region_higher', 'YEAR_higher']]
cl = {'Region_higher':'Region', 'YEAR_higher':'Year'}
rnd1_info.rename(columns=cl, inplace=True)

I didn't have a funtion to get 1st round matchups since I got the previous ones from boxscores, but these games haven't happened yet so there are no boxscores. The above cells get the matchups.

In [None]:
rf = RandomForestClassifier(random_state=0)

X_train= pre_2019[blnd_rel_cols]
y_train = pre_2019['Low_Seed_Win']

model = rf.fit(X_train, y_train)

X_test = rnd1_df

y_prob = model.predict_proba(X_test)
y_prob = y_prob[:,1]

output_all = pd.DataFrame()

output = pd.DataFrame({'ID':X_test.index, 'Low_Seed_Win_Prob':y_prob})
output = output.merge(rnd1_info, how='left', left_on='ID', right_on=rnd1_info.index)
output_all = pd.concat([output_all, output])

low_win = []
for index, item in output_all.iterrows():
    if item['Low_Seed_Win_Prob'] > 0.45:
        low = 1
    else:
        low = 0
        
    low_win.append(low)
    
output_all['Low_Seed_Win'] = low_win

winning_team = []
winning_seed = []
for index, item in output_all.iterrows():
    if item['Low_Seed_Win'] == 1:
        winning_team.append(item['TEAM_lower'])
        winning_seed.append(item['SEED_lower'])
    else:
        winning_team.append(item['TEAM_higher'])
        winning_seed.append(item['SEED_higher'])

output_all['Proj_Winner'] = winning_team
output_all['Proj_Seed'] = winning_seed

output_all.sort_values('Low_Seed_Win_Prob', ascending=False)

I switched the threshold for the first three rounds to 0.45 instead of 0.5. This resulted in the model  predicting 5 upsets, which is still probably too low. However, there are a few more with a win probability over 40% that I will definitely keep my eye on (St. Bonaventure, Utah State). 

This model also predicts that no 12 or 11 seeds will come even that close to winning, although at least one of each (and often, multiple of each) win every year. I think it does appropriately rank the likelihood of each 12 winning. Georgetown is also hot recently and is a very popular pick to upset Colorado. I think Michigan State is more likely to pull off an upset than Utah State due to coaching primarily, but the model could be right about that one.

Again, the point of this model isn't to fill out a bracket for you, it's just to be a guide as you fill it out yourself.

In [None]:
rnd2_df, rnd2_info = new_matchups(output_all, 2, tdf21, rel_cols_21)

In [None]:
rf = RandomForestClassifier(random_state=0)

X_train= pre_2019[blnd_rel_cols]
y_train = pre_2019['Low_Seed_Win']

model = rf.fit(X_train, y_train)

X_test = rnd2_df

y_prob = model.predict_proba(X_test)
y_prob = y_prob[:,1]
    
output_all = pd.DataFrame()

output = pd.DataFrame({'ID':X_test.index, 'Low_Seed_Win_Prob':y_prob})
output = output.merge(rnd2_info, how='left', left_on='ID', right_on=rnd2_info.index)
output_all = pd.concat([output_all, output])

low_win = []
for index, item in output_all.iterrows():
    if item['Low_Seed_Win_Prob'] > 0.45:
        low = 1
    else:
        low = 0
        
    low_win.append(low)
    
output_all['Low_Seed_Win'] = low_win

winning_team = []
winning_seed = []
for index, item in output_all.iterrows():
    if item['Low_Seed_Win'] == 1:
        winning_team.append(item['TEAM_lower'])
        winning_seed.append(item['SEED_lower'])
    else:
        winning_team.append(item['TEAM_higher'])
        winning_seed.append(item['SEED_higher'])

output_all['Proj_Winner'] = winning_team
output_all['Proj_Seed'] = winning_seed

output_all.sort_values('Low_Seed_Win_Prob', ascending=False)

The model also predicts 5 upsets in the second round, with three 6 seeds beating 3s. This could just be an indicator that the 6 seeds are very strong this year.

In [None]:
rnd3_df, rnd3_info = new_matchups(output_all, 3, tdf21, rel_cols_21)

In [None]:
rf = RandomForestClassifier(random_state=0)

X_train= pre_2019[blnd_rel_cols]
y_train = pre_2019['Low_Seed_Win']

model = rf.fit(X_train, y_train)

X_test = rnd3_df

y_prob = model.predict_proba(X_test)
y_prob = y_prob[:,1]
 
output_all = pd.DataFrame()

output = pd.DataFrame({'ID':X_test.index, 'Low_Seed_Win_Prob':y_prob})
output = output.merge(rnd3_info, how='left', left_on='ID', right_on=rnd3_info.index)
output_all = pd.concat([output_all, output])

low_win = []
for index, item in output_all.iterrows():
    if item['Low_Seed_Win_Prob'] > 0.45:
        low = 1
    else:
        low = 0
        
    low_win.append(low)
    
output_all['Low_Seed_Win'] = low_win

winning_team = []
winning_seed = []
for index, item in output_all.iterrows():
    if item['Low_Seed_Win'] == 1:
        winning_team.append(item['TEAM_lower'])
        winning_seed.append(item['SEED_lower'])
    else:
        winning_team.append(item['TEAM_higher'])
        winning_seed.append(item['SEED_higher'])

output_all['Proj_Winner'] = winning_team
output_all['Proj_Seed'] = winning_seed

output_all.sort_values('Low_Seed_Win_Prob', ascending=False)

The model again predicts an all top 2 Elite Eight. This is a very unlikely result and I almost certainly won't pick it in my own bracket. However this could be the result of so many 6-3 upsets.

In [None]:
rnd4_df, rnd4_info = new_matchups(output_all, 4, tdf21, rel_cols_21)

In [None]:
rf = RandomForestClassifier(random_state=0)

X_train= pre_2019[blnd_rel_cols]
y_train = pre_2019['Low_Seed_Win']

model = rf.fit(X_train, y_train)

X_test = rnd4_df

y_prob = model.predict_proba(X_test)
y_prob = y_prob[:,1]

output_all = pd.DataFrame()

output = pd.DataFrame({'ID':X_test.index, 'Low_Seed_Win_Prob':y_prob})
output = output.merge(rnd4_info, how='left', left_on='ID', right_on=rnd4_info.index)
output_all = pd.concat([output_all, output])

low_win = []
for index, item in output_all.iterrows():
    if item['Low_Seed_Win_Prob'] > 0.5:
        low = 1
    else:
        low = 0
        
    low_win.append(low)
    
output_all['Low_Seed_Win'] = low_win

winning_team = []
winning_seed = []
for index, item in output_all.iterrows():
    if item['Low_Seed_Win'] == 1:
        winning_team.append(item['TEAM_lower'])
        winning_seed.append(item['SEED_lower'])
    else:
        winning_team.append(item['TEAM_higher'])
        winning_seed.append(item['SEED_higher'])

output_all['Proj_Winner'] = winning_team
output_all['Proj_Seed'] = winning_seed

output_all.sort_values('Low_Seed_Win_Prob', ascending=False)

This model is a bit different, predicting a Final Four of two 1s and two 2s, which has also only happened once. However, there have been 10 Final Fours with two 1s and a 2 or two 2s and a 1, so this is definitely a high percentage Final Four pick.

In [None]:
rnd5_df, rnd5_info = new_matchups(output_all, 5, tdf21, rel_cols_21)

In [None]:
rf = RandomForestClassifier(random_state=0)

X_train= pre_2019[blnd_rel_cols]
y_train = pre_2019['Low_Seed_Win']

model = rf.fit(X_train, y_train)

X_test = rnd5_df

y_prob = model.predict_proba(X_test)
y_prob = y_prob[:,1]
    
output_all = pd.DataFrame()

output = pd.DataFrame({'ID':X_test.index, 'Low_Seed_Win_Prob':y_prob})
output = output.merge(rnd5_info, how='left', left_on='ID', right_on=rnd5_info.index)
output_all = pd.concat([output_all, output])

low_win = []
for index, item in output_all.iterrows():
    if item['Low_Seed_Win_Prob'] > 0.5:
        low = 1
    else:
        low = 0
        
    low_win.append(low)
    
output_all['Low_Seed_Win'] = low_win

winning_team = []
winning_seed = []
for index, item in output_all.iterrows():
    if item['Low_Seed_Win'] == 1:
        winning_team.append(item['TEAM_lower'])
        winning_seed.append(item['SEED_lower'])
    else:
        winning_team.append(item['TEAM_higher'])
        winning_seed.append(item['SEED_higher'])

output_all['Proj_Winner'] = winning_team
output_all['Proj_Seed'] = winning_seed

output_all.sort_values('Low_Seed_Win_Prob', ascending=False)

In [None]:
rnd6_df, rnd6_info = new_matchups(output_all, 6, tdf21, rel_cols_21)

In [None]:
rf = RandomForestClassifier(random_state=0)

X_train= pre_2019[blnd_rel_cols]
y_train = pre_2019['Low_Seed_Win']

model = rf.fit(X_train, y_train)

X_test = rnd6_df

y_prob = model.predict_proba(X_test)
y_prob = y_prob[:,1]
    
output_all = pd.DataFrame()

output = pd.DataFrame({'ID':X_test.index, 'Low_Seed_Win_Prob':y_prob})
output = output.merge(rnd6_info, how='left', left_on='ID', right_on=rnd6_info.index)
output_all = pd.concat([output_all, output])

low_win = []
for index, item in output_all.iterrows():
    if item['Low_Seed_Win_Prob'] > 0.5:
        low = 1
    else:
        low = 0
        
    low_win.append(low)
    
output_all['Low_Seed_Win'] = low_win

winning_team = []
winning_seed = []
for index, item in output_all.iterrows():
    if item['Low_Seed_Win'] == 1:
        winning_team.append(item['TEAM_lower'])
        winning_seed.append(item['SEED_lower'])
    else:
        winning_team.append(item['TEAM_higher'])
        winning_seed.append(item['SEED_higher'])

output_all['Proj_Winner'] = winning_team
output_all['Proj_Seed'] = winning_seed

output_all.sort_values('Low_Seed_Win_Prob', ascending=False)

Ultimately, Houston is the projected winner of March Madness 2021 over Michigan. This isn't the most common pick this year (that would be either Gonzaga or Baylor) but certainly a possible. I'm not sure this is the exact finals matchup I will pick in my bracket, but I will probably pick something close to this model's Final Four.

Anyway, here's my final model. Any feedback would be greatly appreciated. Good luck filling out your brackets!