<a href="https://colab.research.google.com/github/zjserapin/MarchMadness2020/blob/master/MarchMadness.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

With this years march madness being cancelled due to coronavirus I though it would be fun to try to simulate it!

First lets upload some packages to get the ball rolling

In [0]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import re

First Im going to input efficiency metrics of each team from Kenpom.com.  This is a great website, providing advanced analytics on men's and women's college basketball.

In [0]:
base_url = "https://kenpom.com/index.php"
url_year = lambda x: '%s?y=%s' % (base_url, str(x) if x != 2016 else base_url)
years = range(2015,2020)

In order to utilize the metrics from Kenpom, I needed to scrape elements from this website.  In order to do this I used code written from a kaggle user WalterHan.  During a Kaggle exercise in 2016 Walter scrapped similar data from Kenpom.com for his project. 

In [0]:
def import_raw_year(year):
    """
    Imports raw data from a ken pom year into a dataframe
    """
    f = requests.get(url_year(year))
    soup = BeautifulSoup(f.text, features='lxml')
    table_html = soup.find_all('table', {'id': 'ratings-table'})

    thead = table_html[0].find_all('thead')

    table = table_html[0]
    for x in thead:
        table = str(table).replace(str(x), '')

    df = pd.read_html(table)[0]
    df['year'] = year
    return df

I also manually typed in the latest projected bracekt from ESPN's Joe Lunardi into an excell spreadsheet. I then joined each teams efficieny ratings to the bracket to facilitate the simulation

In [0]:
df20 = import_raw_year(2020)
df20.columns = ['Rank', 'Team', 'Conference', 'W-L', 'AdjEM', 
             'AdjustO', 'AdjustO Rank', 'AdjustD', 'AdjustD Rank',
             'AdjustT', 'AdjustT Rank', 'Luck', 'Luck Rank', 
             'SOS Pyth', 'SOS Pyth Rank', 'SOS OppO', 'SOS OppO Rank',
             'SOS OppD', 'SOS OppD Rank', 'NCSOS Pyth', 'NCSOS Pyth Rank', 'Year']

df_bracket = pd.read_csv("https://raw.githubusercontent.com/zjserapin/MarchMadness2020/master/Bracketology.csv")
df_bracket = pd.merge(df_bracket, df20, on="Team", how="left")
df_bracket['AdjEM'] = df_bracket['AdjEM'].astype('float64')
df_bracket.head()

Unnamed: 0,Team,Region,Seed,Rank,Conference,W-L,AdjEM,AdjustO,AdjustO Rank,AdjustD,AdjustD Rank,AdjustT,AdjustT Rank,Luck,Luck Rank,SOS Pyth,SOS Pyth Rank,SOS OppO,SOS OppO Rank,SOS OppD,SOS OppD Rank,NCSOS Pyth,NCSOS Pyth Rank,Year
0,Kansas,Midwest,1,1,B12,28-3,30.23,115.8,8,85.5,2,67.3,233,0.04,79,12.66,2,107.4,26,94.7,1,9.58,10,2020
1,Siena,Midwest,16,145,MAAC,20-10,2.06,107.2,87,105.1,236,66.7,261,-0.001,189,-5.42,277,98.3,329,103.7,205,0.29,171,2020
2,Houston,Midwest,8,14,Amer,23-8,20.39,112.7,22,92.3,21,65.7,300,-0.052,308,6.33,68,104.2,99,97.9,38,2.56,88,2020
3,Marquette,Midwest,9,31,BE,18-12,17.19,114.0,14,96.9,73,70.7,60,-0.046,293,10.78,20,107.6,21,96.9,18,2.13,100,2020
4,Auburn,Midwest,5,33,SEC,25-6,15.91,111.4,33,95.5,55,69.2,127,0.103,7,6.7,62,107.0,32,100.3,90,1.38,133,2020


In [0]:
from random import random
import numpy as np
from itertools import compress
from numpy.random import multinomial

Next we are moving into the fun part!  I am going to use a pythagorean sum to generate win probabilities between the two teams in a given matchup.  For example if Team 1 has a rating of 60 and Team 2 a rating of 40.  The win probability for team would be 60% as 60 / (60+40) = 0.6.  To suplement this, I will use a function that generates a random weighted choice between the two win probabilites in an attempt to replicate some of the "unpredictable" events that take place in every tournament. 

In [0]:
def weightedChoice(weights, objects):
    return next(compress(objects, multinomial(1, weights, 1)[0]))

def predictor(team1, team2):
    prob1 = (team1['AdjEM'].sum()/(team1['AdjEM'].sum()+team2['AdjEM'].sum()))
    prob2 = (team2['AdjEM'].sum()/(team1['AdjEM'].sum()+team2['AdjEM'].sum()))
    choice = weightedChoice([prob1, prob2], [team1, team2])
    return choice

There were a few teams that had an efficieny score below 0.  This would cause some issues in generating the win probabilites so I changed them to 0.1, in order to give them at least a chance in a given matchup.

In [0]:
df_bracket.loc[(df_bracket['AdjEM'] < 0), 'AdjEM'] = 0.1

Then I divided the original dataframe into the different regions to help me better keep track of matchups and victors.

In [0]:
south = df_bracket[(df_bracket['Region']=='South')]
east = df_bracket[(df_bracket['Region']=='East')]
midwest = df_bracket[(df_bracket['Region']=='Midwest')]
west = df_bracket[(df_bracket['Region']=='West')]

I stored each winner of a game as a variable and progressed it throughout the entire tournament.  Im sure there is a more efficient way to do it, this code runs sufficiently quick.

In [0]:
########### PLAY IN GAMES #####################

play1 = predictor(east.loc[east['Team'] == "Boston Univeristy"], 
                            east.loc[east['Team'] == "Robert Morris"]) 
play2 = predictor(east.loc[east['Team'] == "Texas"], 
                            east.loc[east['Team'] == "Richmond"]) 
play3 = predictor(east.loc[east['Team'] == "UCLA"], 
                            east.loc[east['Team'] == "N.C. State"]) 
play4 = predictor(west.loc[west['Team'] == "Prairie View A&M"], 
                            west.loc[west['Team'] == "North Carolina Central"]) 

############### First Round Games

south_1v16 = predictor(south.loc[south['Seed'] == 1], south.loc[south['Seed'] == 16])                                    
south_8v9 = predictor(south.loc[south['Seed'] == 8], south.loc[south['Seed'] == 9])
south_4v13 = predictor(south.loc[south['Seed'] == 4], south.loc[south['Seed'] == 13])
south_5v12 = predictor(south.loc[south['Seed'] == 5], south.loc[south['Seed'] == 12])
south_2v15 = predictor(south.loc[south['Seed'] == 2], south.loc[south['Seed'] == 15])
south_7v10 = predictor(south.loc[south['Seed'] == 7], south.loc[south['Seed'] == 10])
south_3v14 = predictor(south.loc[south['Seed'] == 3], south.loc[south['Seed'] == 14])
south_6v11 = predictor(south.loc[south['Seed'] == 6], south.loc[south['Seed'] == 11])

west_1v16 = predictor(west.loc[west['Seed'] == 1], play4)                                    
west_8v9 = predictor(west.loc[west['Seed'] == 8], west.loc[west['Seed'] == 9])
west_4v13 = predictor(west.loc[west['Seed'] == 4], west.loc[west['Seed'] == 13])
west_5v12 = predictor(west.loc[west['Seed'] == 5], west.loc[west['Seed'] == 12])
west_2v15 = predictor(west.loc[west['Seed'] == 2], west.loc[west['Seed'] == 15])
west_7v10 = predictor(west.loc[west['Seed'] == 7], west.loc[west['Seed'] == 10])
west_3v14 = predictor(west.loc[west['Seed'] == 3], west.loc[west['Seed'] == 14])
west_6v11 = predictor(west.loc[west['Seed'] == 6], west.loc[west['Seed'] == 11])

midwest_1v16 = predictor(midwest.loc[midwest['Seed'] == 1], midwest.loc[midwest['Seed'] == 16])                                    
midwest_8v9 = predictor(midwest.loc[midwest['Seed'] == 8], midwest.loc[midwest['Seed'] == 9])
midwest_4v13 = predictor(midwest.loc[midwest['Seed'] == 4], midwest.loc[midwest['Seed'] == 13])
midwest_5v12 = predictor(midwest.loc[midwest['Seed'] == 5], midwest.loc[midwest['Seed'] == 12])
midwest_2v15 = predictor(midwest.loc[midwest['Seed'] == 2], midwest.loc[midwest['Seed'] == 15])
midwest_7v10 = predictor(midwest.loc[midwest['Seed'] == 7], midwest.loc[midwest['Seed'] == 10])
midwest_3v14 = predictor(midwest.loc[midwest['Seed'] == 3], midwest.loc[midwest['Seed'] == 14])
midwest_6v11 = predictor(midwest.loc[midwest['Seed'] == 6], midwest.loc[midwest['Seed'] == 11])

east_1v16 = predictor(east.loc[east['Seed'] == 1], play1)                                    
east_8v9 = predictor(east.loc[east['Seed'] == 8], east.loc[east['Seed'] == 9])
east_4v13 = predictor(east.loc[east['Seed'] == 4], east.loc[east['Seed'] == 13])
east_5v12 = predictor(east.loc[east['Seed'] == 5], play2)
east_2v15 = predictor(east.loc[east['Seed'] == 2], east.loc[east['Seed'] == 15])
east_7v10 = predictor(east.loc[east['Seed'] == 7], east.loc[east['Seed'] == 10])
east_3v14 = predictor(east.loc[east['Seed'] == 3], east.loc[east['Seed'] == 14])
east_6v11 = predictor(east.loc[east['Seed'] == 6], play3)


############### Second Round Games

south_s161 = predictor(south_1v16, south_8v9)
south_s162 = predictor(south_4v13, south_5v12)
south_s163 = predictor(south_2v15, south_7v10)
south_s164 = predictor(south_3v14, south_6v11)

west_s161 = predictor(west_1v16, west_8v9)
west_s162 = predictor(west_4v13, west_5v12)
west_s163 = predictor(west_2v15, west_7v10)
west_s164 = predictor(west_3v14, west_6v11)

midwest_s161 = predictor(midwest_1v16, midwest_8v9)
midwest_s162 = predictor(midwest_4v13, midwest_5v12)
midwest_s163 = predictor(midwest_2v15, midwest_7v10)
midwest_s164 = predictor(midwest_3v14, midwest_6v11)

east_s161 = predictor(east_1v16, east_8v9)
east_s162 = predictor(east_4v13, east_5v12)
east_s163 = predictor(east_2v15, east_7v10)
east_s164 = predictor(east_3v14, east_6v11)

################ Sweet 16

south_e81 = predictor(south_s161, south_s162)
south_e82 = predictor(south_s163, south_s164)

west_e81 = predictor(west_s161, west_s162)
west_e82 = predictor(west_s163, west_s164)

midwest_e81 = predictor(midwest_s161, midwest_s162)
midwest_e82 = predictor(midwest_s163, midwest_s164)

east_e81 = predictor(east_s161, east_s162)
east_e82 = predictor(east_s163, east_s164)

################ Elite 8

south_winner = predictor(south_e81, south_e82)
west_winner = predictor(west_e81, west_e82)
midwest_winner = predictor(midwest_e81, midwest_e82)
east_winner = predictor(east_e81, east_e82)

############### Final Four
print('------------------------------------')
print("The Final 4 Teams are:", south_winner['Team'], west_winner['Team'], midwest_winner['Team'], east_winner['Team'])

final1 = predictor(midwest_winner, east_winner)
final2 = predictor(west_winner, south_winner)
print('------------------------------------')
print('The championship game is between', final1['Team'], "and", final2['Team'])
print('------------------------------------')
champion = predictor(final1, final2)
print('Congratulations to', champion['Team'], "They are the 2020 NCAA Champions")

------------------------------------
The Final 4 Teams are: 64    Illinois
Name: Team, dtype: object 35    Gonzaga
Name: Team, dtype: object 2    Houston
Name: Team, dtype: object 31    West Virginia
Name: Team, dtype: object
------------------------------------
The championship game is between 31    West Virginia
Name: Team, dtype: object and 35    Gonzaga
Name: Team, dtype: object
------------------------------------
Congratulations to 35    Gonzaga
Name: Team, dtype: object They are the 2020 NCAA Champions


And we finally have our final four and projected champions.  Next steps would to be to generate a monte carlo simulation, add some fun visualizations and eventually deploy it as an application.