# NCAA Tournament Predictions

## Download data from Kaggle

I used the Kaggle API to download the data. This requires an account in order to obtain an API key and to accept the terms and conditions of the [Google Cloud & Men's 2019 NCAA Tournament ML Competition](https://www.kaggle.com/c/mens-machine-learning-competition-2019/).

In [150]:
%%bash
kaggle competitions download -c mens-machine-learning-competition-2019

SampleSubmissionStage1.csv: Skipping, found more recently modified local copy (use --force to force download)
Downloading MasseyOrdinals.zip to /Users/tljohn/data_science/kaggle/ncaa-mens-2019

Downloading PlayByPlay_2010.zip to /Users/tljohn/data_science/kaggle/ncaa-mens-2019

Downloading PlayByPlay_2011.zip to /Users/tljohn/data_science/kaggle/ncaa-mens-2019

Downloading PlayByPlay_2012.zip to /Users/tljohn/data_science/kaggle/ncaa-mens-2019

Downloading PlayByPlay_2013.zip to /Users/tljohn/data_science/kaggle/ncaa-mens-2019

Downloading PlayByPlay_2014.zip to /Users/tljohn/data_science/kaggle/ncaa-mens-2019

Downloading PlayByPlay_2015.zip to /Users/tljohn/data_science/kaggle/ncaa-mens-2019

Downloading PlayByPlay_2016.zip to /Users/tljohn/data_science/kaggle/ncaa-mens-2019

Downloading PlayByPlay_2017.zip to /Users/tljohn/data_science/kaggle/ncaa-mens-2019

Downloading PlayByPlay_2018.zip to /Users/tljohn/data_science/kaggle/ncaa-mens-2019

Downloading DataFiles.zip to /Users/tljoh

  0%|          | 0.00/13.7M [00:00<?, ?B/s]  7%|▋         | 1.00M/13.7M [00:00<00:08, 1.63MB/s] 15%|█▍        | 2.00M/13.7M [00:00<00:05, 2.09MB/s] 22%|██▏       | 3.00M/13.7M [00:01<00:04, 2.44MB/s] 29%|██▉       | 4.00M/13.7M [00:01<00:03, 3.02MB/s] 36%|███▋      | 5.00M/13.7M [00:01<00:02, 3.80MB/s] 44%|████▎     | 6.00M/13.7M [00:01<00:02, 4.04MB/s] 51%|█████     | 7.00M/13.7M [00:01<00:01, 4.12MB/s] 58%|█████▊    | 8.00M/13.7M [00:01<00:01, 4.49MB/s] 65%|██████▌   | 9.00M/13.7M [00:02<00:01, 4.78MB/s] 73%|███████▎  | 10.0M/13.7M [00:02<00:00, 4.91MB/s] 80%|████████  | 11.0M/13.7M [00:02<00:00, 5.05MB/s] 87%|████████▋ | 12.0M/13.7M [00:02<00:00, 5.40MB/s] 95%|█████████▍| 13.0M/13.7M [00:02<00:00, 5.44MB/s]100%|██████████| 13.7M/13.7M [00:03<00:00, 5.51MB/s]
  0%|          | 0.00/19.9M [00:00<?, ?B/s]  5%|▌         | 1.00M/19.9M [00:00<00:03, 6.03MB/s] 10%|█         | 2.00M/19.9M [00:00<00:03, 6.14MB/s] 15%|█▌        | 3.00M/19.9M [00:00<00:02, 6.34MB/s] 20%|██  

We care mostly about the files in ```DataFiles.zip```, so we will only unzip this directory. All the others contain information about play-by-play events. It would be really cool to incorporate individual player stats into a ML algorithm, but for now, I will only use team stats for each individual game.

In [151]:
import zipfile
zip_ref = zipfile.ZipFile('DataFiles.zip', 'r')
zip_ref.extractall('DataFiles')
zip_ref.close()

I like to use Pandas to work with tabular data. There are two types of game-by-game data for the NCAA: compact results which give the simple box scores for each game (teamIDs, scores, and who was the home team or if the game was played at a neutral site) and detailed results with box scores along with all the statisics like field goal attempts/completions, etc. The former goes back all the way to 1985 and the latter only to 2003.

In [152]:
df1 = pd.read_csv('DataFiles/RegularSeasonCompactResults.csv')
df1['Playoff'] = 0
df2 = pd.read_csv('DataFiles/NCAATourneyCompactResults.csv')
df2['Playoff'] = 1
df = pd.concat([df1,df2])
del df1
del df2

In [153]:
df = df.sort_values(by=['Season','DayNum'])

In [154]:
df['WProb_Elo'] = np.nan
df['WElo'] = np.nan
df['LElo'] = np.nan

In [84]:
df.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0


In [93]:
def elo_prob(eloW,eloL):
    elo_diff = eloW-eloL
    probW = 1 / (10**(-elo_diff/400) + 1)
    return probW

In [94]:
def update_elo(eloW,ptsW,eloL,ptsL,K=20):
    PD = ptsW-ptsL
    mult = np.log(PD+1) * (2.2/((eloW-eloL)*.001+2.2))
    shift = (K * mult) * (1 - elo_prob(eloW,eloL))
    return eloW + shift, eloL - shift

In [95]:
def home_adv(WLoc,HCA=100):
    if WLoc == 'N':
        return 0
    elif WLoc == 'H':
        return HCA
    else:
        return -HCA

In [137]:
def season_revert(team_elo,R=1/3):
    for team,elo in team_elo.items():
        team_elo[team] = 1505*R + elo*(1-R)
    return team_elo

In [97]:
teams_df = pd.read_csv('DataFiles/teams.csv')

In [98]:
teams_df.head()

Unnamed: 0,TeamID,TeamName,FirstD1Season,LastD1Season
0,1101,Abilene Chr,2014,2019
1,1102,Air Force,1985,2019
2,1103,Akron,1985,2019
3,1104,Alabama,1985,2019
4,1105,Alabama A&M,2000,2019


In [139]:
%time
team_elo = {t:1500 for t in teams_df[teams_df['FirstD1Season'] == 1985]['TeamID']}
season = 1985
for i,game in df.iterrows():
    
    if game['Season'] > season:
        update_teams = season_revert(team_elo)
        new_teams = {t:1300 for t in teams_df[teams_df['FirstD1Season'] == game['Season']]['TeamID']}
        team_elo = {**update_teams,**new_teams}
        season = game['Season']
    
    teamW = game['WTeamID']
    ptsW = game['WScore']
    eloW = team_elo[teamW]
    teamL = game['LTeamID']
    ptsL = game['LScore']
    eloL = team_elo[teamL]

    df.at[i,'WElo'] = eloW
    df.at[i,'LElo'] = eloL

    HA = home_adv(game['WLoc'])
    df.at[i,'WProb_Elo'] = elo_prob(eloW+HA,eloL-HA)
    
    eloW,eloL = update_elo(eloW,ptsW,eloL,ptsL)
    team_elo[teamW] = eloW
    team_elo[teamL] = eloL
    

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 4.05 µs


In [140]:
team_elo

{1102: 1399.7170347400233,
 1103: 1408.1251949629293,
 1104: 1712.35889266977,
 1106: 1193.951183462544,
 1108: 1206.5984999340717,
 1109: 1504.9930030096352,
 1110: 1216.0819506747337,
 1111: 1432.7766612267487,
 1112: 1900.6946327827552,
 1113: 1646.0529221352583,
 1114: 1292.526394075765,
 1116: 1760.7563370237299,
 1117: 1311.5236885693762,
 1119: 1291.2649729622678,
 1120: 1774.5641598723191,
 1121: 1504.9982253263622,
 1122: 1472.0405257328596,
 1123: 1527.852342308862,
 1124: 1715.5510504783142,
 1126: 1385.8083448818359,
 1129: 1692.6045789530208,
 1130: 1634.1265273841602,
 1131: 1412.3983738493287,
 1132: 1416.727331297526,
 1133: 1523.5110797140267,
 1134: 1504.9894317051612,
 1135: 1302.2253980537723,
 1137: 1715.9228350552132,
 1139: 1735.5211394433131,
 1140: 1692.9939520806938,
 1141: 1514.5123143253431,
 1143: 1380.6826147048384,
 1144: 1395.0538470532692,
 1145: 1613.4287549466417,
 1147: 1480.5510902647634,
 1149: 1410.967934945654,
 1150: 1254.9583554254457,
 1151: 1

In [142]:
df_tourney = pd.read_csv('DataFiles/NCAATourneyCompactResults.csv')

In [143]:
df_tourney.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1985,136,1116,63,1234,54,N,0
1,1985,136,1120,59,1345,58,N,0
2,1985,136,1207,68,1250,43,N,0
3,1985,136,1229,58,1425,55,N,0
4,1985,136,1242,49,1325,38,N,0


In [148]:
%ls

[1m[34mDataFiles[m[m/                  SampleSubmissionStage1.csv
[1m[34mMasseyOrdinals[m[m/             elo_scores.ipynb
[1m[34mPlayByPlay[m[m/                 [1m[34mzipfiles[m[m/
