# Men's NCAA Basketball Tournament Predictions

## Download data from Kaggle

I used the Kaggle API to download the data. This requires an account in order to obtain an API key and to accept the terms and conditions of the [Google Cloud & Men's 2019 NCAA Tournament ML Competition](https://www.kaggle.com/c/mens-machine-learning-competition-2019/).

In [150]:
%%bash
kaggle competitions download -c mens-machine-learning-competition-2019

SampleSubmissionStage1.csv: Skipping, found more recently modified local copy (use --force to force download)
Downloading MasseyOrdinals.zip to /Users/tljohn/data_science/kaggle/ncaa-mens-2019

Downloading PlayByPlay_2010.zip to /Users/tljohn/data_science/kaggle/ncaa-mens-2019

Downloading PlayByPlay_2011.zip to /Users/tljohn/data_science/kaggle/ncaa-mens-2019

Downloading PlayByPlay_2012.zip to /Users/tljohn/data_science/kaggle/ncaa-mens-2019

Downloading PlayByPlay_2013.zip to /Users/tljohn/data_science/kaggle/ncaa-mens-2019

Downloading PlayByPlay_2014.zip to /Users/tljohn/data_science/kaggle/ncaa-mens-2019

Downloading PlayByPlay_2015.zip to /Users/tljohn/data_science/kaggle/ncaa-mens-2019

Downloading PlayByPlay_2016.zip to /Users/tljohn/data_science/kaggle/ncaa-mens-2019

Downloading PlayByPlay_2017.zip to /Users/tljohn/data_science/kaggle/ncaa-mens-2019

Downloading PlayByPlay_2018.zip to /Users/tljohn/data_science/kaggle/ncaa-mens-2019

Downloading DataFiles.zip to /Users/tljoh

  0%|          | 0.00/13.7M [00:00<?, ?B/s]  7%|▋         | 1.00M/13.7M [00:00<00:08, 1.63MB/s] 15%|█▍        | 2.00M/13.7M [00:00<00:05, 2.09MB/s] 22%|██▏       | 3.00M/13.7M [00:01<00:04, 2.44MB/s] 29%|██▉       | 4.00M/13.7M [00:01<00:03, 3.02MB/s] 36%|███▋      | 5.00M/13.7M [00:01<00:02, 3.80MB/s] 44%|████▎     | 6.00M/13.7M [00:01<00:02, 4.04MB/s] 51%|█████     | 7.00M/13.7M [00:01<00:01, 4.12MB/s] 58%|█████▊    | 8.00M/13.7M [00:01<00:01, 4.49MB/s] 65%|██████▌   | 9.00M/13.7M [00:02<00:01, 4.78MB/s] 73%|███████▎  | 10.0M/13.7M [00:02<00:00, 4.91MB/s] 80%|████████  | 11.0M/13.7M [00:02<00:00, 5.05MB/s] 87%|████████▋ | 12.0M/13.7M [00:02<00:00, 5.40MB/s] 95%|█████████▍| 13.0M/13.7M [00:02<00:00, 5.44MB/s]100%|██████████| 13.7M/13.7M [00:03<00:00, 5.51MB/s]
  0%|          | 0.00/19.9M [00:00<?, ?B/s]  5%|▌         | 1.00M/19.9M [00:00<00:03, 6.03MB/s] 10%|█         | 2.00M/19.9M [00:00<00:03, 6.14MB/s] 15%|█▌        | 3.00M/19.9M [00:00<00:02, 6.34MB/s] 20%|██  

We care mostly about the files in ```DataFiles.zip```, so we will only unzip this directory. All the others contain information about play-by-play events. It would be really cool to incorporate individual player stats into a ML algorithm, but for now, I will only use team stats for each individual game.

In [151]:
import zipfile
zip_ref = zipfile.ZipFile('DataFiles.zip', 'r')
zip_ref.extractall('DataFiles')
zip_ref.close()

I like to use Pandas to work with tabular data. There are two types of game-by-game data for the NCAA: compact results which give the simple box scores for each game (teamIDs, scores, and who was the home team or if the game was played at a neutral site) and detailed results with box scores along with all the statisics like field goal attempts/completions, etc. The former goes back all the way to 1985 and the latter only to 2003. We can use all the box scores dating back from 1985 to establish a long enough baseline to determine accurage Elo scores.

In [152]:
df1 = pd.read_csv('DataFiles/RegularSeasonCompactResults.csv')
df1['Playoff'] = 0
df2 = pd.read_csv('DataFiles/NCAATourneyCompactResults.csv')
df2['Playoff'] = 1
df = pd.concat([df1,df2])
del df1
del df2

In [153]:
df = df.sort_values(by=['Season','DayNum'])

In [154]:
df['WProb_Elo'] = np.nan
df['WElo'] = np.nan
df['LElo'] = np.nan

In [168]:
df.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,Playoff,WProb_Elo,WElo,LElo
0,1985,20,1228,81,1328,64,N,0,0,,,
1,1985,25,1106,77,1354,70,H,0,0,,,
2,1985,25,1112,63,1223,56,H,0,0,,,
3,1985,25,1165,70,1432,54,H,0,0,,,
4,1985,25,1192,86,1447,74,H,0,0,,,


Elo scores were first developed for chess but have been adopted to predict outcomes of other games and sports. I have chosen to use the NBA Elo scores developed by [FiveThirtyEight](https://fivethirtyeight.com/features/how-we-calculate-nba-elo-ratings/).

We can calculate the probability of winning based on the Elo score difference between any two teams.

In [194]:
def elo_prob(eloW,eloL,locW):
    if locW == 'H':
        value = HCA
    elif locW == 'A':
        value = -HCA
    else:
        value = 0
        
    elo_diff = eloW-eloL+value
        
    probW = 1 / (10**(-elo_diff/400) + 1)
    return probW

Elo scores are updated after a game occurs based on how surprising the outcome is. The larger the difference in the scores, the larger the expected point spread is. If the actual point differential is larger than expected, the more the Elo scores change. Elo scores are a zero-sum game, so points are added to the team who beats the point spread while points are taken from the other team.

In [199]:
K=20
HCA=100
R=1/3

In [200]:
def update_elo(eloW,ptsW,eloL,ptsL,locW):
    if locW == 'H':
        value = HCA
    elif locW == 'A':
        value = -HCA
    else:
        value = 0
        
    MOV = ptsW-ptsL
    elo_diff = eloW-eloL+value
    
    mult = (MOV+3)**0.8 / (7.5 + 0.006*elo_diff)
    probW = elo_prob(eloW,eloL,locW)
    
    shift = K*mult*(1-probW)
    
    return eloW + shift, eloL - shift

At the end of the season, the Elo scores revert back toward the mean of 1500.

In [137]:
def season_revert(team_elo):
    for team,elo in team_elo.items():
        team_elo[team] = 1505*R + elo*(1-R)
    return team_elo

Not all the teams have been Division 1 since 1985, so we will have to add in the new teams at the beginning of the season. For that, we will need the data file indicating their first D1 season. Every team's first D1 season (starting with 1985) will have an Elo score of 1300. The next block of code propagates the Elo scores in chronological order.

In [201]:
%time
team_elo = {t:1300 for t in df['WTeamID'].unique()}
season = 1985
for i,game in df.iterrows():
    
    if game['Season'] > season:
        team_elo = season_revert(team_elo)
        season = game['Season']
    
    teamW = game['WTeamID']
    ptsW = game['WScore']
    eloW = team_elo[teamW]
    
    teamL = game['LTeamID']
    ptsL = game['LScore']
    eloL = team_elo[teamL]
    
    locW = game['WLoc']

    df.at[i,'WElo'] = eloW
    df.at[i,'LElo'] = eloL

    df.at[i,'WProb_Elo'] = elo_prob(eloW,eloL,locW)
    
    eloW,eloL = update_elo(eloW,ptsW,eloL,ptsL,locW)
    team_elo[teamW] = eloW
    team_elo[teamL] = eloL

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.25 µs


In [203]:
df.tail()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,Playoff,WProb_Elo,WElo,LElo
2179,2018,146,1242,85,1181,81,N,1,1,0.457003,1785.669008,1815.620234
2180,2018,146,1437,71,1403,59,N,0,1,0.710045,1875.016362,1719.434648
2181,2018,152,1276,69,1260,57,N,0,1,0.611211,1799.783746,1721.192994
2182,2018,152,1437,95,1242,79,N,0,1,0.624422,1881.017408,1792.705842
2183,2018,154,1437,79,1276,62,N,0,1,0.616659,1890.880748,1808.29659


How well does Elo do on its own? One way to check would be to see how many times the Elo scores accurately predict a single game. We can say that if the probability that the winning team would win given the Elo scores of both teams is greater than 50%, then we would have been correct. Here's how we compute the percentage of games we got correct.

In [207]:
(df['WProb_Elo'] > 0.5).sum()/len(df)

0.7301687590429196

But what if we are really off in our prediction? Obviously, if we predicted 10% for the winning team to win, that would be bad. We then could add up all the probabilities and see how close we get to the total number of games.

In [209]:
df['WProb_Elo'].sum()/len(df)

0.6117655949079803

## Game statistics

In [359]:
data1 = pd.read_csv('DataFiles/RegularSeasonDetailedResults.csv')
data1['Playoff'] = 0
data2 = pd.read_csv('DataFiles/NCAATourneyDetailedResults.csv')
data2['Playoff'] = 1
data = pd.concat([data1,data2])
del data1
del data2
data = data.sort_values(by=['Season','DayNum'])

In [360]:
data.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,WFGM,WFGA,...,LFTM,LFTA,LOR,LDR,LAst,LTO,LStl,LBlk,LPF,Playoff
0,2003,10,1104,68,1328,62,N,0,27,58,...,16,22,10,22,8,18,9,2,20,0
1,2003,10,1272,70,1393,63,N,0,26,62,...,9,20,20,25,7,12,8,6,16,0
2,2003,11,1266,73,1437,61,N,0,24,58,...,14,23,31,22,9,12,2,5,23,0
3,2003,11,1296,56,1457,50,N,0,18,38,...,8,15,17,20,9,19,4,3,23,0
4,2003,11,1400,77,1208,71,N,0,30,61,...,17,27,21,15,12,10,7,1,14,0


In [361]:
data.columns

Index(['Season', 'DayNum', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc',
       'NumOT', 'WFGM', 'WFGA', 'WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WOR', 'WDR',
       'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 'LFGM', 'LFGA', 'LFGM3', 'LFGA3',
       'LFTM', 'LFTA', 'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF',
       'Playoff'],
      dtype='object')

Here are the statistics for each game since 2003.
- WFGM - field goals made (by the winning team)
- WFGA - field goals attempted (by the winning team)
- WFGM3 - three pointers made (by the winning team)
- WFGA3 - three pointers attempted (by the winning team)
- WFTM - free throws made (by the winning team)
- WFTA - free throws attempted (by the winning team)
- WOR - offensive rebounds (pulled by the winning team)
- WDR - defensive rebounds (pulled by the winning team)
- WAst - assists (by the winning team)
- WTO - turnovers committed (by the winning team)
- WStl - steals (accomplished by the winning team)
- WBlk - blocks (accomplished by the winning team)
- WPF - personal fouls committed (by the winning team)

And the same for the losing teams with an "L" in front.

Let's append the Elo scores and probabilities to this new table.

In [364]:
df_short = df[df['Season'] >= 2003].reset_index()
data['WProb_Elo'] = df_short['WProb_Elo']
data['WElo'] = df_short['WElo']
data['LElo'] = df_short['LElo']

In [365]:
data.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,WFGM,WFGA,...,LDR,LAst,LTO,LStl,LBlk,LPF,Playoff,WProb_Elo,WElo,LElo
0,2003,10,1104,68,1328,62,N,0,27,58,...,22,8,18,9,2,20,0,0.383434,1622.525383,1705.041109
1,2003,10,1272,70,1393,63,N,0,26,62,...,25,7,12,8,6,16,0,0.530485,1594.201242,1572.991687
2,2003,11,1266,73,1437,61,N,0,24,58,...,22,9,12,2,5,23,0,0.566627,1623.303453,1576.729194
3,2003,11,1296,56,1457,50,N,0,18,38,...,20,9,19,4,3,23,0,0.406155,1446.022915,1512.015483
4,2003,11,1400,77,1208,71,N,0,30,61,...,15,12,10,7,1,14,0,0.522671,1621.719614,1605.955571


Right now, the Elo scores are a good measure of the historical performance of a basketball program; however, game stats will help us to determine how a team has been performing as of late. We'll do this by taking a rolling average of some of the statistics over previous games. We have to be careful to not include the current game stats in this average because it will help to reveal the outcome of the game, which we are trying to predict.

In [366]:
to_roll = ['FGM','FGA','FGM3','FGA3']
averages = dict()
for ID in data['WTeamID'].unique():
    averages[ID] = dict()
    team = data[(data['WTeamID'] == ID) | (data['LTeamID'] == ID)]
    for r in to_roll:
        averages[ID][r] = team.apply(lambda x: x['W%s' % r] if x['WTeamID'] == 1104 else x['L%s' % r],axis=1).\
            rolling(4).mean().shift(1).bfill()

The predictors we use for our comprehensive algorithm will be the ratio of the winning team's statistic to that the losing team.

In [367]:
for r in to_roll:
    win = data.apply(lambda row: averages[row['WTeamID']][r].loc[row.name],axis=1)
    los = data.apply(lambda row: averages[row['LTeamID']][r].loc[row.name],axis=1)
    data[r] = win/los

In [368]:
data.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,WFGM,WFGA,...,LBlk,LPF,Playoff,WProb_Elo,WElo,LElo,FGM,FGA,FGM3,FGA3
0,2003,10,1104,68,1328,62,N,0,27,58,...,2,20,0,0.383434,1622.525383,1705.041109,0.959184,1.09314,1.16667,1.29688
1,2003,10,1272,70,1393,63,N,0,26,62,...,6,16,0,0.530485,1594.201242,1572.991687,0.887755,0.905109,0.931034,0.868687
2,2003,11,1266,73,1437,61,N,0,24,58,...,5,23,0,0.566627,1623.303453,1576.729194,1.13924,1.08145,1.26087,1.06173
3,2003,11,1296,56,1457,50,N,0,18,38,...,3,23,0,0.406155,1446.022915,1512.015483,1.01205,1.04902,0.76,0.987013
4,2003,11,1400,77,1208,71,N,0,30,61,...,1,14,0,0.522671,1621.719614,1605.955571,0.761468,0.912,0.666667,0.753425
