Elo scores were first developed for chess but have been adopted to predict outcomes of other games and sports. I have chosen to use the NBA Elo scores developed by [FiveThirtyEight](https://fivethirtyeight.com/features/how-we-calculate-nba-elo-ratings/).

We can calculate the probability of winning based on the Elo score difference between any two teams.

Elo scores are updated after a game occurs based on how surprising the outcome is. The larger the difference in the scores, the larger the expected point spread is. If the actual point differential is larger than expected, the more the Elo scores change. Elo scores are a zero-sum game, so points are added to the team who beats the point spread while points are taken from the other team.

At the end of the season, the Elo scores revert back toward the mean of 1500.

Not all the teams have been Division 1 since 1985, so we will have to add in the new teams at the beginning of the season. For that, we will need the data file indicating their first D1 season. Every team's first D1 season (starting with 1985) will have an Elo score of 1300. The next block of code propagates the Elo scores in chronological order.

The NBA Elo forecast from FiveThirtyEight used parameters K=20, HCA=100, R=1/3. The K factor determines how many elo points are transferred from the losing team to the winning team. HCA is the home court advantage, which gives extra Elo points to the home team. The R factor is how much the Elo scores revert back to the mean. For example, if a team had 1800 points at the end of the season, they would revert to 1700, subtract 1/3 of the difference from 1500.

We expect the NCAA to have similar values, but to be sure, we'll run the forecast with different parameters.

In [1]:
import pandas as pd
import numpy as np
df = pd.concat([pd.read_csv('DataFiles/RegularSeasonCompactResults.csv'),
               pd.read_csv('DataFiles/NCAATourneyCompactResults.csv')]).reset_index(drop=True)
df = df.sort_values(by=['Season','DayNum'])

In [6]:
def forecast(df,K,HCA,R,return_df=False):
    start = time.time()
    prob_all = np.zeros(len(df))
    new_df = df.copy()
    
    team_elo = {t:1300 for t in df['WTeamID'].unique()}
    season = 1985
    for i,game in df.iterrows():
        
        if game['Season'] > season:
            for team,elo in team_elo.items():
                if team_elo[team] != 1300:
                    team_elo[team] = 1505*R + elo*(1-R)
            season = game['Season']
        
        locW = game['WLoc']
        if locW == 'H':
            value = HCA
        elif locW == 'A':
            value = -HCA
        else:
            value = 0
        
        teamW = game['WTeamID']
        teamL = game['LTeamID']
        
        eloW = team_elo[teamW]
        eloL = team_elo[teamL]
        
        elo_diff = eloW-eloL+value
        probW = 1 / (10**(-elo_diff/400) + 1)  
        
        MOV = game['WScore']-game['LScore']
    
        mult = (MOV+3)**0.8 / (7.5 + 0.006*elo_diff)
    
        shift = K*mult*(1-probW)
        
        new_df.at[i,'WElo_before'] = eloW
        new_df.at[i,'LElo_before'] = eloL
        
        new_eloW = eloW + shift
        new_eloL = eloL - shift
        
        new_df.at[i,'WElo_after'] = new_eloW
        new_df.at[i,'LElo_after'] = new_eloL
        
        new_df.at[i,'Elo_Prob'] = probW
        
        team_elo[teamW] = new_eloW
        team_elo[teamL] = new_eloL
        
        
    df_cut = new_df[new_df['Season'] >= 2003]
    logloss = -np.log(df_cut['Elo_Prob']).mean()
    
    end = time.time()
    
    dt = end-start
        
    if return_df:
        return new_df
    else:
        result = {'K':K,'HCA':HCA,'R':R,'logloss':logloss}
        print('K=%d,HCA=%d,R=1/%d...logloss=%0.3f...Elapsed Time: %0.1f sec' % \
              (K,HCA,1/R,logloss,dt))
        return result

In [7]:
import time
import multiprocessing as mp

k = [10,20,30,40]
hca=[100,200,300,400]
r=[1/5,1/4,1/3]

start = time.time()
pool = mp.Pool(processes=4)
results = [pool.apply_async(forecast, args = (df,K,HCA,R)) \
           for K in k for HCA in hca for R in r]
output = [p.get() for p in results]

end = time.time()

t = (end-start)/60
print('Elapsed time: %0.1f min' % t)

K=10,HCA=100,R=1/3...logloss=0.565...Elapsed Time: 37.6 sec
K=10,HCA=200,R=1/5...logloss=0.582...Elapsed Time: 37.9 sec
K=10,HCA=100,R=1/4...logloss=0.559...Elapsed Time: 38.5 sec
K=10,HCA=100,R=1/5...logloss=0.555...Elapsed Time: 39.0 sec
K=10,HCA=200,R=1/4...logloss=0.585...Elapsed Time: 32.7 sec
K=10,HCA=200,R=1/3...logloss=0.591...Elapsed Time: 32.9 sec
K=10,HCA=300,R=1/5...logloss=0.652...Elapsed Time: 32.8 sec
K=10,HCA=300,R=1/4...logloss=0.656...Elapsed Time: 33.0 sec
K=10,HCA=300,R=1/3...logloss=0.662...Elapsed Time: 37.8 sec
K=10,HCA=400,R=1/4...logloss=0.758...Elapsed Time: 37.8 sec
K=10,HCA=400,R=1/5...logloss=0.754...Elapsed Time: 38.2 sec
K=10,HCA=400,R=1/3...logloss=0.765...Elapsed Time: 37.5 sec
K=20,HCA=100,R=1/5...logloss=0.535...Elapsed Time: 32.7 sec
K=20,HCA=100,R=1/4...logloss=0.538...Elapsed Time: 32.7 sec
K=20,HCA=100,R=1/3...logloss=0.542...Elapsed Time: 32.6 sec
K=20,HCA=200,R=1/5...logloss=0.560...Elapsed Time: 32.6 sec
K=20,HCA=200,R=1/4...logloss=0.563...Ela

In [8]:
elo_df = pd.DataFrame(output)

In [9]:
best = elo_df.loc[elo_df['logloss'].idxmin()]

In [10]:
best.head()

HCA        100.000000
K           40.000000
R            0.200000
logloss      0.522538
Name: 36, dtype: float64

In [12]:
df = forecast(df,best['K'],best['HCA'],best['R'],return_df=True)

In [13]:
slots = pd.read_csv('DataFiles/NCAATourneySlots.csv')
seeds = pd.read_csv('DataFiles/NCAATourneySeeds.csv')

In [26]:
submit = []
full = []
for s in range(2014,2019):
    stats = []
    subset = df[(df['Season'] == s) & (df['DayNum'] <= 133)]
    for t in seeds[seeds['Season'] == s]['TeamID']:
        row = {'TeamID':t}
        team_stats = subset[(subset['WTeamID'] == t) | (subset['LTeamID'] == t)]
        team_stats = team_stats.loc[team_stats['DayNum'].idxmax()]
        if team_stats['WTeamID'] == t:
            row['Elo'] = team_stats['WElo_after']
        else:
            row['Elo'] = team_stats['LElo_after']
        stats.append(row)
    stats = pd.DataFrame(stats).sort_values('TeamID').set_index('TeamID')
    
    pairings = []
    for i in range(len(stats)):
        teamA = stats.iloc[i]
        for j in range(i+1,len(stats)):
            teamB = stats.iloc[j]
            x = dict()
            x['TeamA'] = teamA.name
            x['TeamB'] = teamB.name
            x['EloA'] = teamA['Elo']
            x['EloB'] = teamB['Elo']
            pairings.append(x)
    
    pairings = pd.DataFrame(pairings)
    
    pairings['EloDiff'] = pairings['EloA'] - pairings['EloB']
    pairings['Pred'] = pairings['EloDiff'].apply(lambda x: 1 / (10**(-x/400) + 1))
    
    pairings['ID'] = pairings.apply(lambda row: '%d_%d_%d' % (s,row['TeamA'],row['TeamB']),axis=1)
    pairings['Season'] = s
    submit.append(pairings[['ID','Pred']])
    
    full.append(pairings)
    
    print('Done with season %d' % s)

submit = pd.concat(submit).set_index('ID')
full = pd.concat(full)

## validation
tourney = df[(df['DayNum'] > 133) & (df['Season'] >= 2014)]
for i,game in tourney.iterrows():
    season = game['Season']
    teamW = game['WTeamID']
    teamL = game['LTeamID']
    teamA = min(teamW,teamL)
    teamB = max(teamW,teamL)
    p = full[(full['TeamA'] == teamA) & (full['TeamB'] == teamB) & (full['Season'] == season)]['Pred'].values[0]
    if teamA == teamL:
        p = 1 - p
    tourney.at[i,'Pred'] = p
logloss = -np.log(tourney['Pred']).mean()
print('LogLoss = %0.4f' % logloss)

Done with season 2014
Done with season 2015
Done with season 2016
Done with season 2017
Done with season 2018
LogLoss = 0.5601


In [27]:
stamp = pd.Timestamp.now()
filename = 'submission_%d-%d-%d_%d:%d.csv' % (stamp.year,stamp.month,stamp.day,stamp.hour,stamp.minute)
submit.to_csv(filename)

message = 'Elo scores only, with K=%d R=1/%d HCA=%d' % (best['K'],1/best['R'],best['HCA'])
print(message)

In [29]:
import os
os.system('kaggle competitions submit -c mens-machine-learning-competition-2019 -f %s -m "%s"' % (filename,message))
print('Submission Complete!')

Submission Complete!


In [31]:
df.to_csv('elo_scores.csv')