###### Overview

In this notebook I apply the Elo and Glicko-2 rating systems to the singles matches data and save the learned ratings and rating deviations to a dataframe for timely access in other notebooks.

###### Imports

Packages

In [23]:
import pandas as pd
from functions import assembleDf, epochElo, epochsElo, epochG, epochsG,\
PlayerElo, winProbG, get_recent_rating_wp, get_recent_rating_rd_wp_lambda
from datetime import datetime, timedelta
import numpy as np
import glob
import sys
sys.path.append('..')
from pyglicko2.glicko2_tests import exampleCase
from pyglicko2.glicko2 import Player
import glicko2
import time

Data

In [2]:
# read in the data that will be used with the rating systems.
matches = pd.read_csv('../Data/matches_glicko2.csv',parse_dates = 
                      ['tourney_date'], infer_datetime_format = True)

In [24]:
# read in the unfiltered data
data_files = glob.glob('../Data/singles_matches_df_*.csv')
singles_matches = pd.concat([pd.read_csv(f, low_memory=False,parse_dates = 
                                            ['tourney_date'], 
                                            infer_datetime_format=True,
                                           index_col=[0]) 
                                for f in data_files]) 

##### Rating system application

Get the Elo ratings for the matches.

In [4]:
playerClasses, eloRatingsHistory = epochsElo(matches)

In [5]:
ratingsHistory_df = assembleDf(eloRatingsHistory)

In [6]:
# fill in missing values with most recent rating.  
ratingsHistory_df = ratingsHistory_df.ffill(axis=0).fillna(1500)

In [7]:
# import glicko2 rating and rating deviation history DataFrames
g2_rh = pd.read_csv('../Data/ratings_histories_glicko2.csv', index_col = 0, 
                   parse_dates=True, dtype=np.float64)
g2_rh.columns = g2_rh.columns.astype(int)
g2_rdh = pd.read_csv('../Data/rd_histories_glicko2.csv', index_col = 0, 
                     parse_dates=True, dtype=np.float64)
g2_rdh.columns = g2_rdh.columns.astype(int)


###### Filter for winner_id and loser_id in ratinghistory pre-glicko2 analysis

In [8]:
g2_players = set(g2_rdh.columns).union(g2_rh.columns)

In [9]:
in_g2 = [p[0] in g2_players and p[1] in g2_players  for p in zip(matches['winner_id'],matches['loser_id'])]

In [10]:
sum(in_g2)/matches.shape[0]

0.9528939152407472

For 95% of the matches, both players are in the glicko-2 dataframes.  Later iterations on this code will determine why not 100% of the players are getting captured by epochsG.

In [11]:
%%time
apply_glicko2 = matches[in_g2].apply(lambda x: 
                     get_recent_rating_rd_wp_lambda(x['tourney_date'],
                                                   x['winner_id'],
                                                   x['loser_id'],
                                                   g2_rh, g2_rdh),axis=1)

CPU times: user 7min 34s, sys: 12.9 s, total: 7min 47s
Wall time: 7min 42s


The above requires 7 minutes to resolve for me.  I will save the data to a csv.

In [12]:
# here I reshape the the filtered glicko2 data to match the full data, filling
# the blanks with NaNs.
g2_reshaped = [apply_glicko2[m] if m in apply_glicko2.index.values else 
 (np.nan, np.nan, np.nan, np.nan, np.nan)
 for m in range(0, matches.shape[0])]

In [13]:
# convert the reshaped list of tuples to a dataframe for later concatenation 
# with elo data and the original dataset.
g2_df = pd.DataFrame.from_records(g2_reshaped, columns = 
                          ['winner_gr','winner_grd','loser_gr', 'loser_grd',
                          'wp_g'])

In [14]:
g2_df.to_csv('./g2_df.csv')

###### Elo application

In [15]:
padRow = pd.DataFrame({col: 1500.0 for col in ratingsHistory_df.columns}, index = [pd.Timestamp('1877-07-09T00')])
padRow

Unnamed: 0,131500,131584,131866,131867,131869,131873,131876,131879,131881,131884,...,130340,130355,130457,130493,130542,130550,130552,130553,130655,130767
1877-07-09,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,...,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0


In [16]:
ratingsHistory_df = pd.concat([padRow,ratingsHistory_df],axis=0)

In [17]:
we_le_pw = matches.apply(lambda x:
                           get_recent_rating_wp(
                               ratingsHistory_df, 
                               x['tourney_date'], 
                               x['winner_id'], 
                               x['loser_id']),
                            axis=1)

In [18]:
we_le_pw_df = pd.DataFrame.from_records(we_le_pw, columns = ['winner_elo',
                                                             'loser_elo',
                                                             'win_prob'])

###### Glicko-2 and Elo combined with full data

In [19]:
combined_df = pd.concat([singles_matches,we_le_pw_df,g2_df],axis = 1)

###### I'll split the data into thirds so that the chunks are under the 100mb limit for github upload

In [20]:
n = combined_df.shape[0]
combined_df_0 = combined_df.iloc[0:n*4//16,:]
combined_df_1 = combined_df.iloc[n*4//16:n*8//16,:]
combined_df_2 = combined_df.iloc[n*8//16:n*12//16,:]
combined_df_3 = combined_df.iloc[n*12//16:,:]
combined_dfs = [combined_df_0, combined_df_1, combined_df_2, combined_df_3]

In [21]:
for i, df in enumerate(combined_dfs):
    df.to_csv(f'../Data/combined_df_{i}.csv')