The goal of this notebook is to see what the accuracy of 538s ELO model for MLB baseball. We have taken the data from https://github.com/fivethirtyeight/data/tree/master/mlb-elo where they give a csv file that has their win percentages. 

One thing we could do is compare their win percentages with our win percentages (as they all tend to be around 50%), but we will also see if we just predict the winner as the team with the higher win percentage, what the prediction accuracy of this model will be (and we can compare that to our model)

In [1]:
##Import of packages that we will use 
import pandas as pd
import numpy as np
import math

We start by getting the data from the csv file, and putting it into a data frame

In [2]:
##Read data from 538 csv file as Elo_data
Elo_data = pd.read_csv("./mlb_elo_data/mlb_elo.csv")

We shall take a quick look at the head of the data

In [4]:
Elo_data.head()

Unnamed: 0,date,season,neutral,playoff,team1,team2,elo1_pre,elo2_pre,elo_prob1,elo_prob2,...,pitcher1_rgs,pitcher2_rgs,pitcher1_adj,pitcher2_adj,rating_prob1,rating_prob2,rating1_post,rating2_post,score1,score2
0,2023-10-01,2023,0,,STL,CIN,1499.567587,1485.123367,0.555101,0.444899,...,,,,,0.57582,0.42418,,,,
1,2023-10-01,2023,0,,SEA,TEX,1516.277991,1535.226359,0.507269,0.492731,...,,,,,0.50461,0.49539,,,,
2,2023-10-01,2023,0,,NYM,PHI,1506.248367,1523.132153,0.51024,0.48976,...,,,,,0.538668,0.461332,,,,
3,2023-10-01,2023,0,,MIL,CHC,1502.093612,1498.788921,0.539214,0.460786,...,,,,,0.557476,0.442524,,,,
4,2023-10-01,2023,0,,KCR,NYY,1423.429777,1541.893168,0.36731,0.63269,...,,,,,0.347503,0.652497,,,,


For simplicity we shall just focus on the year 2010 as that is the year I had just tested for my model, but in general we can pick any year. 

In [42]:
#Picks the year we are interested in
year = 2010
#create a dataframe for the data coming from that year
year_data = Elo_data.loc[Elo_data.season == year]
#adds a column of a prediction coming from the win percentage expectations
year_data.insert(len(year_data.T),"prediction", (year_data["rating_prob1"].values>.5)*1)
#adds a column of the actual win coming from the scores
year_data.insert(len(year_data.T),"actual", (year_data["score1"].values>year_data["score2"])*1)
#adds a column of whether or not the prediction was correct
year_data.insert(len(year_data.T),"prediction_correct", (year_data["prediction"].values == year_data["actual"])*1)
#Print out what the accuracy of the prediction is given by
print("The prediction accuracy is given by " + str(year_data.prediction_correct.sum()/len(year_data)))

The prediction accuracy is given by 0.5536149471974005


Now we rewrite the following code as a function, and we will iterate over the years 2010-2022

In [43]:
def get_averages (Elo_data):
    for year in range(2010,2023):
        #create a dataframe for the data coming from that year
        year_data = Elo_data.loc[Elo_data.season == year]
        #adds a column of a prediction coming from the win percentage expectations
        year_data.insert(len(year_data.T),"prediction", (year_data["rating_prob1"].values>.5)*1)
        #adds a column of the actual win coming from the scores
        year_data.insert(len(year_data.T),"actual", (year_data["score1"].values>year_data["score2"])*1)
        #adds a column of whether or not the prediction was correct
        year_data.insert(len(year_data.T),"prediction_correct", (year_data["prediction"].values == year_data["actual"])*1)
        #Print out what the accuracy of the prediction is given by
        print("The prediction accuracy is given by " + str(year_data.prediction_correct.sum()/len(year_data)))

In [44]:
get_averages(Elo_data)

The prediction accuracy is given by 0.5536149471974005
The prediction accuracy is given by 0.55776246453182
The prediction accuracy is given by 0.5630320226996351
The prediction accuracy is given by 0.583232077764277
The prediction accuracy is given by 0.554021121039805
The prediction accuracy is given by 0.5525354969574037
The prediction accuracy is given by 0.5663824604141291
The prediction accuracy is given by 0.5721231766612642
The prediction accuracy is given by 0.5888798701298701
The prediction accuracy is given by 0.5977291159772912
The prediction accuracy is given by 0.5772870662460567
The prediction accuracy is given by 0.5754257907542579
The prediction accuracy is given by 0.6064777327935222
