# Sabermetrics: Computing Baseball Park Factors


Sabermetrics is the original term for baseball analytics.  It is not necessary to know anything about Sabermetrics to understand this notebook.  However, understanding the rules of baseball is helpful.

For baseball fans, an excellent introduction to Sabermetrics is: [Understanding Sabermetrics: Second Edition](https://www.amazon.com/Understanding-Sabermetrics-Introduction-Baseball-Statistics/dp/1476667667/).  The three authors are all professors of mathematics who love baseball.

Sabermetric blog posts and websites have evolved a specialized terminology that uses some of the same terms as data science but with slightly different meanings. For a data scientist learning about Sabermetrics I recommended the above book written by mathematics professors and Wikipedia as both of these sources use standard data science terminology.

On Wikipedia, see for example:
* https://en.wikipedia.org/wiki/Baseball_statistics
* https://en.wikipedia.org/wiki/Batting_park_factor

# Basic Park Factor

Each baseball park is different.  Some parks, such as Coors Field in Denver are easy to score runs in whereas others, such as Oracle Park in San Francisco are difficult to score runs in.

In order to compare players who play in different parks, it would be helpful to account for how much their home park affects their performance.

The basic park factor formula is:

$$PF = {(RS_{home} + RA_{home}) / G_{home} \over \operatorname(RS_{road} + RA_{road}) / G_{road}}$$

For each team:  
$RS_{home}$ is Runs Scored at home  
$RA_{home}$ is Runs Allowed at home  
$G_{home}$ is number of Games played at home  
$RS_{road}$ is Runs Scored on the road  
$RA_{home}$ is Runs Allowed on the road  
$G_{home}$ is number of Games played on the road  

Example: 2019 LA Dodgers   
$RS_{home}$ = 441  
$RA_{home}$ = 271  
$G_{home}$ = 81  
$RS_{road}$ = 445  
$RA_{road}$ = 342  
$G_{road}$ = 81  

Plugging in the above numbers gives a basic PF for runs for Dodger Stadium in 2019 as 0.9047.  Often this value is multiplied by 100 so that 100 represents league average instead of 1.0.

As a player only plays half of their games at home, the PF is sometimes "halved" as follows $PF_{half} = (1+PF)/2$.  This gives a value of 0.952 for Dodger Stadium.  [Fangraphs](https://www.fangraphs.com/guts.aspx?type=pf&teamid=0&season=2019) shows the "halved" value as 0.96 (for 1yr PF by runs) for 2019.  Which is in close agreement.

A value less than 1.0 means that fewer runs than average are scored at Dodger Stadium.  This could be used to determine an equivalent ERA.  For example, Clayton Kershaw had an ERA of 3.02 in 2019.  Assuming half those games were at Dodger Stadium and using $PF_{half}$ means that his ERA in an average park would have been 3.02 / 0.952 = 3.17 without the pitching advantage of Dodger Stadium.

The Park Factor formula nicely factors out most of the qualities unique to the home team.  For example, an excellent scoring team would score more runs both home and away.  As RS is both in the numerator and the denominator, the team's above average run scoring ability is "canceled out", leaving just the contribution due to the park.  Similarly, an excellent pitching team will allow fewer runs both home and away.  As RA is in both the numerator and the denominator, this also "cancels out", leaving just the contribution due to the park.

## Case Example: Fenway Park 2019
All park factors for all stadiums for several years will be calculated while paying particular attention to a specific example, Fenway Park in Boston in 2019.

The 2019 Fenway Park calculation is interesting because in 2019 Boston hosted the Yankess in London for two games.  See: https://en.wikipedia.org/wiki/MLB_London_Series. 

These 2 games were extremely high scoring, with the two teams scoring a total of 50 runs.  Even though Boston was the "home team", meaning they batted last, the runs were not scored at their home park of Fenway, but in London.

## Park Factor Refinements
There are several refinements that can be made.  Each will be addressed in subsequent notebooks.  Arguably the most important is the one addressed here.

**Home Team not playing in Home Park**  
The Park Factor was created to measure the affect of each park on baseball statistics.  "Home" in the PF formula doesn't mean "home team" it means "home park".  Boston's home park is Fenway, so the values in the numerator are Runs Scored at Fenway, Runs Allowed at Fenway and Games played at Fenway.  If Boston happens to bat last at a baseball game in London, this should not affect the Fenway Park Factor.

As it is quite rare for the team batting last to not be playing in their home park, it is sufficient to remove all such games prior to computing each team's Park Factor.

In [1]:
import os
import pandas as pd
import numpy as np
from pathlib import Path
import re
from scipy.stats import linregress

In [2]:
import sys

# import data_helper.py from download_scripts directory
sys.path.append('../download_scripts')
import data_helper as dh

In [3]:
data_dir = Path('../data')
lahman_data = data_dir.joinpath('lahman/wrangled').resolve()
retrosheet_data = data_dir.joinpath('retrosheet/wrangled').resolve()

## Read in the Data
Reading in the data up front makes the code clearer, but may use more memory.  usecols helps to reduce memory as these are very wide csv files.

In [4]:
cols = ['game_id', 'year', 'bat_last', 'team_id', 'opponent_team_id', 'r']
team_game = dh.from_csv_with_types(retrosheet_data / 'team_game.csv.gz', usecols=cols)

In [5]:
cols = ['game_id', 'park_id', 'game_start']
game = dh.from_csv_with_types(retrosheet_data / 'game.csv.gz', usecols=cols)

In [6]:
parks = dh.from_csv_with_types(retrosheet_data / 'parks.csv')

In [7]:
teams = dh.from_csv_with_types(retrosheet_data / 'teams.csv')

In [8]:
lahman_teams = dh.from_csv_with_types(lahman_data / 'teams.csv')

In [9]:
# for now, focus on 2015 onward
team_game = team_game.query('year >= 2015')
game['year'] = game['game_start'].dt.year
game = game.query('year >= 2015')
game = game.drop(columns='game_start')

## Understanding the team_game CSV file
The output of the cwgame parser was made [tidy](https://en.wikipedia.org/wiki/Tidy_data) and this is one of two files created.

The team_game file has one team per game. Since each game has two teams, there are two rows per game.  The team_id field identifies the team that the statistics are for.  The opponent_team_id identifies the opponent.

The fields which uniquely identify a record are: team_id, year.

In [10]:
# examine 1 game
team_game_example = team_game.query('game_id == "BOS201906290"')
team_game_example

Unnamed: 0,game_id,bat_last,team_id,opponent_team_id,r,year
260466,BOS201906290,True,BOS,NYA,13,2019
260467,BOS201906290,False,NYA,BOS,17,2019


We see the Red Sox (BOS) played the Yankees (NYA).  The Red Sox batted last and scored 13 runs whereas the Yankees batted first and scored 17 runs.

## Understanding the game CSV file
The output of the cwgame parser was made tidy and this is one of two files created.

The game file has information specific to a game, such as the baseball park it was played in.

In [11]:
game_example = game.query('game_id == "BOS201906290"')
game_example

Unnamed: 0,game_id,park_id,year
130233,BOS201906290,LON01,2019


In [12]:
# join with parks file to get info about parks
game_example.merge(parks, left_on=['park_id'], right_on=['park_id'])

Unnamed: 0,game_id,park_id,year,name,aka,city,state,start,end,league,notes
0,BOS201906290,LON01,2019,London Stadium,,London,UK,2019-06-29,2019-06-30,AL,BOS


We see that the game was played in London.

In [13]:
# join with teams file to get info about teams
team_game_example.merge(teams, left_on=['team_id', 'year'], right_on=['team_id', 'year'])

Unnamed: 0,game_id,bat_last,team_id,opponent_team_id,r,year,lg_id,city,name
0,BOS201906290,True,BOS,NYA,13,2019,A,Boston,Red Sox
1,BOS201906290,False,NYA,BOS,17,2019,A,New York,Yankees


# Compute RT = (RS + RA)
Runs Total = Runs Scored + Runs Allowed: per team per year per park.

Also rank the number of games played by each team at each park.  Whatever park a team plays its most games in, will be the park for which the Park Factor is computed.

In [14]:
def compute_rt_per_park(team_game, game, parks):
    """Compuate RS and RA per team per year per park"""
    
    # bring in the park_id field from game
    tg_park = team_game.merge(game)
    
    # compute RS per team per year per park
    rs = tg_park.groupby(['team_id', 'year', 'park_id']).agg(
        rs=('r', 'sum'), games=('r', 'count')).reset_index()
    
    # compute RA per team per year per park
    ra = tg_park.groupby(['opponent_team_id', 'year', 'park_id']).agg(
        ra=('r', 'sum'), games=('r', 'count')).reset_index()
    
    # combine the RS and RA dataframes
    ra = ra.rename(columns={'opponent_team_id':'team_id'})
    rt = rs.merge(ra, 
              left_on=['team_id', 'year', 'park_id'], 
              right_on=['team_id', 'year', 'park_id'],
              suffixes=('_rs', '_ra'))
    
    # the number of games in which runs scored must equal 
    # the number of games in which runs were allowed
    assert (rt['games_rs'] == rt['games_ra']).all()
    
    # when summed over all teams, runs scored must equal runs allowed
    assert rt['rs'].sum() == rt['ra'].sum()
    
    rt['rt'] = rt['rs'] + rt['ra']
    rt['games'] = rt['games_rs']
    rt = rt.drop(columns=['games_rs', 'games_ra'])   
    
    # per team per year, rank each park by number of games played there
    rt['rank'] = rt.groupby(['team_id', 'year'])['games'].rank(ascending=False)
    
    return rt

In [15]:
rt = compute_rt_per_park(team_game, game, parks)
rt.head()

Unnamed: 0,team_id,year,park_id,rs,ra,rt,games,rank
0,ANA,2015,ANA01,320.0,298.0,618.0,81,1.0
1,ANA,2015,ARL02,69.0,46.0,115.0,10,2.5
2,ANA,2015,BAL12,9.0,5.0,14.0,3,13.0
3,ANA,2015,BOS07,16.0,19.0,35.0,3,13.0
4,ANA,2015,CHI12,4.0,14.0,18.0,3,13.0


In [16]:
# spot check with Boston
rt.query('team_id == "BOS" and year==2019')[['rs', 'ra']].sum()

rs    901.0
ra    828.0
dtype: float64

In [17]:
# total runs scored: home and away
team_game.query('team_id == "BOS" and year==2019')['r'].sum()

901

In [18]:
# total runs allowed: home and away
team_game.query('opponent_team_id == "BOS" and year==2019')['r'].sum()

828

In [19]:
# and check with Lahman to see its RS and RA for Boston
lahman_teams.query('team_id == "BOS" and year == 2019')[['r', 'ra']]

Unnamed: 0,r,ra
2898,901,828


The spot check for Boston looks good.

## Split RT into Home Parks and Road Parks
The RT dataframe has the total runs scored and allowed per team per park per year.

To compute the PF for each home park it is necessary to split out the home parks from the road parks, per team per year.

For example, the home park for Boston is Fenway.  The runs scored as the "home team" in London, should not count as runs scored in Fenway Park.

In [20]:
def find_home_parks(rt):
    """Find each team's home park."""
    
    # rank == 1 identifies each team's home park
    home_parks = rt.query('rank == 1').copy()
    home_parks['r_avg'] = home_parks['rt'] / home_parks['games']
    home_parks = home_parks.drop(columns=['rank'])
    
    return home_parks

In [21]:
def find_road_parks(rt):
    """Find the road parks for each team."""
    
    # rank > 1 identifies each team's road parks
    road_parks = rt.query('rank > 1').copy()
    road_parks = road_parks.drop(columns=['rank'])
    
    return road_parks

In [22]:
def compute_road_totals(road_parks):
    """Sum the totals on the road for per team per year."""
    
    road_totals = road_parks.groupby(['team_id', 'year']).agg(
        rt=('rt', 'sum'), games=('games', 'sum'))
    road_totals['r_avg'] = road_totals['rt'] / road_totals['games']
    
    return road_totals

In [23]:
home_parks = find_home_parks(rt)

In [24]:
# home park for Boston
home_parks.query('team_id == "BOS" and year==2019')

Unnamed: 0,team_id,year,park_id,rs,ra,rt,games,r_avg
461,BOS,2019,BOS07,431.0,410.0,841.0,79,10.64557


We see that 79 games were played at Fenway and 841 runs in total were scored at Fenway.

## Team batting last but not playing at their home park
The "home park" is the park where the team plays the most games.

When Boston hosted the Yankees in London, Boston batted last, but the game was not at Boston's home park, Fenway.  The extraordinary high scoring games in London affect the Fenway Park PF if they are added to Fenway Park total.

Another example is the 2015 Baltimore Orioles who hosted Tampa Bay in Tampa Bay. See: https://en.wikipedia.org/wiki/2015_White_Sox%E2%80%93Orioles_crowdless_game

In [25]:
batting_last = team_game.query('bat_last == True').merge(game)[
    ['team_id', 'year', 'park_id', 'game_id']]

tmp = batting_last.merge(home_parks[['team_id', 'year', 'park_id']],
               left_on=['team_id', 'year'],
               right_on=['team_id', 'year'],
               suffixes=['_game', '_home_team'])

diff = tmp.query('park_id_game != park_id_home_team')
diff

Unnamed: 0,team_id,year,park_id_game,game_id,park_id_home_team
253,BAL,2015,STP01,BAL201505010,BAL12
254,BAL,2015,STP01,BAL201505020,BAL12
255,BAL,2015,STP01,BAL201505030,BAL12
2637,ATL,2016,FTB01,ATL201607030,ATL02
5814,HOU,2017,STP01,HOU201708290,HOU03
5815,HOU,2017,STP01,HOU201708300,HOU03
5816,HOU,2017,STP01,HOU201708310,HOU03
6062,MIA,2017,MIL06,MIA201709150,MIA02
6063,MIA,2017,MIL06,MIA201709160,MIA02
6064,MIA,2017,MIL06,MIA201709170,MIA02


In [26]:
# remove these games and recompute the run totals
filt = team_game['game_id'].isin(diff['game_id'])
team_game = team_game[~filt]

rt = compute_rt_per_park(team_game, game, parks)

In [27]:
home_parks = find_home_parks(rt)
road_parks = find_road_parks(rt)
road_totals = compute_road_totals(road_parks)

In [28]:
# ensure the indexes are the same before dividing
home_parks = home_parks.set_index(['team_id', 'year'])
pf = home_parks['r_avg'] / road_totals['r_avg']
pf = pf.to_frame().reset_index()
pf = pf.rename(columns={'r_avg':'pf'})
pf['pf_half'] = (1+pf['pf']) / 2
home_parks = home_parks.reset_index()
pf.head()

Unnamed: 0,team_id,year,pf,pf_half
0,ANA,2015,0.860724,0.930362
1,ANA,2016,0.910053,0.955026
2,ANA,2017,0.949176,0.974588
3,ANA,2018,0.968622,0.984311
4,ANA,2019,1.006353,1.003176


In [29]:
pf.query('team_id == "BOS" and year==2019').round(3)

Unnamed: 0,team_id,year,pf,pf_half
24,BOS,2019,1.029,1.014


We see that the basic PF for Fenway Park is 1.029 rounded to 3 decimal points.

In [30]:
# Spot Check by hand using ESPN home/away splits for Boston 2019
# See: http://www.espn.com/mlb/stats/team/_/stat/pitching/split/33
# home: 81 games 452 RS  439 RA -- includes London!
# away: 81 games 449 RS  389 RA
# London Series See: https://en.wikipedia.org/wiki/MLB_London_Series
# Boston in London (as home team) 2 games: 21 RS 29 RA
fenway_r_avg = (452 + 439 - 21 - 29) / 79
away_r_avg = (449 + 389) / 81
pf_fenway_2019 = fenway_r_avg / away_r_avg
print(f'pf: {pf_fenway_2019:.3f} pf_half:{(1+pf_fenway_2019)/2:.3f}')

pf: 1.029 pf_half:1.014


The value calculated by hand exactly matches the code which computes the PF for every team for every year.  This suggests that the above code is correct.

What value do we get if we mistakenly include the high scoring London games in the total for Fenway Park?

In [31]:
fenway_london_r_avg = (452+439) / 81
away_r_avg = (449 + 389) / 81
pf_fenway_london_2019 = fenway_london_r_avg / away_r_avg
print(f'pf: {pf_fenway_london_2019:.3f} pf_half:{(1+pf_fenway_london_2019)/2:.3f}')

pf: 1.063 pf_half:1.032


ESPN shows the PF for Fenway Park in 2019 as 1.06.  It appears ESPN mistakenly included the London runs as Fenway Park runs.
See: http://www.espn.com/mlb/stats/parkfactor/_/year/2019

Fangraphs, which presents pf_half values, shows the half value as 1.03.  It appears Fangraphs mistakenly includes the London runs as Fenway Park runs.
See: https://www.fangraphs.com/guts.aspx?type=pf&season=2019&teamid=0

Repeatable Research is about presenting data analysis that is repeatable.  This allows anyone to catch and fix possible errors, such as the one found here.  Scoring lots of runs in London as the home team, should not increase the Fenway Park PF.

# Webscrape FanGraphs for PF
This will allow for comparing many values with those computed here.

In [32]:
import requests
from io import StringIO
from bs4 import BeautifulSoup
import requests

In [33]:
# read the parks factor table on the fangraphs website
data = []
for year in range(2015, 2020):
    url = f'https://www.fangraphs.com/guts.aspx?type=pf&season={year}&teamid=0'
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'lxml')
    
    table = soup.find('table', class_='rgMasterTable')
    
    header = table.find('thead')
    cols = [col.text for col in header.find_all('th')]
    
    body = table.find('tbody')
    for row in body.find_all('tr'):
        data.append([col.text for col in row.find_all('td')])

In [34]:
fangraphs = pd.DataFrame(data, columns = cols)

# change datatypes from string to int
for col in fangraphs.columns:
    if col != 'Team':
        fangraphs[col] = fangraphs[col].astype('int')

fangraphs.head()

Unnamed: 0,Season,Team,Basic (5yr),3yr,1yr,1B,2B,3B,HR,SO,BB,GB,FB,LD,IFFB,FIP
0,2015,Angels,97,95,93,100,96,88,98,102,97,101,100,98,100,98
1,2015,Orioles,101,101,108,101,96,87,106,98,100,101,102,100,100,103
2,2015,Red Sox,104,107,109,103,112,103,95,99,99,102,97,103,101,98
3,2015,White Sox,99,98,95,98,95,93,105,102,103,98,101,98,105,102
4,2015,Indians,102,106,112,101,106,83,102,100,100,101,97,101,92,100


In [35]:
nb_data_path = data_dir / 'retrosheet/nb_data'
nb_data_path.mkdir(parents=True, exist_ok=True)
dh.to_csv_with_types(fangraphs, nb_data_path / 'fangraphs.csv')

In [36]:
# add the team name to the pf dataframe to compare with Fangraphs
pf = pf.merge(teams[['team_id', 'year', 'name']], left_on=['team_id', 'year'], right_on=['team_id', 'year'])
pf.head()

Unnamed: 0,team_id,year,pf,pf_half,name
0,ANA,2015,0.860724,0.930362,Angels
1,ANA,2016,0.910053,0.955026,Angels
2,ANA,2017,0.949176,0.974588,Angels
3,ANA,2018,0.968622,0.984311,Angels
4,ANA,2019,1.006353,1.003176,Angels


In [38]:
# add the 1yr PF from Fangraphs
pf_fg = pf.merge(fangraphs[['Season', 'Team', '1yr']],
         left_on=['year', 'name'],
         right_on=['Season', 'Team'],
         validate='one_to_one')
pf_fg.head()

Unnamed: 0,team_id,year,pf,pf_half,name,Season,Team,1yr
0,ANA,2015,0.860724,0.930362,Angels,2015,Angels,93
1,ANA,2016,0.910053,0.955026,Angels,2016,Angels,96
2,ANA,2017,0.949176,0.974588,Angels,2017,Angels,98
3,ANA,2018,0.968622,0.984311,Angels,2018,Angels,99
4,ANA,2019,1.006353,1.003176,Angels,2019,Angels,101


In [39]:
# compute the relative differences
pf_fg['pf_half'] *= 100  # to be on the same scale as Fangraphs
rel_diff = np.abs(1.0 - pf_fg['pf_half'] / pf_fg['1yr'])
rel_diff.max()

0.021251573577154792

In [40]:
# examine the largest differences between what is computed here and Fangraphs
pf_fg.loc[rel_diff > 0.015].merge(diff,
    left_on=['team_id', 'year'],
    right_on = ['team_id', 'year'])

Unnamed: 0,team_id,year,pf,pf_half,name,Season,Team,1yr,park_id_game,game_id,park_id_home_team
0,BAL,2015,1.205903,110.29517,Orioles,2015,Orioles,108,STP01,BAL201505010,BAL12
1,BAL,2015,1.205903,110.29517,Orioles,2015,Orioles,108,STP01,BAL201505020,BAL12
2,BAL,2015,1.205903,110.29517,Orioles,2015,Orioles,108,STP01,BAL201505030,BAL12
3,BOS,2019,1.028987,101.449352,Red Sox,2019,Red Sox,103,LON01,BOS201906290,BOS07
4,BOS,2019,1.028987,101.449352,Red Sox,2019,Red Sox,103,LON01,BOS201906300,BOS07


These differences are for the games previously mentioned.  Boston hosting the Yankees in London in 2019 and Baltimore hosting Tampa Bay in Tampa Bay in 2015.

In [41]:
# what does Fangraphs say about Boston?
pf_fg.query('team_id == "BOS" and year == 2019')

Unnamed: 0,team_id,year,pf,pf_half,name,Season,Team,1yr
24,BOS,2019,1.028987,101.449352,Red Sox,2019,Red Sox,103


Fangraphs reports the halved value (1yr) as 103.  This is the halved value is what we got when we included the London games in with Fenway park.  Fangraphs accidentally included the London games in with Fenway.

What does ESPN say about Boston?  
http://www.espn.com/mlb/stats/parkfactor/_/year/2019  

They have the non-halved value as 1.063 which exactly matches the value computed by adding the London games in with the Fenway Park games.  Both ESPN and Fangraphs made the same mistake.

In [42]:
# save the PF dataframe
pf = pf.round(2)
dh.to_csv_with_types(pf, data_dir / 'retrosheet/nb_data/pf.csv')

# Summary
Computing the Park Factor is relatively easy using the Retorsheet data however it is important to distinguish between being the "home team" (who bats last) and playing at the "home park".

Less than 1% of all games have the team batting last, not playing in their home park.  These games were removed from the data and the Park Factor was computed for each home park.

For the exceptionally high scoring games in London in 2019 it appears that both ESPN and Fangraphs added the London runs to Fenway Park, causing an incorrect Fenway Park PF.

Similarly, for Baltimore hosting Tampa Bay in Tampa Bay in 2015, both Fangraphs and ESPN added the Tampa Bay runs to Camden Yards, causing an incorrect Camden Yards PF.