# Sabermetrics: Computing Baseball Park Factors


&#128308; First Draft

Sabermetrics is the original term for baseball analytics.  It is not necessary to know anything about Sabermetrics to understand this notebook.  However understanding the rules of baseball will be helpful.

For baseball fans, an excellent introduction to Sabermetrics is: [Understanding Sabermetrics: Second Edition](https://www.amazon.com/Understanding-Sabermetrics-Introduction-Baseball-Statistics/dp/1476667667/).  This book is written by three professors of mathematics who love baseball.

Sabermetric blogs posts and websites have evolved their own terminology which uses some of the same terms as data science, but with slightly different meanings.  This can be confusing.  For a data scientist learning about Sabermetrics I recommended the above book written by mathematics professors and Wikipedia as both of these sources use standard data science terminology.

On Wikipedia, see for example:
* https://en.wikipedia.org/wiki/Baseball_statistics
* https://en.wikipedia.org/wiki/Batting_park_factor

# Basic Park Factor

Each baseball park is different.  Some parks, such as Coors Field in Denver are easy to score runs in whereas others, such as Oracle Park in San Francisco are difficult to score runs in.

In order to compare players who play in different parks, it would be helpful to account for how much their home park affects their performance.

The basic park factor formula is:

$$PF = {(RS_{home} + RA_{home}) / G_{home} \over \operatorname(RS_{road} + RA_{road}) / G_{road}}$$

For each team:  
$RS_{home}$ is Runs Scored at home  
$RA_{home}$ is Runs Allowed at home  
$G_{home}$ is number of Games played at home  
$RS_{road}$ is Runs Scored on the road  
$RA_{home}$ is Runs Allowed on the road  
$G_{home}$ is number of Games played on the road  

Example: 2019 LA Dodgers   
$RS_{home}$ = 441  
$RA_{home}$ = 271  
$G_{home}$ = 81  
$RS_{road}$ = 445  
$RA_{road}$ = 342  
$G_{road}$ = 81  

Plugging in the above numbers gives a basic PF for runs for Dodger Stadium in 2019 as 0.9047.  Often this value is multiplied by 100 so that 100 represents league average instead of 1.0.

As a player only plays half of their games at home, the PF is sometimes "halved" as follows $PF_{half} = (1+PF)/2$.  This gives a value of 0.952 for Dodger Stadium.  [Fangraphs](https://www.fangraphs.com/guts.aspx?type=pf&teamid=0&season=2019) shows the "halved" value as 0.96 (for 1yr PF by runs) for 2019.  Which is in close agreement.

A value less than 1.0 means that fewer runs than average are scored at Dodger Stadium.  This could be used to determine an equivalent ERA.  For example, Clayton Kershaw had an ERA of 3.02 in 2019.  Assuming half those games were at Dodger Stadium and using $PF_{half}$ means that his ERA in an average park would have been 3.02 / 0.952 = 3.17 without the pitching advantage of Dodger Stadium.

The Park Factor formula nicely factors out most of the qualities unique to the home team.  For example, an excellent scoring team would score more runs both home and away.  As RS is both in the numerator and the denominator, the team's above average run scoring ability is "canceled out", leaving just the contribution due to the park.  Similarly, an excellent pitching team will allow fewer runs both home and away, and as RA is in both the numerator and the denominator, this too "cancels out", leaving just the contribution due to the park.

## Case Example: Fenway Park 2019
All park factors for all stadiums for all years will be calculated while paying particular attention to a specific example, Fenway Park in Boston in 2019.

The 2019 Fenway Park calculation is interesting because in 2019 Boston hosted the Yankess in London for two games.  See: https://en.wikipedia.org/wiki/MLB_London_Series. 

These 2 games were extremely high scoring, with the two teams scoring a total of 50 runs.  Even though Boston was the "home team", meaning they batted last, the runs were not scored at their home park of Fenway, but in London.

## Park Factor Refinements
There are several refinements that can be made.

**Home Team not playing in Home Park**  
The Park Factor was created to measure the affect of each park on baseball statistics.  "Home" in the PF formula doesn't mean "home team" it means "home park".  Boston's home park is Fenway, so the values in the numerator are Runs Scored at Fenway, Runs Allowed at Fenway and Games played at Fenway.  If Boston happens to bat last at a baseball game in London, this should not affect the Fenway Park factor.

One way to account for the team batting last not playing at the home park is to remove all such games from the data.  As there are relatively few such games, this is the approach that will be taken in this notebook.

**Runs in Last Inning of Play**  
A team with a high winning percentage will often not bat in the bottom of the 9th at home.  This means that the team has only 24 outs to score runs at home, but 27 outs to score runs on the road.  This could artificially reduce the park factor for such a team.

Additionally, the strategy of the game changes in the 9th or later innings.  Roughly speaking, for the first 8 innings the strategy is to score the most runs possible while allowing the least runs possible.  After the 8th inning however the strategy is to play to end the game with a win, which is a bit different. For example, a home team in the bottom of the 9th with a runner on 3rd and less than two outs will see the defensive team play a very short outfield.  This is to reduce the chance of a sacrifice fly at the risk of giving up a double, but a double doesn't matter in this context because the game would be over.  This change in strategy has nothing to do with the park and yet the goal is to analyze the effect of the park on scoring.

One way to account for both the number of outs being different at home and on the road, and the strategy of the game being different in the 9th or later innings, is to just use the runs scored in the first 8 innings of each game.  Aside from rain shortened games which are not completed at a later date, which is rare, this means that each team gets exactly 24 outs to score as many runs as possible with no special strategy being employed.  This approach will be taken in a later notebook.

**Road Games are not at Parks with PF = 1.0**  
Each team's opponent schedule is not uniform.  Teams in the same division play each other more than teams in other divisions.  Only a few interleague games are played.  The basic PF formula assumes that the average PF is 1.0 for all road games, but this is not the case.

One way to account for this is to use the basic PF formula to get a PF per park, and then adjust the runs scored on the road by using the PF appropriate for each road game.  The adjusted run total on the road can then be used to compute a new PF per park.  This process can be repeated.  The result being that each team's road schedule is taken into account when computing their home park factor.  This approach will be taken in a later notebook.

**DH and Interleague Games**  
As seen in an earlier notebook, the DH adds about 1/4 run per game per team.  The DH is always used in the American League Park and never used in the National League Park.  About 12\% of all games in recent years are interleague games, which means about 6\% are interleague games on the road.

For a NL team playing in an AL park, about 1/2 more runs would be scored for this road game.  For the AL team playing at home, the game is a normal game with the DH.

For an AL team playing in a NL park, about 1/2 less runs would be scored for this road game.  For the NL team playing at home, the game is a normal game without the DH.

One way to account for this is to ignore interleague games.  However that is a fair amount of data to throw out given how few games there are in a season.  An alternative approach would be to keep the data but adjust the road total appropriately to account for the use of the DH.  At least one of these approaches will be taken in a later notebook.

**Park Factors for Singles, Doubles, Triples and Home Runs**  
The logic is the same as for computing runs scored, so this is easy to do.

**Park Factors for Righties and Lefties**  
This requires use of the play-by-play data and is more involved.

**Park Factors with 3 and 5 Year Moving Averages**  
As with any time series data that has noise, applying a moving average helps to remove the noise.  This is a simple step to apply after computing the PF for each year.

# Strategy to Find Park Factor
Note:  Pandas operates much faster when using vectorized (i.e. column) operations.  That is, pd.apply() should generally be avoided.  Unfortunately this makes the code slightly harder to read. Generous use of comments should help.

The basic strategy is:
* Compute a **home_parks** DataFrame:
  * one row per team per year
  * has: home_park_id, RS, RA and number of games played at home park
* Compute a **road_parks** DataFrame:
  * one row per team per year per road park (this is more rows than in the home_parks DataFrame)
  * has: road_park_id, RS, RA and the number of road games played at that park
* Compute a **road_totals** DataFrame:
  * sum **road_parks** over all parks to get the road total per team per year
  * if accounting for each team's road schedule, then the road totals would be adjusted by the park factor for each park the road game is played in
* Compute the PF per park using the home_parks and road_totals DataFrames
* Web Scrape the Park Factors from Fangraphs and ESPN and compare.

In [1]:
import os
import pandas as pd
import numpy as np
from pathlib import Path
import re
from scipy.stats import linregress

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [3]:
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 100 # increase dpi, will make figures larger and clearer

In [4]:
import sys

# import data_helper.py from download_scripts directory
sys.path.append('../download_scripts')
import data_helper as dh

In [5]:
pd.set_option("display.max_columns", 100)

In [6]:
data_dir = Path('../data')
lahman_data = data_dir.joinpath('lahman/wrangled').resolve()
retrosheet_data = data_dir.joinpath('retrosheet/wrangled').resolve()

## Understanding the team_game CSV file
The output of the cwgame parser was made [tidy](https://en.wikipedia.org/wiki/Tidy_data) and this is one of two files created.

The team_game file has one team per game. Since each game has two teams, there are two rows per game.  The team_id field identifies the team that the statistics are for.  The opponent_team_id identifies the opponent.

The fields which uniquely identify a record are: team_id, year.

In [7]:
# select a subset of the available fields for team_game
cols = ['game_id', 'year', 'at_home', 'team_id', 'opponent_team_id', 'r', 'h', 'e', 'lob', 'line_tx',
        'ab', 'double', 'triple', 'hr', 'rbi', 'sh', 'sf', 'hbp', 'bb', 'ibb', 'so', 'sb', 'cs',
        'gidp', 'xi', 'er', 'ter', 'wp', 'bk', 'po', 'a', 'pb', 'dp', 'tp', 'game_start']
team_game = dh.from_csv_with_types(retrosheet_data / 'team_game.csv.gz', usecols=cols)

In [8]:
# examine 1 game
team_game_example = team_game.query('game_id == "BOS201906290"')
team_game_example

Unnamed: 0,game_id,at_home,team_id,opponent_team_id,r,h,e,lob,line_tx,ab,double,triple,hr,rbi,sh,sf,hbp,bb,ibb,so,sb,cs,gidp,xi,er,ter,wp,bk,po,a,pb,dp,tp,game_start,year
260466,BOS201906290,True,BOS,NYA,13,18,0,10,600001600,43,3,0,3,12,0,1,0,6,0,5,0,0,1,0,17,17,2,0,27,8,0,1,0,2019-06-29 18:10:00,2019
260467,BOS201906290,False,NYA,BOS,17,19,0,7,602630000,45,7,0,3,17,0,0,0,6,0,11,0,0,1,0,13,13,2,0,27,11,0,1,0,2019-06-29 18:10:00,2019


We see the Red Sox (BOS) played the Yankees (NYA).  The Red Sox were the home team (meaning they batted last) and scored 13 runs whereas the Yankees were the road team and they scored 17 runs.

## Understanding the game CSV file
The output of the cwgame parser was made tidy and this is one of two files created.

The game file has information specific to a game, such as the baseball park it was played in.

In [9]:
# park_id is an attribute of a game, so it is in the game CSV file
cols = ['game_id', 'park_id']
game = dh.from_csv_with_types(retrosheet_data / 'game.csv.gz', usecols=cols)

In [10]:
game_example = game.query('game_id == "BOS201906290"')
game_example

Unnamed: 0,game_id,park_id
130233,BOS201906290,LON01


In [11]:
# join with parks file to get info about parks
parks = dh.from_csv_with_types(retrosheet_data / 'parks.csv')

game_example.merge(parks, left_on=['park_id'], right_on=['park_id'])

Unnamed: 0,game_id,park_id,name,aka,city,state,start,end,league,notes
0,BOS201906290,LON01,London Stadium,,London,UK,2019-06-29,2019-06-30,AL,BOS


We see that the game was played in London.

In [12]:
# join with teams file to get info about teams
teams = dh.from_csv_with_types(retrosheet_data / 'teams.csv')

cols = ['game_id', 'year', 'at_home', 'team_id', 'opponent_team_id', 'r']
team_game_example[cols].merge(teams, left_on=['team_id', 'year'], right_on=['team_id', 'year'])

Unnamed: 0,game_id,year,at_home,team_id,opponent_team_id,r,lg_id,city,name
0,BOS201906290,2019,True,BOS,NYA,13,A,Boston,Red Sox
1,BOS201906290,2019,False,NYA,BOS,17,A,New York,Yankees


# Create home_parks DataFrame

In [13]:
# TEMPORAILY use from 2015 on
# Focus on just the fields needed
cols = ['game_id', 'year', 'at_home', 'team_id', 'opponent_team_id', 'r']
tg = team_game[cols].query('year >= 2015')

# bring in the park_id field from game
tg_park = tg.merge(game)
tg_park.head(2)

Unnamed: 0,game_id,year,at_home,team_id,opponent_team_id,r,park_id
0,ANA201504100,2015,True,ANA,KCA,2,ANA01
1,ANA201504100,2015,False,KCA,ANA,4,ANA01


In [14]:
# compute RS per team per year per park
rs = tg_park.groupby(['team_id', 'year', 'park_id']).agg(
    rs=('r', 'sum'), games=('r', 'count')).reset_index()
rs.head(3)

Unnamed: 0,team_id,year,park_id,rs,games
0,ANA,2015,ANA01,320.0,81
1,ANA,2015,ARL02,69.0,10
2,ANA,2015,BAL12,9.0,3


In [15]:
# Compute RA per team per year per park
ra = tg_park.groupby(['opponent_team_id', 'year', 'park_id']).agg(
    ra=('r', 'sum'), games=('r', 'count')).reset_index()
ra.head(3)

Unnamed: 0,opponent_team_id,year,park_id,ra,games
0,ANA,2015,ANA01,298.0,81
1,ANA,2015,ARL02,46.0,10
2,ANA,2015,BAL12,5.0,3


RS and RA have now been computed for each team for each park.

In [16]:
# rename the axis to allow for join
ra = ra.rename(columns={'opponent_team_id':'team_id'})

# join to have RS and RA in the same dataframe
rt = rs.merge(ra, 
              left_on=['team_id', 'year', 'park_id'], 
              right_on=['team_id', 'year', 'park_id'],
              suffixes=('_rs', '_ra'))
rt.head()

Unnamed: 0,team_id,year,park_id,rs,games_rs,ra,games_ra
0,ANA,2015,ANA01,320.0,81,298.0,81
1,ANA,2015,ARL02,69.0,10,46.0,10
2,ANA,2015,BAL12,9.0,3,5.0,3
3,ANA,2015,BOS07,16.0,3,19.0,3
4,ANA,2015,CHI12,4.0,3,14.0,3


In [17]:
# spot check with Boston
rt.query('team_id == "BOS" and year==2019')[['rs', 'ra']].sum()

rs    901.0
ra    828.0
dtype: float64

In [18]:
# total runs scored: home and away
team_game.query('team_id == "BOS" and year==2019')['r'].sum()

901

In [19]:
# total runs allowed: home and away
team_game.query('opponent_team_id == "BOS" and year==2019')['r'].sum()

828

In [20]:
# and check with Lahman to see its RS and RA for Boston
lahman_teams = dh.from_csv_with_types(lahman_data / 'teams.csv')
lahman_teams.query('team_id == "BOS" and year == 2019')[['r', 'ra']]

Unnamed: 0,r,ra
2898,901,828


The spot check for Boston looks good.

In [21]:
# the number of games in which runs scored must equal the number of games in which runs were allowed
(rt['games_rs'] == rt['games_ra']).all()

True

In [22]:
# when summed over all teams, the runs scored should equal the runs allowed
rt['rs'].sum() == rt['ra'].sum()

True

In [23]:
# Add a rt column and just use games instead of games_rs and games_ra
rt['rt'] = rt['rs'] + rt['ra']
rt['games'] = rt['games_rs']
rt = rt.drop(columns=['games_rs', 'games_ra'])
rt.head(3)

Unnamed: 0,team_id,year,park_id,rs,ra,rt,games
0,ANA,2015,ANA01,320.0,298.0,618.0,81
1,ANA,2015,ARL02,69.0,46.0,115.0,10
2,ANA,2015,BAL12,9.0,5.0,14.0,3


There is an easy way to identify which park is the home park.  Rank each park (by team_id, by year) with respect to the number of games played.  The park with the most games (rank == 1), is the home park.

In [24]:
# per team_id per year, rank by the number of games
rt['rank'] = rt.groupby(['team_id', 'year'])['games'].rank(method='first', ascending=False)
rt.head(3)

Unnamed: 0,team_id,year,park_id,rs,ra,rt,games,rank
0,ANA,2015,ANA01,320.0,298.0,618.0,81,1.0
1,ANA,2015,ARL02,69.0,46.0,115.0,10,2.0
2,ANA,2015,BAL12,9.0,5.0,14.0,3,9.0


In [25]:
# rank == 1 identifies each team's home park
home_parks = rt.query('rank == 1').copy()

# compute the average total runs at home per game
home_parks['r_avg'] = home_parks['rt'] / home_parks['games']

# rank no longer needed
home_parks = home_parks.drop(columns=['rank'])
home_parks.head(7)

Unnamed: 0,team_id,year,park_id,rs,ra,rt,games,r_avg
0,ANA,2015,ANA01,320.0,298.0,618.0,81,7.62963
19,ANA,2016,ANA01,337.0,351.0,688.0,81,8.493827
38,ANA,2017,ANA01,356.0,335.0,691.0,81,8.530864
57,ANA,2018,ANA01,355.0,355.0,710.0,81,8.765432
76,ANA,2019,ANA01,385.0,411.0,796.0,79,10.075949
108,ARI,2015,PHO01,366.0,372.0,738.0,81,9.111111
127,ARI,2016,PHO01,411.0,493.0,904.0,81,11.160494


In [26]:
# example home parks for Boston
home_parks.query('team_id == "BOS" and year==2019')

Unnamed: 0,team_id,year,park_id,rs,ra,rt,games,r_avg
461,BOS,2019,BOS07,431.0,410.0,841.0,79,10.64557


We see that 79 games were played at Fenway and 841 runs in total were scored at Fenway.

# Create road_parks DataFrame

In [27]:
# rank not needed after query
road_parks = rt.query('rank != 1').copy()
road_parks = road_parks.drop(columns=['rank'])
road_parks.head(3)

Unnamed: 0,team_id,year,park_id,rs,ra,rt,games
1,ANA,2015,ARL02,69.0,46.0,115.0,10
2,ANA,2015,BAL12,9.0,5.0,14.0,3
3,ANA,2015,BOS07,16.0,19.0,35.0,3


In [28]:
# there are some road parks that are not anyone's home park
missing_parks = list(set(road_parks['park_id'].unique()) -set(home_parks['park_id'].unique()))
missing_parks

['FTB01', 'MNT01', 'WIL02', 'OMA01', 'LON01', 'SJU01']

In [29]:
# join with parks to get more info about these
missing_parks = pd.Series(missing_parks, name='park_id')
parks.merge(missing_parks)

Unnamed: 0,park_id,name,aka,city,state,start,end,league,notes
0,FTB01,Fort Bragg Field,,Fort Bragg,NC,2016-07-03,2016-07-03,NL,ATL:1 game
1,LON01,London Stadium,,London,UK,2019-06-29,2019-06-30,AL,BOS
2,MNT01,Estadio Monterrey,,Monterrey,MX,1996-08-16,1999-04-04,NL,SDN:8/16&8/17&8/18/1996; 4/4/1999
3,OMA01,TD Ameritrade Park,,Omaha,NE,2019-06-13,2019-06-13,KC1,
4,SJU01,Estadio Hiram Bithorn,,San Juan,PR,2001-04-01,2010-06-30,NL,"TOR: 4/1/2001; MON:4/11-20,6/3-8,9/5-11/2003"
5,WIL02,BB&T Ballpark at Bowman Field,,Williamsport,PA,2017-08-20,2017-08-20,NL,PIT


## Remove Games Played in Unusual Places
The 50 runs scored in the 2 game series in London suggests that the London park is not a typical baseball park.  There is too little data to say anything about such parks.  Remove them from the data.

In [30]:
criteria = road_parks['park_id'].isin(missing_parks)

# number of games that will be removed from entire dataset
road_parks.loc[criteria, 'games'].sum()

30

In [31]:
# number of games in the dataset
len(road_parks)

2718

In [32]:
# drop the road games played at a park that was not the home team's primary park
idx = road_parks.loc[criteria].index
road_parks = road_parks.drop(idx)

# Create road_totals DataFrame

In [33]:
road_totals = road_parks.groupby(['team_id', 'year']).agg(rt=('rt', 'sum'), games=('games', 'sum'))
road_totals['r_avg'] = road_totals['rt'] / road_totals['games']
road_totals.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,rt,games,r_avg
team_id,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ANA,2015,718.0,81,8.864198
ANA,2016,756.0,81,9.333333
ANA,2017,728.0,81,8.987654
ANA,2018,733.0,81,9.049383
ANA,2019,811.0,81,10.012346


In [34]:
# ensure the indexes are the same before dividing
home_parks = home_parks.set_index(['team_id', 'year'])
pf = home_parks['r_avg'] / road_totals['r_avg']
pf = pf.to_frame().reset_index()
pf = pf.rename(columns={'r_avg':'pf'})
pf['pf_half'] = (1+pf['pf']) / 2
home_parks = home_parks.reset_index()
pf.head()

Unnamed: 0,team_id,year,pf,pf_half
0,ANA,2015,0.860724,0.930362
1,ANA,2016,0.910053,0.955026
2,ANA,2017,0.949176,0.974588
3,ANA,2018,0.968622,0.984311
4,ANA,2019,1.006353,1.003176


In [35]:
pf.query('team_id == "BOS" and year==2019').round(3)

Unnamed: 0,team_id,year,pf,pf_half
24,BOS,2019,1.029,1.014


We see that the basic PF for Fenway Park is 1.029 rounded to 3 decimal points.

In [36]:
# Spot Check using ESPN home/away splits for Boston 2019
# http://www.espn.com/mlb/stats/team/_/stat/pitching/split/33
# home: 81 games 452 RS  439 RA -- includes London!
# away: 81 games 449 RS  389 RA
# London Series: https://en.wikipedia.org/wiki/MLB_London_Series
# London (as home team) 2 games: 21 RS 29 RA
fenway_r_avg = (452 + 439 - 21 - 29) / 79
away_r_avg = (449 + 389) / 81
pf_fenway_2019 = fenway_r_avg / away_r_avg
print(f'pf: {pf_fenway_2019:.3f} pf_half:{(1+pf_fenway_2019)/2:.3f}')

pf: 1.029 pf_half:1.014


The value calculated by hand exactly matches the code which computes the PF for every team for every year.

What value do we get if we mistakenly include the high scoring London games in the total for Fenway Park?

In [37]:
fenway_london_r_avg = (452+439) / 81
away_r_avg = (449 + 389) / 81
pf_fenway_london_2019 = fenway_london_r_avg / away_r_avg
print(f'pf: {pf_fenway_london_2019:.3f} pf_half:{(1+pf_fenway_london_2019)/2:.3f}')

pf: 1.063 pf_half:1.032


The PF is considerably higher than it should be.

# Webscrape FanGraphs for PF

In [38]:
import requests
from io import StringIO
from bs4 import BeautifulSoup
import requests

In [39]:
# read the parks factor table on the fangraphs website
data = []
for year in range(2015, 2020):
    url = f'https://www.fangraphs.com/guts.aspx?type=pf&season={year}&teamid=0'
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'lxml')
    
    table = soup.find('table', class_='rgMasterTable')
    
    header = table.find('thead')
    cols = [col.text for col in header.find_all('th')]
    
    body = table.find('tbody')
    for row in body.find_all('tr'):
        data.append([col.text for col in row.find_all('td')])

In [40]:
fg = pd.DataFrame(data, columns = cols)

# change datatypes from string to int
for col in fg.columns:
    if col != 'Team':
        fg[col] = fg[col].astype('int16')

fg.head()

Unnamed: 0,Season,Team,Basic (5yr),3yr,1yr,1B,2B,3B,HR,SO,BB,GB,FB,LD,IFFB,FIP
0,2015,Angels,97,95,93,100,96,88,98,102,97,101,100,98,100,98
1,2015,Orioles,101,101,108,101,96,87,106,98,100,101,102,100,100,103
2,2015,Red Sox,104,107,109,103,112,103,95,99,99,102,97,103,101,98
3,2015,White Sox,99,98,95,98,95,93,105,102,103,98,101,98,105,102
4,2015,Indians,102,106,112,101,106,83,102,100,100,101,97,101,92,100


In [41]:
# add the team name to the pf dataframe to compare with Fangraphs
pf = pf.merge(teams[['team_id', 'year', 'name']], left_on=['team_id', 'year'], right_on=['team_id', 'year'])
pf.head()

Unnamed: 0,team_id,year,pf,pf_half,name
0,ANA,2015,0.860724,0.930362,Angels
1,ANA,2016,0.910053,0.955026,Angels
2,ANA,2017,0.949176,0.974588,Angels
3,ANA,2018,0.968622,0.984311,Angels
4,ANA,2019,1.006353,1.003176,Angels


In [42]:
# add the 1yr PF from Fangraphs
pf_fg = pf.merge(fg[['Season', 'Team', '1yr']],
         left_on=['year', 'name'],
         right_on=['Season', 'Team'],
         validate='one_to_one')
pf_fg.head()

Unnamed: 0,team_id,year,pf,pf_half,name,Season,Team,1yr
0,ANA,2015,0.860724,0.930362,Angels,2015,Angels,93
1,ANA,2016,0.910053,0.955026,Angels,2016,Angels,96
2,ANA,2017,0.949176,0.974588,Angels,2017,Angels,98
3,ANA,2018,0.968622,0.984311,Angels,2018,Angels,99
4,ANA,2019,1.006353,1.003176,Angels,2019,Angels,101


In [43]:
# compute the maximum relative difference
pf_fg['pf_half'] *= 100  # to be on the same scale as Fangraphs
rel_diff = np.abs(1.0 - pf_fg['pf_half'] / pf_fg['1yr'])
rel_diff.max()

0.031354208675365314

In [44]:
pf_fg.loc[rel_diff.idxmax()].to_frame().T

Unnamed: 0,team_id,year,pf,pf_half,name,Season,Team,1yr
15,BAL,2015,1.22773,111.386,Orioles,2015,Orioles,108


So the biggest difference with Fangraphs was for the Orioles in 2015.  This is likely because Fangraphs adjusts the basic PF in several ways.  These adjustments will be made in subsequent notebooks.

In [45]:
# what does Fangraphs say about Boston?
pf_fg.query('team_id == "BOS" and year == 2019')

Unnamed: 0,team_id,year,pf,pf_half,name,Season,Team,1yr
24,BOS,2019,1.028987,101.449352,Red Sox,2019,Red Sox,103


Fangraphs reports the halved value (1yr) as 103.  This is the halved value is what we got when we included the London games in with Fenway park.  Either Fangraphs accidentally included the London games in with Fenway, or other PF adjustments account for the difference.  Other adjustments will be made to the PF to see what happens in later notebooks.

What does ESPN say about Boston?  
http://www.espn.com/mlb/stats/parkfactor/_/year/2019  

They have the non-halved value as 1.063 which exactly matches the value computed by adding the London games in with the Fenway Park games.  Perhaps both ESPN and Fangraphs made this same mistake.

# Summary
Computing the Park Factor is relatively easy using the Retorsheet data however it is important to distinguish between being the "home team" and playing at the "home park".  For the exceptionally high scoring games in Londing for which Boston was the home team but did not play at home, it appears that both ESPN and Fangraphs computed the Fenway Park Factor incorrectly.

In [51]:
#The park factor for all 30 teams for the last 5 years is:
pd.set_option("display.max_rows", 150)
pf

Unnamed: 0,team_id,year,pf,pf_half,name
0,ANA,2015,0.860724,0.930362,Angels
1,ANA,2016,0.910053,0.955026,Angels
2,ANA,2017,0.949176,0.974588,Angels
3,ANA,2018,0.968622,0.984311,Angels
4,ANA,2019,1.006353,1.003176,Angels
5,ARI,2015,1.061871,1.030935,Diamondbacks
6,ARI,2016,1.224932,1.112466,Diamondbacks
7,ARI,2017,1.202096,1.101048,Diamondbacks
8,ARI,2018,1.056923,1.028462,Diamondbacks
9,ARI,2019,0.977128,0.988564,Diamondbacks
