# Computing Baseball Park Factors<br>Home Team Not Playing in Home Park

The original term for baseball analystics is Sabermetrics.  It is not necessary to know anything about Sabermetrics to understand this notebook.  However, understanding the rules of baseball is helpful.

For baseball fans, an excellent introduction to Sabermetrics is: [Understanding Sabermetrics: Second Edition](https://www.amazon.com/Understanding-Sabermetrics-Introduction-Baseball-Statistics/dp/1476667667/).  The three authors are all professors of mathematics who love baseball.

Sabermetric blog posts and websites have evolved a specialized terminology that uses some of the same terms as data science but with slightly different meanings. For a data scientist learning about Sabermetrics, I recommended the above book written by mathematics professors and Wikipedia as both of these sources use standard data science terminology.

On Wikipedia, see for example:
* https://en.wikipedia.org/wiki/Baseball_statistics
* https://en.wikipedia.org/wiki/Batting_park_factor

The park factor (PF) is computed for each home park, and it is shown that ESPN and Fangraphs wrongly computed the PF for Fenway Park in 2019.  This error is due to mistakenly counting the high scoring games in London as runs scored in Fenway Park.  A similar error occurs when Baltimore was forced to play "home games" on the road in 2015.  Those runs were mistakenly counted by ESPN and Fangraphs as if they were scored at Camden Yards.

# Basic Park Factor

Each baseball park is different.  Some parks, such as Coors Field in Denver are easy to score runs in whereas others, such as Oracle Park in San Francisco are difficult to score runs in.

In order to compare players who play in different parks, it is helpful to account for how much their home park affects their performance.

The basic park factor formula is:

$$PF = {(RS_{home} + RA_{home}) / G_{home} \over \operatorname(RS_{road} + RA_{road}) / G_{road}}$$

For each team:  
$RS_{home}$ is Runs Scored at home  
$RA_{home}$ is Runs Allowed at home  
$G_{home}$ is number of Games played at home  
$RS_{road}$ is Runs Scored on the road  
$RA_{home}$ is Runs Allowed on the road  
$G_{home}$ is number of Games played on the road  

Example: 2019 LA Dodgers   
$RS_{home}$ = 441  
$RA_{home}$ = 271  
$G_{home}$ = 81  
$RS_{road}$ = 445  
$RA_{road}$ = 342  
$G_{road}$ = 81  

Plugging in the above numbers gives a basic PF for runs for Dodger Stadium in 2019 as 0.9047.  Often this value is multiplied by 100 so that 100 represents league average instead of 1.0.

As a player only plays half of their games at home, the PF is sometimes "halved" as follows $PF_{half} = (1+PF)/2$.  This gives a value of 0.952 for Dodger Stadium.  [Fangraphs](https://www.fangraphs.com/guts.aspx?type=pf&teamid=0&season=2019) shows the "halved" value as 0.96 (for 1yr PF by runs) for 2019.  Which is in close agreement.

A value less than 1.0 means that fewer runs than average are scored at Dodger Stadium.  This could be used to determine an equivalent ERA.  For example, Clayton Kershaw had an ERA of 3.02 in 2019.  Assuming half those games were at Dodger Stadium and using $PF_{half}$ means that his ERA in an average park would have been 3.02 / 0.952 = 3.17 without the pitching advantage of Dodger Stadium.

The Park Factor formula nicely factors out most of the qualities unique to the home team.  For example, an excellent scoring team would score more runs both home and away.  As RS is both in the numerator and the denominator, the team's above average run scoring ability is "canceled out", leaving just the contribution due to the park.  Similarly, an excellent pitching team will allow fewer runs both home and away.  As RA is in both the numerator and the denominator, this also "cancels out", leaving just the contribution due to the park.

## Case Example: Fenway Park 2019
All park factors for all stadiums for several years will be calculated while paying particular attention to a specific example, Fenway Park in Boston in 2019.

The 2019 Fenway Park calculation is interesting because in 2019 Boston hosted the Yankess in London for two games.  See: https://en.wikipedia.org/wiki/MLB_London_Series. 

These 2 games were extremely high scoring, with the two teams scoring a total of 50 runs.  Even though Boston was the "home team", meaning they batted last, the runs were not scored at their home park of Fenway, but in London.

## Home Team not playing in Home Park
The Park Factor was created to measure the affect of each park on baseball statistics.  

"Home" in the PF formula doesn't mean "home team" it means "home park".  Boston's home park is Fenway, so the values in the numerator are Runs Scored at Fenway, Runs Allowed at Fenway and Games played at Fenway.  If Boston happens to bat last at a baseball game in London, this should not affect the Fenway Park PF.

As it is quite rare for the team batting last to not be playing in their home park, it is sufficient to remove all such games prior to computing each team's Park Factor.

In [1]:
import os
import pandas as pd
import numpy as np
from pathlib import Path
import re

In [2]:
import sys

# import data_helper.py from download_scripts directory
sys.path.append('../download_scripts')
import data_helper as dh

In [3]:
data_dir = Path('../data')
lahman_data = data_dir.joinpath('lahman/wrangled').resolve()
retrosheet_data = data_dir.joinpath('retrosheet/wrangled').resolve()

## Read in the Data
Reading in the data up front makes the code clearer, but may use more memory. By only selecting the columns that are needed, much less memory is used as these are very wide csv files.

In [4]:
cols = ['game_id', 'year', 'bat_last', 'team_id', 'opponent_team_id', 'r']
team_game = dh.from_csv_with_types(retrosheet_data / 'team_game.csv.gz', usecols=cols)

In [5]:
cols = ['game_id', 'park_id', 'game_start']
game = dh.from_csv_with_types(retrosheet_data / 'game.csv.gz', usecols=cols)

In [6]:
parks = dh.from_csv_with_types(retrosheet_data / 'parks.csv')

In [7]:
teams = dh.from_csv_with_types(retrosheet_data / 'teams.csv')

In [8]:
# for now, focus on 2015 onward
team_game = team_game.query('year >= 2015')
game['year'] = game['game_start'].dt.year
game = game.query('year >= 2015')
game = game.drop(columns='game_start')

## Understanding the team_game CSV file
The team_game csv file is one of two files that was created when the output of the cwgame parser was made [tidy](https://en.wikipedia.org/wiki/Tidy_data).

The team_game file has one team per game. Since each game has two teams, there are two rows per game. The team_id field identifies the team that the statistics are for. The opponent_team_id identifies the opponent.

The fields which uniquely identify a record are: team_id, game_id

In [9]:
# examine 1 game (two teams)
team_game.query('game_id == "BOS201906290"')

Unnamed: 0,game_id,bat_last,team_id,opponent_team_id,r,year
260466,BOS201906290,True,BOS,NYA,13,2019
260467,BOS201906290,False,NYA,BOS,17,2019


We see the Red Sox (BOS) played the Yankees (NYA).  The Red Sox batted last and scored 13 runs whereas the Yankees batted first and scored 17 runs.

## Understanding the game CSV file
The game csv file is one of two files that was created when the output of the cwgame parser was made tidy.

The game file has one row per game, with information such as which baseball park the game was played in.

In [10]:
game.query('game_id == "BOS201906290"')

Unnamed: 0,game_id,park_id,year
130233,BOS201906290,LON01,2019


In [11]:
# query parks to get more info about LON01
parks.query('park_id == "LON01"')

Unnamed: 0,park_id,name,aka,city,state,start,end,league,notes
111,LON01,London Stadium,,London,UK,2019-06-29,2019-06-30,AL,BOS


In [12]:
# query teams to get more info about the two teams
two_teams = ['BOS', 'NYA']
teams.query('year == 2019 and team_id in @two_teams')

Unnamed: 0,team_id,year,lg_id,city,name
2494,BOS,2019,A,Boston,Red Sox
2501,NYA,2019,A,New York,Yankees


# Data Processing
The PF will be computed per team, per year, per park.

The rational for removing games in which the team batting last is not playing at home is:
* these games shouldn't count towards the home park because they are not played in the home park
* these games are arguably not "road games", as the specified team batted last
* less than 1% of all games meet this criteria

DataFrame Indexes
* usually the best index is the field(s) that uniquely identify a row in that dataframe
* most of the DataFrames used here are identified by (team_id, year), or equivalently (park_id, year), or equivalently (team_id, year, park_id)

There are several data processing steps.

Each of the following functions represents a particular data transformation.  Functional composition is then used to apply multiple data transformations to the original data.

For example,  
`home_parks = create_home_parks(create_tg_parks(team_game, game))`.

Using functional composition in a single line of code can be difficult to read, so multiple lines are used,  
`tg_parks = create_tg_parks(team_game, game)
home_parks = create_home_parks(tg_parks)`

In [13]:
def create_tg_parks(team_game, game):
    """Create minimal team_game dataframe with park_id."""

    cols = ['team_id', 'year', 'park_id', 'game_id', 'bat_last', 'r', 'opponent_team_id']
    tg_parks = team_game.merge(game)[cols]

    return tg_parks.set_index(['team_id', 'year', 'park_id'])

In [14]:
tg_parks = create_tg_parks(team_game, game)
tg_parks.head(4)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,game_id,bat_last,r,opponent_team_id
team_id,year,park_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ANA,2015,ANA01,ANA201504100,True,2,KCA
KCA,2015,ANA01,ANA201504100,False,4,ANA
ANA,2015,ANA01,ANA201504110,True,4,KCA
KCA,2015,ANA01,ANA201504110,False,6,ANA


In [15]:
def create_home_parks(tg_parks):
    """Create minimal home_parks dataframe which has home park_id per team per year."""

    hp = tg_parks.groupby(['team_id', 'year', 'park_id']).agg(games=('game_id', 'count'))

    # rank number of games per team per year
    hp['rank'] = hp.groupby(['team_id', 'year'])['games'].rank(ascending=False, method='first')

    # each team's home park is the park with the most games (rank == 1)
    home_parks = hp.query('rank == 1').copy()
    home_parks = home_parks.drop(columns=['rank', 'games'])

    return home_parks

In [16]:
home_parks = create_home_parks(tg_parks)
home_parks.head(6)

team_id,year,park_id
ANA,2015,ANA01
ANA,2016,ANA01
ANA,2017,ANA01
ANA,2018,ANA01
ANA,2019,ANA01
ARI,2015,PHO01


In [17]:
def create_home_parks_bats_last(home_parks, team_game, tg_parks):
    """Create dataframe with each game's park_id and the team batting last's home park_id."""

    # get game's park_id
    bats_last = tg_parks.query('bat_last == True').reset_index().set_index(['team_id', 'year'])

    # get team's home park_id
    hp = home_parks.reset_index().set_index(['team_id', 'year'])

    return hp.join(bats_last, lsuffix='_home', rsuffix='_game')

In [18]:
home_parks_bats_last = create_home_parks_bats_last(home_parks, team_game, tg_parks)
home_parks_bats_last.head(4)

Unnamed: 0_level_0,Unnamed: 1_level_0,park_id_home,park_id_game,game_id,bat_last,r,opponent_team_id
team_id,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ANA,2015,ANA01,ANA01,ANA201504100,True,2,KCA
ANA,2015,ANA01,ANA01,ANA201504110,True,4,KCA
ANA,2015,ANA01,ANA01,ANA201504120,True,2,KCA
ANA,2015,ANA01,ANA01,ANA201504200,True,3,OAK


In [19]:
def remove_games(home_parks_bats_last, team_game):
    """Remove games in which team batting last is not at home park."""

    # get the game_id where the park_ids do not match
    diff = home_parks_bats_last.query('park_id_game != park_id_home')

    # filter out those game_ids from team_game
    filt = team_game['game_id'].isin(diff['game_id'])
    return team_game[~filt]   

In [20]:
print(len(team_game))
team_game = remove_games(home_parks_bats_last, team_game)
print(len(team_game))

24294
24240


In [21]:
# recompute tg_park using the new team_game dataframe
print(len(tg_parks))
tg_parks = create_tg_parks(team_game, game)
print(len(tg_parks))

24294
24240


In [22]:
def compute_runs_scored(tg_parks):
    """Compute Runs Scored per team per year per park."""

    cols = ['team_id', 'year', 'park_id']
    return tg_parks.groupby(cols).agg(games=('game_id', 'count'), rs=('r', 'sum'))

In [23]:
runs_scored = compute_runs_scored(tg_parks)
runs_scored.head(4)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,games,rs
team_id,year,park_id,Unnamed: 3_level_1,Unnamed: 4_level_1
ANA,2015,ANA01,81,320.0
ANA,2015,ARL02,10,69.0
ANA,2015,BAL12,3,9.0
ANA,2015,BOS07,3,16.0


In [24]:
def compute_runs_allowed(tg_parks):
    """Compute Runs Allowed per team per year per park."""

    cols = ['opponent_team_id', 'year', 'park_id']
    tmp = tg_parks.groupby(cols).agg(games=('game_id', 'count'), ra=('r', 'sum'))
    return tmp.rename_axis(['team_id', 'year', 'park_id'])

In [25]:
runs_allowed = compute_runs_allowed(tg_parks)
runs_allowed.head(4)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,games,ra
team_id,year,park_id,Unnamed: 3_level_1,Unnamed: 4_level_1
ANA,2015,ANA01,81,298.0
ANA,2015,ARL02,10,46.0
ANA,2015,BAL12,3,5.0
ANA,2015,BOS07,3,19.0


In [26]:
def compute_runs_total(runs_scored, runs_allowed):
    """Join RS to RA create single dataframe.  Rank by games to find home_parks_runs."""

    rt = runs_scored.join(runs_allowed, lsuffix='_rs', rsuffix='_ra')

    # validate code
    assert (rt['games_rs'] == rt['games_ra']).all()
    assert rt['ra'].sum() == rt['rs'].sum()

    # rank games per team per year
    rt = rt.rename(columns={'games_rs': 'games'})
    rt = rt.drop(columns=['games_ra'])
    rt['rt'] = rt['rs'] + rt['ra']
    rt['rank'] = rt['games'].groupby(['team_id', 'year']).rank(ascending=False, method='first')
    
    rt = rt.drop(columns=['rs', 'ra'])

    return rt

In [27]:
runs_total = compute_runs_total(runs_scored, runs_allowed)
runs_total.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,games,rt,rank
team_id,year,park_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ANA,2015,ANA01,81,618.0,1.0
ANA,2015,ARL02,10,115.0,2.0
ANA,2015,BAL12,3,14.0,9.0


In [28]:
def create_home_parks_runs(runs_total):
    """Similar to create_home_parks, except it has runs total and averge runs per game."""

    hp = runs_total.query('rank == 1').copy()
    hp = hp.drop(columns='rank')
    hp['r_avg'] = hp['rt'] / hp['games']

    return hp

In [29]:
home_parks_runs = create_home_parks_runs(runs_total)
home_parks_runs.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,games,rt,r_avg
team_id,year,park_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ANA,2015,ANA01,81,618.0,7.62963
ANA,2016,ANA01,81,688.0,8.493827
ANA,2017,ANA01,81,691.0,8.530864


In [30]:
def create_road_parks_runs(runs_total):
    """Create dataframe with runs per team per road-park per year"""
    rp = runs_total.query('rank > 1').copy()
    rp = rp.drop(columns='rank')
    
    return rp

In [31]:
road_parks_runs = create_road_parks_runs(runs_total)
road_parks_runs.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,games,rt
team_id,year,park_id,Unnamed: 3_level_1,Unnamed: 4_level_1
ANA,2015,ARL02,10,115.0
ANA,2015,BAL12,3,14.0
ANA,2015,BOS07,3,35.0


In [32]:
def compute_road_totals(road_parks):
    """Sum the totals on the road per team per year."""
    
    road_totals = road_parks.groupby(['team_id', 'year']).agg(
        rt=('rt', 'sum'), games=('games', 'sum'))
    road_totals['r_avg'] = road_totals['rt'] / road_totals['games']
    
    return road_totals

In [33]:
road_totals = compute_road_totals(road_parks_runs)
road_totals.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,rt,games,r_avg
team_id,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ANA,2015,718.0,81,8.864198
ANA,2016,756.0,81,9.333333
ANA,2017,728.0,81,8.987654


In [34]:
def compute_pf(home_parks_runs, road_totals):
    """Compute Park Factor."""
    
    pf = home_parks_runs['r_avg'] / road_totals['r_avg']
    pf = pf.to_frame()
    pf.columns = ['pf']

    return pf

In [35]:
pf = compute_pf(home_parks_runs, road_totals)
pf.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,pf
team_id,year,park_id,Unnamed: 3_level_1
ANA,2015,ANA01,0.860724
ANA,2016,ANA01,0.910053
ANA,2017,ANA01,0.949176
ANA,2018,ANA01,0.968622
ANA,2019,ANA01,1.006353


# Putting it All Together
Starting over from scratch without all the explanations.

In [36]:
# find which games to remove and remove them
tg_parks = create_tg_parks(team_game, game)
home_parks = create_home_parks(tg_parks)
home_parks_bats_last = create_home_parks_bats_last(home_parks, team_game, tg_parks)
team_game = remove_games(home_parks_bats_last, team_game)

# recompute tg_parks with fewer games
tg_parks = create_tg_parks(team_game, game)

In [37]:
# compute runs scored and runs allowed
runs_scored = compute_runs_scored(tg_parks)
runs_allowed = compute_runs_allowed(tg_parks)
runs_total = compute_runs_total(runs_scored, runs_allowed)

In [38]:
# compute runs scored and allowed at each team's home park, and on the road
home_parks_runs = create_home_parks_runs(runs_total)
road_parks_runs = create_road_parks_runs(runs_total)
road_runs_totals = compute_road_totals(road_parks_runs)

# compute the basic Park Factor
pf = compute_pf(home_parks_runs, road_runs_totals)

In [39]:
pf.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,pf
team_id,year,park_id,Unnamed: 3_level_1
ANA,2015,ANA01,0.860724
ANA,2016,ANA01,0.910053
ANA,2017,ANA01,0.949176
ANA,2018,ANA01,0.968622
ANA,2019,ANA01,1.006353


In [40]:
pf.query('team_id == "BOS" and year == 2019')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,pf
team_id,year,park_id,Unnamed: 3_level_1
BOS,2019,BOS07,1.028987


We see that the basic PF for Fenway Park is 1.029 rounded to 3 decimal points.

Spot Check by hand using ESPN's home/away splits for Boston 2019:  
* http://www.espn.com/mlb/stats/team/_/stat/pitching/year/2019/split/33
* http://www.espn.com/mlb/stats/team/_/stat/batting/year/2019/split/33

home: 81 games 452 RS  439 RA -- includes London!  
away: 81 games 449 RS  389 RA

London Series Data: https://en.wikipedia.org/wiki/MLB_London_Series  
Boston in London (as home team): 21 RS, 29 RA, 2 Games

In [41]:
# Manually plug in the numbers, subtracting out those for London
fenway_r_avg = (452 + 439 - 21 - 29) / 79
away_r_avg = (449 + 389) / 81
pf_fenway_2019 = fenway_r_avg / away_r_avg
print(f'pf: {pf_fenway_2019:.3f} pf_half:{(1+pf_fenway_2019)/2:.3f}')

pf: 1.029 pf_half:1.014


The value calculated by hand exactly matches the code which computes the PF for every team for every year.

What value do we get if we mistakenly include the high scoring London games in the total for Fenway Park?

In [42]:
fenway_london_r_avg = (452+439) / 81
away_r_avg = (449 + 389) / 81
pf_fenway_london_2019 = fenway_london_r_avg / away_r_avg
print(f'pf: {pf_fenway_london_2019:.3f} pf_half:{(1+pf_fenway_london_2019)/2:.3f}')

pf: 1.063 pf_half:1.032


ESPN shows the PF for Fenway Park in 2019 as 1.06.  This is the value you get when the runs scored in London are included in the Fenway Park totals.  
See: http://www.espn.com/mlb/stats/parkfactor/_/year/2019

Fangraphs, which presents the pf_half values, shows the half value as 1.03.  This is the value you get when the runs scored in London are included in the Fenway Park totals.  
See: https://www.fangraphs.com/guts.aspx?type=pf&season=2019&teamid=0

Of course, scoring lots of runs in London should not cause the Fenway Park PF to increase, but it did for both ESPN and Fangraphs.

Perhaps the main reason this error has not been caught before is that the data was not wrangled to be tidy.  When data is not tidy, it is very difficult to analyze correctly.  This difficulty is a barrier to entry for those wishing to compute the Park Factor from the open-source Retrosheet data.

# Webscrape FanGraphs for PF
This will allow for comparing many values with those computed here.

In [43]:
import requests
from io import StringIO
from bs4 import BeautifulSoup
import requests

In [44]:
# read the parks factor table on the fangraphs website
data = []
for year in range(2015, 2020):
    url = f'https://www.fangraphs.com/guts.aspx?type=pf&season={year}&teamid=0'
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'lxml')
    
    table = soup.find('table', class_='rgMasterTable')
    
    header = table.find('thead')
    cols = [col.text for col in header.find_all('th')]
    
    body = table.find('tbody')
    for row in body.find_all('tr'):
        data.append([col.text for col in row.find_all('td')])

In [45]:
fangraphs = pd.DataFrame(data, columns = cols)

# change datatypes from string to int
for col in fangraphs.columns:
    if col != 'Team':
        fangraphs[col] = fangraphs[col].astype('int')

fangraphs.head()

Unnamed: 0,Season,Team,Basic (5yr),3yr,1yr,1B,2B,3B,HR,SO,BB,GB,FB,LD,IFFB,FIP
0,2015,Angels,97,95,93,100,96,88,98,102,97,101,100,98,100,98
1,2015,Orioles,101,101,108,101,96,87,106,98,100,101,102,100,100,103
2,2015,Red Sox,104,107,109,103,112,103,95,99,99,102,97,103,101,98
3,2015,White Sox,99,98,95,98,95,93,105,102,103,98,101,98,105,102
4,2015,Indians,102,106,112,101,106,83,102,100,100,101,97,101,92,100


In [46]:
# save the web scraped Fangraphs Park Factor data
nb_data_path = data_dir / 'retrosheet/nb_data'
nb_data_path.mkdir(parents=True, exist_ok=True)
dh.to_csv_with_types(fangraphs, nb_data_path / 'fangraphs.csv')

In [47]:
# add the team name to the pf dataframe to compare with Fangraphs
pf = pf.merge(teams[['team_id', 'year', 'name']], 
              left_on=['team_id', 'year'], 
              right_on=['team_id', 'year'])
pf.head()

Unnamed: 0,team_id,year,pf,name
0,ANA,2015,0.860724,Angels
1,ANA,2016,0.910053,Angels
2,ANA,2017,0.949176,Angels
3,ANA,2018,0.968622,Angels
4,ANA,2019,1.006353,Angels


In [48]:
# compute pf_half to compare with Fangraphs
pf['pf_half'] = (1.0 + pf['pf']) / 2.0

In [49]:
# add the 1yr PF from Fangraphs
pf_fg = pf.merge(fangraphs[['Season', 'Team', '1yr']],
         left_on=['year', 'name'],
         right_on=['Season', 'Team'],
         validate='one_to_one')
pf_fg.head()

Unnamed: 0,team_id,year,pf,name,pf_half,Season,Team,1yr
0,ANA,2015,0.860724,Angels,0.930362,2015,Angels,93
1,ANA,2016,0.910053,Angels,0.955026,2016,Angels,96
2,ANA,2017,0.949176,Angels,0.974588,2017,Angels,98
3,ANA,2018,0.968622,Angels,0.984311,2018,Angels,99
4,ANA,2019,1.006353,Angels,1.003176,2019,Angels,101


In [50]:
# compute the relative differences between pf_half and 1yr
pf_fg['pf_half'] *= 100  # to be on the same scale as Fangraphs
rel_diff = np.abs(1.0 - pf_fg['pf_half'] / pf_fg['1yr'])

In [51]:
rel_diff.nlargest(2)

15    0.021252
24    0.015055
dtype: float64

In [52]:
# examine the 2 largest differences from Fangraphs
pf_fg.loc[rel_diff.nlargest(2).index]

Unnamed: 0,team_id,year,pf,name,pf_half,Season,Team,1yr
15,BAL,2015,1.205903,Orioles,110.29517,2015,Orioles,108
24,BOS,2019,1.028987,Red Sox,101.449352,2019,Red Sox,103


These differences are for the games previously mentioned.  Boston hosting the Yankees in London in 2019 and Baltimore hosting Tampa Bay in Tampa Bay in 2015.

# Summary
Computing the Park Factor is relatively easy after the Retrosheet data has been wrangled.

The team batting last is not always playing at their home park.  Not accounting for this was seen to result in errors of up to 4% for the full PF value for teams playing in the last 5 years.  These errors were made by both ESPN and Fangraphs.

As less than 1% of all games fall into this category, removing these games from the data is a reasonable option.