# Computing Park Factors<br>Accounting for Road Schedule
As with the previous notebook, the home team not playing in the home park will be accounted for.  Additionally, the Park Factor will be adjusted by considering each team's road schedule.

It will be shown that adjusting for the road schedule, especially for home parks with very high or very low Park Factors, can make a large difference.

The methodology is to compute the PF as before.  Then adjust the runs on the road by each road parks PF and then recompute the PF for each team's home park.

This notebook is identical to the previous notebook until the section: [Adust for Road Schedule](#Adjust-for-Road-Schedule).

In [1]:
import os
import pandas as pd
import numpy as np
from pathlib import Path
import re

In [2]:
import sys

# import data_helper.py from download_scripts directory
sys.path.append('../download_scripts')
import data_helper as dh

In [3]:
data_dir = Path('../data')
lahman_data = data_dir.joinpath('lahman/wrangled').resolve()
retrosheet_data = data_dir.joinpath('retrosheet/wrangled').resolve()

## Read in the Data
Reading in the data up front makes the code clearer, but may use more memory. By only selecting the columns that are needed, much less memory is used as these are very wide csv files.

In [4]:
cols = ['game_id', 'year', 'bat_last', 'team_id', 'opponent_team_id', 'r']
team_game = dh.from_csv_with_types(retrosheet_data / 'team_game.csv.gz', usecols=cols)

In [5]:
cols = ['game_id', 'park_id', 'game_start']
game = dh.from_csv_with_types(retrosheet_data / 'game.csv.gz', usecols=cols)

In [6]:
parks = dh.from_csv_with_types(retrosheet_data / 'parks.csv')

In [7]:
teams = dh.from_csv_with_types(retrosheet_data / 'teams.csv')

In [8]:
# for now, focus on 2015 onward
team_game = team_game.query('year >= 2015')
game['year'] = game['game_start'].dt.year
game = game.query('year >= 2015')
game = game.drop(columns='game_start')

# Data Processing
This section is identical to the Data Processing section in the previous notebook.

In [9]:
def create_tg_parks(team_game, game):
    """Create minimal team_game dataframe with park_id."""

    cols = ['team_id', 'year', 'park_id', 'game_id', 'bat_last', 'r', 'opponent_team_id']
    tg_parks = team_game.merge(game)[cols]

    tg_parks = tg_parks.set_index(['team_id', 'year', 'park_id'])

    return tg_parks

In [10]:
def create_home_parks(tg_parks):
    """Create minimal home_parks dataframe which has home park_id per team per year."""

    # count games per team per year per park
    hp = tg_parks.groupby(['team_id', 'year', 'park_id']).agg(games=('game_id', 'count'))

    # rank number of games per team per year
    hp['rank'] = hp.groupby(['team_id', 'year'])['games'].rank(ascending=False, method='first')

    # each team's home park is the park with the most games (rank == 1)
    home_parks = hp.query('rank == 1').copy()
    home_parks = home_parks.drop(columns=['rank', 'games'])

    return home_parks

In [11]:
def create_home_parks_bats_last(home_parks, team_game, tg_parks):
    """Create dataframe with each game's park_id and the team batting last's home park_id."""

    # get game's park_id
    bats_last = tg_parks.query('bat_last == True').reset_index().set_index(['team_id', 'year'])

    # get team's home park_id
    hp = home_parks.reset_index().set_index(['team_id', 'year'])

    return hp.join(bats_last, lsuffix='_home', rsuffix='_game')

In [12]:
def remove_games(home_parks_bats_last, team_game):
    """Remove games in which team batting last is not at home park."""

    # get the game_id where the park_ids do not match
    diff = home_parks_bats_last.query('park_id_game != park_id_home')

    # filter out those game_ids from team_game
    filt = team_game['game_id'].isin(diff['game_id'])
    return team_game[~filt]   

In [13]:
def compute_runs_scored(tg_park):
    """Compute Runs Scored per team per year per park."""

    cols = ['team_id', 'year', 'park_id']
    return tg_park.groupby(cols).agg(games=('game_id', 'count'), rs=('r', 'sum'))

In [14]:
def compute_runs_allowed(tg_park):
    """Compute Runs Allowed per team per year per park."""

    cols = ['opponent_team_id', 'year', 'park_id']
    tmp = tg_park.groupby(cols).agg(games=('game_id', 'count'), ra=('r', 'sum'))
    return tmp.rename_axis(['team_id', 'year', 'park_id'])

In [15]:
def compute_runs_total(runs_scored, runs_allowed):
    """Join RS to RA create single dataframe.  Rank by games to find home_parks_runs."""

    rt = runs_scored.join(runs_allowed, lsuffix='_rs', rsuffix='_ra')

    # validate code
    assert (rt['games_rs'] == rt['games_ra']).all()
    assert rt['ra'].sum() == rt['rs'].sum()

    # rank games per team per year
    rt = rt.rename(columns={'games_rs': 'games'})
    rt = rt.drop(columns=['games_ra'])
    rt['rt'] = rt['rs'] + rt['ra']
    rt['rank'] = rt['games'].groupby(['team_id', 'year']).rank(ascending=False, method='first')
    
    rt = rt.drop(columns=['rs', 'ra'])

    return rt

In [16]:
def create_home_parks_runs(runs_total):
    """Similar to create_home_parks, except it has runs total and averge runs per game."""

    hp = runs_total.query('rank == 1').copy()
    hp = hp.drop(columns='rank')
    hp['r_avg'] = hp['rt'] / hp['games']

    return hp

In [17]:
def create_road_parks_runs(runs_total):
    """Create dataframe with runs per team per road-park per year"""
    rp = runs_total.query('rank > 1').copy()
    rp = rp.drop(columns='rank')
    
    return rp

In [18]:
def compute_road_totals(road_parks):
    """Sum the totals on the road for per team per year."""
    
    road_totals = road_parks.groupby(['team_id', 'year']).agg(
        rt=('rt', 'sum'), games=('games', 'sum'))
    road_totals['r_avg'] = road_totals['rt'] / road_totals['games']
    
    return road_totals

In [19]:
def compute_pf(home_parks_runs, road_totals):
    """Compute Park Factor."""
    
    pf = home_parks_runs['r_avg'] / road_totals['r_avg']
    pf = pf.to_frame()
    pf.columns = ['pf']

    return pf

In [20]:
# find which games to remove and remove them
tg_parks = create_tg_parks(team_game, game)
home_parks = create_home_parks(tg_parks)
home_parks_bats_last = create_home_parks_bats_last(home_parks, team_game, tg_parks)
team_game = remove_games(home_parks_bats_last, team_game)

# recompute tg_parks with fewer games
tg_parks = create_tg_parks(team_game, game)

In [21]:
# compute runs scored and runs allowed and runs total
runs_scored = compute_runs_scored(tg_parks)
runs_allowed = compute_runs_allowed(tg_parks)
runs_total = compute_runs_total(runs_scored, runs_allowed)

In [22]:
# split out the data by park
home_parks_runs = create_home_parks_runs(runs_total)
road_parks_runs = create_road_parks_runs(runs_total)
road_totals = compute_road_totals(road_parks_runs)

In [23]:
pf = compute_pf(home_parks_runs, road_totals)
pf.head(7)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,pf
team_id,year,park_id,Unnamed: 3_level_1
ANA,2015,ANA01,0.860724
ANA,2016,ANA01,0.910053
ANA,2017,ANA01,0.949176
ANA,2018,ANA01,0.968622
ANA,2019,ANA01,1.006353
ARI,2015,PHO01,1.061871
ARI,2016,PHO01,1.224932


# Adjust for Road Schedule

In [24]:
pf_orig = pf['pf']

In [25]:
for i in range(5):
    rp = road_parks_runs.reset_index()
    pf = pf.reset_index()

    # prior to adjusting the road total runs by PF, ensure that mean is 1.0
    # this is a small but important adjustment if several iterations are performed
    pf['pf'] /= pf['pf'].mean()

    # add PF column with values for each road park
    rp = rp.merge(pf,
                  left_on=['year', 'park_id'],
                  right_on=['year', 'park_id'],
                  suffixes=['', '_home'])

    rp = rp.drop(columns='team_id_home')

    # create adjusted road runs per park based on each road park's pf
    rp['rt_adj'] = rp['rt'] / rp['pf']
    
    # used to compute a game weighted average of each team's road PF
    rp['pf_games'] = rp['games'] * rp['pf']

    rp = rp.set_index(['team_id', 'year', 'park_id']).sort_index()

    road_totals = rp.groupby(['team_id', 'year']).agg(
        rt_adj=('rt_adj', 'sum'), games=('games', 'sum'),
        pf_adj_sum=('pf_games', 'sum'))

    # compute road PF adjusted run total average per road game
    road_totals['r_avg_adj'] = road_totals['rt_adj'] / road_totals['games']
    
    # compute game weighted PF average per road game
    road_totals['pf_avg_road'] = road_totals['pf_adj_sum'] / road_totals['games']

    # compute PF
    pf = home_parks_runs['r_avg'] / road_totals['r_avg_adj']
    pf = pf.to_frame()
    pf.columns = ['pf']
    pf['pf_orig'] = pf_orig
    pf['pf_avg_road'] = road_totals['pf_avg_road']

    display(pf.query('team_id == "COL" and year==2019').round(3))

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,pf,pf_orig,pf_avg_road
team_id,year,park_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
COL,2019,DEN02,1.306,1.394,0.936


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,pf,pf_orig,pf_avg_road
team_id,year,park_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
COL,2019,DEN02,1.311,1.394,0.939


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,pf,pf_orig,pf_avg_road
team_id,year,park_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
COL,2019,DEN02,1.305,1.394,0.935


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,pf,pf_orig,pf_avg_road
team_id,year,park_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
COL,2019,DEN02,1.302,1.394,0.932


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,pf,pf_orig,pf_avg_road
team_id,year,park_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
COL,2019,DEN02,1.299,1.394,0.931


The Colorado Rockies (COL) play in Denver at Coors Field (DEN02).

In the above, we see that the PF for Coors Field dropped significantly with the first iteration of using the road PF for each of the Rockies road games.  Subsequent iterations made very slight adjustments.

Given that Denver's home park of Coors Field has the highest PF, the road PF must be less than 1.0. So the above is reasonable.

Coors Field adjusted for the Rockies road schedule is very nearly the same as the Rockies average PF on the road times the original Coors Field PF.  It is not exactly the same as the road runs PF weighted average is slightly different than the game weighted PF average times all road runs.

What the the minimum and maximum PF's for the road?  

Only the last 5 years of data are being considered at this time.

In [26]:
pf['pf_avg_road'].agg(['min', 'max'])

min    0.930562
max    1.072243
Name: pf_avg_road, dtype: float64

These are significant and will change the home team's park PF considerably.

In [27]:
pf['pf_avg_road'].nsmallest(3)

team_id  year  park_id
COL      2019  DEN02      0.930562
PIT      2016  PIT08      0.945413
MIL      2016  MIL06      0.945822
Name: pf_avg_road, dtype: float64

In [28]:
pf['pf_avg_road'].nlargest(3)

team_id  year  park_id
TBA      2016  STP01      1.072243
NYA      2016  NYC21      1.065839
MIN      2016  MIN04      1.053975
Name: pf_avg_road, dtype: float64

The Tampa Bay Rays of 2016 have the highest road PF.  Tampa Bay is in the AL East, which means most of their games are played in AL East parks.  The AL East in 2016 has a high PF, except for Tampa Bay.

In [29]:
# the following is correct from 2013 through present
mlb_division = {
    'BOS':'AL-East',
    'BAL':'AL-East',
    'NYA':'AL-East',
    'TBA':'AL-East',
    'TOR':'AL-East',
    'CHA':'AL-Central', 
    'CLE':'AL-Central', 
    'DET':'AL-Central', 
    'KCA':'AL-Central', 
    'MIN':'AL-Central', 
    'HOU':'AL-West', 
    'ANA':'AL-West', 
    'OAK':'AL-West', 
    'SEA':'AL-West', 
    'TEX':'AL-West', 
    'ATL':'NL-East', 
    'MIA':'NL-East', 
    'NYN':'NL-East', 
    'PHI':'NL-East', 
    'WAS':'NL-East', 
    'CHN':'NL-Central', 
    'CIN':'NL-Central', 
    'MIL':'NL-Central', 
    'PIT':'NL-Central', 
    'SLN':'NL-Central', 
    'ARI':'NL-West', 
    'COL':'NL-West', 
    'LAN':'NL-West', 
    'SDN':'NL-West', 
    'SFN':'NL-West'}

al_east = ['BOS', 'BAL', 'NYA', 'TBA', 'TOR']
al_central = ['CHA', 'CLE','DET', 'KCA', 'MIN']
al_west = ['HOU', 'ANA', 'OAK', 'SEA', 'TEX']
nl_east = ['ATL', 'MIA', 'NYN', 'PHI', 'WAS']
nl_central = ['CHN', 'CIN', 'MIL', 'PIT', 'SLN']
nl_west = ['ARI', 'COL', 'LAN', 'SDN', 'SFN']

In [30]:
pf = pf.reset_index()
pf['div'] = pf['team_id'].map(mlb_division)
pf.groupby(['year','div'])[['pf']].agg('mean')

Unnamed: 0_level_0,Unnamed: 1_level_0,pf
year,div,Unnamed: 2_level_1
2015,AL-Central,1.025025
2015,AL-East,1.058728
2015,AL-West,0.931605
2015,NL-Central,1.002328
2015,NL-East,0.9446
2015,NL-West,1.027719
2016,AL-Central,1.110906
2016,AL-East,1.092436
2016,AL-West,0.913574
2016,NL-Central,0.900196


# Compare with Previous NB

In [31]:
pf_nb01 = dh.from_csv_with_types(data_dir / 'retrosheet/nb_data/pf.csv')
pf_nb01 = pf_nb01.set_index(['team_id', 'year'])

In [32]:
# compute the maximum relative difference
pf = pf.set_index(['team_id', 'year'])
rel_diff = np.abs(1.0 - pf_nb01['pf'] / pf['pf'])

In [33]:
rel_diff.nlargest(5)

team_id  year
WAS      2018    0.070261
COL      2019    0.069709
MIL      2016    0.064798
PIT      2016    0.063030
TBA      2016    0.059912
Name: pf, dtype: float64

In [34]:
pf.loc[rel_diff.idxmax()]

park_id          WAS11
pf             1.05582
pf_orig        1.13363
pf_avg_road    0.95219
div            NL-East
Name: (WAS, 2018), dtype: object

The Washington Nationals of 2018 had the largest change in PF when considering their road schedule.  This is because the NL East, where they play most of their games, has a low PF except for the home park of the Nationals (WAS11) which has a high PF.

# Summary
The Park Factor was adjusted for the road schedule for each team.

It was shown that 5 teams over the last 5 years had their Park Factor adjusted by 6% or more.  This is a large amount.  Accounting for the road schedule of each team does affect the home teams Park Factor.

The game weighted average road PF was computed.  This new metric is useful for accounting for a players performance both at home and on the road.  Below Clayton Kershaw's 2019 ERA of 3.02 is adjusted by using the PF both at home and on the road.

In [35]:
# Example, compute Clayton Kershaw's ERA adjusted by PF
pf.loc[('LAN', 2019)]

park_id           LOS03
pf             0.901403
pf_orig        0.904701
pf_avg_road    0.986976
div             NL-West
Name: (LAN, 2019), dtype: object

In [36]:
# Assume half of Kershaw's games were at home and half were on the road
# => use the average PF for adjustment
pf_avg_2019_dodgers = (.901 + .987)/2

# Kershaw's 2019 adjusted ERA
np.round(3.02 / pf_avg_2019_dodgers, 2)

3.2