# Sabermetrics: Computing Park Factors<br>Accounting for Road Schedule


&#128308; First Draft

As with the previous notebook, the home team not playing in the home park will be accounted for.  Additionally, the Park Factor will be adjusted by considering each team's road schedule.

In some cases, adjusting for the road schedule makes a significant difference.

## Road Games not at Parks with PF = 1.0   
Each team's opponent schedule is not uniform.  The basic PF formula assumes that the average PF is 1.0 for all road games, but this is not the case.

One way to account for this is to use the basic PF formula to get an initial PF per park, then adjust the runs scored on the road using this PF for each road game.  The adjusted run total on the road can then be used to compute a new PF per park.  This process can be repeated.  The result being that each team's road schedule is taken into account when computing their home park factor.

This will be the approach taken in this notebook.

In [1]:
import os
import pandas as pd
import numpy as np
from pathlib import Path
import re

In [2]:
import sys

# import data_helper.py from download_scripts directory
sys.path.append('../download_scripts')
import data_helper as dh

In [3]:
data_dir = Path('../data')
lahman_data = data_dir.joinpath('lahman/wrangled').resolve()
retrosheet_data = data_dir.joinpath('retrosheet/wrangled').resolve()

## Read in the Data
Reading in the data up front makes the code clearer, but may use more memory. By only selecting the columns that are needed, much less memory is used as these are very wide csv files.

In [4]:
cols = ['game_id', 'year', 'bat_last', 'team_id', 'opponent_team_id', 'r']
team_game = dh.from_csv_with_types(retrosheet_data / 'team_game.csv.gz', usecols=cols)

In [5]:
cols = ['game_id', 'park_id', 'game_start']
game = dh.from_csv_with_types(retrosheet_data / 'game.csv.gz', usecols=cols)

In [6]:
parks = dh.from_csv_with_types(retrosheet_data / 'parks.csv')

In [7]:
teams = dh.from_csv_with_types(retrosheet_data / 'teams.csv')

In [8]:
# for now, focus on 2015 onward
team_game = team_game.query('year >= 2015')
game['year'] = game['game_start'].dt.year
game = game.query('year >= 2015')
game = game.drop(columns='game_start')

# Data Processing
This section is identical to the Data Processing section in the previous notebook.

In [9]:
def create_tg_parks(team_game, game):
    """Create minimal team_game dataframe with park_id."""

    cols = ['team_id', 'year', 'park_id', 'game_id', 'bat_last', 'r', 'opponent_team_id']
    tg_parks = team_game.merge(game)[cols]

    tg_parks = tg_parks.set_index(['team_id', 'year', 'park_id'])

    return tg_parks

In [10]:
def create_home_parks(tg_parks):
    """Create minimal home_parks dataframe which has home park_id per team per year."""

    # count games per team per year per park
    hp = tg_parks.groupby(['team_id', 'year', 'park_id']).agg(games=('game_id', 'count'))

    # rank number of games per team per year
    hp['rank'] = hp.groupby(['team_id', 'year'])['games'].rank(ascending=False, method='first')

    # each team's home park is the park with the most games (rank == 1)
    home_parks = hp.query('rank == 1').copy()
    home_parks = home_parks.drop(columns=['rank', 'games'])

    return home_parks

In [11]:
def create_home_parks_bats_last(home_parks, team_game, tg_parks):
    """Create dataframe with each game's park_id and the team batting last's home park_id."""

    # get game's park_id
    bats_last = tg_parks.query('bat_last == True').reset_index().set_index(['team_id', 'year'])

    # get team's home park_id
    hp = home_parks.reset_index().set_index(['team_id', 'year'])

    return hp.join(bats_last, lsuffix='_home', rsuffix='_game')

In [12]:
def remove_games(home_parks_bats_last, team_game):
    """Remove games in which team batting last is not at home park."""

    # get the game_id where the park_ids do not match
    diff = home_parks_bats_last.query('park_id_game != park_id_home')

    # filter out those game_ids from team_game
    filt = team_game['game_id'].isin(diff['game_id'])
    return team_game[~filt]   

In [13]:
def compute_runs_scored(tg_park):
    """Compute Runs Scored per team per year per park."""

    cols = ['team_id', 'year', 'park_id']
    return tg_park.groupby(cols).agg(games=('game_id', 'count'), rs=('r', 'sum'))

In [14]:
def compute_runs_allowed(tg_park):
    """Compute Runs Allowed per team per year per park."""

    cols = ['opponent_team_id', 'year', 'park_id']
    tmp = tg_park.groupby(cols).agg(games=('game_id', 'count'), ra=('r', 'sum'))
    return tmp.rename_axis(['team_id', 'year', 'park_id'])

In [15]:
def compute_runs_total(runs_scored, runs_allowed):
    """Join RS to RA create single dataframe.  Rank by games to find home_parks_runs."""

    rt = runs_scored.join(runs_allowed, lsuffix='_rs', rsuffix='_ra')

    # validate code
    assert (rt['games_rs'] == rt['games_ra']).all()
    assert rt['ra'].sum() == rt['rs'].sum()

    # rank games per team per year
    rt = rt.rename(columns={'games_rs': 'games'})
    rt = rt.drop(columns=['games_ra'])
    rt['rt'] = rt['rs'] + rt['ra']
    rt['rank'] = rt['games'].groupby(['team_id', 'year']).rank(ascending=False, method='first')
    
    rt = rt.drop(columns=['rs', 'ra'])

    return rt

In [16]:
def create_home_parks_runs(runs_total):
    """Similar to create_home_parks, except it has runs total and averge runs per game."""

    hp = runs_total.query('rank == 1').copy()
    hp = hp.drop(columns='rank')
    hp['r_avg'] = hp['rt'] / hp['games']

    return hp

In [17]:
def create_road_parks_runs(runs_total):
    """Create dataframe with runs per team per road-park per year"""
    rp = runs_total.query('rank > 1').copy()
    rp = rp.drop(columns='rank')
    
    return rp

In [18]:
def compute_road_totals(road_parks):
    """Sum the totals on the road for per team per year."""
    
    road_totals = road_parks.groupby(['team_id', 'year']).agg(
        rt=('rt', 'sum'), games=('games', 'sum'))
    road_totals['r_avg'] = road_totals['rt'] / road_totals['games']
    
    return road_totals

In [19]:
def compute_pf(home_parks_runs, road_totals):
    """Compute Park Factor."""
    
    pf = home_parks_runs['r_avg'] / road_totals['r_avg']
    pf = pf.to_frame()
    pf.columns = ['pf']

    return pf

In [20]:
# find which games to remove and remove them
tg_parks = create_tg_parks(team_game, game)
home_parks = create_home_parks(tg_parks)
home_parks_bats_last = create_home_parks_bats_last(home_parks, team_game, tg_parks)
team_game = remove_games(home_parks_bats_last, team_game)

# recompute tg_parks with fewer games
tg_parks = create_tg_parks(team_game, game)

In [21]:
# compute runs scored and runs allowed
runs_scored = compute_runs_scored(tg_parks)
runs_allowed = compute_runs_allowed(tg_parks)
runs_total = compute_runs_total(runs_scored, runs_allowed)

In [22]:
# pull in the pf from home_parks_runs
home_parks_runs = create_home_parks_runs(runs_total)
road_parks_runs = create_road_parks_runs(runs_total)
road_totals = compute_road_totals(road_parks_runs)

In [23]:
pf = compute_pf(home_parks_runs, road_totals)
pf.head(7)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,pf
team_id,year,park_id,Unnamed: 3_level_1
ANA,2015,ANA01,0.860724
ANA,2016,ANA01,0.910053
ANA,2017,ANA01,0.949176
ANA,2018,ANA01,0.968622
ANA,2019,ANA01,1.006353
ARI,2015,PHO01,1.061871
ARI,2016,PHO01,1.224932


In [24]:
home_parks_runs.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,games,rt,r_avg
team_id,year,park_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ANA,2015,ANA01,81,618.0,7.62963
ANA,2016,ANA01,81,688.0,8.493827
ANA,2017,ANA01,81,691.0,8.530864


In [25]:
road_parks_runs.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,games,rt
team_id,year,park_id,Unnamed: 3_level_1,Unnamed: 4_level_1
ANA,2015,ARL02,10,115.0
ANA,2015,BAL12,3,14.0
ANA,2015,BOS07,3,35.0


# Adjust for Road Schedule

In [26]:
pf_orig = pf['pf']

In [27]:
for i in range(5):
    rp = road_parks_runs.reset_index()
    pf = pf.reset_index()

    # prior to adjusting by PF, ensure that mean is 1.0
    # this is a minor adjustment
    pf['pf'] /= pf['pf'].mean()

    # bring in the PF for each (year, park_id)
    rp = rp.merge(pf,
                  left_on=['year', 'park_id'],
                  right_on=['year', 'park_id'],
                  suffixes=['', '_home'])

    rp = rp.drop(columns='team_id_home')

    # create adjusted road runs based on each park's pf
    rp['rt_adj'] = rp['rt'] / rp['pf']
    rp['pf_games'] = rp['pf'] * rp['games']  # to later compute weighted avg of road pf

    rp = rp.set_index(['team_id', 'year', 'park_id']).sort_index()

    road_totals = rp.groupby(['team_id', 'year']).agg(
        rt_adj=('rt_adj', 'sum'), games=('games', 'sum'),
        pf_adj_sum=('pf_games', 'sum'))

    road_totals['r_avg_adj'] = road_totals['rt_adj'] / road_totals['games']
    road_totals['pf_avg_road'] = road_totals['pf_adj_sum'] / road_totals['games']  # weighted avg of road pf

    pf = home_parks_runs['r_avg'] / road_totals['r_avg_adj']
    pf = pf.to_frame()
    pf.columns = ['pf']
    pf['pf_orig'] = pf_orig
    pf['pf_avg_road'] = road_totals['pf_avg_road']

    display(pf.query('team_id == "COL" and year==2019').round(3))

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,pf,pf_orig,pf_avg_road
team_id,year,park_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
COL,2019,DEN02,1.306,1.394,0.936


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,pf,pf_orig,pf_avg_road
team_id,year,park_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
COL,2019,DEN02,1.311,1.394,0.939


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,pf,pf_orig,pf_avg_road
team_id,year,park_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
COL,2019,DEN02,1.305,1.394,0.935


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,pf,pf_orig,pf_avg_road
team_id,year,park_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
COL,2019,DEN02,1.302,1.394,0.932


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,pf,pf_orig,pf_avg_road
team_id,year,park_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
COL,2019,DEN02,1.299,1.394,0.931


What the the minimum and maximum PF's for the road?

In [28]:
pf['pf_avg_road'].agg(['min', 'max'])

min    0.930562
max    1.072243
Name: pf_avg_road, dtype: float64

This are significant and will change the home team's park PF considerably.

In [29]:
pf['pf_avg_road'].idxmin()

('COL', 2019, 'DEN02')

This is not unexpected.  Given that Denver has the highest PF, it must be playing its road games in parks with lower PFs, so its road PF average will be less than one.  This decreased the PF at Coors Field as the adjusted runs on the road was higher than 1.0.

In [30]:
pf['pf_avg_road'].idxmax()

('TBA', 2016, 'STP01')

In [31]:
# since 2013
mlb_division = {
    'BOS':'AL-East',
    'BAL':'AL-East',
    'NYA':'AL-East',
    'TBA':'AL-East',
    'TOR':'AL-East',
    'CHA':'AL-Central', 
    'CLE':'AL-Central', 
    'DET':'AL-Central', 
    'KCA':'AL-Central', 
    'MIN':'AL-Central', 
    'HOU':'AL-West', 
    'ANA':'AL-West', 
    'OAK':'AL-West', 
    'SEA':'AL-West', 
    'TEX':'AL-West', 
    'ATL':'NL-East', 
    'MIA':'NL-East', 
    'NYN':'NL-East', 
    'PHI':'NL-East', 
    'WAS':'NL-East', 
    'CHN':'NL-Central', 
    'CIN':'NL-Central', 
    'MIL':'NL-Central', 
    'PIT':'NL-Central', 
    'SLN':'NL-Central', 
    'ARI':'NL-West', 
    'COL':'NL-West', 
    'LAN':'NL-West', 
    'SDN':'NL-West', 
    'SFN':'NL-West'}

al_east = ['BOS', 'BAL', 'NYA', 'TBA', 'TOR']
al_central = ['CHA', 'CLE','DET', 'KCA', 'MIN']
al_west = ['HOU', 'ANA', 'OAK', 'SEA', 'TEX']
nl_east = ['ATL', 'MIA', 'NYN', 'PHI', 'WAS']
nl_central = ['CHN', 'CIN', 'MIL', 'PIT', 'SLN']
nl_west = ['ARI', 'COL', 'LAN', 'SDN', 'SFN']

In [32]:
pf.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,pf,pf_orig,pf_avg_road
team_id,year,park_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ANA,2015,ANA01,0.852147,0.860724,0.99578
ANA,2016,ANA01,0.893544,0.910053,0.986948
ANA,2017,ANA01,0.935991,0.949176,0.994577
ANA,2018,ANA01,0.960691,0.968622,1.016708
ANA,2019,ANA01,1.025482,1.006353,1.02192


In [33]:
pf2 = pf.reset_index()
pf2['div'] = pf2['team_id'].map(mlb_division)

In [34]:
pf2.groupby(['year','div'])[['pf']].agg('mean')

Unnamed: 0_level_0,Unnamed: 1_level_0,pf
year,div,Unnamed: 2_level_1
2015,AL-Central,1.025025
2015,AL-East,1.058728
2015,AL-West,0.931605
2015,NL-Central,1.002328
2015,NL-East,0.9446
2015,NL-West,1.027719
2016,AL-Central,1.110906
2016,AL-East,1.092436
2016,AL-West,0.913574
2016,NL-Central,0.900196


# Compare with Previous NB

In [35]:
pf_nb01 = dh.from_csv_with_types(data_dir / 'retrosheet/nb_data/pf.csv')
pf_nb01 = pf_nb01.set_index(['team_id', 'year'])

In [36]:
pf_nb01

Unnamed: 0_level_0,Unnamed: 1_level_0,pf,pf_half,name
team_id,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ANA,2015,0.86,0.93,Angels
ANA,2016,0.91,0.96,Angels
ANA,2017,0.95,0.97,Angels
ANA,2018,0.97,0.98,Angels
ANA,2019,1.01,1.00,Angels
...,...,...,...,...
WAS,2015,1.00,1.00,Nationals
WAS,2016,0.96,0.98,Nationals
WAS,2017,1.06,1.03,Nationals
WAS,2018,1.13,1.07,Nationals


In [37]:
# compute the maximum relative difference
rel_diff = np.abs(1.0 - pf_nb01['pf'] / pf['pf'])

In [38]:
rel_diff.idxmax()

('WAS', 2018)

# Summary
The Park Factor was adjusted for the road schedule for each team.  In addition, the weighted average of the road Park Factor was computed.  It was shown, in a few cases, to be significantly different from assuming a road PF of 1.0.  This new metric, the average Park Factor on the road, may be useful.