# Baseball Analysis 1

**Baseball Notebooks**  
1. Downloaded and unzipped baseball data.
2. Helper functions and their motivation for use.
3. Lahman data was wrangled and persisted.
4. Retrosheet Play by Play data was parsed, collected into 2 DataFrames, and persisted.
5. Wrangle the Retrosheet data in preparation for data analysis.
6. Loaded the wrangled Retrosheet data into Postgres.
7. This notebook.

Compute aggregates from Retrosheet and compare with appropriate values in Lahman.

This notebook is designed to be used with Jupyter Lab and the Table of Contents extension.  
https://github.com/jupyterlab/jupyterlab-toc

In [1]:
import pandas as pd
import numpy as np

import os
from pathlib import Path

In [2]:
# see Baseball Notebook #2
import helper_functions as bb

In [3]:
# create path objects
home = Path.home()
lahman = home.joinpath('data/lahman')
p_lahman_wrangled = lahman.joinpath('wrangled')

retrosheet = home.joinpath('data/retrosheet')
p_retro_wrangled = retrosheet.joinpath('wrangled')

# Verify Lahman Data Matches Retrosheet Data

Lahman has data aggregated by stint.

A stint is usually, but not always, the same as grouping by player_id, year, team_id

For example, Tucker Preston played for ATL, was traded to CIN, and was traded back to ATL, and played for each in 2018.  Therefore Preston had 3 stints, but only two rows when grouped by player_id, year, team_id.

The Retrosheet data does not have stint information.

To compare the data between the two:
* Lahman batting/pitching will be aggregated by: player_id, year, team_id
* Retrosheect batting/pitching will be aggregated by: player_id_lahman, year, team_id_lahman

There are several reasons for performing this data comparison:
* verify that the data sources are (almost) the same
* verify that the processing of the data worked properly

In [4]:
os.chdir(p_lahman_wrangled)
lahman_batting = bb.from_csv_with_types("batting.csv")

In [5]:
# Retrosheet data was only collected from 1955 and on
lahman_batting = lahman_batting[lahman_batting['year'] >= 1955]

In [6]:
lahman_batting['year'].min(), lahman_batting['year'].max()

(1955, 2018)

In [7]:
lahman_batting.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 68397 entries, 18796 to 87192
Data columns (total 22 columns):
player_id    68397 non-null object
year         68397 non-null uint16
stint        68397 non-null uint8
team_id      68397 non-null object
lg_id        68397 non-null object
b_g          68397 non-null uint8
b_ab         68397 non-null uint16
b_r          68397 non-null uint8
b_h          68397 non-null uint16
b_2b         68397 non-null uint8
b_3b         68397 non-null uint8
b_hr         68397 non-null uint8
b_rbi        68397 non-null uint8
b_sb         68397 non-null uint8
b_cs         68397 non-null float64
b_bb         68397 non-null uint8
b_so         68397 non-null uint8
b_ibb        68397 non-null float64
b_hp         68397 non-null uint8
b_sh         68397 non-null uint8
b_sf         68397 non-null float64
b_gdp        68397 non-null float64
dtypes: float64(4), object(3), uint16(3), uint8(12)
memory usage: 5.3+ MB


In [8]:
lahman_batting_cols = [col for col in lahman_batting.columns if col.startswith('b_')]

In [9]:
os.chdir(p_retro_wrangled)
player_game = bb.from_csv_with_types('player_game.csv.gz')

In [10]:
player_game['year'].min(), player_game['year'].max()

(1955, 2018)

In [11]:
retro_batting_cols = [col for col in player_game.columns if col.startswith('b_')]

In [12]:
# cols in lahman, not in retro
set(lahman_batting_cols) - set(retro_batting_cols)

set()

In [13]:
# cols in retro, not in lahman
set(retro_batting_cols) - set(lahman_batting_cols)

{'b_pa', 'b_xi'}

As per the above, Retrosheet has two additional batting columns, plate appearances and reached based on interference.  Except when considering eligibility for the batting title, in which 502 plate appearances are required, these 2 stats are not very important.

Perform the analysis with all the Lahman batting columns.

In [14]:
retro_batting_cols = ['player_id_lahman', 'year', 'team_id_lahman']
retro_batting_cols.extend(lahman_batting_cols)
retro_batting_cols

['player_id_lahman',
 'year',
 'team_id_lahman',
 'b_g',
 'b_ab',
 'b_r',
 'b_h',
 'b_2b',
 'b_3b',
 'b_hr',
 'b_rbi',
 'b_sb',
 'b_cs',
 'b_bb',
 'b_so',
 'b_ibb',
 'b_hp',
 'b_sh',
 'b_sf',
 'b_gdp']

In [15]:
tmp = lahman_batting_cols.copy()
lahman_batting_cols = ['player_id', 'year', 'team_id']
lahman_batting_cols.extend(tmp)
lahman_batting_cols

['player_id',
 'year',
 'team_id',
 'b_g',
 'b_ab',
 'b_r',
 'b_h',
 'b_2b',
 'b_3b',
 'b_hr',
 'b_rbi',
 'b_sb',
 'b_cs',
 'b_bb',
 'b_so',
 'b_ibb',
 'b_hp',
 'b_sh',
 'b_sf',
 'b_gdp']

In [16]:
retro_batting = player_game[retro_batting_cols]
retro_batting.tail(3)

Unnamed: 0,player_id_lahman,year,team_id_lahman,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
3549696,mcraeal01,2018,PIT,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3549697,burdini01,2018,PIT,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3549698,burdini01,2018,PIT,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [17]:
lahman_batting = lahman_batting[lahman_batting_cols]
lahman_batting.tail(3)

Unnamed: 0,player_id,year,team_id,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
87190,zimmery01,2018,WAS,85,288,33,76,21,2,13,51,1,1.0,30,55,1.0,3,0,2.0,10.0
87191,zobribe01,2018,CHN,139,455,67,139,28,3,9,58,3,4.0,55,60,1.0,2,1,7.0,8.0
87192,zuninmi01,2018,SEA,113,373,37,75,18,0,20,44,0,0.0,24,150,0.0,6,0,2.0,7.0


In [18]:
retro_grouped = retro_batting.groupby(by=['player_id_lahman', 'year', 'team_id_lahman'])

In [19]:
retro_batting_agg = retro_grouped.aggregate(np.sum)
retro_batting_agg.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
player_id_lahman,year,team_id_lahman,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
aardsda01,2004,SFN,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aardsda01,2006,CHN,45.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
aardsda01,2007,CHA,25.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
# no nulls in entire dataframe
retro_batting_agg.isna().sum().sum()

0

In [21]:
# rename index to allow for easier dataframe comparison
retro_batting_agg.index.names = ['player_id', 'year', 'team_id']
retro_batting_agg.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
player_id,year,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
aardsda01,2004,SFN,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aardsda01,2006,CHN,45.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
aardsda01,2007,CHA,25.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Lahman has a row per 'stint'
For example, tuckepr01 started in ATL, went to CIN, then back to ATL, for 3 stints in 1 year with two teams.

In [22]:
lahman_grouped = lahman_batting.groupby(by=['player_id', 'year', 'team_id'])

In [23]:
lahman_batting_agg = lahman_grouped.aggregate(np.sum)
lahman_batting_agg.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
player_id,year,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
aardsda01,2004,SFN,11,0,0,0,0,0,0,0,0,0.0,0,0,0.0,0,0,0.0,0.0
aardsda01,2006,CHN,45,2,0,0,0,0,0,0,0,0.0,0,0,0.0,0,1,0.0,0.0
aardsda01,2007,CHA,25,0,0,0,0,0,0,0,0,0.0,0,0,0.0,0,0,0.0,0.0


In [24]:
# no nulls in entire dataframe
lahman_batting_agg.isna().sum().sum()

0

In [25]:
# which rows are in retro, but not in lahman
retro_batting_agg.loc[~retro_batting_agg.index.isin(lahman_batting_agg.index)]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
player_id,year,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1


In [26]:
# same using set notation
set(retro_batting_agg.index.to_list()) - set(lahman_batting_agg.index.to_list())

set()

In [27]:
# which rows are in lahman, not in retro
lahman_batting_agg.loc[~lahman_batting_agg.index.isin(retro_batting_agg.index)]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
player_id,year,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
fanniji01,1956,CHN,1,4,0,1,0,0,0,0,0,0.0,0,0,0.0,0,0,0.0,1.0


In [28]:
# same using set notation
set(lahman_batting_agg.index.to_list()) - set(retro_batting_agg.index.to_list())

{('fanniji01', 1956, 'CHN')}

In [29]:
retro_batting_agg.shape, lahman_batting_agg.shape

((68370, 17), (68371, 17))

In [37]:
# drop the extra lahman batting agg row
lahman_batting_agg = lahman_batting_agg.drop(('fanniji01', 1956, 'CHN'))

In [38]:
lahman_batting_agg.shape

(68370, 17)

In [40]:
(lahman_batting_agg.index == retro_batting_agg.index).all()

True

### Data Note

Retrosheet data is known to be missing a few of the older baseball games.

Per above, there is only one (player_id, yeard, team_id) that is not identical between the two aggregates.  And the row that is only represent 4 at bats.

In [53]:
retro_agg_all = retro_batting_agg.aggregate(np.sum)
lahman_agg_all = lahman_batting_agg.aggregate(np.sum)
np.round(lahman_agg_all / retro_agg_all, 3)

b_g      1.001
b_ab     1.001
b_r      1.001
b_h      1.001
b_2b     1.001
b_3b     1.003
b_hr     1.001
b_rbi    1.001
b_sb     1.001
b_cs     0.997
b_bb     1.001
b_so     1.001
b_ibb    1.001
b_hp     1.001
b_sh     1.002
b_sf     1.001
b_gdp    1.000
dtype: float64

From the above, Lahman data is about 0.1% greater than Retrosheet data, which is known to be missing a few games.  This is a very close.

# Compare Pitching Aggregates

In [112]:
os.chdir(p_lahman_wrangled)
lahman_pitching = bb.from_csv_with_types("pitching.csv")

In [113]:
# Retrosheet data was only collected from 1955 and on
lahman_pitching = lahman_pitching[lahman_pitching['year'] >= 1955]

In [114]:
lahman_pitching['year'].min(), lahman_pitching['year'].max()

(1955, 2018)

In [115]:
lahman_pitching.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32121 entries, 7812 to 39932
Data columns (total 30 columns):
player_id    32121 non-null object
year         32121 non-null uint16
stint        32121 non-null uint8
team_id      32121 non-null object
lg_id        32121 non-null object
p_w          32121 non-null uint8
p_l          32121 non-null uint8
p_g          32121 non-null uint8
p_gs         32121 non-null uint8
p_cg         32121 non-null uint8
p_sho        32121 non-null uint8
p_sv         32121 non-null uint8
p_out        32121 non-null uint16
p_h          32121 non-null uint16
p_er         32121 non-null uint8
p_hr         32121 non-null uint8
p_bb         32121 non-null uint8
p_so         32121 non-null uint16
p_ba_opp     32111 non-null float64
p_era        32080 non-null float64
p_ibb        32121 non-null float64
p_wp         32121 non-null uint8
p_hp         32121 non-null uint8
p_bk         32121 non-null uint8
p_bfp        32121 non-null float64
p_gf         32121 non-

In [116]:
lahman_pitching_cols = [col for col in lahman_pitching.columns if col.startswith('p_')]

In [117]:
# os.chdir(p_retro_wrangled)
# player_game = bb.from_csv_with_types('player_game.csv.gz')

In [118]:
player_game['year'].min(), player_game['year'].max()

(1955, 2018)

In [119]:
retro_pitching_cols = [col for col in player_game.columns if col.startswith('p_')]

In [120]:
# cols in lahman, not in retro
set(lahman_pitching_cols) - set(retro_pitching_cols)

{'p_ba_opp', 'p_bfp', 'p_era'}

In [121]:
# cols in retro, not in lahman
set(retro_pitching_cols) - set(lahman_pitching_cols)

{'p_2b', 'p_3b', 'p_ab', 'p_tbf', 'p_xi'}

In [124]:
# will compare all columns that exist in both
pitching_cols = set(lahman_pitching_cols) & set(retro_pitching_cols)
pitching_cols = list(pitching_cols)
pitching_cols

['p_so',
 'p_hr',
 'p_er',
 'p_sv',
 'p_w',
 'p_wp',
 'p_gdp',
 'p_sh',
 'p_sf',
 'p_cg',
 'p_ibb',
 'p_g',
 'p_l',
 'p_hp',
 'p_r',
 'p_gf',
 'p_out',
 'p_bb',
 'p_gs',
 'p_bk',
 'p_h',
 'p_sho']

In [125]:
retro_pitching_cols = ['player_id_lahman', 'year', 'team_id_lahman']
retro_pitching_cols.extend(pitching_cols)
retro_pitching = player_game[retro_pitching_cols]
retro_pitching.columns

Index(['player_id_lahman', 'year', 'team_id_lahman', 'p_so', 'p_hr', 'p_er',
       'p_sv', 'p_w', 'p_wp', 'p_gdp', 'p_sh', 'p_sf', 'p_cg', 'p_ibb', 'p_g',
       'p_l', 'p_hp', 'p_r', 'p_gf', 'p_out', 'p_bb', 'p_gs', 'p_bk', 'p_h',
       'p_sho'],
      dtype='object')

In [126]:
lahman_pitching_cols = ['player_id', 'year', 'team_id']
lahman_pitching_cols.extend(pitching_cols.copy())
lahman_pitching = lahman_pitching[lahman_pitching_cols]
lahman_pitching.columns

Index(['player_id', 'year', 'team_id', 'p_so', 'p_hr', 'p_er', 'p_sv', 'p_w',
       'p_wp', 'p_gdp', 'p_sh', 'p_sf', 'p_cg', 'p_ibb', 'p_g', 'p_l', 'p_hp',
       'p_r', 'p_gf', 'p_out', 'p_bb', 'p_gs', 'p_bk', 'p_h', 'p_sho'],
      dtype='object')

In [148]:
retro_grouped = retro_pitching.groupby(by=['player_id_lahman', 'year', 'team_id_lahman'])
retro_pitching_agg = retro_grouped.aggregate(np.sum)
retro_pitching_agg.index.names = ['player_id', 'year', 'team_id']

In [149]:
lahman_grouped = lahman_pitching.groupby(by=['player_id', 'year', 'team_id'])
lahman_pitching_agg = lahman_grouped.aggregate(np.sum)

In [150]:
# no nulls in entire dataframe
retro_pitching_agg.isna().sum().sum()

0

In [151]:
# no nulls in entire dataframe
lahman_pitching_agg.isna().sum().sum()

0

In [152]:
retro_only = list(set(retro_pitching_agg.index) - set(lahman_pitching_agg.index))
len(retro_only)

36259

In [153]:
lahman_only = set(lahman_pitching_agg.index) - set(retro_batting_agg.index)
len(lahman_only)

0

In [154]:
# look at some of the differences
retro_pitching_agg.loc[retro_only, 'p_out'].head()

player_id  year  team_id
hintoch01  1963  WS2        0.0
triangu01  1958  BAL        0.0
pujolal01  2008  SLN        0.0
jimenda01  2007  WAS        0.0
henribo01  1957  CIN        0.0
Name: p_out, dtype: float64

As per the above, the pitchers in Retrosheet not in Lahman, got no outs over the entire year.  Remove these rows.

In [157]:
criteria = (retro_pitching_agg['p_out'] == 0)
retro_pitching_agg = retro_pitching_agg.drop(retro_pitching_agg[criteria].index)

In [158]:
retro_only = list(set(retro_pitching_agg.index) - set(lahman_pitching_agg.index))
len(retro_only)

0

As per the above, every (player_id, year, team_id) in both aggregates is the same.

In [159]:
# with the indexes and columns the same, the data can be compared
retro_agg_all = retro_pitching_agg.aggregate(np.sum)
lahman_agg_all = lahman_pitching_agg.aggregate(np.sum)
np.round(lahman_agg_all / retro_agg_all, 3)

p_so     1.001
p_hr     1.001
p_er     1.001
p_sv     1.001
p_w      1.001
p_wp     1.001
p_gdp    0.788
p_sh     0.794
p_sf     0.848
p_cg     1.003
p_ibb    1.002
p_g      1.001
p_l      1.002
p_hp     1.001
p_r      1.001
p_gf     1.001
p_out    1.001
p_bb     1.001
p_gs     1.001
p_bk     1.000
p_h      1.001
p_sho    1.002
dtype: float64

In [None]:
# compare the data from 1975 on
# with the indexes and columns the same, the data can be compared

In [166]:
retro_agg_1975 = retro_pitching_agg.loc[(slice(None), slice(1975,2018), slice(None)), :]
lahman_agg_1975 = lahman_pitching_agg.loc[(slice(None), slice(1970,2018), slice(None)), :]

In [167]:
# compare the data from 1975 on
retro_agg_1975_all = retro_agg_1975.aggregate(np.sum)
lahman_agg_1975_all = lahman_agg_1975.aggregate(np.sum)

np.round(lahman_agg_1975_all / retro_agg_1975_all, 3)

p_so     1.000
p_hr     1.000
p_er     1.000
p_sv     1.000
p_w      1.000
p_wp     0.999
p_gdp    0.946
p_sh     1.000
p_sf     1.000
p_cg     1.001
p_ibb    1.000
p_g      1.000
p_l      1.000
p_hp     1.000
p_r      1.000
p_gf     1.000
p_out    1.000
p_bb     1.000
p_gs     1.000
p_bk     1.000
p_h      1.000
p_sho    1.001
dtype: float64

# Perform above using Postgres

In [None]:
# Get the user and password from the environment (rather than hardcoding it)
import os
from sqlalchemy.engine import create_engine
db_user = os.environ.get('DB_USER')
db_pass = os.environ.get('DB_PASS')

# avoid putting passwords directly in code
connect_str = f'postgresql://{db_user}:{db_pass}@localhost:5432/baseball'

# treat sql alchmey engine as a connection to the database
conn = create_engine(connect_str)

In [None]:
%%timeit
# same but use SQL
sql = """
SELECT player_id
FROM batting
WHERE year_id = '2018'
"""
df = pd.read_sql(sql, conn)

In [None]:
df.equals(result)

In [None]:
result.columns

In [None]:
%%timeit
# convert to retro_id
retro_id = people[people['player_id'].isin(result['player_id'])]['retro_id']