# Baseball Analysis 1

**Baseball Notebooks**  
1. Downloaded and unzipped baseball data.
2. Helper functions and their motivation for use.
3. Lahman data was wrangled and persisted.
4. Retrosheet Play by Play data was parsed, collected into 2 DataFrames, and persisted.
5. Wrangle the Retrosheet data in preparation for data analysis.
6. Loaded the wrangled Retrosheet data into Postgres.
7. This notebook.

Compute aggregates from Retrosheet to compare with appropriate values in Lahman.

This notebook is designed to be used with Jupyter Lab and the Table of Contents extension.  
https://github.com/jupyterlab/jupyterlab-toc

In [1]:
import pandas as pd
import numpy as np

import os
from pathlib import Path

In [2]:
# see Baseball Notebook #2
import helper_functions as bb

In [3]:
# create path objects
home = Path.home()
lahman = home.joinpath('data/lahman')
p_lahman_wrangled = lahman.joinpath('wrangled')

retrosheet = home.joinpath('data/retrosheet')
p_retro_wrangled = retrosheet.joinpath('wrangled')

# Verify Lahman Data Matches Retrosheet Data

Lahman has data aggregated by stint.

A stint is usually, but not always, the same as grouping by (player_id, year, team_id).

An example where they differ is Tucker Preston for 2018.  He played for ATL, was traded to CIN, then was traded back to ATL, and played for each.  Preston had 3 stints but only two rows when grouped by (player_id, year, team_id).

The Retrosheet data does not have stint information.

To compare the data between the two:
* Lahman batting/pitching will be aggregated by: player_id, year, team_id
* Retrosheect batting/pitching will be aggregated by: player_id_lahman, year, team_id_lahman

Some reasons for performing this data comparison:
* verify that the data sources have (almost completely) consistent data
* verify that the processing of the data was performed properly

# Compare Batting Data

In [4]:
os.chdir(p_lahman_wrangled)
lahman_batting = bb.from_csv_with_types("batting.csv")

In [6]:
# Retrosheet data was only collected from 1955 and on
lahman_batting = lahman_batting[lahman_batting['year_id'] >= 1955]

In [7]:
lahman_batting['year_id'].min(), lahman_batting['year_id'].max()

(1955, 2018)

In [8]:
lahman_batting.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 68397 entries, 18796 to 87192
Data columns (total 22 columns):
player_id    68397 non-null object
year_id      68397 non-null uint16
stint        68397 non-null uint8
team_id      68397 non-null object
lg_id        68397 non-null object
b_g          68397 non-null uint8
b_ab         68397 non-null uint16
b_r          68397 non-null uint8
b_h          68397 non-null uint16
b_2b         68397 non-null uint8
b_3b         68397 non-null uint8
b_hr         68397 non-null uint8
b_rbi        68397 non-null uint8
b_sb         68397 non-null uint8
b_cs         68397 non-null float64
b_bb         68397 non-null uint8
b_so         68397 non-null uint8
b_ibb        68397 non-null float64
b_hp         68397 non-null uint8
b_sh         68397 non-null uint8
b_sf         68397 non-null float64
b_gdp        68397 non-null float64
dtypes: float64(4), object(3), uint16(3), uint8(12)
memory usage: 5.3+ MB


In [9]:
lahman_batting_cols = [col for col in lahman_batting.columns if col.startswith('b_')]

In [10]:
os.chdir(p_retro_wrangled)
player_game = bb.from_csv_with_types('player_game.csv.gz')

In [11]:
player_game['year_id'].min(), player_game['year_id'].max()

(1955, 2018)

In [12]:
retro_batting_cols = [col for col in player_game.columns if col.startswith('b_')]

In [13]:
# cols in lahman, not in retro
set(lahman_batting_cols) - set(retro_batting_cols)

set()

In [14]:
# cols in retro, not in lahman
set(retro_batting_cols) - set(lahman_batting_cols)

{'b_pa', 'b_xi'}

As per the above, Retrosheet has two additiona columns:
* b_pa = plate appearances
* b_xi = safe on interference (usually by the catcher)

b_pa only matter for eligibility for the batting title, in which 502 plate appearances are required.  
b_xi is a rare event that probably isn't helpful.

Perform the analysis with all the Lahman batting columns.

In [15]:
retro_batting_cols = ['player_id_lahman', 'year_id', 'team_id_lahman']
retro_batting_cols.extend(lahman_batting_cols)
retro_batting_cols

['player_id_lahman',
 'year_id',
 'team_id_lahman',
 'b_g',
 'b_ab',
 'b_r',
 'b_h',
 'b_2b',
 'b_3b',
 'b_hr',
 'b_rbi',
 'b_sb',
 'b_cs',
 'b_bb',
 'b_so',
 'b_ibb',
 'b_hp',
 'b_sh',
 'b_sf',
 'b_gdp']

In [16]:
tmp = lahman_batting_cols.copy()
lahman_batting_cols = ['player_id', 'year_id', 'team_id']
lahman_batting_cols.extend(tmp)
lahman_batting_cols

['player_id',
 'year_id',
 'team_id',
 'b_g',
 'b_ab',
 'b_r',
 'b_h',
 'b_2b',
 'b_3b',
 'b_hr',
 'b_rbi',
 'b_sb',
 'b_cs',
 'b_bb',
 'b_so',
 'b_ibb',
 'b_hp',
 'b_sh',
 'b_sf',
 'b_gdp']

In [17]:
retro_batting = player_game[retro_batting_cols]
retro_batting.tail(3)

Unnamed: 0,player_id_lahman,year_id,team_id_lahman,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
3549696,mcraeal01,2018,PIT,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3549697,burdini01,2018,PIT,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3549698,burdini01,2018,PIT,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [18]:
lahman_batting = lahman_batting[lahman_batting_cols]
lahman_batting.tail(3)

Unnamed: 0,player_id,year_id,team_id,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
87190,zimmery01,2018,WAS,85,288,33,76,21,2,13,51,1,1.0,30,55,1.0,3,0,2.0,10.0
87191,zobribe01,2018,CHN,139,455,67,139,28,3,9,58,3,4.0,55,60,1.0,2,1,7.0,8.0
87192,zuninmi01,2018,SEA,113,373,37,75,18,0,20,44,0,0.0,24,150,0.0,6,0,2.0,7.0


In [19]:
retro_grouped = retro_batting.groupby(by=['player_id_lahman',
                                          'year_id', 'team_id_lahman'])

In [20]:
retro_batting_agg = retro_grouped.aggregate(np.sum)
retro_batting_agg.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
player_id_lahman,year_id,team_id_lahman,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
aardsda01,2004,SFN,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aardsda01,2006,CHN,45.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
aardsda01,2007,CHA,25.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
# no nulls in entire dataframe
retro_batting_agg.isna().sum().sum()

0

In [22]:
# rename index to allow for easier dataframe comparison
retro_batting_agg.index.names = ['player_id', 'year_id', 'team_id']
retro_batting_agg.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
player_id,year_id,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
aardsda01,2004,SFN,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aardsda01,2006,CHN,45.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
aardsda01,2007,CHA,25.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Lahman has a row per 'stint'
For example, tuckepr01 started in ATL, went to CIN, then back to ATL, for 3 stints in 1 year with two teams.

In [23]:
lahman_grouped = lahman_batting.groupby(by=['player_id', 'year_id', 'team_id'])

In [24]:
lahman_batting_agg = lahman_grouped.aggregate(np.sum)
lahman_batting_agg.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
player_id,year_id,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
aardsda01,2004,SFN,11,0,0,0,0,0,0,0,0,0.0,0,0,0.0,0,0,0.0,0.0
aardsda01,2006,CHN,45,2,0,0,0,0,0,0,0,0.0,0,0,0.0,0,1,0.0,0.0
aardsda01,2007,CHA,25,0,0,0,0,0,0,0,0,0.0,0,0,0.0,0,0,0.0,0.0


In [25]:
# no nulls in entire dataframe
lahman_batting_agg.isna().sum().sum()

0

In [26]:
# which rows are in retro, but not in lahman
retro_batting_agg.loc[~retro_batting_agg.index.isin(lahman_batting_agg.index)]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
player_id,year_id,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1


In [27]:
# same using set notation
set(retro_batting_agg.index.to_list()) - set(lahman_batting_agg.index.to_list())

set()

In [28]:
# which rows are in lahman, not in retro
lahman_batting_agg.loc[~lahman_batting_agg.index.isin(retro_batting_agg.index)]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
player_id,year_id,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
fanniji01,1956,CHN,1,4,0,1,0,0,0,0,0,0.0,0,0,0.0,0,0,0.0,1.0


In [29]:
# same using set notation
set(lahman_batting_agg.index.to_list()) - set(retro_batting_agg.index.to_list())

{('fanniji01', 1956, 'CHN')}

In [30]:
retro_batting_agg.shape, lahman_batting_agg.shape

((68370, 17), (68371, 17))

In [31]:
# drop the extra lahman batting agg row
lahman_batting_agg = lahman_batting_agg.drop(('fanniji01', 1956, 'CHN'))

In [32]:
lahman_batting_agg.shape

(68370, 17)

In [33]:
(lahman_batting_agg.index == retro_batting_agg.index).all()

True

### Data Note

Retrosheet data is known to be missing a few of the older baseball games.

Per above, there is only one (player_id, yeard, team_id) that is not identical between the two aggregates.  And that row only represent 4 at bats.

In [34]:
retro_agg_all = retro_batting_agg.aggregate(np.sum)
lahman_agg_all = lahman_batting_agg.aggregate(np.sum)
np.round(lahman_agg_all / retro_agg_all, 3)

b_g      1.001
b_ab     1.001
b_r      1.001
b_h      1.001
b_2b     1.001
b_3b     1.003
b_hr     1.001
b_rbi    1.001
b_sb     1.001
b_cs     0.997
b_bb     1.001
b_so     1.001
b_ibb    1.001
b_hp     1.001
b_sh     1.002
b_sf     1.001
b_gdp    1.000
dtype: float64

From the above, Lahman data is about 0.1% greater than Retrosheet data, which is known to be missing a few games.  This is a very close.

# Compare Pitching Data

In [35]:
os.chdir(p_lahman_wrangled)
lahman_pitching = bb.from_csv_with_types("pitching.csv")

In [36]:
# Retrosheet data was only collected from 1955 and on
lahman_pitching = lahman_pitching[lahman_pitching['year_id'] >= 1955]

In [37]:
lahman_pitching['year_id'].min(), lahman_pitching['year_id'].max()

(1955, 2018)

In [38]:
lahman_pitching.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32121 entries, 7812 to 39932
Data columns (total 30 columns):
player_id    32121 non-null object
year_id      32121 non-null uint16
stint        32121 non-null uint8
team_id      32121 non-null object
lg_id        32121 non-null object
p_w          32121 non-null uint8
p_l          32121 non-null uint8
p_g          32121 non-null uint8
p_gs         32121 non-null uint8
p_cg         32121 non-null uint8
p_sho        32121 non-null uint8
p_sv         32121 non-null uint8
p_out        32121 non-null uint16
p_h          32121 non-null uint16
p_er         32121 non-null uint8
p_hr         32121 non-null uint8
p_bb         32121 non-null uint8
p_so         32121 non-null uint16
p_ba_opp     32111 non-null float64
p_era        32080 non-null float64
p_ibb        32121 non-null float64
p_wp         32121 non-null uint8
p_hp         32121 non-null uint8
p_bk         32121 non-null uint8
p_bfp        32121 non-null float64
p_gf         32121 non-

In [39]:
lahman_pitching_cols = [col for col in lahman_pitching.columns if col.startswith('p_')]

In [40]:
player_game['year_id'].min(), player_game['year_id'].max()

(1955, 2018)

In [41]:
retro_pitching_cols = [col for col in player_game.columns if col.startswith('p_')]

In [42]:
# cols in lahman, not in retro
set(lahman_pitching_cols) - set(retro_pitching_cols)

{'p_ba_opp', 'p_bfp', 'p_era'}

In [43]:
# cols in retro, not in lahman
set(retro_pitching_cols) - set(lahman_pitching_cols)

{'p_2b', 'p_3b', 'p_ab', 'p_tbf', 'p_xi'}

In [44]:
# will compare all columns that exist in both
pitching_cols = set(lahman_pitching_cols) & set(retro_pitching_cols)
pitching_cols = list(pitching_cols)
pitching_cols

['p_sh',
 'p_ibb',
 'p_gs',
 'p_h',
 'p_bk',
 'p_hr',
 'p_cg',
 'p_w',
 'p_g',
 'p_gf',
 'p_gdp',
 'p_hp',
 'p_sv',
 'p_l',
 'p_sho',
 'p_so',
 'p_r',
 'p_bb',
 'p_er',
 'p_out',
 'p_wp',
 'p_sf']

In [45]:
retro_pitching_cols = ['player_id_lahman', 'year_id', 'team_id_lahman']
retro_pitching_cols.extend(pitching_cols)
retro_pitching = player_game[retro_pitching_cols]
retro_pitching.columns

Index(['player_id_lahman', 'year_id', 'team_id_lahman', 'p_sh', 'p_ibb',
       'p_gs', 'p_h', 'p_bk', 'p_hr', 'p_cg', 'p_w', 'p_g', 'p_gf', 'p_gdp',
       'p_hp', 'p_sv', 'p_l', 'p_sho', 'p_so', 'p_r', 'p_bb', 'p_er', 'p_out',
       'p_wp', 'p_sf'],
      dtype='object')

In [46]:
lahman_pitching_cols = ['player_id', 'year_id', 'team_id']
lahman_pitching_cols.extend(pitching_cols.copy())
lahman_pitching = lahman_pitching[lahman_pitching_cols]
lahman_pitching.columns

Index(['player_id', 'year_id', 'team_id', 'p_sh', 'p_ibb', 'p_gs', 'p_h',
       'p_bk', 'p_hr', 'p_cg', 'p_w', 'p_g', 'p_gf', 'p_gdp', 'p_hp', 'p_sv',
       'p_l', 'p_sho', 'p_so', 'p_r', 'p_bb', 'p_er', 'p_out', 'p_wp', 'p_sf'],
      dtype='object')

In [47]:
retro_grouped = retro_pitching.groupby(by=['player_id_lahman',
                                           'year_id','team_id_lahman'])
retro_pitching_agg = retro_grouped.aggregate(np.sum)
retro_pitching_agg.index.names = ['player_id', 'year_id', 'team_id']

In [48]:
lahman_grouped = lahman_pitching.groupby(by=['player_id', 'year_id', 'team_id'])
lahman_pitching_agg = lahman_grouped.aggregate(np.sum)

In [49]:
# no nulls in entire dataframe
retro_pitching_agg.isna().sum().sum()

0

In [50]:
# no nulls in entire dataframe
lahman_pitching_agg.isna().sum().sum()

0

In [51]:
retro_only = list(set(retro_pitching_agg.index) - set(lahman_pitching_agg.index))
len(retro_only)

36259

In [52]:
lahman_only = set(lahman_pitching_agg.index) - set(retro_batting_agg.index)
len(lahman_only)

0

In [53]:
# look at some of the differences
retro_pitching_agg.loc[retro_only, 'p_out'].head()

player_id  year_id  team_id
thompmi02  1994     PHI        0.0
pasquda01  1985     NYA        0.0
robinfr02  1959     CIN        0.0
lummi01    1981     ATL        0.0
sullico01  2005     COL        0.0
Name: p_out, dtype: float64

As per the above, the pitchers in Retrosheet not in Lahman, got no outs over the entire year.  Remove these rows.

In [54]:
criteria = (retro_pitching_agg['p_out'] == 0)
retro_pitching_agg = retro_pitching_agg.drop(retro_pitching_agg[criteria].index)

In [55]:
retro_only = list(set(retro_pitching_agg.index) - set(lahman_pitching_agg.index))
len(retro_only)

0

As per the above, every (player_id, year, team_id) in both aggregates is the same.

In [56]:
retro_agg_all = retro_pitching_agg.aggregate(np.sum)
lahman_agg_all = lahman_pitching_agg.aggregate(np.sum)
np.round(lahman_agg_all / retro_agg_all, 3)

p_sh     0.794
p_ibb    1.002
p_gs     1.001
p_h      1.001
p_bk     1.000
p_hr     1.001
p_cg     1.003
p_w      1.001
p_g      1.001
p_gf     1.001
p_gdp    0.788
p_hp     1.001
p_sv     1.001
p_l      1.002
p_sho    1.002
p_so     1.001
p_r      1.001
p_bb     1.001
p_er     1.001
p_out    1.001
p_wp     1.001
p_sf     0.848
dtype: float64

In [57]:
# same as above, but from 1975 on
retro_agg_1975 = retro_pitching_agg.loc[(slice(None), 
                                         slice(1975,2018), 
                                         slice(None)), :]
lahman_agg_1975 = lahman_pitching_agg.loc[(slice(None), 
                                           slice(1975,2018), 
                                           slice(None)), :]

In [58]:
# compare the data from 1975 on
retro_agg_1975_all = retro_agg_1975.aggregate(np.sum)
lahman_agg_1975_all = lahman_agg_1975.aggregate(np.sum)

np.round(lahman_agg_1975_all / retro_agg_1975_all, 3)

p_sh     1.000
p_ibb    1.000
p_gs     1.000
p_h      1.000
p_bk     1.000
p_hr     1.000
p_cg     1.000
p_w      1.000
p_g      1.000
p_gf     1.000
p_gdp    1.000
p_hp     1.000
p_sv     1.000
p_l      1.000
p_sho    1.001
p_so     1.000
p_r      1.000
p_bb     1.000
p_er     1.000
p_out    1.000
p_wp     1.000
p_sf     1.000
dtype: float64

As per the above, the pitcher data aggregated from Retrosheet, from 1975 on, matches the Lahman pitcher data.  This is good.

# Compare Fielding Data

In [59]:
os.chdir(p_lahman_wrangled)
lahman_fielding = bb.from_csv_with_types("fielding.csv")

In [60]:
# Retrosheet data was only collected from 1955 and on
lahman_fielding = lahman_fielding[lahman_fielding['year_id'] >= 1955]

In [61]:
lahman_fielding['year_id'].min(), lahman_fielding['year_id'].max()

(1955, 2018)

In [62]:
lahman_fielding.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 88751 entries, 22748 to 111498
Data columns (total 16 columns):
player_id    88751 non-null object
year_id      88751 non-null uint16
stint        88751 non-null uint8
team_id      88751 non-null object
lg_id        88751 non-null object
pos          88751 non-null object
g            88751 non-null uint8
gs           88734 non-null float64
inn_outs     88734 non-null float64
po           88751 non-null uint16
a            88751 non-null uint16
e            88751 non-null float64
dp           88751 non-null uint8
pb           6332 non-null float64
sb           6329 non-null float64
cs           6329 non-null float64
dtypes: float64(6), object(4), uint16(3), uint8(3)
memory usage: 8.2+ MB


In [63]:
retro_fielding_cols = ['player_id_lahman', 'year_id', 'team_id_lahman', 
                       'f_po', 'f_a', 'f_e']
lahman_fielding_cols = ['player_id', 'year_id', 'team_id', 
                        'po', 'a', 'e']

In [64]:
retro_fielding = player_game[retro_fielding_cols]
lahman_fielding = lahman_fielding[lahman_fielding_cols]

In [65]:
retro_grouped = retro_fielding.groupby(by=['player_id_lahman', 
                                           'year_id', 'team_id_lahman'])
retro_fielding_agg = retro_grouped.aggregate(np.sum)
retro_fielding_agg.index.names = ['player_id', 'year_id', 'team_id']
retro_fielding_agg = retro_fielding_agg.rename(columns={'f_po':'po',
                                                        'f_a':'a', 'f_e':'e'})

In [66]:
lahman_grouped = lahman_fielding.groupby(by=['player_id', 'year_id', 'team_id'])
lahman_fielding_agg = lahman_grouped.aggregate(np.sum)

In [67]:
# no nulls in entire dataframe
retro_fielding_agg.isna().sum().sum()

0

In [68]:
# no nulls in entire dataframe
lahman_fielding_agg.isna().sum().sum()

0

In [73]:
retro_only = set(retro_pitching_agg.index) - set(lahman_pitching_agg.index)
retro_only

set()

In [74]:
lahman_only = set(lahman_pitching_agg.index) - set(retro_batting_agg.index)
lahman_only

set()

From the above, each group by entry is the same.

In [75]:
retro_agg_all = retro_fielding_agg.aggregate(np.sum)
lahman_agg_all = lahman_fielding_agg.aggregate(np.sum)
np.round(lahman_agg_all / retro_agg_all, 3)

po    1.002
a     1.003
e     1.002
dtype: float64

In [76]:
# same as above, but from 1975 on
retro_agg_1975 = retro_fielding_agg.loc[(slice(None), 
                                         slice(1975,2018), 
                                         slice(None)), :]
lahman_agg_1975 = lahman_fielding_agg.loc[(slice(None), 
                                           slice(1975,2018), 
                                           slice(None)), :]

In [77]:
# compare the data from 1975 on
retro_agg_1975_all = retro_agg_1975.aggregate(np.sum)
lahman_agg_1975_all = lahman_agg_1975.aggregate(np.sum)

np.round(lahman_agg_1975_all / retro_agg_1975_all, 3)

po    1.000
a     1.002
e     1.000
dtype: float64

# Compare Game Data

This will compare the data in Retrosheet team_game with the data in Lahman teams.

In [151]:
os.chdir(p_lahman_wrangled)
lahman_teams = bb.from_csv_with_types("teams.csv")

In [152]:
os.chdir(p_retro_wrangled)
team_game = bb.from_csv_with_types("team_game.csv.gz")

In [153]:
# Retrosheet data was only collected from 1955 and on
lahman_teams = lahman_teams[lahman_teams['year_id'] >= 1955]

In [154]:
lahman_teams['year_id'].min(), lahman_teams['year_id'].max()

(1955, 2018)

In [155]:
lahman_teams.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1632 entries, 544 to 2175
Data columns (total 48 columns):
year_id             1632 non-null uint16
lg_id               1632 non-null object
team_id             1632 non-null object
franch_id           1632 non-null object
div_id              1378 non-null object
team_rank           1632 non-null uint8
g                   1632 non-null uint8
ghome               1632 non-null float64
w                   1632 non-null uint8
l                   1632 non-null uint8
div_win             1350 non-null object
wc_win              714 non-null object
lg_win              1604 non-null object
ws_win              1604 non-null object
b_r                 1632 non-null uint16
b_ab                1632 non-null uint16
b_h                 1632 non-null uint16
b_2b                1632 non-null uint16
b_3b                1632 non-null uint8
b_hr                1632 non-null uint16
b_bb                1632 non-null float64
b_so                1632 non-null 

In [156]:
# lahman_pitching_cols = [col for col in lahman_pitching.columns if col.startswith('p_')]

In [157]:
team_game['year_id'].min(), team_game['year_id'].max()

(1955, 2018)

In [158]:
# retro_pitching_cols = [col for col in player_game.columns if col.startswith('p_')]

In [159]:
# cols in lahman, not in retro
set(lahman_teams.columns) - set(team_game.columns)

{'attendance',
 'bba',
 'bpf',
 'cg',
 'div_id',
 'div_win',
 'fp',
 'franch_id',
 'g',
 'ghome',
 'ha',
 'hra',
 'ip_outs',
 'l',
 'lg_id',
 'lg_win',
 'p_era',
 'p_ra',
 'park',
 'ppf',
 'sho',
 'soa',
 'sv',
 'team_id_br',
 'team_id_lahman45',
 'team_id_retro',
 'team_name',
 'team_rank',
 'w',
 'wc_win',
 'ws_win'}

In [160]:
# cols in retro, not in lahman
set(team_game.columns) - set(lahman_teams.columns)

{'b_dpg',
 'b_ibb',
 'b_sh',
 'b_xi',
 'bi',
 'f_a',
 'f_po',
 'finish_pit_id',
 'game_date',
 'game_id',
 'home',
 'line_tx',
 'lob',
 'p_bk',
 'p_ter',
 'p_wp',
 'pb',
 'pitcher',
 'start_pit_id',
 'team_id_lahman',
 'team_league_id',
 'tp'}

In [161]:
# will compare all columns that exist in both
cols = set(team_game.columns) & set(lahman_teams.columns)
cols = list(cols)
cols

['team_id',
 'b_3b',
 'b_so',
 'b_h',
 'b_ab',
 'b_hp',
 'b_r',
 'b_sb',
 'b_2b',
 'b_cs',
 'p_er',
 'b_bb',
 'dp',
 'b_sf',
 'f_e',
 'b_hr',
 'year_id']

In [162]:
grouped = team_game[['team_id_lahman'] + cols].groupby(['team_id_lahman', 'year_id'])
team_game_agg = grouped.aggregate(np.sum)

In [173]:
team_game_agg.index.names = ['team_id', 'year_id']

In [174]:
team_game_agg.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,b_3b,b_so,b_h,b_ab,b_hp,b_r,b_sb,b_2b,b_cs,p_er,b_bb,dp,b_sf,f_e,b_hr
team_id,year_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
ANA,1997,25.0,953.0,1531.0,5628.0,45.0,829.0,126.0,279.0,72.0,730.0,617.0,140.0,57.0,123.0,161.0
ANA,1998,27.0,1028.0,1530.0,5630.0,48.0,787.0,93.0,314.0,45.0,720.0,510.0,146.0,41.0,106.0,147.0
ANA,1999,22.0,1022.0,1404.0,5494.0,43.0,711.0,71.0,248.0,45.0,762.0,511.0,156.0,42.0,106.0,158.0
ANA,2000,34.0,1024.0,1574.0,5628.0,47.0,864.0,93.0,309.0,52.0,807.0,608.0,182.0,43.0,134.0,236.0
ANA,2001,26.0,1001.0,1447.0,5551.0,77.0,691.0,116.0,275.0,52.0,671.0,494.0,142.0,53.0,103.0,158.0


In [181]:
lahman_teams = lahman_teams[cols].set_index(['team_id', 'year_id']).sort_index()
lahman_teams.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,b_3b,b_so,b_h,b_ab,b_hp,b_r,b_sb,b_2b,b_cs,p_er,b_bb,dp,b_sf,f_e,b_hr
team_id,year_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
ANA,1997,25,953.0,1531,5628,45.0,829,126.0,279,72.0,730,617.0,140,57.0,123,161
ANA,1998,27,1028.0,1530,5630,48.0,787,93.0,314,45.0,720,510.0,146,41.0,106,147
ANA,1999,22,1022.0,1404,5494,43.0,711,71.0,248,45.0,762,511.0,156,42.0,106,158
ANA,2000,34,1024.0,1574,5628,47.0,864,93.0,309,52.0,805,608.0,182,43.0,134,236
ANA,2001,26,1001.0,1447,5551,77.0,691,116.0,275,52.0,671,494.0,142,53.0,103,158


In [184]:
retro_only = set(team_game_agg.index) - set(lahman_teams.index)
retro_only

set()

In [185]:
lahman_only = set(lahman_teams.index) - set(team_game_agg.index)
lahman_only

set()

From the above, each group by entry is the same.

In [187]:
retro_agg_all = team_game_agg.aggregate(np.sum)
lahman_agg_all = lahman_teams.aggregate(np.sum)
np.round(lahman_agg_all / retro_agg_all, 3)

b_3b    1.003
b_so    1.001
b_h     1.001
b_ab    1.001
b_hp    0.863
b_r     1.001
b_sb    1.001
b_2b    1.001
b_cs    0.997
p_er    1.000
b_bb    1.001
dp      1.001
b_sf    0.848
f_e     1.001
b_hr    1.001
dtype: float64

# Perform above using Postgres

In [None]:
# Get the user and password from the environment (rather than hardcoding it)
import os
from sqlalchemy.engine import create_engine
db_user = os.environ.get('DB_USER')
db_pass = os.environ.get('DB_PASS')

# avoid putting passwords directly in code
connect_str = f'postgresql://{db_user}:{db_pass}@localhost:5432/baseball'

# treat sql alchmey engine as a connection to the database
conn = create_engine(connect_str)

In [None]:
%%timeit
# same but use SQL
sql = """
SELECT player_id
FROM batting
WHERE year_id = '2018'
"""
df = pd.read_sql(sql, conn)

In [None]:
df.equals(result)

In [None]:
result.columns

In [None]:
%%timeit
# convert to retro_id
retro_id = people[people['player_id'].isin(result['player_id'])]['retro_id']