# Baseball Analysis 1

**Baseball Notebooks**  
1. Download and unzipped the Lahman and Retrosheet data.
2. Described helper functions used by several notebooks.
3. Baseball Data Organization and Data Dictionary
4. Lahman data was wrangled and persisted.
5. Retrosheet Play by Play data was parsed, collected into 2 DataFrames, and persisted.
6. Retrosheet data was wrangled and persisted to csv files.
7. Retrosheet data was loaded to Postgres.
8. This notebook.

Compute aggregates from Retrosheet to compare with appropriate values in Lahman.

This notebook is designed to be used with Jupyter Lab and the Table of Contents extension.  
https://github.com/jupyterlab/jupyterlab-toc

In [1]:
import pandas as pd
import numpy as np

import os
from pathlib import Path

In [2]:
# see Baseball Notebook #2
import helper_functions as bb

In [3]:
# create path objects
home = Path.home()
lahman = home.joinpath('data/lahman')
p_lahman_wrangled = lahman.joinpath('wrangled')

retrosheet = home.joinpath('data/retrosheet')
p_retro_wrangled = retrosheet.joinpath('wrangled')

# Verify Lahman Data Matches Retrosheet Data

Lahman has data aggregated by stint.

A stint is usually, but not always, the same as grouping by (player_id, year, team_id).

An example where they differ is Tucker Preston for 2018.  He played for ATL, was traded to CIN, then was traded back to ATL, and played for each.  Preston had 3 stints but only two rows when grouped by (player_id, year, team_id).

The Retrosheet data does not have stint information.

To compare the data between the two:
* Lahman batting/pitching/fielding will be aggregated by: player_id, year, team_id
* Retrosheect batting/pitching/fielding will be aggregated by: player_id_lahman, year, team_id_lahman

Some reasons for performing this data comparison:
* verify that the data sources have (almost completely) consistent data
* verify that the processing of the data was performed properly

# Compare Batting Data

## Lahman and Retrosheet Batting Data from 1955 on

In [4]:
os.chdir(p_lahman_wrangled)
lahman_batting = bb.from_csv_with_types("batting.csv")

In [5]:
# Retrosheet data was only collected from 1955 and on
lahman_batting = lahman_batting[lahman_batting['year_id'] >= 1955]
lahman_batting['year_id'].min(), lahman_batting['year_id'].max()

(1955, 2018)

In [6]:
os.chdir(p_retro_wrangled)
player_game = bb.from_csv_with_types('player_game.csv.gz')

In [7]:
player_game['year_id'].min(), player_game['year_id'].max()

(1955, 2018)

## Batting Columns from Player Game

In [8]:
batting_cols = [col for col in player_game.columns if col.startswith('b_')]
retro_batting = player_game[['player_id_lahman', 'year_id', 'team_id_lahman'] + batting_cols]
retro_batting.columns

Index(['player_id_lahman', 'year_id', 'team_id_lahman', 'b_g', 'b_pa', 'b_ab',
       'b_r', 'b_h', 'b_2b', 'b_3b', 'b_hr', 'b_rbi', 'b_bb', 'b_ibb', 'b_so',
       'b_gdp', 'b_hp', 'b_sh', 'b_sf', 'b_sb', 'b_cs', 'b_xi'],
      dtype='object')

In [9]:
# rename retro columns to match lahman columns
names = {col:col[2:] for col in retro_batting.columns 
         if col.startswith('b_') and not col[2] == '2' and not col[2] == '3'}
names

{'b_g': 'g',
 'b_pa': 'pa',
 'b_ab': 'ab',
 'b_r': 'r',
 'b_h': 'h',
 'b_hr': 'hr',
 'b_rbi': 'rbi',
 'b_bb': 'bb',
 'b_ibb': 'ibb',
 'b_so': 'so',
 'b_gdp': 'gdp',
 'b_hp': 'hp',
 'b_sh': 'sh',
 'b_sf': 'sf',
 'b_sb': 'sb',
 'b_cs': 'cs',
 'b_xi': 'xi'}

In [10]:
# add 2 more fields to the renaming dictionary
names['player_id_lahman'] = 'player_id'
names['team_id_lahman'] = 'team_id'

In [11]:
retro_batting = retro_batting.rename(columns=names)
retro_batting.columns

Index(['player_id', 'year_id', 'team_id', 'g', 'pa', 'ab', 'r', 'h', 'b_2b',
       'b_3b', 'hr', 'rbi', 'bb', 'ibb', 'so', 'gdp', 'hp', 'sh', 'sf', 'sb',
       'cs', 'xi'],
      dtype='object')

## Compare Columns

In [12]:
# columns in lahman, not in retrosheet
set(lahman_batting.columns) - set(retro_batting.columns)

{'lg_id', 'stint'}

In [13]:
# columns in retrosheet, not in lahman
set(retro_batting.columns) - set(lahman_batting.columns)

{'pa', 'xi'}

In [89]:
# columns in common -- these are the columns to compare
b_cols = set(retro_batting.columns) & set(lahman_batting.columns)
b_cols

{'ab',
 'b_2b',
 'b_3b',
 'bb',
 'cs',
 'g',
 'gdp',
 'h',
 'hp',
 'hr',
 'ibb',
 'player_id',
 'r',
 'rbi',
 'sb',
 'sf',
 'sh',
 'so',
 'team_id',
 'year_id'}

In [90]:
retro_batting = retro_batting[b_cols]
lahman_batting = lahman_batting[b_cols]

## Aggregate the Common Columns

In [16]:
retro_grouped = retro_batting.groupby(by=['player_id','year_id', 'team_id'])
retro_batting_agg = retro_grouped.aggregate(np.sum)

In [17]:
lahman_grouped = lahman_batting.groupby(by=['player_id','year_id', 'team_id'])
lahman_batting_agg = lahman_grouped.aggregate(np.sum)

In [18]:
# rows in lahman not in retro
lahman_only = set(lahman_batting_agg.index.values) - set(retro_batting_agg.index.values)
lahman_batting_agg.loc[list(lahman_only)]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,r,so,h,rbi,bb,gdp,b_3b,b_2b,ibb,sb,hp,g,hr,sf,sh,ab,cs
player_id,year_id,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
fanniji01,1956,CHN,0,0,1,0,0,1.0,0,0,0.0,0,0,1,0,0.0,0,4,0.0


In [19]:
# rows in retro, not in lahman
retro_only = set(retro_batting_agg.index.values) - set(lahman_batting_agg.index.values)
retro_batting_agg.loc[list(retro_only)]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,r,so,h,rbi,bb,gdp,b_3b,b_2b,ibb,sb,hp,g,hr,sf,sh,ab,cs
player_id,year_id,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1


In [20]:
lahman_batting_agg.shape

(68371, 17)

In [21]:
# there is only a single row different, and this player only had 4 at bats the entire season
# drop this row to make comparing the rest of the data easy
lahman_batting_agg = lahman_batting_agg.drop(list(lahman_only), axis=0)
lahman_batting_agg.shape

(68370, 17)

## Compare Sums of all Batting Columns

In [22]:
retro_agg_all = retro_batting_agg.aggregate(np.sum)
lahman_agg_all = lahman_batting_agg.aggregate(np.sum)
np.round(lahman_agg_all / retro_agg_all, 3)

r       1.001
so      1.001
h       1.001
rbi     1.001
bb      1.001
gdp     1.000
b_3b    1.003
b_2b    1.001
ibb     1.001
sb      1.001
hp      1.001
g       1.001
hr      1.001
sf      1.001
sh      1.002
ab      1.001
cs      0.997
dtype: float64

### Data Note

Retrosheet data is known to be missing a few of the older baseball games, so Lahman totals should be slightly higher than Retrosheet totals.

In [23]:
# same as above, but from 1975 on
retro_agg_1975 = retro_batting_agg.loc[(slice(None), 
                                         slice(1975,2018), 
                                         slice(None)), :]
lahman_agg_1975 = lahman_batting_agg.loc[(slice(None), 
                                           slice(1975,2018), 
                                           slice(None)), :]

In [24]:
# compare the data from 1975 on
retro_agg_1975_all = retro_agg_1975.aggregate(np.sum)
lahman_agg_1975_all = lahman_agg_1975.aggregate(np.sum)

np.round(lahman_agg_1975_all / retro_agg_1975_all, 3)

r       1.0
so      1.0
h       1.0
rbi     1.0
bb      1.0
gdp     1.0
b_3b    1.0
b_2b    1.0
ibb     1.0
sb      1.0
hp      1.0
g       1.0
hr      1.0
sf      1.0
sh      1.0
ab      1.0
cs      1.0
dtype: float64

In [25]:
retro_batting_agg.shape, lahman_batting_agg.shape

((68370, 17), (68370, 17))

# Compare Pitching Data

## Lahman and Retrosheet Pitching Data from 1955 on

In [26]:
os.chdir(p_lahman_wrangled)
lahman_pitching = bb.from_csv_with_types("pitching.csv")

In [27]:
# Retrosheet data was only collected from 1955 and on
lahman_pitching = lahman_pitching[lahman_pitching['year_id'] >= 1955]
lahman_pitching['year_id'].min(), lahman_pitching['year_id'].max()

(1955, 2018)

In [28]:
player_game['year_id'].min(), player_game['year_id'].max()

(1955, 2018)

## Pitching Columns from Player Game

In [29]:
pitching_cols = [col for col in player_game.columns if col.startswith('p_')]
retro_pitching = player_game[['player_id_lahman', 'year_id', 'team_id_lahman'] + pitching_cols]
retro_pitching.columns

Index(['player_id_lahman', 'year_id', 'team_id_lahman', 'p_g', 'p_gs', 'p_cg',
       'p_sho', 'p_gf', 'p_w', 'p_l', 'p_sv', 'p_out', 'p_tbf', 'p_ab', 'p_r',
       'p_er', 'p_h', 'p_2b', 'p_3b', 'p_hr', 'p_bb', 'p_ibb', 'p_so', 'p_gdp',
       'p_hp', 'p_sh', 'p_sf', 'p_xi', 'p_wp', 'p_bk'],
      dtype='object')

In [30]:
# a player only played as a pitcher, if p_g == 1
retro_pitching = retro_pitching.drop(retro_pitching[retro_pitching['p_g'] == 0].index)

In [31]:
# rename retro columns to match lahman columns
names = {col:col[2:] for col in retro_pitching.columns 
         if col.startswith('p_') and not col[2] == '2' and not col[2] == '3'}
names

{'p_g': 'g',
 'p_gs': 'gs',
 'p_cg': 'cg',
 'p_sho': 'sho',
 'p_gf': 'gf',
 'p_w': 'w',
 'p_l': 'l',
 'p_sv': 'sv',
 'p_out': 'out',
 'p_tbf': 'tbf',
 'p_ab': 'ab',
 'p_r': 'r',
 'p_er': 'er',
 'p_h': 'h',
 'p_hr': 'hr',
 'p_bb': 'bb',
 'p_ibb': 'ibb',
 'p_so': 'so',
 'p_gdp': 'gdp',
 'p_hp': 'hp',
 'p_sh': 'sh',
 'p_sf': 'sf',
 'p_xi': 'xi',
 'p_wp': 'wp',
 'p_bk': 'bk'}

In [32]:
# add 2 more fields to the renaming dictionary
names['player_id_lahman'] = 'player_id'
names['team_id_lahman'] = 'team_id'

In [33]:
retro_pitching = retro_pitching.rename(columns=names)
retro_pitching.columns

Index(['player_id', 'year_id', 'team_id', 'g', 'gs', 'cg', 'sho', 'gf', 'w',
       'l', 'sv', 'out', 'tbf', 'ab', 'r', 'er', 'h', 'p_2b', 'p_3b', 'hr',
       'bb', 'ibb', 'so', 'gdp', 'hp', 'sh', 'sf', 'xi', 'wp', 'bk'],
      dtype='object')

## Compare Columns

In [34]:
# columns in lahman, not in retrosheet
set(lahman_pitching.columns) - set(retro_pitching.columns)

{'ba_opp', 'bfp', 'e', 'era', 'ip_outs', 'lg_id', 'stint'}

In [35]:
# columns in retrosheet, not in lahman
set(retro_batting.columns) - set(lahman_batting.columns)

set()

In [36]:
# columns in common -- these are the columns to compare
cols = set(retro_pitching.columns) & set(lahman_pitching.columns)
cols

{'bb',
 'bk',
 'cg',
 'g',
 'gdp',
 'gf',
 'gs',
 'h',
 'hp',
 'hr',
 'ibb',
 'l',
 'player_id',
 'r',
 'sf',
 'sh',
 'sho',
 'so',
 'sv',
 'team_id',
 'w',
 'wp',
 'year_id'}

In [37]:
retro_pitching = retro_pitching[cols]
lahman_pitching = lahman_pitching[cols]

## Aggregate the Common Columns

In [38]:
retro_grouped = retro_pitching.groupby(by=['player_id','year_id', 'team_id'])
retro_pitching_agg = retro_grouped.aggregate(np.sum)

In [39]:
lahman_grouped = lahman_pitching.groupby(by=['player_id','year_id', 'team_id'])
lahman_pitching_agg = lahman_grouped.aggregate(np.sum)

In [40]:
# rows in lahman not in retro
lahman_only = set(lahman_pitching_agg.index.values) - set(retro_pitching_agg.index.values)
lahman_pitching_agg.loc[list(lahman_only)]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,r,so,h,bb,w,bk,gdp,gs,ibb,sv,hp,sho,wp,gf,l,g,hr,sf,sh,cg
player_id,year_id,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1


In [41]:
# rows in retro, not in lahman
retro_only = set(retro_pitching_agg.index.values) - set(lahman_pitching_agg.index.values)
retro_pitching_agg.loc[list(retro_only)].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,r,so,h,bb,w,bk,gdp,gs,ibb,sv,hp,sho,wp,gf,l,g,hr,sf,sh,cg
player_id,year_id,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1


## Compare Sums of all Pitching Columns

In [42]:
retro_agg_all = retro_pitching_agg.aggregate(np.sum)
lahman_agg_all = lahman_pitching_agg.aggregate(np.sum)
np.round(lahman_agg_all / retro_agg_all, 3)

r      1.001
so     1.001
h      1.001
bb     1.001
w      1.001
bk     1.000
gdp    0.788
gs     1.001
ibb    1.002
sv     1.001
hp     1.001
sho    1.002
wp     1.001
gf     1.001
l      1.001
g      1.001
hr     1.001
sf     0.848
sh     0.794
cg     1.003
dtype: float64

In [43]:
# same as above, but from 1975 on
retro_agg_1975 = retro_pitching_agg.loc[(slice(None), 
                                         slice(1975,2018), 
                                         slice(None)), :]
lahman_agg_1975 = lahman_pitching_agg.loc[(slice(None), 
                                           slice(1975,2018), 
                                           slice(None)), :]

In [44]:
# compare the data from 1975 on
retro_agg_1975_all = retro_agg_1975.aggregate(np.sum)
lahman_agg_1975_all = lahman_agg_1975.aggregate(np.sum)

np.round(lahman_agg_1975_all / retro_agg_1975_all, 3)

r      1.000
so     1.000
h      1.000
bb     1.000
w      1.000
bk     1.000
gdp    1.000
gs     1.000
ibb    1.000
sv     1.000
hp     1.000
sho    1.001
wp     1.000
gf     1.000
l      1.000
g      1.000
hr     1.000
sf     1.000
sh     1.000
cg     1.000
dtype: float64

As per the above, the pitcher data aggregated from Retrosheet, from 1975 on, matches the Lahman pitcher data.

# Compare Fielding Data

In [45]:
os.chdir(p_lahman_wrangled)
lahman_fielding = bb.from_csv_with_types("fielding.csv")
lahman_fielding.columns

Index(['player_id', 'year_id', 'stint', 'team_id', 'lg_id', 'pos', 'g', 'gs',
       'inn_outs', 'po', 'a', 'e', 'dp', 'pb', 'sb', 'cs'],
      dtype='object')

In [46]:
# Retrosheet data was only collected from 1955 and on
lahman_fielding = lahman_fielding[lahman_fielding['year_id'] >= 1955]

In [47]:
lahman_fielding['year_id'].min(), lahman_fielding['year_id'].max()

(1955, 2018)

In [48]:
fielding_cols = [col for col in player_game.columns if col.startswith('f_')]
retro_fielding = player_game[['player_id_lahman', 'year_id', 'team_id_lahman'] + fielding_cols]
retro_fielding.columns

Index(['player_id_lahman', 'year_id', 'team_id_lahman', 'f_po', 'f_a', 'f_e',
       'f_o', 'f_pb', 'f_xi'],
      dtype='object')

In [49]:
names = {col:col[2:] for col in retro_fielding.columns if col.startswith('f_')}
retro_fielding = retro_fielding.rename(columns=names)
retro_fielding.columns

Index(['player_id_lahman', 'year_id', 'team_id_lahman', 'po', 'a', 'e', 'o',
       'pb', 'xi'],
      dtype='object')

In [50]:
# further renaming for comparison
names = {'player_id_lahman':'player_id',
        'team_id_lahman':'team_id',
        'o':'inn_outs'}
retro_fielding = retro_fielding.rename(columns=names)
retro_fielding.columns

Index(['player_id', 'year_id', 'team_id', 'po', 'a', 'e', 'inn_outs', 'pb',
       'xi'],
      dtype='object')

## Compare Columns

In [51]:
# columns in lahman, not in retrosheet
set(lahman_fielding.columns) - set(retro_fielding.columns)

{'cs', 'dp', 'g', 'gs', 'lg_id', 'pos', 'sb', 'stint'}

In [52]:
# columns in retrosheet, not in lahman
set(retro_fielding.columns) - set(lahman_fielding.columns)

{'xi'}

In [53]:
# columns in common -- these are the columns to compare
cols = set(retro_fielding.columns) & set(lahman_fielding.columns)
cols

{'a', 'e', 'inn_outs', 'pb', 'player_id', 'po', 'team_id', 'year_id'}

In [54]:
retro_fielding = retro_fielding[cols]
lahman_fielding = lahman_fielding[cols]

## Aggregate the Common Columns

In [55]:
retro_grouped = retro_fielding.groupby(by=['player_id','year_id', 'team_id'])
retro_fielding_agg = retro_grouped.aggregate(np.sum)

In [56]:
lahman_grouped = lahman_fielding.groupby(by=['player_id','year_id', 'team_id'])
lahman_fielding_agg = lahman_grouped.aggregate(np.sum)

In [57]:
# rows in lahman not in retro
lahman_only = set(lahman_fielding_agg.index.values) - set(retro_fielding_agg.index.values)
lahman_fielding_agg.loc[list(lahman_only)]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,po,pb,e,a,inn_outs
player_id,year_id,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
fanniji01,1956,CHN,5,0.0,2.0,3,24.0


In [58]:
lahman_fielding_agg.shape

(67531, 5)

In [59]:
# there is only a single row
# this player was only on the field for 24 outs for the year
# drop this row to make comparing the rest of the data easy
lahman_fielding_agg = lahman_fielding_agg.drop(list(lahman_only), axis=0)
lahman_fielding_agg.shape

(67530, 5)

In [60]:
set(lahman_fielding_agg.index.values) - set(retro_fielding_agg.index.values)

set()

In [61]:
# rows in retro, not in lahman
retro_only = set(retro_fielding_agg.index.values) - set(lahman_fielding_agg.index.values)
retro_fielding_agg.loc[list(retro_only)].head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,po,pb,e,a,inn_outs
player_id,year_id,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
guerrwi01,1996,LAN,0.0,0.0,0.0,0.0,0.0
lauch01,1967,ATL,0.0,0.0,0.0,0.0,0.0
mossle01,1958,CHA,0.0,0.0,0.0,0.0,0.0


In [62]:
# sum the innnings the players in retro fielding, not in Lahman fieling, fielded
retro_fielding_agg.loc[list(retro_only), 'inn_outs'].sum()

9.0

In [63]:
# there were only 9 outs for all players for all years
# remove these rows to make comparing the rest of the data easier
retro_fielding_agg = retro_fielding_agg.drop(list(retro_only), axis=0)

In [64]:
set(retro_fielding_agg.index.values) - set(lahman_fielding_agg.index.values)

set()

## Compare Sums of all Fielding Columns

In [65]:
retro_agg_all = retro_fielding_agg.aggregate(np.sum)
lahman_agg_all = lahman_fielding_agg.aggregate(np.sum)
np.round(lahman_agg_all / retro_agg_all, 3)

po          1.002
pb          0.999
e           1.002
a           1.003
inn_outs    1.001
dtype: float64

In [66]:
# same as above, but from 1975 on
retro_agg_1975 = retro_fielding_agg.loc[(slice(None), 
                                         slice(1975,2018), 
                                         slice(None)), :]
lahman_agg_1975 = lahman_fielding_agg.loc[(slice(None), 
                                           slice(1975,2018), 
                                           slice(None)), :]

In [67]:
# compare the data from 1975 on
retro_agg_1975_all = retro_agg_1975.aggregate(np.sum)
lahman_agg_1975_all = lahman_agg_1975.aggregate(np.sum)

np.round(lahman_agg_1975_all / retro_agg_1975_all, 3)

po          1.000
pb          0.997
e           1.000
a           1.002
inn_outs    1.000
dtype: float64

# Compare Game Data

This will compare the data in Retrosheet team_game with the data in Lahman teams.

In [68]:
os.chdir(p_lahman_wrangled)
lahman_teams = bb.from_csv_with_types("teams.csv")

In [69]:
os.chdir(p_retro_wrangled)
team_game = bb.from_csv_with_types("team_game.csv.gz")

In [70]:
# Retrosheet data was only collected from 1955 and on
lahman_teams = lahman_teams[lahman_teams['year_id'] >= 1955]
lahman_teams['year_id'].min(), lahman_teams['year_id'].max()

(1955, 2018)

In [71]:
lahman_teams.columns

Index(['year_id', 'lg_id', 'team_id', 'franch_id', 'div_id', 'team_rank', 'g',
       'ghome', 'w', 'l', 'div_win', 'wc_win', 'lg_win', 'ws_win', 'r', 'ab',
       'h', 'b_2b', 'b_3b', 'hr', 'bb', 'so', 'sb', 'cs', 'hbp', 'sf', 'ra',
       'er', 'era', 'cg', 'sho', 'sv', 'ip_outs', 'ha', 'hra', 'bba', 'soa',
       'e', 'dp', 'fp', 'team_name', 'park', 'attendance', 'bpf', 'ppf',
       'team_id_br', 'team_id_lahman45', 'team_id_retro'],
      dtype='object')

In [72]:
team_game['year_id'].min(), team_game['year_id'].max()

(1955, 2018)

In [73]:
team_game.columns

Index(['game_id', 'game_date', 'year_id', 'team_id', 'team_id_lahman', 'home',
       'start_pit_id', 'score', 'hits', 'err', 'lob', 'finish_pit_id',
       'team_league_id', 'line_tx', 'ab', 'b_2b', 'b_3b', 'hr', 'bi', 'sh',
       'sf', 'hp', 'bb', 'ibb', 'so', 'sb', 'cs', 'gdp', 'xi', 'pitcher_ct',
       'er', 'ter', 'wp', 'bk', 'po', 'a', 'pb', 'dp', 'tp'],
      dtype='object')

In [74]:
# work with lahman team ids only
retro_teams = team_game.drop('team_id', axis=1).copy()
retro_teams.columns

Index(['game_id', 'game_date', 'year_id', 'team_id_lahman', 'home',
       'start_pit_id', 'score', 'hits', 'err', 'lob', 'finish_pit_id',
       'team_league_id', 'line_tx', 'ab', 'b_2b', 'b_3b', 'hr', 'bi', 'sh',
       'sf', 'hp', 'bb', 'ibb', 'so', 'sb', 'cs', 'gdp', 'xi', 'pitcher_ct',
       'er', 'ter', 'wp', 'bk', 'po', 'a', 'pb', 'dp', 'tp'],
      dtype='object')

In [75]:
# rename columns that correspond to each other to have the same name
names = {'team_id_lahman':'team_id',
        'hits':'h',
        'score':'r',
        'err':'e'}
retro_teams = retro_teams.rename(columns=names)
retro_teams.columns

Index(['game_id', 'game_date', 'year_id', 'team_id', 'home', 'start_pit_id',
       'r', 'h', 'e', 'lob', 'finish_pit_id', 'team_league_id', 'line_tx',
       'ab', 'b_2b', 'b_3b', 'hr', 'bi', 'sh', 'sf', 'hp', 'bb', 'ibb', 'so',
       'sb', 'cs', 'gdp', 'xi', 'pitcher_ct', 'er', 'ter', 'wp', 'bk', 'po',
       'a', 'pb', 'dp', 'tp'],
      dtype='object')

In [76]:
# cols in lahman, not in retro
set(lahman_teams.columns) - set(retro_teams.columns)

{'attendance',
 'bba',
 'bpf',
 'cg',
 'div_id',
 'div_win',
 'era',
 'fp',
 'franch_id',
 'g',
 'ghome',
 'ha',
 'hbp',
 'hra',
 'ip_outs',
 'l',
 'lg_id',
 'lg_win',
 'park',
 'ppf',
 'ra',
 'sho',
 'soa',
 'sv',
 'team_id_br',
 'team_id_lahman45',
 'team_id_retro',
 'team_name',
 'team_rank',
 'w',
 'wc_win',
 'ws_win'}

In [77]:
# cols in retro, not in lahman
set(retro_teams.columns) - set(lahman_teams.columns)

{'a',
 'bi',
 'bk',
 'finish_pit_id',
 'game_date',
 'game_id',
 'gdp',
 'home',
 'hp',
 'ibb',
 'line_tx',
 'lob',
 'pb',
 'pitcher_ct',
 'po',
 'sh',
 'start_pit_id',
 'team_league_id',
 'ter',
 'tp',
 'wp',
 'xi'}

In [78]:
# will compare all columns that exist in both
cols = set(retro_teams.columns) & set(lahman_teams.columns)
cols = list(cols)
cols

['dp',
 'er',
 'b_2b',
 'hr',
 'team_id',
 'r',
 'b_3b',
 'year_id',
 'sb',
 'sf',
 'so',
 'e',
 'h',
 'cs',
 'bb',
 'ab']

In [79]:
retro_teams = retro_teams[cols]
lahman_teams = lahman_teams[cols]

## Aggregate the Common Columns

In [80]:
retro_grouped = retro_teams.groupby(by=['team_id','year_id'])
retro_teams_agg = retro_grouped.aggregate(np.sum)

In [81]:
lahman_grouped = lahman_teams.groupby(by=['team_id','year_id'])
lahman_teams_agg = lahman_grouped.aggregate(np.sum)

In [82]:
# rows in lahman not in retro
lahman_only = set(lahman_teams_agg.index.values) - set(retro_teams_agg.index.values)
lahman_teams_agg.loc[list(lahman_only)]

Unnamed: 0_level_0,Unnamed: 1_level_0,dp,er,b_2b,hr,r,b_3b,sb,sf,so,e,h,cs,bb,ab
team_id,year_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1


In [83]:
# rows in retro, not in lahman
retro_only = set(retro_teams_agg.index.values) - set(lahman_teams_agg.index.values)
retro_teams_agg.loc[list(retro_only)].head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,dp,er,b_2b,hr,r,b_3b,sb,sf,so,e,h,cs,bb,ab
team_id,year_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1


In [84]:
retro_agg_all = retro_teams_agg.aggregate(np.sum)
lahman_agg_all = lahman_teams_agg.aggregate(np.sum)
np.round(lahman_agg_all / retro_agg_all, 3)

dp      1.001
er      1.000
b_2b    1.001
hr      1.001
r       1.001
b_3b    1.003
sb      1.001
sf      0.848
so      1.001
e       1.001
h       1.001
cs      0.997
bb      1.001
ab      1.001
dtype: float64

# Perform Some of Above using Postgres

This shows how to use SQL to perform the above entirely on the server.

Only the batting aggregations above will be duplicated here in SQL.

In [85]:
# Get the user and password from the environment (rather than hardcoding it)
import os
from sqlalchemy.engine import create_engine
db_user = os.environ.get('DB_USER')
db_pass = os.environ.get('DB_PASS')

# avoid putting passwords directly in code
connect_str = f'postgresql://{db_user}:{db_pass}@localhost:5432/baseball'

In [86]:
%load_ext sql

In [87]:
%sql {connect_str}

'Connected: postgres@baseball'

In [93]:
%%sql
drop table if exists retro_batting;
drop table if exists lahman_batting;
drop table if exists retro_batting_agg;
drop table if exists lahman_batting_agg;

 * postgresql://postgres:***@localhost:5432/baseball
Done.
Done.
Done.
Done.


[]

In [94]:
%%sql
SELECT 
player_id_lahman as player_id, year_id, team_id_lahman as team_id, 
b_ab as ab, b_2b, b_3b, b_bb as bb, b_cs as cs, b_g as g, b_gdp as gdp,
b_h as h, b_hp as hp, b_hr as hr, b_ibb as ibb, b_r as r, b_rbi as rbi,
b_sb as sb, b_sf as sf, b_sh as sh, b_so as so
INTO TEMP retro_batting
FROM player_game_retro;

 * postgresql://postgres:***@localhost:5432/baseball
3549699 rows affected.


[]

In [95]:
%%sql
SELECT
player_id, year_id, team_id,
ab, b_2b, b_3b, bb, cs, g, gdp, h, hp, hr,
ibb, r, rbi, sb, sf, sh, so
INTO TEMP lahman_batting
FROM batting
WHERE year_id >= 1955;

 * postgresql://postgres:***@localhost:5432/baseball
68397 rows affected.


[]

In [96]:
%%sql
SELECT player_id, year_id, team_id,
sum(ab) as ab, sum(b_2b) as b_2b, sum(b_3b) as b_3b, sum(bb) as bb,
sum(cs) as cs, sum(g) as g, sum(gdp) as gdp, sum(h) as h, sum(hp) as hp,
sum(hr) as hr, sum(ibb) as ibb, sum(r) as r, sum(rbi) as rbi, 
sum(sb) as sb, sum(sf) as sf, sum(sh) as sh, sum(so) as so
INTO TEMP retro_batting_agg
FROM retro_batting
GROUP BY player_id, year_id, team_id;

 * postgresql://postgres:***@localhost:5432/baseball
68370 rows affected.


[]

In [97]:
%%sql
SELECT player_id, year_id, team_id,
sum(ab) as ab, sum(b_2b) as b_2b, sum(b_3b) as b_3b, sum(bb) as bb,
sum(cs) as cs, sum(g) as g, sum(gdp) as gdp, sum(h) as h, sum(hp) as hp,
sum(hr) as hr, sum(ibb) as ibb, sum(r) as r, sum(rbi) as rbi, 
sum(sb) as sb, sum(sf) as sf, sum(sh) as sh, sum(so) as so
INTO TEMP lahman_batting_agg
from lahman_batting
GROUP BY player_id, year_id, team_id;

 * postgresql://postgres:***@localhost:5432/baseball
68371 rows affected.


[]

In [100]:
sql = """
SELECT
    player_id, year_id, team_id
FROM
    retro_batting_agg r
WHERE
    NOT EXISTS
    (SELECT 1 FROM lahman_batting_agg l
    WHERE l.player_id = r.player_id and l.year_id = r.year_id and l.team_id = r.team_id);
"""

In [101]:
rs = %sql {sql}

 * postgresql://postgres:***@localhost:5432/baseball
0 rows affected.


In [103]:
sql = """
SELECT
    player_id, year_id, team_id
FROM
    lahman_batting_agg l
WHERE
    NOT EXISTS
    (SELECT 1 FROM retro_batting_agg r
    WHERE l.player_id = r.player_id and l.year_id = r.year_id and l.team_id = r.team_id);
"""

In [104]:
rs = %sql {sql}

 * postgresql://postgres:***@localhost:5432/baseball
1 rows affected.


In [105]:
rs

player_id,year_id,team_id
fanniji01,1956,CHN


In [106]:
%%sql
delete from lahman_batting_agg
where player_id = 'fanniji01' and year_id = 1956 and team_id = 'CHN';

 * postgresql://postgres:***@localhost:5432/baseball
1 rows affected.


[]

In [107]:
%%sql
SELECT
sum(ab) as ab, sum(b_2b) as b_2b, sum(b_3b) as b_3b, sum(bb) as bb,
sum(cs) as cs, sum(g) as g, sum(gdp) as gdp, sum(h) as h, sum(hp) as hp,
sum(hr) as hr, sum(ibb) as ibb, sum(r) as r, sum(rbi) as rbi, 
sum(sb) as sb, sum(sf) as sf, sum(sh) as sh, sum(so) as so
INTO TEMP lahman_batting_agg_all
FROM lahman_batting_agg;

 * postgresql://postgres:***@localhost:5432/baseball
1 rows affected.


[]

In [108]:
%%sql
select
sum(ab) as ab, sum(b_2b) as b_2b, sum(b_3b) as b_3b, sum(bb) as bb,
sum(cs) as cs, sum(g) as g, sum(gdp) as gdp, sum(h) as h, sum(hp) as hp,
sum(hr) as hr, sum(ibb) as ibb, sum(r) as r, sum(rbi) as rbi, 
sum(sb) as sb, sum(sf) as sf, sum(sh) as sh, sum(so) as so
into temp retro_batting_agg_all
from retro_batting_agg;

 * postgresql://postgres:***@localhost:5432/baseball
1 rows affected.


[]

In [123]:
lahman_grand_totals = %sql SELECT * from lahman_batting_agg_all

 * postgresql://postgres:***@localhost:5432/baseball
1 rows affected.


In [127]:
lahman_grand_totals = lahman_grand_totals.DataFrame()
lahman_grand_totals

Unnamed: 0,ab,b_2b,b_3b,bb,cs,g,gdp,h,hp,hr,ibb,r,rbi,sb,sf,sh,so
0,8866286,411753,55122,849926,74939.0,3554459,198910.0,2292363,68749,238516,75168.0,1145531,1079682,157293,72003.0,96898,1584799


In [124]:
retro_grand_totals = %sql SELECT * from retro_batting_agg_all

 * postgresql://postgres:***@localhost:5432/baseball
1 rows affected.


In [128]:
retro_grand_totals = retro_grand_totals.DataFrame()
retro_grand_totals

Unnamed: 0,ab,b_2b,b_3b,bb,cs,g,gdp,h,hp,hr,ibb,r,rbi,sb,sf,sh,so
0,8853338,411244,54980,848840,75138,3549699,198972,2289006,68683,238215,75068,1143916,1078157,157160,71904,96715,1583087


In [132]:
(lahman_grand_totals.astype('int') / retro_grand_totals.astype('int')).T

Unnamed: 0,0
ab,1.001462
b_2b,1.001238
b_3b,1.002583
bb,1.001279
cs,0.997352
g,1.001341
gdp,0.999688
h,1.001467
hp,1.000961
hr,1.001264
