# Baseball Analysis 1

**Baseball Notebooks**  
1. Downloaded and unzipped baseball data.
2. Helper functions and their motivation for use.
3. Lahman data was wrangled and persisted.
4. Retrosheet Play by Play data was parsed, collected into 2 DataFrames, and persisted.
5. Wrangle the Retrosheet data in preparation for data analysis.
6. Loaded the wrangled Retrosheet data into Postgres.
7. This notebook.

Begin analyzing baseball data.

This notebook is designed to be used with Jupyter Lab and the Table of Contents extension.  
https://github.com/jupyterlab/jupyterlab-toc

In [1]:
import pandas as pd
import numpy as np

import os
from pathlib import Path

In [2]:
# see Baseball Notebook #2
import helper_functions as bb

In [3]:
# create path objects
home = Path.home()
lahman = home.joinpath('data/lahman')
p_lahman_wrangled = lahman.joinpath('wrangled')

retrosheet = home.joinpath('data/retrosheet')
p_retro_wrangled = retrosheet.joinpath('wrangled')

# Verify Lahman Data Matches Retrosheet Data Aggegrated

Lahman has yearly aggregates for batting and pitching.  Compute yearly aggregates from Retrosheet for batting and pitching and compare to Lahman.

This verifies that the data sources are in snyc, and that the data prepration work was correct.

The full group by, used for both Lahman and Retrosheet, is (player_id, year, team_id).

In other words, the summed values for a given player, for a given year, for a given team, will be compared.

Note that Lahman data is grouped by stint, so for example, player_id == bostich01 who started in ATL, was traded to CIN and traded back to ATL, has 3 rows for 2018 for Lahman.  If the Lahman data is grouped by (player_id, year, team_id) this will result in 2 rows, allowing for comparison with the Retrosheet data (which is not organized by stint).

In [4]:
os.chdir(p_lahman_wrangled)
lahman_batting = bb.from_csv_with_types("batting.csv")

In [5]:
lahman_batting['year'].min(), lahman_batting['year'].max()

(1955, 2018)

In [6]:
lahman_batting.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53230 entries, 0 to 53229
Data columns (total 22 columns):
player_id    53230 non-null object
year         53230 non-null uint16
stint        53230 non-null uint8
team_id      53230 non-null object
lg_id        53230 non-null object
b_g          53230 non-null uint8
b_ab         53230 non-null uint16
b_r          53230 non-null uint8
b_h          53230 non-null uint16
b_2b         53230 non-null uint8
b_3b         53230 non-null uint8
b_hr         53230 non-null uint8
b_rbi        53230 non-null uint8
b_sb         53230 non-null uint8
b_cs         53230 non-null uint8
b_bb         53230 non-null uint8
b_so         53230 non-null uint8
b_ibb        53230 non-null uint8
b_hp         53230 non-null uint8
b_sh         53230 non-null uint8
b_sf         53230 non-null uint8
b_gdp        53230 non-null uint8
dtypes: object(3), uint16(3), uint8(16)
memory usage: 2.3+ MB


In [7]:
lahman_batting_cols = [col for col in lahman_batting.columns if col.startswith('b_')]

In [8]:
os.chdir(p_retro_wrangled)
player_game = bb.from_csv_with_types('player_game.csv.gz')

In [9]:
player_game['year'].min(), player_game['year'].max()

(1955, 2018)

In [10]:
retro_batting_cols = [col for col in player_game.columns if col.startswith('b_')]

In [11]:
# cols in lahman, not in retro
set(lahman_batting_cols) - set(retro_batting_cols)

set()

In [12]:
# cols in retro, not in lahman
set(retro_batting_cols) - set(lahman_batting_cols)

{'b_pa', 'b_xi'}

In [13]:
# so lahman has all retro columns
# do the analysis with all lahman batting columns

In [15]:
retro_batting_cols = ['player_id_lahman', 'year', 'team_id_lahman']
retro_batting_cols.extend(lahman_batting_cols)
retro_batting_cols

['player_id_lahman',
 'year',
 'team_id_lahman',
 'b_g',
 'b_ab',
 'b_r',
 'b_h',
 'b_2b',
 'b_3b',
 'b_hr',
 'b_rbi',
 'b_sb',
 'b_cs',
 'b_bb',
 'b_so',
 'b_ibb',
 'b_hp',
 'b_sh',
 'b_sf',
 'b_gdp']

In [16]:
lahman_batting_cols_used = ['player_id', 'year', 'team_id']
lahman_batting_cols_used.extend(lahman_batting_cols)
lahman_batting_cols_used

['player_id',
 'year',
 'team_id',
 'b_g',
 'b_ab',
 'b_r',
 'b_h',
 'b_2b',
 'b_3b',
 'b_hr',
 'b_rbi',
 'b_sb',
 'b_cs',
 'b_bb',
 'b_so',
 'b_ibb',
 'b_hp',
 'b_sh',
 'b_sf',
 'b_gdp']

In [17]:
# test with 2018 only, for faster computations
retro_batting = player_game[retro_batting_cols]
retro_batting_2018 = retro_batting[retro_batting['year'] == 2018]
retro_batting_2018.head(3)

Unnamed: 0,player_id_lahman,year,team_id_lahman,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
3098476,colonba01,2018,TEX,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3098477,colonba01,2018,TEX,1,2,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0
3098478,colonba01,2018,TEX,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [18]:
lahman_batting_2018 = lahman_batting[lahman_batting['year'] == 2018]
lahman_batting_2018 = lahman_batting_2018[lahman_batting_cols_used]
lahman_batting_2018 = lahman_batting_2018.set_index(['player_id', 'year', 'team_id']).sort_index()
lahman_batting_2018.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
player_id,year,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
abreujo02,2018,CHA,128,499,68,132,36,1,22,78,2,0,37,109,7,11,0,6,14
acunaro01,2018,ATL,111,433,78,127,26,4,26,64,16,5,45,123,2,6,0,3,4
adamewi01,2018,TBA,85,288,43,80,7,0,10,34,6,5,31,95,3,1,1,2,6


In [19]:
grouped = retro_batting_2018.groupby(by=['player_id_lahman', 'year', 'team_id_lahman'])

In [20]:
retro_batting_2018 = grouped.aggregate(np.sum)
retro_batting_2018 = retro_batting_2018.dropna().sort_index()
retro_batting_2018.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
player_id_lahman,year,team_id_lahman,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
abreujo02,2018,CHA,128.0,499.0,68.0,132.0,36.0,1.0,22.0,78.0,2.0,0.0,37.0,109.0,7.0,11.0,0.0,6.0,14.0
acunaro01,2018,ATL,111.0,433.0,78.0,127.0,26.0,4.0,26.0,64.0,16.0,5.0,45.0,123.0,2.0,6.0,0.0,3.0,4.0
adamewi01,2018,TBA,85.0,288.0,43.0,80.0,7.0,0.0,10.0,34.0,6.0,5.0,31.0,95.0,3.0,1.0,1.0,2.0,6.0


In [21]:
# no nulls in entire dataframe
retro_batting_2018.isna().sum().sum()

0

In [22]:
# some players had no at bats over for their stint = (player, team, year)
zero_ab = (retro_batting_2018['b_ab']==0)
zero_ab.sum()

474

In [23]:
# batting = batting.drop(batting[batting['b_ab'] == 0].index)
retro_batting_2018 = retro_batting_2018.drop(retro_batting_2018[zero_ab].index)

In [24]:
# rename to allow for dataframe comparison
retro_batting_2018.index.names = ['player_id', 'year', 'team_id']
retro_batting_2018.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
player_id,year,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
abreujo02,2018,CHA,128.0,499.0,68.0,132.0,36.0,1.0,22.0,78.0,2.0,0.0,37.0,109.0,7.0,11.0,0.0,6.0,14.0
acunaro01,2018,ATL,111.0,433.0,78.0,127.0,26.0,4.0,26.0,64.0,16.0,5.0,45.0,123.0,2.0,6.0,0.0,3.0,4.0
adamewi01,2018,TBA,85.0,288.0,43.0,80.0,7.0,0.0,10.0,34.0,6.0,5.0,31.0,95.0,3.0,1.0,1.0,2.0,6.0


### Lahman has a row per 'stint'
For example, tuckepr01 started in ATL, went to CIN, then back to ATL, for 3 stints in 1 year with two teams.

In [25]:
grouped = lahman_batting_2018.groupby(by=['player_id', 'year', 'team_id'])

In [26]:
lahman_batting_2018 = grouped.aggregate(np.sum)
lahman_batting_2018 = lahman_batting_2018.dropna().sort_index()
lahman_batting_2018.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
player_id,year,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
abreujo02,2018,CHA,128,499,68,132,36,1,22,78,2,0,37,109,7,11,0,6,14
acunaro01,2018,ATL,111,433,78,127,26,4,26,64,16,5,45,123,2,6,0,3,4
adamewi01,2018,TBA,85,288,43,80,7,0,10,34,6,5,31,95,3,1,1,2,6


In [27]:
# which rows are in retro, not in lahman
retro_batting_2018.loc[~retro_batting_2018.index.isin(lahman_batting_2018.index)]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
player_id,year,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1


In [28]:
# which rows are in lahman, not in retro
lahman_batting_2018.loc[~lahman_batting_2018.index.isin(retro_batting_2018.index)]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
player_id,year,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1


In [29]:
(retro_batting_2018.columns == lahman_batting_2018.columns).all()

True

In [31]:
(retro_batting_2018.index == lahman_batting_2018.index).all()

True

In [34]:
(retro_batting_2018 == lahman_batting_2018).all()

b_g       True
b_ab      True
b_r       True
b_h      False
b_2b      True
b_3b      True
b_hr      True
b_rbi     True
b_sb      True
b_cs      True
b_bb      True
b_so      True
b_ibb     True
b_hp      True
b_sh      True
b_sf      True
b_gdp     True
dtype: bool

In [36]:
retro_batting_2018[retro_batting_2018['b_h'] != lahman_batting_2018['b_h']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
player_id,year,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
quinnro01,2018,PHI,50.0,131.0,13.0,34.0,6.0,4.0,2.0,12.0,10.0,4.0,10.0,35.0,0.0,1.0,1.0,0.0,1.0


In [37]:
lahman_batting_2018[retro_batting_2018['b_h'] != lahman_batting_2018['b_h']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
player_id,year,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
quinnro01,2018,PHI,50,131,13,35,6,4,2,12,10,4,10,35,0,1,1,0,1


### MLB Stats agree with Retrosheet, not Lahman
The difference is only 1 hit, but it is interesting to check this with another source.  
https://www.mlb.com/player/roman-quinn-596451

# ********************************

In [49]:
os.chdir(p_lahman_wrangled)
lahman_pitching = bb.from_csv_with_types("pitching.csv")

In [50]:
lahman_pitching['year'].min(), lahman_pitching['year'].max()

(1955, 2018)

In [51]:
lahman_pitching.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31879 entries, 0 to 31878
Data columns (total 30 columns):
player_id    31879 non-null object
year         31879 non-null uint16
stint        31879 non-null uint8
team_id      31879 non-null object
lg_id        31879 non-null object
p_w          31879 non-null uint8
p_l          31879 non-null uint8
p_g          31879 non-null uint8
p_gs         31879 non-null uint8
p_cg         31879 non-null uint8
p_sho        31879 non-null uint8
p_sv         31879 non-null uint8
p_out        31879 non-null uint16
p_h          31879 non-null uint16
p_er         31879 non-null uint8
p_hr         31879 non-null uint8
p_bb         31879 non-null uint8
p_so         31879 non-null uint16
p_ba_opp     31879 non-null float64
p_era        31879 non-null float64
p_ibb        31879 non-null uint8
p_wp         31879 non-null uint8
p_hp         31879 non-null uint8
p_bk         31879 non-null uint8
p_bfp        31879 non-null uint16
p_gf         31879 non-null u

In [52]:
lahman_pitching_cols = [col for col in lahman_pitching.columns if col.startswith('p_')]

In [53]:
os.chdir(p_retro_wrangled)
player_game = bb.from_csv_with_types('player_game.csv.gz')

In [54]:
player_game['year'].min(), player_game['year'].max()

(1955, 2018)

In [55]:
retro_pitching_cols = [col for col in player_game.columns if col.startswith('p_')]

In [56]:
# cols in lahman, not in retro
set(lahman_pitching_cols) - set(retro_pitching_cols)

{'p_ba_opp', 'p_bfp', 'p_era'}

In [57]:
# cols in retro, not in lahman
set(retro_pitching_cols) - set(lahman_pitching_cols)

{'p_2b', 'p_3b', 'p_ab', 'p_tbf', 'p_xi'}

In [58]:
# compare all columns that exist in both
pitching_cols = set(lahman_pitching_cols) & set(retro_pitching_cols)
pitching_cols

{'p_bb',
 'p_bk',
 'p_cg',
 'p_er',
 'p_g',
 'p_gdp',
 'p_gf',
 'p_gs',
 'p_h',
 'p_hp',
 'p_hr',
 'p_ibb',
 'p_l',
 'p_out',
 'p_r',
 'p_sf',
 'p_sh',
 'p_sho',
 'p_so',
 'p_sv',
 'p_w',
 'p_wp'}

In [60]:
retro_pitching_cols = ['player_id_lahman', 'year', 'team_id_lahman']
retro_pitching_cols.extend(pitching_cols)

In [64]:
lahman_pitching_cols = ['player_id', 'year', 'team_id']
lahman_pitching_cols.extend(pitching_cols)

In [65]:
# test with 2018 only, for faster computations
retro_pitching = player_game[retro_pitching_cols]
retro_pitching_2018 = retro_pitching[retro_pitching['year'] == 2018]
retro_pitching_2018.head(3)

Unnamed: 0,player_id_lahman,year,team_id_lahman,p_hp,p_out,p_h,p_wp,p_sf,p_l,p_er,...,p_g,p_ibb,p_gdp,p_gs,p_w,p_hr,p_sh,p_bb,p_r,p_cg
3098476,colonba01,2018,TEX,0,9,8,0,0,1,6,...,1,0,0,1,0,2,0,1,6,0
3098477,colonba01,2018,TEX,0,15,7,0,0,1,5,...,1,0,0,1,0,1,0,2,5,0
3098478,colonba01,2018,TEX,0,18,9,0,0,1,3,...,1,0,1,1,0,0,0,2,4,0


In [66]:
lahman_pitching_2018 = lahman_pitching[lahman_pitching['year'] == 2018]
lahman_pitching_2018 = lahman_pitching_2018[lahman_pitching_cols]
lahman_pitching_2018 = lahman_pitching_2018.set_index(['player_id', 'year', 'team_id']).sort_index()
lahman_pitching_2018.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,p_hp,p_out,p_h,p_wp,p_sf,p_l,p_er,p_so,p_sho,p_gf,...,p_g,p_ibb,p_gdp,p_gs,p_w,p_hr,p_sh,p_bb,p_r,p_cg
player_id,year,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
adamja01,2018,KCA,3,97,30,4,2.0,3,22,37,0,14,...,31,1,1.0,0,0,9,0.0,15,22,0
adamsau02,2018,WAS,0,3,1,0,0.0,0,0,0,0,0,...,2,0,0.0,0,0,0,0.0,3,0,0
adamsch01,2018,NYA,0,23,8,0,0.0,1,6,4,0,1,...,3,0,2.0,1,0,3,0.0,4,7,0


In [68]:
grouped = retro_pitching_2018.groupby(by=['player_id_lahman', 'year', 'team_id_lahman'])

In [73]:
retro_pitching_2018 = grouped.aggregate(np.sum)
retro_pitching_2018 = retro_pitching_2018.dropna().sort_index()

# if the pitcher recorded less than 3 outs for the entire year, drop the record
retro_pitching_2018 = retro_pitching_2018.drop(retro_pitching_2018[retro_pitching_2018['p_out'] < 3].index)
retro_pitching_2018.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,p_hp,p_out,p_h,p_wp,p_sf,p_l,p_er,p_so,p_sho,p_gf,...,p_g,p_ibb,p_gdp,p_gs,p_w,p_hr,p_sh,p_bb,p_r,p_cg
player_id,year,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
adamja01,2018,KCA,3,97,30,4,2.0,3,22,37,0,14,...,31,1,1.0,0,0,9,0.0,15,22,0
adamsau02,2018,WAS,0,3,1,0,0.0,0,0,0,0,0,...,2,0,0.0,0,0,0,0.0,3,0,0
adamsch01,2018,NYA,0,23,8,0,0.0,1,6,4,0,1,...,3,0,2.0,1,0,3,0.0,4,7,0


In [74]:
# no nulls in entire dataframe
retro_pitching_2018.isna().sum().sum()

0

In [75]:
# rename to allow for dataframe comparison
retro_pitching_2018.index.names = ['player_id', 'year', 'team_id']
retro_pitching_2018.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,p_hp,p_out,p_h,p_wp,p_sf,p_l,p_er,p_so,p_sho,p_gf,...,p_g,p_ibb,p_gdp,p_gs,p_w,p_hr,p_sh,p_bb,p_r,p_cg
player_id,year,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
adamja01,2018,KCA,3,97,30,4,2.0,3,22,37,0,14,...,31,1,1.0,0,0,9,0.0,15,22,0
adamsau02,2018,WAS,0,3,1,0,0.0,0,0,0,0,0,...,2,0,0.0,0,0,0,0.0,3,0,0
adamsch01,2018,NYA,0,23,8,0,0.0,1,6,4,0,1,...,3,0,2.0,1,0,3,0.0,4,7,0


### Lahman has a row per 'stint'
For example, tuckepr01 started in ATL, went to CIN, then back to ATL, for 3 stints in 1 year with two teams.

In [76]:
grouped = lahman_pitching_2018.groupby(by=['player_id', 'year', 'team_id'])

In [77]:
lahman_pitching_2018 = grouped.aggregate(np.sum)
lahman_pitching_2018 = lahman_pitching_2018.dropna().sort_index()
lahman_pitching_2018.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,p_hp,p_out,p_h,p_wp,p_sf,p_l,p_er,p_so,p_sho,p_gf,...,p_g,p_ibb,p_gdp,p_gs,p_w,p_hr,p_sh,p_bb,p_r,p_cg
player_id,year,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
adamja01,2018,KCA,3,97,30,4,2.0,3,22,37,0,14,...,31,1,1.0,0,0,9,0.0,15,22,0
adamsau02,2018,WAS,0,3,1,0,0.0,0,0,0,0,0,...,2,0,0.0,0,0,0,0.0,3,0,0
adamsch01,2018,NYA,0,23,8,0,0.0,1,6,4,0,1,...,3,0,2.0,1,0,3,0.0,4,7,0


In [78]:
# which rows are in retro, not in lahman
retro_batting_2018.loc[~retro_batting_2018.index.isin(lahman_batting_2018.index)]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
player_id,year,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1


In [79]:
# which rows are in lahman, not in retro
lahman_batting_2018.loc[~lahman_batting_2018.index.isin(retro_batting_2018.index)]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,b_g,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_sb,b_cs,b_bb,b_so,b_ibb,b_hp,b_sh,b_sf,b_gdp
player_id,year,team_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1


In [80]:
(retro_pitching_2018.columns == lahman_pitching_2018.columns).all()

True

In [81]:
(retro_pitching_2018.index == lahman_pitching_2018.index).all()

True

In [83]:
(retro_pitching_2018 == lahman_pitching_2018).all().all()

True

# ********************************

In [None]:
# Get the user and password from the environment (rather than hardcoding it)
import os
from sqlalchemy.engine import create_engine
db_user = os.environ.get('DB_USER')
db_pass = os.environ.get('DB_PASS')

# avoid putting passwords directly in code
connect_str = f'postgresql://{db_user}:{db_pass}@localhost:5432/baseball'

# treat sql alchmey engine as a connection to the database
conn = create_engine(connect_str)

In [None]:
%%timeit
# same but use SQL
sql = """
SELECT player_id
FROM batting
WHERE year_id = '2018'
"""
df = pd.read_sql(sql, conn)

In [None]:
df.equals(result)

In [None]:
result.columns

In [None]:
%%timeit
# convert to retro_id
retro_id = people[people['player_id'].isin(result['player_id'])]['retro_id']