# Baseball Analysis 1

**Baseball Notebooks**  
1. Downloaded and unzipped baseball data.
2. Helper functions and their motivation for use.
3. Lahman data was wrangled and persisted.
4. Retrosheet Play by Play data was parsed, collected into 2 DataFrames, and persisted.
5. Wrangle the Retrosheet data in preparation for data analysis.
6. Loaded the wrangled Retrosheet data into Postgres.
7. This notebook.

Compute aggregates from Retrosheet to compare with appropriate values in Lahman.

This notebook is designed to be used with Jupyter Lab and the Table of Contents extension.  
https://github.com/jupyterlab/jupyterlab-toc

In [1]:
import pandas as pd
import numpy as np

import os
from pathlib import Path

In [2]:
# see Baseball Notebook #2
import helper_functions as bb

In [3]:
# create path objects
home = Path.home()
lahman = home.joinpath('data/lahman')
p_lahman_wrangled = lahman.joinpath('wrangled')

retrosheet = home.joinpath('data/retrosheet')
p_retro_wrangled = retrosheet.joinpath('wrangled')

# Verify Lahman Data Matches Retrosheet Data

Lahman has data aggregated by stint.

A stint is usually, but not always, the same as grouping by (player_id, year, team_id).

An example where they differ is Tucker Preston for 2018.  He played for ATL, was traded to CIN, then was traded back to ATL, and played for each.  Preston had 3 stints but only two rows when grouped by (player_id, year, team_id).

The Retrosheet data does not have stint information.

To compare the data between the two:
* Lahman batting/pitching will be aggregated by: player_id, year, team_id
* Retrosheect batting/pitching will be aggregated by: player_id_lahman, year, team_id_lahman

Some reasons for performing this data comparison:
* verify that the data sources have (almost completely) consistent data
* verify that the processing of the data was performed properly

In [4]:
os.chdir(p_lahman_wrangled)
lahman_batting = bb.from_csv_with_types("batting.csv")

In [5]:
# Retrosheet data was only collected from 1955 and on
lahman_batting = lahman_batting[lahman_batting['year'] >= 1955]

In [6]:
lahman_batting['year'].min(), lahman_batting['year'].max()

(1955, 2018)

In [7]:
lahman_batting.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 68397 entries, 18796 to 87192
Data columns (total 22 columns):
player_id    68397 non-null object
year         68397 non-null uint16
stint        68397 non-null uint8
team_id      68397 non-null object
lg_id        68397 non-null object
b_g          68397 non-null uint8
b_ab         68397 non-null uint16
b_r          68397 non-null uint8
b_h          68397 non-null uint16
b_2b         68397 non-null uint8
b_3b         68397 non-null uint8
b_hr         68397 non-null uint8
b_rbi        68397 non-null uint8
b_sb         68397 non-null uint8
b_cs         68397 non-null float64
b_bb         68397 non-null uint8
b_so         68397 non-null uint8
b_ibb        68397 non-null float64
b_hp         68397 non-null uint8
b_sh         68397 non-null uint8
b_sf         68397 non-null float64
b_gdp        68397 non-null float64
dtypes: float64(4), object(3), uint16(3), uint8(12)
memory usage: 5.3+ MB


In [8]:
lahman_batting_cols = [col for col in lahman_batting.columns if col.startswith('b_')]

In [9]:
os.chdir(p_retro_wrangled)
player_game = bb.from_csv_with_types('player_game.csv.gz')

In [10]:
player_game['year'].min(), player_game['year'].max()

(1955, 2018)

In [11]:
retro_batting_cols = [col for col in player_game.columns if col.startswith('b_')]

In [12]:
# cols in lahman, not in retro
set(lahman_batting_cols) - set(retro_batting_cols)

set()

In [13]:
# cols in retro, not in lahman
set(retro_batting_cols) - set(lahman_batting_cols)

{'b_pa', 'b_xi'}

As per the above, Retrosheet has two additiona columns:
* b_pa = plate appearances
* b_xi = safe on interference (usually by the catcher)

b_pa only matter for eligibility for the batting title, in which 502 plate appearances are required.  
b_xi is a rare event that probably isn't helpful.

Perform the analysis with all the Lahman batting columns.

In [None]:
retro_batting_cols = ['player_id_lahman', 'year', 'team_id_lahman']
retro_batting_cols.extend(lahman_batting_cols)
retro_batting_cols

In [None]:
tmp = lahman_batting_cols.copy()
lahman_batting_cols = ['player_id', 'year', 'team_id']
lahman_batting_cols.extend(tmp)
lahman_batting_cols

In [None]:
retro_batting = player_game[retro_batting_cols]
retro_batting.tail(3)

In [None]:
lahman_batting = lahman_batting[lahman_batting_cols]
lahman_batting.tail(3)

In [None]:
retro_grouped = retro_batting.groupby(by=['player_id_lahman', 'year', 'team_id_lahman'])

In [None]:
retro_batting_agg = retro_grouped.aggregate(np.sum)
retro_batting_agg.head(3)

In [None]:
# no nulls in entire dataframe
retro_batting_agg.isna().sum().sum()

In [None]:
# rename index to allow for easier dataframe comparison
retro_batting_agg.index.names = ['player_id', 'year', 'team_id']
retro_batting_agg.head(3)

### Lahman has a row per 'stint'
For example, tuckepr01 started in ATL, went to CIN, then back to ATL, for 3 stints in 1 year with two teams.

In [None]:
lahman_grouped = lahman_batting.groupby(by=['player_id', 'year', 'team_id'])

In [None]:
lahman_batting_agg = lahman_grouped.aggregate(np.sum)
lahman_batting_agg.head(3)

In [None]:
# no nulls in entire dataframe
lahman_batting_agg.isna().sum().sum()

In [None]:
# which rows are in retro, but not in lahman
retro_batting_agg.loc[~retro_batting_agg.index.isin(lahman_batting_agg.index)]

In [None]:
# same using set notation
set(retro_batting_agg.index.to_list()) - set(lahman_batting_agg.index.to_list())

In [None]:
# which rows are in lahman, not in retro
lahman_batting_agg.loc[~lahman_batting_agg.index.isin(retro_batting_agg.index)]

In [None]:
# same using set notation
set(lahman_batting_agg.index.to_list()) - set(retro_batting_agg.index.to_list())

In [None]:
retro_batting_agg.shape, lahman_batting_agg.shape

In [None]:
# drop the extra lahman batting agg row
lahman_batting_agg = lahman_batting_agg.drop(('fanniji01', 1956, 'CHN'))

In [None]:
lahman_batting_agg.shape

In [None]:
(lahman_batting_agg.index == retro_batting_agg.index).all()

### Data Note

Retrosheet data is known to be missing a few of the older baseball games.

Per above, there is only one (player_id, yeard, team_id) that is not identical between the two aggregates.  And the row that is only represent 4 at bats.

In [None]:
retro_agg_all = retro_batting_agg.aggregate(np.sum)
lahman_agg_all = lahman_batting_agg.aggregate(np.sum)
np.round(lahman_agg_all / retro_agg_all, 3)

From the above, Lahman data is about 0.1% greater than Retrosheet data, which is known to be missing a few games.  This is a very close.

# Compare Pitching Aggregates

In [None]:
os.chdir(p_lahman_wrangled)
lahman_pitching = bb.from_csv_with_types("pitching.csv")

In [None]:
# Retrosheet data was only collected from 1955 and on
lahman_pitching = lahman_pitching[lahman_pitching['year'] >= 1955]

In [None]:
lahman_pitching['year'].min(), lahman_pitching['year'].max()

In [None]:
lahman_pitching.info()

In [None]:
lahman_pitching_cols = [col for col in lahman_pitching.columns if col.startswith('p_')]

In [None]:
# os.chdir(p_retro_wrangled)
# player_game = bb.from_csv_with_types('player_game.csv.gz')

In [None]:
player_game['year'].min(), player_game['year'].max()

In [None]:
retro_pitching_cols = [col for col in player_game.columns if col.startswith('p_')]

In [None]:
# cols in lahman, not in retro
set(lahman_pitching_cols) - set(retro_pitching_cols)

In [None]:
# cols in retro, not in lahman
set(retro_pitching_cols) - set(lahman_pitching_cols)

In [None]:
# will compare all columns that exist in both
pitching_cols = set(lahman_pitching_cols) & set(retro_pitching_cols)
pitching_cols = list(pitching_cols)
pitching_cols

In [None]:
retro_pitching_cols = ['player_id_lahman', 'year', 'team_id_lahman']
retro_pitching_cols.extend(pitching_cols)
retro_pitching = player_game[retro_pitching_cols]
retro_pitching.columns

In [None]:
lahman_pitching_cols = ['player_id', 'year', 'team_id']
lahman_pitching_cols.extend(pitching_cols.copy())
lahman_pitching = lahman_pitching[lahman_pitching_cols]
lahman_pitching.columns

In [None]:
retro_grouped = retro_pitching.groupby(by=['player_id_lahman', 'year', 'team_id_lahman'])
retro_pitching_agg = retro_grouped.aggregate(np.sum)
retro_pitching_agg.index.names = ['player_id', 'year', 'team_id']

In [None]:
lahman_grouped = lahman_pitching.groupby(by=['player_id', 'year', 'team_id'])
lahman_pitching_agg = lahman_grouped.aggregate(np.sum)

In [None]:
# no nulls in entire dataframe
retro_pitching_agg.isna().sum().sum()

In [None]:
# no nulls in entire dataframe
lahman_pitching_agg.isna().sum().sum()

In [None]:
retro_only = list(set(retro_pitching_agg.index) - set(lahman_pitching_agg.index))
len(retro_only)

In [None]:
lahman_only = set(lahman_pitching_agg.index) - set(retro_batting_agg.index)
len(lahman_only)

In [None]:
# look at some of the differences
retro_pitching_agg.loc[retro_only, 'p_out'].head()

As per the above, the pitchers in Retrosheet not in Lahman, got no outs over the entire year.  Remove these rows.

In [None]:
criteria = (retro_pitching_agg['p_out'] == 0)
retro_pitching_agg = retro_pitching_agg.drop(retro_pitching_agg[criteria].index)

In [None]:
retro_only = list(set(retro_pitching_agg.index) - set(lahman_pitching_agg.index))
len(retro_only)

As per the above, every (player_id, year, team_id) in both aggregates is the same.

In [None]:
# with the indexes and columns the same, the data can be compared
retro_agg_all = retro_pitching_agg.aggregate(np.sum)
lahman_agg_all = lahman_pitching_agg.aggregate(np.sum)
np.round(lahman_agg_all / retro_agg_all, 3)

In [None]:
retro_agg_1975 = retro_pitching_agg.loc[(slice(None), slice(1975,2018), slice(None)), :]
lahman_agg_1975 = lahman_pitching_agg.loc[(slice(None), slice(1975,2018), slice(None)), :]

In [None]:
# compare the data from 1975 on
retro_agg_1975_all = retro_agg_1975.aggregate(np.sum)
lahman_agg_1975_all = lahman_agg_1975.aggregate(np.sum)

np.round(lahman_agg_1975_all / retro_agg_1975_all, 3)

# Perform above using Postgres

In [None]:
# Get the user and password from the environment (rather than hardcoding it)
import os
from sqlalchemy.engine import create_engine
db_user = os.environ.get('DB_USER')
db_pass = os.environ.get('DB_PASS')

# avoid putting passwords directly in code
connect_str = f'postgresql://{db_user}:{db_pass}@localhost:5432/baseball'

# treat sql alchmey engine as a connection to the database
conn = create_engine(connect_str)

In [None]:
%%timeit
# same but use SQL
sql = """
SELECT player_id
FROM batting
WHERE year_id = '2018'
"""
df = pd.read_sql(sql, conn)

In [None]:
df.equals(result)

In [None]:
result.columns

In [None]:
%%timeit
# convert to retro_id
retro_id = people[people['player_id'].isin(result['player_id'])]['retro_id']