# Lahman Baseball Data

This is the second in a series of notebooks.

In the first notebook, the baseball data was downloaded and unzipped.

This notebook is designed to be used with Jupyter Lab and the Table of Contents extension: https://github.com/jupyterlab/jupyterlab-toc

## Lahman Data Dictionary
Depending upon context, a "Data Dictionary" is also called a "Codebook" or "Schema".

http://www.seanlahman.com/files/database/readme2016.txt  

## Repeatable Research
All data processing should be documented so that others can repeat the results.  This includes every step from downloading the data through analysis.

## Path Objects for Lahman Baseball Data

In [1]:
import pandas as pd
import numpy as np

import os
import re
import wget
from pathlib import Path
import zipfile

In [2]:
# create path objects
home = Path.home()
lahman = home.joinpath('data/lahman')
p_raw = lahman.joinpath('raw')
p_wrangled = lahman.joinpath('wrangled')

# create directories from these path objects
p_raw.mkdir(parents=True, exist_ok=True)
p_wrangled.mkdir(parents=True, exist_ok=True)
os.chdir(p_raw)

# Database

This section is preparation for interacting with Postgres.  Using Postgres is optional for the data analysis.

Prerequisites
1. PostgreSQL server is installed, configured and running.
2. baseball database has been created.

In [3]:
from sqlalchemy.engine import create_engine
from IPython.display import HTML, display

### Connect to DB

In [4]:
# Get the user and password from the environment (rather than hardcoding it)
import os
db_user = os.environ.get('DB_USER')
db_pass = os.environ.get('DB_PASS')

# avoid putting passwords directly in code
connect_str = f'postgresql://{db_user}:{db_pass}@localhost:5432/baseball'

# connect
conn = create_engine(connect_str)

### SQL Magic

Is not used here because it does not release its connection until the notebook is closed.  This can cause a lock to be put on a table which prevents using conn from performing database updates.

### **psql**

Note, there must be a ~/.pgpass file similar to the following to connect without a password:  
```localhost:5432:*:<user>:<passwd>```

In [7]:
# -H for html output
# this connects, executes, and disconnects
def psql(cmd, user='postgres', schema='baseball'):
    psql_out = !psql -H -U {user} {schema} -c "{cmd}"
    display(HTML(''.join(psql_out)))

In [8]:
psql('\d')

Schema,Name,Type,Owner
public,batting,table,postgres
public,fielding,table,postgres
public,people,table,postgres
public,pitching,table,postgres
public,r_game,table,postgres
public,r_game_fields,table,postgres
public,r_parks,table,postgres
public,r_player_game,table,postgres
public,r_player_game_fields,table,postgres
public,r_players,table,postgres


### CamelCase to snake_case

Postgres is easier to use without caps in the column names.

https://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-snake-case

In [9]:
# CamelCase to snake_case
def convert_camel_case(name):
    s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()

In [10]:
# example
convert_camel_case('playerID')

'player_id'

# Useful Methods

### Persisting DataFrame Column Data Types with CSV

In [12]:
def to_csv_with_types(df, file_prefix):
    """Save df to csv and save df.dtypes to csv"""
    dtypes = df.dtypes.to_frame('dtypes').reset_index()
    dtypes.to_csv(file_prefix + '_types.csv', index=False)
    df.to_csv(file_prefix + '.csv', index=False)

In [13]:
def from_csv_with_types(file_prefix):
    """Read df.dtypes from csv and read df from csv"""
    types = pd.read_csv(file_prefix+'_types.csv').set_index('index').to_dict()
    dtypes = types['dtypes']
    
    dates = [key for key,value in dtypes.items() if value.startswith('datetime')]
    for field in dates:
        dtypes.pop(field)
        
    df = pd.read_csv(file_prefix+'.csv', parse_dates = dates, dtype=dtypes)
    return df

### Is Unique over Multiple Columns

In [15]:
# this is faster than using groupby
def is_unique(df, cols):
    return not (df.duplicated(subset=cols)).any()

### Optimize Pandas Data Types
There is no way to do this "perfectly" in general, however the following is a good start.

In [48]:
def optimize_data_types(df, cutoff=0.05):
    df = df.copy()
    
    # number -> smallest uint allowed by data
    df_number = df.select_dtypes(include=[np.float, np.int])
    df_number = df_number.apply(pd.to_numeric,downcast='unsigned')
    df[df_number.columns] = df_number

    # object -> category, if less than 5% unique
    df_obj = df.select_dtypes(include=['object'])
    s = df_obj.nunique() / people.shape[0]
    columns = s.index[s <= cutoff].values
    df_cat = df[columns].astype('category')
    df[columns] = df_cat    
    
    return df

In [42]:
def mem_usage(df):
    mem = df.memory_usage(deep=True).sum()
    mem = mem / 2 ** 20 # covert to megabytes
    return f'{mem:03.2f} MB'

# Main Files
As per:  
http://www.seanlahman.com/files/database/readme2016.txt

After readme2016.txt was written, master was renamed to People.

The 4 main files are:
*  People   - Player names, DOB, and biographical info
*  Batting  - batting statistics
*  Pitching - pitching statistics
*  Fielding - fielding statistics

# People

In [50]:
os.chdir(p_raw)
people = pd.read_csv('People.csv', parse_dates=['debut', 'finalGame'])

In [53]:
people.columns

Index(['playerID', 'birthYear', 'birthMonth', 'birthDay', 'birthCountry',
       'birthState', 'birthCity', 'deathYear', 'deathMonth', 'deathDay',
       'deathCountry', 'deathState', 'deathCity', 'nameFirst', 'nameLast',
       'nameGiven', 'weight', 'height', 'bats', 'throws', 'debut', 'finalGame',
       'retroID', 'bbrefID'],
      dtype='object')

In [54]:
people.columns = [convert_camel_case(name) for name in people.columns]
people.columns

Index(['player_id', 'birth_year', 'birth_month', 'birth_day', 'birth_country',
       'birth_state', 'birth_city', 'death_year', 'death_month', 'death_day',
       'death_country', 'death_state', 'death_city', 'name_first', 'name_last',
       'name_given', 'weight', 'height', 'bats', 'throws', 'debut',
       'final_game', 'retro_id', 'bbref_id'],
      dtype='object')

In [55]:
# custom parsing of birth/death dates
def to_date(row, prefix):
    y = row[prefix + '_year']
    m = row[prefix + '_month']
    d = row[prefix + '_day']
    
    # NaT if year is missing
    if pd.isna(y):
        return pd.NaT
    
    # fillna if year present but month missing
    if pd.isna(m):
        m = 1
        
    # fillna if year present but day missing
    if pd.isna(d):
        d = 1
        
    return pd.datetime(int(y),int(m),int(d))

In [56]:
people['birth_date'] = people.apply(lambda x: to_date(x, 'birth'), axis=1)
people['death_date'] = people.apply(lambda x: to_date(x, 'death'), axis=1)

In [57]:
people = people.drop(
    ['birth_year', 'birth_month', 'birth_day', 
     'death_year', 'death_month', 'death_day'], axis=1)

In [58]:
# retro_id will be used with Retrosheet Data
# drop (the very few) rows with missing retro_id
people = people.dropna(subset=['retro_id'], axis=0)

In [59]:
# verify uniqueness
print(people['player_id'].is_unique)
print(people['retro_id'].is_unique)

True
True


In [60]:
mem_usage(people)

'16.78 MB'

In [61]:
people = optimize_data_types(people)

In [62]:
mem_usage(people)

'10.64 MB'

In [64]:
people.dtypes.value_counts()

object            8
datetime64[ns]    4
float64           2
category          1
category          1
category          1
category          1
category          1
category          1
dtype: int64

In [65]:
os.chdir(p_wrangled)
to_csv_with_types(people, 'people')

In [66]:
df = from_csv_with_types('people')
df.head(2)

Unnamed: 0,player_id,birth_country,birth_state,birth_city,death_country,death_state,death_city,name_first,name_last,name_given,weight,height,bats,throws,debut,final_game,retro_id,bbref_id,birth_date,death_date
0,aardsda01,USA,CO,Denver,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,1981-12-27,NaT
1,aaronha01,USA,AL,Mobile,,,,Hank,Aaron,Henry Louis,180.0,72.0,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01,1934-02-05,NaT


In [67]:
df.dtypes

player_id                object
birth_country          category
birth_state            category
birth_city               object
death_country          category
death_state            category
death_city               object
name_first               object
name_last                object
name_given               object
weight                  float64
height                  float64
bats                   category
throws                 category
debut            datetime64[ns]
final_game       datetime64[ns]
retro_id                 object
bbref_id                 object
birth_date       datetime64[ns]
death_date       datetime64[ns]
dtype: object

In [None]:
os.chdir(p_wrangled)
people.to_csv('people.csv', index=False)

In [None]:
people_ids = people[['player_id', 'retro_id']].astype('category')

In [None]:
people_ids.dtypes.to_frame('dtypes').reset_index()

In [None]:
people_ids.to_csv('people_ids.csv', index=False)

In [None]:
people_ids2 = df.read_csv('people_ids.csv')

In [None]:
from pandas.api.types import CategoricalDtype
player_id_cat = CategoricalDtype(people_ids['player_id'], ordered=True)
retro_id_cat = CategoricalDtype(people_ids['retro_id'], ordered=True)

In [None]:
people['player_id'].astype(retro_id_cat)

In [None]:
people['player_id']

In [None]:
# replace the table if it exists
people.to_sql('people', conn, if_exists='replace', index=False)

In [None]:
rs = conn.execute("SELECT COUNT(*) from people")
rs.fetchall()

In [None]:
# add primary key, unique and not null constraints
sql   = 'ALTER TABLE people ADD PRIMARY KEY (player_id)'
conn.execute(sql)

sql = 'ALTER TABLE people ADD CONSTRAINT retro_unique UNIQUE (retro_id)'
conn.execute(sql)

sql = 'ALTER TABLE people ALTER COLUMN retro_id SET NOT NULL'
conn.execute(sql);

In [None]:
# describe the table
psql('\d people')

# Batting

In [None]:
batting = pd.read_csv('Batting.csv')

In [None]:
batting.columns

## Rename to use retrosheet names for the corresponding fields

The following is from the RetrosheetBaseball Jupyter notebook.
```
 'b_g': 'games played',
 'b_pa': 'plate appearances',
 'b_ab': 'at bats',
 'b_r': 'runs',
 'b_h': 'hits',
 'b_2b': 'doubles',
 'b_3b': 'triples',
 'b_hr': 'home runs',
 'b_rbi': 'runs batted in',
 'b_bb': 'walks',
 'b_ibb': 'intentional walks',
 'b_so': 'strikeouts',
 'b_gdp': 'grounded into DP',
 'b_hp': 'hit by pitch',
 'b_sh': 'sacrifice hits',
 'b_sf': 'sacrifice flies',
 'b_sb': 'stolen bases',
 'b_cs': 'caught stealing',
 'b_xi': 'reached on interference', 
```

In [None]:
names = {
    'playerID':'player_id',
    'yearID':'year_id',
    'teamID':'team_id',
    'lgID':'lg_id',
    'G':'b_g',
    'AB':'b_ab',
    'R':'b_r',
    'H':'b_h',
    '2B':'b_2b',
    '3B':'b_3b',
    'HR':'b_hr',
    'RBI':'b_rbi',
    'SB':'b_sb',
    'CS':'b_cs',
    'BB':'b_bb',
    'SO':'b_so',
    'IBB':'b_ibb',
    'HBP':'b_hp',
    'SH':'b_sh',
    'SF':'b_sf',
    'GIDP':'b_gdp'
}

In [None]:
batting.rename(columns=names, inplace=True)
batting.columns

In [None]:
# certain stats are null only for old games
# this study will be from 1955 onward
batting = batting.drop(batting[batting['year_id'] < 1955].index)

In [None]:
batting['year_id'].min()

In [None]:
# battings stats for a given year, with 0 at-bats are meaningless
# a player could be a pinch-runner and have no at-bats
batting = batting.drop(batting[batting['b_ab'] == 0].index)

In [None]:
# these are integers, but had NA, so were converted to float
batting_float = batting.select_dtypes(include=['float']).copy()
batting_float.columns

In [None]:
# after removing years < 1955, there are no longer any null values
batting_float.isna().sum()

In [None]:
batting_numeric = batting.select_dtypes(include=[np.number])

In [None]:
# pandas will downcast as far as the data allows
batting_numeric = batting_numeric.apply(pd.to_numeric,downcast='unsigned')
batting_numeric.dtypes.value_counts()

In [None]:
batting[batting_numeric.columns] = batting_numeric

In [None]:
batting.dtypes.value_counts()

In [None]:
batting_obj = batting.select_dtypes(include='object')
batting_obj.columns

In [None]:
batting_obj.nunique()

In [None]:
batting[['team_id', 'lg_id']] = batting_obj[['team_id', 'lg_id']].astype('category')

In [None]:
batting.dtypes.value_counts()

In [None]:
from sqlalchemy.types import SmallInteger

In [None]:
# SmallInteger is not deduced from uint8 or uint16 dataframe column type
dtype = {c:SmallInteger for c in batting.select_dtypes(include=np.integer).columns}

In [None]:
dtype

In [None]:
batting.to_sql('batting', conn, if_exists='replace', index=False, dtype=dtype)

In [None]:
# verify unique
is_unique(batting, ['player_id', 'year_id', 'stint'])

In [None]:
sql = 'ALTER TABLE batting ADD PRIMARY KEY (player_id, year_id, stint)'
conn.execute(sql);

In [None]:
psql('\d batting')

# Pitching

In [None]:
pitching = pd.read_csv('Pitching.csv')

In [None]:
pitching.columns

## Rename to Match Retrosheet
```
 'p_g': 'games pitched',
 'p_gs': 'games started',
 'p_cg': 'complete games',
 'p_sho': 'shutouts',
 'p_gf': 'games finished',
 'p_w': 'wins',
 'p_l': 'losses',
 'p_sv': 'saves',
 'p_out': 'outs recorded (innings pitched times 3)',
 'p_tbf': 'batters faced',
 'p_ab': 'at bats',
 'p_r': 'runs allowed',
 'p_er': 'earned runs allowed',
 'p_h': 'hits allowed',
 'p_2b': 'doubles allowed',
 'p_3b': 'triples allowed',
 'p_hr': 'home runs allowed',
 'p_bb': 'walks allowed',
 'p_ibb': 'intentional walks allowed',
 'p_so': 'strikeouts',
 'p_gdp': 'grounded into double play',
 'p_hp': 'hit batsmen',
 'p_sh': 'sacrifice hits against',
 'p_sf': 'sacrifice flies against',
 'p_xi': 'reached on interference',
 'p_wp': 'wild pitches',
 'p_bk': 'balks'
``` 

In [None]:
names = {
    'playerID':'player_id',
    'yearID':'year_id',
    'teamID':'team_id',
    'lgID':'lg_id',
    'W':'p_w',
    'L':'p_l',
    'G':'p_g',
    'GS':'p_gs',
    'CG':'p_cg',
    'SHO':'p_sho',
    'SV':'p_sv',
    'IPouts':'p_outs',
    'H':'p_h',
    'ER':'p_er',
    'HR':'p_hr',
    'BB':'p_bb',
    'SO':'p_so',
    'BAOpp':'p_ba_opp', # not in retrosheet player_game
    'ERA':'p_era', # not in retrosheet player_game
    'IBB':'p_ibb',
    'WP':'p_wp',
    'HBP':'p_hp',
    'BK':'p_bk',
    'BFP':'p_bfp', # not in retrosheet player_game
    'GF':'p_gf', # not in retrosheet
    'R':'p_r',
    'SH':'p_sh',
    'SF':'p_sf',
    'GIDP':'p_gdp'
        }

In [None]:
pitching.rename(columns=names, inplace=True)

In [None]:
pitching.dtypes.value_counts()

In [None]:
# certain stats are null only for old games
# this study will be from 1955 onward
pitching = pitching.drop(pitching[pitching['year_id'] < 1955].index)

In [None]:
pitching['year_id'].min(), pitching['year_id'].max()

In [None]:
# if the pitcher recorded less than 3 outs for the entire year, drop the record
pitching = pitching.drop(pitching[pitching['p_outs'] < 3].index)

In [None]:
pitching_float = pitching.select_dtypes(include=[np.float])

In [None]:
pitching_float.isna().sum()

In [None]:
# find highest year that has a null value
for col in ['p_sh','p_sf','p_gdp']:
    print(col, pitching[pitching[col].isna()]['year_id'].max())

In [None]:
# smallint works for all numeric values for the database
pitching.describe().T

In [None]:
dtype = {col:SmallInteger for col in pitching.select_dtypes(include=np.number).columns}

In [None]:
dtype

In [None]:
pitching.to_sql('pitching', conn, if_exists='replace', index=False, dtype=dtype)

In [None]:
# verify unique
is_unique(pitching, ['player_id', 'year_id', 'stint'])

In [None]:
sql = 'ALTER TABLE pitching ADD PRIMARY KEY (player_id, year_id, stint)'
conn.execute(sql);

In [None]:
psql('\d pitching')

# Fielding

In [None]:
fielding = pd.read_csv('Fielding.csv')

In [None]:
fielding.columns

In [None]:
fielding.columns = [convert_camel_case(name) for name in fielding.columns]
fielding.columns

In [None]:
# as above, drop records before 1955
fielding = fielding.drop(fielding[fielding['year_id'] < 1955].index)

In [None]:
# drop records in which the fielder recorded no outs (stats are meaningless)
fielding = fielding.drop(fielding[fielding['inn_outs'] == 0].index)

# drop records in which the number of put outs is unknown
fielding = fielding.dropna(subset=['inn_outs'])

In [None]:
fielding.isna().sum()

In [None]:
# pb, wb, sb, cs only apply to catchers, hence the large number of nulls
# it is reasonably to use fillna with 0
fielding[['pb','wp','sb','cs']] = fielding[['pb','wp','sb','cs']].fillna(0)

In [None]:
# zr applies to all fielders, but is 99% null, drop this column
fielding['zr'].isna().sum() / fielding.shape[0]

In [None]:
fielding = fielding.drop('zr', axis=1)

In [None]:
fielding.isna().sum()

In [None]:
# smallint works for all numeric values
fielding.describe().T

In [None]:
dtype = {col:SmallInteger for col in fielding.select_dtypes(include=np.number).columns}

In [None]:
fielding.to_sql('fielding', conn, if_exists='replace', index=False, dtype=dtype)

In [None]:
is_unique(fielding, ['player_id', 'year_id', 'stint', 'pos'])

### Note on Position

This is based on my MLB domain knowledge.

Players in recent years are increasingly playing more than one position in a single game, let alone in a single stint.

Catchers and Pitchers rarely play a position other than catcher or pitcher (except in exceedingly long extra inning games).

Usually, but not always, infielders play one of the infield positions.

Usually, but not always, outfielders play one of the outfield positions.

So although every player is listed as having a specific position, this position is not fixed.  It is likely that the position represents the position most often played by that player.

A 'stint' means playing for 1 team. If a player plays for 5 different teams in the same year, then the player has 5 stints.

In [None]:
fielding['pos'].value_counts()

In [None]:
fielding['stint'].value_counts(normalize=True)

In [None]:
sql = 'ALTER TABLE fielding ADD PRIMARY KEY (player_id, year_id, stint, pos)'
conn.execute(sql);

In [None]:
psql('\d fielding')