# Lahman Baseball Data

**Baseball Notebooks**  
1. Downloaded and unzipped baseball data.
2. Helper functions and their motivation for use.
3. This notebook. 

The Lahman data will be wrangled and persisted.

This notebook is designed to be used with Jupyter Lab and the Table of Contents extension: https://github.com/jupyterlab/jupyterlab-toc

## Lahman Data Dictionary
A "Data Dictionary" is also called a "Codebook".

http://www.seanlahman.com/files/database/readme2016.txt  

## Repeatable Research
All data processing should be documented so that others can repeat the results.  This includes every step from downloading the data through analysis.

## Path Objects for Lahman Baseball Data

In [1]:
import pandas as pd
import numpy as np

import os
import re
import wget
from pathlib import Path
import zipfile

from IPython.display import HTML, display
from sqlalchemy import create_engine
from sqlalchemy.types import SmallInteger, Integer, BigInteger

In [2]:
# see Baseball Notebook #2
import helper_functions as bb

In [3]:
# create path objects
home = Path.home()
lahman = home.joinpath('data/lahman')
p_raw = lahman.joinpath('raw')
p_wrangled = lahman.joinpath('wrangled')

# create directories from these path objects
p_raw.mkdir(parents=True, exist_ok=True)
p_wrangled.mkdir(parents=True, exist_ok=True)
os.chdir(p_raw)

# Database

Using Postgres, or any database, is optional for the baseball data analysis.  However in a business environment, data often comes from databases, so how to use a database will be presented.

This section is preparation for interacting with Postgres.

Prerequisites
1. PostgreSQL server is installed, configured and running.
2. baseball database has been created.

### Connect to DB

In [4]:
# Get the user and password from the environment (rather than hardcoding it)
import os
db_user = os.environ.get('DB_USER')
db_pass = os.environ.get('DB_PASS')

# avoid putting passwords directly in code
connect_str = f'postgresql://{db_user}:{db_pass}@localhost:5432/baseball'

# connect
conn = create_engine(connect_str)

type(conn)

sqlalchemy.engine.base.Engine

In [5]:
type(conn.connect())

sqlalchemy.engine.base.Connection

### SQL Magic

SQL Magic is not used here because it does not release its connection until the notebook is closed.  This can cause a lock to be put on a table, preventing the use of conn (above) from performing database updates when used in df.to_sql() and pd.read_sql().

A connection from SQL Alchemy can be used almost identically to the [Python DB API](https://www.python.org/dev/peps/pep-0249/).

When the type of connection is SQL Alchemy Engine, and is used for SQL, a connection will be allocated, used, changes committed, and the connection will be released.

When the type of connection is SQL ALchemy Connection (not used here), transaction processing can be performed.

## psql

Use the following to run psql commands from a Jupyter Code cell.

This will connect, execute, and disconnect from the database.

For this to work without a password, a .pgpass file is necessary.  
See: https://www.postgresql.org/docs/11/libpq-pgpass.html    

The .pgpass file should look like:  
```localhost:5432:*:<user>:<passwd>```

In [6]:
def psql(cmd, user='postgres', schema='baseball'):
    psql_out = !psql -H -U {user} {schema} -c "{cmd}"
    display(HTML(''.join(psql_out)))

In [7]:
!psql --version

psql (PostgreSQL) 11.2 (Ubuntu 11.2-1.pgdg18.04+1)


# Main Lahman Baseball Files
As per:  
http://www.seanlahman.com/files/database/readme2016.txt

After readme2016.txt was written, master was renamed to People.

The 4 main files are:
*  People   - Player names, DOB, and biographical info
*  Batting  - batting statistics
*  Pitching - pitching statistics
*  Fielding - fielding statistics

# People

In [8]:
os.chdir(p_raw)
people = pd.read_csv('People.csv', parse_dates=['debut', 'finalGame'])

In [9]:
people.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19617 entries, 0 to 19616
Data columns (total 24 columns):
playerID        19617 non-null object
birthYear       19497 non-null float64
birthMonth      19332 non-null float64
birthDay        19188 non-null float64
birthCountry    19553 non-null object
birthState      19090 non-null object
birthCity       19441 non-null object
deathYear       9649 non-null float64
deathMonth      9648 non-null float64
deathDay        9647 non-null float64
deathCountry    9646 non-null object
deathState      9598 non-null object
deathCity       9641 non-null object
nameFirst       19580 non-null object
nameLast        19617 non-null object
nameGiven       19580 non-null object
weight          18792 non-null float64
height          18875 non-null float64
bats            18433 non-null object
throws          18638 non-null object
debut           19420 non-null datetime64[ns]
finalGame       19420 non-null datetime64[ns]
retroID         19561 non-null object

In [10]:
people.columns = [bb.convert_camel_case(name) for name in people.columns]
people.columns

Index(['player_id', 'birth_year', 'birth_month', 'birth_day', 'birth_country',
       'birth_state', 'birth_city', 'death_year', 'death_month', 'death_day',
       'death_country', 'death_state', 'death_city', 'name_first', 'name_last',
       'name_given', 'weight', 'height', 'bats', 'throws', 'debut',
       'final_game', 'retro_id', 'bbref_id'],
      dtype='object')

In [11]:
# custom parsing of birth/death dates
def to_date(row, prefix):
    y = row[prefix + '_year']
    m = row[prefix + '_month']
    d = row[prefix + '_day']
    
    # NaT if year is missing
    if pd.isna(y):
        return pd.NaT
    
    # fillna if year present but month missing
    if pd.isna(m):
        m = 1
        
    # fillna if year present but day missing
    if pd.isna(d):
        d = 1
        
    return pd.datetime(int(y),int(m),int(d))

In [12]:
people['birth_date'] = people.apply(lambda x: to_date(x, 'birth'), axis=1)
people['death_date'] = people.apply(lambda x: to_date(x, 'death'), axis=1)

In [13]:
people = people.drop(
    ['birth_year', 'birth_month', 'birth_day', 
     'death_year', 'death_month', 'death_day'], axis=1)

In [14]:
# retro_id is required to work with Retrosheet Data
# get list of players without a Retrosheet player_id
missing =people.loc[people['retro_id'].isna(), 'player_id']
missing.head()

1127     bellco99
2123    brownra99
2238    bulkemo99
2769    cartwal99
2923    chadwhe99
Name: player_id, dtype: object

In [15]:
# drop players without a retro_id
people = people.dropna(subset=['retro_id'], axis=0)

In [16]:
# verify num unique is num records for both fields
# this implies the mapping of player_id to retro_id is 1 to 1 and onto
print(people['player_id'].nunique() == people.shape[0])
print(people['retro_id'].nunique() == people.shape[0])

True
True


### Persist as CSV with Column Types
Use helper function described in previous notebook to save the data types to a separate csv file.

In [17]:
os.chdir(p_wrangled)
bb.to_csv_with_types(people, 'people.csv')

In [18]:
# verify that data type information was not lost
df2 = bb.from_csv_with_types('people.csv')
(df2.dtypes == people.dtypes).all()

True

### Persist as Postgres Table

df.to_sql(if_exists='replace') will replace data if it exists, but it will *not* replace column types if the Postgres table exists, therefore drop the table first.

In [19]:
conn.execute("DROP TABLE IF EXISTS people");

In [20]:
# create Postgres people table
people.to_sql('people', conn, index=False)

In [21]:
# check that it worked by selecting number of people records
rs = conn.execute("SELECT COUNT(*) from people")
rs.fetchall()

[(19561,)]

In [22]:
# add primary key, unique and not null constraints
sql   = 'ALTER TABLE people ADD PRIMARY KEY (player_id)'
conn.execute(sql)

sql = 'ALTER TABLE people ADD CONSTRAINT retro_unique UNIQUE (retro_id)'
conn.execute(sql)

sql = 'ALTER TABLE people ALTER COLUMN retro_id SET NOT NULL'
conn.execute(sql);

In [23]:
# describe the table
psql('\d people')

Column,Type,Collation,Nullable,Default
player_id,text,,not null,
birth_country,text,,,
birth_state,text,,,
birth_city,text,,,
death_country,text,,,
death_state,text,,,
death_city,text,,,
name_first,text,,,
name_last,text,,,
name_given,text,,,


# Batting

In [24]:
os.chdir(p_raw)

# consider yearID as a string for now
batting = pd.read_csv('Batting.csv')

In [25]:
batting.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105861 entries, 0 to 105860
Data columns (total 22 columns):
playerID    105861 non-null object
yearID      105861 non-null int64
stint       105861 non-null int64
teamID      105861 non-null object
lgID        105123 non-null object
G           105861 non-null int64
AB          105861 non-null int64
R           105861 non-null int64
H           105861 non-null int64
2B          105861 non-null int64
3B          105861 non-null int64
HR          105861 non-null int64
RBI         105105 non-null float64
SB          103493 non-null float64
CS          82320 non-null float64
BB          105861 non-null int64
SO          103761 non-null float64
IBB         69210 non-null float64
HBP         103044 non-null float64
SH          99792 non-null float64
SF          69757 non-null float64
GIDP        80420 non-null float64
dtypes: float64(9), int64(10), object(3)
memory usage: 17.8+ MB


## Rename to use Retrosheet Names for the Same Fields

The following is from the RetrosheetBaseball Jupyter notebook.
```
 'b_g': 'games played',
 'b_pa': 'plate appearances',
 'b_ab': 'at bats',
 'b_r': 'runs',
 'b_h': 'hits',
 'b_2b': 'doubles',
 'b_3b': 'triples',
 'b_hr': 'home runs',
 'b_rbi': 'runs batted in',
 'b_bb': 'walks',
 'b_ibb': 'intentional walks',
 'b_so': 'strikeouts',
 'b_gdp': 'grounded into DP',
 'b_hp': 'hit by pitch',
 'b_sh': 'sacrifice hits',
 'b_sf': 'sacrifice flies',
 'b_sb': 'stolen bases',
 'b_cs': 'caught stealing',
 'b_xi': 'reached on interference', 
```

In [26]:
retro_names = {
    'playerID':'player_id',
    'yearID':'year',
    'teamID':'team_id',
    'lgID':'lg_id',
    'G':'b_g',
    'AB':'b_ab',
    'R':'b_r',
    'H':'b_h',
    '2B':'b_2b',
    '3B':'b_3b',
    'HR':'b_hr',
    'RBI':'b_rbi',
    'SB':'b_sb',
    'CS':'b_cs',
    'BB':'b_bb',
    'SO':'b_so',
    'IBB':'b_ibb',
    'HBP':'b_hp',
    'SH':'b_sh',
    'SF':'b_sf',
    'GIDP':'b_gdp'
}

In [29]:
batting.rename(columns=retro_names, inplace=True)
batting.columns

Index(['player_id', 'year', 'stint', 'team_id', 'lg_id', 'b_g', 'b_ab', 'b_r',
       'b_h', 'b_2b', 'b_3b', 'b_hr', 'b_rbi', 'b_sb', 'b_cs', 'b_bb', 'b_so',
       'b_ibb', 'b_hp', 'b_sh', 'b_sf', 'b_gdp'],
      dtype='object')

In [30]:
# Retrosheet only has data from 1921 onward, keep the same from Lahman
batting = batting.drop(batting[batting['year'] < 1921].index)

In [31]:
(batting['year'].min(), batting['year'].max())

(1921, 2018)

In [32]:
# are any of the players in batting that are missing a retro_id?
(batting['player_id'].isin(missing)).all()

False

As per above, no player with a missing retro_id is in the batting dataframe.

In [33]:
batting_float = batting.select_dtypes(include=['float']).copy()
batting_float.columns

Index(['b_rbi', 'b_sb', 'b_cs', 'b_so', 'b_ibb', 'b_hp', 'b_sh', 'b_sf',
       'b_gdp'],
      dtype='object')

In [34]:
# these are integers, but had NA, so were converted to float
batting_float.apply(bb.is_int)

b_rbi    True
b_sb     True
b_cs     True
b_so     True
b_ibb    True
b_hp     True
b_sh     True
b_sf     True
b_gdp    True
dtype: bool

In [35]:
# after remove rows < 1921, some fields no longer have null values
batting_float.isna().sum()

b_rbi        0
b_sb         0
b_cs      6736
b_so         0
b_ibb    18137
b_hp         0
b_sh         0
b_sf     17588
b_gdp     7661
dtype: int64

In [66]:
criteria = (batting_float.apply(bb.is_int)) & (batting_float.isna().sum() == 0)
cols = criteria[criteria].index.to_list()
cols

['b_rbi', 'b_sb', 'b_so', 'b_hp', 'b_sh']

In [44]:
# cast these back to int
batting[cols] = batting_float[cols].astype('int')

In [45]:
batting = bb.optimize_df_dtypes(batting)

In [46]:
batting.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 87193 entries, 18668 to 105860
Data columns (total 22 columns):
player_id    87193 non-null object
year         87193 non-null uint16
stint        87193 non-null uint8
team_id      87193 non-null object
lg_id        87193 non-null object
b_g          87193 non-null uint8
b_ab         87193 non-null uint16
b_r          87193 non-null uint8
b_h          87193 non-null uint16
b_2b         87193 non-null uint8
b_3b         87193 non-null uint8
b_hr         87193 non-null uint8
b_rbi        87193 non-null uint8
b_sb         87193 non-null uint8
b_cs         80457 non-null float64
b_bb         87193 non-null uint8
b_so         87193 non-null uint8
b_ibb        69056 non-null float64
b_hp         87193 non-null uint8
b_sh         87193 non-null uint8
b_sf         69605 non-null float64
b_gdp        79532 non-null float64
dtypes: float64(4), object(3), uint16(3), uint8(12)
memory usage: 6.8+ MB


### Persist as CSV with Column Types

In [47]:
os.chdir(p_wrangled)
bb.to_csv_with_types(batting, 'batting.csv')

### Persist as Postgres Table

df.to_sql(if_exists='replace') will replace data if it exists, but it will *not* replace column types if the Postgres table exists, therefore drop the table first.

In [48]:
dtypes = bb.optimize_db_dtypes(batting)
dtypes

{'year': sqlalchemy.sql.sqltypes.SmallInteger,
 'stint': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_g': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_ab': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_r': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_h': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_2b': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_3b': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_hr': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_rbi': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_sb': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_bb': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_so': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_hp': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_sh': sqlalchemy.sql.sqltypes.SmallInteger}

In [49]:
conn.execute("DROP TABLE IF EXISTS batting");

In [50]:
batting.to_sql('batting', conn, index=False, dtype=dtypes)

In [51]:
# verify unique
bb.is_unique(batting, ['player_id', 'year', 'stint'])

True

In [52]:
sql = 'ALTER TABLE batting ADD PRIMARY KEY (player_id, year, stint)'
conn.execute(sql);

In [53]:
psql('\d batting')

Column,Type,Collation,Nullable,Default
player_id,text,,not null,
year,smallint,,not null,
stint,smallint,,not null,
team_id,text,,,
lg_id,text,,,
b_g,smallint,,,
b_ab,smallint,,,
b_r,smallint,,,
b_h,smallint,,,
b_2b,smallint,,,


# Pitching

In [54]:
os.chdir(p_raw)

# consider yearID as string for now
pitching = pd.read_csv('Pitching.csv')

In [55]:
pitching.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46699 entries, 0 to 46698
Data columns (total 30 columns):
playerID    46699 non-null object
yearID      46699 non-null int64
stint       46699 non-null int64
teamID      46699 non-null object
lgID        46567 non-null object
W           46699 non-null int64
L           46699 non-null int64
G           46699 non-null int64
GS          46699 non-null int64
CG          46699 non-null int64
SHO         46699 non-null int64
SV          46699 non-null int64
IPouts      46699 non-null int64
H           46699 non-null int64
ER          46699 non-null int64
HR          46699 non-null int64
BB          46699 non-null int64
SO          46699 non-null int64
BAOpp       42259 non-null float64
ERA         46607 non-null float64
IBB         32121 non-null float64
WP          46699 non-null int64
HBP         45965 non-null float64
BK          46699 non-null int64
BFP         46696 non-null float64
GF          46699 non-null int64
R           46699 no

## Rename to Match Retrosheet
The following is from the RetrosheetBaseball Jupyter notebook.
```
 'p_g': 'games pitched',
 'p_gs': 'games started',
 'p_cg': 'complete games',
 'p_sho': 'shutouts',
 'p_gf': 'games finished',
 'p_w': 'wins',
 'p_l': 'losses',
 'p_sv': 'saves',
 'p_out': 'outs recorded (innings pitched times 3)',
 'p_tbf': 'batters faced',
 'p_ab': 'at bats',
 'p_r': 'runs allowed',
 'p_er': 'earned runs allowed',
 'p_h': 'hits allowed',
 'p_2b': 'doubles allowed',
 'p_3b': 'triples allowed',
 'p_hr': 'home runs allowed',
 'p_bb': 'walks allowed',
 'p_ibb': 'intentional walks allowed',
 'p_so': 'strikeouts',
 'p_gdp': 'grounded into double play',
 'p_hp': 'hit batsmen',
 'p_sh': 'sacrifice hits against',
 'p_sf': 'sacrifice flies against',
 'p_xi': 'reached on interference',
 'p_wp': 'wild pitches',
 'p_bk': 'balks'
``` 

In [56]:
retro_names = {
    'playerID':'player_id',
    'yearID':'year',
    'teamID':'team_id',
    'lgID':'lg_id',
    'W':'p_w',
    'L':'p_l',
    'G':'p_g',
    'GS':'p_gs',
    'CG':'p_cg',
    'SHO':'p_sho',
    'SV':'p_sv',
    'IPouts':'p_out',
    'H':'p_h',
    'ER':'p_er',
    'HR':'p_hr',
    'BB':'p_bb',
    'SO':'p_so',
    'BAOpp':'p_ba_opp', # not in retrosheet player_game
    'ERA':'p_era', # not in retrosheet player_game
    'IBB':'p_ibb',
    'WP':'p_wp',
    'HBP':'p_hp',
    'BK':'p_bk',
    'BFP':'p_bfp', # not in retrosheet player_game
    'GF':'p_gf', # not in retrosheet player_game
    'R':'p_r',
    'SH':'p_sh',
    'SF':'p_sf',
    'GIDP':'p_gdp'
        }

In [57]:
pitching.rename(columns=retro_names, inplace=True)

In [58]:
# Retrosheet only has data from 1921 onward
pitching = pitching.drop(pitching[pitching['year'] < 1921].index)

In [59]:
(pitching['year'].min(), pitching['year'].max())

(1921, 2018)

In [60]:
# are any of the pitchers missing a retro_id?
(pitching['player_id'].isin(missing)).all()

False

As per above, no pitchers are missing a retro_id.

In [61]:
pitching_float = pitching.select_dtypes(include=[np.float])

In [62]:
pitching_float.apply(bb.is_int)

p_ba_opp    False
p_era       False
p_ibb        True
p_hp         True
p_bfp        True
p_sh         True
p_sf         True
p_gdp        True
dtype: bool

In [67]:
# after dropping records < 1921, some fields no longer have nulls
pitching_float.isna().sum()

p_ba_opp       11
p_era          69
p_ibb        7812
p_hp            0
p_bfp           3
p_sh        12421
p_sf        12421
p_gdp       13552
dtype: int64

In [68]:
# integer fields with no nulls
criteria = (pitching_float.apply(bb.is_int) & (pitching_float.isna().sum() == 0))
criteria

p_ba_opp    False
p_era       False
p_ibb       False
p_hp         True
p_bfp       False
p_sh        False
p_sf        False
p_gdp       False
dtype: bool

In [69]:
cols = criteria[criteria].index.to_list()
cols

['p_hp']

In [70]:
# convert these floats to integers
pitching[cols] = pitching[cols].astype(np.int)

In [71]:
pitching = bb.optimize_df_dtypes(pitching)

In [72]:
pitching.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39933 entries, 6766 to 46698
Data columns (total 30 columns):
player_id    39933 non-null object
year         39933 non-null uint16
stint        39933 non-null uint8
team_id      39933 non-null object
lg_id        39933 non-null object
p_w          39933 non-null uint8
p_l          39933 non-null uint8
p_g          39933 non-null uint8
p_gs         39933 non-null uint8
p_cg         39933 non-null uint8
p_sho        39933 non-null uint8
p_sv         39933 non-null uint8
p_out        39933 non-null uint16
p_h          39933 non-null uint16
p_er         39933 non-null uint8
p_hr         39933 non-null uint8
p_bb         39933 non-null uint8
p_so         39933 non-null uint16
p_ba_opp     39922 non-null float64
p_era        39864 non-null float64
p_ibb        32121 non-null float64
p_wp         39933 non-null uint8
p_hp         39933 non-null uint8
p_bk         39933 non-null uint8
p_bfp        39930 non-null float64
p_gf         39933 non-

### Persist as CSV with Column Types

In [73]:
os.chdir(p_wrangled)
bb.to_csv_with_types(pitching, 'pitching.csv')

### Persist as Postgres Table

df.to_sql(if_exists='replace') will replace data if it exists, but it will *not* replace column types if the Postgres table exists, therefore drop the table first.

In [74]:
dtype = bb.optimize_db_dtypes(pitching)
dtype

{'year': sqlalchemy.sql.sqltypes.SmallInteger,
 'stint': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_w': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_l': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_g': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_gs': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_cg': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_sho': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_sv': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_out': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_h': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_er': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_hr': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_bb': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_so': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_wp': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_hp': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_bk': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_gf': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_r': sqlalchemy.sql.sqltypes.SmallInteger}

In [75]:
conn.execute("DROP TABLE IF EXISTS pitching");

In [76]:
pitching.to_sql('pitching', conn, index=False, dtype=dtype)

In [77]:
# verify unique
bb.is_unique(pitching, ['player_id', 'year', 'stint'])

True

In [78]:
sql = 'ALTER TABLE pitching ADD PRIMARY KEY (player_id, year, stint)'
conn.execute(sql);

In [79]:
psql('\d pitching')

Column,Type,Collation,Nullable,Default
player_id,text,,not null,
year,smallint,,not null,
stint,smallint,,not null,
team_id,text,,,
lg_id,text,,,
p_w,smallint,,,
p_l,smallint,,,
p_g,smallint,,,
p_gs,smallint,,,
p_cg,smallint,,,


# Fielding

In [118]:
os.chdir(p_raw)

fielding = pd.read_csv('Fielding.csv')

In [119]:
fielding.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140921 entries, 0 to 140920
Data columns (total 18 columns):
playerID    140921 non-null object
yearID      140921 non-null int64
stint       140921 non-null int64
teamID      140921 non-null object
lgID        139408 non-null object
POS         140921 non-null object
G           140921 non-null int64
GS          93544 non-null float64
InnOuts     110992 non-null float64
PO          140921 non-null int64
A           140921 non-null int64
E           140920 non-null float64
DP          140921 non-null int64
PB          11478 non-null float64
WP          1169 non-null float64
SB          8691 non-null float64
CS          8691 non-null float64
ZR          1169 non-null float64
dtypes: float64(8), int64(6), object(4)
memory usage: 19.4+ MB


In [120]:
fielding.columns = [bb.convert_camel_case(name) for name in fielding.columns]
fielding = fielding.rename(columns = {'year_id':'year'})
fielding.columns

Index(['player_id', 'year', 'stint', 'team_id', 'lg_id', 'pos', 'g', 'gs',
       'inn_outs', 'po', 'a', 'e', 'dp', 'pb', 'wp', 'sb', 'cs', 'zr'],
      dtype='object')

In [121]:
# Retrosheet only has data from 1921 onward
fielding = fielding.drop(fielding[fielding['year'] < 1921].index)

In [122]:
(fielding['year'].min(), fielding['year'].max())

(1921, 2018)

In [123]:
# are any of the players in fielding missing a retro_id?
(fielding['player_id'].isin(missing)).all()

False

As per above, no players in fielding are missing a retro_id

In [124]:
fielding['pos'].value_counts()

P     39933
OF    21705
3B    11350
1B    11035
2B    10135
SS     8918
C      8423
Name: pos, dtype: int64

#### Catcher Only Fields: pb, wb, sb, cs
See:  [Lahman Data Dictionary](http://www.seanlahman.com/files/database/readme2016.txt)

pb -- passed balls allowed by catcher  
wp -- wild pitches for catcher (wp stat for pitcher is in pitching dataframe as p_wp)  
sb -- stolen bases given up by this catcher  
cs -- base runners caught stealing by this catcher  

In [125]:
# check frequency of missing values since 1975 for catcher fields
tmp = fielding[fielding['year'] > 1975]
tmp = tmp[tmp['pos'] == 'C']
tmp = tmp[['pb', 'wp', 'sb', 'cs']]
tmp.isna().sum() / tmp.shape[0]

pb    0.0
wp    1.0
sb    0.0
cs    0.0
dtype: float64

In [126]:
# this wp shouldn't apply to pitcher, but check just to be sure
tmp = fielding[fielding['year'] > 1975]
tmp = tmp[tmp['pos'] == 'P']
tmp = tmp['wp']
tmp.isna().sum() / tmp.shape[0]

1.0

In [127]:
# 100% (rounded) are null for both pitchers and catchers, drop this field
fielding = fielding.drop('wp', axis=1)

In [128]:
# zr applies to all fielders
fielding['zr'].isna().sum() / fielding.shape[0]

0.9895156010367806

In [129]:
# 99% null, so drop this column
fielding = fielding.drop('zr', axis=1)

In [131]:
fielding = bb.optimize_df_dtypes(fielding)

In [132]:
fielding.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 111499 entries, 29422 to 140920
Data columns (total 16 columns):
player_id    111499 non-null object
year         111499 non-null uint16
stint        111499 non-null uint8
team_id      111499 non-null object
lg_id        111499 non-null object
pos          111499 non-null object
g            111499 non-null uint8
gs           89431 non-null float64
inn_outs     89431 non-null float64
po           111499 non-null uint16
a            111499 non-null uint16
e            111498 non-null float64
dp           111499 non-null uint8
pb           8423 non-null float64
sb           6389 non-null float64
cs           6389 non-null float64
dtypes: float64(6), object(4), uint16(3), uint8(3)
memory usage: 10.3+ MB


### Persist as CSV with Column Types

In [133]:
os.chdir(p_wrangled)
bb.to_csv_with_types(fielding, 'fielding.csv')

### Persist as Postgres Table

df.to_sql(if_exists='replace') will replace data if it exists, but it will *not* replace column types if the Postgres table exists, therefore drop the table first.

In [134]:
dtype = bb.optimize_db_dtypes(fielding)
dtype

{'year': sqlalchemy.sql.sqltypes.SmallInteger,
 'stint': sqlalchemy.sql.sqltypes.SmallInteger,
 'g': sqlalchemy.sql.sqltypes.SmallInteger,
 'po': sqlalchemy.sql.sqltypes.SmallInteger,
 'a': sqlalchemy.sql.sqltypes.SmallInteger,
 'dp': sqlalchemy.sql.sqltypes.SmallInteger}

In [135]:
conn.execute("DROP TABLE IF EXISTS fielding");

In [136]:
fielding.to_sql('fielding', conn, index=False, dtype=dtype)

In [137]:
bb.is_unique(fielding, ['player_id', 'year', 'stint', 'pos'])

True

In [138]:
sql = 'ALTER TABLE fielding ADD PRIMARY KEY (player_id, year, stint, pos)'
conn.execute(sql);

In [139]:
psql('\d fielding')

Column,Type,Collation,Nullable,Default
player_id,text,,not null,
year,smallint,,not null,
stint,smallint,,not null,
team_id,text,,,
lg_id,text,,,
pos,text,,not null,
g,smallint,,,
gs,double precision,,,
inn_outs,double precision,,,
po,smallint,,,


### Note on Position

This is based on my MLB domain knowledge.

Players in recent years are increasingly playing more than one position in a single game, let alone in a single stint.

Note: a player that plays for 3 teams in 1 year would have 3 "stints".

Catchers and Pitchers rarely play a position other than catcher or pitcher (except in exceedingly long extra inning games).

Usually, but not always, infielders play one of the infield positions.

Usually, but not always, outfielders play one of the outfield positions.

So although every player is listed as having a specific position, this position is not fixed.  It is likely that the position represents the position most often played by that player.

The Lahman csv file "Appearances" lists how often each player played at a particular position for a given year.

# Teams

The team_id used by Lahman is not the same as the team_id as used by Retrosheet.  When comparing data between the two data sources, it will be necessary to map one team_id to the other.

In [140]:
os.chdir(p_raw)
teams = pd.read_csv('Teams.csv', usecols=['yearID', 'teamID', 'teamIDretro'])

In [141]:
teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2895 entries, 0 to 2894
Data columns (total 3 columns):
yearID         2895 non-null int64
teamID         2895 non-null object
teamIDretro    2895 non-null object
dtypes: int64(1), object(2)
memory usage: 67.9+ KB


In [142]:
teams = teams.rename(columns={'yearID':'year', 'teamID':'team_id', 'teamIDretro':'team_id_retro'})
teams.columns

Index(['year', 'team_id', 'team_id_retro'], dtype='object')

In [144]:
# Retrosheet only has data from 1921 onward
teams = teams.drop(teams[teams['year'] < 1921].index)

In [146]:
# Note: 97% of the time, the Lahman team_id = the Retrosheet team_id
(teams['team_id'] == teams['team_id_retro']).mean()

0.9747242647058824

In [147]:
teams = bb.optimize_df_dtypes(teams)
teams.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2176 entries, 719 to 2894
Data columns (total 3 columns):
year             2176 non-null uint16
team_id          2176 non-null object
team_id_retro    2176 non-null object
dtypes: object(2), uint16(1)
memory usage: 55.2+ KB


### Persist as CSV with Column Types

In [148]:
os.chdir(p_wrangled)
bb.to_csv_with_types(teams, 'teams.csv')

### Persist as Postgres Table

df.to_sql(if_exists='replace') will replace data if it exists, but it will *not* replace column types if the Postgres table exists, therefore drop the table first.

In [149]:
dtypes = bb.optimize_db_dtypes(teams)
dtypes

{'year': sqlalchemy.sql.sqltypes.SmallInteger}

In [150]:
conn.execute("DROP TABLE IF EXISTS teams");

In [151]:
teams.to_sql('teams', conn, index=False, dtype=dtypes)

In [152]:
bb.is_unique(teams, ['year', 'team_id'])

True

In [153]:
bb.is_unique(teams, ['year', 'team_id_retro'])

True

In [154]:
sql = 'ALTER TABLE teams ADD PRIMARY KEY (year, team_id)'
conn.execute(sql);

sql = 'ALTER TABLE teams ADD CONSTRAINT team_retro_unique UNIQUE (year, team_id_retro)'
conn.execute(sql)

sql = 'ALTER TABLE teams ALTER COLUMN team_id_retro SET NOT NULL'
conn.execute(sql);

In [155]:
psql('\d teams')

Column,Type,Collation,Nullable,Default
year,smallint,,not null,
team_id,text,,not null,
team_id_retro,text,,not null,
