# Lahman Baseball Data -- DRAFT Notebook

There are two sources for the Lahman data.

**Sean Lahman**  
http://www.seanlahman.com/baseball-archive/statistics  
There appears to be a snapshot of data taken the day prior to last season's opening day.

**Baseball Databank**  
https://github.com/chadwickbureau/baseballdatabank  
This is the latest data.  As of the time of this writing, it includes the 2018 season whereas the previous link does not.

In order to use 2018 data, the baseball databank will be used.

## Schema

The data will be saved to a Postgres database.

**Data Dictionary**  
http://www.seanlahman.com/files/database/readme2016.txt  
The schema is updated over time.  This makes having a fixed schema for this dataset somewhat of a problem as it has to be continually updated.

Rather than fix the schema by creating the Postgres tables first, each csv file will be read into Pandas, analyzed to find its unique key, then written directly to Postgres using df.to_sql().

Pandas uses SQL Alchemy behind the scenes and good (but not necessarily optimal) datatypes are chosen for the database tables.  These datatypes will be sufficient for this analysis.

### Create Directories
* ~/data/lahman/raw  
* ~/data/lahman/processed  

In [1]:
import pandas as pd
import numpy as np

In [2]:
import os
import re
import wget
from pathlib import Path
import zipfile

In [3]:
# create path objects
home = Path.home()
retrosheet = home.joinpath('data/lahman')
p_raw = retrosheet.joinpath('raw')
p_processed = retrosheet.joinpath('processed')

# create directories from these path objects
p_raw.mkdir(parents=True, exist_ok=True)
p_processed.mkdir(parents=True, exist_ok=True)

In [4]:
# change to raw file directory
os.chdir(p_raw)

# download zip file from github
url = 'https://github.com/chadwickbureau/baseballdatabank/archive/master.zip'
wget.download(url)

# unzip it
with zipfile.ZipFile('baseballdatabank-master.zip', "r") as zip_ref:
    zip_ref.extractall()

In [5]:
import shutil

unzip_dir = p_raw.joinpath('baseballdatabank-master/core')

# move the unzipped csv files to the current working directory
os.chdir(p_raw)
for root, dirs, files in os.walk(unzip_dir):
    for file in files:
        shutil.move(root+'/'+file, '.')
        
# rm the extract directory
shutil.rmtree('baseballdatabank-master')

# rm the zip file
os.remove('baseballdatabank-master.zip')

In [6]:
# verify the current directory (p_raw) has the csv files
sorted(os.listdir())

['AllstarFull.csv',
 'Appearances.csv',
 'AwardsManagers.csv',
 'AwardsPlayers.csv',
 'AwardsShareManagers.csv',
 'AwardsSharePlayers.csv',
 'Batting.csv',
 'BattingPost.csv',
 'CollegePlaying.csv',
 'Fielding.csv',
 'FieldingOF.csv',
 'FieldingOFsplit.csv',
 'FieldingPost.csv',
 'HallOfFame.csv',
 'HomeGames.csv',
 'Managers.csv',
 'ManagersHalf.csv',
 'Parks.csv',
 'People.csv',
 'Pitching.csv',
 'PitchingPost.csv',
 'Salaries.csv',
 'Schools.csv',
 'SeriesPost.csv',
 'Teams.csv',
 'TeamsFranchises.csv',
 'TeamsHalf.csv',
 'readme2014.txt']

## Helpful Methods for Working with DB

In [7]:
from sqlalchemy.engine import create_engine
from IPython.display import HTML, display

%reload_ext sql

### Connect to DB

In [8]:
# Get the user and password from the environment (rather than hardcoding it)
import os
db_user = os.environ.get('DB_USER')
db_pass = os.environ.get('DB_PASS')

# avoid putting passwords directly in code
connect_str = f'postgresql://{db_user}:{db_pass}@localhost:5432/lahman'

# connect
conn = create_engine(connect_str)

### **psql**

Note, there must be a ~/.pgpass file similar to the following to connect without a password:  
```localhost:5432:*:<user>:<passwd>```

In [9]:
# -H for html output
# this connects, executes, and disconnects
def psql(cmd, user='postgres', schema='lahman'):
    psql_out = !psql -H -U {user} {schema} -c "{cmd}"
    display(HTML(''.join(psql_out)))

### CamelCase to snake_case

Postgres is easier to use without caps in the column names.

Also, columns name should not start with a number.

https://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-snake-case

In [10]:
# CamelCase to camel_case
def convert_camel_case(name):
    s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()

In [11]:
# 3B to h3b, 2B to h2b
def prepend_h(name):
    match = re.search(r'(^\d\w)', name)
    if match:
        return 'h'+name.lower()
    else:
        return name

In [12]:
convert_camel_case(prepend_h('3B'))

'h3b'

In [13]:
convert_camel_case(prepend_h('playerID'))

'player_id'

### Is Unique over Multiple Columns

In [14]:
def is_unique(df, cols):
    return not (df.duplicated(subset=cols)).any()

# Main Files
As per:  
http://www.seanlahman.com/files/database/readme2016.txt

After readme2016.txt was written, master was renamed to People.

The 4 main files are:
*  People   - Player names, DOB, and biographical info
*  Batting  - batting statistics
*  Pitching - pitching statistics
*  Fielding - fielding statistics

# People

In [15]:
people = pd.read_csv('People.csv', parse_dates=['debut', 'finalGame'])

In [16]:
people.columns

Index(['playerID', 'birthYear', 'birthMonth', 'birthDay', 'birthCountry',
       'birthState', 'birthCity', 'deathYear', 'deathMonth', 'deathDay',
       'deathCountry', 'deathState', 'deathCity', 'nameFirst', 'nameLast',
       'nameGiven', 'weight', 'height', 'bats', 'throws', 'debut', 'finalGame',
       'retroID', 'bbrefID'],
      dtype='object')

In [17]:
people.columns = [convert_camel_case(name) for name in people.columns]
people.columns

Index(['player_id', 'birth_year', 'birth_month', 'birth_day', 'birth_country',
       'birth_state', 'birth_city', 'death_year', 'death_month', 'death_day',
       'death_country', 'death_state', 'death_city', 'name_first', 'name_last',
       'name_given', 'weight', 'height', 'bats', 'throws', 'debut',
       'final_game', 'retro_id', 'bbref_id'],
      dtype='object')

In [18]:
# custom parsing of dates
def lahman_to_date(row, prefix):
    y = row[prefix + '_year']
    m = row[prefix + '_month']
    d = row[prefix + '_day']
    
    # NaT if year is missing
    if pd.isna(y):
        return pd.NaT
    
    # fillna if year present but month missing
    if pd.isna(m):
        m = 1
        
    # fillna if year present but day missing
    if pd.isna(d):
        d = 1
        
    return pd.datetime(int(y),int(m),int(d))

In [19]:
people['birth_date'] = people.apply(lambda x: lahman_to_date(x, 'birth'), axis=1)
people['death_date'] = people.apply(lambda x: lahman_to_date(x, 'death'), axis=1)

In [20]:
people = people.drop(
    ['birth_year', 'birth_month', 'birth_day', 
     'death_year', 'death_month', 'death_day'], axis=1)

In [21]:
# will be used with Retrosheet data, so retroID cannot be null
# retroID appears to be null only for some players who played long ago
people = people.dropna(subset=['retro_id'], axis=0)

In [23]:
# verify uniqueness
print(people['player_id'].is_unique)
print(people['retro_id'].is_unique)

True
True


#### Note: Datetime Null

In [24]:
a = pd.NaT
b = np.nan
print(pd.isna(a))
print(pd.isna(b))

True
True


In [25]:
# replace the table if it exists
people.to_sql('people', conn, if_exists='replace', index=False)

In [26]:
# add primary key, unique and not null constraints

sql   = 'ALTER TABLE lahman.public.people ADD PRIMARY KEY (player_id)'
conn.execute(sql)

sql = 'ALTER TABLE lahman.public.people ADD CONSTRAINT retro_unique UNIQUE (retro_id)'
conn.execute(sql)

sql = 'ALTER TABLE lahman.public.people ALTER COLUMN retro_id SET NOT NULL'
conn.execute(sql);

In [27]:
# describe the table
psql('\d people')

Column,Type,Collation,Nullable,Default
player_id,text,,not null,
birth_country,text,,,
birth_state,text,,,
birth_city,text,,,
death_country,text,,,
death_state,text,,,
death_city,text,,,
name_first,text,,,
name_last,text,,,
name_given,text,,,


# Batting

In [28]:
batting = pd.read_csv('Batting.csv')

In [29]:
batting.columns

Index(['playerID', 'yearID', 'stint', 'teamID', 'lgID', 'G', 'AB', 'R', 'H',
       '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'IBB', 'HBP', 'SH',
       'SF', 'GIDP'],
      dtype='object')

In [30]:
batting.columns = [convert_camel_case(prepend_h(name)) for name in batting.columns]
batting.columns

Index(['player_id', 'year_id', 'stint', 'team_id', 'lg_id', 'g', 'ab', 'r',
       'h', 'h2b', 'h3b', 'hr', 'rbi', 'sb', 'cs', 'bb', 'so', 'ibb', 'hbp',
       'sh', 'sf', 'gidp'],
      dtype='object')

In [31]:
# certain stats are null only for old games
# this study will be from 1955 onward
batting = batting.drop(batting[batting['year_id'] < 1955].index)

In [32]:
batting['year_id'].min()

1955

In [33]:
# these are integers, but had NA, so were converted to float
batting_float = batting.select_dtypes(include=['float']).copy()
batting_float.columns

Index(['rbi', 'sb', 'cs', 'so', 'ibb', 'hbp', 'sh', 'sf', 'gidp'], dtype='object')

In [34]:
# after removing years < 1955, there are no longer any null values
batting_float.isna().sum()

rbi     0
sb      0
cs      0
so      0
ibb     0
hbp     0
sh      0
sf      0
gidp    0
dtype: int64

In [35]:
batting_numeric = batting.select_dtypes(include=[np.number])

In [36]:
# pandas will downcast as far as the data allows
batting_numeric = batting_numeric.apply(pd.to_numeric,downcast='unsigned')
batting_numeric.dtypes.value_counts()

uint8     16
uint16     3
dtype: int64

In [37]:
batting[batting_numeric.columns] = batting_numeric

In [38]:
batting.dtypes.value_counts()

uint8     16
uint16     3
object     3
dtype: int64

In [39]:
batting_obj = batting.select_dtypes(include='object')
batting_obj.columns

Index(['player_id', 'team_id', 'lg_id'], dtype='object')

In [40]:
batting_obj.nunique()

player_id    11165
team_id         42
lg_id            2
dtype: int64

In [41]:
batting[['team_id', 'lg_id']] = batting_obj[['team_id', 'lg_id']].astype('category')

In [42]:
batting.dtypes.value_counts()

uint8       16
uint16       3
category     1
category     1
object       1
dtype: int64

In [43]:
from sqlalchemy.types import SmallInteger

In [44]:
# SmallInteger is not deduced from dataframe column type
dtype = {c:SmallInteger for c in batting.select_dtypes(include=np.integer).columns}

In [45]:
dtype

{'year_id': sqlalchemy.sql.sqltypes.SmallInteger,
 'stint': sqlalchemy.sql.sqltypes.SmallInteger,
 'g': sqlalchemy.sql.sqltypes.SmallInteger,
 'ab': sqlalchemy.sql.sqltypes.SmallInteger,
 'r': sqlalchemy.sql.sqltypes.SmallInteger,
 'h': sqlalchemy.sql.sqltypes.SmallInteger,
 'h2b': sqlalchemy.sql.sqltypes.SmallInteger,
 'h3b': sqlalchemy.sql.sqltypes.SmallInteger,
 'hr': sqlalchemy.sql.sqltypes.SmallInteger,
 'rbi': sqlalchemy.sql.sqltypes.SmallInteger,
 'sb': sqlalchemy.sql.sqltypes.SmallInteger,
 'cs': sqlalchemy.sql.sqltypes.SmallInteger,
 'bb': sqlalchemy.sql.sqltypes.SmallInteger,
 'so': sqlalchemy.sql.sqltypes.SmallInteger,
 'ibb': sqlalchemy.sql.sqltypes.SmallInteger,
 'hbp': sqlalchemy.sql.sqltypes.SmallInteger,
 'sh': sqlalchemy.sql.sqltypes.SmallInteger,
 'sf': sqlalchemy.sql.sqltypes.SmallInteger,
 'gidp': sqlalchemy.sql.sqltypes.SmallInteger}

In [48]:
batting.to_sql('batting', conn, if_exists='replace', index=False, dtype=dtype)

In [49]:
psql('\d batting')

Column,Type,Collation,Nullable,Default
player_id,text,,,
year_id,smallint,,,
stint,smallint,,,
team_id,text,,,
lg_id,text,,,
g,smallint,,,
ab,smallint,,,
r,smallint,,,
h,smallint,,,
h2b,smallint,,,


In [50]:
# verify unique
is_unique(batting, ['player_id', 'year_id', 'stint'])

True

In [51]:
sql   = 'ALTER TABLE lahman.public.batting ADD PRIMARY KEY (player_id, year_id, stint)'
conn.execute(sql);

In [52]:
psql('\d batting')

Column,Type,Collation,Nullable,Default
player_id,text,,not null,
year_id,smallint,,not null,
stint,smallint,,not null,
team_id,text,,,
lg_id,text,,,
g,smallint,,,
ab,smallint,,,
r,smallint,,,
h,smallint,,,
h2b,smallint,,,


# Pitching

In [53]:
pitching = pd.read_csv('Pitching.csv')

In [54]:
pitching.columns

Index(['playerID', 'yearID', 'stint', 'teamID', 'lgID', 'W', 'L', 'G', 'GS',
       'CG', 'SHO', 'SV', 'IPouts', 'H', 'ER', 'HR', 'BB', 'SO', 'BAOpp',
       'ERA', 'IBB', 'WP', 'HBP', 'BK', 'BFP', 'GF', 'R', 'SH', 'SF', 'GIDP'],
      dtype='object')

In [55]:
pitching.columns = [convert_camel_case(name) for name in pitching.columns]
pitching.columns

Index(['player_id', 'year_id', 'stint', 'team_id', 'lg_id', 'w', 'l', 'g',
       'gs', 'cg', 'sho', 'sv', 'i_pouts', 'h', 'er', 'hr', 'bb', 'so',
       'ba_opp', 'era', 'ibb', 'wp', 'hbp', 'bk', 'bfp', 'gf', 'r', 'sh', 'sf',
       'gidp'],
      dtype='object')