# Lahman: Wrangle and Persist to CSV and Postgres

**Baseball Notebooks**  
1. Download and unzipped the Lahman and Retrosheet data.
2. Described helper functions used by several notebooks.
3. Baseball Data Organization and Data Dictionary
4. This notebook.

The Lahman data will be wrangled and persisted.

This notebook is designed to be used with Jupyter Lab and the Table of Contents extension: https://github.com/jupyterlab/jupyterlab-toc

## Lahman Data Dictionary
http://www.seanlahman.com/files/database/readme2016.txt  

## Repeatable Research
All data processing should be documented so that others can repeat the results.  This includes every step from downloading the data through analysis.

## Path Objects for Lahman Baseball Data

In [1]:
import pandas as pd
import numpy as np

import os
import re
import wget
from pathlib import Path
import zipfile

from IPython.display import HTML, display
from sqlalchemy import create_engine
from sqlalchemy.types import SmallInteger, Integer, BigInteger

In [2]:
# see Baseball Notebook #2
import helper_functions as bb

In [3]:
# create path objects
home = Path.home()
lahman = home.joinpath('data/lahman')
p_raw = lahman.joinpath('raw')
p_wrangled = lahman.joinpath('wrangled')

# create directories from these path objects
p_raw.mkdir(parents=True, exist_ok=True)
p_wrangled.mkdir(parents=True, exist_ok=True)
os.chdir(p_raw)

# Database

Using Postgres, or any database, is optional for the baseball data analysis.  However in a business environment, data often comes from databases, so how to use a database will be presented.

This section is preparation for interacting with Postgres.

Prerequisites
1. PostgreSQL server is installed, configured and running.
2. baseball database has been created.

### Connect to DB

In [4]:
# Get the user and password from the environment (rather than hardcoding it)
import os
db_user = os.environ.get('DB_USER')
db_pass = os.environ.get('DB_PASS')

# avoid putting passwords directly in code
connect_str = f'postgresql://{db_user}:{db_pass}@localhost:5432/baseball'

# connect
conn = create_engine(connect_str)

type(conn)

sqlalchemy.engine.base.Engine

In [5]:
type(conn.connect())

sqlalchemy.engine.base.Connection

### SQL Magic

SQL Magic is not used here because it does not release its connection until the notebook is closed.  This can cause a lock to be put on a table, preventing the use of conn (above) from performing database updates when used in df.to_sql() and pd.read_sql().

A connection from SQL Alchemy can be used almost identically to the [Python DB API](https://www.python.org/dev/peps/pep-0249/).

When the type of connection is SQL Alchemy Engine, and is used for SQL, a connection will be allocated, used, changes committed, and the connection will be released.

When the type of connection is SQL ALchemy Connection (not used here), transaction processing can be performed.

## psql

Use the following to run psql commands from a Jupyter Code cell.

This will connect, execute, and disconnect from the database.

For this to work without a password, a .pgpass file is necessary.  
See: https://www.postgresql.org/docs/11/libpq-pgpass.html    

The .pgpass file should look like:  
```localhost:5432:*:<user>:<passwd>```

In [6]:
def psql(cmd, user='postgres', schema='baseball'):
    psql_out = !psql -H -U {user} {schema} -c "{cmd}"
    display(HTML(''.join(psql_out)))

In [7]:
!psql --version

psql (PostgreSQL) 11.2 (Ubuntu 11.2-1.pgdg18.04+1)


# Main Lahman Baseball Files
As per:  
http://www.seanlahman.com/files/database/readme2016.txt

After readme2016.txt was written, master was renamed to People.

The 4 main files are:
*  People   - Player names, DOB, and biographical info
*  Batting  - batting statistics
*  Pitching - pitching statistics
*  Fielding - fielding statistics

# People

In [8]:
os.chdir(p_raw)
people = pd.read_csv('People.csv', parse_dates=['debut', 'finalGame'])

In [9]:
people.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19617 entries, 0 to 19616
Data columns (total 24 columns):
playerID        19617 non-null object
birthYear       19497 non-null float64
birthMonth      19332 non-null float64
birthDay        19188 non-null float64
birthCountry    19553 non-null object
birthState      19090 non-null object
birthCity       19441 non-null object
deathYear       9649 non-null float64
deathMonth      9648 non-null float64
deathDay        9647 non-null float64
deathCountry    9646 non-null object
deathState      9598 non-null object
deathCity       9641 non-null object
nameFirst       19580 non-null object
nameLast        19617 non-null object
nameGiven       19580 non-null object
weight          18792 non-null float64
height          18875 non-null float64
bats            18433 non-null object
throws          18638 non-null object
debut           19420 non-null datetime64[ns]
finalGame       19420 non-null datetime64[ns]
retroID         19561 non-null object

In [10]:
people.columns = [bb.convert_camel_case(name) for name in people.columns]
people.columns

Index(['player_id', 'birth_year', 'birth_month', 'birth_day', 'birth_country',
       'birth_state', 'birth_city', 'death_year', 'death_month', 'death_day',
       'death_country', 'death_state', 'death_city', 'name_first', 'name_last',
       'name_given', 'weight', 'height', 'bats', 'throws', 'debut',
       'final_game', 'retro_id', 'bbref_id'],
      dtype='object')

In [11]:
# custom parsing of birth/death dates
def to_date(row, prefix):
    y = row[prefix + '_year']
    m = row[prefix + '_month']
    d = row[prefix + '_day']
    
    # NaT if year is missing
    if pd.isna(y):
        return pd.NaT
    
    # fillna if year present but month missing
    if pd.isna(m):
        m = 1
        
    # fillna if year present but day missing
    if pd.isna(d):
        d = 1
        
    return pd.datetime(int(y),int(m),int(d))

In [12]:
people['birth_date'] = people.apply(lambda x: to_date(x, 'birth'), axis=1)
people['death_date'] = people.apply(lambda x: to_date(x, 'death'), axis=1)

In [13]:
people = people.drop(
    ['birth_year', 'birth_month', 'birth_day', 
     'death_year', 'death_month', 'death_day'], axis=1)

In [14]:
# retro_id is required to work with Retrosheet Data
# get list of players without a Retrosheet player_id
missing =people.loc[people['retro_id'].isna(), 'player_id']
missing.head()

1127     bellco99
2123    brownra99
2238    bulkemo99
2769    cartwal99
2923    chadwhe99
Name: player_id, dtype: object

In [15]:
# drop players without a retro_id
people = people.dropna(subset=['retro_id'], axis=0)

In [16]:
# verify num unique is num records for both fields
# this implies the mapping of player_id to retro_id is 1 to 1 and onto
print(people['player_id'].nunique() == people.shape[0])
print(people['retro_id'].nunique() == people.shape[0])

True
True


### Persist as CSV with Column Types
Use helper function described in previous notebook to save the data types to a separate csv file.

In [17]:
os.chdir(p_wrangled)
bb.to_csv_with_types(people, 'people.csv')

In [18]:
# verify that data type information was not lost
df2 = bb.from_csv_with_types('people.csv')
(df2.dtypes == people.dtypes).all()

True

### Persist as Postgres Table

df.to_sql(if_exists='replace') will replace data if it exists, but it will *not* replace column types if the Postgres table exists, therefore drop the table first.

In [19]:
conn.execute("DROP TABLE IF EXISTS people");

In [20]:
# create Postgres people table
people.to_sql('people', conn, index=False)

In [21]:
# check that it worked by selecting number of people records
rs = conn.execute("SELECT COUNT(*) from people")
rs.fetchall()

[(19561,)]

In [22]:
# add primary key, unique and not null constraints
sql   = 'ALTER TABLE people ADD PRIMARY KEY (player_id)'
conn.execute(sql)

sql = 'ALTER TABLE people ADD CONSTRAINT retro_unique UNIQUE (retro_id)'
conn.execute(sql)

sql = 'ALTER TABLE people ALTER COLUMN retro_id SET NOT NULL'
conn.execute(sql);

In [23]:
# describe the table
psql('\d people')

Column,Type,Collation,Nullable,Default
player_id,text,,not null,
birth_country,text,,,
birth_state,text,,,
birth_city,text,,,
death_country,text,,,
death_state,text,,,
death_city,text,,,
name_first,text,,,
name_last,text,,,
name_given,text,,,


# Batting

In [24]:
os.chdir(p_raw)

# consider yearID as a string for now
batting = pd.read_csv('Batting.csv')

In [25]:
batting.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105861 entries, 0 to 105860
Data columns (total 22 columns):
playerID    105861 non-null object
yearID      105861 non-null int64
stint       105861 non-null int64
teamID      105861 non-null object
lgID        105123 non-null object
G           105861 non-null int64
AB          105861 non-null int64
R           105861 non-null int64
H           105861 non-null int64
2B          105861 non-null int64
3B          105861 non-null int64
HR          105861 non-null int64
RBI         105105 non-null float64
SB          103493 non-null float64
CS          82320 non-null float64
BB          105861 non-null int64
SO          103761 non-null float64
IBB         69210 non-null float64
HBP         103044 non-null float64
SH          99792 non-null float64
SF          69757 non-null float64
GIDP        80420 non-null float64
dtypes: float64(9), int64(10), object(3)
memory usage: 17.8+ MB


## Rename

In [26]:
retro_names = {
    'playerID':'player_id',
    'yearID':'year_id',
    'teamID':'team_id',
    'lgID':'lg_id',
    'G':'g',
    'AB':'ab',
    'R':'r',
    'H':'h',
    '2B':'b_2b',
    '3B':'b_3b',
    'HR':'hr',
    'RBI':'rbi',
    'SB':'sb',
    'CS':'cs',
    'BB':'bb',
    'SO':'so',
    'IBB':'ibb',
    'HBP':'hp',
    'SH':'sh',
    'SF':'sf',
    'GIDP':'gdp'
}

In [27]:
batting.rename(columns=retro_names, inplace=True)
batting.columns

Index(['player_id', 'year_id', 'stint', 'team_id', 'lg_id', 'g', 'ab', 'r',
       'h', 'b_2b', 'b_3b', 'hr', 'rbi', 'sb', 'cs', 'bb', 'so', 'ibb', 'hp',
       'sh', 'sf', 'gdp'],
      dtype='object')

In [28]:
# Retrosheet only has data from 1921 onward, keep the same from Lahman
batting = batting.drop(batting[batting['year_id'] < 1921].index)

In [29]:
(batting['year_id'].min(), batting['year_id'].max())

(1921, 2018)

In [30]:
# are any of the players in batting that are missing a retro_id?
(batting['player_id'].isin(missing)).all()

False

As per above, no player with a missing retro_id is in the batting dataframe.

In [31]:
batting_float = batting.select_dtypes(include=['float']).copy()
batting_float.columns

Index(['rbi', 'sb', 'cs', 'so', 'ibb', 'hp', 'sh', 'sf', 'gdp'], dtype='object')

In [32]:
# these are integers, but had NA, so were converted to float
batting_float.apply(bb.is_int)

rbi    True
sb     True
cs     True
so     True
ibb    True
hp     True
sh     True
sf     True
gdp    True
dtype: bool

In [33]:
# after remove rows < 1921, some fields no longer have null values
batting_float.isna().sum()

rbi        0
sb         0
cs      6736
so         0
ibb    18137
hp         0
sh         0
sf     17588
gdp     7661
dtype: int64

In [34]:
criteria = (batting_float.apply(bb.is_int)) & (batting_float.isna().sum() == 0)
cols = criteria[criteria].index.to_list()
cols

['rbi', 'sb', 'so', 'hp', 'sh']

In [35]:
# cast these back to int
batting[cols] = batting_float[cols].astype('int')

In [36]:
batting = bb.optimize_df_dtypes(batting)

In [37]:
batting.dtypes

player_id     object
year_id       uint16
stint          uint8
team_id       object
lg_id         object
g              uint8
ab            uint16
r              uint8
h             uint16
b_2b           uint8
b_3b           uint8
hr             uint8
rbi            uint8
sb             uint8
cs           float64
bb             uint8
so             uint8
ibb          float64
hp             uint8
sh             uint8
sf           float64
gdp          float64
dtype: object

### Persist as CSV with Column Types

In [38]:
os.chdir(p_wrangled)
bb.to_csv_with_types(batting, 'batting.csv')

### Persist as Postgres Table

df.to_sql(if_exists='replace') will replace data if it exists, but it will *not* replace column types if the Postgres table exists, therefore drop the table first.

In [39]:
dtypes = bb.optimize_db_dtypes(batting)
dtypes

{'year_id': sqlalchemy.sql.sqltypes.SmallInteger,
 'stint': sqlalchemy.sql.sqltypes.SmallInteger,
 'g': sqlalchemy.sql.sqltypes.SmallInteger,
 'ab': sqlalchemy.sql.sqltypes.SmallInteger,
 'r': sqlalchemy.sql.sqltypes.SmallInteger,
 'h': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_2b': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_3b': sqlalchemy.sql.sqltypes.SmallInteger,
 'hr': sqlalchemy.sql.sqltypes.SmallInteger,
 'rbi': sqlalchemy.sql.sqltypes.SmallInteger,
 'sb': sqlalchemy.sql.sqltypes.SmallInteger,
 'bb': sqlalchemy.sql.sqltypes.SmallInteger,
 'so': sqlalchemy.sql.sqltypes.SmallInteger,
 'hp': sqlalchemy.sql.sqltypes.SmallInteger,
 'sh': sqlalchemy.sql.sqltypes.SmallInteger}

In [40]:
conn.execute("DROP TABLE IF EXISTS batting");

In [41]:
batting.to_sql('batting', conn, index=False, dtype=dtypes)

In [42]:
# verify unique
bb.is_unique(batting, ['player_id', 'year_id', 'stint'])

True

In [43]:
sql = 'ALTER TABLE batting ADD PRIMARY KEY (player_id, year_id, stint)'
conn.execute(sql);

In [44]:
psql('\d batting')

Column,Type,Collation,Nullable,Default
player_id,text,,not null,
year_id,smallint,,not null,
stint,smallint,,not null,
team_id,text,,,
lg_id,text,,,
g,smallint,,,
ab,smallint,,,
r,smallint,,,
h,smallint,,,
b_2b,smallint,,,


# Pitching

In [45]:
os.chdir(p_raw)

# consider yearID as string for now
pitching = pd.read_csv('Pitching.csv')

In [46]:
pitching.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46699 entries, 0 to 46698
Data columns (total 30 columns):
playerID    46699 non-null object
yearID      46699 non-null int64
stint       46699 non-null int64
teamID      46699 non-null object
lgID        46567 non-null object
W           46699 non-null int64
L           46699 non-null int64
G           46699 non-null int64
GS          46699 non-null int64
CG          46699 non-null int64
SHO         46699 non-null int64
SV          46699 non-null int64
IPouts      46699 non-null int64
H           46699 non-null int64
ER          46699 non-null int64
HR          46699 non-null int64
BB          46699 non-null int64
SO          46699 non-null int64
BAOpp       42259 non-null float64
ERA         46607 non-null float64
IBB         32121 non-null float64
WP          46699 non-null int64
HBP         45965 non-null float64
BK          46699 non-null int64
BFP         46696 non-null float64
GF          46699 non-null int64
R           46699 no

## Rename

In [47]:
retro_names = {
    'playerID':'player_id',
    'yearID':'year_id',
    'teamID':'team_id',
    'lgID':'lg_id',
    'W':'w',
    'L':'l',
    'G':'g',
    'GS':'gs',
    'CG':'cg',
    'SHO':'sho',
    'SV':'sv',
    'IPouts':'ip_outs',
    'H':'h',
    'ER':'e',
    'HR':'hr',
    'BB':'bb',
    'SO':'so',
    'BAOpp':'ba_opp', 
    'ERA':'era', 
    'IBB':'ibb',
    'WP':'wp',
    'HBP':'hp',
    'BK':'bk',
    'BFP':'bfp', 
    'GF':'gf', 
    'R':'r',
    'SH':'sh',
    'SF':'sf',
    'GIDP':'gdp'
        }

In [48]:
pitching.rename(columns=retro_names, inplace=True)

In [49]:
# Retrosheet only has data from 1921 onward
pitching = pitching.drop(pitching[pitching['year_id'] < 1921].index)

In [50]:
(pitching['year_id'].min(), pitching['year_id'].max())

(1921, 2018)

In [51]:
# are any of the pitchers missing a retro_id?
(pitching['player_id'].isin(missing)).all()

False

As per above, no pitchers are missing a retro_id.

In [52]:
pitching_float = pitching.select_dtypes(include=[np.float])

In [53]:
pitching_float.apply(bb.is_int)

ba_opp    False
era       False
ibb        True
hp         True
bfp        True
sh         True
sf         True
gdp        True
dtype: bool

In [54]:
# after dropping records < 1921, some fields no longer have nulls
pitching_float.isna().sum()

ba_opp       11
era          69
ibb        7812
hp            0
bfp           3
sh        12421
sf        12421
gdp       13552
dtype: int64

In [55]:
# integer fields with no nulls
criteria = (pitching_float.apply(bb.is_int) & (pitching_float.isna().sum() == 0))
criteria

ba_opp    False
era       False
ibb       False
hp         True
bfp       False
sh        False
sf        False
gdp       False
dtype: bool

In [56]:
cols = criteria[criteria].index.to_list()
cols

['hp']

In [57]:
# convert these floats to integers
pitching[cols] = pitching[cols].astype(np.int)

In [58]:
pitching = bb.optimize_df_dtypes(pitching)

In [59]:
pitching.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39933 entries, 6766 to 46698
Data columns (total 30 columns):
player_id    39933 non-null object
year_id      39933 non-null uint16
stint        39933 non-null uint8
team_id      39933 non-null object
lg_id        39933 non-null object
w            39933 non-null uint8
l            39933 non-null uint8
g            39933 non-null uint8
gs           39933 non-null uint8
cg           39933 non-null uint8
sho          39933 non-null uint8
sv           39933 non-null uint8
ip_outs      39933 non-null uint16
h            39933 non-null uint16
e            39933 non-null uint8
hr           39933 non-null uint8
bb           39933 non-null uint8
so           39933 non-null uint16
ba_opp       39922 non-null float64
era          39864 non-null float64
ibb          32121 non-null float64
wp           39933 non-null uint8
hp           39933 non-null uint8
bk           39933 non-null uint8
bfp          39930 non-null float64
gf           39933 non-

### Persist as CSV with Column Types

In [60]:
os.chdir(p_wrangled)
bb.to_csv_with_types(pitching, 'pitching.csv')

### Persist as Postgres Table

df.to_sql(if_exists='replace') will replace data if it exists, but it will *not* replace column types if the Postgres table exists, therefore drop the table first.

In [61]:
dtype = bb.optimize_db_dtypes(pitching)
dtype

{'year_id': sqlalchemy.sql.sqltypes.SmallInteger,
 'stint': sqlalchemy.sql.sqltypes.SmallInteger,
 'w': sqlalchemy.sql.sqltypes.SmallInteger,
 'l': sqlalchemy.sql.sqltypes.SmallInteger,
 'g': sqlalchemy.sql.sqltypes.SmallInteger,
 'gs': sqlalchemy.sql.sqltypes.SmallInteger,
 'cg': sqlalchemy.sql.sqltypes.SmallInteger,
 'sho': sqlalchemy.sql.sqltypes.SmallInteger,
 'sv': sqlalchemy.sql.sqltypes.SmallInteger,
 'ip_outs': sqlalchemy.sql.sqltypes.SmallInteger,
 'h': sqlalchemy.sql.sqltypes.SmallInteger,
 'e': sqlalchemy.sql.sqltypes.SmallInteger,
 'hr': sqlalchemy.sql.sqltypes.SmallInteger,
 'bb': sqlalchemy.sql.sqltypes.SmallInteger,
 'so': sqlalchemy.sql.sqltypes.SmallInteger,
 'wp': sqlalchemy.sql.sqltypes.SmallInteger,
 'hp': sqlalchemy.sql.sqltypes.SmallInteger,
 'bk': sqlalchemy.sql.sqltypes.SmallInteger,
 'gf': sqlalchemy.sql.sqltypes.SmallInteger,
 'r': sqlalchemy.sql.sqltypes.SmallInteger}

In [62]:
conn.execute("DROP TABLE IF EXISTS pitching");

In [63]:
pitching.to_sql('pitching', conn, index=False, dtype=dtype)

In [64]:
# verify unique
bb.is_unique(pitching, ['player_id', 'year_id', 'stint'])

True

In [65]:
sql = 'ALTER TABLE pitching ADD PRIMARY KEY (player_id, year_id, stint)'
conn.execute(sql);

In [66]:
psql('\d pitching')

Column,Type,Collation,Nullable,Default
player_id,text,,not null,
year_id,smallint,,not null,
stint,smallint,,not null,
team_id,text,,,
lg_id,text,,,
w,smallint,,,
l,smallint,,,
g,smallint,,,
gs,smallint,,,
cg,smallint,,,


# Fielding

In [67]:
os.chdir(p_raw)

fielding = pd.read_csv('Fielding.csv')

In [68]:
fielding.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140921 entries, 0 to 140920
Data columns (total 18 columns):
playerID    140921 non-null object
yearID      140921 non-null int64
stint       140921 non-null int64
teamID      140921 non-null object
lgID        139408 non-null object
POS         140921 non-null object
G           140921 non-null int64
GS          93544 non-null float64
InnOuts     110992 non-null float64
PO          140921 non-null int64
A           140921 non-null int64
E           140920 non-null float64
DP          140921 non-null int64
PB          11478 non-null float64
WP          1169 non-null float64
SB          8691 non-null float64
CS          8691 non-null float64
ZR          1169 non-null float64
dtypes: float64(8), int64(6), object(4)
memory usage: 19.4+ MB


In [69]:
fielding.columns = [bb.convert_camel_case(name) for name in fielding.columns]
fielding.columns

Index(['player_id', 'year_id', 'stint', 'team_id', 'lg_id', 'pos', 'g', 'gs',
       'inn_outs', 'po', 'a', 'e', 'dp', 'pb', 'wp', 'sb', 'cs', 'zr'],
      dtype='object')

In [70]:
# Retrosheet only has data from 1921 onward
fielding = fielding.drop(fielding[fielding['year_id'] < 1921].index)

In [71]:
(fielding['year_id'].min(), fielding['year_id'].max())

(1921, 2018)

In [72]:
# are any of the players in fielding missing a retro_id?
(fielding['player_id'].isin(missing)).all()

False

As per above, no players in fielding are missing a retro_id

In [73]:
# these are null 99% of the time, drop them
fielding[['wp', 'zr']].isna().mean()

wp    0.989516
zr    0.989516
dtype: float64

In [74]:
fielding = fielding.drop(['wp','zr'], axis=1)

In [75]:
fielding = bb.optimize_df_dtypes(fielding)

In [76]:
fielding.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 111499 entries, 29422 to 140920
Data columns (total 16 columns):
player_id    111499 non-null object
year_id      111499 non-null uint16
stint        111499 non-null uint8
team_id      111499 non-null object
lg_id        111499 non-null object
pos          111499 non-null object
g            111499 non-null uint8
gs           89431 non-null float64
inn_outs     89431 non-null float64
po           111499 non-null uint16
a            111499 non-null uint16
e            111498 non-null float64
dp           111499 non-null uint8
pb           8423 non-null float64
sb           6389 non-null float64
cs           6389 non-null float64
dtypes: float64(6), object(4), uint16(3), uint8(3)
memory usage: 10.3+ MB


### Persist as CSV with Column Types

In [77]:
os.chdir(p_wrangled)
bb.to_csv_with_types(fielding, 'fielding.csv')

### Persist as Postgres Table

df.to_sql(if_exists='replace') will replace data if it exists, but it will *not* replace column types if the Postgres table exists, therefore drop the table first.

In [78]:
dtype = bb.optimize_db_dtypes(fielding)
dtype

{'year_id': sqlalchemy.sql.sqltypes.SmallInteger,
 'stint': sqlalchemy.sql.sqltypes.SmallInteger,
 'g': sqlalchemy.sql.sqltypes.SmallInteger,
 'po': sqlalchemy.sql.sqltypes.SmallInteger,
 'a': sqlalchemy.sql.sqltypes.SmallInteger,
 'dp': sqlalchemy.sql.sqltypes.SmallInteger}

In [79]:
conn.execute("DROP TABLE IF EXISTS fielding");

In [80]:
fielding.to_sql('fielding', conn, index=False, dtype=dtype)

In [81]:
bb.is_unique(fielding, ['player_id', 'year_id', 'stint', 'pos'])

True

In [82]:
sql = 'ALTER TABLE fielding ADD PRIMARY KEY (player_id, year_id, stint, pos)'
conn.execute(sql);

In [83]:
psql('\d fielding')

Column,Type,Collation,Nullable,Default
player_id,text,,not null,
year_id,smallint,,not null,
stint,smallint,,not null,
team_id,text,,,
lg_id,text,,,
pos,text,,not null,
g,smallint,,,
gs,double precision,,,
inn_outs,double precision,,,
po,smallint,,,


### Note on Position

This is based on my MLB domain knowledge.

Players in recent years are increasingly playing more than one position in a single game, let alone in a single stint.

Note: a player that plays for 3 teams in 1 year would have 3 "stints".

Catchers and Pitchers rarely play a position other than catcher or pitcher (except in exceedingly long extra inning games).

Usually, but not always, infielders play one of the infield positions.

Usually, but not always, outfielders play one of the outfield positions.

So although every player is listed as having a specific position, this position is not fixed.  It is likely that the position represents the position most often played by that player.

The Lahman csv file "Appearances" lists how often each player played at a particular position for a given year.

# Teams

The team_id used by Lahman is not the same as the team_id as used by Retrosheet.  When comparing data between the two data sources, it will be necessary to map one team_id to the other.

In [84]:
os.chdir(p_raw)
teams = pd.read_csv('Teams.csv')

In [85]:
teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2895 entries, 0 to 2894
Data columns (total 48 columns):
yearID            2895 non-null int64
lgID              2845 non-null object
teamID            2895 non-null object
franchID          2895 non-null object
divID             1378 non-null object
Rank              2895 non-null int64
G                 2895 non-null int64
Ghome             2496 non-null float64
W                 2895 non-null int64
L                 2895 non-null int64
DivWin            1350 non-null object
WCWin             714 non-null object
LgWin             2867 non-null object
WSWin             2538 non-null object
R                 2895 non-null int64
AB                2895 non-null int64
H                 2895 non-null int64
2B                2895 non-null int64
3B                2895 non-null int64
HR                2895 non-null int64
BB                2894 non-null float64
SO                2879 non-null float64
SB                2769 non-null float64
CS  

In [86]:
teams.columns = [bb.convert_camel_case(col) for col in teams.columns]

In [87]:
teams.columns

Index(['year_id', 'lg_id', 'team_id', 'franch_id', 'div_id', 'rank', 'g',
       'ghome', 'w', 'l', 'div_win', 'wc_win', 'lg_win', 'ws_win', 'r', 'ab',
       'h', '2_b', '3_b', 'hr', 'bb', 'so', 'sb', 'cs', 'hbp', 'sf', 'ra',
       'er', 'era', 'cg', 'sho', 'sv', 'i_pouts', 'ha', 'hra', 'bba', 'soa',
       'e', 'dp', 'fp', 'name', 'park', 'attendance', 'bpf', 'ppf',
       'team_idbr', 'team_i_dlahman45', 'team_i_dretro'],
      dtype='object')

In [88]:
# for convert camel case was not perfect, rename postgres keyword names
names = {'2_b':'b_2b', '3_b':'b_3b', 'i_pouts':'ip_outs', 
         'team_idbr':'team_id_br','team_i_dlahman45':'team_id_lahman45',
         'team_i_dretro':'team_id_retro', 'rank':'team_rank', 'name':'team_name'}

In [89]:
teams = teams.rename(columns=names)
teams.columns

Index(['year_id', 'lg_id', 'team_id', 'franch_id', 'div_id', 'team_rank', 'g',
       'ghome', 'w', 'l', 'div_win', 'wc_win', 'lg_win', 'ws_win', 'r', 'ab',
       'h', 'b_2b', 'b_3b', 'hr', 'bb', 'so', 'sb', 'cs', 'hbp', 'sf', 'ra',
       'er', 'era', 'cg', 'sho', 'sv', 'ip_outs', 'ha', 'hra', 'bba', 'soa',
       'e', 'dp', 'fp', 'team_name', 'park', 'attendance', 'bpf', 'ppf',
       'team_id_br', 'team_id_lahman45', 'team_id_retro'],
      dtype='object')

In [90]:
# Retrosheet only has data from 1921 onward
teams = teams.drop(teams[teams['year_id'] < 1921].index)

In [91]:
# Note: 97% of the time, the Lahman team_id = the Retrosheet team_id
(teams['team_id'] == teams['team_id_retro']).mean()

0.9747242647058824

In [92]:
teams = bb.optimize_df_dtypes(teams)
teams.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2176 entries, 719 to 2894
Data columns (total 48 columns):
year_id             2176 non-null uint16
lg_id               2176 non-null object
team_id             2176 non-null object
franch_id           2176 non-null object
div_id              1378 non-null object
team_rank           2176 non-null uint8
g                   2176 non-null uint8
ghome               2176 non-null float64
w                   2176 non-null uint8
l                   2176 non-null uint8
div_win             1350 non-null object
wc_win              714 non-null object
lg_win              2148 non-null object
ws_win              2148 non-null object
r                   2176 non-null uint16
ab                  2176 non-null uint16
h                   2176 non-null uint16
b_2b                2176 non-null uint16
b_3b                2176 non-null uint8
hr                  2176 non-null uint16
bb                  2176 non-null float64
so                  2176 non-null 

### Persist as CSV with Column Types

In [93]:
os.chdir(p_wrangled)
bb.to_csv_with_types(teams, 'teams.csv')

### Persist as Postgres Table

df.to_sql(if_exists='replace') will replace data if it exists, but it will *not* replace column types if the Postgres table exists, therefore drop the table first.

In [94]:
dtypes = bb.optimize_db_dtypes(teams)
dtypes

{'year_id': sqlalchemy.sql.sqltypes.SmallInteger,
 'team_rank': sqlalchemy.sql.sqltypes.SmallInteger,
 'g': sqlalchemy.sql.sqltypes.SmallInteger,
 'w': sqlalchemy.sql.sqltypes.SmallInteger,
 'l': sqlalchemy.sql.sqltypes.SmallInteger,
 'r': sqlalchemy.sql.sqltypes.SmallInteger,
 'ab': sqlalchemy.sql.sqltypes.SmallInteger,
 'h': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_2b': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_3b': sqlalchemy.sql.sqltypes.SmallInteger,
 'hr': sqlalchemy.sql.sqltypes.SmallInteger,
 'ra': sqlalchemy.sql.sqltypes.SmallInteger,
 'er': sqlalchemy.sql.sqltypes.SmallInteger,
 'cg': sqlalchemy.sql.sqltypes.SmallInteger,
 'sho': sqlalchemy.sql.sqltypes.SmallInteger,
 'sv': sqlalchemy.sql.sqltypes.SmallInteger,
 'ip_outs': sqlalchemy.sql.sqltypes.SmallInteger,
 'ha': sqlalchemy.sql.sqltypes.SmallInteger,
 'hra': sqlalchemy.sql.sqltypes.SmallInteger,
 'bba': sqlalchemy.sql.sqltypes.SmallInteger,
 'soa': sqlalchemy.sql.sqltypes.SmallInteger,
 'e': sqlalchemy.sql.sqltypes.Small

In [95]:
conn.execute("DROP TABLE IF EXISTS teams");

In [96]:
teams.to_sql('teams', conn, index=False, dtype=dtypes)

In [97]:
bb.is_unique(teams, ['year_id', 'team_id'])

True

In [98]:
bb.is_unique(teams, ['year_id', 'team_id_retro'])

True

In [99]:
sql = 'ALTER TABLE teams ADD PRIMARY KEY (year_id, team_id)'
conn.execute(sql);

sql = 'ALTER TABLE teams ADD CONSTRAINT team_retro_unique UNIQUE (year_id, team_id_retro)'
conn.execute(sql)

sql = 'ALTER TABLE teams ALTER COLUMN team_id_retro SET NOT NULL'
conn.execute(sql);

In [100]:
psql('\d teams')

Column,Type,Collation,Nullable,Default
year_id,smallint,,not null,
lg_id,text,,,
team_id,text,,not null,
franch_id,text,,,
div_id,text,,,
team_rank,smallint,,,
g,smallint,,,
ghome,double precision,,,
w,smallint,,,
l,smallint,,,
