# Lahman Baseball Data

**Baseball Notebooks**  
1. Downloaded and unzipped data.  Discussed techniques for persisting DataFrames.
2. In this notebook, the Lahman data will be wrangled and persisted.

This notebook is designed to be used with Jupyter Lab and the Table of Contents extension: https://github.com/jupyterlab/jupyterlab-toc

## Lahman Data Dictionary
A "Data Dictionary" is also called a "Codebook".

http://www.seanlahman.com/files/database/readme2016.txt  

## Repeatable Research
All data processing should be documented so that others can repeat the results.  This includes every step from downloading the data through analysis.

## Path Objects for Lahman Baseball Data

In [1]:
import pandas as pd
import numpy as np

import os
import re
import wget
from pathlib import Path
import zipfile

In [2]:
# create path objects
home = Path.home()
lahman = home.joinpath('data/lahman')
p_raw = lahman.joinpath('raw')
p_wrangled = lahman.joinpath('wrangled')

# create directories from these path objects
p_raw.mkdir(parents=True, exist_ok=True)
p_wrangled.mkdir(parents=True, exist_ok=True)
os.chdir(p_raw)

# Database

This section is preparation for interacting with Postgres.  Using Postgres is optional for the data analysis.

Prerequisites
1. PostgreSQL server is installed, configured and running.
2. baseball database has been created.

In [3]:
from sqlalchemy.engine import create_engine
from sqlalchemy.types import SmallInteger, Integer, BigInteger
from IPython.display import HTML, display

### Connect to DB

In [4]:
# Get the user and password from the environment (rather than hardcoding it)
import os
db_user = os.environ.get('DB_USER')
db_pass = os.environ.get('DB_PASS')

# avoid putting passwords directly in code
connect_str = f'postgresql://{db_user}:{db_pass}@localhost:5432/baseball'

# connect
conn = create_engine(connect_str)

type(conn)

sqlalchemy.engine.base.Engine

In [5]:
type(conn.connect())

sqlalchemy.engine.base.Connection

### SQL Magic

SQL Magic not used here because it does not release its connection until the notebook is closed.  This can cause a lock to be put on a table, preventing the use of conn (above) from performing database updates.

A connection from SQL Alchemy can be used almost identically to the [Python DB API](https://www.python.org/dev/peps/pep-0249/).

When the type of connection is SQL Alchemy Engine, and it is used for SQL, a connection will be allocated, used, changes committed, and the connection will be released.

When the type of connection is SQL ALchemy Connection (not used here), transaction processing can be done.

### **psql**

There must be a ~/.pgpass file similar to the following for psql to connect without a password:  

```localhost:5432:*:<user>:<passwd>```

See [Postgres pgpass Doc](https://www.postgresql.org/docs/11/libpq-pgpass.html)

In [6]:
# -H for html output
# this connects, executes, and disconnects
def psql(cmd, user='postgres', schema='baseball'):
    psql_out = !psql -H -U {user} {schema} -c "{cmd}"
    display(HTML(''.join(psql_out)))

In [7]:
psql('\d')

Schema,Name,Type,Owner
public,batting,table,postgres
public,fielding,table,postgres
public,people,table,postgres
public,pitching,table,postgres
public,r_game,table,postgres
public,r_game_fields,table,postgres
public,r_parks,table,postgres
public,r_player_game,table,postgres
public,r_player_game_fields,table,postgres
public,r_players,table,postgres


### CamelCase to snake_case

Postgres is easier to use without caps in the column names.

The following function is from:  
https://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-snake-case

In [8]:
# CamelCase to snake_case
def convert_camel_case(name):
    s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()

In [9]:
# example
convert_camel_case('playerID')

'player_id'

# Useful Methods

### Persisting DataFrame Column Data Types with CSV
See the discussion in previous notebook for persisting DataFrames.

In [23]:
def to_csv_with_types(df, file_prefix, compression=False):
    """Save df to csv and save df.dtypes to csv"""
    
    dtypes = df.dtypes.to_frame('dtypes').reset_index()
    filename_types = file_prefix + '_types.csv'
    dtypes.to_csv(filename_types, index=False)    
    
    if compression:
        filename = file_prefix + '.csv.gz'
        df.to_csv(filename, compression='gzip', index=False)        
    else:
        filename = file_prefix + '.csv'        
        df.to_csv(filename, index=False)

In [24]:
def from_csv_with_types(file_prefix, compression=False):
    """Read df.dtypes from csv and read df from csv"""
    
    filename_types = file_prefix + '_types.csv'
    types = pd.read_csv(filename_types).set_index('index').to_dict()
    dtypes = types['dtypes']
    
    dates = [key for key,value in dtypes.items() if value.startswith('datetime')]
    for field in dates:
        dtypes.pop(field)
        
    if compression:
        filename = file_prefix + '.csv.gz'
        df = pd.read_csv(filename, compression='gzip', parse_dates = dates, dtype=dtypes)
    else:
        df = pd.read_csv(file_prefix+'.csv', parse_dates = dates, dtype=dtypes)
    return df

### Is Unique over Multiple Columns

In [3]:
# this is faster than using groupby
def is_unique(df, cols):
    return not (df.duplicated(subset=cols)).any()

### Optimize Pandas Data Types
There is no way to do this perfectly in general, however the following is a good start.

In [13]:
def optimize_pandas_dtypes(df, cutoff=0.05):
    df = df.copy()
    
    # int64 -> smallest uint allowed by data
    df_int = df.select_dtypes(include=[np.int])
    df_int = df_int.apply(pd.to_numeric,downcast='unsigned')
    df[df_int.columns] = df_int

    # object -> category, if less than 5% of values are unique
    df_obj = df.select_dtypes(include=['object'])
    s = df_obj.nunique() / df.shape[0]
    columns = s.index[s <= cutoff].values
    if len(columns) > 0:
        df_cat = df[columns].astype('category')
        df[columns] = df_cat    
    
    return df

In [14]:
def mem_usage(df):
    mem = df.memory_usage(deep=True).sum()
    mem = mem / 2 ** 20 # covert to megabytes
    return f'{mem:03.2f} MB'

### Optimize Postgres Data Types
Postgres uses bytes = 8 bits  
So np.int32 = Postgres int4 = ANSI SQL Integer

In [15]:
def optimize_database_dtypes(df):
    """Chose best DB column type from DataFrame column type"""
    small_int = {col:SmallInteger for col in df.select_dtypes(
        include=[np.int16, np.uint16, np.int8, np.uint8]).columns}

    integer = {col:Integer for col in df.select_dtypes(
        include=[np.int32, np.uint32]).columns}

    big_int = {col:Integer for col in df.select_dtypes(
        include=[np.int64, np.uint64]).columns}

    dtype = {**small_int, **integer, **big_int}
    
    return dtype

# Main Lahman Baseball Files
As per:  
http://www.seanlahman.com/files/database/readme2016.txt

After readme2016.txt was written, master was renamed to People.

The 4 main files are:
*  People   - Player names, DOB, and biographical info
*  Batting  - batting statistics
*  Pitching - pitching statistics
*  Fielding - fielding statistics

# People

In [16]:
os.chdir(p_raw)
people = pd.read_csv('People.csv', parse_dates=['debut', 'finalGame'])

In [17]:
people.columns

Index(['playerID', 'birthYear', 'birthMonth', 'birthDay', 'birthCountry',
       'birthState', 'birthCity', 'deathYear', 'deathMonth', 'deathDay',
       'deathCountry', 'deathState', 'deathCity', 'nameFirst', 'nameLast',
       'nameGiven', 'weight', 'height', 'bats', 'throws', 'debut', 'finalGame',
       'retroID', 'bbrefID'],
      dtype='object')

In [18]:
people.columns = [convert_camel_case(name) for name in people.columns]
people.columns

Index(['player_id', 'birth_year', 'birth_month', 'birth_day', 'birth_country',
       'birth_state', 'birth_city', 'death_year', 'death_month', 'death_day',
       'death_country', 'death_state', 'death_city', 'name_first', 'name_last',
       'name_given', 'weight', 'height', 'bats', 'throws', 'debut',
       'final_game', 'retro_id', 'bbref_id'],
      dtype='object')

In [19]:
# custom parsing of birth/death dates
def to_date(row, prefix):
    y = row[prefix + '_year']
    m = row[prefix + '_month']
    d = row[prefix + '_day']
    
    # NaT if year is missing
    if pd.isna(y):
        return pd.NaT
    
    # fillna if year present but month missing
    if pd.isna(m):
        m = 1
        
    # fillna if year present but day missing
    if pd.isna(d):
        d = 1
        
    return pd.datetime(int(y),int(m),int(d))

In [20]:
people['birth_date'] = people.apply(lambda x: to_date(x, 'birth'), axis=1)
people['death_date'] = people.apply(lambda x: to_date(x, 'death'), axis=1)

In [21]:
people = people.drop(
    ['birth_year', 'birth_month', 'birth_day', 
     'death_year', 'death_month', 'death_day'], axis=1)

In [22]:
# retro_id will be used with Retrosheet Data
# drop (the very few) rows with missing retro_id
people = people.dropna(subset=['retro_id'], axis=0)

In [23]:
# verify uniqueness
print(people['player_id'].is_unique)
print(people['retro_id'].is_unique)

True
True


In [24]:
mem_usage(people)

'16.78 MB'

In [25]:
people.get_dtype_counts().sort_values(ascending=False)

object            14
datetime64[ns]     4
float64            2
dtype: int64

In [26]:
people = optimize_pandas_dtypes(people)

In [27]:
mem_usage(people)

'10.64 MB'

In [28]:
people.get_dtype_counts().sort_values(ascending=False)

object            8
category          6
datetime64[ns]    4
float64           2
dtype: int64

### Persist as CSV with Column Types
Use custom function above to save the data types to a separate csv file.

In [29]:
os.chdir(p_wrangled)
to_csv_with_types(people, 'people')

In [30]:
# verify data type information was not lost
df2 = from_csv_with_types('people')
(df2.dtypes == people.dtypes).all()

True

### Persist as Postgres Table

In [31]:
# replace the table if it exists
people.to_sql('people', conn, if_exists='replace', index=False)

In [32]:
rs = conn.execute("SELECT COUNT(*) from people")
rs.fetchall()

[(19561,)]

In [33]:
# add primary key, unique and not null constraints
sql   = 'ALTER TABLE people ADD PRIMARY KEY (player_id)'
conn.execute(sql)

sql = 'ALTER TABLE people ADD CONSTRAINT retro_unique UNIQUE (retro_id)'
conn.execute(sql)

sql = 'ALTER TABLE people ALTER COLUMN retro_id SET NOT NULL'
conn.execute(sql);

In [34]:
# describe the table
psql('\d people')

Column,Type,Collation,Nullable,Default
player_id,text,,not null,
birth_country,text,,,
birth_state,text,,,
birth_city,text,,,
death_country,text,,,
death_state,text,,,
death_city,text,,,
name_first,text,,,
name_last,text,,,
name_given,text,,,


# Batting

In [35]:
os.chdir(p_raw)
batting = pd.read_csv('Batting.csv')

In [36]:
batting.columns

Index(['playerID', 'yearID', 'stint', 'teamID', 'lgID', 'G', 'AB', 'R', 'H',
       '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'IBB', 'HBP', 'SH',
       'SF', 'GIDP'],
      dtype='object')

## Rename to use Retrosheet Names for the Same Fields

The following is from the RetrosheetBaseball Jupyter notebook.
```
 'b_g': 'games played',
 'b_pa': 'plate appearances',
 'b_ab': 'at bats',
 'b_r': 'runs',
 'b_h': 'hits',
 'b_2b': 'doubles',
 'b_3b': 'triples',
 'b_hr': 'home runs',
 'b_rbi': 'runs batted in',
 'b_bb': 'walks',
 'b_ibb': 'intentional walks',
 'b_so': 'strikeouts',
 'b_gdp': 'grounded into DP',
 'b_hp': 'hit by pitch',
 'b_sh': 'sacrifice hits',
 'b_sf': 'sacrifice flies',
 'b_sb': 'stolen bases',
 'b_cs': 'caught stealing',
 'b_xi': 'reached on interference', 
```

In [37]:
retro_names = {
    'playerID':'player_id',
    'yearID':'year_id',
    'teamID':'team_id',
    'lgID':'lg_id',
    'G':'b_g',
    'AB':'b_ab',
    'R':'b_r',
    'H':'b_h',
    '2B':'b_2b',
    '3B':'b_3b',
    'HR':'b_hr',
    'RBI':'b_rbi',
    'SB':'b_sb',
    'CS':'b_cs',
    'BB':'b_bb',
    'SO':'b_so',
    'IBB':'b_ibb',
    'HBP':'b_hp',
    'SH':'b_sh',
    'SF':'b_sf',
    'GIDP':'b_gdp'
}

In [38]:
batting.rename(columns=retro_names, inplace=True)
batting.columns

Index(['player_id', 'year_id', 'stint', 'team_id', 'lg_id', 'b_g', 'b_ab',
       'b_r', 'b_h', 'b_2b', 'b_3b', 'b_hr', 'b_rbi', 'b_sb', 'b_cs', 'b_bb',
       'b_so', 'b_ibb', 'b_hp', 'b_sh', 'b_sf', 'b_gdp'],
      dtype='object')

In [39]:
# certain stats are null only for old games
# this study will be from 1955 onward
batting = batting.drop(batting[batting['year_id'] < 1955].index)

In [40]:
batting['year_id'].min()

1955

### Remove Rows with 0 At Bats
This will also remove rows with missing stat fields as most stats require at least 1 at bat.

A batter can have 0 at bats for a season, if his only appearances were as a pinch runner, or if he walked every time up.

In [41]:
batting = batting.drop(batting[batting['b_ab'] == 0].index)

In [42]:
# these are integers, but had NA, so were converted to float
batting_float = batting.select_dtypes(include=['float']).copy()
batting_float.columns

Index(['b_rbi', 'b_sb', 'b_cs', 'b_so', 'b_ibb', 'b_hp', 'b_sh', 'b_sf',
       'b_gdp'],
      dtype='object')

In [43]:
# after removing years < 1955 and rows with 0 at bats, there are no longer any null values
batting_float.isna().sum()

b_rbi    0
b_sb     0
b_cs     0
b_so     0
b_ibb    0
b_hp     0
b_sh     0
b_sf     0
b_gdp    0
dtype: int64

In [44]:
# cast these back to int
batting[batting_float.columns] = batting_float.astype('int')

In [45]:
mem_usage(batting)

'17.51 MB'

In [46]:
batting.get_dtype_counts()

int64     19
object     3
dtype: int64

In [47]:
batting = optimize_pandas_dtypes(batting)

In [48]:
mem_usage(batting)

'4.97 MB'

In [49]:
batting.get_dtype_counts()

category     2
object       1
uint16       3
uint8       16
dtype: int64

In [50]:
dtype = optimize_database_dtypes(batting)

### Persist as CSV with Column Types

In [51]:
os.chdir(p_wrangled)
to_csv_with_types(batting, 'batting')

### Persist as Postgres Table

In [52]:
dtype = optimize_database_dtypes(batting)

In [53]:
# display the dtypes
[(key,value) for key,value in dtype.items()]

[('year_id', sqlalchemy.sql.sqltypes.SmallInteger),
 ('stint', sqlalchemy.sql.sqltypes.SmallInteger),
 ('b_g', sqlalchemy.sql.sqltypes.SmallInteger),
 ('b_ab', sqlalchemy.sql.sqltypes.SmallInteger),
 ('b_r', sqlalchemy.sql.sqltypes.SmallInteger),
 ('b_h', sqlalchemy.sql.sqltypes.SmallInteger),
 ('b_2b', sqlalchemy.sql.sqltypes.SmallInteger),
 ('b_3b', sqlalchemy.sql.sqltypes.SmallInteger),
 ('b_hr', sqlalchemy.sql.sqltypes.SmallInteger),
 ('b_rbi', sqlalchemy.sql.sqltypes.SmallInteger),
 ('b_sb', sqlalchemy.sql.sqltypes.SmallInteger),
 ('b_cs', sqlalchemy.sql.sqltypes.SmallInteger),
 ('b_bb', sqlalchemy.sql.sqltypes.SmallInteger),
 ('b_so', sqlalchemy.sql.sqltypes.SmallInteger),
 ('b_ibb', sqlalchemy.sql.sqltypes.SmallInteger),
 ('b_hp', sqlalchemy.sql.sqltypes.SmallInteger),
 ('b_sh', sqlalchemy.sql.sqltypes.SmallInteger),
 ('b_sf', sqlalchemy.sql.sqltypes.SmallInteger),
 ('b_gdp', sqlalchemy.sql.sqltypes.SmallInteger)]

In [54]:
batting.to_sql('batting', conn, if_exists='replace', index=False, dtype=dtype)

In [55]:
# verify unique
is_unique(batting, ['player_id', 'year_id', 'stint'])

True

In [56]:
sql = 'ALTER TABLE batting ADD PRIMARY KEY (player_id, year_id, stint)'
conn.execute(sql);

In [57]:
psql('\d batting')

Column,Type,Collation,Nullable,Default
player_id,text,,not null,
year_id,smallint,,not null,
stint,smallint,,not null,
team_id,text,,,
lg_id,text,,,
b_g,smallint,,,
b_ab,smallint,,,
b_r,smallint,,,
b_h,smallint,,,
b_2b,smallint,,,


# Pitching

In [58]:
os.chdir(p_raw)
pitching = pd.read_csv('Pitching.csv')

In [59]:
pitching.columns

Index(['playerID', 'yearID', 'stint', 'teamID', 'lgID', 'W', 'L', 'G', 'GS',
       'CG', 'SHO', 'SV', 'IPouts', 'H', 'ER', 'HR', 'BB', 'SO', 'BAOpp',
       'ERA', 'IBB', 'WP', 'HBP', 'BK', 'BFP', 'GF', 'R', 'SH', 'SF', 'GIDP'],
      dtype='object')

## Rename to Match Retrosheet
The following is from the RetrosheetBaseball Jupyter notebook.
```
 'p_g': 'games pitched',
 'p_gs': 'games started',
 'p_cg': 'complete games',
 'p_sho': 'shutouts',
 'p_gf': 'games finished',
 'p_w': 'wins',
 'p_l': 'losses',
 'p_sv': 'saves',
 'p_out': 'outs recorded (innings pitched times 3)',
 'p_tbf': 'batters faced',
 'p_ab': 'at bats',
 'p_r': 'runs allowed',
 'p_er': 'earned runs allowed',
 'p_h': 'hits allowed',
 'p_2b': 'doubles allowed',
 'p_3b': 'triples allowed',
 'p_hr': 'home runs allowed',
 'p_bb': 'walks allowed',
 'p_ibb': 'intentional walks allowed',
 'p_so': 'strikeouts',
 'p_gdp': 'grounded into double play',
 'p_hp': 'hit batsmen',
 'p_sh': 'sacrifice hits against',
 'p_sf': 'sacrifice flies against',
 'p_xi': 'reached on interference',
 'p_wp': 'wild pitches',
 'p_bk': 'balks'
``` 

In [60]:
retro_names = {
    'playerID':'player_id',
    'yearID':'year_id',
    'teamID':'team_id',
    'lgID':'lg_id',
    'W':'p_w',
    'L':'p_l',
    'G':'p_g',
    'GS':'p_gs',
    'CG':'p_cg',
    'SHO':'p_sho',
    'SV':'p_sv',
    'IPouts':'p_outs',
    'H':'p_h',
    'ER':'p_er',
    'HR':'p_hr',
    'BB':'p_bb',
    'SO':'p_so',
    'BAOpp':'p_ba_opp', # not in retrosheet player_game
    'ERA':'p_era', # not in retrosheet player_game
    'IBB':'p_ibb',
    'WP':'p_wp',
    'HBP':'p_hp',
    'BK':'p_bk',
    'BFP':'p_bfp', # not in retrosheet player_game
    'GF':'p_gf', # not in retrosheet player_game
    'R':'p_r',
    'SH':'p_sh',
    'SF':'p_sf',
    'GIDP':'p_gdp'
        }

In [61]:
pitching.rename(columns=retro_names, inplace=True)

In [62]:
pitching.get_dtype_counts()

float64     8
int64      19
object      3
dtype: int64

In [63]:
# certain stats are null only for old games
# this study will be from 1955 onward
pitching = pitching.drop(pitching[pitching['year_id'] < 1955].index)

In [64]:
pitching['year_id'].min(), pitching['year_id'].max()

(1955, 2018)

In [65]:
# if the pitcher recorded less than 3 outs for the entire year, drop the record
pitching = pitching.drop(pitching[pitching['p_outs'] < 3].index)

In [66]:
pitching_float = pitching.select_dtypes(include=[np.float])

In [67]:
# after dropping old records and pitchers with less than 3 outs for the year
# several fields no longer contain nulls
pitching_float.isna().sum()

p_ba_opp       0
p_era          0
p_ibb          0
p_hp           0
p_bfp          0
p_sh        4579
p_sf        4579
p_gdp       5703
dtype: int64

In [68]:
# fields without null can be converted back to integer
s = pitching_float.isna().sum()
cols = s[s == 0].index.to_list()
cols

['p_ba_opp', 'p_era', 'p_ibb', 'p_hp', 'p_bfp']

In [69]:
pitching[cols] = pitching[cols].astype(np.int)

In [70]:
pitching.get_dtype_counts()

float64     3
int64      24
object      3
dtype: int64

In [71]:
pitching = optimize_pandas_dtypes(pitching)

In [72]:
pitching.get_dtype_counts()

category     2
float64      3
object       1
uint16       5
uint8       19
dtype: int64

### Persist as CSV with Column Types

In [73]:
os.chdir(p_wrangled)
to_csv_with_types(pitching, 'pitching')

### Persist as Postgres Table

In [74]:
dtype = optimize_database_dtypes(pitching)
dtype

{'year_id': sqlalchemy.sql.sqltypes.SmallInteger,
 'stint': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_w': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_l': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_g': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_gs': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_cg': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_sho': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_sv': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_outs': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_h': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_er': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_hr': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_bb': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_so': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_ba_opp': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_era': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_ibb': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_wp': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_hp': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_bk': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_

In [75]:
pitching.to_sql('pitching', conn, if_exists='replace', index=False, dtype=dtype)

In [76]:
# verify unique
is_unique(pitching, ['player_id', 'year_id', 'stint'])

True

In [77]:
sql = 'ALTER TABLE pitching ADD PRIMARY KEY (player_id, year_id, stint)'
conn.execute(sql);

In [78]:
psql('\d pitching')

Column,Type,Collation,Nullable,Default
player_id,text,,not null,
year_id,smallint,,not null,
stint,smallint,,not null,
team_id,text,,,
lg_id,text,,,
p_w,smallint,,,
p_l,smallint,,,
p_g,smallint,,,
p_gs,smallint,,,
p_cg,smallint,,,


# Fielding

In [79]:
os.chdir(p_raw)
fielding = pd.read_csv('Fielding.csv')

In [80]:
fielding.columns

Index(['playerID', 'yearID', 'stint', 'teamID', 'lgID', 'POS', 'G', 'GS',
       'InnOuts', 'PO', 'A', 'E', 'DP', 'PB', 'WP', 'SB', 'CS', 'ZR'],
      dtype='object')

In [81]:
fielding.columns = [convert_camel_case(name) for name in fielding.columns]
fielding.columns

Index(['player_id', 'year_id', 'stint', 'team_id', 'lg_id', 'pos', 'g', 'gs',
       'inn_outs', 'po', 'a', 'e', 'dp', 'pb', 'wp', 'sb', 'cs', 'zr'],
      dtype='object')

In [82]:
# as above, drop records before 1955
fielding = fielding.drop(fielding[fielding['year_id'] < 1955].index)

In [83]:
# drop records in which the fielder recorded no outs (stats are meaningless)
fielding = fielding.drop(fielding[fielding['inn_outs'] == 0].index)

In [84]:
# drop records in which the number of put outs is unknown
fielding = fielding.dropna(subset=['inn_outs'])

In [85]:
fielding.isna().sum()

player_id        0
year_id          0
stint            0
team_id          0
lg_id            0
pos              0
g                0
gs               0
inn_outs         0
po               0
a                0
e                0
dp               0
pb           82223
wp           87441
sb           82226
cs           82226
zr           87441
dtype: int64

#### Catcher Only Fields: pb, wb, sb, cs
It is reasonable to use fillna(0) for these fields.

In [86]:
fielding[['pb','wp','sb','cs']] = fielding[['pb','wp','sb','cs']].fillna(0)

In [87]:
# zr applies to all fielders, but is 99% null, so drop this column
fielding['zr'].isna().sum() / fielding.shape[0]

0.9874871540051271

In [88]:
fielding = fielding.drop('zr', axis=1)

In [89]:
# these float columns no longer have missing values
fielding_float = fielding.select_dtypes(['float'])
fielding_float.isna().sum()

gs          0
inn_outs    0
e           0
pb          0
wp          0
sb          0
cs          0
dtype: int64

In [90]:
# converted back to integer
s = fielding_float.isna().sum()
cols = s[s == 0].index.to_list()
cols

['gs', 'inn_outs', 'e', 'pb', 'wp', 'sb', 'cs']

In [91]:
fielding[cols] = fielding[cols].astype(np.integer)

In [92]:
fielding.get_dtype_counts()

int64     13
object     4
dtype: int64

In [93]:
fielding = optimize_pandas_dtypes(fielding)

In [94]:
fielding.get_dtype_counts()

category    3
object      1
uint16      4
uint8       9
dtype: int64

### Persist as CSV with Column Types

In [95]:
os.chdir(p_wrangled)
to_csv_with_types(fielding, 'fielding')

### Persist as Postgres Table

In [96]:
dtype = optimize_database_dtypes(fielding)
dtype

{'year_id': sqlalchemy.sql.sqltypes.SmallInteger,
 'stint': sqlalchemy.sql.sqltypes.SmallInteger,
 'g': sqlalchemy.sql.sqltypes.SmallInteger,
 'gs': sqlalchemy.sql.sqltypes.SmallInteger,
 'inn_outs': sqlalchemy.sql.sqltypes.SmallInteger,
 'po': sqlalchemy.sql.sqltypes.SmallInteger,
 'a': sqlalchemy.sql.sqltypes.SmallInteger,
 'e': sqlalchemy.sql.sqltypes.SmallInteger,
 'dp': sqlalchemy.sql.sqltypes.SmallInteger,
 'pb': sqlalchemy.sql.sqltypes.SmallInteger,
 'wp': sqlalchemy.sql.sqltypes.SmallInteger,
 'sb': sqlalchemy.sql.sqltypes.SmallInteger,
 'cs': sqlalchemy.sql.sqltypes.SmallInteger}

In [97]:
fielding.to_sql('fielding', conn, if_exists='replace', index=False, dtype=dtype)

In [98]:
is_unique(fielding, ['player_id', 'year_id', 'stint', 'pos'])

True

In [99]:
sql = 'ALTER TABLE fielding ADD PRIMARY KEY (player_id, year_id, stint, pos)'
conn.execute(sql);

In [100]:
psql('\d fielding')

Column,Type,Collation,Nullable,Default
player_id,text,,not null,
year_id,smallint,,not null,
stint,smallint,,not null,
team_id,text,,,
lg_id,text,,,
pos,text,,not null,
g,smallint,,,
gs,smallint,,,
inn_outs,smallint,,,
po,smallint,,,


### Note on Position

This is based on my MLB domain knowledge.

Players in recent years are increasingly playing more than one position in a single game, let alone in a single stint.

Catchers and Pitchers rarely play a position other than catcher or pitcher (except in exceedingly long extra inning games).

Usually, but not always, infielders play one of the infield positions.

Usually, but not always, outfielders play one of the outfield positions.

So although every player is listed as having a specific position, this position is not fixed.  It is likely that the position represents the position most often played by that player.

The Lahman csv file "Appearances" lists how often each player played at a particular position for a given year.

A 'stint' means playing for 1 team. If a player plays for 5 different teams in the same year, then the player has 5 stints.