# Lahman Baseball Data

There are two sources for the Lahman data.

**Sean Lahman**  
http://www.seanlahman.com/baseball-archive/statistics  
There appears to be a snapshot of data taken the day prior to last season's opening day.

**Baseball Databank**  
https://github.com/chadwickbureau/baseballdatabank  
This is the latest data.  As of the time of this writing, it includes the 2018 season whereas the previous link does not.

In order to use 2018 data, the baseball databank will be used.

## Schema

The data will be saved to a Postgres database.

**Data Dictionary**  
http://www.seanlahman.com/files/database/readme2016.txt  
The schema is updated over time.  This makes having a fixed schema for this dataset somewhat of a problem as it has to be continually updated.

Rather than fix the schema by creating the Postgres tables first, each csv file will be read into Pandas, analyzed to find its unique key, then written directly to Postgres using df.to_sql().

Pandas uses SQL Alchemy behind the scenes and good (but not necessarily optimal) datatypes are chosen for the database tables.  These datatypes will be sufficient for this analysis.

### Create Directories
* ~/data/lahman/raw  
* ~/data/lahman/processed  

In [1]:
import pandas as pd
import numpy as np

In [2]:
import os
import re
import wget
from pathlib import Path
import zipfile

In [3]:
# create path objects
home = Path.home()
retrosheet = home.joinpath('data/lahman')
p_raw = retrosheet.joinpath('raw')
p_processed = retrosheet.joinpath('processed')

# create directories from these path objects
p_raw.mkdir(parents=True, exist_ok=True)
p_processed.mkdir(parents=True, exist_ok=True)

In [4]:
# change to raw file directory
os.chdir(p_raw)

# download zip file from github
url = 'https://github.com/chadwickbureau/baseballdatabank/archive/master.zip'
wget.download(url)

# unzip it
with zipfile.ZipFile('baseballdatabank-master.zip', "r") as zip_ref:
    zip_ref.extractall()

In [5]:
import shutil

unzip_dir = p_raw.joinpath('baseballdatabank-master/core')

# move the unzipped csv files to the current working directory
os.chdir(p_raw)
for root, dirs, files in os.walk(unzip_dir):
    for file in files:
        shutil.move(root+'/'+file, '.')
        
# rm the extract directory
shutil.rmtree('baseballdatabank-master')

# rm the zip file
os.remove('baseballdatabank-master.zip')

In [6]:
# verify the current directory (p_raw) has the csv files
sorted(os.listdir())

['AllstarFull.csv',
 'Appearances.csv',
 'AwardsManagers.csv',
 'AwardsPlayers.csv',
 'AwardsShareManagers.csv',
 'AwardsSharePlayers.csv',
 'Batting.csv',
 'BattingPost.csv',
 'CollegePlaying.csv',
 'Fielding.csv',
 'FieldingOF.csv',
 'FieldingOFsplit.csv',
 'FieldingPost.csv',
 'HallOfFame.csv',
 'HomeGames.csv',
 'Managers.csv',
 'ManagersHalf.csv',
 'Parks.csv',
 'People.csv',
 'Pitching.csv',
 'PitchingPost.csv',
 'Salaries.csv',
 'Schools.csv',
 'SeriesPost.csv',
 'Teams.csv',
 'TeamsFranchises.csv',
 'TeamsHalf.csv',
 'readme2014.txt']

## Helpful Methods for Working with DB

In [7]:
from sqlalchemy.engine import create_engine
from IPython.display import HTML, display

%reload_ext sql

### Connect to DB

In [8]:
# Get the user and password from the environment (rather than hardcoding it)
import os
db_user = os.environ.get('DB_USER')
db_pass = os.environ.get('DB_PASS')

# avoid putting passwords directly in code
connect_str = f'postgresql://{db_user}:{db_pass}@localhost:5432/lahman'

# connect
conn = create_engine(connect_str)

### **psql**

Note, there must be a ~/.pgpass file similar to the following to connect without a password:  
```localhost:5432:*:<user>:<passwd>```

In [9]:
# -H for html output
# this connects, executes, and disconnects
def psql(cmd, user='postgres', schema='lahman'):
    psql_out = !psql -H -U {user} {schema} -c "{cmd}"
    display(HTML(''.join(psql_out)))

### CamelCase to snake_case

Postgres is easier to use without caps in the column names.

Also, columns name should not start with a number.

https://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-snake-case

In [10]:
# CamelCase to camel_case
def convert_camel_case(name):
    s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()

In [13]:
convert_camel_case(prepend_h('playerID'))

'player_id'

### Is Unique over Multiple Columns

In [15]:
def is_unique(df, cols):
    return not (df.duplicated(subset=cols)).any()

### Note on Datetime Null Values

In [16]:
a = pd.NaT
b = np.nan
print(pd.isna(a))
print(pd.isna(b))

True
True


## Removing Unwanted Connections
When experimenting, it is possible to leave connections open.

Assuming you are the only one using the database, it can be helpful to close all connections except the current connection.

The following is from:  
https://stackoverflow.com/questions/5108876/kill-a-postgresql-session-connection

In [18]:
%sql {connect_str}

'Connected: postgres@lahman'

In [19]:
%%sql
-- kill all pids except for the current connection
SELECT 
    pg_terminate_backend(pid) 
FROM 
    pg_stat_activity 
WHERE 
    -- don't kill my own connection!
    pid <> pg_backend_pid()
    -- don't kill the connections to other databases
    AND datname = 'database_name'
;

 * postgresql://postgres:***@localhost:5432/lahman
0 rows affected.


pg_terminate_backend


# Main Files
As per:  
http://www.seanlahman.com/files/database/readme2016.txt

After readme2016.txt was written, master was renamed to People.

The 4 main files are:
*  People   - Player names, DOB, and biographical info
*  Batting  - batting statistics
*  Pitching - pitching statistics
*  Fielding - fielding statistics

# People

In [20]:
people = pd.read_csv('People.csv', parse_dates=['debut', 'finalGame'])

In [21]:
people.columns

Index(['playerID', 'birthYear', 'birthMonth', 'birthDay', 'birthCountry',
       'birthState', 'birthCity', 'deathYear', 'deathMonth', 'deathDay',
       'deathCountry', 'deathState', 'deathCity', 'nameFirst', 'nameLast',
       'nameGiven', 'weight', 'height', 'bats', 'throws', 'debut', 'finalGame',
       'retroID', 'bbrefID'],
      dtype='object')

In [22]:
people.columns = [convert_camel_case(name) for name in people.columns]
people.columns

Index(['player_id', 'birth_year', 'birth_month', 'birth_day', 'birth_country',
       'birth_state', 'birth_city', 'death_year', 'death_month', 'death_day',
       'death_country', 'death_state', 'death_city', 'name_first', 'name_last',
       'name_given', 'weight', 'height', 'bats', 'throws', 'debut',
       'final_game', 'retro_id', 'bbref_id'],
      dtype='object')

In [23]:
# custom parsing of dates
def lahman_to_date(row, prefix):
    y = row[prefix + '_year']
    m = row[prefix + '_month']
    d = row[prefix + '_day']
    
    # NaT if year is missing
    if pd.isna(y):
        return pd.NaT
    
    # fillna if year present but month missing
    if pd.isna(m):
        m = 1
        
    # fillna if year present but day missing
    if pd.isna(d):
        d = 1
        
    return pd.datetime(int(y),int(m),int(d))

In [24]:
people['birth_date'] = people.apply(lambda x: lahman_to_date(x, 'birth'), axis=1)
people['death_date'] = people.apply(lambda x: lahman_to_date(x, 'death'), axis=1)

In [25]:
people = people.drop(
    ['birth_year', 'birth_month', 'birth_day', 
     'death_year', 'death_month', 'death_day'], axis=1)

In [26]:
# will be used with Retrosheet data, so retroID cannot be null
# retroID appears to be null only for some players who played long ago
people = people.dropna(subset=['retro_id'], axis=0)

In [27]:
# verify uniqueness
print(people['player_id'].is_unique)
print(people['retro_id'].is_unique)

True
True


In [28]:
# replace the table if it exists
people.to_sql('people', conn, if_exists='replace', index=False)

In [29]:
# add primary key, unique and not null constraints
sql   = 'ALTER TABLE lahman.public.people ADD PRIMARY KEY (player_id)'
conn.execute(sql)

sql = 'ALTER TABLE lahman.public.people ADD CONSTRAINT retro_unique UNIQUE (retro_id)'
conn.execute(sql)

sql = 'ALTER TABLE lahman.public.people ALTER COLUMN retro_id SET NOT NULL'
conn.execute(sql);

In [30]:
# describe the table
psql('\d people')

Column,Type,Collation,Nullable,Default
player_id,text,,not null,
birth_country,text,,,
birth_state,text,,,
birth_city,text,,,
death_country,text,,,
death_state,text,,,
death_city,text,,,
name_first,text,,,
name_last,text,,,
name_given,text,,,


# Batting

In [106]:
batting = pd.read_csv('Batting.csv')

In [114]:
batting.columns

Index(['playerID', 'yearID', 'stint', 'teamID', 'lgID', 'G', 'AB', 'R', 'H',
       '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'IBB', 'HBP', 'SH',
       'SF', 'GIDP'],
      dtype='object')

## Rename to use retrosheet names for the corresponding fields

The following is from the RetrosheetBaseball Jupyter notebook.
```
 'b_g': 'games played',
 'b_pa': 'plate appearances',
 'b_ab': 'at bats',
 'b_r': 'runs',
 'b_h': 'hits',
 'b_2b': 'doubles',
 'b_3b': 'triples',
 'b_hr': 'home runs',
 'b_rbi': 'runs batted in',
 'b_bb': 'walks',
 'b_ibb': 'intentional walks',
 'b_so': 'strikeouts',
 'b_gdp': 'grounded into DP',
 'b_hp': 'hit by pitch',
 'b_sh': 'sacrifice hits',
 'b_sf': 'sacrifice flies',
 'b_sb': 'stolen bases',
 'b_cs': 'caught stealing',
 'b_xi': 'reached on interference', 
```

In [115]:
names = {
    'playerID':'player_id',
    'yearID':'year_id',
    'teamID':'team_id',
    'lgID':'lg_id',
    'G':'b_g',
    'AB':'b_ab',
    'R':'b_r',
    'H':'b_h',
    '2B':'b_2b',
    '3B':'b_3b',
    'HR':'b_hr',
    'RBI':'b_rbi',
    'SB':'b_sb',
    'CS':'b_cs',
    'BB':'b_bb',
    'SO':'b_so',
    'IBB':'b_ibb',
    'HBP':'b_hp',
    'SH':'b_sh',
    'SF':'b_sf',
    'GIDP':'b_gdp'
}

In [117]:
batting.rename(columns=names, inplace=True)
batting.columns

Index(['player_id', 'year_id', 'stint', 'team_id', 'lg_id', 'b_g', 'b_ab',
       'b_r', 'b_h', 'b_2b', 'b_3b', 'b_hr', 'b_rbi', 'b_sb', 'b_cs', 'b_bb',
       'b_so', 'b_ibb', 'b_hp', 'b_sh', 'b_sf', 'b_gdp'],
      dtype='object')

In [118]:
# certain stats are null only for old games
# this study will be from 1955 onward
batting = batting.drop(batting[batting['year_id'] < 1955].index)

In [119]:
batting['year_id'].min()

1955

In [121]:
# battings stats for a given year, with 0 at-bats are meaningless
# a player could be a pinch-runner and have no at-bats
batting = batting.drop(batting[batting['b_ab'] == 0].index)

In [122]:
# these are integers, but had NA, so were converted to float
batting_float = batting.select_dtypes(include=['float']).copy()
batting_float.columns

Index(['b_rbi', 'b_sb', 'b_cs', 'b_so', 'b_ibb', 'b_hp', 'b_sh', 'b_sf',
       'b_gdp'],
      dtype='object')

In [123]:
# after removing years < 1955, there are no longer any null values
batting_float.isna().sum()

b_rbi    0
b_sb     0
b_cs     0
b_so     0
b_ibb    0
b_hp     0
b_sh     0
b_sf     0
b_gdp    0
dtype: int64

In [124]:
batting_numeric = batting.select_dtypes(include=[np.number])

In [125]:
# pandas will downcast as far as the data allows
batting_numeric = batting_numeric.apply(pd.to_numeric,downcast='unsigned')
batting_numeric.dtypes.value_counts()

uint8     16
uint16     3
dtype: int64

In [126]:
batting[batting_numeric.columns] = batting_numeric

In [127]:
batting.dtypes.value_counts()

uint8     16
object     3
uint16     3
dtype: int64

In [128]:
batting_obj = batting.select_dtypes(include='object')
batting_obj.columns

Index(['player_id', 'team_id', 'lg_id'], dtype='object')

In [129]:
batting_obj.nunique()

player_id    9410
team_id        42
lg_id           2
dtype: int64

In [130]:
batting[['team_id', 'lg_id']] = batting_obj[['team_id', 'lg_id']].astype('category')

In [131]:
batting.dtypes.value_counts()

uint8       16
uint16       3
object       1
category     1
category     1
dtype: int64

In [132]:
from sqlalchemy.types import SmallInteger

In [133]:
# SmallInteger is not deduced from dataframe column type
dtype = {c:SmallInteger for c in batting.select_dtypes(include=np.integer).columns}

In [134]:
dtype

{'year_id': sqlalchemy.sql.sqltypes.SmallInteger,
 'stint': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_g': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_ab': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_r': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_h': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_2b': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_3b': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_hr': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_rbi': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_sb': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_cs': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_bb': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_so': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_ibb': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_hp': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_sh': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_sf': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_gdp': sqlalchemy.sql.sqltypes.SmallInteger}

In [135]:
batting.to_sql('batting', conn, if_exists='replace', index=False, dtype=dtype)

In [136]:
psql('\d batting')

Column,Type,Collation,Nullable,Default
player_id,text,,,
year_id,smallint,,,
stint,smallint,,,
team_id,text,,,
lg_id,text,,,
b_g,smallint,,,
b_ab,smallint,,,
b_r,smallint,,,
b_h,smallint,,,
b_2b,smallint,,,


In [137]:
# verify unique
is_unique(batting, ['player_id', 'year_id', 'stint'])

True

In [138]:
sql = 'ALTER TABLE lahman.public.batting ADD PRIMARY KEY (player_id, year_id, stint)'
conn.execute(sql);

In [139]:
psql('\d batting')

Column,Type,Collation,Nullable,Default
player_id,text,,not null,
year_id,smallint,,not null,
stint,smallint,,not null,
team_id,text,,,
lg_id,text,,,
b_g,smallint,,,
b_ab,smallint,,,
b_r,smallint,,,
b_h,smallint,,,
b_2b,smallint,,,


# Pitching

In [144]:
pitching = pd.read_csv('Pitching.csv')

In [145]:
pitching.columns

Index(['playerID', 'yearID', 'stint', 'teamID', 'lgID', 'W', 'L', 'G', 'GS',
       'CG', 'SHO', 'SV', 'IPouts', 'H', 'ER', 'HR', 'BB', 'SO', 'BAOpp',
       'ERA', 'IBB', 'WP', 'HBP', 'BK', 'BFP', 'GF', 'R', 'SH', 'SF', 'GIDP'],
      dtype='object')

## Rename to Match Retrosheet
```
 'p_g': 'games pitched',
 'p_gs': 'games started',
 'p_cg': 'complete games',
 'p_sho': 'shutouts',
 'p_gf': 'games finished',
 'p_w': 'wins',
 'p_l': 'losses',
 'p_sv': 'saves',
 'p_out': 'outs recorded (innings pitched times 3)',
 'p_tbf': 'batters faced',
 'p_ab': 'at bats',
 'p_r': 'runs allowed',
 'p_er': 'earned runs allowed',
 'p_h': 'hits allowed',
 'p_2b': 'doubles allowed',
 'p_3b': 'triples allowed',
 'p_hr': 'home runs allowed',
 'p_bb': 'walks allowed',
 'p_ibb': 'intentional walks allowed',
 'p_so': 'strikeouts',
 'p_gdp': 'grounded into double play',
 'p_hp': 'hit batsmen',
 'p_sh': 'sacrifice hits against',
 'p_sf': 'sacrifice flies against',
 'p_xi': 'reached on interference',
 'p_wp': 'wild pitches',
 'p_bk': 'balks'
``` 

In [146]:
names = {
    'playerID':'player_id',
    'yearID':'year_id',
    'teamID':'team_id',
    'lgID':'lg_id',
    'W':'p_w',
    'L':'p_l',
    'G':'p_g',
    'GS':'p_gs',
    'CG':'p_cg',
    'SHO':'p_sho',
    'SV':'p_sv',
    'IPouts':'p_outs',
    'H':'p_h',
    'ER':'p_er',
    'HR':'p_hr',
    'BB':'p_bb',
    'SO':'p_so',
    'BAOpp':'p_ba_opp', # not in retrosheet player_game
    'ERA':'p_era', # not in retrosheet player_game
    'IBB':'p_ibb',
    'WP':'p_wp',
    'HBP':'p_hp',
    'BK':'p_bk',
    'BFP':'p_bfp', # not in retrosheet player_game
    'GF':'p_gf', # not in retrosheet
    'R':'p_r',
    'SH':'p_sh',
    'SF':'p_sf',
    'GIDP':'p_gdp'
        }

In [150]:
pitching.rename(columns=names, inplace=True)

In [151]:
pitching.dtypes.value_counts()

int64      19
float64     8
object      3
dtype: int64

In [152]:
# certain stats are null only for old games
# this study will be from 1955 onward
pitching = pitching.drop(pitching[pitching['year_id'] < 1955].index)

In [153]:
pitching['year_id'].min(), pitching['year_id'].max()

(1955, 2018)

In [157]:
# if the pitcher recorded less than 3 outs for the entire year, drop the record
pitching = pitching.drop(pitching[pitching['p_outs'] < 3].index)

In [158]:
pitching_float = pitching.select_dtypes(include=[np.float])

In [159]:
pitching_float.isna().sum()

p_ba_opp       0
p_era          0
p_ibb          0
p_hp           0
p_bfp          0
p_sh        4579
p_sf        4579
p_gdp       5703
dtype: int64

In [160]:
# find highest year that has a null value
for col in ['p_sh','p_sf','p_gdp']:
    print(col, pitching[pitching[col].isna()]['year_id'].max())

p_sh 1969
p_sf 1969
p_gdp 1972


In [161]:
# smallint works for all numeric values for the database
pitching.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
year_id,31879.0,1992.65413,17.856196,1955.0,1979.0,1996.0,2008.0,2018.0
stint,31879.0,1.084476,0.292722,1.0,1.0,1.0,1.0,5.0
p_w,31879.0,4.075567,4.788235,0.0,0.0,2.0,6.0,31.0
p_l,31879.0,4.074971,4.174239,0.0,1.0,3.0,6.0,24.0
p_g,31879.0,25.819881,19.413978,1.0,9.0,23.0,36.0,106.0
p_gs,31879.0,8.157972,11.582567,0.0,0.0,1.0,14.0,49.0
p_cg,31879.0,1.069231,2.92353,0.0,0.0,0.0,0.0,30.0
p_sho,31879.0,0.274256,0.829932,0.0,0.0,0.0,0.0,13.0
p_sv,31879.0,1.862198,5.917081,0.0,0.0,0.0,1.0,62.0
p_outs,31879.0,218.957401,208.802239,3.0,52.0,154.0,317.0,1130.0


In [162]:
dtype = {col:SmallInteger for col in pitching.select_dtypes(include=np.number).columns}

In [163]:
dtype

{'year_id': sqlalchemy.sql.sqltypes.SmallInteger,
 'stint': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_w': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_l': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_g': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_gs': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_cg': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_sho': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_sv': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_outs': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_h': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_er': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_hr': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_bb': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_so': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_ba_opp': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_era': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_ibb': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_wp': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_hp': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_bk': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_

In [164]:
pitching.to_sql('pitching', conn, if_exists='replace', index=False, dtype=dtype)

In [165]:
# verify unique
is_unique(pitching, ['player_id', 'year_id', 'stint'])

True

In [166]:
sql = 'ALTER TABLE lahman.public.pitching ADD PRIMARY KEY (player_id, year_id, stint)'
conn.execute(sql);

In [167]:
psql('\d pitching')

Column,Type,Collation,Nullable,Default
player_id,text,,not null,
year_id,smallint,,not null,
stint,smallint,,not null,
team_id,text,,,
lg_id,text,,,
p_w,smallint,,,
p_l,smallint,,,
p_g,smallint,,,
p_gs,smallint,,,
p_cg,smallint,,,


In [168]:
pitching.shape

(31879, 30)

# Fielding

In [74]:
fielding = pd.read_csv('Fielding.csv')

In [75]:
fielding.columns

Index(['playerID', 'yearID', 'stint', 'teamID', 'lgID', 'POS', 'G', 'GS',
       'InnOuts', 'PO', 'A', 'E', 'DP', 'PB', 'WP', 'SB', 'CS', 'ZR'],
      dtype='object')

In [76]:
fielding.columns = [convert_camel_case(name) for name in fielding.columns]
fielding.columns

Index(['player_id', 'year_id', 'stint', 'team_id', 'lg_id', 'pos', 'g', 'gs',
       'inn_outs', 'po', 'a', 'e', 'dp', 'pb', 'wp', 'sb', 'cs', 'zr'],
      dtype='object')

In [77]:
# as above, drop records before 1955
fielding = fielding.drop(fielding[fielding['year_id'] < 1955].index)

In [78]:
# drop records in which the fielder recorded no outs (stats are meaningless)
fielding = fielding.drop(fielding[fielding['inn_outs'] == 0].index)

# drop records in which the number of put outs is unknown
fielding = fielding.dropna(subset=['inn_outs'])

In [79]:
fielding.isna().sum()

player_id        0
year_id          0
stint            0
team_id          0
lg_id            0
pos              0
g                0
gs               0
inn_outs         0
po               0
a                0
e                0
dp               0
pb           82223
wp           87441
sb           82226
cs           82226
zr           87441
dtype: int64

In [80]:
# pb, wb, sb, cs only apply to catchers, hence the large number of nulls
# it is reasonably to use fillna with 0
fielding[['pb','wp','sb','cs']] = fielding[['pb','wp','sb','cs']].fillna(0)

In [81]:
# zr applies to all fielders, but is 99% null, drop this column
fielding['zr'].isna().sum() / fielding.shape[0]

0.9874871540051271

In [82]:
fielding = fielding.drop('zr', axis=1)

In [83]:
fielding.isna().sum()

player_id    0
year_id      0
stint        0
team_id      0
lg_id        0
pos          0
g            0
gs           0
inn_outs     0
po           0
a            0
e            0
dp           0
pb           0
wp           0
sb           0
cs           0
dtype: int64

In [84]:
# smallint works for all numeric values
fielding.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
year_id,88549.0,1991.171781,17.84377,1955.0,1977.0,1993.0,2007.0,2018.0
stint,88549.0,1.079493,0.285141,1.0,1.0,1.0,1.0,5.0
g,88549.0,36.264791,40.943583,1.0,6.0,21.0,50.0,165.0
gs,88549.0,26.432585,40.521686,0.0,0.0,7.0,32.0,164.0
inn_outs,88549.0,709.476437,1061.538035,1.0,57.0,222.0,742.0,4469.0
po,88549.0,78.821918,178.10899,0.0,2.0,9.0,65.0,1682.0
a,88549.0,31.018363,75.266838,0.0,1.0,6.0,21.0,621.0
e,88549.0,2.213407,4.100843,0.0,0.0,1.0,2.0,44.0
dp,88549.0,7.344984,20.106666,0.0,0.0,1.0,3.0,182.0
pb,88549.0,0.236152,1.347472,0.0,0.0,0.0,0.0,35.0


In [85]:
dtype = {col:SmallInteger for col in fielding.select_dtypes(include=np.number).columns}

In [86]:
fielding.to_sql('fielding', conn, if_exists='replace', index=False, dtype=dtype)

In [87]:
is_unique(fielding, ['player_id', 'year_id', 'stint', 'pos'])

True

### Note on Position

This is based on my MLB domain knowledge.

Players in recent years are increasingly playing more than one position in a single game, let alone in a single stint.

Catchers and Pitchers rarely play a position other than catcher or pitcher (except in exceedingly long extra inning games).

Usually, but not always, infielders play one of the infield positions.

Usually, but not always, outfielders play one of the outfield positions.

So although every player is listed as having a specific position, this position is not fixed.  It is likely that the position represents the position most often played by that player.

A 'stint' means playing for 1 team. If a player plays for 5 different teams in the same year, then the player has 5 stints.

In [88]:
fielding['pos'].value_counts()

P     32075
OF    17211
1B     9156
3B     8922
2B     7959
SS     6900
C      6326
Name: pos, dtype: int64

In [89]:
fielding['stint'].value_counts(normalize=True)

1    0.924403
2    0.071859
3    0.003591
4    0.000136
5    0.000011
Name: stint, dtype: float64

In [90]:
sql = 'ALTER TABLE lahman.public.fielding ADD PRIMARY KEY (player_id, year_id, stint, pos)'
conn.execute(sql);

In [91]:
psql('\d fielding')

Column,Type,Collation,Nullable,Default
player_id,text,,not null,
year_id,smallint,,not null,
stint,smallint,,not null,
team_id,text,,,
lg_id,text,,,
pos,text,,not null,
g,smallint,,,
gs,smallint,,,
inn_outs,smallint,,,
po,smallint,,,
