# Retrosheet Baseball Data -- Persist to Postgres

**Baseball Notebooks**  
1. Downloaded and unzipped baseball data.
2. Helper functions and their motivation for use.
3. Lahman data was wrangled and persisted.
4. Retrosheet Play by Play data was parsed, collected into 2 DataFrames, and persisted.
5. Wrangle the Retrosheet data in preparation for data analysis.
6. This notebook.

Load the wrangled data into Postgres.

This notebook is designed to be used with Jupyter Lab and the Table of Contents extension.  
https://github.com/jupyterlab/jupyterlab-toc

## Repeatable Research
All data processing should be documented so that others can repeat the results.  This includes every step from downloading the data through analysis.

In [1]:
import pandas as pd
import numpy as np
import os
import re
from pathlib import Path

In [2]:
# see Baseball Notebook #2
import helper_functions as bb

In [3]:
# create path objects -- these directories were created in the previous notebook
home = Path.home()
retrosheet = home.joinpath('data/retrosheet')
p_wrangled = retrosheet.joinpath('wrangled')

## Working with Postgres

1. The Postgres server should be installed and running.
2. The retrosheet schema should exist.
3. The ipython/Jupyter Lab SQL magic extension should be installed. See:  
https://github.com/catherinedevlin/ipython-sql

In [4]:
from sqlalchemy.engine import create_engine
from sqlalchemy.types import SmallInteger, Integer, BigInteger
from IPython.display import HTML, display

### Connect to DB

conn = create_engine(connect_str)

Using conn.execute(query), with conn created as above (i.e. a SQL Alchemy engine), will:
1. cause a DB connection object to be allocated for use
2. will use that connection object to execute the query
3. will commit any data changes
4. will release that connection object back to the open connection pool

For transaction processing, using the Python DB API, with SQL Alchemy, use:  
```connection = create_engine(connect_str).connect()```

In [5]:
# Get the user and password from the environment (rather than hardcoding it)
import os
db_user = os.environ.get('DB_USER')
db_pass = os.environ.get('DB_PASS')

# avoid putting passwords directly in code
connect_str = f'postgresql://{db_user}:{db_pass}@localhost:5432/baseball'

# treat sql alchmey engine as a connection to the database
conn = create_engine(connect_str)

## psql

Use the following to run psql commands from a Jupyter Code cell.

This will connect, execute, and disconnect from the database.

For this to work without a password, a .pgpass file is necessary.  
See: https://www.postgresql.org/docs/11/libpq-pgpass.html    

The .pgpass file should look like:  
```localhost:5432:*:<user>:<passwd>```

In [6]:
def psql(cmd, user='postgres', schema='baseball'):
    psql_out = !psql -H -U {user} {schema} -c "{cmd}"
    display(HTML(''.join(psql_out)))

In [7]:
!psql --version

psql (PostgreSQL) 11.2 (Ubuntu 11.2-1.pgdg18.04+1)


## 1. Player Game to Postgres

### Load into Postgres

df.to_sql() is a convenient method to load data into a database table, as well as create the schema for that table, if it does not exist.

However for large amounts of data, most database systems offer a "bulk copy" that is much faster than any other method of loading data.

The "create table" feature of df.to_sql() will be used here. 

For Postgres, the "bulk copy" is called COPY and can be run from the psql command line.

In [8]:
os.chdir(p_wrangled)

# just read in 10 rows
player_game = bb.from_csv_with_types('player_game.csv.gz', nrows=10)
dtypes = bb.optimize_db_dtypes(player_game)

In [9]:
# drop the table to ensure that r_player_game data types are updated as necessary
conn.execute('DROP TABLE IF EXISTS r_player_game');

In [10]:
# copy over a few rows to create/replace the table with the proper datatypes
player_game.to_sql('r_player_game', conn, if_exists='replace', index=False, dtype=dtypes)

In [11]:
type(conn)

sqlalchemy.engine.base.Engine

In [12]:
# when conn is a SQL Alchemy engine, the changes will be committed automatically
conn.execute('DELETE FROM r_player_game');

In [13]:
rs = conn.execute('SELECT COUNT(*) FROM r_player_game')
rs.fetchall()

[(0,)]

In [14]:
# verify we are in correct directory and have zcat
# Shoule see about 3.5 million records for 1955 to 2018
!zcat player_game.csv.gz | wc -l

3549700


In [15]:
# psql command to copy gzipped csv file to postgres table
cmd="\copy r_player_game from program 'zcat player_game.csv.gz' CSV HEADER"

In [16]:
# this is MUCH faster than using df.to_sql()
%time psql(cmd)

CPU times: user 4.87 ms, sys: 4.33 ms, total: 9.19 ms
Wall time: 10.2 s


In [17]:
# add primary key constraint
sql = 'ALTER TABLE r_player_game ADD PRIMARY KEY (player_id, game_id)'
conn.execute(sql);

In [18]:
# describe player_game table
psql('\d r_player_game')

Column,Type,Collation,Nullable,Default
game_id,text,,not null,
game_dt,text,,,
game_ct,smallint,,,
team_id,text,,,
player_id,text,,not null,
b_g,smallint,,,
b_pa,smallint,,,
b_ab,smallint,,,
b_r,smallint,,,
b_h,smallint,,,


## 2. Player Game Data Dictionary to Postgres

In [19]:
os.chdir(p_wrangled)
player_game_fields = pd.read_csv('player_game_fields.csv')
player_game_fields.to_sql('r_player_game_fields', conn, if_exists='replace', index=False)

In [20]:
# verify df.to_sql worked
rs = conn.execute("SELECT * FROM r_player_game_fields")
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
df

Unnamed: 0,game_id,game_dt,game_ct,team_id,player_id,b_g,b_pa,b_ab,b_r,b_h,...,p_bb,p_ibb,p_so,p_gdp,p_hp,p_sh,p_sf,p_xi,p_wp,p_bk
0,game id,date,game number (0 = no double header),team id,player id,games played,plate appearances,at bats,runs,hits,...,walks allowed,intentional walks allowed,strikeouts,grounded into double play,hit batsmen,sacrifice hits against,sacrifice flies against,reached on interference,wild pitches,balks


## 3. Game to Postgres

In [21]:
os.chdir(p_wrangled)

# just read in 10 rows
game = bb.from_csv_with_types('game.csv.gz', nrows=10)
dtypes = bb.optimize_db_dtypes(game)

In [22]:
conn.execute('DROP TABLE IF EXISTS r_game');
game.to_sql('r_game', conn, dtype=dtypes, index=False)

In [23]:
# delete the few rows as \copy will be used instead
conn.execute('DELETE FROM r_game');

In [24]:
rs = conn.execute('SELECT COUNT(*) FROM r_game')
rs.fetchall()

[(0,)]

In [25]:
# psql command to copy gzipped csv file to postgres table
cmd="\copy r_game from program 'zcat game.csv.gz' CSV HEADER"

In [26]:
# this is MUCH faster than using df.to_sql()
%time psql(cmd)

CPU times: user 1.41 ms, sys: 8.6 ms, total: 10 ms
Wall time: 497 ms


In [27]:
# add primary key constraint
sql = 'ALTER TABLE r_game ADD PRIMARY KEY (game_id)'
conn.execute(sql);

In [28]:
# describe player_game table
psql('\d r_game')

Column,Type,Collation,Nullable,Default
game_id,text,,not null,
game_dt,text,,,
game_ct,smallint,,,
game_dy,text,,,
start_game_tm,text,,,
dh_fl,text,,,
daynight_park_cd,text,,,
away_team_id,text,,,
home_team_id,text,,,
park_id,text,,,


## 4. Game Data Dictionary to Postgres

In [29]:
os.chdir(p_wrangled)
game_fields = pd.read_csv('game_fields.csv')

In [30]:
conn.execute('DROP TABLE IF EXISTS r_game_fields');
game_fields.to_sql('r_game_fields', conn, index=False)

In [31]:
# verify df.to_sql worked
rs = conn.execute("SELECT * FROM r_game_fields")
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
df

Unnamed: 0,game_id,game_dt,game_ct,game_dy,start_game_tm,dh_fl,daynight_park_cd,away_team_id,home_team_id,park_id,...,away_hits_ct,home_hits_ct,away_err_ct,home_err_ct,away_lob_ct,home_lob_ct,win_pit_id,lose_pit_id,save_pit_id,gwrbi_bat_id
0,game id,date,game number (0 = no double header),day of week,start time,DH used flag,day/night flag,visiting team,home team,game site,...,visitor hits,home hits,visitor errors,home errors,visitor left on base,home left on base,winning pitcher,losing pitcher,save for,GW RBI


## 5. Players to Postgres

In [32]:
os.chdir(p_wrangled)
players = pd.read_csv('players.csv', 
                      parse_dates=['player_debut', 'mgr_debut', 'coach_debut', 'ump_debut'])

In [33]:
conn.execute('DROP TABLE IF EXISTS r_players');
players.to_sql('r_players', conn, index=False)

In [34]:
# verify df.to_sql worked
rs = conn.execute("SELECT * FROM r_players LIMIT 3")
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
df

Unnamed: 0,player_id,last,first,player_debut,mgr_debut,coach_debut,ump_debut
0,aardd001,Aardsma,David,2004-04-06,,NaT,
1,aaroh101,Aaron,Hank,1954-04-13,,NaT,
2,aarot101,Aaron,Tommie,1962-04-10,,1979-04-06,


In [35]:
# describe player_game table
psql('\d r_players')

Column,Type,Collation,Nullable,Default
player_id,text,,,
last,text,,,
first,text,,,
player_debut,timestamp without time zone,,,
mgr_debut,timestamp without time zone,,,
coach_debut,timestamp without time zone,,,
ump_debut,timestamp without time zone,,,


## 6. Stadium (Parks) to Postgres

In [36]:
os.chdir(p_wrangled)
parks = pd.read_csv('parks.csv', 
                      parse_dates=['start', 'end'])

In [37]:
# as before, drop the table to ensure data types are choosen by Pandas
conn.execute('DROP TABLE IF EXISTS r_parks');
parks.to_sql('r_parks', conn, index=False)

In [38]:
# verify df.to_sql worked
rs = conn.execute("SELECT * FROM r_parks LIMIT 3")
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
df

Unnamed: 0,park_id,name,aka,city,state,start,end,league,notes
0,ALB01,Riverside Park,,Albany,NY,1880-09-11,1882-05-30,NL,TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882
1,ALT01,Columbia Park,,Altoona,PA,1884-04-30,1884-05-31,UA,
2,ANA01,Angel Stadium of Anaheim,Edison Field; Anaheim Stadium,Anaheim,CA,1966-04-19,NaT,AL,


In [39]:
# describe player_game table
psql('\d r_parks')

Column,Type,Collation,Nullable,Default
park_id,text,,,
name,text,,,
aka,text,,,
city,text,,,
state,text,,,
start,timestamp without time zone,,,
end,timestamp without time zone,,,
league,text,,,
notes,text,,,
