# Retrosheet Baseball Data

Retrosheet data will be downloaded, parsed, and saved to both gzipped csv and PostgreSQL.

TODO: This notebook presents a detailed explanation of each step. A Python script which performs everything in this notebook, without explanation, will also be provided in my github repo.

This notebook is designed to be used with Jupyter Lab and the Table of Contents extension.  
https://github.com/jupyterlab/jupyterlab-toc

The two most popular open source Baseball Data Sources are:  
**Lahman**  
http://www.seanlahman.com/baseball-archive/statistics/  
This database is copyright 1996-2018 by Sean Lahman.  

**Retrosheet**  
https://www.retrosheet.org/game.htm  
https://www.retrosheet.org/game.htm#Notice  
This database is copyright 1996-2018 by Retrosheet.


Lahman has data about each player summarized by year.  Retrosheet has data at the play-by-play level (called "event data"). This notebook is for Retrosheet.  Another notebook is for Lahman.  Subsequent notebooks will use data from both sources.

The only open-source parsers available for Retrosheet are by Dr. T. L. Turocy:  
Description: http://chadwick.sourceforge.net/doc/index.html  
Source: https://sourceforge.net/projects/chadwick/

## Data Wrangling

The Retrosheet event data includes every play for every major league game since 1921. 
A subset of that data will be used here.

Retrosheet Data Wrangling will include:
1. Parsing and manipulating player per game data from event files.
2. Parsing and manipulating game data from event files.
3. Creating "lookup tables" by web scraping.
4. Creating data dictionaries (aka codebooks) from scraping Dr. Turocy's C source code.

At the end of the data wrangling, 6 DataFrames will exist.
1. **player_game:** player per game stats 
2. **player_game_fields:** player per game field descriptions
3. **game:** game stats
4. **game_fields:** game field descriptions
5. **players:** player info
6. **stadiums:** stadium info

The above 6 dataframes will be persisted to both gzipped csv and Postgres.

## Repeatable Research
All data processing should be documented so that others can repeat the results.  This includes every step from downloading the data through analysis.

## Download and Unzip Retrosheet Data

The raw event data will be downloaded. The url is of the form:
http://www.retrosheet.org/events/{year}eve.zip'

There are many ways to download files in Python.  For a simple binary file download, wget may be the easiest.

### Create Directories for Raw and Processed Data
* ~/data/retrosheet/raw  
* ~/data/retrosheet/processed  

In [1]:
import os
import re
import wget
from pathlib import Path
import zipfile

In [2]:
# create path objects
home = Path.home()
retrosheet = home.joinpath('data/retrosheet')
p_raw = retrosheet.joinpath('raw')
p_processed = retrosheet.joinpath('processed')

# create directories (if they don't already exist) from these path objects
p_raw.mkdir(parents=True, exist_ok=True)
p_processed.mkdir(parents=True, exist_ok=True)

### Retrosheet Event Data
Data is available from 1921 to present.

Here, data from 1955 through 2018 will be downloaded and unzipped.  The start year of 1955 was chosen in part because there are fewer missing values for baseball attributes from 1955 on.

Using 1955 to present will result in (at least one temporary) 2+ Gig Pandas DataFrame, so chose more or less years as appropriate for your computer's resources.

In [3]:
# change to raw file directory
os.chdir(p_raw)

for year in range(1955,2019):   
    # download each event file, if it doesn't exist locally
    filename = f'{year}eve.zip'
    path = Path(filename)
    if not path.exists():
        url = f'http://www.retrosheet.org/events/{year}eve.zip'
        wget.download(url)
    
    # unzip each zip file, if its contents don't exist locally
    # {year}BOS.EVA is in all zip files
    filename = f'{year}BOS.EVA'
    path = Path(filename)
    if not path.exists():
        filename = f'{year}eve.zip'
        with zipfile.ZipFile(filename, "r") as zip_ref:
            zip_ref.extractall(".")

### Unzipped Data File Types
The unzipped data consists of 3 types of files:
1. *.EVA and *.EVN -- these are American League and National League event files per team per year
2. *.ROS -- these are the rosters per team per year
3. TEAM* -- these are the MBL teams in existence per year

## Working with Postgres

1. The Postgres server should be installed and running.
2. The retrosheet schema should be created, but empty.
3. The ipython/Jupyter Lab SQL magic extension should be installed. See:  
https://github.com/catherinedevlin/ipython-sql

In [4]:
from sqlalchemy.engine import create_engine
from sqlalchemy.types import SmallInteger
from IPython.display import HTML, display

### Connect to DB

conn = create_engine(connect_str)

Using conn.execute(query), with conn created as above (i.e. a SQL Alchemy engine), will:
1. cause a DB connection object to be allocated for use
2. will use that connection object to execute the query
3. will commit any data changes
4. will release that connection object back to the open connection pool

For transaction processing, using the Python DB API, with SQL Alchemy, use:  
```connection = create_engine(connect_str).connect()```

In [5]:
# Get the user and password from the environment (rather than hardcoding it)
import os
db_user = os.environ.get('DB_USER')
db_pass = os.environ.get('DB_PASS')

# avoid putting passwords directly in code
connect_str = f'postgresql://{db_user}:{db_pass}@localhost:5432/retrosheet'

# treat sql alchmey engine as a connection to the database
conn = create_engine(connect_str)

### **psql**

psql is a very helpful command line tool to use with Postgres.  

In order to use it without a password, a .pgpass file in the user's home directory must be created.  See:  https://www.postgresql.org/docs/11/libpq-pgpass.html  

The .pgpass file should look like:  
```localhost:5432:*:<user>:<passwd>```

In [6]:
# -H for html output
# this connects, executes, and disconnects
def psql(cmd, user='postgres', schema='retrosheet'):
    psql_out = !psql -H -U {user} {schema} -c "{cmd}"
    display(HTML(''.join(psql_out)))

In [7]:
psql('\d')

Schema,Name,Type,Owner
public,game,table,postgres
public,game_fields,table,postgres
public,parks,table,postgres
public,player_game,table,postgres
public,player_game_fields,table,postgres
public,players,table,postgres


## Retrosheet Data Dictionary Overview
A "data dictionary" is also called a "codebook".

The following is a highlevel overview of the meaning of the field names created by the Retrosheet parsers.

```
Suffix Meaning
CT     count (integer)
ID     identifier
FL     boolean flag
CD     code (enumerated data type)
DT     date
DY     day of week
TM     time

Prefix Meaning
B      batter
P      pitcher
```

In most cases, the abbreviation between the prefix and the suffix is a common baseball abbreviation.  For common baseball abbreviations see:  
http://www.espn.com/gen/editors/mlb/glossary.html

From the glossary above, "SF" stands from sacrifice flies.  This statistic has been recorded since 1955.  The full field names created by the parsers are "B_SF" for how many sacrifice flies by the batter, and "P_SF" for how many sacrifice flies given up by the pitcher.

## Data Verification

For odd data, such as whether or not a the first game of a double header was in one stadium, and the second game was in a different stadium; [Baseball-Reference](https://www.baseball-reference.com) is helpful.

Baseball-reference uses the data from Retrosheet, and presents it in an easy to read form for people. Although baseball-reference on rare occasion will incorrectly interpret the event data, it is nevertheless a useful tool to verify the data processing used here.

Baseball-reference does not offer already parsed data for data analysis.

The following method takes a game_id and converts it to a baseball-reference url for researching more about a particular game.

In [8]:
def game_id_to_url(game_id):
    home = game_id[:3]
    url = 'https://www.baseball-reference.com/boxes/' + home + '/' + game_id + '.shtml'
    display(HTML(f'<a href="{url}">{game_id}</a>'))

In [9]:
# Click on the generated link to get a url for detailed game information.
game_id_to_url('NYA200806271')

As per the above, the first game of the double header was in Yankee Stadium and the second game, on the same day, was in Shea Stadium.

## 1. Parse Event Data for Player per Game Statistics

The event data is in a format that is very difficult to work with.  There is an open-source project which has parsers for the Retrosheet event data.  This project has 6 parsers.  Each of these parsers is fed event data and produces csv or XML or text output.

The two parsers that are of interest for this study are:
1. cwdaily
2. cwgame

The cwbox parser produces a box score in the form MLB fans are accustomed to seeing (or it can produce XML with appropriate tags).  This appears to have the same information as is produced by cwdaily, however cwdaily formats the data as one line per player per game, which is much easier to work with.

The Retrosheet data parser tools are described at:  
http://chadwick.sourceforge.net/doc/index.html  
  
They are distributed under the GPL:  
https://www.gnu.org/licenses/gpl.html  

Note: as of February 2019, the cwdaily parser, published in July 2018, is not described on the above webpage.

### Build Chadwick Parsers on Linux (or use prebuilt Windows binaries)
This section describes how to download the source, compile and install it.

This procedure is how most open-source projects are compiled and installed on Linux.

Go To:  
https://sourceforge.net/projects/chadwick/  
Download the source code for version 0.7.1 or later.

If you do not already have a build environment:
1. sudo apt install gcc
2. sudo apt install build-essential

cd to the source directory:
1. ./configure
2. make
3. make install  # or: sudo make install  

Result
1. The cw command line tools will be installed in /usr/local/bin.  
2. The cw library will be installed in /usr/local/lib.  

To allow the command line tools to find the shared libraries, add the following to your .bashrc and then: source .bashrc  
```export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib```

### Using Prebuilt Windows Binaries
Go To:  
https://sourceforge.net/projects/chadwick/  
Download the Windows binaries for version 0.7.1 or later.

**Linux Wine**  
Install wine: https://wiki.winehq.org/Ubuntu  
Before first use of wine: run winecfg in a terminal

**Windows**  
You could also run the windows binaries on Windows or a Windows VM.

### Data Wrangling Scripting
As part of the initial data processing pipeline, Data Wrangling is often performed using shell scripts or Python scripts.

Here each preprocessing step will be documented using a Jupyter Notebook Cell.

In [10]:
# normally os.listdir() is used to list a directory
# here, for demonstating the subprocess module, subprocess will be used
# invoke bash directly with shell=False in subprocess
import subprocess

cmd = 'ls /usr/local/bin/cw*'
args = ['/bin/bash', '-c', cmd]
result = subprocess.run(args, shell=False, text=True, capture_output=True)
result.stdout.splitlines()

['/usr/local/bin/cwbox',
 '/usr/local/bin/cwcomment',
 '/usr/local/bin/cwdaily',
 '/usr/local/bin/cwevent',
 '/usr/local/bin/cwgame',
 '/usr/local/bin/cwsub']

In [11]:
# check the environment variable for LD_LIBRARY_PATH
os.environ['LD_LIBRARY_PATH']

'/usr/local/lib'

In [12]:
# if you are running windows binaries under Linux, 
# prepend 'wine ' to the cmd string below
def process_cwdaily(year):
    """Parse event data into 52 fields of player stats per game.
    
    There are a total of 117 fields to chose from, the first 52 are selected.
    """
    cmd = f'cwdaily -f 0-51 -n -y {year} {year}*.EV*'
    args = ["/bin/bash", "-c", cmd]
    out = f'../processed/daily{year}.csv'
    with open(out, "w") as outfile:
        result = subprocess.run(args, stdout=outfile)

In [13]:
# change to raw file directory
os.chdir(p_raw)

In [14]:
# parse each year of event data
for year in range(1955, 2019):
    file = p_processed.joinpath(f'daily{year}.csv')
    
    # if the output is not already there
    if not file.is_file():
        process_cwdaily(year)

In [15]:
# collect all the parsed files into a single pandas dataframe
import glob
import pandas as pd
import numpy as np

os.chdir(p_processed)
dailyfiles = glob.glob('daily*.csv')
dailyfiles.sort()

dfs = []
for file in dailyfiles:
    dfs.append(pd.read_csv(file, parse_dates=['GAME_DT', 'APPEAR_DT']))
player_game = pd.concat(dfs)

In [16]:
# reset index after concatenation
player_game = player_game.reset_index(drop=True)

In [17]:
# use lower case column names
player_game.columns = player_game.columns.str.lower()

In [18]:
player_game.head(3)

Unnamed: 0,game_id,game_dt,game_ct,appear_dt,team_id,player_id,b_g,b_pa,b_ab,b_r,...,p_bb,p_ibb,p_so,p_gdp,p_hp,p_sh,p_sf,p_xi,p_wp,p_bk
0,BAL195504120,1955-04-12,0,1955-04-12,BOS,goodb101,1,5,5,1,...,0,0,0,0,0,0,0,0,0,0
1,BAL195504120,1955-04-12,0,1955-04-12,BOS,joose101,1,5,4,0,...,0,0,0,0,0,0,0,0,0,0
2,BAL195504120,1955-04-12,0,1955-04-12,BOS,throf101,1,5,5,1,...,0,0,0,0,0,0,0,0,0,0


In [19]:
# the primary key is (game_id, PLAYER_ID), verify no dups
dups = player_game.duplicated(subset=['game_id', 'player_id'], keep=False)
player_game[dups]

Unnamed: 0,game_id,game_dt,game_ct,appear_dt,team_id,player_id,b_g,b_pa,b_ab,b_r,...,p_bb,p_ibb,p_so,p_gdp,p_hp,p_sh,p_sf,p_xi,p_wp,p_bk
3418636,BOS201708250,2017-08-25,0,2017-08-25,BOS,younc004,1,3,3,0,...,0,0,0,0,0,0,0,0,0,0
3418638,BOS201708250,2017-08-25,0,2017-08-25,BOS,younc004,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
# check this game manually by clicking on the generated link
game_id_to_url('BOS201708250')

### Data Correction for Duplicates

Checking the box score via the above link, shows 2 entries for Young for the same game, one as a pinch-hitter and one as the designated-hitter.  It would appear that both entries are correct and that the data should be summed.

In [21]:
# get the index labels of the duplicated rows
idx1, idx2 = player_game[dups].index.values
idx1, idx2

(3418636, 3418638)

In [22]:
# identifier columns
id_columns = player_game.columns[:5]
id_columns

Index(['game_id', 'game_dt', 'game_ct', 'appear_dt', 'team_id'], dtype='object')

In [23]:
# stat columns
stat_columns = player_game.columns[5:]
stat_columns

Index(['player_id', 'b_g', 'b_pa', 'b_ab', 'b_r', 'b_h', 'b_2b', 'b_3b',
       'b_hr', 'b_rbi', 'b_bb', 'b_ibb', 'b_so', 'b_gdp', 'b_hp', 'b_sh',
       'b_sf', 'b_sb', 'b_cs', 'b_xi', 'p_g', 'p_gs', 'p_cg', 'p_sho', 'p_gf',
       'p_w', 'p_l', 'p_sv', 'p_out', 'p_tbf', 'p_ab', 'p_r', 'p_er', 'p_h',
       'p_2b', 'p_3b', 'p_hr', 'p_bb', 'p_ibb', 'p_so', 'p_gdp', 'p_hp',
       'p_sh', 'p_sf', 'p_xi', 'p_wp', 'p_bk'],
      dtype='object')

In [24]:
# id columns match (as per df.duplicated() above)
player_game.loc[[idx1,idx2], id_columns]

Unnamed: 0,game_id,game_dt,game_ct,appear_dt,team_id
3418636,BOS201708250,2017-08-25,0,2017-08-25,BOS
3418638,BOS201708250,2017-08-25,0,2017-08-25,BOS


In [25]:
# game data
player_game.loc[[idx1,idx2], stat_columns]

Unnamed: 0,player_id,b_g,b_pa,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,...,p_bb,p_ibb,p_so,p_gdp,p_hp,p_sh,p_sf,p_xi,p_wp,p_bk
3418636,younc004,1,3,3,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3418638,younc004,1,1,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
# add stats for the two rows
player_game.loc[idx1, stat_columns] += player_game.loc[idx2, stat_columns]

# remove duplicate row
player_game = player_game.drop(idx2)

In [27]:
# is_unique method for multiple columns
# faster than using groupby
def is_unique(df, cols):
    return not (df.duplicated(subset=cols)).any()

In [28]:
# the primary key is (GAME_ID, PLAYER_ID), verify no dups
is_unique(player_game, ['game_id', 'player_id'])

True

### Optimizing Pandas Data Types for each Variable

Using the smallest data type that can represent the data offers several advantages:
1. Reduced memory
2. Possibly increased performance
3. Provides information to both the data analyst and to other software libraries, about the variable.

This means using small integers, as appropriate.  
This means using categories, as appropriate.

A category should be used if there is a relatively small number of unique string values.  In other languages, a "category" is called a "factor" or an "enumerated data type".

The above is true for the analytical processing of data.

For datasets being constantly updated, unless the range of each variable is known in advance, using the smallest data type could create problems when new data is added.

In [29]:
def mem_usage(obj):
    if isinstance(obj, pd.DataFrame):
        mem = obj.memory_usage(deep=True).sum()
    else:
        mem = obj.memory_usage(deep=True)
        
    mem = mem / 2 ** 20 # covert to megabytes
    return f'{mem:03.2f} MB'

In [30]:
# About 2GB
mem_usage(player_game)

'2010.84 MB'

In [31]:
# data types by count
player_game.dtypes.value_counts()

int64             47
object             3
datetime64[ns]     2
dtype: int64

In [32]:
# Fraction of string values that are unique
player_game_obj = player_game.select_dtypes(include=['object'])
player_game_obj.nunique() / player_game_obj.shape[0]

game_id      0.036579
team_id      0.000012
player_id    0.003146
dtype: float64

In [33]:
# this optimization is good for player_game and game
def optimize_data_types(df):
    df = df.copy()
    
    # int64 -> smallest uint allowed by data
    df_int = df.select_dtypes(include=['int'])
    df_int = df_int.apply(pd.to_numeric,downcast='unsigned')
    df[df_int.columns] = df_int

    # object -> category
    df_obj = df.select_dtypes(include=['object'])
    df_obj = df_obj.astype('category')
    df[df_obj.columns] = df_obj
    
    return df

In [34]:
player_game = optimize_data_types(player_game)

In [35]:
# data types by count
player_game.dtypes.value_counts()

uint8             47
datetime64[ns]     2
category           1
category           1
category           1
dtype: int64

In [36]:
# about 8 times less memory is now being used
mem_usage(player_game)

'278.60 MB'

### Techniques to Persist Dataframes

1. pickle
2. hdf5
3. csv or compressed csv
4. write directly to database

Some of the pros and cons to the above: 
1. Pickle is easiest, but format may change over time and loading a pickled file from an untrusted source could execute malicious code.
2. hdf5 has too much overhead for storing less than about 2 GB of data
3. csv (optionally compressed) loses optimized Pandas data types
4. writing to a database also loses Pandas optimized datatypes

This notebook will both:
1. save each dataframe as a (optionally compressed) csv file
2. save each dataframe to a postgres table

Data type optimizations will have to be reapplied upon reading the data back in.

Due to the sparsity of the player_game dataframe, using gzip will reduce the file size by a factor of 10+.

In [37]:
player_game.dtypes.value_counts()

uint8             47
datetime64[ns]     2
category           1
category           1
category           1
dtype: int64

In [38]:
# create path objects
p_persisted = retrosheet.joinpath('persisted')

# create directories from these path objects
p_persisted.mkdir(parents=True, exist_ok=True)

# change working dir
os.chdir(p_persisted)

In [39]:
# persist as compressed csv file
%time player_game.to_csv('player_game.csv.gz', compression='gzip', index=False)

CPU times: user 5min 20s, sys: 182 ms, total: 5min 20s
Wall time: 5min 20s


#### To Read Back Use:
```
player_game = pd.read_csv('player_game.csv.gz', parse_dates=['game_dt'])
player_game = optimize_data_types(player_game)
```

### Load into Postgres

df.to_sql() is a very convenient method.  It will be used to load the data into Postgres for all dataframes except this one.

player_game is a large dataframe.  The data can be loaded to Postgres much faster using the COPY command.

In [40]:
# to_sql() will have postgres use bigint even though smallint is sufficient
dtype = {col:SmallInteger for col in player_game.select_dtypes(include=np.number).columns}
dtype

{'game_ct': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_g': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_pa': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_ab': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_r': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_h': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_2b': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_3b': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_hr': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_rbi': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_bb': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_ibb': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_so': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_gdp': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_hp': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_sh': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_sf': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_sb': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_cs': sqlalchemy.sql.sqltypes.SmallInteger,
 'b_xi': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_g': sqlalchemy.sql.sqltypes.SmallInteger,
 'p_gs': sq

In [41]:
# COPY requires the table to already exist
# use to_sql() to create the table on a small subset of the data
# then delete that data to have an empty table
player_game[:10].to_sql('player_game', conn, if_exists='replace', index=False, dtype=dtype)

In [42]:
type(conn)

sqlalchemy.engine.base.Engine

In [43]:
# when conn is a SQL Alchemy engine, the changes will be committed automatically
conn.execute('DELETE FROM player_game');

In [44]:
rs = conn.execute('SELECT COUNT(*) FROM player_game')
rs.fetchall()

[(0,)]

In [45]:
# verify we are in correct directory and have zcat
# Shoule see about 3.5 million records for 1955 to 2018
!zcat player_game.csv.gz | wc -l

3549700


In [46]:
# psql command to copy gzipped csv file to postgres table
cmd="\copy player_game from program 'zcat player_game.csv.gz' CSV HEADER"

In [47]:
# this is MUCH faster than using df.to_sql() for this many rows
%time psql(cmd)

CPU times: user 2.17 ms, sys: 64.1 ms, total: 66.2 ms
Wall time: 11.8 s


In [48]:
is_unique(player_game, ['game_id', 'player_id'])

True

In [49]:
# add primary key constraint
sql = 'ALTER TABLE retrosheet.public.player_game ADD PRIMARY KEY (player_id, game_id)'
conn.execute(sql);

In [50]:
# describe player_game table
psql('\d player_game')

Column,Type,Collation,Nullable,Default
game_id,text,,not null,
game_dt,timestamp without time zone,,,
game_ct,smallint,,,
appear_dt,timestamp without time zone,,,
team_id,text,,,
player_id,text,,not null,
b_g,smallint,,,
b_pa,smallint,,,
b_ab,smallint,,,
b_r,smallint,,,


## 2. Scrape Data for Player_Game Data Dictionary
As of February 2019, I could find no published information on cwdaily.

cwdaily can be run with the '-n' flag to have it output field names, but it is not clear what some of the field names mean.

Luckily, the source code itself has a text description of each field name.  This description takes place within a single, very long, C statement.

The source code will be scraped to retrieve a field name to field description mapping.

In [51]:
# cd to dir with cwdaily.c
p_src = retrosheet.joinpath('src')
os.chdir(p_src)

In [52]:
def parse_c_source(filename, struct='field_data'):
    """Extract field name to field description from parser's C source code"""
    dd = {}
    with open(filename, 'r') as cwdaily:
        # to account for patterns across lines, read entire source code
        source = cwdaily.read()
    
        # get the single (multiline) C statement that has field descriptions
        pattern = r'(static\s+field_struct\s+' + struct + r'.*?;)'
        match = re.search(pattern, source, flags=re.DOTALL | re.MULTILINE)
    
        if match:
            pattern = r'{.*?"(.*?)".*?"(.*?)".*?}'
            for m in re.finditer(pattern, match.group(1), 
                                 flags=re.DOTALL | re.MULTILINE):
                if m:
                    if len(m.group(2).split(':')) == 2:
                        desc = m.group(2).split(':')[1].strip()
                    else:
                        desc = m.group(2).strip()
                    dd[m.group(1).lower()] = desc   
    return dd

In [53]:
player_game_fields_all = parse_c_source('cwdaily.c')        

In [54]:
# As of Python 3.6, dictionaries maintain insertion order
# Only the first 52 fields were selected, so that's all that needed here
player_game_fields = {key:value for num, 
        (key, value) in enumerate(player_game_fields_all.items()) if num < 52}

# appear_dt was removed from player_game above
del player_game_fields['appear_dt']

In [55]:
# here is the explanation of each field, as scraped from the C source code
player_game_fields

{'game_id': 'game id',
 'game_dt': 'date',
 'game_ct': 'game number (0 = no double header)',
 'team_id': 'team id',
 'player_id': 'player id',
 'b_g': 'games played',
 'b_pa': 'plate appearances',
 'b_ab': 'at bats',
 'b_r': 'runs',
 'b_h': 'hits',
 'b_2b': 'doubles',
 'b_3b': 'triples',
 'b_hr': 'home runs',
 'b_rbi': 'runs batted in',
 'b_bb': 'walks',
 'b_ibb': 'intentional walks',
 'b_so': 'strikeouts',
 'b_gdp': 'grounded into DP',
 'b_hp': 'hit by pitch',
 'b_sh': 'sacrifice hits',
 'b_sf': 'sacrifice flies',
 'b_sb': 'stolen bases',
 'b_cs': 'caught stealing',
 'b_xi': 'reached on interference',
 'p_g': 'games pitched',
 'p_gs': 'games started',
 'p_cg': 'complete games',
 'p_sho': 'shutouts',
 'p_gf': 'games finished',
 'p_w': 'wins',
 'p_l': 'losses',
 'p_sv': 'saves',
 'p_out': 'outs recorded (innings pitched times 3)',
 'p_tbf': 'batters faced',
 'p_ab': 'at bats',
 'p_r': 'runs allowed',
 'p_er': 'earned runs allowed',
 'p_h': 'hits allowed',
 'p_2b': 'doubles allowed',
 'p

### Data Dictionary Notes
In the above, team_id is the team_id of the player.

game_id is:  
```
0:4 Home TEAM_ID  
4:8 YYYYMMDD  
9   Game Count
```

Game Count is:
* 0 for single game
* 1 for 1st game of double header
* 2 for 2nd game of double header

### Persist player_game_fields

In [56]:
os.chdir(p_persisted)

# index=[0] is required for dictionary of scalar values
# no need to compress something this small
player_game_fields_df = pd.DataFrame(player_game_fields, index=[0])
player_game_fields_df.to_csv('player_game_fields.csv', index=False)

In [57]:
# replace the table if it exists
player_game_fields_df.to_sql('player_game_fields', conn, if_exists='replace', index=False)

In [58]:
# verify df.to_sql worked
rs = conn.execute("SELECT * FROM player_game_fields")

In [59]:
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
df

Unnamed: 0,game_id,game_dt,game_ct,team_id,player_id,b_g,b_pa,b_ab,b_r,b_h,...,p_bb,p_ibb,p_so,p_gdp,p_hp,p_sh,p_sf,p_xi,p_wp,p_bk
0,game id,date,game number (0 = no double header),team id,player id,games played,plate appearances,at bats,runs,hits,...,walks allowed,intentional walks allowed,strikeouts,grounded into double play,hit batsmen,sacrifice hits against,sacrifice flies against,reached on interference,wild pitches,balks


## 3. Parse Event Data for Game Statistics
Additional information about the game is available, such as the attendance, the temperature at game start time, etc.

In [60]:
# if you are running windows binaries under Linux, prepend 'wine ' to the cmd string below
def process_cwgame(year):
    """Parse yearly event data into 45 fields of game data per year.
    
    For each game, there are 84 standard fields and 95 extended fields to chose from.  
    Only the first 46 standard fields are chosen.
    """
    cmd = f'cwgame -f 0-45 -n -y {year} {year}*.EV*'
    args = ["/bin/bash", "-c", cmd]
    out = f'../processed/game{year}.csv'
    with open(out, "w") as outfile:
        result = subprocess.run(args, stdout=outfile)

In [61]:
# change to raw file directory
os.chdir(p_raw)

In [62]:
# parse each year of event data
for year in range(1955, 2019):
    file = p_processed.joinpath(f'game{year}.csv')
    
    # if the output is not already there
    if not file.is_file():
        process_cwgame(year)

In [63]:
# collect all the parsed files into a single pandas dataframe
import glob
os.chdir(p_processed)
gamefiles = glob.glob('game*.csv')
gamefiles.sort()

dfs = []
for file in gamefiles:
    dfs.append(
        pd.read_csv(file, 
            keep_default_na=False,
            na_values={'ATTEND_PARK_CT':[-1,0],
                       'TEMP_PARK_CT':[-1,0]}))
game = pd.concat(dfs)

In [64]:
game.reset_index(drop=True)
game.columns = game.columns.str.lower()
game.head(3)

Unnamed: 0,game_id,game_dt,game_ct,game_dy,start_game_tm,dh_fl,daynight_park_cd,away_team_id,home_team_id,park_id,...,away_hits_ct,home_hits_ct,away_err_ct,home_err_ct,away_lob_ct,home_lob_ct,win_pit_id,lose_pit_id,save_pit_id,gwrbi_bat_id
0,BAL195504120,19550412,0,Tuesday,0,F,D,BOS,BAL,BAL11,...,13,5,0,2,8,9,sullf101,colej101,,
1,BAL195504180,19550418,0,Monday,0,F,N,NYA,BAL,BAL11,...,8,3,0,1,5,4,fordw101,moorr101,,
2,BAL195504220,19550422,0,Friday,0,F,N,WS1,BAL,BAL11,...,4,8,2,1,6,11,mcdem102,wilsj104,schmj101,


In [65]:
# the primary key is (game_id), verify no dups
is_unique(game, ['game_id'])

True

In [66]:
# these columns will not be used in the analysis
drop_columns = ['edit_record_ts',
                'wind_direction_park_cd',
                'wind_speed_park_ct',
                'field_park_cd',
                'precip_park_cd',
                'sky_park_cd',                
                'base1_ump_id', 
                'base2_ump_id', 
                'base3_ump_id', 
                'base4_ump_id',
                'scorer_record_id', 
                'inputter_record_id', 
                'lf_ump_id', 
                'rf_ump_id',
                'translator_record_id', 
                'input_record_ts', 
                'method_record_cd',
                'pitches_record_cd']

In [67]:
game = game.drop(drop_columns, axis=1)

In [68]:
game.dtypes.value_counts()

object     13
int64      13
float64     2
dtype: int64

### Reverse Engineer am/pm for start_game_tm

1. am/pm is not specified.
2. The time is not in 24-hour format
3. The time is an integer, not a string.  For example, 1259 means 12:59.
4. A value of zero means the game start time is unknown.
5. The daynight_park_cd is never missing.  This specifies whether the game took play in the "day" or at "night".
6. MLB domain knowledge: Some games may start late, due to a rain delay for example.  But games never start after midnight.
7. MLB domain knowledge: Some games may start early, to allow for travel to the next city.  But games never start before 9 am.

Given the above, am/pm can be deduced as follows:
* start_game_tm == 0 => use midnight (to represent unknown time)
* start_game_tm >= 1200 => pm
* start_game_tm < 900 => pm
* 900 <= start_game_tm < 1200, and day/night = day, => am
* 900 <= start_game_tm < 1200, and day/night = night, => pm

In [69]:
def parse_datetime(row):
    date = row['game_dt']
    time = row['start_game_tm']
    day_night = row['daynight_park_cd']
    
    if time > 0 and time < 900:
        time += 1200
    elif (900 <= time < 1200) and day_night == 'N':
        time += 1200

    time_str = f'{time//100:02d}:{time%100:02d}'
    datetime_str = str(date) + ' ' + time_str
    return pd.to_datetime(datetime_str, format='%Y%m%d %H:%M')

In [70]:
# create new datetime column
game['game_date'] = game.apply(parse_datetime, axis=1)

### Optimize Data Types

Normally, if the percentage of unique string values is large, there is no advantage in converting 'object' to 'category'.  (A join might work faster between two category variables than two string variables though.)

Here, optimize_data_types() will be called here and it will convert all object data types to categories.

In [71]:
df_obj = game.select_dtypes(include=['object'])
df_obj.nunique() / df_obj.shape[0]

game_id              1.000000
game_dy              0.000054
dh_fl                0.000015
daynight_park_cd     0.000015
away_team_id         0.000316
home_team_id         0.000316
park_id              0.000608
away_start_pit_id    0.026401
home_start_pit_id    0.026169
win_pit_id           0.034487
lose_pit_id          0.037183
save_pit_id          0.020517
gwrbi_bat_id         0.011182
dtype: float64

In [72]:
mem_usage(game)

'119.80 MB'

In [73]:
# optimize_data_types will
#  use smallest uint that can hold value
#  convert all objects to category
game = optimize_data_types(game)

In [74]:
# use lower case names
game.columns = game.columns.str.lower()

In [75]:
game.dtypes

game_id                    category
game_dt                      uint32
game_ct                       uint8
game_dy                    category
start_game_tm                uint16
dh_fl                      category
daynight_park_cd           category
away_team_id               category
home_team_id               category
park_id                    category
away_start_pit_id          category
home_start_pit_id          category
attend_park_ct              float64
temp_park_ct                float64
minutes_game_ct              uint16
inn_ct                        uint8
away_score_ct                 uint8
home_score_ct                 uint8
away_hits_ct                  uint8
home_hits_ct                  uint8
away_err_ct                   uint8
home_err_ct                   uint8
away_lob_ct                   uint8
home_lob_ct                   uint8
win_pit_id                 category
lose_pit_id                category
save_pit_id                category
gwrbi_bat_id               c

In [76]:
# a unique key is: (date, home_team, game_count)
is_unique(game, ['game_dt', 'home_team_id', 'game_ct'])

True

In [77]:
# game_id is a string concatenation of the above 3 fields, so it is also unique
game['game_id'].is_unique

True

In [78]:
# about 5 times less memory after optimizing the data types
mem_usage(game)

'24.47 MB'

In [79]:
# no need to compress this
os.chdir(p_persisted)
game.to_csv('game.csv', index=False)

In [80]:
game_float = game.select_dtypes(include=[np.float])
game_float.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
attend_park_ct,124836.0,24810.611258,12761.524179,306.0,14327.75,23709.0,34748.0,80227.0
temp_park_ct,80653.0,72.922074,10.920313,12.0,67.0,73.0,80.0,109.0


In [81]:
def is_all_int(s):
    """Returns True if all non-null values are integers"""
    notnull = s.notnull()
    is_integer = s.apply(lambda x: (x%1 == 0.0))
    return (notnull == is_integer).all()

In [82]:
# attendance and temperatre are always recorded as integers
is_all_int(game['attend_park_ct'])

True

In [83]:
is_all_int(game['temp_park_ct'])

True

### Attendance and Temperature: Use fillna(), Leave as Float, Interpolate, Other

There are several ways to deal with missing values.

**fillna() with impossible integer value**  
Pro: allows column to be represented as an integer in both Pandas and the database.  
Con: mean() and other operations may inadvertently use the impossible value.

If this technique is chosen, an additional (boolean) column such as 'is_attendance_null', could be created to make analysis easier.

**Leave as float**  
Pro: mean() and other operations skip na values by default.  This is the expected behavior.  
Con: requires more storage in Pandas and the database.  
Con: data analyst or software library using this column, may think the variable can have non-integer values.

**Interpolate**  
Values could be interpolated (or predicted using machine learning) from values on either side of the missing value.

**Semantics**  
Attendance must be an integer.

Temperature is not an integer.  Rather, to date, it has been recorded to the nearest integer value. This could change in the future.

**Use Database Representation Different from Pandas**  
A database can have an integer column with null values, Pandas cannot.  One way around this is  to write the values to a float column in the database, then convert that column type to integer.  However this makes it difficult for Pandas to append new null values to that column.

**Decision**  
There is no obvious best answer.  For this notebook, the fields will be left as Float for easy use with Pandas.

In [84]:
game.to_sql('game', conn, if_exists='replace', index=False, dtype=dtype)

In [85]:
game['game_id'].is_unique

True

In [86]:
# add primary key constraint
sql = 'ALTER TABLE retrosheet.public.game ADD PRIMARY KEY (game_id)'
conn.execute(sql);

In [87]:
# describe player_game table
psql('\d game')

Column,Type,Collation,Nullable,Default
game_id,text,,not null,
game_dt,bigint,,,
game_ct,smallint,,,
game_dy,text,,,
start_game_tm,bigint,,,
dh_fl,text,,,
daynight_park_cd,text,,,
away_team_id,text,,,
home_team_id,text,,,
park_id,text,,,


## 4. Scrape Data for Game Data Dictionary

There is a field-name to field-description mapping provided on the following web page:  
http://chadwick.sourceforge.net/doc/cwgame.html

This data could be scraped from the webpage, but as a parser to read C source code to get this mapping was written above, it's simpler just to use it.

Note: the codes for some of the \_CD fields are only specified on the above web page, but the \_CD fields are not being used in this study.

In [88]:
p_src = retrosheet.joinpath('src')
os.chdir(p_src)

In [89]:
game_reg_fields = parse_c_source('cwgame.c')
game_ext_fields = parse_c_source('cwgame.c', 'ext_field_data')           

In [90]:
# there are 84 regular fields and 95 extended fields
len(game_reg_fields), len(game_ext_fields)

(84, 95)

#### Data Dictionary Note
dh_fl: Designated Hitter Flag, 'T' if DH in use, else 'F'  
daynight_park_cd: 'N' for night, 'D' for day  
gw_rbi_bat_id: Player ID for batter who got Game Winning RBI  

In [91]:
# As of Python 3.6, dictionaries maintain insertion order
game_fields = {key:value for num, 
    (key, value) in enumerate(game_reg_fields.items()) if num < 46}

for key in drop_columns:
    del game_fields[key]

game_fields

{'game_id': 'game id',
 'game_dt': 'date',
 'game_ct': 'game number (0 = no double header)',
 'game_dy': 'day of week',
 'start_game_tm': 'start time',
 'dh_fl': 'DH used flag',
 'daynight_park_cd': 'day/night flag',
 'away_team_id': 'visiting team',
 'home_team_id': 'home team',
 'park_id': 'game site',
 'away_start_pit_id': 'vis. starting pitcher',
 'home_start_pit_id': 'home starting pitcher',
 'attend_park_ct': 'attendance',
 'temp_park_ct': 'temperature',
 'minutes_game_ct': 'time of game',
 'inn_ct': 'number of innings',
 'away_score_ct': 'visitor final score',
 'home_score_ct': 'home final score',
 'away_hits_ct': 'visitor hits',
 'home_hits_ct': 'home hits',
 'away_err_ct': 'visitor errors',
 'home_err_ct': 'home errors',
 'away_lob_ct': 'visitor left on base',
 'home_lob_ct': 'home left on base',
 'win_pit_id': 'winning pitcher',
 'lose_pit_id': 'losing pitcher',
 'save_pit_id': 'save for',
 'gwrbi_bat_id': 'GW RBI'}

#### Persist game_fields

In [92]:
os.chdir(p_persisted)

# index=[0] is required for dictionary of scalar values
game_fields_df = pd.DataFrame(game_fields, index=[0])
game_fields_df.to_csv('game_fields.csv', index=False)

In [93]:
game_fields_df.to_sql('game_fields', conn, if_exists='replace', index=False)

In [94]:
# verify df.to_sql worked
rs = conn.execute("SELECT * FROM game_fields")

In [95]:
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
df

Unnamed: 0,game_id,game_dt,game_ct,game_dy,start_game_tm,dh_fl,daynight_park_cd,away_team_id,home_team_id,park_id,...,away_hits_ct,home_hits_ct,away_err_ct,home_err_ct,away_lob_ct,home_lob_ct,win_pit_id,lose_pit_id,save_pit_id,gwrbi_bat_id
0,game id,date,game number (0 = no double header),day of week,start time,DH used flag,day/night flag,visiting team,home team,game site,...,visitor hits,home hits,visitor errors,home errors,visitor left on base,home left on base,winning pitcher,losing pitcher,save for,GW RBI


## 5. Scrape Data for Player Lookup Table

There is no separate file for this.  It will be scraped from a web page.

In [96]:
import requests
import pandas as pd
from io import StringIO
from bs4 import BeautifulSoup

In [97]:
# get the web page
r = requests.get("https://www.retrosheet.org/retroID.htm")
soup = BeautifulSoup(r.content, 'lxml')

# data is within the pre tag
table_txt = soup.pre.string

# remove unnecessary double quotes
table_txt = table_txt.replace('"','')

In [98]:
# read from this string instead of file
players = pd.read_csv(StringIO(table_txt), 
    parse_dates=['Play debut', 'Mgr debut', 'Ump debut'])

In [100]:
# Coach debut has some bad values
def parse_date(value):
    if pd.isna(value) or value == '43188' or int(value[-4:]) < 1800:
        return pd.NaT
    else:
        return pd.datetime.strptime(value, '%m/%d/%Y')
players['Coach debut'] = players['Coach debut'].apply(parse_date)

In [101]:
players.head()

Unnamed: 0,ID,Last,First,Play debut,Mgr debut,Coach debut,Ump debut
0,aardd001,Aardsma,David,2004-04-06,NaT,NaT,NaT
1,aaroh101,Aaron,Hank,1954-04-13,NaT,NaT,NaT
2,aarot101,Aaron,Tommie,1962-04-10,NaT,1979-04-06,NaT
3,aased001,Aase,Don,1977-07-26,NaT,NaT,NaT
4,abada001,Abad,Andy,2001-09-10,NaT,NaT,NaT


In [102]:
name_chg = {'ID':'retro_id',
         'Last':'last',
         'First':'first',
         'Play debut':'player_debut',
         'Mgr debut':'mgr_debut',
         'Coach debut': 'coach_debut',
         'Ump debut':'ump_debut'}
players = players.rename(columns=name_chg)
players.head()

Unnamed: 0,retro_id,last,first,player_debut,mgr_debut,coach_debut,ump_debut
0,aardd001,Aardsma,David,2004-04-06,NaT,NaT,NaT
1,aaroh101,Aaron,Hank,1954-04-13,NaT,NaT,NaT
2,aarot101,Aaron,Tommie,1962-04-10,NaT,1979-04-06,NaT
3,aased001,Aase,Don,1977-07-26,NaT,NaT,NaT
4,abada001,Abad,Andy,2001-09-10,NaT,NaT,NaT


#### Persist Players

In [103]:
os.chdir(p_persisted)
players.to_csv('players.csv', index=False)

In [104]:
players.to_sql('players', conn, if_exists='replace', index=False)

In [105]:
# add primary key constraint
sql = 'ALTER TABLE retrosheet.public.players ADD PRIMARY KEY (retro_id)'
conn.execute(sql);

In [106]:
# describe player_game table
psql('\d players')

Column,Type,Collation,Nullable,Default
retro_id,text,,not null,
last,text,,,
first,text,,,
player_debut,timestamp without time zone,,,
mgr_debut,timestamp without time zone,,,
coach_debut,timestamp without time zone,,,
ump_debut,timestamp without time zone,,,


## 6. Scrape Data for Stadium Lookup Table
There is no separate file for this, it will be scraped from a webpage.

In [107]:
# get the web page (this is not html!)
r = requests.get("https://www.retrosheet.org/parkcode.txt")

table_txt = r.content.decode("utf-8")

# read from this string instead of file
parks = pd.read_csv(StringIO(table_txt), parse_dates=['START', 'END'])

In [108]:
parks.columns = parks.columns.str.lower()
parks.head()

Unnamed: 0,parkid,name,aka,city,state,start,end,league,notes
0,ALB01,Riverside Park,,Albany,NY,1880-09-11,1882-05-30,NL,TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882
1,ALT01,Columbia Park,,Altoona,PA,1884-04-30,1884-05-31,UA,
2,ANA01,Angel Stadium of Anaheim,Edison Field; Anaheim Stadium,Anaheim,CA,1966-04-19,NaT,AL,
3,ARL01,Arlington Stadium,,Arlington,TX,1972-04-21,1993-10-03,AL,
4,ARL02,Rangers Ballpark in Arlington,The Ballpark in Arlington; Ameriquest Fl,Arlington,TX,1994-04-11,NaT,AL,


#### Persist Stadiums

In [109]:
os.chdir(p_persisted)
parks.to_csv('parks.csv', index=False)

In [110]:
parks.to_sql('parks', conn, if_exists='replace', index=False)

In [111]:
# add primary key constraint
sql = 'ALTER TABLE retrosheet.public.parks ADD PRIMARY KEY (parkid)'
conn.execute(sql);

In [112]:
# describe player_game table
psql('\d parks')

Column,Type,Collation,Nullable,Default
parkid,text,,not null,
name,text,,,
aka,text,,,
city,text,,,
state,text,,,
start,timestamp without time zone,,,
end,timestamp without time zone,,,
league,text,,,
notes,text,,,
