# Retrosheet Baseball Data -- Part 1

## Overview
In this Jupyter notebook, data will be prepared for easy analysis with Pandas or with SQL against a database.

The two most popular open source Baseball Data Sources are:  
Lahman:  http://www.seanlahman.com/resources/  
Retrosheet:  https://www.retrosheet.org/site.htm

Lahman has data about each player summarized by year.  Retrosheet has data at the play-by-play level.

The play-by-play data: https://www.retrosheet.org/game.htm  will be downloaded and parsed into daily game data.

The only open-source parsers available for Retrosheet are by Dr. T. L. Turocy: 
http://chadwick.sourceforge.net/doc/index.html

The most up-to-date description of the parsers is:
https://github.com/chadwickbureau/chadwick/blob/master/doc/cwtools.rst

## Data Wrangling (Preprocessing)

The Retrosheet event data includes every play for every major league game since 1921.

Only a subset of that data will be used here.

At the end of this preprocessing, the following Pandas DataFrames will exist:
1. player_game:  stats per player per game
2. game: stats per game
3. players: player_id -> player info
4. stadiums: stadium_id -> stadium info

In addition, there will be two dictionaries used as codebooks:
1. player_game_fields: player_game_fieldname -> field description
2. game_fields: game_fieldname -> field description

The above 6 objects will be persisted for use with other notebooks.

## Repeatable Research
All data processing should be documented so that others can repeat the results.

This notebook documents all preprocessing steps taken with the data available from Retrosheet.

Retrosheet licenses their data using the GPL:  
https://www.gnu.org/licenses/gpl.html

## Download and Unpack Retrosheet Data

The event data will be downloaded.  The data is zipped ascii text with filenames:
http://www.retrosheet.org/events/{year}eve.zip'

There are many ways to download files in Python.  For a simple binary file download, wget may be the easiest.

### Create Directories
* ~/data/retrosheet/raw  
* ~/data/retrosheet/processed  

In [1]:
import os
import re
import wget
from pathlib import Path
import zipfile

In [2]:
# create path objects
home = Path.home()
retrosheet = home.joinpath('data/retrosheet')
p_raw = retrosheet.joinpath('raw')
p_processed = retrosheet.joinpath('processed')

# create directories from these path objects
p_raw.mkdir(parents=True, exist_ok=True)
p_processed.mkdir(parents=True, exist_ok=True)

### Download and Unzip the Event Data
Data is available from 1921 to present.

Here, data from 1950 through 2018 will be downloaded and unzipped.

This will result in a (temporary) 2+ Gig Pandas DataFrame, so chose more or less years as appropriate for your computer's resources.

In [3]:
# change to raw file directory
os.chdir(p_raw)

for year in range(1950,2019):   
    # download each file, if it doesn't exist
    filename = f'{year}eve.zip'
    path = Path(filename)
    if not path.exists():
        url = f'http://www.retrosheet.org/events/{year}eve.zip'
        wget.download(url)
    
    # unzip each zip file, if its contents don't exist
    # {year}BOS.EVA is in all zip files
    filename = f'{year}BOS.EVA'
    path = Path(filename)
    if not path.exists():
        filename = f'{year}eve.zip'
        with zipfile.ZipFile(filename, "r") as zip_ref:
            zip_ref.extractall(".")

### Unzipped Data File Types
The unzipped data consists of 3 types of files:
1. *.EVA and *.EVN -- these are American League and National League event files per team per year
2. *.ROS -- these are the rosters per team per year
3. TEAM* -- the MBL teams in existence per year

## Parse Event Data for Player Statistics

The event data is in a format that is very difficult to work with.  There is one (and only one) open-source project which has parsers for the Retrosheet event data.  This project has 6 parsers.  Each of these parsers is fed event data and produces csv or XML or text output.

The two parsers that are of interest for this study are:
1. cwdaily
2. cwgame

The cwbox parser produces a box score in the form MLB fans are accustomed to seeing (or it can produce XML with appropriate tags).  This is exactly the same information as is produced by cwdaily, however cwdaily formats the data as one line per player per game, which is much easier to work with.

The Retrosheet data parser tools are described at:  
http://chadwick.sourceforge.net/doc/index.html  
  
They are distributed under the GPL:  
https://www.gnu.org/licenses/gpl.html  

Note: as of February 2019, the cwdaily parser, published in July 2018, is not described on the above webpage.

#### Build Chandwick Parsers on Linux
Go To:  
https://sourceforge.net/projects/chadwick/  
Download the source code for version 0.7.1 or later, and optionally download the Windows binaries.

If you do not already have a build environment:
1. sudo apt install gcc
2. sudo apt install build-essential

cd to the source directory:
1. ./configure
2. make
3. make install  # or: sudo make install  

The cw command line tools will be installed in /usr/local/bin.  
The cw library will be installed in /usr/local/lib.  
To allow the command line tools to find the library, add the following to your .bashrc and source .bashrc  
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib  

#### Or Run Windows Binaries
If you prefer to use the prebuilt windows binaries:  
Install wine: https://wiki.winehq.org/Ubuntu  
Before first use of wine: run winecfg in a terminal

You could also run the windows binaries on a Windows VM (if you own a Windows license).

### Preprocess Scripting
Preprocessing is usually performed with shell scripts or Python scripts.

Here each preprocessing step will be documented as a Jupyter Notebook Cell using Python.

In [4]:
# subprocess example
# prefer to invoke bash directly with shell=False
import subprocess

# List the 6 parsers that were built and installed from source
result = subprocess.run(["/bin/bash", "-c", "ls /usr/local/bin/cw*"], shell=False, 
                        text=True, capture_output=True)
result.stdout.splitlines()

['/usr/local/bin/cwbox',
 '/usr/local/bin/cwcomment',
 '/usr/local/bin/cwdaily',
 '/usr/local/bin/cwevent',
 '/usr/local/bin/cwgame',
 '/usr/local/bin/cwsub']

In [5]:
# if you are running windows binaries under Linux, prepend 'wine ' to the cmd string below
def process_cwdaily(year):
    """Parse yearly event data into 52 fields of player-game data per year.
    
    There are a total of 117 fields to chose from, the first 52 are selected.
    """
    cmd = f'cwdaily -f 0-51 -n -y {year} {year}*.EV*'
    args = ["/bin/bash", "-c", cmd]
    out = f'../processed/daily{year}.csv'
    with open(out, "w") as outfile:
        result = subprocess.run(args, stdout=outfile)

In [6]:
# change to raw file directory
os.chdir(p_raw)

In [7]:
# parse each year of event data
for year in range(1950, 2019):
    process_cwdaily(year)

In [8]:
# collect all the parsed files into a single pandas dataframe
import glob
import pandas as pd
os.chdir(p_processed)
dailyfiles = glob.glob('daily*.csv')
dailyfiles.sort()

dfs = []
for file in dailyfiles:
    dfs.append(pd.read_csv(file, parse_dates=['GAME_DT', 'APPEAR_DT']))
player_game = pd.concat(dfs)

In [9]:
# after concatentation, reset the index
player_game = player_game.reset_index(drop=True)
player_game.index

RangeIndex(start=0, stop=3688067, step=1)

In [10]:
player_game.head()

Unnamed: 0,GAME_ID,GAME_DT,GAME_CT,APPEAR_DT,TEAM_ID,PLAYER_ID,B_G,B_PA,B_AB,B_R,...,P_BB,P_IBB,P_SO,P_GDP,P_HP,P_SH,P_SF,P_XI,P_WP,P_BK
0,BOS195004180,1950-04-18,0,1950-04-18,NYA,rizzp101,1,6,4,1,...,0,0,0,0,0,0,0,0,0,0
1,BOS195004180,1950-04-18,0,1950-04-18,NYA,henrt101,1,6,6,2,...,0,0,0,0,0,0,0,0,0,0
2,BOS195004180,1950-04-18,0,1950-04-18,NYA,baueh101,1,4,4,1,...,0,0,0,0,0,0,0,0,0,0
3,BOS195004180,1950-04-18,0,1950-04-18,NYA,woodg101,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,BOS195004180,1950-04-18,0,1950-04-18,NYA,mapec101,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
# a player always appears on the same date as the game is played
(player_game['GAME_DT'] == player_game['APPEAR_DT']).all()

True

In [12]:
player_game.drop('APPEAR_DT', axis=1, inplace=True)

In [13]:
# the primary key is (GAME_ID, PLAYER_ID), verify no dups
dups = player_game.duplicated(subset=['GAME_ID', 'PLAYER_ID'], keep=False)
player_game[dups]

Unnamed: 0,GAME_ID,GAME_DT,GAME_CT,TEAM_ID,PLAYER_ID,B_G,B_PA,B_AB,B_R,B_H,...,P_BB,P_IBB,P_SO,P_GDP,P_HP,P_SH,P_SF,P_XI,P_WP,P_BK
3557003,BOS201708250,2017-08-25,0,BOS,younc004,1,3,3,0,1,...,0,0,0,0,0,0,0,0,0,0
3557005,BOS201708250,2017-08-25,0,BOS,younc004,1,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0


In [14]:
ix1, ix2 = player_game[dups].index.values
ix1, ix2

(3557003, 3557005)

#### Data Correction

Checking the box score for this game:
https://www.baseball-reference.com/boxes/BOS/BOS201708250.shtml

Shows 2 entries for Young for the same game, one as a pinch-hitter and one as the designated-hitter.  It would appear that both entries are correct and that the data should be summed.

In [15]:
player_game.iloc[ix1, 5:] += player_game.iloc[ix2, 5:]
player_game = player_game.drop(ix1)

In [16]:
# the primary key is (GAME_ID, PLAYER_ID), verify no dups
dups = player_game.duplicated(subset=['GAME_ID', 'PLAYER_ID'], keep=False)
player_game[dups]

Unnamed: 0,GAME_ID,GAME_DT,GAME_CT,TEAM_ID,PLAYER_ID,B_G,B_PA,B_AB,B_R,B_H,...,P_BB,P_IBB,P_SO,P_GDP,P_HP,P_SH,P_SF,P_XI,P_WP,P_BK


In [17]:
def mem_usage(obj):
    if isinstance(obj, pd.DataFrame):
        mem = obj.memory_usage(deep=True).sum()
    else:
        mem = obj.memory_usage(deep=True)
        
    mem = mem / 2 ** 20 # covert to megabytes
    return f'{mem:03.2f} MB'

In [18]:
mem_usage(player_game)

'2061.09 MB'

#### Pandas can find smallest datatype for given data

In [19]:
# attempt to convert int64 to uint8
player_game_int = player_game.select_dtypes(include=['int'])
converted_int = player_game_int.apply(pd.to_numeric,downcast='unsigned')

player_game[converted_int.columns] = converted_int
mem_usage(player_game)

'903.92 MB'

In [20]:
player_game.dtypes.value_counts()

uint8             47
object             3
datetime64[ns]     1
dtype: int64

In [21]:
player_game_obj = player_game.select_dtypes(include=['object'])
player_game_obj.columns

Index(['GAME_ID', 'TEAM_ID', 'PLAYER_ID'], dtype='object')

In [22]:
player_game_obj.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3688066 entries, 0 to 3688066
Data columns (total 3 columns):
GAME_ID      object
TEAM_ID      object
PLAYER_ID    object
dtypes: object(3)
memory usage: 112.6+ MB


##### Leave GAME_ID as is, so it can easily be joined with GAME_ID in the games dataframe

In [23]:
player_game['TEAM_ID'].nunique()

44

In [24]:
player_game['PLAYER_ID'].nunique()

11670

In [25]:
player_game.shape

(3688066, 51)

In [26]:
# TEAM_ID and PLAYER_ID only have a small number of distinct values
# relative to the size of the datafrmae
# convert these to categorical
player_game['TEAM_ID'] = player_game['TEAM_ID'].astype('category')
player_game['PLAYER_ID'] = player_game['PLAYER_ID'].astype('category')
player_game.dtypes.value_counts()

uint8             47
datetime64[ns]     1
category           1
category           1
object             1
dtype: int64

In [27]:
# memory usage is now less than 1/4 of what it started out as
mem_usage(player_game)

'475.86 MB'

#### Data Note

It is possible for a player to be traded and end up playing in two different games, for two different teams, on the same day.

Here is a specific example.

In [28]:
player_game[(player_game['PLAYER_ID'] == 'morgn001') & (player_game['GAME_DT'] == '2009-05-05')]

Unnamed: 0,GAME_ID,GAME_DT,GAME_CT,TEAM_ID,PLAYER_ID,B_G,B_PA,B_AB,B_R,B_H,...,P_BB,P_IBB,P_SO,P_GDP,P_HP,P_SH,P_SF,P_XI,P_WP,P_BK
3038597,PIT200905050,2009-05-05,0,PIT,morgn001,1,5,5,0,1,...,0,0,0,0,0,0,0,0,0,0
3056716,WAS200905050,2009-05-05,0,WAS,morgn001,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
# use GAME_DT as the index
player_game = player_game.set_index('GAME_DT')

In [30]:
type(player_game.index)

pandas.core.indexes.datetimes.DatetimeIndex

#### Persist player_game

There are at least 3 good ways to persist the DataFrame:
1. Fastest and easiest: pickle
2. Reduce disk space with fast compression: csv with gzip
3. For use with other apps: csv

Due to the large amount of zero values, gzip can reduce csv file size by a factor of 10+.

In [31]:
# create path objects
p_persisted = retrosheet.joinpath('persisted')

# create directories from these path objects
p_persisted.mkdir(parents=True, exist_ok=True)

In [32]:
os.chdir(p_persisted)

In [33]:
player_game.to_hdf('player_game.h5', key='player_game', mode='w', format='table')

## Retrosheet Data Dictionary (Codebook)

```
Suffix Meaning
CT     count (integer)
ID     identifier
FL     boolean flag
CD     code (enumerated data type)
DT     date
DY     day of week
TM     time

Prefix Meaning
B      batter
P      pitcher
```

In most cases, the abbreviation between the prefix and the suffix is a common baseball abbreviation.  For common baseball abbreviations see:  
http://www.espn.com/gen/editors/mlb/glossary.html

## Player-Game Data Dictionary (Codebook)
As of February 2019, I could find no published information on cwdaily.

cwdaily can be run with the '-n' flag to have it output fieldnames, but it is not clear what some of the fieldnames mean.

Luckily, the source code itself has a text description of each output field.  This description takes place within a single, very long, C statement.

The source code will be scraped to retrieve a field-name to field-description mapping.

In [34]:
# cd to dir with cwdaily.c
p_src = retrosheet.joinpath('src')
os.chdir(p_src)

In [35]:
def parse_c_source(filename, struct='field_data'):
    dd = {}
    with open(filename, 'r') as cwdaily:
        # to account for patterns across lines, read the entire source code into a text string
        source = cwdaily.read()
    
        # get the single (multiline) C statement that has the field-name, field-description
        pattern = r'(static\s+field_struct\s+' + struct + r'.*?;)'
        match = re.search(pattern, source, flags=re.DOTALL | re.MULTILINE)
    
        if match:
            # within this statement there are many {...} and inside each is the mapping
            pattern = r'{.*?"(.*?)".*?"(.*?)".*?}'
            for m in re.finditer(pattern, match.group(1), flags=re.DOTALL | re.MULTILINE):
                if m:
                    if len(m.group(2).split(':')) == 2:
                        desc = m.group(2).split(':')[1].strip()
                    else:
                        desc = m.group(2).strip()
                    dd[m.group(1)] = desc   
    return dd

In [36]:
player_game_fields_all = parse_c_source('cwdaily.c')        

In [37]:
# As of Python 3.6, dictionaries maintain insertion order
# Only the first 52 fields were selected, so that's all that needed here
player_game_fields = {key:value for num, 
                      (key, value) in enumerate(player_game_fields_all.items()) if num < 52}

### Data Dictionary Notes
In the following, TEAM_ID is the TEAM_ID of the player. 

GAME_ID is:  
```
0:4 Home TEAM_ID  
4:8 YYYYMMDD  
9   Game Count (0 for single game, 1 for 1st of double header, 2 for 2nd of double header)
```

In [38]:
# here is the explanation of each field, as scraped from the C source code
player_game_fields

{'GAME_ID': 'game id',
 'GAME_DT': 'date',
 'GAME_CT': 'game number (0 = no double header)',
 'APPEAR_DT': 'apperance date',
 'TEAM_ID': 'team id',
 'PLAYER_ID': 'player id',
 'B_G': 'games played',
 'B_PA': 'plate appearances',
 'B_AB': 'at bats',
 'B_R': 'runs',
 'B_H': 'hits',
 'B_2B': 'doubles',
 'B_3B': 'triples',
 'B_HR': 'home runs',
 'B_RBI': 'runs batted in',
 'B_BB': 'walks',
 'B_IBB': 'intentional walks',
 'B_SO': 'strikeouts',
 'B_GDP': 'grounded into DP',
 'B_HP': 'hit by pitch',
 'B_SH': 'sacrifice hits',
 'B_SF': 'sacrifice flies',
 'B_SB': 'stolen bases',
 'B_CS': 'caught stealing',
 'B_XI': 'reached on interference',
 'P_G': 'games pitched',
 'P_GS': 'games started',
 'P_CG': 'complete games',
 'P_SHO': 'shutouts',
 'P_GF': 'games finished',
 'P_W': 'wins',
 'P_L': 'losses',
 'P_SV': 'saves',
 'P_OUT': 'outs recorded (innings pitched times 3)',
 'P_TBF': 'batters faced',
 'P_AB': 'at bats',
 'P_R': 'runs allowed',
 'P_ER': 'earned runs allowed',
 'P_H': 'hits allowed',

#### Persist player_game_fields

In [39]:
import csv
import pickle
os.chdir(p_persisted)

with open('player_game_fields.pickle','wb') as p:
    pickle.dump(player_game_fields, p)

## Parse Event Data for Game Statistics
Additional information about the game itself is available.

In [40]:
# if you are running windows binaries under Linux, prepend 'wine ' to the cmd string below
def process_cwgame(year):
    """Parse yearly event data into 45 fields of game data per year.
    
    For each game, there are 84 standard fields and 95 extended fields to chose from.  
    Only the first 46 standard fields are chosen.
    """
    cmd = f'cwgame -f 0-45 -n -y {year} {year}*.EV*'
    args = ["/bin/bash", "-c", cmd]
    out = f'../processed/game{year}.csv'
    with open(out, "w") as outfile:
        result = subprocess.run(args, stdout=outfile)

In [41]:
# change to raw file directory
os.chdir(p_raw)

In [42]:
# parse each year of event data
for year in range(1950, 2019):
    process_cwgame(year)

In [43]:
# collect all the parsed files into a single pandas dataframe
import glob
os.chdir(p_processed)
gamefiles = glob.glob('game*.csv')
gamefiles.sort()

dfs = []
for file in gamefiles:
    dfs.append(pd.read_csv(file, parse_dates=['GAME_DT']))
game = pd.concat(dfs)

In [44]:
# after concatentation, reset the index
game = game.reset_index(drop=True)
game.head()

Unnamed: 0,GAME_ID,GAME_DT,GAME_CT,GAME_DY,START_GAME_TM,DH_FL,DAYNIGHT_PARK_CD,AWAY_TEAM_ID,HOME_TEAM_ID,PARK_ID,...,AWAY_HITS_CT,HOME_HITS_CT,AWAY_ERR_CT,HOME_ERR_CT,AWAY_LOB_CT,HOME_LOB_CT,WIN_PIT_ID,LOSE_PIT_ID,SAVE_PIT_ID,GWRBI_BAT_ID
0,BOS195004180,1950-04-18,0,Tuesday,0,F,D,NYA,BOS,BOS07,...,15,15,0,0,9,13,johnd102,mastw101,pagej101,
1,BOS195004192,1950-04-19,2,Wednesday,0,F,D,NYA,BOS,BOS07,...,15,10,0,1,10,7,lopae101,kinde101,pagej101,
2,BOS195004280,1950-04-28,0,Friday,0,F,D,PHA,BOS,BOS07,...,8,8,0,0,7,7,parnm101,kella103,,
3,BOS195004301,1950-04-30,1,Sunday,0,F,D,PHA,BOS,BOS07,...,5,17,2,0,5,7,dobsj101,fowld101,,
4,BOS195004302,1950-04-30,2,Sunday,0,F,D,PHA,BOS,BOS07,...,10,12,2,0,5,11,stobc101,wyseh101,papaa101,


In [45]:
# the primary key is (GAME_ID), verify no dups
dups = game.duplicated(subset=['GAME_ID'], keep='last')
game[dups]

Unnamed: 0,GAME_ID,GAME_DT,GAME_CT,GAME_DY,START_GAME_TM,DH_FL,DAYNIGHT_PARK_CD,AWAY_TEAM_ID,HOME_TEAM_ID,PARK_ID,...,AWAY_HITS_CT,HOME_HITS_CT,AWAY_ERR_CT,HOME_ERR_CT,AWAY_LOB_CT,HOME_LOB_CT,WIN_PIT_ID,LOSE_PIT_ID,SAVE_PIT_ID,GWRBI_BAT_ID


In [46]:
game.dtypes.value_counts()

object            23
int64             21
datetime64[ns]     1
float64            1
dtype: int64

In [47]:
game_float = game.select_dtypes(['float'])
game_float.columns

Index(['EDIT_RECORD_TS'], dtype='object')

In [48]:
game_float['EDIT_RECORD_TS'].nunique()

0

In [49]:
game.drop('EDIT_RECORD_TS', axis=1, inplace=True)

In [50]:
mem_usage(game)

'191.66 MB'

In [51]:
# convert int64 to uint8
game_int = game.select_dtypes(include=['int'])
converted_int = game_int.apply(pd.to_numeric,downcast='unsigned')

game[converted_int.columns] = converted_int
mem_usage(game)

'175.63 MB'

In [52]:
game_obj = game.select_dtypes(include=['object'])

In [53]:
game_obj.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 135559 entries, 0 to 135558
Data columns (total 23 columns):
GAME_ID                 135559 non-null object
GAME_DY                 135559 non-null object
DH_FL                   135559 non-null object
DAYNIGHT_PARK_CD        135559 non-null object
AWAY_TEAM_ID            135559 non-null object
HOME_TEAM_ID            135559 non-null object
PARK_ID                 135559 non-null object
AWAY_START_PIT_ID       135559 non-null object
HOME_START_PIT_ID       135559 non-null object
BASE4_UMP_ID            135559 non-null object
BASE1_UMP_ID            135559 non-null object
BASE2_UMP_ID            135559 non-null object
BASE3_UMP_ID            135559 non-null object
LF_UMP_ID               206 non-null object
RF_UMP_ID               11 non-null object
SCORER_RECORD_ID        66794 non-null object
TRANSLATOR_RECORD_ID    49352 non-null object
INPUTTER_RECORD_ID      71612 non-null object
INPUT_RECORD_TS         54459 non-null object
WIN_PIT

In [54]:
game_obj.nunique()

GAME_ID                 135559
GAME_DY                      7
DH_FL                        2
DAYNIGHT_PARK_CD             2
AWAY_TEAM_ID                44
HOME_TEAM_ID                44
PARK_ID                     80
AWAY_START_PIT_ID         3598
HOME_START_PIT_ID         3565
BASE4_UMP_ID               457
BASE1_UMP_ID               478
BASE2_UMP_ID               494
BASE3_UMP_ID               510
LF_UMP_ID                   34
RF_UMP_ID                   10
SCORER_RECORD_ID          5148
TRANSLATOR_RECORD_ID       335
INPUTTER_RECORD_ID         495
INPUT_RECORD_TS          53839
WIN_PIT_ID                4638
LOSE_PIT_ID               5009
SAVE_PIT_ID               2792
GWRBI_BAT_ID              1502
dtype: int64

In [55]:
game['GAME_DY'] = game['GAME_DY'].astype('category')
game['DH_FL'] = game['DH_FL'].astype('category')
game['DAYNIGHT_PARK_CD'] = game['DAYNIGHT_PARK_CD'].astype('category')
game['HOME_TEAM_ID'] = game['HOME_TEAM_ID'].astype('category')
game['AWAY_TEAM_ID'] = game['AWAY_TEAM_ID'].astype('category')
game['PARK_ID'] = game['PARK_ID'].astype('category')

In [56]:
mem_usage(game)

'127.53 MB'

In [57]:
# remove fields that won't be used in subsequent analysis
drop_columns = ['BASE4_UMP_ID', 
                'BASE1_UMP_ID', 
                'BASE2_UMP_ID', 
                'BASE3_UMP_ID',
                'SCORER_RECORD_ID', 
                'INPUTTER_RECORD_ID', 
                'LF_UMP_ID', 
                'RF_UMP_ID',
                'TRANSLATOR_RECORD_ID', 
                'INPUT_RECORD_TS', 
                'METHOD_RECORD_CD',
                'PITCHES_RECORD_CD']
game.drop(drop_columns, axis=1, inplace=True)

In [58]:
mem_usage(game)

'60.50 MB'

In [59]:
os.chdir(p_persisted)

In [60]:
with open('game.pickle','wb') as p:
    pickle.dump(game, p)

## Game Data Dictionary (Codebook)

There is a field-name to field-description mapping provided on the following web page:  
http://chadwick.sourceforge.net/doc/cwgame.html

This data could be scraped from the webpage, but as a parser to read C source code to get this mapping was written above, it's simpler just to use it.

In [61]:
p_src = retrosheet.joinpath('src')
os.chdir(p_src)

In [62]:
game_reg_fields = parse_c_source('cwgame.c')
game_ext_fields = parse_c_source('cwgame.c', 'ext_field_data')           

In [63]:
len(game_reg_fields), len(game_ext_fields)

(84, 95)

#### Data Dictionary Note
DH_FL: Designated Hitter Flag, True if DH in use  
GW_RBI_BAT_ID: Batter ID for batter who got Game Winning RBI  

In [64]:
# As of Python 3.6, dictionaries maintain insertion order
game_fields = {key:value for num, (key, value) in enumerate(game_reg_fields.items()) if num < 46}

# as per above, edit_record_ts has no data
del game_fields['EDIT_RECORD_TS']
game_fields

{'GAME_ID': 'game id',
 'GAME_DT': 'date',
 'GAME_CT': 'game number (0 = no double header)',
 'GAME_DY': 'day of week',
 'START_GAME_TM': 'start time',
 'DH_FL': 'DH used flag',
 'DAYNIGHT_PARK_CD': 'day/night flag',
 'AWAY_TEAM_ID': 'visiting team',
 'HOME_TEAM_ID': 'home team',
 'PARK_ID': 'game site',
 'AWAY_START_PIT_ID': 'vis. starting pitcher',
 'HOME_START_PIT_ID': 'home starting pitcher',
 'BASE4_UMP_ID': 'home plate umpire',
 'BASE1_UMP_ID': 'first base umpire',
 'BASE2_UMP_ID': 'second base umpire',
 'BASE3_UMP_ID': 'third base umpire',
 'LF_UMP_ID': 'left field umpire',
 'RF_UMP_ID': 'right field umpire',
 'ATTEND_PARK_CT': 'attendance',
 'SCORER_RECORD_ID': 'PS scorer',
 'TRANSLATOR_RECORD_ID': 'translator',
 'INPUTTER_RECORD_ID': 'inputter',
 'INPUT_RECORD_TS': 'input time',
 'METHOD_RECORD_CD': 'how scored',
 'PITCHES_RECORD_CD': 'pitches entered?',
 'TEMP_PARK_CT': 'temperature',
 'WIND_DIRECTION_PARK_CD': 'wind direction',
 'WIND_SPEED_PARK_CT': 'wind speed',
 'FIELD_PA

#### Persist game_fields

In [65]:
os.chdir(p_persisted)

with open('game_fields.pickle','wb') as p:
    pickle.dump(game_fields, p)

## Player Lookup Table

There is no separate file for this.  It will be scraped from a web page.

In [66]:
import requests
import pandas as pd
from io import StringIO
from bs4 import BeautifulSoup

In [67]:
# get the web page
r = requests.get("https://www.retrosheet.org/retroID.htm")
soup = BeautifulSoup(r.content, 'lxml')

# data is within the pre tag
table_txt = soup.pre.string

# remove unnecessary double quotes
table_txt = table_txt.replace('"','')

# read from this string instead of file
players = pd.read_csv(StringIO(table_txt))

In [68]:
players.head()

Unnamed: 0,ID,Last,First,Play debut,Mgr debut,Coach debut,Ump debut
0,aardd001,Aardsma,David,04/06/2004,,,
1,aaroh101,Aaron,Hank,04/13/1954,,,
2,aarot101,Aaron,Tommie,04/10/1962,,04/06/1979,
3,aased001,Aase,Don,07/26/1977,,,
4,abada001,Abad,Andy,09/10/2001,,,


#### Persist Players

In [69]:
os.chdir(p_persisted)

with open('players.pickle','wb') as p:
    pickle.dump(players, p)

## Stadium Lookup Table
There is no separate file for this, it will be scraped from a webpage.

In [70]:
# get the web page (this is not html!)
r = requests.get("https://www.retrosheet.org/parkcode.txt")

table_txt = r.content.decode("utf-8")

# read from this string instead of file
parks = pd.read_csv(StringIO(table_txt))

In [71]:
parks.head()

Unnamed: 0,PARKID,NAME,AKA,CITY,STATE,START,END,LEAGUE,NOTES
0,ALB01,Riverside Park,,Albany,NY,09/11/1880,05/30/1882,NL,TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882
1,ALT01,Columbia Park,,Altoona,PA,04/30/1884,05/31/1884,UA,
2,ANA01,Angel Stadium of Anaheim,Edison Field; Anaheim Stadium,Anaheim,CA,04/19/1966,,AL,
3,ARL01,Arlington Stadium,,Arlington,TX,04/21/1972,10/03/1993,AL,
4,ARL02,Rangers Ballpark in Arlington,The Ballpark in Arlington; Ameriquest Fl,Arlington,TX,04/11/1994,,AL,


#### Persist Stadiums

In [72]:
os.chdir(p_persisted)

with open('parks.pickle','wb') as p:
    pickle.dump(parks, p)