# Retrosheet Baseball Data: Download & Parse

In this Jupyter notebook, data will be prepared for easy analysis with Pandas or with SQL against a database.

The two most popular open source Baseball Data Sources are:  
Lahman:  http://www.seanlahman.com/resources/  
Retrosheet:  https://www.retrosheet.org/site.htm

Lahman has data about each player summarized by year.  Retrosheet has data at the play-by-play level (called "event data").

The Retrosheet play-by-play data will be downloaded and parsed into daily game data:  
https://www.retrosheet.org/game.htm  

The only open-source parsers available for Retrosheet are by Dr. T. L. Turocy:  
Description: http://chadwick.sourceforge.net/doc/index.html  
Source:  https://sourceforge.net/projects/chadwick/

## Data Wrangling

Preparing data for data analysis or machine learning is called "Data Wrangling".  An older term is "Data Preprocessing".

The Retrosheet event data includes every play for every major league game since 1921. 
A subset of that data will be used here.

Data Wrangling will include:
1. Parsing and manipulating player per game from event files.
2. Parsing and manipulating game data from event files.
3. Creating "lookup tables" by web scraping.
4. Creating data dictionaries (aka codebooks) from scraping Dr. Turocy's C source code.

At the end of the data wrangling, 6 DataFrames will exist:
1. **player_game:** player per game stats 
2. **player_game_fields:** player_game field descriptions
3. **game:** game stats
4. **game_fields:** game field descriptions
5. **players:** player info
6. **stadiums:** stadium info

The above 6 dataframes will be persisted for use with other notebooks.

## Repeatable Research
All data processing should be documented so that others can repeat the results.

This notebook documents all preprocessing steps taken with the data available from Retrosheet.

Retrosheet licenses their data using the GPL:  
https://www.gnu.org/licenses/gpl.html

## Download and Unpack Retrosheet Data

The event data will be downloaded.  The data is zipped ascii text with filenames:
http://www.retrosheet.org/events/{year}eve.zip'

There are many ways to download files in Python.  For a simple binary file download, wget may be the easiest.

### Create Directories
* ~/data/retrosheet/raw  
* ~/data/retrosheet/processed  

In [1]:
import os
import re
import wget
from pathlib import Path
import zipfile

In [2]:
# create path objects
home = Path.home()
retrosheet = home.joinpath('data/retrosheet')
p_raw = retrosheet.joinpath('raw')
p_processed = retrosheet.joinpath('processed')

# create directories from these path objects
p_raw.mkdir(parents=True, exist_ok=True)
p_processed.mkdir(parents=True, exist_ok=True)

### Download and Unzip the Event Data
Data is available from 1921 to present.

Here, data from 1950 through 2018 will be downloaded and unzipped.

This will result in a (temporary) 2+ Gig Pandas DataFrame, so chose more or less years as appropriate for your computer's resources.

In [3]:
# change to raw file directory
os.chdir(p_raw)

for year in range(1950,2019):   
    # download each file, if it doesn't exist
    filename = f'{year}eve.zip'
    path = Path(filename)
    if not path.exists():
        url = f'http://www.retrosheet.org/events/{year}eve.zip'
        wget.download(url)
    
    # unzip each zip file, if its contents don't exist
    # {year}BOS.EVA is in all zip files
    filename = f'{year}BOS.EVA'
    path = Path(filename)
    if not path.exists():
        filename = f'{year}eve.zip'
        with zipfile.ZipFile(filename, "r") as zip_ref:
            zip_ref.extractall(".")

### Unzipped Data File Types
The unzipped data consists of 3 types of files:
1. *.EVA and *.EVN -- these are American League and National League event files per team per year
2. *.ROS -- these are the rosters per team per year
3. TEAM* -- the MBL teams in existence per year

## 1. Parse Event Data for Player per Game Statistics

The event data is in a format that is very difficult to work with.  There is one open-source project which has parsers for the Retrosheet event data.  This project has 6 parsers.  Each of these parsers is fed event data and produces csv or XML or text output.

The two parsers that are of interest for this study are:
1. cwdaily
2. cwgame

The cwbox parser produces a box score in the form MLB fans are accustomed to seeing (or it can produce XML with appropriate tags).  This is exactly the same information as is produced by cwdaily, however cwdaily formats the data as one line per player per game, which is much easier to work with.

The Retrosheet data parser tools are described at:  
http://chadwick.sourceforge.net/doc/index.html  
  
They are distributed under the GPL:  
https://www.gnu.org/licenses/gpl.html  

Note: as of February 2019, the cwdaily parser, published in July 2018, is not described on the above webpage.

### Build Chadwick Parsers on Linux
Go To:  
https://sourceforge.net/projects/chadwick/  
Download the source code for version 0.7.1 or later, and optionally download the Windows binaries.

If you do not already have a build environment:
1. sudo apt install gcc
2. sudo apt install build-essential

cd to the source directory:
1. ./configure
2. make
3. make install  # or: sudo make install  

The cw command line tools will be installed in /usr/local/bin.  
The cw library will be installed in /usr/local/lib.  
To allow the command line tools to find the library, add the following to your .bashrc and source .bashrc  
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib  

### Or Run Windows Binaries
If you prefer to use the prebuilt windows binaries:  
Install wine: https://wiki.winehq.org/Ubuntu  
Before first use of wine: run winecfg in a terminal

You could also run the windows binaries on a Windows VM (if you own a Windows license).

### Data Wrangling Scripting
As part of the initial data processing pipeline, Data Wrangling is often performed using shell scripts or Python scripts.

Here each preprocessing step will be documented as a Jupyter Notebook Cell using Python.

In [4]:
# subprocess example
# prefer to invoke bash directly with shell=False
import subprocess

# List the 6 parsers that were built and installed from source
result = subprocess.run(["/bin/bash", "-c", "ls /usr/local/bin/cw*"], shell=False, 
                        text=True, capture_output=True)
result.stdout.splitlines()

['/usr/local/bin/cwbox',
 '/usr/local/bin/cwcomment',
 '/usr/local/bin/cwdaily',
 '/usr/local/bin/cwevent',
 '/usr/local/bin/cwgame',
 '/usr/local/bin/cwsub']

In [5]:
# if you are running windows binaries under Linux, 
# prepend 'wine ' to the cmd string below
def process_cwdaily(year):
    """Parse yearly event data into 52 fields of player-game data per year.
    
    There are a total of 117 fields to chose from, the first 52 are selected.
    """
    cmd = f'cwdaily -f 0-51 -n -y {year} {year}*.EV*'
    args = ["/bin/bash", "-c", cmd]
    out = f'../processed/daily{year}.csv'
    with open(out, "w") as outfile:
        result = subprocess.run(args, stdout=outfile)

In [6]:
# change to raw file directory
os.chdir(p_raw)

In [None]:
# parse each year of event data
for year in range(1950, 2019):
    process_cwdaily(year)

In [7]:
# collect all the parsed files into a single pandas dataframe
import glob
import pandas as pd
import numpy as np
os.chdir(p_processed)
dailyfiles = glob.glob('daily*.csv')
dailyfiles.sort()

dfs = []
for file in dailyfiles:
    dfs.append(pd.read_csv(file, parse_dates=['GAME_DT', 'APPEAR_DT']))
player_game = pd.concat(dfs)

In [8]:
# after concatentation, reset the index
player_game = player_game.reset_index(drop=True)
player_game.index

RangeIndex(start=0, stop=3688067, step=1)

In [9]:
player_game.head()

Unnamed: 0,GAME_ID,GAME_DT,GAME_CT,APPEAR_DT,TEAM_ID,PLAYER_ID,B_G,B_PA,B_AB,B_R,...,P_BB,P_IBB,P_SO,P_GDP,P_HP,P_SH,P_SF,P_XI,P_WP,P_BK
0,BOS195004180,1950-04-18,0,1950-04-18,NYA,rizzp101,1,6,4,1,...,0,0,0,0,0,0,0,0,0,0
1,BOS195004180,1950-04-18,0,1950-04-18,NYA,henrt101,1,6,6,2,...,0,0,0,0,0,0,0,0,0,0
2,BOS195004180,1950-04-18,0,1950-04-18,NYA,baueh101,1,4,4,1,...,0,0,0,0,0,0,0,0,0,0
3,BOS195004180,1950-04-18,0,1950-04-18,NYA,woodg101,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,BOS195004180,1950-04-18,0,1950-04-18,NYA,mapec101,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
# a player always appears on the same date as the game is played
(player_game['GAME_DT'] == player_game['APPEAR_DT']).all()

True

In [11]:
player_game.drop('APPEAR_DT', axis=1, inplace=True)

In [12]:
# the primary key is (GAME_ID, PLAYER_ID), verify no dups
dups = player_game.duplicated(subset=['GAME_ID', 'PLAYER_ID'], keep=False)
player_game[dups]

Unnamed: 0,GAME_ID,GAME_DT,GAME_CT,TEAM_ID,PLAYER_ID,B_G,B_PA,B_AB,B_R,B_H,...,P_BB,P_IBB,P_SO,P_GDP,P_HP,P_SH,P_SF,P_XI,P_WP,P_BK
3557003,BOS201708250,2017-08-25,0,BOS,younc004,1,3,3,0,1,...,0,0,0,0,0,0,0,0,0,0
3557005,BOS201708250,2017-08-25,0,BOS,younc004,1,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0


In [13]:
ix1, ix2 = player_game[dups].index.values
ix1, ix2

(3557003, 3557005)

### Data Correction

Checking the box score for this game:  
https://www.baseball-reference.com/boxes/BOS/BOS201708250.shtml

Shows 2 entries for Young for the same game, one as a pinch-hitter and one as the designated-hitter.  It would appear that both entries are correct and that the data should be summed.

In [14]:
player_game.iloc[ix1, 5:] += player_game.iloc[ix2, 5:]
player_game = player_game.drop(ix2)

In [15]:
# the primary key is (GAME_ID, PLAYER_ID), verify no dups
dups = player_game.duplicated(subset=['GAME_ID', 'PLAYER_ID'], keep=False)
player_game[dups]

Unnamed: 0,GAME_ID,GAME_DT,GAME_CT,TEAM_ID,PLAYER_ID,B_G,B_PA,B_AB,B_R,B_H,...,P_BB,P_IBB,P_SO,P_GDP,P_HP,P_SH,P_SF,P_XI,P_WP,P_BK


### Optimizing Pandas Data Types for each Variable

Reasons for using the data type with the least memory include:
1. Reduce DataFrame memory.
2. Increase performance.
3. Provide information to both the data analyst and to other software libraries about that variable.

No attribute, per player per game, can exceed 255.  For example, no batter can get more than 255 hits in one, no pitcher can allow more than 255 hits in one game, etc.

Converting an object that represents a relatively small number of unique string values to a category, provides helpful information about the variable.

In other languages, a "category" is called a "factor" or "enumerated data type".

In [16]:
def mem_usage(obj):
    if isinstance(obj, pd.DataFrame):
        mem = obj.memory_usage(deep=True).sum()
    else:
        mem = obj.memory_usage(deep=True)
        
    mem = mem / 2 ** 20 # covert to megabytes
    return f'{mem:03.2f} MB'

In [17]:
mem_usage(player_game)

'2061.09 MB'

In [18]:
# data types by count
player_game.dtypes.value_counts()

int64             47
object             3
datetime64[ns]     1
dtype: int64

In [19]:
# Fraction of values that are unique
player_game_obj = player_game.select_dtypes(include=['object'])
player_game_obj.nunique() / player_game_obj.shape[0]

GAME_ID      0.036756
TEAM_ID      0.000012
PLAYER_ID    0.003164
dtype: float64

In [20]:
# this optimization is good for player-game and game
def optimize_data_types(df):
    df = df.copy()
    
    # int64 -> smallest uint allowed by data
    df_int = df.select_dtypes(include=['int'])
    df_int = df_int.apply(pd.to_numeric,downcast='unsigned')
    df[df_int.columns] = df_int

    # object -> category
    df_obj = df.select_dtypes(include=['object'])
    df_obj = df_obj.astype('category')
    df[df_obj.columns] = df_obj
    
    return df

In [21]:
player_game = optimize_data_types(player_game)

In [22]:
# data types by count
player_game.dtypes.value_counts()

uint8             47
category           1
category           1
datetime64[ns]     1
category           1
dtype: int64

In [23]:
mem_usage(player_game)

'261.16 MB'

### Data Note

It is possible for a player to be traded and end up playing in two different games, for two different teams, on the same day.

Here is a specific example.

In [24]:
player_game[(player_game['PLAYER_ID'] == 'morgn001')
          & (player_game['GAME_DT'] == '2009-05-05')]

Unnamed: 0,GAME_ID,GAME_DT,GAME_CT,TEAM_ID,PLAYER_ID,B_G,B_PA,B_AB,B_R,B_H,...,P_BB,P_IBB,P_SO,P_GDP,P_HP,P_SH,P_SF,P_XI,P_WP,P_BK
3038597,PIT200905050,2009-05-05,0,PIT,morgn001,1,5,5,0,1,...,0,0,0,0,0,0,0,0,0,0
3056716,WAS200905050,2009-05-05,0,WAS,morgn001,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


### Persist DataFrame

Techniques to persist a DataFrame include:
1. Pickle
2. hdf5
3. csv or compressed csv
4. write directly to database

Some of the pros and cons to the above: 
1. Pickle is easiest, but format may change over time and loading a pickled file from an untrusted source could execute malicious code.
2. hdf5 has too much overhead for dataframes less than about 1 GB
3. csv (optionally compressed) loses optimized Pandas data types
4. Although Pandas can read/write directly to a database, a data analyst may not have necessary DB Admin privileges

This notebook will save the dataframes as csv files.  Any datatype optimizations will have to be reapplied upon reading the data back in.

The largest dataframe, player_game, will be compressed with gzip.  Due to the sparsity of this dataframe, the csv file can be reduced by a factor of 10+ with gzip.

In [25]:
# create path objects
p_persisted = retrosheet.joinpath('persisted')

# create directories from these path objects
p_persisted.mkdir(parents=True, exist_ok=True)

# change working dir
os.chdir(p_persisted)

# persist as compressed csv file
player_game.to_csv('player_game.csv.gz', compression='infer')

#### To Read Back Use:
```
player_game = pd.read_csv('player_game.csv.gz', parse_dates=['GAME_DT'])
player_game = optimize_data_types(player_game)
```

## Retrosheet Data Dictionary General
A "data dictionary" is also called a "codebook".

```
Suffix Meaning
CT     count (integer)
ID     identifier
FL     boolean flag
CD     code (enumerated data type)
DT     date
DY     day of week
TM     time

Prefix Meaning
B      batter
P      pitcher
```

In most cases, the abbreviation between the prefix and the suffix is a common baseball abbreviation.  For common baseball abbreviations see:  
http://www.espn.com/gen/editors/mlb/glossary.html

## 2. Scrape Data for Player-Game Data Dictionary
As of February 2019, I could find no published information on cwdaily.

cwdaily can be run with the '-n' flag to have it output field names, but it is not clear what some of the field names mean.

Luckily, the source code itself has a text description of each field name.  This description takes place within a single, very long, C statement.

The source code will be scraped to retrieve a field name to field description mapping.

In [26]:
# cd to dir with cwdaily.c
p_src = retrosheet.joinpath('src')
os.chdir(p_src)

In [27]:
def parse_c_source(filename, struct='field_data'):
    dd = {}
    with open(filename, 'r') as cwdaily:
        # to account for patterns across lines, read entire source code
        source = cwdaily.read()
    
        # get the single (multiline) C statement that has field descriptions
        pattern = r'(static\s+field_struct\s+' + struct + r'.*?;)'
        match = re.search(pattern, source, flags=re.DOTALL | re.MULTILINE)
    
        if match:
            pattern = r'{.*?"(.*?)".*?"(.*?)".*?}'
            for m in re.finditer(pattern, match.group(1), 
                                 flags=re.DOTALL | re.MULTILINE):
                if m:
                    if len(m.group(2).split(':')) == 2:
                        desc = m.group(2).split(':')[1].strip()
                    else:
                        desc = m.group(2).strip()
                    dd[m.group(1)] = desc   
    return dd

In [28]:
player_game_fields_all = parse_c_source('cwdaily.c')        

In [29]:
# As of Python 3.6, dictionaries maintain insertion order
# Only the first 52 fields were selected, so that's all that needed here
player_game_fields = {key:value for num, 
        (key, value) in enumerate(player_game_fields_all.items()) if num < 52}

### Data Dictionary Notes
In the following, TEAM_ID is the TEAM_ID of the player. 

GAME_ID is:  
```
0:4 Home TEAM_ID  
4:8 YYYYMMDD  
9   Game Count
```

Game Count is:
* 0 for single game
* 1 for 1st game of double header
* 2 for 2nd game of double header

In [30]:
# here is the explanation of each field, as scraped from the C source code
player_game_fields

{'GAME_ID': 'game id',
 'GAME_DT': 'date',
 'GAME_CT': 'game number (0 = no double header)',
 'APPEAR_DT': 'apperance date',
 'TEAM_ID': 'team id',
 'PLAYER_ID': 'player id',
 'B_G': 'games played',
 'B_PA': 'plate appearances',
 'B_AB': 'at bats',
 'B_R': 'runs',
 'B_H': 'hits',
 'B_2B': 'doubles',
 'B_3B': 'triples',
 'B_HR': 'home runs',
 'B_RBI': 'runs batted in',
 'B_BB': 'walks',
 'B_IBB': 'intentional walks',
 'B_SO': 'strikeouts',
 'B_GDP': 'grounded into DP',
 'B_HP': 'hit by pitch',
 'B_SH': 'sacrifice hits',
 'B_SF': 'sacrifice flies',
 'B_SB': 'stolen bases',
 'B_CS': 'caught stealing',
 'B_XI': 'reached on interference',
 'P_G': 'games pitched',
 'P_GS': 'games started',
 'P_CG': 'complete games',
 'P_SHO': 'shutouts',
 'P_GF': 'games finished',
 'P_W': 'wins',
 'P_L': 'losses',
 'P_SV': 'saves',
 'P_OUT': 'outs recorded (innings pitched times 3)',
 'P_TBF': 'batters faced',
 'P_AB': 'at bats',
 'P_R': 'runs allowed',
 'P_ER': 'earned runs allowed',
 'P_H': 'hits allowed',

### Persist player_game_fields

In [31]:
os.chdir(p_persisted)

# index=[0] is required for dictionary of scalar values
player_game_fields_df = pd.DataFrame(player_game_fields, index=[0])
player_game_fields_df.to_csv('player_game_fields.csv')

## 3. Parse Event Data for Game Statistics
Additional information about the game itself is available.

In [32]:
# if you are running windows binaries under Linux, prepend 'wine ' to the cmd string below
def process_cwgame(year):
    """Parse yearly event data into 45 fields of game data per year.
    
    For each game, there are 84 standard fields and 95 extended fields to chose from.  
    Only the first 46 standard fields are chosen.
    """
    cmd = f'cwgame -f 0-45 -n -y {year} {year}*.EV*'
    args = ["/bin/bash", "-c", cmd]
    out = f'../processed/game{year}.csv'
    with open(out, "w") as outfile:
        result = subprocess.run(args, stdout=outfile)

In [33]:
# change to raw file directory
os.chdir(p_raw)

In [None]:
# parse each year of event data
for year in range(1950, 2019):
    process_cwgame(year)

In [34]:
# collect all the parsed files into a single pandas dataframe
import glob
os.chdir(p_processed)
gamefiles = glob.glob('game*.csv')
gamefiles.sort()

dfs = []
for file in gamefiles:
    dfs.append(
        pd.read_csv(file, 
            keep_default_na=False,
            na_values={'ATTEND_PARK_CT':[-1,0],
                       'TEMP_PARK_CT':[-1,0]}))
game = pd.concat(dfs)

In [35]:
game.head(3)

Unnamed: 0,GAME_ID,GAME_DT,GAME_CT,GAME_DY,START_GAME_TM,DH_FL,DAYNIGHT_PARK_CD,AWAY_TEAM_ID,HOME_TEAM_ID,PARK_ID,...,AWAY_HITS_CT,HOME_HITS_CT,AWAY_ERR_CT,HOME_ERR_CT,AWAY_LOB_CT,HOME_LOB_CT,WIN_PIT_ID,LOSE_PIT_ID,SAVE_PIT_ID,GWRBI_BAT_ID
0,BOS195004180,19500418,0,Tuesday,0,F,D,NYA,BOS,BOS07,...,15,15,0,0,9,13,johnd102,mastw101,pagej101,
1,BOS195004192,19500419,2,Wednesday,0,F,D,NYA,BOS,BOS07,...,15,10,0,1,10,7,lopae101,kinde101,pagej101,
2,BOS195004280,19500428,0,Friday,0,F,D,PHA,BOS,BOS07,...,8,8,0,0,7,7,parnm101,kella103,,


In [36]:
# the primary key is (GAME_ID), verify no dups
dups = game.duplicated(subset=['GAME_ID'], keep=False)
game[dups]

Unnamed: 0,GAME_ID,GAME_DT,GAME_CT,GAME_DY,START_GAME_TM,DH_FL,DAYNIGHT_PARK_CD,AWAY_TEAM_ID,HOME_TEAM_ID,PARK_ID,...,AWAY_HITS_CT,HOME_HITS_CT,AWAY_ERR_CT,HOME_ERR_CT,AWAY_LOB_CT,HOME_LOB_CT,WIN_PIT_ID,LOSE_PIT_ID,SAVE_PIT_ID,GWRBI_BAT_ID


In [37]:
# these columns will not be used in the analysis
drop_columns = ['EDIT_RECORD_TS',
                'WIND_DIRECTION_PARK_CD',
                'WIND_SPEED_PARK_CT',
                'FIELD_PARK_CD',
                'PRECIP_PARK_CD',
                'SKY_PARK_CD',                
                'BASE4_UMP_ID', 
                'BASE1_UMP_ID', 
                'BASE2_UMP_ID', 
                'BASE3_UMP_ID',
                'SCORER_RECORD_ID', 
                'INPUTTER_RECORD_ID', 
                'LF_UMP_ID', 
                'RF_UMP_ID',
                'TRANSLATOR_RECORD_ID', 
                'INPUT_RECORD_TS', 
                'METHOD_RECORD_CD',
                'PITCHES_RECORD_CD']

In [38]:
game = game.drop(drop_columns, axis=1)

In [39]:
game.dtypes.value_counts()

object     13
int64      13
float64     2
dtype: int64

### START_GAME_TM is Integer between 0 and 1259, no AM/PM

A value of zero means the start time is unknown.

Game time is in the local time zone of the home team's stadium.

The DAYNIGHT_PARK_CD is never missing. 'N' for night. 'D' for day.

MLB domain knowledge suggests that games do not start before 9 am or after 11 pm.

A rain delay, or a long first game of a double header, can cause a start time after 9 pm, but not after 11 pm.

When the above is combined with the day/night code
* start_game_tm == 0 => use midnight (to represent unknown time)
* start_game_tm >= 1200 => pm
* start_game_tm < 900 => pm
* 900 <= start_game_tm < 1200, day/night = day, => am
* 900 <= start_game_tm < 1200, day/night = night, => pm

In [40]:
# compute fraction of unknown start times since 1990-01-01-#
unknown = (game['START_GAME_TM'] == 0)
recent = (game['GAME_DT'] >= 19901010)
(unknown & recent).sum() / recent.sum()

0.10122166707287351

In [41]:
# compute fraction of unknown start times proir to 1960-01-01-#
unknown = (game['START_GAME_TM'] == 0)
old = (game['GAME_DT'] <= 19601010)
(unknown & old).sum() / old.sum()

0.931405721316518

#### Summary of Above
The game start time was often not recorded prior to 1960.  
The game start time has been recorded often since 1990.  

### Parse GAME_DT, START_GAME_TM

GAME_DT was read by Pandas as an integer of YYYYMMDD  
START_GAME_TM was read by Pandas as an integer HHDD  

As per above:
* start_game_tm == 0 => use midnight (to represent unknown time)
* start_game_tm >= 1200 => pm
* start_game_tm < 900 => pm
* 900 <= start_game_tm < 1200, day/night = day, => am
* 900 <= start_game_tm < 1200, day/night = night, => pm

Convert GAME_DT, START_GAME_TM, DAYNIGHT_PARK_CD to a Panda's datetime.

In [42]:
def parse_datetime(row):
    date = row['GAME_DT']
    time = row['START_GAME_TM']
    day_night = row['DAYNIGHT_PARK_CD']
    
    if time > 0 and time < 900:
        time += 1200
    elif (900 <= time < 1200) and day_night == 'N':
        time += 1200

    time_str = f'{time//100:02d}:{time%100:02d}'
    datetime_str = str(date) + ' ' + time_str
    return pd.to_datetime(datetime_str, format='%Y%m%d %H:%M')

In [43]:
game['GAME_DT'] = game.apply(parse_datetime, axis=1)

In [44]:
game = game.drop('START_GAME_TM', axis=1)

In [45]:
# spot check: verify 2nd game starts after 1st game
not_null = (game['GAME_DT'].dt.hour != 0) 
double_header = (game['GAME_CT'] != 0)
recent = (game['GAME_DT'].dt.year >= 2015)
criteria = not_null & double_header & recent
(game[criteria][['HOME_TEAM_ID','GAME_DT','GAME_CT']]
 .sort_values(by=['HOME_TEAM_ID','GAME_DT','GAME_CT'])).head(20)

Unnamed: 0,HOME_TEAM_ID,GAME_DT,GAME_CT
45,ANA,2015-07-20 14:07:00,1
46,ANA,2015-07-20 19:07:00,2
241,ATL,2015-10-04 13:10:00,1
242,ATL,2015-10-04 16:05:00,2
189,ATL,2017-06-10 13:07:00,1
190,ATL,2017-06-10 18:05:00,2
228,ATL,2017-09-06 13:36:00,1
229,ATL,2017-09-06 19:36:00,2
182,ATL,2018-05-28 13:13:00,1
183,ATL,2018-05-28 22:07:00,2


### Optimize Data Types

Although GAME_ID is a unique string, it will be joined with GAME_ID in player_game, where it is represented as a category, so represent it as a category here as well.

In [46]:
df_obj = game.select_dtypes(include=['object'])
df_obj.nunique() / df_obj.shape[0]

GAME_ID              1.000000
GAME_DY              0.000052
DH_FL                0.000015
DAYNIGHT_PARK_CD     0.000015
AWAY_TEAM_ID         0.000325
HOME_TEAM_ID         0.000325
PARK_ID              0.000590
AWAY_START_PIT_ID    0.026542
HOME_START_PIT_ID    0.026299
WIN_PIT_ID           0.034221
LOSE_PIT_ID          0.036958
SAVE_PIT_ID          0.020604
GWRBI_BAT_ID         0.011087
dtype: float64

In [47]:
# optimize_data_types will
#  use smallest uint that can hold value
#  convert all objects to category
game = optimize_data_types(game)

In [48]:
game.dtypes

GAME_ID                    category
GAME_DT              datetime64[ns]
GAME_CT                       uint8
GAME_DY                    category
DH_FL                      category
DAYNIGHT_PARK_CD           category
AWAY_TEAM_ID               category
HOME_TEAM_ID               category
PARK_ID                    category
AWAY_START_PIT_ID          category
HOME_START_PIT_ID          category
ATTEND_PARK_CT              float64
TEMP_PARK_CT                float64
MINUTES_GAME_CT              uint16
INN_CT                        uint8
AWAY_SCORE_CT                 uint8
HOME_SCORE_CT                 uint8
AWAY_HITS_CT                  uint8
HOME_HITS_CT                  uint8
AWAY_ERR_CT                   uint8
HOME_ERR_CT                   uint8
AWAY_LOB_CT                   uint8
HOME_LOB_CT                   uint8
WIN_PIT_ID                 category
LOSE_PIT_ID                category
SAVE_PIT_ID                category
GWRBI_BAT_ID               category
dtype: object

In [49]:
# unique key is: (date, home team, game count)
(game.duplicated(subset=['GAME_DT', 'HOME_TEAM_ID', 'GAME_CT'], keep=False)).any()

False

In [50]:
# unique key is also: game_id 
(game.duplicated(subset=['GAME_ID'], keep=False)).any()

False

In [51]:
mem_usage(game)

'24.52 MB'

In [52]:
os.chdir(p_persisted)
game.to_csv('game.csv')

## 4. Scrape Data for Game Data Dictionary

There is a field-name to field-description mapping provided on the following web page:  
http://chadwick.sourceforge.net/doc/cwgame.html

This data could be scraped from the webpage, but as a parser to read C source code to get this mapping was written above, it's simpler just to use it.

Note: the codes for some of the \_CD fields are only specified on the above web page, but the \_CD fields are not being used in this study.

In [53]:
p_src = retrosheet.joinpath('src')
os.chdir(p_src)

In [54]:
game_reg_fields = parse_c_source('cwgame.c')
game_ext_fields = parse_c_source('cwgame.c', 'ext_field_data')           

In [55]:
# there are 84 regular fields and 95 extended fields
len(game_reg_fields), len(game_ext_fields)

(84, 95)

#### Data Dictionary Note
DH_FL: Designated Hitter Flag, 'T' if DH in use, else 'F'  
DAYNIGHT_PARK_CD: 'N' for night, 'D' for day  
GW_RBI_BAT_ID: Batter ID for batter who got Game Winning RBI  

In [56]:
# As of Python 3.6, dictionaries maintain insertion order
game_fields = {key:value for num, 
    (key, value) in enumerate(game_reg_fields.items()) if num < 46}

for key in drop_columns:
    del game_fields[key]

game_fields

{'GAME_ID': 'game id',
 'GAME_DT': 'date',
 'GAME_CT': 'game number (0 = no double header)',
 'GAME_DY': 'day of week',
 'START_GAME_TM': 'start time',
 'DH_FL': 'DH used flag',
 'DAYNIGHT_PARK_CD': 'day/night flag',
 'AWAY_TEAM_ID': 'visiting team',
 'HOME_TEAM_ID': 'home team',
 'PARK_ID': 'game site',
 'AWAY_START_PIT_ID': 'vis. starting pitcher',
 'HOME_START_PIT_ID': 'home starting pitcher',
 'ATTEND_PARK_CT': 'attendance',
 'TEMP_PARK_CT': 'temperature',
 'MINUTES_GAME_CT': 'time of game',
 'INN_CT': 'number of innings',
 'AWAY_SCORE_CT': 'visitor final score',
 'HOME_SCORE_CT': 'home final score',
 'AWAY_HITS_CT': 'visitor hits',
 'HOME_HITS_CT': 'home hits',
 'AWAY_ERR_CT': 'visitor errors',
 'HOME_ERR_CT': 'home errors',
 'AWAY_LOB_CT': 'visitor left on base',
 'HOME_LOB_CT': 'home left on base',
 'WIN_PIT_ID': 'winning pitcher',
 'LOSE_PIT_ID': 'losing pitcher',
 'SAVE_PIT_ID': 'save for',
 'GWRBI_BAT_ID': 'GW RBI'}

#### Persist game_fields

In [57]:
os.chdir(p_persisted)

# index=[0] is required for dictionary of scalar values
game_fields_df = pd.DataFrame(game_fields, index=[0])
game_fields_df.to_csv('game_fields.csv')

## 5. Scrape Data for Player Lookup Table

There is no separate file for this.  It will be scraped from a web page.

In [58]:
import requests
import pandas as pd
from io import StringIO
from bs4 import BeautifulSoup

In [59]:
# get the web page
r = requests.get("https://www.retrosheet.org/retroID.htm")
soup = BeautifulSoup(r.content, 'lxml')

# data is within the pre tag
table_txt = soup.pre.string

# remove unnecessary double quotes
table_txt = table_txt.replace('"','')

# read from this string instead of file
players = pd.read_csv(StringIO(table_txt), parse_dates=['Play debut'])

In [60]:
players.head()

Unnamed: 0,ID,Last,First,Play debut,Mgr debut,Coach debut,Ump debut
0,aardd001,Aardsma,David,2004-04-06,,,
1,aaroh101,Aaron,Hank,1954-04-13,,,
2,aarot101,Aaron,Tommie,1962-04-10,,04/06/1979,
3,aased001,Aase,Don,1977-07-26,,,
4,abada001,Abad,Andy,2001-09-10,,,


#### Persist Players

In [61]:
os.chdir(p_persisted)
players.to_csv('players.csv')

## 6. Scrape Data for Stadium Lookup Table
There is no separate file for this, it will be scraped from a webpage.

In [62]:
# get the web page (this is not html!)
r = requests.get("https://www.retrosheet.org/parkcode.txt")

table_txt = r.content.decode("utf-8")

# read from this string instead of file
parks = pd.read_csv(StringIO(table_txt), parse_dates=['START', 'END'])

In [63]:
parks.head()

Unnamed: 0,PARKID,NAME,AKA,CITY,STATE,START,END,LEAGUE,NOTES
0,ALB01,Riverside Park,,Albany,NY,1880-09-11,1882-05-30,NL,TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882
1,ALT01,Columbia Park,,Altoona,PA,1884-04-30,1884-05-31,UA,
2,ANA01,Angel Stadium of Anaheim,Edison Field; Anaheim Stadium,Anaheim,CA,1966-04-19,NaT,AL,
3,ARL01,Arlington Stadium,,Arlington,TX,1972-04-21,1993-10-03,AL,
4,ARL02,Rangers Ballpark in Arlington,The Ballpark in Arlington; Ameriquest Fl,Arlington,TX,1994-04-11,NaT,AL,


#### Persist Stadiums

In [64]:
os.chdir(p_persisted)
parks.to_csv('parks.csv')