# Retrosheet Baseball Data -- Part 1

The Retrosheet event data includes every play for every major league game since 1921.

Only a subset of that data will be used here.

At the end of this preprocessing, the following Pandas DataFrames will exist:
1. player-game:  stats per player per game
2. game: stats per game
3. players: player_id -> player lookup
4. stadiums: stadium_id -> stadium lookup

## Repeatable Research
All data processing should be documented so that others can repeat the results.

This notebook documents all preprocessing steps taken with the data available from Retrosheet.

Retrosheet licenses their data on the GPL:  
https://www.gnu.org/licenses/gpl.html

## Download and Unpack Retrosheet Data

The Retrosheet website is:  
https://www.retrosheet.org/game.htm  

The event data will be downloaded.  The data is zipped ascii text with filenames:
http://www.retrosheet.org/events/{year}eve.zip'

There are many ways to download files in Python.  For a simple binary file download, wget may be the easiest.

### Create Directories
* ~/data/retrosheet/raw  
* ~/data/retrosheet/processed  

In [1]:
import os
import re
import wget
from pathlib import Path
import zipfile

In [2]:
# create path objects
home = Path.home()
retrosheet = home.joinpath('data/retrosheet')
p_raw = retrosheet.joinpath('raw')
p_processed = retrosheet.joinpath('processed')

# create directories from these path objects
p_raw.mkdir(parents=True, exist_ok=True)
p_processed.mkdir(parents=True, exist_ok=True)

### Download and Unzip the Event Data
Data is available from 1921 to present.

Here, data from 1950 through 2018 will be downloaded and unzipped.

This will result in a 3.2+ Gig Pandas DataFrame, so chose more or less years as appropriate for your computer.

In [3]:
# change to raw file directory
os.chdir(p_raw)

for year in range(1950,2019):   
    # download each file, if it doesn't exist
    filename = f'{year}eve.zip'
    path = Path(filename)
    if not path.exists():
        url = f'http://www.retrosheet.org/events/{year}eve.zip'
        wget.download(url)
    
    # unzip each zip file, if its contents don't exist
    # {year}BOS.EVA is in all zip files
    filename = f'{year}BOS.EVA'
    path = Path(filename)
    if not path.exists():
        filename = f'{year}eve.zip'
        with zipfile.ZipFile(filename, "r") as zip_ref:
            zip_ref.extractall(".")

### Unzipped Data Files
The unzipped data consists of 3 types of files:
1. *.EVA and *.EVN -- these are American League and National League event files per team per year
2. *.ROS -- these are the rosters per team per year
3. TEAM* -- the MBL teams in existence per year

## Parse Event Data for Player Statistics

The event data is in a format that is very difficult to work with.  There is one open-source project which has parsers for the Retrosheet even data.  This project has 6 parsers, each of which is fed event data as input and outputs a csv (or text) file of related fields as output.

The two parsers that are of interest for player-game data are:
1. cwdaily
2. cwgame

The Retrosheet data parser tools are described at:  
http://chadwick.sourceforge.net/doc/index.html  
They are distributed under the GPL:  
https://www.gnu.org/licenses/gpl.html  

Note: as of January 2019, the cwdaily parser, written in 2018, is not described on the above webpage.

#### Build Parsers on Linux
Go To:  
https://sourceforge.net/projects/chadwick/  
Download the source code for version 0.7.1 or later, and optionally download the Windows binaries.

If you do not already have a build environment:
1. sudo apt install gcc
2. sudo apt install build-essential

cd to the source directory:
1. ./configure
2. make
3. make install  # or: sudo make install  

The cw command line tools will be installed in /usr/local/bin.  
The cw library will be installed in /usr/local/lib.  
To allow the command line tools to find the library, add the following to your .bashrc and source .bashrc  
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib  

#### Or Run Windows Binaries
If you prefer to use the prebuilt windows binaries:  
Install wine: https://wiki.winehq.org/Ubuntu  
Before first use of wine: run winecfg in a terminal

You could also run the windows binaries on a Windows VM (if you have own a Windows license).

### Preprocess Scripting
Preprocessing is usually performed with shell scripts or Python scripts.

Here each preprocessing step will be documented as a Jupyter Notebook Cell using Python.

In [4]:
# subprocess example
# prefer to invoke bash directly with shell=False
import subprocess

# List the 6 parsers that were just built
result = subprocess.run(["/bin/bash", "-c", "ls /usr/local/bin/cw*"], shell=False, 
                        text=True, capture_output=True)
result.stdout.splitlines()

['/usr/local/bin/cwbox',
 '/usr/local/bin/cwcomment',
 '/usr/local/bin/cwdaily',
 '/usr/local/bin/cwevent',
 '/usr/local/bin/cwgame',
 '/usr/local/bin/cwsub']

In [5]:
# if you are running windows binaries under Linux, prepend 'wine ' to the cmd string below
def process_cwdaily(year):
    """Parse yearly event data into 117 fields of player-game data per year.
    
    There are a total of 117 fields to chose from, all are chosen.
    """
    cmd = f'cwdaily -f 0-116 -n -y {year} {year}*.EV*'
    args = ["/bin/bash", "-c", cmd]
    out = f'../processed/daily{year}.csv'
    with open(out, "w") as outfile:
        result = subprocess.run(args, stdout=outfile)

In [6]:
# change to raw file directory
os.chdir(p_raw)

In [7]:
# parse each year of event data
for year in range(1950, 2019):
    process_cwdaily(year)

In [8]:
# collect all the parsed files into a single pandas dataframe
import glob
import pandas as pd
os.chdir(p_processed)
dailyfiles = glob.glob('daily*.csv')
dailyfiles.sort()

dfs = []
for file in dailyfiles:
    dfs.append(pd.read_csv(file))
player_game = pd.concat(dfs)

In [9]:
# after concatentation, reset the index
player_game = player_game.reset_index(drop=True)
player_game.index

RangeIndex(start=0, stop=3688067, step=1)

In [10]:
player_game.head()

Unnamed: 0,GAME_ID,GAME_DT,GAME_CT,APPEAR_DT,TEAM_ID,PLAYER_ID,B_G,B_PA,B_AB,B_R,...,F_CF_E,F_CF_DP,F_CF_TP,F_RF_G,F_RF_OUT,F_RF_PO,F_RF_A,F_RF_E,F_RF_DP,F_RF_TP
0,BOS195004180,19500418,0,19500418,NYA,rizzp101,1,6,4,1,...,0,0,0,0,0,0,0,0,0,0
1,BOS195004180,19500418,0,19500418,NYA,henrt101,1,6,6,2,...,0,0,0,0,0,0,0,0,0,0
2,BOS195004180,19500418,0,19500418,NYA,baueh101,1,4,4,1,...,0,0,0,1,21,2,0,0,0,0
3,BOS195004180,19500418,0,19500418,NYA,woodg101,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,BOS195004180,19500418,0,19500418,NYA,mapec101,1,1,0,0,...,0,0,0,1,6,0,0,0,0,0


In [11]:
# the primary key is (GAME_ID, PLAYER_ID), verify no dups
dups = player_game.duplicated(subset=['GAME_ID', 'PLAYER_ID'], keep='last')
player_game[dups]

Unnamed: 0,GAME_ID,GAME_DT,GAME_CT,APPEAR_DT,TEAM_ID,PLAYER_ID,B_G,B_PA,B_AB,B_R,...,F_CF_E,F_CF_DP,F_CF_TP,F_RF_G,F_RF_OUT,F_RF_PO,F_RF_A,F_RF_E,F_RF_DP,F_RF_TP
3557003,BOS201708250,20170825,0,20170825,BOS,younc004,1,3,3,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
# Out of the over 3 million records, there was a dup
# Arbitrarily remove one of these records
player_game = player_game.drop_duplicates(subset=['GAME_ID', 'PLAYER_ID'], keep='last')
dups = player_game.duplicated(subset=['GAME_ID', 'PLAYER_ID'], keep='last')
player_game[dups]

Unnamed: 0,GAME_ID,GAME_DT,GAME_CT,APPEAR_DT,TEAM_ID,PLAYER_ID,B_G,B_PA,B_AB,B_R,...,F_CF_E,F_CF_DP,F_CF_TP,F_RF_G,F_RF_OUT,F_RF_PO,F_RF_A,F_RF_E,F_RF_DP,F_RF_TP


In [13]:
player_game.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3688066 entries, 0 to 3688066
Columns: 117 entries, GAME_ID to F_RF_TP
dtypes: int64(114), object(3)
memory usage: 3.2+ GB


In [14]:
player_game.dtypes.value_counts()

int64     114
object      3
dtype: int64

In [15]:
text_flds = player_game.select_dtypes(['object'])
text_flds.columns

Index(['GAME_ID', 'TEAM_ID', 'PLAYER_ID'], dtype='object')

## Player-Game Data Dictionary (Codebook)
As of January 2019, I could find no published information on cwdaily.

cwdaily can be run with the '-n' flag to have it output fieldnames, but it is not clear what some of the fieldnames mean.

Luckily, the source code itself has a text description of each output field.  This description takes place within a single, very long, C statement.

The source code will be scraped to retrieve a field-name to field-description mapping.

In [16]:
# cd to dir with cwdaily.c
p_src = retrosheet.joinpath('src')
os.chdir(p_src)

In [17]:
def parse_c_source(filename, struct='field_data'):
    dd = {}
    with open(filename, 'r') as cwdaily:
        # to account for patterns across lines, read the entire source code into a text string
        source = cwdaily.read()
    
        # get the single (multiline) C statement that has the field-name, field-description
        pattern = r'(static\s+field_struct\s+' + struct + r'.*?;)'
        match = re.search(pattern, source, flags=re.DOTALL | re.MULTILINE)
    
        if match:
            # within this statement there are many {...} and inside each is the mapping
            pattern = r'{.*?"(.*?)".*?"(.*?)".*?}'
            for m in re.finditer(pattern, match.group(1), flags=re.DOTALL | re.MULTILINE):
                if m:
                    if len(m.group(2).split(':')) == 2:
                        desc = m.group(2).split(':')[1].strip()
                    else:
                        desc = m.group(2).strip()
                    dd[m.group(1)] = desc   
    return dd

In [18]:
player_game_fields = parse_c_source('cwdaily.c')        

In [19]:
# dictionary of field-name -> field-description
player_game_fields

{'GAME_ID': 'game id',
 'GAME_DT': 'date',
 'GAME_CT': 'game number (0 = no double header)',
 'APPEAR_DT': 'apperance date',
 'TEAM_ID': 'team id',
 'PLAYER_ID': 'player id',
 'B_G': 'games played',
 'B_PA': 'plate appearances',
 'B_AB': 'at bats',
 'B_R': 'runs',
 'B_H': 'hits',
 'B_2B': 'doubles',
 'B_3B': 'triples',
 'B_HR': 'home runs',
 'B_RBI': 'runs batted in',
 'B_BB': 'walks',
 'B_IBB': 'intentional walks',
 'B_SO': 'strikeouts',
 'B_GDP': 'grounded into DP',
 'B_HP': 'hit by pitch',
 'B_SH': 'sacrifice hits',
 'B_SF': 'sacrifice flies',
 'B_SB': 'stolen bases',
 'B_CS': 'caught stealing',
 'B_XI': 'reached on interference',
 'P_G': 'games pitched',
 'P_GS': 'games started',
 'P_CG': 'complete games',
 'P_SHO': 'shutouts',
 'P_GF': 'games finished',
 'P_W': 'wins',
 'P_L': 'losses',
 'P_SV': 'saves',
 'P_OUT': 'outs recorded (innings pitched times 3)',
 'P_TBF': 'batters faced',
 'P_AB': 'at bats',
 'P_R': 'runs allowed',
 'P_ER': 'earned runs allowed',
 'P_H': 'hits allowed',

## Parse Event Data for Game Statistics
Additional information about the game itself is available.

In [20]:
# if you are running windows binaries under Linux, prepend 'wine ' to the cmd string below
def process_cwgame(year):
    """Parse yearly event data into 45 fields of game data per year.
    
    For each game, there are 84 standard fields and 95 extended fields to chose from.  
    The first 46 standard field are chosen.
    """
    cmd = f'cwgame -f 0-45 -n -y {year} {year}*.EV*'
    args = ["/bin/bash", "-c", cmd]
    out = f'../processed/game{year}.csv'
    with open(out, "w") as outfile:
        result = subprocess.run(args, stdout=outfile)

In [21]:
# change to raw file directory
os.chdir(p_raw)

In [22]:
# parse each year of event data
for year in range(1950, 2019):
    process_cwgame(year)

In [23]:
# collect all the parsed files into a single pandas dataframe
import glob
os.chdir(p_processed)
gamefiles = glob.glob('game*.csv')
gamefiles.sort()

dfs = []
for file in gamefiles:
    dfs.append(pd.read_csv(file))
game = pd.concat(dfs)

In [24]:
# after concatentation, reset the index
game = game.reset_index(drop=True)
game.head()

Unnamed: 0,GAME_ID,GAME_DT,GAME_CT,GAME_DY,START_GAME_TM,DH_FL,DAYNIGHT_PARK_CD,AWAY_TEAM_ID,HOME_TEAM_ID,PARK_ID,...,AWAY_HITS_CT,HOME_HITS_CT,AWAY_ERR_CT,HOME_ERR_CT,AWAY_LOB_CT,HOME_LOB_CT,WIN_PIT_ID,LOSE_PIT_ID,SAVE_PIT_ID,GWRBI_BAT_ID
0,BOS195004180,19500418,0,Tuesday,0,F,D,NYA,BOS,BOS07,...,15,15,0,0,9,13,johnd102,mastw101,pagej101,
1,BOS195004192,19500419,2,Wednesday,0,F,D,NYA,BOS,BOS07,...,15,10,0,1,10,7,lopae101,kinde101,pagej101,
2,BOS195004280,19500428,0,Friday,0,F,D,PHA,BOS,BOS07,...,8,8,0,0,7,7,parnm101,kella103,,
3,BOS195004301,19500430,1,Sunday,0,F,D,PHA,BOS,BOS07,...,5,17,2,0,5,7,dobsj101,fowld101,,
4,BOS195004302,19500430,2,Sunday,0,F,D,PHA,BOS,BOS07,...,10,12,2,0,5,11,stobc101,wyseh101,papaa101,


In [25]:
# the primary key is (GAME_ID), verify no dups
dups = game.duplicated(subset=['GAME_ID'], keep='last')
game[dups]

Unnamed: 0,GAME_ID,GAME_DT,GAME_CT,GAME_DY,START_GAME_TM,DH_FL,DAYNIGHT_PARK_CD,AWAY_TEAM_ID,HOME_TEAM_ID,PARK_ID,...,AWAY_HITS_CT,HOME_HITS_CT,AWAY_ERR_CT,HOME_ERR_CT,AWAY_LOB_CT,HOME_LOB_CT,WIN_PIT_ID,LOSE_PIT_ID,SAVE_PIT_ID,GWRBI_BAT_ID


In [26]:
game.dtypes.value_counts()

object     23
int64      22
float64     1
dtype: int64

In [27]:
game_txt = game.select_dtypes(['object'])
game_txt.columns

Index(['GAME_ID', 'GAME_DY', 'DH_FL', 'DAYNIGHT_PARK_CD', 'AWAY_TEAM_ID',
       'HOME_TEAM_ID', 'PARK_ID', 'AWAY_START_PIT_ID', 'HOME_START_PIT_ID',
       'BASE4_UMP_ID', 'BASE1_UMP_ID', 'BASE2_UMP_ID', 'BASE3_UMP_ID',
       'LF_UMP_ID', 'RF_UMP_ID', 'SCORER_RECORD_ID', 'TRANSLATOR_RECORD_ID',
       'INPUTTER_RECORD_ID', 'INPUT_RECORD_TS', 'WIN_PIT_ID', 'LOSE_PIT_ID',
       'SAVE_PIT_ID', 'GWRBI_BAT_ID'],
      dtype='object')

In [28]:
game_float = game.select_dtypes(['float'])
game_float.columns

Index(['EDIT_RECORD_TS'], dtype='object')

In [29]:
game_float['EDIT_RECORD_TS'].nunique()

0

In [30]:
game = game.drop(['EDIT_RECORD_TS'], axis=1)
game.dtypes.value_counts()

object    23
int64     22
dtype: int64

## Game Data Dictionary (Codebook)

There is a field-name to field-description mapping provided on the following web page:  
http://chadwick.sourceforge.net/doc/cwgame.html

This data could be scraped from the webpage, but as a parser to read C source code to get this mapping was written above, it's simpler just to use it.

In [31]:
p_src = retrosheet.joinpath('src')
os.chdir(p_src)

In [32]:
game_reg_fields = parse_c_source('cwgame.c')
game_ext_fields = parse_c_source('cwgame.c', 'ext_field_data')           

In [33]:
len(game_reg_fields), len(game_ext_fields)

(84, 95)

In [34]:
# As of Python 3.6, dictionaries maintain insertion order
game_fields = {key:value for num, (key, value) in enumerate(game_reg_fields.items()) if num < 46}

# as per above, edit_record_ts has no data
del game_fields['EDIT_RECORD_TS']
game_fields

{'GAME_ID': 'game id',
 'GAME_DT': 'date',
 'GAME_CT': 'game number (0 = no double header)',
 'GAME_DY': 'day of week',
 'START_GAME_TM': 'start time',
 'DH_FL': 'DH used flag',
 'DAYNIGHT_PARK_CD': 'day/night flag',
 'AWAY_TEAM_ID': 'visiting team',
 'HOME_TEAM_ID': 'home team',
 'PARK_ID': 'game site',
 'AWAY_START_PIT_ID': 'vis. starting pitcher',
 'HOME_START_PIT_ID': 'home starting pitcher',
 'BASE4_UMP_ID': 'home plate umpire',
 'BASE1_UMP_ID': 'first base umpire',
 'BASE2_UMP_ID': 'second base umpire',
 'BASE3_UMP_ID': 'third base umpire',
 'LF_UMP_ID': 'left field umpire',
 'RF_UMP_ID': 'right field umpire',
 'ATTEND_PARK_CT': 'attendance',
 'SCORER_RECORD_ID': 'PS scorer',
 'TRANSLATOR_RECORD_ID': 'translator',
 'INPUTTER_RECORD_ID': 'inputter',
 'INPUT_RECORD_TS': 'input time',
 'METHOD_RECORD_CD': 'how scored',
 'PITCHES_RECORD_CD': 'pitches entered?',
 'TEMP_PARK_CT': 'temperature',
 'WIND_DIRECTION_PARK_CD': 'wind direction',
 'WIND_SPEED_PARK_CT': 'wind speed',
 'FIELD_PA

## Player Lookup Table

There is no separate file for this.  It will be scraped from a web page.

In [35]:
import requests
import pandas as pd
from io import StringIO
from bs4 import BeautifulSoup

In [36]:
# get the web page
r = requests.get("https://www.retrosheet.org/retroID.htm")
soup = BeautifulSoup(r.content, 'lxml')

# data is within the pre tag
table_txt = soup.pre.string

# remove unnecessary double quotes
table_txt = table_txt.replace('"','')

# read from this string instead of file
players = pd.read_csv(StringIO(table_txt))

In [37]:
players.head()

Unnamed: 0,ID,Last,First,Play debut,Mgr debut,Coach debut,Ump debut
0,aardd001,Aardsma,David,04/06/2004,,,
1,aaroh101,Aaron,Hank,04/13/1954,,,
2,aarot101,Aaron,Tommie,04/10/1962,,04/06/1979,
3,aased001,Aase,Don,07/26/1977,,,
4,abada001,Abad,Andy,09/10/2001,,,


## Stadium Lookup Table
There is no separate file for this, it will be scraped from a webpage.

In [38]:
# get the web page (this is not html!)
r = requests.get("https://www.retrosheet.org/parkcode.txt")

table_txt = r.content.decode("utf-8")

# read from this string instead of file
parks = pd.read_csv(StringIO(table_txt))

In [39]:
parks.head()

Unnamed: 0,PARKID,NAME,AKA,CITY,STATE,START,END,LEAGUE,NOTES
0,ALB01,Riverside Park,,Albany,NY,09/11/1880,05/30/1882,NL,TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882
1,ALT01,Columbia Park,,Altoona,PA,04/30/1884,05/31/1884,UA,
2,ANA01,Angel Stadium of Anaheim,Edison Field; Anaheim Stadium,Anaheim,CA,04/19/1966,,AL,
3,ARL01,Arlington Stadium,,Arlington,TX,04/21/1972,10/03/1993,AL,
4,ARL02,Rangers Ballpark in Arlington,The Ballpark in Arlington; Ameriquest Fl,Arlington,TX,04/11/1994,,AL,
