# Retrosheet Baseball Data -- Parse Play by Play Data

**Baseball Notebooks**  
1. Downloaded and unzipped baseball data.
2. Helper functions and their motivation for use.
3. Lahman data was wrangled and persisted.
4. This notebook.

Parse the Retrosheet Play by Play data.

The parses used will be the open source parsers by Dr. T. L. Turocy.  
Parser Description: http://chadwick.sourceforge.net/doc/index.html  
Parser Source: https://sourceforge.net/projects/chadwick/

As of March 2019, the cwdaily parser, published in July 2018, is not described on the above web site.  It is similar to the other parsers described there.

This notebook is designed to be used with Jupyter Lab and the Table of Contents extension: https://github.com/jupyterlab/jupyterlab-toc

## Repeatable Research
All data processing should be documented so that others can repeat the results.  This includes every step from downloading the data through analysis.

### Create Directories for Data Processing

* ~/data/retrosheet/raw -- event files downloaded and unzipped
* ~/data/retrosheet/parsed -- results of running 2 parsers on the event files
* ~/data/retrosheet/collected -- collect the parsed files into dataframes
* ~/data/retrosheet/wrangled -- wrangle the data for analsyis
* ~/data/retrosheet/src -- optional directory to hold parser source code

In [1]:
import os
import re
import wget
from pathlib import Path
import zipfile

In [2]:
# see Baseball Notebook #2
import helper_functions as bb

In [3]:
# create path objects
home = Path.home()
retrosheet = home.joinpath('data/retrosheet')
p_raw = retrosheet.joinpath('raw')
p_wrangled = retrosheet.joinpath('wrangled')

p_parsed = retrosheet.joinpath('parsed')
p_collected = retrosheet.joinpath('collected')
p_src = retrosheet.joinpath('src')

# create directories (if they don't already exist) from these path objects
p_raw.mkdir(parents=True, exist_ok=True)
p_wrangled.mkdir(parents=True, exist_ok=True)
p_parsed.mkdir(parents=True, exist_ok=True)
p_collected.mkdir(parents=True, exist_ok=True)
p_src.mkdir(parents=True, exist_ok=True)

## Parse Event Data for Stats per Player per Game

The event data is in a format that is very difficult to work with.  There is an open-source project which has parsers for the Retrosheet event data.  This project has 6 parsers.  Each of these parsers is fed event data and produces csv or XML or text output.

The two parsers that are of interest for this study are:
1. cwdaily -- player per game stats
2. cwgame -- game stats

The cwbox parser produces a box score in the form MLB fans are accustomed to seeing (or it can produce XML with appropriate tags).  This appears to have the same information as is produced by cwdaily, however cwdaily formats the data as one line per player per game, which is much easier to work with.

### Build Chadwick Parsers on Linux (or use prebuilt Windows binaries)
This section describes how to download the source, compile and install it.

The compile and install procedure here is the standard procedure for compiling and installing open-source code on Linux.

Go To:  
https://sourceforge.net/projects/chadwick/  
Download the source code for version 0.7.1 or later.

If you do not already have a build environment:
1. sudo apt install gcc
2. sudo apt install build-essential

cd to the source directory:
1. ./configure
2. make
3. make install  # or: sudo make install  

Result
1. The cw command line tools will be installed in /usr/local/bin.  
2. The cw library will be installed in /usr/local/lib.  

To allow the command line tools to find the shared libraries, add the following to your .bashrc and then: source .bashrc  
```export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib```

Optionally copy cwdaily.c and cwgame.c to the src directory.  These C source code files will be parsed to get data dictionary infromation.  These files, and the parsing of these C source code files, is only useful to understanding the data and is not required for the later baseball analysis notebooks.

### Using Prebuilt Windows Binaries
Go To:  
https://sourceforge.net/projects/chadwick/  
Download the Windows binaries for version 0.7.1 or later.

**Linux Wine**  
Install wine: https://wiki.winehq.org/Ubuntu  
Before first use of wine: run winecfg in a terminal

**Windows**  
You could also run the windows binaries on Windows or a Windows VM.

### Run the cwdaily Parser

In [4]:
# normally os.listdir() is used to list a directory
# here, for demonstating the subprocess module, subprocess will be used
# invoke bash directly with shell=False in subprocess
import subprocess

cmd = 'ls /usr/local/bin/cw*'
args = ['/bin/bash', '-c', cmd]
result = subprocess.run(args, shell=False, text=True, capture_output=True)
result.stdout.splitlines()

['/usr/local/bin/cwbox',
 '/usr/local/bin/cwcomment',
 '/usr/local/bin/cwdaily',
 '/usr/local/bin/cwevent',
 '/usr/local/bin/cwgame',
 '/usr/local/bin/cwsub']

In [5]:
# the optionally downloaded C source code for the two parsers
os.listdir(p_src)

['cwgame.c', 'cwdaily.c']

In [6]:
import os
# check the environment variable for LD_LIBRARY_PATH
os.environ['LD_LIBRARY_PATH']

'/usr/local/lib'

In [7]:
# if you are running windows binaries under Linux, 
# prepend 'wine ' to the cmd string below
def process_cwdaily(year):
    """Parse event data into 52 fields of player stats per game.
    
    There are a total of 117 fields to chose from, the first 52 are selected.
    """
    cmd = f'cwdaily -f 0-51 -n -y {year} {year}*.EV*'
    args = ["/bin/bash", "-c", cmd]
    out = f'../parsed/daily{year}.csv'
    with open(out, "w") as outfile:
        result = subprocess.run(args, stdout=outfile)

In [8]:
# change to raw file directory
os.chdir(p_raw)

In [9]:
# parse each year of event data
for year in range(1955, 2019):
    file = p_parsed.joinpath(f'daily{year}.csv')
    
    # if the output file is not already there
    if not file.is_file():
        process_cwdaily(year)

In [11]:
# collect all the parsed files into a single pandas dataframe
import glob
import pandas as pd
import numpy as np

os.chdir(p_parsed)
dailyfiles = glob.glob('daily*.csv')
dailyfiles.sort()

dfs = []
for file in dailyfiles:
    dfs.append(pd.read_csv(file))
player_game = pd.concat(dfs)

In [12]:
player_game = player_game.reset_index(drop=True)
player_game.columns = player_game.columns.str.lower()
player_game.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3549700 entries, 0 to 3549699
Data columns (total 52 columns):
game_id      object
game_dt      int64
game_ct      int64
appear_dt    int64
team_id      object
player_id    object
b_g          int64
b_pa         int64
b_ab         int64
b_r          int64
b_h          int64
b_2b         int64
b_3b         int64
b_hr         int64
b_rbi        int64
b_bb         int64
b_ibb        int64
b_so         int64
b_gdp        int64
b_hp         int64
b_sh         int64
b_sf         int64
b_sb         int64
b_cs         int64
b_xi         int64
p_g          int64
p_gs         int64
p_cg         int64
p_sho        int64
p_gf         int64
p_w          int64
p_l          int64
p_sv         int64
p_out        int64
p_tbf        int64
p_ab         int64
p_r          int64
p_er         int64
p_h          int64
p_2b         int64
p_3b         int64
p_hr         int64
p_bb         int64
p_ibb        int64
p_so         int64
p_gdp        int64
p_hp      

### Persist Dataframe

Parsing dates and other data wrangling is performed in the next notebook.  This notebook just downloads, parses, and saves the data to a compressed csv file.

Due to the sparsity of the player_game dataframe, using gzip will reduce the file size by a factor of 10+.

In [13]:
bb.mem_usage(player_game)

'1983.76 MB'

In [14]:
player_game = bb.optimize_df_dtypes(player_game)

In [15]:
bb.mem_usage(player_game)

'842.93 MB'

In [16]:
player_game.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3549700 entries, 0 to 3549699
Data columns (total 52 columns):
game_id      object
game_dt      uint32
game_ct      uint8
appear_dt    uint32
team_id      object
player_id    object
b_g          uint8
b_pa         uint8
b_ab         uint8
b_r          uint8
b_h          uint8
b_2b         uint8
b_3b         uint8
b_hr         uint8
b_rbi        uint8
b_bb         uint8
b_ibb        uint8
b_so         uint8
b_gdp        uint8
b_hp         uint8
b_sh         uint8
b_sf         uint8
b_sb         uint8
b_cs         uint8
b_xi         uint8
p_g          uint8
p_gs         uint8
p_cg         uint8
p_sho        uint8
p_gf         uint8
p_w          uint8
p_l          uint8
p_sv         uint8
p_out        uint8
p_tbf        uint8
p_ab         uint8
p_r          uint8
p_er         uint8
p_h          uint8
p_2b         uint8
p_3b         uint8
p_hr         uint8
p_bb         uint8
p_ibb        uint8
p_so         uint8
p_gdp        uint8
p_hp    

In [17]:
# change working dir
os.chdir(p_collected)

# persist as compressed csv file
%time bb.to_csv_with_types(player_game, 'player_game.csv.gz')

CPU times: user 2min 38s, sys: 39.6 ms, total: 2min 38s
Wall time: 2min 38s


## Parse Event Data for Stats per Game
Additional information about each game is available, such as the attendance, temperature at game start time, game start time, etc.

In [18]:
# if you are running windows binaries under Linux, prepend 'wine ' to the cmd string below
def process_cwgame(year):
    """Parse yearly event data into 45 fields of game data per year.
    
    For each game, there are 84 standard fields and 95 extended fields to chose from.  
    Only the first 46 standard fields are chosen.
    """
    cmd = f'cwgame -f 0-45 -n -y {year} {year}*.EV*'
    args = ["/bin/bash", "-c", cmd]
    out = f'../parsed/game{year}.csv'
    with open(out, "w") as outfile:
        result = subprocess.run(args, stdout=outfile)

In [19]:
# change to raw file directory
os.chdir(p_raw)

In [20]:
# parse each year of event data
for year in range(1955, 2019):
    file = p_parsed.joinpath(f'game{year}.csv')
    
    # if the output file is not already there
    if not file.is_file():
        process_cwgame(year)

In [21]:
# collect all the parsed files into a single pandas dataframe
import glob
os.chdir(p_parsed)
gamefiles = glob.glob('game*.csv')
gamefiles.sort()

dfs = []
for file in gamefiles:
    #consider GAME_DT, START_GAME_TM to be a strings for now
    dfs.append(pd.read_csv(file))
game = pd.concat(dfs)

In [22]:
game.reset_index(drop=True)
game.columns = game.columns.str.lower()
game.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 129846 entries, 0 to 2430
Data columns (total 46 columns):
game_id                   129846 non-null object
game_dt                   129846 non-null int64
game_ct                   129846 non-null int64
game_dy                   129846 non-null object
start_game_tm             129846 non-null int64
dh_fl                     129846 non-null object
daynight_park_cd          129846 non-null object
away_team_id              129846 non-null object
home_team_id              129846 non-null object
park_id                   129846 non-null object
away_start_pit_id         129846 non-null object
home_start_pit_id         129846 non-null object
base4_ump_id              129846 non-null object
base1_ump_id              129846 non-null object
base2_ump_id              129846 non-null object
base3_ump_id              129846 non-null object
lf_ump_id                 206 non-null object
rf_ump_id                 11 non-null object
attend_park_ct     

In [23]:
game = bb.optimize_df_dtypes(game)

In [24]:
game.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 129846 entries, 0 to 2430
Data columns (total 46 columns):
game_id                   129846 non-null object
game_dt                   129846 non-null uint32
game_ct                   129846 non-null uint8
game_dy                   129846 non-null object
start_game_tm             129846 non-null uint16
dh_fl                     129846 non-null object
daynight_park_cd          129846 non-null object
away_team_id              129846 non-null object
home_team_id              129846 non-null object
park_id                   129846 non-null object
away_start_pit_id         129846 non-null object
home_start_pit_id         129846 non-null object
base4_ump_id              129846 non-null object
base1_ump_id              129846 non-null object
base2_ump_id              129846 non-null object
base3_ump_id              129846 non-null object
lf_ump_id                 206 non-null object
rf_ump_id                 11 non-null object
attend_park_ct   

In [25]:
os.chdir(p_collected)
%time bb.to_csv_with_types(game, 'game.csv.gz')

CPU times: user 4.5 s, sys: 7.99 ms, total: 4.51 s
Wall time: 4.51 s
