# Retrosheet Baseball Data -- Parse Play by Play Data

**Baseball Notebooks**  
1. Downloaded and unzipped baseball data.
2. Helper functions and their motivation for use.
3. Lahman data was wrangled and persisted.
4. This notebook.

Parse the Retrosheet Play by Play data.

The parses used will be the open source parsers by Dr. T. L. Turocy.  
Parser Description: http://chadwick.sourceforge.net/doc/index.html  
Parser Source: https://sourceforge.net/projects/chadwick/

As of March 2019, the cwdaily parser, published in July 2018, is not described on the above web site.  It is similar to the other parsers described there.

This notebook is designed to be used with Jupyter Lab and the Table of Contents extension: https://github.com/jupyterlab/jupyterlab-toc

## Repeatable Research
All data processing should be documented so that others can repeat the results.  This includes every step from downloading the data through analysis.

### Create Directories for Data Processing

* ~/data/retrosheet/raw -- event files downloaded and unzipped
* ~/data/retrosheet/parsed -- results of running 2 parsers on the event files
* ~/data/retrosheet/collected -- collect the parsed files into dataframes
* ~/data/retrosheet/wrangled -- wrangle the data for analsyis
* ~/data/retrosheet/src -- optional directory to hold parser source code

In [1]:
import os
import re
import wget
from pathlib import Path
import zipfile

In [2]:
# see Baseball Notebook #2
import helper_functions as bb

In [3]:
# create path objects
home = Path.home()
retrosheet = home.joinpath('data/retrosheet')
p_raw = retrosheet.joinpath('raw')
p_wrangled = retrosheet.joinpath('wrangled')

p_parsed = retrosheet.joinpath('parsed')
p_collected = retrosheet.joinpath('collected')
p_src = retrosheet.joinpath('src')

# create directories (if they don't already exist) from these path objects
p_raw.mkdir(parents=True, exist_ok=True)
p_wrangled.mkdir(parents=True, exist_ok=True)
p_parsed.mkdir(parents=True, exist_ok=True)
p_collected.mkdir(parents=True, exist_ok=True)
p_src.mkdir(parents=True, exist_ok=True)

## Parse Event Data for Stats per Player per Game

The event data is in a format that is very difficult to work with.  There is an open-source project which has parsers for the Retrosheet event data.  This project has 6 parsers.  Each of these parsers is fed event data and produces csv or XML or text output.

The two parsers that are of interest for this study are:
1. cwdaily -- player per game stats
2. cwgame -- game stats

The cwbox parser produces a box score in the form MLB fans are accustomed to seeing (or it can produce XML with appropriate tags).  This appears to have the same information as is produced by cwdaily, however cwdaily formats the data as one line per player per game, which is much easier to work with.

### Computer Resource Usage

This analysis was run on a 2015 workstation with:
* Ubuntu 18.04 LTS
* 4 core Xeon CPU with hyperthreading running at 3.7 GHz
* 96 GB of RAM
* 1 TB PCIe SSD

8 or 16 GB of RAM is probably sufficient, but if you run out of RAM, try the following (in order of preference):
1. use 1975 (or later) to present instead of 1955 to present
2. cwdaily: use '-f 0-51' instead of '-f 0-116' -- the rest of the fields are not as useful
3. cwgame: use '-f 0-45' instead of '-f 0-45,82,83 -x 0-59' -- the rest of the fields are not as useful

All the following notebooks will work as-is with less years, so the preferred approach to reducing RAM usage would be to process fewer years of Retrosheet data.  

cwdaily creates about 25 times more records as cwgame, so reducing the number of fields for cwdaily will make far more of a difference than doing so for cwgame.

### Build Chadwick Parsers on Linux (or use prebuilt Windows binaries)
This section describes how to download the source, compile and install it.

The compile and install procedure here is the standard procedure for compiling and installing open-source code on Linux.

Go To:  
https://sourceforge.net/projects/chadwick/  
Download the source code for version 0.7.1 or later.

If you do not already have a build environment:
1. sudo apt install gcc
2. sudo apt install build-essential

cd to the source directory:
1. ./configure
2. make
3. make install  # or: sudo make install  

Result
1. The cw command line tools will be installed in /usr/local/bin.  
2. The cw library will be installed in /usr/local/lib.  

To allow the command line tools to find the shared libraries, add the following to your .bashrc and then: source .bashrc  
```export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib```

Optionally copy cwdaily.c and cwgame.c to the src directory.  These C source code files will be parsed to get data dictionary infromation.  These files, and the parsing of these C source code files, is only useful to understanding the data and is not required for the later baseball analysis notebooks.

### Using Prebuilt Windows Binaries
Go To:  
https://sourceforge.net/projects/chadwick/  
Download the Windows binaries for version 0.7.1 or later.

**Linux Wine**  
Install wine: https://wiki.winehq.org/Ubuntu  
Before first use of wine: run winecfg in a terminal

**Windows**  
You could also run the windows binaries on Windows or a Windows VM.

### Run the cwdaily Parser

In [4]:
# normally os.listdir() is used to list a directory
# here, for demonstating the subprocess module, subprocess will be used
# invoke bash directly with shell=False in subprocess
import subprocess

cmd = 'ls /usr/local/bin/cw*'
args = ['/bin/bash', '-c', cmd]
result = subprocess.run(args, shell=False, text=True, capture_output=True)
result.stdout.splitlines()

['/usr/local/bin/cwbox',
 '/usr/local/bin/cwcomment',
 '/usr/local/bin/cwdaily',
 '/usr/local/bin/cwevent',
 '/usr/local/bin/cwgame',
 '/usr/local/bin/cwsub']

In [5]:
# the optionally downloaded C source code for the two parsers
os.listdir(p_src)

['cwgame.c', 'cwdaily.c']

In [6]:
import os
# check the environment variable for LD_LIBRARY_PATH
os.environ['LD_LIBRARY_PATH']

'/usr/local/lib'

If you run out of RAM for any part of the analysis, try the first cmd below.  52 fields requires much less RAM than 117, and the first 52 are the most useful.

In [7]:
# if you are running windows binaries under Linux, 
# prepend 'wine ' to the cmd string below
def process_cwdaily(year):
    """Parse event data into player stats per game.
    
    There are a total of 117 fields to choose from.
    """
    # cmd = f'cwdaily -f 0-51 -n -y {year} {year}*.EV*'
    cmd = f'cwdaily -f 0-116 -n -y {year} {year}*.EV*'    
    args = ["/bin/bash", "-c", cmd]
    out = f'../parsed/daily{year}.csv'
    with open(out, "w") as outfile:
        result = subprocess.run(args, stdout=outfile)

In [8]:
# change to raw file directory
os.chdir(p_raw)

If you run out of RAM at any point in the analysis, try years in the range of 1975 to present.

In [9]:
# parse each year of event data
for year in range(1955, 2019):
    file = p_parsed.joinpath(f'daily{year}.csv')
    
    # if the output file is not already there
    if not file.is_file():
        process_cwdaily(year)

In [10]:
# collect all the parsed files into a single pandas dataframe
import glob
import pandas as pd
import numpy as np

os.chdir(p_parsed)
dailyfiles = glob.glob('daily*.csv')
dailyfiles.sort()

dfs = []
for file in dailyfiles:
    dfs.append(pd.read_csv(file))
player_game = pd.concat(dfs)

In [11]:
player_game = player_game.reset_index(drop=True)
player_game.columns = player_game.columns.str.lower()
player_game.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3549700 entries, 0 to 3549699
Columns: 117 entries, game_id to f_rf_tp
dtypes: int64(114), object(3)
memory usage: 3.1+ GB


### Persist Dataframe

Parsing dates and other data wrangling is performed in the next notebook.  This notebook just downloads, parses, downcasts as appropriate, and saves the data to a compressed csv file.

Due to the sparsity of the player_game dataframe, using gzip will reduce the file size by a factor of 10+.

In [12]:
bb.mem_usage(player_game)

'3744.10 MB'

In [13]:
player_game = bb.optimize_df_dtypes(player_game)

In [14]:
bb.mem_usage(player_game)

'1062.97 MB'

In [15]:
player_game.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3549700 entries, 0 to 3549699
Columns: 117 entries, game_id to f_rf_tp
dtypes: object(3), uint32(2), uint8(112)
memory usage: 487.5+ MB


In [16]:
# change working dir
os.chdir(p_collected)

# persist as compressed csv file
%time bb.to_csv_with_types(player_game, 'player_game.csv.gz')

CPU times: user 5min 31s, sys: 905 ms, total: 5min 32s
Wall time: 5min 32s


## Parse Event Data for Stats per Game
Additional information about each game is available, such as the attendance, temperature at game start time, game start time, etc.

If you run out of RAM at any point in the analysis, try the first cmd below.  46 fields requires less RAM than 108 fields.

In [17]:
# if you are running windows binaries under Linux, prepend 'wine ' to the cmd string below
def process_cwgame(year):
    """Parse yearly event data into stats per game.
    
    There are 84 standard fields and 95 extended fields to choose from.  
    """
    # cmd = f'cwgame -f 0-45 -n -y {year} {year}*.EV*'
    cmd = f'cwgame -f 0-45,82,83 -x 0-59 -n -y {year} {year}*.EV*'    
    args = ["/bin/bash", "-c", cmd]
    out = f'../parsed/game{year}.csv'
    with open(out, "w") as outfile:
        result = subprocess.run(args, stdout=outfile)

In [18]:
# change to raw file directory
os.chdir(p_raw)

If you run out of RAM at any point in the analysis, try years in the range of 1975 to present.

In [19]:
# parse each year of event data
for year in range(1955, 2019):
    file = p_parsed.joinpath(f'game{year}.csv')
    
    # if the output file is not already there
    if not file.is_file():
        process_cwgame(year)

In [20]:
# collect all the parsed files into a single pandas dataframe
import glob
os.chdir(p_parsed)
gamefiles = glob.glob('game*.csv')
gamefiles.sort()

dfs = []
for file in gamefiles:
    dfs.append(pd.read_csv(file))
game = pd.concat(dfs)

In [21]:
game.columns = game.columns.str.lower()

In [22]:
game = bb.optimize_df_dtypes(game)

In [23]:
game.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 129846 entries, 0 to 2430
Columns: 108 entries, game_id to home_tp_ct
dtypes: float64(6), int64(3), object(29), uint16(2), uint32(1), uint8(67)
memory usage: 47.9+ MB


In [24]:
os.chdir(p_collected)
%time bb.to_csv_with_types(game, 'game.csv.gz')

CPU times: user 16.8 s, sys: 15 ms, total: 16.8 s
Wall time: 16.9 s
