# Retrosheet Baseball Data -- Part 1

Baseball Notebooks
1. **LahmanBaseball**: the Lahman data will be downloaded and parsed.
2. **RetroBaseball-1**: the Retrosheet Play-by-Play data will be downloaded, parsed and saved to compressed csv files.
3. **RetroBaseball-2**: the data will be prepared (wranged) for analysis and saved to compressed csv files
4. **RetroBaseball-3**: the data from the preceeding notebook will be saved to Postgres
5. **RetroAnalysis 1**: the baseball data will be analyzed.

This notebook is designed to be used with Jupyter Lab and the Table of Contents extension.  
https://github.com/jupyterlab/jupyterlab-toc

The two most popular open source Baseball Data Sources are:  
**Lahman**  
http://www.seanlahman.com/baseball-archive/statistics/  
This database is copyright 1996-2018 by Sean Lahman.  

**Retrosheet**  
https://www.retrosheet.org/game.htm  
https://www.retrosheet.org/game.htm#Notice  
This database is copyright 1996-2018 by Retrosheet.

Lahman has data about each player summarized by year.  Retrosheet has data at the play-by-play level (called "event data"). This notebook is for Retrosheet.  Another notebook is for Lahman.  Subsequent notebooks will use data from both sources.

The only open-source parsers available for Retrosheet are by Dr. T. L. Turocy:  
Description: http://chadwick.sourceforge.net/doc/index.html  
Source: https://sourceforge.net/projects/chadwick/

## Repeatable Research
All data processing should be documented so that others can repeat the results.  This includes every step from downloading the data through analysis.

## Download and Unzip Retrosheet Data

The raw event data will be downloaded. The url is of the form:
http://www.retrosheet.org/events/{year}eve.zip'

There are many ways to download files in Python.  For a simple binary file download, wget may be the easiest.

### Create Directories for Data Processing

* ~/data/retrosheet/raw -- event files downloaded from Retrosheet
* ~/data/retrosheet/parsed -- results of running 2 parsers on the event files
* ~/data/retrosheet/df_csv -- collect the parsed files into dataframes and save these to csv
* ~/data/retrosheet/wrangled -- prepare the data for analsyis and save to csv

In [1]:
import os
import re
import wget
from pathlib import Path
import zipfile

In [2]:
# create path objects
home = Path.home()
retrosheet = home.joinpath('data/retrosheet')
raw = retrosheet.joinpath('raw')
parsed = retrosheet.joinpath('parsed')
df_csv = retrosheet.joinpath('df_csv')
wrangled = retrosheet.joinpath('wrangled')

# create directories (if they don't already exist) from these path objects
raw.mkdir(parents=True, exist_ok=True)
parsed.mkdir(parents=True, exist_ok=True)
df_csv.mkdir(parents=True, exist_ok=True)
wrangled.mkdir(parents=True, exist_ok=True)

### Retrosheet Event Data
Data is available from 1921 to present.

Here, data from 1955 through 2018 will be downloaded and unzipped.  The start year of 1955 was chosen in part because there are fewer missing values for baseball attributes from 1955 on.

Using 1955 to present will result in (at least one temporary) 2+ Gig Pandas DataFrame, so chose more or less years as appropriate for your computer's resources.

In [3]:
# change to raw file directory
os.chdir(raw)

for year in range(1955,2019):   
    # download each event file, if it doesn't exist locally
    filename = f'{year}eve.zip'
    path = Path(filename)
    if not path.exists():
        url = f'http://www.retrosheet.org/events/{year}eve.zip'
        wget.download(url)
    
    # unzip each zip file, if its contents don't exist locally
    # {year}BOS.EVA is in all zip files
    filename = f'{year}BOS.EVA'
    path = Path(filename)
    if not path.exists():
        filename = f'{year}eve.zip'
        with zipfile.ZipFile(filename, "r") as zip_ref:
            zip_ref.extractall(".")

### Unzipped Data File Types
The unzipped data consists of 3 types of files:
1. *.EVA and *.EVN -- these are American League and National League event files per team per year
2. *.ROS -- these are the rosters per team per year
3. TEAM* -- these are the MBL teams in existence per year

## 1. Parse Event Data for Player per Game Statistics

The event data is in a format that is very difficult to work with.  There is an open-source project which has parsers for the Retrosheet event data.  This project has 6 parsers.  Each of these parsers is fed event data and produces csv or XML or text output.

The two parsers that are of interest for this study are:
1. cwdaily
2. cwgame

The cwbox parser produces a box score in the form MLB fans are accustomed to seeing (or it can produce XML with appropriate tags).  This appears to have the same information as is produced by cwdaily, however cwdaily formats the data as one line per player per game, which is much easier to work with.

The Retrosheet data parser tools are described at:  
http://chadwick.sourceforge.net/doc/index.html  
  
They are distributed under the GPL:  
https://www.gnu.org/licenses/gpl.html  

Note: as of February 2019, the cwdaily parser, published in July 2018, is not described on the above webpage.

### Build Chadwick Parsers on Linux (or use prebuilt Windows binaries)
This section describes how to download the source, compile and install it.

The compile and install procedure here is the standard procedure for compiling and installing open-source code on Linux.

Go To:  
https://sourceforge.net/projects/chadwick/  
Download the source code for version 0.7.1 or later.

If you do not already have a build environment:
1. sudo apt install gcc
2. sudo apt install build-essential

cd to the source directory:
1. ./configure
2. make
3. make install  # or: sudo make install  

Result
1. The cw command line tools will be installed in /usr/local/bin.  
2. The cw library will be installed in /usr/local/lib.  

To allow the command line tools to find the shared libraries, add the following to your .bashrc and then: source .bashrc  
```export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib```

### Using Prebuilt Windows Binaries
Go To:  
https://sourceforge.net/projects/chadwick/  
Download the Windows binaries for version 0.7.1 or later.

**Linux Wine**  
Install wine: https://wiki.winehq.org/Ubuntu  
Before first use of wine: run winecfg in a terminal

**Windows**  
You could also run the windows binaries on Windows or a Windows VM.

### Run the cwdaily Parser

In [4]:
# normally os.listdir() is used to list a directory
# here, for demonstating the subprocess module, subprocess will be used
# invoke bash directly with shell=False in subprocess
import subprocess

cmd = 'ls /usr/local/bin/cw*'
args = ['/bin/bash', '-c', cmd]
result = subprocess.run(args, shell=False, text=True, capture_output=True)
result.stdout.splitlines()

['/usr/local/bin/cwbox',
 '/usr/local/bin/cwcomment',
 '/usr/local/bin/cwdaily',
 '/usr/local/bin/cwevent',
 '/usr/local/bin/cwgame',
 '/usr/local/bin/cwsub']

In [5]:
import os
# check the environment variable for LD_LIBRARY_PATH
os.environ['LD_LIBRARY_PATH']

'/usr/local/lib'

In [6]:
# if you are running windows binaries under Linux, 
# prepend 'wine ' to the cmd string below
def process_cwdaily(year):
    """Parse event data into 52 fields of player stats per game.
    
    There are a total of 117 fields to chose from, the first 52 are selected.
    """
    cmd = f'cwdaily -f 0-51 -n -y {year} {year}*.EV*'
    args = ["/bin/bash", "-c", cmd]
    out = f'../parsed/daily{year}.csv'
    with open(out, "w") as outfile:
        result = subprocess.run(args, stdout=outfile)

In [7]:
# change to raw file directory
os.chdir(raw)

In [8]:
# parse each year of event data
for year in range(1955, 2019):
    file = parsed.joinpath(f'daily{year}.csv')
    
    # if the output is not already there
    if not file.is_file():
        process_cwdaily(year)

In [9]:
# collect all the parsed files into a single pandas dataframe
import glob
import pandas as pd
import numpy as np

os.chdir(parsed)
dailyfiles = glob.glob('daily*.csv')
dailyfiles.sort()

dfs = []
for file in dailyfiles:
    dfs.append(pd.read_csv(file))
player_game = pd.concat(dfs)

In [11]:
player_game = player_game.reset_index(drop=True)
player_game.columns = player_game.columns.str.lower()
player_game.head(3)

Unnamed: 0,game_id,game_dt,game_ct,appear_dt,team_id,player_id,b_g,b_pa,b_ab,b_r,...,p_bb,p_ibb,p_so,p_gdp,p_hp,p_sh,p_sf,p_xi,p_wp,p_bk
0,BAL195504120,19550412,0,19550412,BOS,goodb101,1,5,5,1,...,0,0,0,0,0,0,0,0,0,0
1,BAL195504120,19550412,0,19550412,BOS,joose101,1,5,4,0,...,0,0,0,0,0,0,0,0,0,0
2,BAL195504120,19550412,0,19550412,BOS,throf101,1,5,5,1,...,0,0,0,0,0,0,0,0,0,0


### Persist Dataframe

Due to the sparsity of the player_game dataframe, using gzip will reduce the file size by a factor of 10+.

In [12]:
# change working dir
os.chdir(df_csv)

# persist as compressed csv file
%time player_game.to_csv('player_game.csv.gz', compression='gzip', index=False)

CPU times: user 2min 57s, sys: 127 ms, total: 2min 58s
Wall time: 2min 58s


## 2. Parse Event Data for Game Statistics
Additional information about the game is available, such as the attendance, the temperature at game start time, etc.

In [13]:
# if you are running windows binaries under Linux, prepend 'wine ' to the cmd string below
def process_cwgame(year):
    """Parse yearly event data into 45 fields of game data per year.
    
    For each game, there are 84 standard fields and 95 extended fields to chose from.  
    Only the first 46 standard fields are chosen.
    """
    cmd = f'cwgame -f 0-45 -n -y {year} {year}*.EV*'
    args = ["/bin/bash", "-c", cmd]
    out = f'../parsed/game{year}.csv'
    with open(out, "w") as outfile:
        result = subprocess.run(args, stdout=outfile)

In [14]:
# change to raw file directory
os.chdir(raw)

In [15]:
# parse each year of event data
for year in range(1955, 2019):
    file = parsed.joinpath(f'game{year}.csv')
    
    # if the output is not already there
    if not file.is_file():
        process_cwgame(year)

In [16]:
# collect all the parsed files into a single pandas dataframe
import glob
os.chdir(parsed)
gamefiles = glob.glob('game*.csv')
gamefiles.sort()

dfs = []
for file in gamefiles:
    dfs.append(pd.read_csv(file))
game = pd.concat(dfs)

In [17]:
game.reset_index(drop=True)
game.columns = game.columns.str.lower()
game.head(3)

Unnamed: 0,game_id,game_dt,game_ct,game_dy,start_game_tm,dh_fl,daynight_park_cd,away_team_id,home_team_id,park_id,...,away_hits_ct,home_hits_ct,away_err_ct,home_err_ct,away_lob_ct,home_lob_ct,win_pit_id,lose_pit_id,save_pit_id,gwrbi_bat_id
0,BAL195504120,19550412,0,Tuesday,0,F,D,BOS,BAL,BAL11,...,13,5,0,2,8,9,sullf101,colej101,,
1,BAL195504180,19550418,0,Monday,0,F,N,NYA,BAL,BAL11,...,8,3,0,1,5,4,fordw101,moorr101,,
2,BAL195504220,19550422,0,Friday,0,F,N,WS1,BAL,BAL11,...,4,8,2,1,6,11,mcdem102,wilsj104,schmj101,


In [18]:
os.chdir(df_csv)
%time game.to_csv('game.csv.gz', compression='gzip', index=False)

CPU times: user 5.05 s, sys: 12 ms, total: 5.06 s
Wall time: 4.7 s
