# Retrosheet Baseball Data -- Wrangle Data

**Baseball Notebooks**  
1. Downloaded and unzipped baseball data.
2. Helper functions and their motivation for use.
3. Lahman data was wrangled and persisted.
4. Retrosheet Play by Play data was parsed, collected into 2 DataFrames, and persisted.
5. This notebook.

Wrangle the Retrosheet data in preparation for data analysis.

This notebook is designed to be used with Jupyter Lab and the Table of Contents extension.  
https://github.com/jupyterlab/jupyterlab-toc

## Data Wrangling

Retrosheet Data Wrangling will include:
1. Manipulating player per game data.
2. Manipulating game data.
3. Creating "lookup tables" by web scraping.
4. Creating data dictionaries (aka codebooks) by "scraping" Dr. Turocy's C source code.

Six DataFrames will be created.

Two DataFrames for Data Dictionary Information
1. **player_game_fields:** stats per player per game field descriptions
2. **game_fields:** stats per game field descriptions

Four DataFrames for Data
1. **player_game:** stats per player per game
2. **game:** stats per game
3. **players:** player info
4. **parks:** stadium info

The above dataframes will be persisted as CSV files with Column Types.

## Repeatable Research
All data processing should be documented so that others can repeat the results.  This includes every step from downloading the data through analysis.

### Directories for Data Processing

* ~/data/retrosheet/raw -- event files downloaded from Retrosheet
* ~/data/retrosheet/parsed -- results of running 2 parsers on the event files
* ~/data/retrosheet/df_csv -- collect the parsed files into dataframes and save these to csv
* ~/data/retrosheet/wrangled -- prepare the data for analsyis and save to csv
* ~/data/retrosheet/src -- optional directory to hold parser source code

In [1]:
import pandas as pd
import numpy as np
import os
import re
from pathlib import Path

from IPython.display import HTML, display

In [2]:
# see Baseball Notebook #2
import helper_functions as bb

In [4]:
# create path objects
home = Path.home()
retrosheet = home.joinpath('data/retrosheet')
p_raw = retrosheet.joinpath('raw')
p_wrangled = retrosheet.joinpath('wrangled')

p_parsed = retrosheet.joinpath('parsed')
p_collected = retrosheet.joinpath('collected')
p_src = retrosheet.joinpath('src')

## Retrosheet Data Dictionary Overview
A "data dictionary" is also called a "codebook".

The following is a highlevel overview of the meaning of the field names created by the Retrosheet parsers.

```
Suffix Meaning
CT     count (integer)
ID     identifier
FL     boolean flag
CD     code (enumerated data type)
DT     date
DY     day of week
TM     time

Prefix Meaning
B      batter
P      pitcher
```

In most cases, the abbreviation between the prefix and the suffix is a common baseball abbreviation.  For common baseball abbreviations see:  
http://www.espn.com/gen/editors/mlb/glossary.html

For example, from the above glossary, "SF" stands from sacrifice flies.  This statistic has been recorded since 1955.  The full field names created by the parsers are "B_SF" for how many sacrifice flies by the batter, and "P_SF" for how many sacrifice flies given up by the pitcher.

## Stats per Player per Game: Data Dictionary
This section is optional.  It is helpful to understanding the data.

As of March 2019, I could find no published information on cwdaily.

cwdaily can be run with the '-n' flag to have it output field names, but it is not clear what some of the field names mean.

Luckily, the source code itself has a text description of each field name.  This description takes place within a single, very long, C struct statement.

The C source code will be parsed to retrieve a field name to field description mapping.  It is not necessary to understand the RegEx code for parsing the C struct.

In [5]:
# cd to dir with cwdaily.c
src = retrosheet.joinpath('src')
os.chdir(src)

In [6]:
def parse_c_source(filename, struct='field_data'):
    """Extract field name to field description from parser's C source code"""
    dd = {}
    with open(filename, 'r') as cwdaily:
        # to account for patterns across lines, read entire source code
        source = cwdaily.read()
    
        # get the single (multiline) C statement that has field descriptions
        pattern = r'(static\s+field_struct\s+' + struct + r'.*?;)'
        match = re.search(pattern, source, flags=re.DOTALL | re.MULTILINE)
    
        if match:
            pattern = r'{.*?"(.*?)".*?"(.*?)".*?}'
            for m in re.finditer(pattern, match.group(1), 
                                 flags=re.DOTALL | re.MULTILINE):
                if m:
                    if len(m.group(2).split(':')) == 2:
                        desc = m.group(2).split(':')[1].strip()
                    else:
                        desc = m.group(2).strip()
                    dd[m.group(1).lower()] = desc   
    return dd

In [7]:
player_game_fields_all = parse_c_source('cwdaily.c')        

In [8]:
# As of Python 3.6, dictionaries maintain insertion order
# Only the first 52 fields were selected, so that's all that needed here
player_game_fields = {key:value for num, 
        (key, value) in enumerate(player_game_fields_all.items()) if num < 52}

# below it is shown appearance date == game_dt, so delete it
del player_game_fields['appear_dt']

In [9]:
# here is the explanation of each field, as scraped from the C source code
player_game_fields

{'game_id': 'game id',
 'game_dt': 'date',
 'game_ct': 'game number (0 = no double header)',
 'team_id': 'team id',
 'player_id': 'player id',
 'b_g': 'games played',
 'b_pa': 'plate appearances',
 'b_ab': 'at bats',
 'b_r': 'runs',
 'b_h': 'hits',
 'b_2b': 'doubles',
 'b_3b': 'triples',
 'b_hr': 'home runs',
 'b_rbi': 'runs batted in',
 'b_bb': 'walks',
 'b_ibb': 'intentional walks',
 'b_so': 'strikeouts',
 'b_gdp': 'grounded into DP',
 'b_hp': 'hit by pitch',
 'b_sh': 'sacrifice hits',
 'b_sf': 'sacrifice flies',
 'b_sb': 'stolen bases',
 'b_cs': 'caught stealing',
 'b_xi': 'reached on interference',
 'p_g': 'games pitched',
 'p_gs': 'games started',
 'p_cg': 'complete games',
 'p_sho': 'shutouts',
 'p_gf': 'games finished',
 'p_w': 'wins',
 'p_l': 'losses',
 'p_sv': 'saves',
 'p_out': 'outs recorded (innings pitched times 3)',
 'p_tbf': 'batters faced',
 'p_ab': 'at bats',
 'p_r': 'runs allowed',
 'p_er': 'earned runs allowed',
 'p_h': 'hits allowed',
 'p_2b': 'doubles allowed',
 'p

### Data Dictionary Notes
In the above, team_id is the team_id of the player.

game_id is:  
```
0:4 Home TEAM_ID  
4:8 YYYYMMDD  
9   Game Count
```

Game Count is:
* 0 for single game
* 1 for 1st game of double header
* 2 for 2nd game of double header

### Persist Stats per Player per Game Data Dictionary Fields

In [10]:
os.chdir(p_wrangled)

# index=[0] is required for dictionary of scalar values
player_game_fields_df = pd.DataFrame(player_game_fields, index=[0])
player_game_fields_df.to_csv('player_game_fields.csv', index=False)

## Stats per Game: Data Dictionary
This section is optional.  It is helpful to understanding the data.

There is a field-name to field-description mapping provided on the following web page:  
http://chadwick.sourceforge.net/doc/cwgame.html

This data could be scraped from the webpage, but as a parser to read C source code to get this mapping was written above, it's simpler just to use it.

Note: the codes for some of the \_CD fields are only specified on the above web page, but the \_CD fields are not being used in this study.

In [11]:
os.chdir(src)
game_reg_fields = parse_c_source('cwgame.c')
game_ext_fields = parse_c_source('cwgame.c', 'ext_field_data')           

In [12]:
# there are 84 regular fields and 95 extended fields
len(game_reg_fields), len(game_ext_fields)

(84, 95)

### Data Dictionary Note
dh_fl: Designated Hitter Flag, 'T' if DH in use, else 'F'  
daynight_park_cd: 'N' for night, 'D' for day  
gw_rbi_bat_id: Player ID for batter who got Game Winning RBI  

In [13]:
# As of Python 3.6, dictionaries maintain insertion order
game_fields = {key:value for num, 
    (key, value) in enumerate(game_reg_fields.items()) if num < 46}

game_fields

{'game_id': 'game id',
 'game_dt': 'date',
 'game_ct': 'game number (0 = no double header)',
 'game_dy': 'day of week',
 'start_game_tm': 'start time',
 'dh_fl': 'DH used flag',
 'daynight_park_cd': 'day/night flag',
 'away_team_id': 'visiting team',
 'home_team_id': 'home team',
 'park_id': 'game site',
 'away_start_pit_id': 'vis. starting pitcher',
 'home_start_pit_id': 'home starting pitcher',
 'base4_ump_id': 'home plate umpire',
 'base1_ump_id': 'first base umpire',
 'base2_ump_id': 'second base umpire',
 'base3_ump_id': 'third base umpire',
 'lf_ump_id': 'left field umpire',
 'rf_ump_id': 'right field umpire',
 'attend_park_ct': 'attendance',
 'scorer_record_id': 'PS scorer',
 'translator_record_id': 'translator',
 'inputter_record_id': 'inputter',
 'input_record_ts': 'input time',
 'edit_record_ts': 'edit time',
 'method_record_cd': 'how scored',
 'pitches_record_cd': 'pitches entered?',
 'temp_park_ct': 'temperature',
 'wind_direction_park_cd': 'wind direction',
 'wind_speed_pa

### Persist Stats per Game Data Dictionary Fields

In [14]:
os.chdir(p_wrangled)

# index=[0] is required for dictionary of scalar values
game_fields_df = pd.DataFrame(game_fields, index=[0])
game_fields_df.to_csv('game_fields.csv', index=False)

## 1. Wrangle Stats per Player Per Game Data

### Manual Data Verification

For odd data, such as whether or not a the first game of a double header was in one stadium, and the second game was in a different stadium, [Baseball-Reference](https://www.baseball-reference.com) is helpful.

Baseball-reference uses the data from Retrosheet, and presents it in an easy to read form for people. On rare occasions it may incorrectly present the event data, but it is a useful tool.

Baseball-reference does not offer already parsed data for data analysis.

The following method takes a game_id and converts it to a baseball-reference url for researching more about a particular game.

In [15]:
from IPython.display import HTML, display
def game_id_to_url(game_id):
    home = game_id[:3]
    url = 'https://www.baseball-reference.com/boxes/' + home + '/' + game_id + '.shtml'
    display(HTML(f'<a href="{url}">{game_id}</a>'))

In [16]:
# Click on the generated link to get a url for detailed game information.
game_id_to_url('NYA200806271')

As per the above link, the first game of the double header was in Yankee Stadium and the second game, on the same day, was in Shea Stadium.

In [17]:
# another oddity, the public was not allowed to attend
game_id_to_url('BAL201504290')

### Read in Stats per Player per Game Data 
This is the data created by running the cwdaily parser in the previous notebook.

Above, in the data dictionary section, it was seen that there are two date fields, game_dt and appear_dt.  These are strings of the from YYYYMMDD with no game time information.

In [18]:
# read in the parsed player_game data
os.chdir(p_collected)
player_game = bb.from_csv_with_types('player_game.csv.gz')

In [19]:
player_game.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3549700 entries, 0 to 3549699
Data columns (total 52 columns):
game_id      category
game_dt      object
game_ct      uint8
appear_dt    object
team_id      category
player_id    category
b_g          uint8
b_pa         uint8
b_ab         uint8
b_r          uint8
b_h          uint8
b_2b         uint8
b_3b         uint8
b_hr         uint8
b_rbi        uint8
b_bb         uint8
b_ibb        uint8
b_so         uint8
b_gdp        uint8
b_hp         uint8
b_sh         uint8
b_sf         uint8
b_sb         uint8
b_cs         uint8
b_xi         uint8
p_g          uint8
p_gs         uint8
p_cg         uint8
p_sho        uint8
p_gf         uint8
p_w          uint8
p_l          uint8
p_sv         uint8
p_out        uint8
p_tbf        uint8
p_ab         uint8
p_r          uint8
p_er         uint8
p_h          uint8
p_2b         uint8
p_3b         uint8
p_hr         uint8
p_bb         uint8
p_ibb        uint8
p_so         uint8
p_gdp        uint8
p_

In [20]:
player_game.shape

(3549700, 52)

In [21]:
player_game.head(3)

Unnamed: 0,game_id,game_dt,game_ct,appear_dt,team_id,player_id,b_g,b_pa,b_ab,b_r,...,p_bb,p_ibb,p_so,p_gdp,p_hp,p_sh,p_sf,p_xi,p_wp,p_bk
0,BAL195504120,19550412,0,19550412,BOS,goodb101,1,5,5,1,...,0,0,0,0,0,0,0,0,0,0
1,BAL195504120,19550412,0,19550412,BOS,joose101,1,5,4,0,...,0,0,0,0,0,0,0,0,0,0
2,BAL195504120,19550412,0,19550412,BOS,throf101,1,5,5,1,...,0,0,0,0,0,0,0,0,0,0


### Remove Duplicate Column

In [22]:
(player_game['game_dt'] == player_game['appear_dt']).all()

True

In [23]:
player_game = player_game.drop('appear_dt', axis=1)

In [24]:
# the primary key is (game_id, PLAYER_ID), verify no dups
dups = player_game.duplicated(subset=['game_id', 'player_id'], keep=False)
player_game[dups]

Unnamed: 0,game_id,game_dt,game_ct,team_id,player_id,b_g,b_pa,b_ab,b_r,b_h,...,p_bb,p_ibb,p_so,p_gdp,p_hp,p_sh,p_sf,p_xi,p_wp,p_bk
3418636,BOS201708250,20170825,0,BOS,younc004,1,3,3,0,1,...,0,0,0,0,0,0,0,0,0,0
3418638,BOS201708250,20170825,0,BOS,younc004,1,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0


In [25]:
# check this game manually by clicking on the generated link
dup_id = player_game.loc[dups, 'game_id'].values[0]
game_id_to_url(dup_id)

### Data Correction

Checking the box score via the above link, shows 2 entries for Young for the same game, one as a pinch-hitter and one as the designated-hitter.  It would appear that both entries are correct and that the data should be summed.

In [26]:
# get the index labels of the duplicated rows
idx1, idx2 = player_game[dups].index.values
idx1, idx2

(3418636, 3418638)

In [27]:
# identifier columns
id_columns = player_game.columns[:5]
id_columns

Index(['game_id', 'game_dt', 'game_ct', 'team_id', 'player_id'], dtype='object')

In [28]:
# stat columns
stat_columns = player_game.columns[5:]
stat_columns

Index(['b_g', 'b_pa', 'b_ab', 'b_r', 'b_h', 'b_2b', 'b_3b', 'b_hr', 'b_rbi',
       'b_bb', 'b_ibb', 'b_so', 'b_gdp', 'b_hp', 'b_sh', 'b_sf', 'b_sb',
       'b_cs', 'b_xi', 'p_g', 'p_gs', 'p_cg', 'p_sho', 'p_gf', 'p_w', 'p_l',
       'p_sv', 'p_out', 'p_tbf', 'p_ab', 'p_r', 'p_er', 'p_h', 'p_2b', 'p_3b',
       'p_hr', 'p_bb', 'p_ibb', 'p_so', 'p_gdp', 'p_hp', 'p_sh', 'p_sf',
       'p_xi', 'p_wp', 'p_bk'],
      dtype='object')

In [29]:
# id columns match (as per df.duplicated() above)
player_game.loc[[idx1,idx2], id_columns]

Unnamed: 0,game_id,game_dt,game_ct,team_id,player_id
3418636,BOS201708250,20170825,0,BOS,younc004
3418638,BOS201708250,20170825,0,BOS,younc004


In [30]:
# game data
player_game.loc[[idx1,idx2], stat_columns]

Unnamed: 0,b_g,b_pa,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_bb,...,p_bb,p_ibb,p_so,p_gdp,p_hp,p_sh,p_sf,p_xi,p_wp,p_bk
3418636,1,3,3,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3418638,1,1,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
# add stats for the two rows
player_game.loc[idx1, stat_columns] += player_game.loc[idx2, stat_columns]

# remove duplicate row
player_game = player_game.drop(idx2)

In [32]:
# the primary key is (GAME_ID, PLAYER_ID), verify no dups
bb.is_unique(player_game, ['game_id', 'player_id'])

True

### Add Lahman player_id
This will make joins to the Lahman data much easier.

In [33]:
# get the People data for Lahman player_id
home = Path.home()
lahman = home.joinpath('data/lahman')
p_lahman_wrangled = lahman.joinpath('wrangled')

os.chdir(p_lahman_wrangled)
people = bb.from_csv_with_types('people.csv')

In [34]:
# add player_id from Lahman people
player_game = pd.merge(player_game, people[['player_id','retro_id']], 
                        left_on = 'player_id', right_on = 'retro_id',
                        suffixes=['_retrosheet', '_lahman'])
player_game.columns

Index(['game_id', 'game_dt', 'game_ct', 'team_id', 'player_id_retrosheet',
       'b_g', 'b_pa', 'b_ab', 'b_r', 'b_h', 'b_2b', 'b_3b', 'b_hr', 'b_rbi',
       'b_bb', 'b_ibb', 'b_so', 'b_gdp', 'b_hp', 'b_sh', 'b_sf', 'b_sb',
       'b_cs', 'b_xi', 'p_g', 'p_gs', 'p_cg', 'p_sho', 'p_gf', 'p_w', 'p_l',
       'p_sv', 'p_out', 'p_tbf', 'p_ab', 'p_r', 'p_er', 'p_h', 'p_2b', 'p_3b',
       'p_hr', 'p_bb', 'p_ibb', 'p_so', 'p_gdp', 'p_hp', 'p_sh', 'p_sf',
       'p_xi', 'p_wp', 'p_bk', 'player_id_lahman', 'retro_id'],
      dtype='object')

In [35]:
col_names = {'player_id_retrosheet':'player_id'}
player_game = player_game.rename(columns=col_names)
player_game = player_game.drop('retro_id', axis= 1)
player_game.columns

Index(['game_id', 'game_dt', 'game_ct', 'team_id', 'player_id', 'b_g', 'b_pa',
       'b_ab', 'b_r', 'b_h', 'b_2b', 'b_3b', 'b_hr', 'b_rbi', 'b_bb', 'b_ibb',
       'b_so', 'b_gdp', 'b_hp', 'b_sh', 'b_sf', 'b_sb', 'b_cs', 'b_xi', 'p_g',
       'p_gs', 'p_cg', 'p_sho', 'p_gf', 'p_w', 'p_l', 'p_sv', 'p_out', 'p_tbf',
       'p_ab', 'p_r', 'p_er', 'p_h', 'p_2b', 'p_3b', 'p_hr', 'p_bb', 'p_ibb',
       'p_so', 'p_gdp', 'p_hp', 'p_sh', 'p_sf', 'p_xi', 'p_wp', 'p_bk',
       'player_id_lahman'],
      dtype='object')

In [36]:
os.chdir(p_wrangled)
%time bb.to_csv_with_types(player_game, 'player_game.csv.gz')

CPU times: user 2min 51s, sys: 105 ms, total: 2min 52s
Wall time: 2min 51s


## 2. Wrangle Stats per Game Data

In [37]:
os.chdir(p_collected)
game = bb.from_csv_with_types('game.csv.gz')

In [38]:
# attendance can never be less than 1
na_data = (game['attend_park_ct'].astype('int') < 1)
game.loc[na_data, 'attend_park_ct'].value_counts()

 0    5009
-1       1
Name: attend_park_ct, dtype: int64

In [41]:
# arguably, if the public is not permitted attend, the attendance is null, not zero
game_id_no_public = 'BAL201504290'
bb.game_id_to_url(game_id_no_public)

In [42]:
game[game['game_id'] == game_id_no_public]['attend_park_ct']

120380    0
Name: attend_park_ct, dtype: int64

In [43]:
# replace 0, -1 values with nan
game['attend_park_ct'] = game['attend_park_ct'].replace([0, -1], np.nan)

In [44]:
# MLB games never have a temp, in Fahrenheit, less than 1F
na_data = (game['temp_park_ct'].astype('int') < 1)
game.loc[na_data, 'temp_park_ct'].value_counts()

 0    49108
-1       85
Name: temp_park_ct, dtype: int64

In [45]:
# replace these values with nan
game['temp_park_ct'] = game['temp_park_ct'].replace([0, -1], np.nan)

In [46]:
game.head(3)

Unnamed: 0,game_id,game_dt,game_ct,game_dy,start_game_tm,dh_fl,daynight_park_cd,away_team_id,home_team_id,park_id,...,away_hits_ct,home_hits_ct,away_err_ct,home_err_ct,away_lob_ct,home_lob_ct,win_pit_id,lose_pit_id,save_pit_id,gwrbi_bat_id
0,BAL195504120,19550412,0,Tuesday,0,F,D,BOS,BAL,BAL11,...,13,5,0,2,8,9,sullf101,colej101,,
1,BAL195504180,19550418,0,Monday,0,F,N,NYA,BAL,BAL11,...,8,3,0,1,5,4,fordw101,moorr101,,
2,BAL195504220,19550422,0,Friday,0,F,N,WS1,BAL,BAL11,...,4,8,2,1,6,11,mcdem102,wilsj104,schmj101,


In [47]:
# the primary key is (game_id), verify no dups
game['game_id'].is_unique

True

In [48]:
# these columns will not be used in the analysis
drop_columns = ['edit_record_ts',
                'wind_direction_park_cd',
                'wind_speed_park_ct',
                'field_park_cd',
                'precip_park_cd',
                'sky_park_cd',                
                'base1_ump_id', 
                'base2_ump_id', 
                'base3_ump_id', 
                'base4_ump_id',
                'scorer_record_id', 
                'inputter_record_id', 
                'lf_ump_id', 
                'rf_ump_id',
                'translator_record_id', 
                'input_record_ts', 
                'method_record_cd',
                'pitches_record_cd']

In [49]:
game = game.drop(drop_columns, axis=1)

In [50]:
game.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129846 entries, 0 to 129845
Data columns (total 28 columns):
game_id              129846 non-null object
game_dt              129846 non-null object
game_ct              129846 non-null uint8
game_dy              129846 non-null category
start_game_tm        129846 non-null object
dh_fl                129846 non-null category
daynight_park_cd     129846 non-null category
away_team_id         129846 non-null category
home_team_id         129846 non-null category
park_id              129846 non-null category
away_start_pit_id    129846 non-null category
home_start_pit_id    129846 non-null category
attend_park_ct       124836 non-null float64
temp_park_ct         80653 non-null float64
minutes_game_ct      129846 non-null uint16
inn_ct               129846 non-null uint8
away_score_ct        129846 non-null uint8
home_score_ct        129846 non-null uint8
away_hits_ct         129846 non-null uint8
home_hits_ct         129846 non-null uint

### Reverse Engineer am/pm for start_game_tm

1. am/pm is not specified.
2. The time is not in 24-hour format
3. The time is an integer, not a string.  For example, 1259 means 12:59.
4. A value of zero means the game start time is unknown.
5. The daynight_park_cd is never missing.  This specifies whether the game started in "day" or at "night".
6. MLB domain knowledge: Some games may start late, due to a rain delay for example.  But games never start after midnight.
7. MLB domain knowledge: Some games may start early, to allow for travel to the next city.  But games never start before 9 am.

Given the above, am/pm can be deduced as follows:
* start_game_tm == 0 => use midnight (to represent unknown time)
* start_game_tm >= 1200 => pm
* start_game_tm < 900 => pm
* 900 <= start_game_tm < 1200, and day/night = day, => am
* 900 <= start_game_tm < 1200, and day/night = night, => pm

In [51]:
def parse_datetime(row):
    date = pd.to_numeric(row['game_dt'])
    time = pd.to_numeric(row['start_game_tm'])
    day_night = row['daynight_park_cd']
    
    if time > 0 and time < 900:
        time += 1200
    elif (900 <= time < 1200) and day_night == 'N':
        time += 1200

    time_str = f'{time//100:02d}:{time%100:02d}'
    datetime_str = str(date) + ' ' + time_str
    return pd.to_datetime(datetime_str, format='%Y%m%d %H:%M')

In [52]:
# create new datetime column
game['game_date'] = game.apply(parse_datetime, axis=1)

In [53]:
game.tail(3)

Unnamed: 0,game_id,game_dt,game_ct,game_dy,start_game_tm,dh_fl,daynight_park_cd,away_team_id,home_team_id,park_id,...,home_hits_ct,away_err_ct,home_err_ct,away_lob_ct,home_lob_ct,win_pit_id,lose_pit_id,save_pit_id,gwrbi_bat_id,game_date
129843,WAS201809240,20180924,0,Monday,706,F,N,MIA,WAS,WAS11,...,9,0,2,10,8,millj006,alcas001,,,2018-09-24 19:06:00
129844,WAS201809250,20180925,0,Tuesday,705,F,N,MIA,WAS,WAS11,...,11,1,2,6,9,schem001,brigj002,,,2018-09-25 19:05:00
129845,WAS201809260,20180926,0,Wednesday,405,F,D,MIA,WAS,WAS11,...,12,1,0,6,8,suerw002,chenw001,,,2018-09-26 16:05:00


In [54]:
os.chdir(p_wrangled)
%time bb.to_csv_with_types(game, 'game.csv.gz')

CPU times: user 3.41 s, sys: 0 ns, total: 3.41 s
Wall time: 3.42 s


## 3. Scrape Data for Players Lookup Table

player_id to player information is needed.  

Lahman's People.csv may have everything that is needed in this regard, however get the Retrosheet version just to be sure.

There is no separate file for this.  It will be scraped from a web page.

In [55]:
import requests
import pandas as pd
from io import StringIO
from bs4 import BeautifulSoup

In [56]:
# get the web page
r = requests.get("https://www.retrosheet.org/retroID.htm")
soup = BeautifulSoup(r.content, 'lxml')

# data is within the pre tag
table_txt = soup.pre.string

# remove unnecessary double quotes
table_txt = table_txt.replace('"','')

In [57]:
# read from this string instead of file
players = pd.read_csv(StringIO(table_txt), 
    parse_dates=['Play debut', 'Mgr debut', 'Ump debut'])

In [58]:
# Coach debut has some bad values
def parse_date(value):
    # perhaps 43188 means 04/31/1988, but use null as unsure
    # no coach debuted prior to the year 1800
    if pd.isna(value) or value == '43188' or int(value[-4:]) < 1800:
        return pd.NaT
    else:
        return pd.datetime.strptime(value, '%m/%d/%Y')
players['Coach debut'] = players['Coach debut'].apply(parse_date)

In [59]:
players.head()

Unnamed: 0,ID,Last,First,Play debut,Mgr debut,Coach debut,Ump debut
0,aardd001,Aardsma,David,2004-04-06,NaT,NaT,NaT
1,aaroh101,Aaron,Hank,1954-04-13,NaT,NaT,NaT
2,aarot101,Aaron,Tommie,1962-04-10,NaT,1979-04-06,NaT
3,aased001,Aase,Don,1977-07-26,NaT,NaT,NaT
4,abada001,Abad,Andy,2001-09-10,NaT,NaT,NaT


In [60]:
name_chg = {'ID':'player_id',
         'Last':'last',
         'First':'first',
         'Play debut':'player_debut',
         'Mgr debut':'mgr_debut',
         'Coach debut': 'coach_debut',
         'Ump debut':'ump_debut'}
players = players.rename(columns=name_chg)
players.head()

Unnamed: 0,player_id,last,first,player_debut,mgr_debut,coach_debut,ump_debut
0,aardd001,Aardsma,David,2004-04-06,NaT,NaT,NaT
1,aaroh101,Aaron,Hank,1954-04-13,NaT,NaT,NaT
2,aarot101,Aaron,Tommie,1962-04-10,NaT,1979-04-06,NaT
3,aased001,Aase,Don,1977-07-26,NaT,NaT,NaT
4,abada001,Abad,Andy,2001-09-10,NaT,NaT,NaT


#### Persist Players

In [61]:
os.chdir(p_wrangled)
bb.to_csv_with_types(players, 'players')

## 4. Scrape Data for Stadium (Park) Lookup Table

park_id to park information is needed.  

Lahman's Parks.csv may have everything that is needed in this regard, however get the Retrosheet version just to be sure.

There is no separate file for this.  It will be scraped from a web page.

In [62]:
# get the web page (this is not html!)
r = requests.get("https://www.retrosheet.org/parkcode.txt")

table_txt = r.content.decode("utf-8")

# read from this string instead of file
parks = pd.read_csv(StringIO(table_txt), parse_dates=['START', 'END'])

In [63]:
parks.columns = parks.columns.str.lower()
parks = parks.rename(columns={'parkid':'park_id'})
parks.head()

Unnamed: 0,park_id,name,aka,city,state,start,end,league,notes
0,ALB01,Riverside Park,,Albany,NY,1880-09-11,1882-05-30,NL,TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882
1,ALT01,Columbia Park,,Altoona,PA,1884-04-30,1884-05-31,UA,
2,ANA01,Angel Stadium of Anaheim,Edison Field; Anaheim Stadium,Anaheim,CA,1966-04-19,NaT,AL,
3,ARL01,Arlington Stadium,,Arlington,TX,1972-04-21,1993-10-03,AL,
4,ARL02,Rangers Ballpark in Arlington,The Ballpark in Arlington; Ameriquest Fl,Arlington,TX,1994-04-11,NaT,AL,


#### Persist Stadiums

In [64]:
os.chdir(p_wrangled)
bb.to_csv_with_types(parks, 'parks')