# Retrosheet Baseball Data -- Wrangle Data

**Baseball Notebooks**  
1. Downloaded and unzipped baseball data.
2. Helper functions and their motivation for use.
3. Lahman data was wrangled and persisted.
4. Retrosheet Play by Play data was parsed, collected into 2 DataFrames, and persisted.
5. This notebook.

Wrangle the Retrosheet data in preparation for data analysis.

This notebook is designed to be used with Jupyter Lab and the Table of Contents extension.  
https://github.com/jupyterlab/jupyterlab-toc

## Data Wrangling

Retrosheet Data Wrangling will include:
1. Manipulating player per game data.
2. Manipulating game data.
3. Creating "lookup tables" by web scraping.
4. Creating data dictionaries (aka codebooks) by "scraping" Dr. Turocy's C source code.

Six DataFrames will be created.

Two DataFrames for Data Dictionary Information
1. **player_game_fields:** stats per player per game field descriptions
2. **game_fields:** stats per game field descriptions

Four DataFrames for Data
1. **player_game:** stats per player per game
2. **game:** stats per game
3. **players:** player info
4. **parks:** stadium info

The above dataframes will be persisted as CSV files with Column Types.

## Repeatable Research
All data processing should be documented so that others can repeat the results.  This includes every step from downloading the data through analysis.

### Directories for Data Processing

* ~/data/retrosheet/raw -- event files downloaded from Retrosheet
* ~/data/retrosheet/parsed -- results of running 2 parsers on the event files
* ~/data/retrosheet/df_csv -- collect the parsed files into dataframes and save these to csv
* ~/data/retrosheet/wrangled -- prepare the data for analsyis and save to csv
* ~/data/retrosheet/src -- optional directory to hold parser source code

In [1]:
import pandas as pd
import numpy as np
import os
import re
from pathlib import Path

from IPython.display import HTML, display

In [2]:
# see Baseball Notebook #2
import helper_functions as bb

In [3]:
# create path objects
home = Path.home()
retrosheet = home.joinpath('data/retrosheet')
p_raw = retrosheet.joinpath('raw')
p_wrangled = retrosheet.joinpath('wrangled')

p_parsed = retrosheet.joinpath('parsed')
p_collected = retrosheet.joinpath('collected')
p_src = retrosheet.joinpath('src')

## Retrosheet Data Dictionary Overview
A "data dictionary" is also called a "codebook".

The following is a highlevel overview of the meaning of the field names created by the Retrosheet parsers.

```
Suffix Meaning
CT     count (integer)
ID     identifier
FL     boolean flag
CD     code (enumerated data type)
DT     date
DY     day of week
TM     time

Prefix Meaning
B      batter
P      pitcher
```

In most cases, the abbreviation between the prefix and the suffix is a common baseball abbreviation.  For common baseball abbreviations see:  
http://www.espn.com/gen/editors/mlb/glossary.html

For example, from the above glossary, "SF" stands from sacrifice flies.  This statistic has been recorded since 1955.  The full field names created by the parsers are "B_SF" for how many sacrifice flies by the batter, and "P_SF" for how many sacrifice flies given up by the pitcher.

## Stats per Player per Game: Data Dictionary
This section is optional.  It is helpful to understanding the data.

cwdaily can be run with the '-n' flag to have it output field names, and separately it can be run with the '-d' flag to have it output field descriptions.  The field names and field descriptions could then be put together to form a data dictionary.

Partly as an exercise in RegEx, a different approach was used here.  The C source code was parsed to produce the data dictionary.  It is not necessary to understand the RegEx code for parsing the C struct that has both the field names, and the field descriptions.

In [4]:
# cd to dir with cwdaily.c
src = retrosheet.joinpath('src')
os.chdir(src)

In [5]:
def parse_c_source(filename, struct='field_data'):
    """Extract field name to field description from parser's C source code"""
    dd = {}
    with open(filename, 'r') as cwdaily:
        # to account for patterns across lines, read entire source code
        source = cwdaily.read()
    
        # get the single (multiline) C statement that has field descriptions
        pattern = r'(static\s+field_struct\s+' + struct + r'.*?;)'
        match = re.search(pattern, source, flags=re.DOTALL | re.MULTILINE)
    
        if match:
            pattern = r'{.*?"(.*?)".*?"(.*?)".*?}'
            for m in re.finditer(pattern, match.group(1), 
                                 flags=re.DOTALL | re.MULTILINE):
                if m:
                    if len(m.group(2).split(':')) == 2:
                        desc = m.group(2).split(':')[1].strip()
                    else:
                        desc = m.group(2).strip()
                    dd[m.group(1).lower()] = desc   
    return dd

In [6]:
player_game_fields = parse_c_source('cwdaily.c')        

In [7]:
# here is the explanation of each field, as scraped from the C source code
player_game_fields

{'game_id': 'game id',
 'game_dt': 'date',
 'game_ct': 'game number (0 = no double header)',
 'appear_dt': 'apperance date',
 'team_id': 'team id',
 'player_id': 'player id',
 'b_g': 'games played',
 'b_pa': 'plate appearances',
 'b_ab': 'at bats',
 'b_r': 'runs',
 'b_h': 'hits',
 'b_2b': 'doubles',
 'b_3b': 'triples',
 'b_hr': 'home runs',
 'b_rbi': 'runs batted in',
 'b_bb': 'walks',
 'b_ibb': 'intentional walks',
 'b_so': 'strikeouts',
 'b_gdp': 'grounded into DP',
 'b_hp': 'hit by pitch',
 'b_sh': 'sacrifice hits',
 'b_sf': 'sacrifice flies',
 'b_sb': 'stolen bases',
 'b_cs': 'caught stealing',
 'b_xi': 'reached on interference',
 'p_g': 'games pitched',
 'p_gs': 'games started',
 'p_cg': 'complete games',
 'p_sho': 'shutouts',
 'p_gf': 'games finished',
 'p_w': 'wins',
 'p_l': 'losses',
 'p_sv': 'saves',
 'p_out': 'outs recorded (innings pitched times 3)',
 'p_tbf': 'batters faced',
 'p_ab': 'at bats',
 'p_r': 'runs allowed',
 'p_er': 'earned runs allowed',
 'p_h': 'hits allowed',

### Data Dictionary Notes
In the above, team_id is the team_id of the player.

game_id is:  
```
0:4 Home TEAM_ID  
4:8 YYYYMMDD  
9   Game Count
```

Game Count is:
* 0 for single game
* 1 for 1st game of double header
* 2 for 2nd game of double header

### Persist Stats per Player per Game Data Dictionary Fields

In [8]:
os.chdir(p_wrangled)

# index=[0] is required for dictionary of scalar values
player_game_fields_df = pd.DataFrame(player_game_fields, index=[0])
player_game_fields_df.to_csv('player_game_fields.csv', index=False)

## Stats per Game: Data Dictionary
This section is optional.  It is helpful to understanding the data.

There is a field-name to field-description mapping provided on the following web page:  
http://chadwick.sourceforge.net/doc/cwgame.html

Alternatively, as with cwdaily, cwgame can be run with the '-n' flag to have it output field names, and separately it can be run with the '-d' flag to have it output field descriptions. The field names and field descriptions could then be put together to form a data dictionary.

Alternatively, as RegEx was used to parse the cwdaily parser, the same RegEx can be used to parse the cwgame parser.  That is the approach taken here.

Note: the codes for some of the \_CD fields are only specified on the above web page, but the \_CD fields are not being used in this study.

In [9]:
os.chdir(src)
game_reg_fields = parse_c_source('cwgame.c')
game_ext_fields = parse_c_source('cwgame.c', 'ext_field_data')
game_fields = {**game_reg_fields, **game_ext_fields}

In [10]:
# there are 84 regular fields and 95 extended fields
len(game_reg_fields), len(game_ext_fields), len(game_fields)

(84, 95, 179)

In [11]:
game_fields

{'game_id': 'game id',
 'game_dt': 'date',
 'game_ct': 'game number (0 = no double header)',
 'game_dy': 'day of week',
 'start_game_tm': 'start time',
 'dh_fl': 'DH used flag',
 'daynight_park_cd': 'day/night flag',
 'away_team_id': 'visiting team',
 'home_team_id': 'home team',
 'park_id': 'game site',
 'away_start_pit_id': 'vis. starting pitcher',
 'home_start_pit_id': 'home starting pitcher',
 'base4_ump_id': 'home plate umpire',
 'base1_ump_id': 'first base umpire',
 'base2_ump_id': 'second base umpire',
 'base3_ump_id': 'third base umpire',
 'lf_ump_id': 'left field umpire',
 'rf_ump_id': 'right field umpire',
 'attend_park_ct': 'attendance',
 'scorer_record_id': 'PS scorer',
 'translator_record_id': 'translator',
 'inputter_record_id': 'inputter',
 'input_record_ts': 'input time',
 'edit_record_ts': 'edit time',
 'method_record_cd': 'how scored',
 'pitches_record_cd': 'pitches entered?',
 'temp_park_ct': 'temperature',
 'wind_direction_park_cd': 'wind direction',
 'wind_speed_pa

### Data Dictionary Note
dh_fl: Designated Hitter Flag, 'T' if DH in use, else 'F'  
daynight_park_cd: 'N' for night, 'D' for day  
gw_rbi_bat_id: Player ID for batter who got Game Winning RBI  

### Persist Stats per Game Data Dictionary Fields

In [12]:
os.chdir(p_wrangled)

# index=[0] is required for dictionary of scalar values
game_fields_df = pd.DataFrame(game_fields, index=[0])
game_fields_df.to_csv('game_fields.csv', index=False)

## 1. Wrangle Stats per Player Per Game Data

### Manual Data Verification

For odd data, such as whether or not a the first game of a double header was in one stadium, and the second game was in a different stadium, [Baseball-Reference](https://www.baseball-reference.com) is helpful.

Baseball-reference uses the data from Retrosheet, and presents it in an easy to read form for people. On rare occasions it may incorrectly present the event data, but it is a useful tool.

Baseball-reference does not offer already parsed data for data analysis.

The following method takes a game_id and converts it to a baseball-reference url for researching more about a particular game.

In [13]:
from IPython.display import HTML, display
def game_id_to_url(game_id):
    home = game_id[:3]
    url = 'https://www.baseball-reference.com/boxes/' + home + '/' + game_id + '.shtml'
    display(HTML(f'<a href="{url}">{game_id}</a>'))

In [14]:
# Click on the generated link to get a url for detailed game information.
game_id_to_url('NYA200806271')

As per the above link, the first game of the double header was in Yankee Stadium and the second game, on the same day, was in Shea Stadium.

In [15]:
# another oddity, the public was not allowed to attend
game_id_to_url('BAL201504290')

### Read in Stats per Player per Game Data 
This is the data created by running the cwdaily parser in the previous notebook.

Above, in the data dictionary section, it was seen that there are two date fields, game_dt and appear_dt.  These are strings of the from YYYYMMDD with no game time information.

In [16]:
# read in the parsed player_game data
os.chdir(p_collected)
player_game = bb.from_csv_with_types('player_game.csv.gz')

In [17]:
player_game.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3549700 entries, 0 to 3549699
Columns: 117 entries, game_id to f_rf_tp
dtypes: object(3), uint32(2), uint8(112)
memory usage: 487.5+ MB


In [18]:
player_game.shape

(3549700, 117)

In [19]:
player_game.tail(3)

Unnamed: 0,game_id,game_dt,game_ct,appear_dt,team_id,player_id,b_g,b_pa,b_ab,b_r,...,f_cf_e,f_cf_dp,f_cf_tp,f_rf_g,f_rf_out,f_rf_po,f_rf_a,f_rf_e,f_rf_dp,f_rf_tp
3549697,WAS201809260,20180926,0,20180926,WAS,rodrj005,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3549698,WAS201809260,20180926,0,20180926,WAS,eatoa002,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
3549699,WAS201809260,20180926,0,20180926,WAS,glovk001,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Remove Duplicate Column

In [20]:
(player_game['game_dt'] == player_game['appear_dt']).all()

True

In [21]:
player_game = player_game.drop('appear_dt', axis=1)

In [22]:
# the primary key is (game_id, PLAYER_ID), verify no dups
dups = player_game.duplicated(subset=['game_id', 'player_id'], keep=False)
player_game[dups]

Unnamed: 0,game_id,game_dt,game_ct,team_id,player_id,b_g,b_pa,b_ab,b_r,b_h,...,f_cf_e,f_cf_dp,f_cf_tp,f_rf_g,f_rf_out,f_rf_po,f_rf_a,f_rf_e,f_rf_dp,f_rf_tp
3418636,BOS201708250,20170825,0,BOS,younc004,1,3,3,0,1,...,0,0,0,0,0,0,0,0,0,0
3418638,BOS201708250,20170825,0,BOS,younc004,1,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0


In [23]:
# check this game manually by clicking on the generated link
dup_id = player_game.loc[dups, 'game_id'].values[0]
game_id_to_url(dup_id)

### Data Correction

Checking the box score via the above link, shows 2 entries for Young for the same game, one as a pinch-hitter and one as the designated-hitter.  It would appear that both entries are correct and that the data should be summed.

In [24]:
# get the index labels of the duplicated rows
idx1, idx2 = player_game[dups].index.values
idx1, idx2

(3418636, 3418638)

In [25]:
# identifier columns
id_columns = player_game.columns[:5]
id_columns

Index(['game_id', 'game_dt', 'game_ct', 'team_id', 'player_id'], dtype='object')

In [26]:
# stat columns
stat_columns = player_game.columns[5:]
stat_columns

Index(['b_g', 'b_pa', 'b_ab', 'b_r', 'b_h', 'b_2b', 'b_3b', 'b_hr', 'b_rbi',
       'b_bb',
       ...
       'f_cf_e', 'f_cf_dp', 'f_cf_tp', 'f_rf_g', 'f_rf_out', 'f_rf_po',
       'f_rf_a', 'f_rf_e', 'f_rf_dp', 'f_rf_tp'],
      dtype='object', length=111)

In [27]:
# id columns match (as per df.duplicated() above)
player_game.loc[[idx1,idx2], id_columns]

Unnamed: 0,game_id,game_dt,game_ct,team_id,player_id
3418636,BOS201708250,20170825,0,BOS,younc004
3418638,BOS201708250,20170825,0,BOS,younc004


In [28]:
# game data
player_game.loc[[idx1,idx2], stat_columns]

Unnamed: 0,b_g,b_pa,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,b_bb,...,f_cf_e,f_cf_dp,f_cf_tp,f_rf_g,f_rf_out,f_rf_po,f_rf_a,f_rf_e,f_rf_dp,f_rf_tp
3418636,1,3,3,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3418638,1,1,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
# add stats for the two rows
player_game.loc[idx1, stat_columns] += player_game.loc[idx2, stat_columns]

# remove duplicate row
player_game = player_game.drop(idx2)

In [30]:
# the primary key is (GAME_ID, PLAYER_ID), verify no dups
bb.is_unique(player_game, ['game_id', 'player_id'])

True

### Fielding Data

There is too much detail here.  All that is necessary is the total number of putouts, assists, and errors per game, per player.

In [31]:
import re
cols = player_game.columns.copy()

# get all the field columns
f_cols = [col for col in cols if col.startswith('f_')]
f_cols

['f_p_g',
 'f_p_out',
 'f_p_po',
 'f_p_a',
 'f_p_e',
 'f_p_dp',
 'f_p_tp',
 'f_c_g',
 'f_c_out',
 'f_c_po',
 'f_c_a',
 'f_c_e',
 'f_c_dp',
 'f_c_tp',
 'f_c_pb',
 'f_c_xi',
 'f_1b_g',
 'f_1b_out',
 'f_1b_po',
 'f_1b_a',
 'f_1b_e',
 'f_1b_dp',
 'f_1b_tp',
 'f_2b_g',
 'f_2b_out',
 'f_2b_po',
 'f_2b_a',
 'f_2b_e',
 'f_2b_dp',
 'f_2b_tp',
 'f_3b_g',
 'f_3b_out',
 'f_3b_po',
 'f_3b_a',
 'f_3b_e',
 'f_3b_dp',
 'f_3b_tp',
 'f_ss_g',
 'f_ss_out',
 'f_ss_po',
 'f_ss_a',
 'f_ss_e',
 'f_ss_dp',
 'f_ss_tp',
 'f_lf_g',
 'f_lf_out',
 'f_lf_po',
 'f_lf_a',
 'f_lf_e',
 'f_lf_dp',
 'f_lf_tp',
 'f_cf_g',
 'f_cf_out',
 'f_cf_po',
 'f_cf_a',
 'f_cf_e',
 'f_cf_dp',
 'f_cf_tp',
 'f_rf_g',
 'f_rf_out',
 'f_rf_po',
 'f_rf_a',
 'f_rf_e',
 'f_rf_dp',
 'f_rf_tp']

In [32]:
cols = player_game.columns.copy()

# get all the put out columns
po_cols = [col for col in cols if re.match(r'^f_.*po$', col)]
po_cols

['f_p_po',
 'f_c_po',
 'f_1b_po',
 'f_2b_po',
 'f_3b_po',
 'f_ss_po',
 'f_lf_po',
 'f_cf_po',
 'f_rf_po']

In [33]:
# sum the 9 put out columns, column by column (much faster than using apply on each row)
cols = po_cols.copy()

po = player_game[cols.pop()].copy()
while len(cols):
    po += player_game[cols.pop()]

In [34]:
cols = player_game.columns.copy()

# get all the assist columns
a_cols = [col for col in cols if re.match(r'^f_.*a$', col)]
a_cols

['f_p_a',
 'f_c_a',
 'f_1b_a',
 'f_2b_a',
 'f_3b_a',
 'f_ss_a',
 'f_lf_a',
 'f_cf_a',
 'f_rf_a']

In [35]:
# sum the 9 assist columns, column by column (much faster than using apply on each row)
cols = a_cols.copy()

a = player_game[cols.pop()].copy()
while len(cols):
    a += player_game[cols.pop()]

In [36]:
cols = player_game.columns.copy()

# get all the error columns
e_cols = [col for col in cols if re.match(r'^f_.*e$', col)]
e_cols

['f_p_e',
 'f_c_e',
 'f_1b_e',
 'f_2b_e',
 'f_3b_e',
 'f_ss_e',
 'f_lf_e',
 'f_cf_e',
 'f_rf_e']

In [37]:
# sum the 9 error columns, column by column (much faster than using apply on each row)
cols = e_cols.copy()

e = player_game[cols.pop()].copy()
while len(cols):
    e += player_game[cols.pop()]

In [38]:
# add the columns to player_game
player_game = player_game.assign(f_po=po)
player_game = player_game.assign(f_a=a)
player_game = player_game.assign(f_e=e)

# remove all the other field columns
player_game = player_game.drop(f_cols, axis = 1)

In [39]:
player_game.columns

Index(['game_id', 'game_dt', 'game_ct', 'team_id', 'player_id', 'b_g', 'b_pa',
       'b_ab', 'b_r', 'b_h', 'b_2b', 'b_3b', 'b_hr', 'b_rbi', 'b_bb', 'b_ibb',
       'b_so', 'b_gdp', 'b_hp', 'b_sh', 'b_sf', 'b_sb', 'b_cs', 'b_xi', 'p_g',
       'p_gs', 'p_cg', 'p_sho', 'p_gf', 'p_w', 'p_l', 'p_sv', 'p_out', 'p_tbf',
       'p_ab', 'p_r', 'p_er', 'p_h', 'p_2b', 'p_3b', 'p_hr', 'p_bb', 'p_ibb',
       'p_so', 'p_gdp', 'p_hp', 'p_sh', 'p_sf', 'p_xi', 'p_wp', 'p_bk', 'f_po',
       'f_a', 'f_e'],
      dtype='object')

### Add Lahman player_id
This will make joins to the Lahman data easier.

In [40]:
# get the People data for Lahman player_id
home = Path.home()
lahman = home.joinpath('data/lahman')
p_lahman_wrangled = lahman.joinpath('wrangled')

os.chdir(p_lahman_wrangled)
people = bb.from_csv_with_types('people.csv')

In [41]:
# add player_id from Lahman people
player_game = pd.merge(player_game, people[['player_id','retro_id']], 
                        left_on = 'player_id', right_on = 'retro_id',
                        suffixes=['_retrosheet', '_lahman'])
player_game.columns

Index(['game_id', 'game_dt', 'game_ct', 'team_id', 'player_id_retrosheet',
       'b_g', 'b_pa', 'b_ab', 'b_r', 'b_h', 'b_2b', 'b_3b', 'b_hr', 'b_rbi',
       'b_bb', 'b_ibb', 'b_so', 'b_gdp', 'b_hp', 'b_sh', 'b_sf', 'b_sb',
       'b_cs', 'b_xi', 'p_g', 'p_gs', 'p_cg', 'p_sho', 'p_gf', 'p_w', 'p_l',
       'p_sv', 'p_out', 'p_tbf', 'p_ab', 'p_r', 'p_er', 'p_h', 'p_2b', 'p_3b',
       'p_hr', 'p_bb', 'p_ibb', 'p_so', 'p_gdp', 'p_hp', 'p_sh', 'p_sf',
       'p_xi', 'p_wp', 'p_bk', 'f_po', 'f_a', 'f_e', 'player_id_lahman',
       'retro_id'],
      dtype='object')

In [42]:
col_names = {'player_id_retrosheet':'player_id'}
player_game = player_game.rename(columns=col_names)
player_game = player_game.drop('retro_id', axis= 1)
player_game.columns

Index(['game_id', 'game_dt', 'game_ct', 'team_id', 'player_id', 'b_g', 'b_pa',
       'b_ab', 'b_r', 'b_h', 'b_2b', 'b_3b', 'b_hr', 'b_rbi', 'b_bb', 'b_ibb',
       'b_so', 'b_gdp', 'b_hp', 'b_sh', 'b_sf', 'b_sb', 'b_cs', 'b_xi', 'p_g',
       'p_gs', 'p_cg', 'p_sho', 'p_gf', 'p_w', 'p_l', 'p_sv', 'p_out', 'p_tbf',
       'p_ab', 'p_r', 'p_er', 'p_h', 'p_2b', 'p_3b', 'p_hr', 'p_bb', 'p_ibb',
       'p_so', 'p_gdp', 'p_hp', 'p_sh', 'p_sf', 'p_xi', 'p_wp', 'p_bk', 'f_po',
       'f_a', 'f_e', 'player_id_lahman'],
      dtype='object')

### Add Lahman team_id and year
This will make joins to the Lahman data easier.

The Lahman team_id can be found from the year and retrosheet team_id.

In [43]:
# get the teams data
os.chdir(p_lahman_wrangled)
teams = bb.from_csv_with_types('teams.csv')

In [44]:
# add year to player_game
player_game['year_id'] = player_game['game_dt'].apply(lambda x: x // 10000)

In [45]:
# add team_id from teams
player_game = pd.merge(player_game, teams[['team_id', 'year_id', 'team_id_retro']],
                       left_on = ['year_id', 'team_id'], 
                       right_on = ['year_id', 'team_id_retro'],
                       suffixes=['_retrosheet', '_lahman'])

In [47]:
player_game.columns

Index(['game_id', 'game_dt', 'game_ct', 'team_id_retrosheet', 'player_id',
       'b_g', 'b_pa', 'b_ab', 'b_r', 'b_h', 'b_2b', 'b_3b', 'b_hr', 'b_rbi',
       'b_bb', 'b_ibb', 'b_so', 'b_gdp', 'b_hp', 'b_sh', 'b_sf', 'b_sb',
       'b_cs', 'b_xi', 'p_g', 'p_gs', 'p_cg', 'p_sho', 'p_gf', 'p_w', 'p_l',
       'p_sv', 'p_out', 'p_tbf', 'p_ab', 'p_r', 'p_er', 'p_h', 'p_2b', 'p_3b',
       'p_hr', 'p_bb', 'p_ibb', 'p_so', 'p_gdp', 'p_hp', 'p_sh', 'p_sf',
       'p_xi', 'p_wp', 'p_bk', 'f_po', 'f_a', 'f_e', 'player_id_lahman',
       'year_id', 'team_id_lahman', 'team_id_retro'],
      dtype='object')

In [48]:
(player_game['team_id_retrosheet'] == player_game['team_id_retro']).all()

True

In [49]:
col_names = {'team_id_retrosheet':'team_id'}
player_game = player_game.rename(columns=col_names)
player_game = player_game.drop('team_id_retro', axis= 1)
player_game.columns

Index(['game_id', 'game_dt', 'game_ct', 'team_id', 'player_id', 'b_g', 'b_pa',
       'b_ab', 'b_r', 'b_h', 'b_2b', 'b_3b', 'b_hr', 'b_rbi', 'b_bb', 'b_ibb',
       'b_so', 'b_gdp', 'b_hp', 'b_sh', 'b_sf', 'b_sb', 'b_cs', 'b_xi', 'p_g',
       'p_gs', 'p_cg', 'p_sho', 'p_gf', 'p_w', 'p_l', 'p_sv', 'p_out', 'p_tbf',
       'p_ab', 'p_r', 'p_er', 'p_h', 'p_2b', 'p_3b', 'p_hr', 'p_bb', 'p_ibb',
       'p_so', 'p_gdp', 'p_hp', 'p_sh', 'p_sf', 'p_xi', 'p_wp', 'p_bk', 'f_po',
       'f_a', 'f_e', 'player_id_lahman', 'year_id', 'team_id_lahman'],
      dtype='object')

In [52]:
# put the key columns up front
player_game = bb.order_cols(player_game, 
                            ['game_id', 'player_id', 'team_id', 
                             'year_id', 'player_id_lahman', 
                             'team_id_lahman'])
player_game.columns

Index(['game_id', 'player_id', 'team_id', 'year_id', 'player_id_lahman',
       'team_id_lahman', 'game_dt', 'game_ct', 'b_g', 'b_pa', 'b_ab', 'b_r',
       'b_h', 'b_2b', 'b_3b', 'b_hr', 'b_rbi', 'b_bb', 'b_ibb', 'b_so',
       'b_gdp', 'b_hp', 'b_sh', 'b_sf', 'b_sb', 'b_cs', 'b_xi', 'p_g', 'p_gs',
       'p_cg', 'p_sho', 'p_gf', 'p_w', 'p_l', 'p_sv', 'p_out', 'p_tbf', 'p_ab',
       'p_r', 'p_er', 'p_h', 'p_2b', 'p_3b', 'p_hr', 'p_bb', 'p_ibb', 'p_so',
       'p_gdp', 'p_hp', 'p_sh', 'p_sf', 'p_xi', 'p_wp', 'p_bk', 'f_po', 'f_a',
       'f_e'],
      dtype='object')

In [53]:
# downcast the new columns
player_game = bb.optimize_df_dtypes(player_game)
player_game.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3549699 entries, 0 to 3549698
Data columns (total 57 columns):
game_id             object
player_id           object
team_id             object
year_id             uint16
player_id_lahman    object
team_id_lahman      object
game_dt             uint32
game_ct             uint8
b_g                 uint8
b_pa                uint8
b_ab                uint8
b_r                 uint8
b_h                 uint8
b_2b                uint8
b_3b                uint8
b_hr                uint8
b_rbi               uint8
b_bb                uint8
b_ibb               uint8
b_so                uint8
b_gdp               uint8
b_hp                uint8
b_sh                uint8
b_sf                uint8
b_sb                uint8
b_cs                uint8
b_xi                uint8
p_g                 uint8
p_gs                uint8
p_cg                uint8
p_sho               uint8
p_gf                uint8
p_w                 uint8
p_l                 ui

In [54]:
os.chdir(p_wrangled)
%time bb.to_csv_with_types(player_game, 'player_game.csv.gz')

CPU times: user 3min 18s, sys: 188 ms, total: 3min 18s
Wall time: 3min 18s


## 2. Wrangle Stats per Game Data

### Game Data Organization

The game data is not tidy.  The analysis will be easier if it is first made tidy.

There are 3 types of data:
1. data specific to a game -- the 'rest' columns below
2. data specific to the home team for that game -- the 'home' columns below
3. data specific to the away team for that game -- the 'away' columns below

The attributes for the home team are identical to the attributes for the away team.

This suggests breaking this out into 2 tables.
1. key: (game_id) -- information specific to this particular game
2. key: (game_id, team_id) -- information for the specified team for the specified game

In [55]:
os.chdir(p_collected)
game = bb.from_csv_with_types('game.csv.gz')

In [151]:
home = [col for col in game.columns if col.startswith('home')]
away = [col for col in game.columns if col.startswith('away')]
rest = [col for col in game.columns 
        if not col.startswith('home') and not col.startswith('away')]

In [162]:
home

['home_team_id',
 'home_start_pit_id',
 'home_score_ct',
 'home_hits_ct',
 'home_err_ct',
 'home_lob_ct',
 'home_finish_pit_id',
 'home_team_league_id',
 'home_team_game_ct',
 'home_line_tx',
 'home_ab_ct',
 'home_2b_ct',
 'home_3b_ct',
 'home_hr_ct',
 'home_bi_ct',
 'home_sh_ct',
 'home_sf_ct',
 'home_hp_ct',
 'home_bb_ct',
 'home_ibb_ct',
 'home_so_ct',
 'home_sb_ct',
 'home_cs_ct',
 'home_gdp_ct',
 'home_xi_ct',
 'home_pitcher_ct',
 'home_er_ct',
 'home_ter_ct',
 'home_wp_ct',
 'home_bk_ct',
 'home_po_ct',
 'home_a_ct',
 'home_pb_ct',
 'home_dp_ct',
 'home_tp_ct']

In [57]:
home[:4], away[:4]

(['home_team_id', 'home_start_pit_id', 'home_score_ct', 'home_hits_ct'],
 ['away_team_id', 'away_start_pit_id', 'away_score_ct', 'away_hits_ct'])

In [58]:
# home columns are same as away columns
[col[5:] for col in home] == [col[5:] for col in away]

True

In [59]:
game_tidy = game[rest].copy()
game_tidy.tail(3)

Unnamed: 0,game_id,game_dt,game_ct,game_dy,start_game_tm,dh_fl,daynight_park_cd,park_id,base4_ump_id,base1_ump_id,...,minutes_game_ct,inn_ct,win_pit_id,lose_pit_id,save_pit_id,gwrbi_bat_id,outs_ct,completion_tx,forfeit_tx,protest_tx
129843,WAS201809240,20180924,0,Monday,706,F,N,WAS11,herna901,porta901,...,189,9,millj006,alcas001,,,51,,,
129844,WAS201809250,20180925,0,Tuesday,705,F,N,WAS11,porta901,whitc901,...,192,9,schem001,brigj002,,,51,,,
129845,WAS201809260,20180926,0,Wednesday,405,F,D,WAS11,whitc901,millb901,...,165,8,suerw002,chenw001,,,42,,,


In [60]:
home_team_game = game[['game_id'] + home]
away_team_game = game[['game_id'] + away]

In [61]:
home_team_game.shape, away_team_game.shape

((129846, 36), (129846, 36))

In [62]:
# create a column of True and a column of False, with appropriate indexes
s_true = game['game_id'] == game['game_id']
s_false = game['game_id'] != game['game_id']

In [63]:
# add the home team column
home_team_game = home_team_game.assign(home = s_true)
away_team_game = away_team_game.assign(home = s_false)

In [64]:
# rename the columns by removing 'home_'
home_column_mapping = {col:col[5:] for col in home}
home_team_game = home_team_game.rename(columns = home_column_mapping)
home_team_game = bb.order_cols(home_team_game, ['game_id', 'home'])

# rename the columns by removing 'away_'
away_column_mapping = {col:col[5:] for col in away}
away_team_game = away_team_game.rename(columns = away_column_mapping)
away_team_game = bb.order_cols(away_team_game, ['game_id', 'home'])

In [65]:
home_team_game.tail(3)

Unnamed: 0,game_id,home,team_id,start_pit_id,score_ct,hits_ct,err_ct,lob_ct,finish_pit_id,team_league_id,...,pitcher_ct,er_ct,ter_ct,wp_ct,bk_ct,po_ct,a_ct,pb_ct,dp_ct,tp_ct
129843,WAS201809240,True,WAS,stras001,7,9,2,8,dools001,N,...,5,2,2,1,0,27,10,0,0,0
129844,WAS201809250,True,WAS,schem001,9,11,2,9,cordj001,N,...,4,3,3,0,0,27,3,0,0,0
129845,WAS201809260,True,WAS,mcgok002,9,12,0,8,glovk001,N,...,5,3,3,0,0,21,6,0,0,0


In [66]:
away_team_game.tail(3)

Unnamed: 0,game_id,home,team_id,start_pit_id,score_ct,hits_ct,err_ct,lob_ct,finish_pit_id,team_league_id,...,pitcher_ct,er_ct,ter_ct,wp_ct,bk_ct,po_ct,a_ct,pb_ct,dp_ct,tp_ct
129843,WAS201809240,False,MIA,alcas001,3,6,0,10,gravb001,N,...,3,7,7,0,0,24,6,0,0,0
129844,WAS201809250,False,MIA,brigj002,4,9,1,6,meyeb002,N,...,5,9,9,1,0,24,13,0,3,0
129845,WAS201809260,False,MIA,chenw001,3,5,1,6,herne002,N,...,4,8,8,1,0,21,7,0,1,0


In [67]:
# concatenate these dataframes
team_game = pd.concat([home_team_game, away_team_game])

In [68]:
# view a game
team_game[team_game['game_id'] == 'WAS201809260']

Unnamed: 0,game_id,home,team_id,start_pit_id,score_ct,hits_ct,err_ct,lob_ct,finish_pit_id,team_league_id,...,pitcher_ct,er_ct,ter_ct,wp_ct,bk_ct,po_ct,a_ct,pb_ct,dp_ct,tp_ct
129845,WAS201809260,True,WAS,mcgok002,9,12,0,8,glovk001,N,...,5,3,3,0,0,21,6,0,0,0
129845,WAS201809260,False,MIA,chenw001,3,5,1,6,herne002,N,...,4,8,8,1,0,21,7,0,1,0


In [69]:
# improve column names
names = {col:col.replace('_ct','') for col in team_game.columns if col.endswith('_ct')}
names

{'score_ct': 'score',
 'hits_ct': 'hits',
 'err_ct': 'err',
 'lob_ct': 'lob',
 'team_game_ct': 'team_game',
 'ab_ct': 'ab',
 '2b_ct': '2b',
 '3b_ct': '3b',
 'hr_ct': 'hr',
 'bi_ct': 'bi',
 'sh_ct': 'sh',
 'sf_ct': 'sf',
 'hp_ct': 'hp',
 'bb_ct': 'bb',
 'ibb_ct': 'ibb',
 'so_ct': 'so',
 'sb_ct': 'sb',
 'cs_ct': 'cs',
 'gdp_ct': 'gdp',
 'xi_ct': 'xi',
 'pitcher_ct': 'pitcher',
 'er_ct': 'er',
 'ter_ct': 'ter',
 'wp_ct': 'wp',
 'bk_ct': 'bk',
 'po_ct': 'po',
 'a_ct': 'a',
 'pb_ct': 'pb',
 'dp_ct': 'dp',
 'tp_ct': 'tp'}

In [154]:
team_game = team_game.rename(columns=names)
team_game.columns

Index(['game_id', 'game_date', 'year_id', 'team_id', 'team_id_lahman', 'home',
       'start_pit_id', 'score', 'hits', 'f_e', 'lob', 'finish_pit_id',
       'team_league_id', 'line_tx', 'b_ab', 'b_2b', 'b_3b', 'b_hr', 'bi',
       'b_sh', 'b_sf', 'b_hp', 'b_bb', 'b_ibb', 'b_so', 'b_sb', 'b_cs',
       'b_dpg', 'b_xi', 'pitcher', 'p_er', 'p_ter', 'p_wp', 'p_bk', 'f_po',
       'f_a', 'pb', 'dp', 'tp'],
      dtype='object')

In [155]:
names2 = {'score':'b_r', 'hits':'b_h', 'ab':'b_ab', '2b':'b_2b', '3b':'b_3b', 'hr':'b_hr', 'sh':'b_sh',
         'sf':'b_sf', 'hp':'b_hp', 'bb':'b_bb', 'ibb':'b_ibb', 'so':'b_so',
         'sb':'b_sb', 'cs':'b_cs', 'gdp':'b_dpg', 'xi':'b_xi', 'er':'p_er',
         'ter':'p_ter','wp':'p_wp', 'bk':'p_bk','err':'f_e', 'po':'f_po',
         'a':'f_a'}

In [156]:
team_game = team_game.rename(columns=names2)
team_game.columns

Index(['game_id', 'game_date', 'year_id', 'team_id', 'team_id_lahman', 'home',
       'start_pit_id', 'b_r', 'b_h', 'f_e', 'lob', 'finish_pit_id',
       'team_league_id', 'line_tx', 'b_ab', 'b_2b', 'b_3b', 'b_hr', 'bi',
       'b_sh', 'b_sf', 'b_hp', 'b_bb', 'b_ibb', 'b_so', 'b_sb', 'b_cs',
       'b_dpg', 'b_xi', 'pitcher', 'p_er', 'p_ter', 'p_wp', 'p_bk', 'f_po',
       'f_a', 'pb', 'dp', 'tp'],
      dtype='object')

In [157]:
team_game.isna().mean()

game_id           0.000000
game_date         0.000000
year_id           0.000000
team_id           0.000000
team_id_lahman    0.000000
home              0.000000
start_pit_id      0.000000
b_r               0.000000
b_h               0.000000
f_e               0.000000
lob               0.000000
finish_pit_id     0.130836
team_league_id    0.000000
line_tx           0.000000
b_ab              0.000000
b_2b              0.000000
b_3b              0.000000
b_hr              0.000000
bi                0.000000
b_sh              0.000000
b_sf              0.000000
b_hp              0.000000
b_bb              0.000000
b_ibb             0.000000
b_so              0.000000
b_sb              0.000000
b_cs              0.000000
b_dpg             0.000000
b_xi              0.000000
pitcher           0.000000
p_er              0.000000
p_ter             0.000000
p_wp              0.000000
p_bk              0.000000
f_po              0.000000
f_a               0.000000
pb                0.000000
d

In [158]:
# team_game is 100% null, drop it
team_game = team_game.drop('team_game', axis=1)

KeyError: "['team_game'] not found in axis"

In [159]:
# game_id, home is unique
bb.is_unique(team_game, ['game_id', 'home'])

True

In [160]:
# game_id, team_id is unique
bb.is_unique(team_game, ['game_id', 'team_id'])

True

### Create pd.datetime from game_dt and start_game_tm

Here is the relevant information.
1. am/pm is not specified.
2. The time is not in 24-hour format.
3. The game_dt and start_game_tm are represented as integers.
4. A start_game_time value of zero means the start time is unknown.
5. The daynight_park_cd is never missing.  It is usually, but not always, correct.  This specifies whether the game is a "day game" or a "night game".
6. MLB domain knowledge: Some games may start late, due to a rain delay for example, but games never start after midnight.
7. MLB domain knowledge: Some games may start early, to allow for travel to the next city, but games never start before 9 am.

Given the above, am/pm can be deduced as follows:
* start_game_tm == 0 => use midnight (to represent unknown time)
* start_game_tm >= 1200 => pm
* start_game_tm < 900 => pm
* 900 <= start_game_tm < 1200, and day/night = day, => am
* 900 <= start_game_tm < 1200, and day/night = night, => pm

In [75]:
def parse_datetime(row):
    date = row['game_dt']
    time = row['start_game_tm']
    day_night = row['daynight_park_cd']
    
    if time > 0 and time < 900:
        time += 1200
    elif (900 <= time < 1200) and day_night == 'N':
        time += 1200

    time_str = f'{time//100:02d}:{time%100:02d}'
    datetime_str = str(date) + ' ' + time_str
    return pd.to_datetime(datetime_str, format='%Y%m%d %H:%M')

In [76]:
# create new datetime column
game_tidy['game_date'] = game_tidy.apply(parse_datetime, axis=1)

In [77]:
# these are no longer necessary
game_tidy = game_tidy.drop(['game_dt', 'game_dy', 'start_game_tm'], axis=1)

In [78]:
game_tidy.columns

Index(['game_id', 'game_ct', 'dh_fl', 'daynight_park_cd', 'park_id',
       'base4_ump_id', 'base1_ump_id', 'base2_ump_id', 'base3_ump_id',
       'lf_ump_id', 'rf_ump_id', 'attend_park_ct', 'scorer_record_id',
       'translator_record_id', 'inputter_record_id', 'input_record_ts',
       'edit_record_ts', 'method_record_cd', 'pitches_record_cd',
       'temp_park_ct', 'wind_direction_park_cd', 'wind_speed_park_ct',
       'field_park_cd', 'precip_park_cd', 'sky_park_cd', 'minutes_game_ct',
       'inn_ct', 'win_pit_id', 'lose_pit_id', 'save_pit_id', 'gwrbi_bat_id',
       'outs_ct', 'completion_tx', 'forfeit_tx', 'protest_tx', 'game_date'],
      dtype='object')

### Remove Fields Not Used in (Later) Analysis

In [79]:
ump_cols = [col for col in game_tidy.columns if 'ump' in col]
ump_cols

['base4_ump_id',
 'base1_ump_id',
 'base2_ump_id',
 'base3_ump_id',
 'lf_ump_id',
 'rf_ump_id']

In [80]:
record_cols = [col for col in game_tidy.columns if 'record' in col]
record_cols

['scorer_record_id',
 'translator_record_id',
 'inputter_record_id',
 'input_record_ts',
 'edit_record_ts',
 'method_record_cd',
 'pitches_record_cd']

In [81]:
tx_cols = [col for col in game_tidy.columns if 'tx' in col]
tx_cols

['completion_tx', 'forfeit_tx', 'protest_tx']

In [82]:
# these columns will not be used in the analysis
drop_cols = ump_cols + record_cols + tx_cols
game_tidy = game_tidy.drop(drop_cols, axis=1)

In [83]:
game_tidy.columns

Index(['game_id', 'game_ct', 'dh_fl', 'daynight_park_cd', 'park_id',
       'attend_park_ct', 'temp_park_ct', 'wind_direction_park_cd',
       'wind_speed_park_ct', 'field_park_cd', 'precip_park_cd', 'sky_park_cd',
       'minutes_game_ct', 'inn_ct', 'win_pit_id', 'lose_pit_id', 'save_pit_id',
       'gwrbi_bat_id', 'outs_ct', 'game_date'],
      dtype='object')

### Examine Park Fields
See: http://chadwick.sourceforge.net/doc/cwgame.html

In [84]:
park_fields = [col for col in game_tidy.columns if 'park' in col]
park_fields

['daynight_park_cd',
 'park_id',
 'attend_park_ct',
 'temp_park_ct',
 'wind_direction_park_cd',
 'wind_speed_park_ct',
 'field_park_cd',
 'precip_park_cd',
 'sky_park_cd']

In [85]:
park = game_tidy[park_fields].copy()
park.nunique()

daynight_park_cd              2
park_id                      79
attend_park_ct            44892
temp_park_ct                 90
wind_direction_park_cd        9
wind_speed_park_ct           51
field_park_cd                 5
precip_park_cd                6
sky_park_cd                   6
dtype: int64

In [86]:
# no nulls as np.nan
park.isna().sum()

daynight_park_cd          0
park_id                   0
attend_park_ct            0
temp_park_ct              0
wind_direction_park_cd    0
wind_speed_park_ct        0
field_park_cd             0
precip_park_cd            0
sky_park_cd               0
dtype: int64

In [87]:
# Day or Night Game
park['daynight_park_cd'].value_counts()

N    83738
D    46108
Name: daynight_park_cd, dtype: int64

In [88]:
# park id for Park table below
park['park_id'].value_counts().head(3)

CHI11    5101
BOS07    5097
LOS03    4877
Name: park_id, dtype: int64

In [89]:
# attendance
park['attend_park_ct'].value_counts().sort_index().head()

-1         1
 0      5009
 306       1
 365       1
 409       1
Name: attend_park_ct, dtype: int64

In [90]:
# attendance can never be 0 or -1, make these na
park['attend_park_ct'] = park['attend_park_ct'].replace([0, -1], np.nan)

In [91]:
# attendance
park['attend_park_ct'].value_counts().sort_index().head()

306.0    1
365.0    1
409.0    1
413.0    1
461.0    1
Name: attend_park_ct, dtype: int64

In [92]:
# note: there was one baseball game in which the public was not allowed to attend
# this is arguably null, as people wanted to attend, but could not
# click on the following generated link to read about it
bb.game_id_to_url('BAL201504290')

In [93]:
park['temp_park_ct'].value_counts().sort_index().head()

-1        85
 0     49108
 12        1
 14        1
 19        1
Name: temp_park_ct, dtype: int64

In [94]:
# temperature can never be 0 or -1 F, for an MLB game
park['temp_park_ct'] = park['temp_park_ct'].replace([0, -1], np.nan)

In [95]:
park['temp_park_ct'].value_counts().sort_index().head()

12.0    1
14.0    1
19.0    1
20.0    1
23.0    1
Name: temp_park_ct, dtype: int64

In [96]:
park['wind_direction_park_cd'].value_counts()

0    70961
4    12609
8    10184
2     8586
3     7509
1     6816
5     4907
6     4637
7     3637
Name: wind_direction_park_cd, dtype: int64

In [97]:
# http://chadwick.sourceforge.net/doc/cwgame.html#cwtools-cwgame-winddirection
direction = {
    0:'unknown',
    1:'to_lf',
    2:'to_cf',
    3:'to_rf',
    4:'l_to_r',
    5:'from_lf',
    6:'from_cf',
    7:'from_rf',
    8:'r_to_l'}

In [98]:
park['wind_direction_park_cd'] = park['wind_direction_park_cd'].map(direction)
park['wind_direction_park_cd'].value_counts()

unknown    70961
l_to_r     12609
r_to_l     10184
to_cf       8586
to_rf       7509
to_lf       6816
from_lf     4907
from_cf     4637
from_rf     3637
Name: wind_direction_park_cd, dtype: int64

In [99]:
park['wind_direction_park_cd'] = park['wind_direction_park_cd'].replace('unknown', np.nan)
park['wind_direction_park_cd'].value_counts(dropna=False)

NaN        70961
l_to_r     12609
r_to_l     10184
to_cf       8586
to_rf       7509
to_lf       6816
from_lf     4907
from_cf     4637
from_rf     3637
Name: wind_direction_park_cd, dtype: int64

In [100]:
park['wind_speed_park_ct'].value_counts().sort_index().head()

-1    59819
 0     9835
 1      816
 2     1690
 3     2563
Name: wind_speed_park_ct, dtype: int64

In [101]:
park['wind_speed_park_ct'] = park['wind_speed_park_ct'].replace(-1, np.nan)
park['wind_speed_park_ct'].isna().sum()

59819

In [102]:
park['wind_speed_park_ct'].value_counts().sort_index().head()

0.0    9835
1.0     816
2.0    1690
3.0    2563
4.0    2293
Name: wind_speed_park_ct, dtype: int64

In [103]:
# http://chadwick.sourceforge.net/doc/cwgame.html#cwtools-cwgame-fieldcondition
condition = {
    0:'unknown',
    1:'soaked',
    2:'wet',
    3:'damp',
    4:'dry'}

In [104]:
park['field_park_cd'] = park['field_park_cd'].map(condition)
park['field_park_cd'].value_counts()

unknown    103217
dry         23367
wet          2470
damp          529
soaked        263
Name: field_park_cd, dtype: int64

In [105]:
park['field_park_cd'] = park['field_park_cd'].replace('unknown', np.nan)
park['field_park_cd'].value_counts(dropna=False)

NaN       103217
dry        23367
wet         2470
damp         529
soaked       263
Name: field_park_cd, dtype: int64

In [106]:
# http://chadwick.sourceforge.net/doc/cwgame.html#cwtools-cwgame-precipitation
precip = {
    0:'unknown',
    1:'none',
    2:'drizzle',
    3:'showers',
    4:'rain',
    5:'snow'}

In [107]:
park['precip_park_cd'] = park['precip_park_cd'].map(precip)
park['precip_park_cd'].value_counts()

unknown    94704
none       32221
rain        1590
drizzle      868
showers      439
snow          24
Name: precip_park_cd, dtype: int64

In [108]:
park['precip_park_cd'] = park['precip_park_cd'].replace('unknown', np.nan)
park['precip_park_cd'].value_counts(dropna=False)

NaN        94704
none       32221
rain        1590
drizzle      868
showers      439
snow          24
Name: precip_park_cd, dtype: int64

In [109]:
# http://chadwick.sourceforge.net/doc/cwgame.html#cwtools-cwgame-sky
sky = {
    0:'unknown',
    1:'sunny',
    2:'cloudy',
    3:'overcast',
    4:'night',
    5:'dome'}

In [110]:
park['sky_park_cd'] = park['sky_park_cd'].map(sky)
park['sky_park_cd'].value_counts()

unknown     50417
sunny       21776
night       20833
cloudy      20154
dome        11584
overcast     5082
Name: sky_park_cd, dtype: int64

In [111]:
park['sky_park_cd'] = park['sky_park_cd'].replace('unknown', np.nan)
park['sky_park_cd'].value_counts(dropna=False)

NaN         50417
sunny       21776
night       20833
cloudy      20154
dome        11584
overcast     5082
Name: sky_park_cd, dtype: int64

In [112]:
# null is now represented as np.nan
park.isna().sum()

daynight_park_cd               0
park_id                        0
attend_park_ct              5010
temp_park_ct               49193
wind_direction_park_cd     70961
wind_speed_park_ct         59819
field_park_cd             103217
precip_park_cd             94704
sky_park_cd                50417
dtype: int64

In [113]:
# copy the wrangled park data back to game_tidy
game_tidy[park.columns] = park

In [114]:
game_tidy.columns

Index(['game_id', 'game_ct', 'dh_fl', 'daynight_park_cd', 'park_id',
       'attend_park_ct', 'temp_park_ct', 'wind_direction_park_cd',
       'wind_speed_park_ct', 'field_park_cd', 'precip_park_cd', 'sky_park_cd',
       'minutes_game_ct', 'inn_ct', 'win_pit_id', 'lose_pit_id', 'save_pit_id',
       'gwrbi_bat_id', 'outs_ct', 'game_date'],
      dtype='object')

In [115]:
game_tidy = bb.order_cols(game_tidy, ['game_id', 'game_date'])
game_tidy.tail(3)

Unnamed: 0,game_id,game_date,game_ct,dh_fl,daynight_park_cd,park_id,attend_park_ct,temp_park_ct,wind_direction_park_cd,wind_speed_park_ct,field_park_cd,precip_park_cd,sky_park_cd,minutes_game_ct,inn_ct,win_pit_id,lose_pit_id,save_pit_id,gwrbi_bat_id,outs_ct
129843,WAS201809240,2018-09-24 19:06:00,0,F,N,WAS11,22428.0,63.0,from_rf,10.0,wet,rain,,189,9,millj006,alcas001,,,51
129844,WAS201809250,2018-09-25 19:05:00,0,F,N,WAS11,26483.0,78.0,to_lf,4.0,,,cloudy,192,9,schem001,brigj002,,,51
129845,WAS201809260,2018-09-26 16:05:00,0,F,D,WAS11,28680.0,88.0,r_to_l,6.0,,,,165,8,suerw002,chenw001,,,42


In [116]:
# the primary key is (game_id), verify no dups
game_tidy['game_id'].is_unique

True

### Add Year, Lahman Team Id and Game Date to Team Game
This will make later queries easier.

In [117]:
team_game.shape

(259692, 37)

In [118]:
team_game = pd.merge(team_game, game_tidy[['game_id','game_date']],on = 'game_id')

In [119]:
team_game['year_id'] = team_game['game_date'].dt.year

In [120]:
team_game.shape

(259692, 39)

In [121]:
# add lahman team_id from teams
team_game = pd.merge(team_game, teams[['team_id', 'year_id', 'team_id_retro']],
                     left_on = ['year_id', 'team_id'], 
                     right_on = ['year_id', 'team_id_retro'],
                     suffixes=['_retrosheet', '_lahman'])
team_game.columns

Index(['game_id', 'home', 'team_id_retrosheet', 'start_pit_id', 'score',
       'hits', 'f_e', 'lob', 'finish_pit_id', 'team_league_id', 'team_game',
       'line_tx', 'b_ab', 'b_2b', 'b_3b', 'b_hr', 'bi', 'b_sh', 'b_sf', 'b_hp',
       'b_bb', 'b_ibb', 'b_so', 'b_sb', 'b_cs', 'b_dpg', 'b_xi', 'pitcher',
       'p_er', 'p_ter', 'p_wp', 'p_bk', 'f_po', 'f_a', 'pb', 'dp', 'tp',
       'game_date', 'year_id', 'team_id_lahman', 'team_id_retro'],
      dtype='object')

In [122]:
(team_game['team_id_retrosheet'] == team_game['team_id_retro']).all()

True

In [123]:
col_names = {'team_id_retrosheet':'team_id'}
team_game = team_game.rename(columns=col_names)
team_game = team_game.drop('team_id_retro', axis= 1)
team_game.columns

Index(['game_id', 'home', 'team_id', 'start_pit_id', 'score', 'hits', 'f_e',
       'lob', 'finish_pit_id', 'team_league_id', 'team_game', 'line_tx',
       'b_ab', 'b_2b', 'b_3b', 'b_hr', 'bi', 'b_sh', 'b_sf', 'b_hp', 'b_bb',
       'b_ibb', 'b_so', 'b_sb', 'b_cs', 'b_dpg', 'b_xi', 'pitcher', 'p_er',
       'p_ter', 'p_wp', 'p_bk', 'f_po', 'f_a', 'pb', 'dp', 'tp', 'game_date',
       'year_id', 'team_id_lahman'],
      dtype='object')

In [124]:
team_game = bb.order_cols(team_game, ['game_id', 'game_date', 
                                      'year_id', 'team_id', 
                                      'team_id_lahman'])
team_game.tail(3)

Unnamed: 0,game_id,game_date,year_id,team_id,team_id_lahman,home,start_pit_id,score,hits,f_e,...,pitcher,p_er,p_ter,p_wp,p_bk,f_po,f_a,pb,dp,tp
259689,WAS201809081,2018-09-08 17:15:00,2018,CHN,CHN,False,garcj004,3,9,3,...,6,7,7,2,0,24,17,0,2,0
259690,WAS201809082,2018-09-08 21:00:00,2018,CHN,CHN,False,hamec001,5,6,0,...,5,6,6,0,0,24,4,0,1,0
259691,WAS201809130,2018-09-13 16:05:00,2018,CHN,CHN,False,montm002,4,10,0,...,9,3,3,2,0,30,6,0,0,0


In [125]:
# 97% of the time, these are the same
(team_game.team_id == team_game.team_id_lahman).mean()

0.9677849144371025

In [161]:
os.chdir(p_wrangled)
%time bb.to_csv_with_types(team_game, 'team_game.csv.gz')

CPU times: user 15.2 s, sys: 11.7 ms, total: 15.3 s
Wall time: 15.3 s


In [127]:
os.chdir(p_wrangled)
%time bb.to_csv_with_types(game_tidy, 'game.csv.gz')

CPU times: user 2.27 s, sys: 20 ms, total: 2.29 s
Wall time: 2.28 s


## 3. Scrape Data for Players Lookup Table

player_id to player information is needed.

This data is similar to Lahman's People.csv data, but it has different fields.

There is no separate file for this.  It will be scraped from a web page.

In [140]:
import requests
import pandas as pd
from io import StringIO
from bs4 import BeautifulSoup

In [141]:
# get the web page
r = requests.get("https://www.retrosheet.org/retroID.htm")
soup = BeautifulSoup(r.content, 'lxml')

# data is within the pre tag
table_txt = soup.pre.string

# remove unnecessary double quotes
table_txt = table_txt.replace('"','')

In [142]:
# read from this string instead of file
players = pd.read_csv(StringIO(table_txt), 
    parse_dates=['Play debut', 'Mgr debut', 'Ump debut'])

In [143]:
# Coach debut has some bad values
def parse_date(value):
    # perhaps 43188 means 04/31/1988, but use null as unsure
    # no coach debuted prior to the year 1800
    if pd.isna(value) or value == '43188' or int(value[-4:]) < 1800:
        return pd.NaT
    else:
        return pd.datetime.strptime(value, '%m/%d/%Y')

In [144]:
players['Coach debut'] = players['Coach debut'].apply(parse_date)
players.head()

Unnamed: 0,ID,Last,First,Play debut,Mgr debut,Coach debut,Ump debut
0,aardd001,Aardsma,David,2004-04-06,NaT,NaT,NaT
1,aaroh101,Aaron,Hank,1954-04-13,NaT,NaT,NaT
2,aarot101,Aaron,Tommie,1962-04-10,NaT,1979-04-06,NaT
3,aased001,Aase,Don,1977-07-26,NaT,NaT,NaT
4,abada001,Abad,Andy,2001-09-10,NaT,NaT,NaT


In [145]:
name_chg = {'ID':'player_id',
         'Last':'last_name',
         'First':'first_name',
         'Play debut':'player_debut',
         'Mgr debut':'mgr_debut',
         'Coach debut': 'coach_debut',
         'Ump debut':'ump_debut'}
players = players.rename(columns=name_chg)
players.head()

Unnamed: 0,player_id,last_name,first_name,player_debut,mgr_debut,coach_debut,ump_debut
0,aardd001,Aardsma,David,2004-04-06,NaT,NaT,NaT
1,aaroh101,Aaron,Hank,1954-04-13,NaT,NaT,NaT
2,aarot101,Aaron,Tommie,1962-04-10,NaT,1979-04-06,NaT
3,aased001,Aase,Don,1977-07-26,NaT,NaT,NaT
4,abada001,Abad,Andy,2001-09-10,NaT,NaT,NaT


#### Persist Players

In [146]:
os.chdir(p_wrangled)
bb.to_csv_with_types(players, 'players.csv')

## 4. Scrape Data for Stadium (Park) Lookup Table

park_id to park information is needed.  

This data is similar to Lahman's Parks.csv, but has different fields.

There is no separate file for this.  It will be scraped from a web page.

In [147]:
# get the web page (this is not html!)
r = requests.get("https://www.retrosheet.org/parkcode.txt")

table_txt = r.content.decode("utf-8")

# read from this string instead of file
parks = pd.read_csv(StringIO(table_txt), parse_dates=['START', 'END'])

In [149]:
parks.columns = parks.columns.str.lower()
parks = parks.rename(columns={'parkid':'park_id','name':'park_name',
                              'start':'part_start','end':'park_end'})
parks.head()

Unnamed: 0,park_id,park_name,aka,city,state,part_start,park_end,league,notes
0,ALB01,Riverside Park,,Albany,NY,1880-09-11,1882-05-30,NL,TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882
1,ALT01,Columbia Park,,Altoona,PA,1884-04-30,1884-05-31,UA,
2,ANA01,Angel Stadium of Anaheim,Edison Field; Anaheim Stadium,Anaheim,CA,1966-04-19,NaT,AL,
3,ARL01,Arlington Stadium,,Arlington,TX,1972-04-21,1993-10-03,AL,
4,ARL02,Rangers Ballpark in Arlington,The Ballpark in Arlington; Ameriquest Fl,Arlington,TX,1994-04-11,NaT,AL,


#### Persist Stadiums

In [150]:
os.chdir(p_wrangled)
bb.to_csv_with_types(parks, 'parks.csv')