# Retrosheet Baseball Data -- Part 2

Baseball Notebooks
1. **LahmanBaseball**: the Lahman data will be downloaded and parsed.
2. **RetroBaseball-1**: the Retrosheet Play-by-Play data will be downloaded, parsed and saved to compressed csv files.
3. **RetroBaseball-2**: the data will be prepared (wranged) for analysis and saved to compressed csv files
4. **RetroBaseball-3**: the data from the preceeding notebook will be saved to Postgres
5. **RetroAnalysis 1**: the baseball data will be analyzed.

This notebook is designed to be used with Jupyter Lab and the Table of Contents extension.  
https://github.com/jupyterlab/jupyterlab-toc

## Data Wrangling

The Retrosheet event data includes every play for every major league game since 1921. 
A subset of that data will be used here.

Retrosheet Data Wrangling will include:
1. Manipulating player per game data.
2. Manipulating game data.
3. Creating "lookup tables" by web scraping.
4. Creating data dictionaries (aka codebooks) by "scraping" Dr. Turocy's C source code.

At the end of the data wrangling, 6 DataFrames will exist.
1. **player_game:** player per game stats 
2. **player_game_fields:** player per game field descriptions
3. **game:** game stats
4. **game_fields:** game field descriptions
5. **players:** player info
6. **parks:** stadium info

The above 6 dataframes will be persisted to both gzipped csv files.

## Repeatable Research
All data processing should be documented so that others can repeat the results.  This includes every step from downloading the data through analysis.

## 1. Read Parsed Player_Game Data

### Directories for Data Processing
~/data/retrosheet/raw -- event files downloaded from Retrosheet  
~/data/retrosheet/parsed -- results of running 2 parsers on the event files  
~/data/retrosheet/df_csv -- collect the parsed files into dataframes and save these to csv  
~/data/retrosheet/wrangled -- prepare the data for analsyis and save to csv  

In [1]:
import pandas as pd
import numpy as np
import os
import re
from pathlib import Path

In [2]:
# create path objects -- these directories were created in previous notebook
home = Path.home()
retrosheet = home.joinpath('data/retrosheet')
raw = retrosheet.joinpath('raw')
parsed = retrosheet.joinpath('parsed')
df_csv = retrosheet.joinpath('df_csv')
wrangled = retrosheet.joinpath('wrangled')
src = retrosheet.joinpath('src')

## Retrosheet Data Dictionary Overview
A "data dictionary" is also called a "codebook".

The following is a highlevel overview of the meaning of the field names created by the Retrosheet parsers.

```
Suffix Meaning
CT     count (integer)
ID     identifier
FL     boolean flag
CD     code (enumerated data type)
DT     date
DY     day of week
TM     time

Prefix Meaning
B      batter
P      pitcher
```

In most cases, the abbreviation between the prefix and the suffix is a common baseball abbreviation.  For common baseball abbreviations see:  
http://www.espn.com/gen/editors/mlb/glossary.html

From the glossary above, "SF" stands from sacrifice flies.  This statistic has been recorded since 1955.  The full field names created by the parsers are "B_SF" for how many sacrifice flies by the batter, and "P_SF" for how many sacrifice flies given up by the pitcher.

## Data Verification

For odd data, such as whether or not a the first game of a double header was in one stadium, and the second game was in a different stadium; [Baseball-Reference](https://www.baseball-reference.com) is helpful.

Baseball-reference uses the data from Retrosheet, and presents it in an easy to read form for people. Although baseball-reference on rare occasion will incorrectly interpret the event data, it is nevertheless a useful tool to verify the data processing used here.

Baseball-reference does not offer already parsed data for data analysis.

The following method takes a game_id and converts it to a baseball-reference url for researching more about a particular game.

In [3]:
from IPython.display import HTML, display
def game_id_to_url(game_id):
    home = game_id[:3]
    url = 'https://www.baseball-reference.com/boxes/' + home + '/' + game_id + '.shtml'
    display(HTML(f'<a href="{url}">{game_id}</a>'))

In [4]:
# Click on the generated link to get a url for detailed game information.
game_id_to_url('NYA200806271')

As per the above link, the first game of the double header was in Yankee Stadium and the second game, on the same day, was in Shea Stadium.

In [5]:
# read in the parsed player_game data
os.chdir(df_csv)
player_game = pd.read_csv('player_game.csv.gz', parse_dates=['game_dt', 'appear_dt'])

In [6]:
player_game.shape

(3549700, 52)

In [7]:
player_game.head(3)

Unnamed: 0,game_id,game_dt,game_ct,appear_dt,team_id,player_id,b_g,b_pa,b_ab,b_r,...,p_bb,p_ibb,p_so,p_gdp,p_hp,p_sh,p_sf,p_xi,p_wp,p_bk
0,BAL195504120,1955-04-12,0,1955-04-12,BOS,goodb101,1,5,5,1,...,0,0,0,0,0,0,0,0,0,0
1,BAL195504120,1955-04-12,0,1955-04-12,BOS,joose101,1,5,4,0,...,0,0,0,0,0,0,0,0,0,0
2,BAL195504120,1955-04-12,0,1955-04-12,BOS,throf101,1,5,5,1,...,0,0,0,0,0,0,0,0,0,0


In [8]:
# the primary key is (game_id, PLAYER_ID), verify no dups
dups = player_game.duplicated(subset=['game_id', 'player_id'], keep=False)
player_game[dups]

Unnamed: 0,game_id,game_dt,game_ct,appear_dt,team_id,player_id,b_g,b_pa,b_ab,b_r,...,p_bb,p_ibb,p_so,p_gdp,p_hp,p_sh,p_sf,p_xi,p_wp,p_bk
3418636,BOS201708250,2017-08-25,0,2017-08-25,BOS,younc004,1,3,3,0,...,0,0,0,0,0,0,0,0,0,0
3418638,BOS201708250,2017-08-25,0,2017-08-25,BOS,younc004,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
# there are dups!
dup_id = player_game.loc[dups, 'game_id'].values[0]
dup_id

'BOS201708250'

In [10]:
# check this game manually by clicking on the generated link
game_id_to_url(dup_id)

### Data Correction for Duplicates

Checking the box score via the above link, shows 2 entries for Young for the same game, one as a pinch-hitter and one as the designated-hitter.  It would appear that both entries are correct and that the data should be summed.

In [11]:
# get the index labels of the duplicated rows
idx1, idx2 = player_game[dups].index.values
idx1, idx2

(3418636, 3418638)

In [12]:
# identifier columns
id_columns = player_game.columns[:5]
id_columns

Index(['game_id', 'game_dt', 'game_ct', 'appear_dt', 'team_id'], dtype='object')

In [13]:
# stat columns
stat_columns = player_game.columns[5:]
stat_columns

Index(['player_id', 'b_g', 'b_pa', 'b_ab', 'b_r', 'b_h', 'b_2b', 'b_3b',
       'b_hr', 'b_rbi', 'b_bb', 'b_ibb', 'b_so', 'b_gdp', 'b_hp', 'b_sh',
       'b_sf', 'b_sb', 'b_cs', 'b_xi', 'p_g', 'p_gs', 'p_cg', 'p_sho', 'p_gf',
       'p_w', 'p_l', 'p_sv', 'p_out', 'p_tbf', 'p_ab', 'p_r', 'p_er', 'p_h',
       'p_2b', 'p_3b', 'p_hr', 'p_bb', 'p_ibb', 'p_so', 'p_gdp', 'p_hp',
       'p_sh', 'p_sf', 'p_xi', 'p_wp', 'p_bk'],
      dtype='object')

In [14]:
# id columns match (as per df.duplicated() above)
player_game.loc[[idx1,idx2], id_columns]

Unnamed: 0,game_id,game_dt,game_ct,appear_dt,team_id
3418636,BOS201708250,2017-08-25,0,2017-08-25,BOS
3418638,BOS201708250,2017-08-25,0,2017-08-25,BOS


In [15]:
# game data
player_game.loc[[idx1,idx2], stat_columns]

Unnamed: 0,player_id,b_g,b_pa,b_ab,b_r,b_h,b_2b,b_3b,b_hr,b_rbi,...,p_bb,p_ibb,p_so,p_gdp,p_hp,p_sh,p_sf,p_xi,p_wp,p_bk
3418636,younc004,1,3,3,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3418638,younc004,1,1,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
# add stats for the two rows
player_game.loc[idx1, stat_columns] += player_game.loc[idx2, stat_columns]

# remove duplicate row
player_game = player_game.drop(idx2)

In [17]:
# is_unique method for multiple columns
# faster than using groupby
def is_unique(df, cols):
    return not (df.duplicated(subset=cols)).any()

In [18]:
# the primary key is (GAME_ID, PLAYER_ID), verify no dups
is_unique(player_game, ['game_id', 'player_id'])

True

### Remove Duplicate Column

In [19]:
(player_game['game_dt'] == player_game['appear_dt']).all()

True

In [20]:
player_game = player_game.drop('appear_dt', axis=1)

### Optimizing Pandas Data Types for each Variable

Using the smallest data type that can represent the data offers several advantages:
1. Reduced memory
2. Possibly increased performance
3. Provides information to both the data analyst and to other software libraries, about the variable.

This means using small integers, as appropriate.  
This means using categories, as appropriate.

A category should be used if there is a relatively small number of unique string values.  In other languages, a "category" is called a "factor" or an "enumerated data type".

The above is true for the analytical processing of data.

For datasets being constantly updated, unless the range of each variable is known in advance, using the smallest data type could create problems when new data is added.

In [21]:
def mem_usage(df):
    mem = df.memory_usage(deep=True).sum()
    mem = mem / 2 ** 20 # covert to megabytes
    return f'{mem:03.2f} MB'

In [22]:
# About 2GB
mem_usage(player_game)

'1983.76 MB'

In [23]:
# data types by count
player_game.dtypes.value_counts()

int64             47
object             3
datetime64[ns]     1
dtype: int64

In [24]:
# Fraction of string values that are unique
player_game_obj = player_game.select_dtypes(include=['object'])
player_game_obj.nunique() / player_game_obj.shape[0]

game_id      0.036579
team_id      0.000012
player_id    0.003146
dtype: float64

In [25]:
# this optimization is good for player_game and game
def optimize_data_types(df):
    df = df.copy()
    
    # int64 -> smallest uint allowed by data
    df_int = df.select_dtypes(include=['int'])
    df_int = df_int.apply(pd.to_numeric,downcast='unsigned')
    df[df_int.columns] = df_int

    # object -> category
    df_obj = df.select_dtypes(include=['object'])
    df_obj = df_obj.astype('category')
    df[df_obj.columns] = df_obj
    
    return df

In [26]:
player_game = optimize_data_types(player_game)

In [27]:
# data types by count
player_game.dtypes.value_counts()

uint8             47
category           1
category           1
category           1
datetime64[ns]     1
dtype: int64

In [28]:
# about 8 times less memory is now being used
mem_usage(player_game)

'251.52 MB'

In [29]:
player_game.dtypes.value_counts()

uint8             47
category           1
category           1
category           1
datetime64[ns]     1
dtype: int64

In [30]:
os.chdir(wrangled)
%time player_game.to_csv('player_game.csv.gz', compression='gzip', index=False)

CPU times: user 4min, sys: 50.7 ms, total: 4min
Wall time: 4min


#### To Read Back Use:
```
player_game = pd.read_csv('player_game.csv.gz', parse_dates=['game_dt'])
player_game = optimize_data_types(player_game)
```

## 2. Scrape Data for Player_Game Data Dictionary
As of February 2019, I could find no published information on cwdaily.

cwdaily can be run with the '-n' flag to have it output field names, but it is not clear what some of the field names mean.

Luckily, the source code itself has a text description of each field name.  This description takes place within a single, very long, C statement.

The source code will be scraped to retrieve a field name to field description mapping.

In [31]:
# cd to dir with cwdaily.c
src = retrosheet.joinpath('src')
os.chdir(src)

In [32]:
def parse_c_source(filename, struct='field_data'):
    """Extract field name to field description from parser's C source code"""
    dd = {}
    with open(filename, 'r') as cwdaily:
        # to account for patterns across lines, read entire source code
        source = cwdaily.read()
    
        # get the single (multiline) C statement that has field descriptions
        pattern = r'(static\s+field_struct\s+' + struct + r'.*?;)'
        match = re.search(pattern, source, flags=re.DOTALL | re.MULTILINE)
    
        if match:
            pattern = r'{.*?"(.*?)".*?"(.*?)".*?}'
            for m in re.finditer(pattern, match.group(1), 
                                 flags=re.DOTALL | re.MULTILINE):
                if m:
                    if len(m.group(2).split(':')) == 2:
                        desc = m.group(2).split(':')[1].strip()
                    else:
                        desc = m.group(2).strip()
                    dd[m.group(1).lower()] = desc   
    return dd

In [33]:
player_game_fields_all = parse_c_source('cwdaily.c')        

In [34]:
# As of Python 3.6, dictionaries maintain insertion order
# Only the first 52 fields were selected, so that's all that needed here
player_game_fields = {key:value for num, 
        (key, value) in enumerate(player_game_fields_all.items()) if num < 52}

# appear_dt was removed from player_game above
del player_game_fields['appear_dt']

In [35]:
# here is the explanation of each field, as scraped from the C source code
player_game_fields

{'game_id': 'game id',
 'game_dt': 'date',
 'game_ct': 'game number (0 = no double header)',
 'team_id': 'team id',
 'player_id': 'player id',
 'b_g': 'games played',
 'b_pa': 'plate appearances',
 'b_ab': 'at bats',
 'b_r': 'runs',
 'b_h': 'hits',
 'b_2b': 'doubles',
 'b_3b': 'triples',
 'b_hr': 'home runs',
 'b_rbi': 'runs batted in',
 'b_bb': 'walks',
 'b_ibb': 'intentional walks',
 'b_so': 'strikeouts',
 'b_gdp': 'grounded into DP',
 'b_hp': 'hit by pitch',
 'b_sh': 'sacrifice hits',
 'b_sf': 'sacrifice flies',
 'b_sb': 'stolen bases',
 'b_cs': 'caught stealing',
 'b_xi': 'reached on interference',
 'p_g': 'games pitched',
 'p_gs': 'games started',
 'p_cg': 'complete games',
 'p_sho': 'shutouts',
 'p_gf': 'games finished',
 'p_w': 'wins',
 'p_l': 'losses',
 'p_sv': 'saves',
 'p_out': 'outs recorded (innings pitched times 3)',
 'p_tbf': 'batters faced',
 'p_ab': 'at bats',
 'p_r': 'runs allowed',
 'p_er': 'earned runs allowed',
 'p_h': 'hits allowed',
 'p_2b': 'doubles allowed',
 'p

### Data Dictionary Notes
In the above, team_id is the team_id of the player.

game_id is:  
```
0:4 Home TEAM_ID  
4:8 YYYYMMDD  
9   Game Count
```

Game Count is:
* 0 for single game
* 1 for 1st game of double header
* 2 for 2nd game of double header

### Persist player_game_fields

In [36]:
os.chdir(wrangled)

# index=[0] is required for dictionary of scalar values
# no need to compress something this small
player_game_fields_df = pd.DataFrame(player_game_fields, index=[0])
player_game_fields_df.to_csv('player_game_fields.csv', index=False)

## 3. Read Parsed Game Data

An analysis of the data shows that -1 and 0 are both used as null values for attendance and temperature.

In [38]:
os.chdir(df_csv)
game = pd.read_csv('game.csv.gz',
            na_values={'attend_park_ct':[-1,0],
                       'temp_park_ct':[-1,0]})

In [39]:
game.head(3)

Unnamed: 0,game_id,game_dt,game_ct,game_dy,start_game_tm,dh_fl,daynight_park_cd,away_team_id,home_team_id,park_id,...,away_hits_ct,home_hits_ct,away_err_ct,home_err_ct,away_lob_ct,home_lob_ct,win_pit_id,lose_pit_id,save_pit_id,gwrbi_bat_id
0,BAL195504120,19550412,0,Tuesday,0,F,D,BOS,BAL,BAL11,...,13,5,0,2,8,9,sullf101,colej101,,
1,BAL195504180,19550418,0,Monday,0,F,N,NYA,BAL,BAL11,...,8,3,0,1,5,4,fordw101,moorr101,,
2,BAL195504220,19550422,0,Friday,0,F,N,WS1,BAL,BAL11,...,4,8,2,1,6,11,mcdem102,wilsj104,schmj101,


In [40]:
# the primary key is (game_id), verify no dups
game['game_id'].is_unique

True

In [41]:
# these columns will not be used in the analysis
drop_columns = ['edit_record_ts',
                'wind_direction_park_cd',
                'wind_speed_park_ct',
                'field_park_cd',
                'precip_park_cd',
                'sky_park_cd',                
                'base1_ump_id', 
                'base2_ump_id', 
                'base3_ump_id', 
                'base4_ump_id',
                'scorer_record_id', 
                'inputter_record_id', 
                'lf_ump_id', 
                'rf_ump_id',
                'translator_record_id', 
                'input_record_ts', 
                'method_record_cd',
                'pitches_record_cd']

In [42]:
game = game.drop(drop_columns, axis=1)

In [43]:
game.dtypes.value_counts()

int64      13
object     13
float64     2
dtype: int64

### Reverse Engineer am/pm for start_game_tm

1. am/pm is not specified.
2. The time is not in 24-hour format
3. The time is an integer, not a string.  For example, 1259 means 12:59.
4. A value of zero means the game start time is unknown.
5. The daynight_park_cd is never missing.  This specifies whether the game took play in the "day" or at "night".
6. MLB domain knowledge: Some games may start late, due to a rain delay for example.  But games never start after midnight.
7. MLB domain knowledge: Some games may start early, to allow for travel to the next city.  But games never start before 9 am.

Given the above, am/pm can be deduced as follows:
* start_game_tm == 0 => use midnight (to represent unknown time)
* start_game_tm >= 1200 => pm
* start_game_tm < 900 => pm
* 900 <= start_game_tm < 1200, and day/night = day, => am
* 900 <= start_game_tm < 1200, and day/night = night, => pm

In [44]:
def parse_datetime(row):
    date = row['game_dt']
    time = row['start_game_tm']
    day_night = row['daynight_park_cd']
    
    if time > 0 and time < 900:
        time += 1200
    elif (900 <= time < 1200) and day_night == 'N':
        time += 1200

    time_str = f'{time//100:02d}:{time%100:02d}'
    datetime_str = str(date) + ' ' + time_str
    return pd.to_datetime(datetime_str, format='%Y%m%d %H:%M')

In [45]:
# create new datetime column
game['game_date'] = game.apply(parse_datetime, axis=1)

### Optimize Data Types

Normally, if the percentage of unique string values is large, there is no advantage in converting 'object' to 'category'.  (A join might work faster between two category variables than two string variables though.)

Here, optimize_data_types() will be called here and it will convert all object data types to categories.

In [46]:
df_obj = game.select_dtypes(include=['object'])
df_obj.nunique() / df_obj.shape[0]

game_id              1.000000
game_dy              0.000054
dh_fl                0.000015
daynight_park_cd     0.000015
away_team_id         0.000316
home_team_id         0.000316
park_id              0.000608
away_start_pit_id    0.026401
home_start_pit_id    0.026169
win_pit_id           0.034479
lose_pit_id          0.037175
save_pit_id          0.020509
gwrbi_bat_id         0.011175
dtype: float64

In [47]:
mem_usage(game)

'113.73 MB'

In [48]:
# optimize_data_types will
#  use smallest uint that can hold value
#  convert all objects to category
game = optimize_data_types(game)

In [49]:
# about 5 times less memory is being used
mem_usage(game)

'23.48 MB'

In [50]:
# a unique key is: (date, home_team, game_count)
is_unique(game, ['game_dt', 'home_team_id', 'game_ct'])

True

In [51]:
# game_id is a string concatenation of the above 3 fields, so it is also unique
game['game_id'].is_unique

True

In [52]:
game_float = game.select_dtypes(include=[np.float])
game_float.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
attend_park_ct,124836.0,24810.611258,12761.524179,306.0,14327.75,23709.0,34748.0,80227.0
temp_park_ct,80653.0,72.922074,10.920313,12.0,67.0,73.0,80.0,109.0


In [53]:
def is_all_int(s):
    """Returns True if all non-null values are integers"""
    notnull = s.notnull()
    is_integer = s.apply(lambda x: (x%1 == 0.0))
    return (notnull == is_integer).all()

In [54]:
# attendance and temperatre are always recorded as integers
is_all_int(game['attend_park_ct'])

True

In [55]:
is_all_int(game['temp_park_ct'])

True

### Attendance and Temperature: Use fillna(), Leave as Float, Interpolate, Other

There are several ways to deal with missing values.

**fillna() with impossible integer value**  
Pro: allows column to be represented as an integer in both Pandas and the database.  
Con: mean() and other operations may inadvertently use the impossible value.

If this technique is chosen, an additional (boolean) column such as 'is_attendance_null', could be created to make analysis easier.

**Leave as float**  
Pro: mean() and other operations skip na values by default.  This is the expected behavior.  
Con: requires more storage in Pandas and the database.  
Con: data analyst or software library using this column, may think the variable can have non-integer values.

**Interpolate**  
Values could be interpolated (or predicted using machine learning) from values on either side of the missing value.

**Semantics**  
Attendance must be an integer.

Temperature is not an integer.  Rather, to date, it has been recorded to the nearest integer value. This could change in the future.

**Use Database Representation Different from Pandas**  
A database can have an integer column with null values, Pandas cannot.  One way around this is  to write the values to a float column in the database, then convert that column type to integer.  However this makes it difficult for Pandas to append new null values to that column.

**Decision**  
There is no obvious best answer.  For this notebook, the fields will be left as float for easy use with Pandas.

In [56]:
os.chdir(wrangled)
%time game.to_csv('game.csv.gz', compression='gzip', index=False)

CPU times: user 3.59 s, sys: 16.1 ms, total: 3.61 s
Wall time: 3.6 s


## 4. Scrape Data for Game Data Dictionary

There is a field-name to field-description mapping provided on the following web page:  
http://chadwick.sourceforge.net/doc/cwgame.html

This data could be scraped from the webpage, but as a parser to read C source code to get this mapping was written above, it's simpler just to use it.

Note: the codes for some of the \_CD fields are only specified on the above web page, but the \_CD fields are not being used in this study.

In [57]:
os.chdir(src)
game_reg_fields = parse_c_source('cwgame.c')
game_ext_fields = parse_c_source('cwgame.c', 'ext_field_data')           

In [58]:
# there are 84 regular fields and 95 extended fields
len(game_reg_fields), len(game_ext_fields)

(84, 95)

#### Data Dictionary Note
dh_fl: Designated Hitter Flag, 'T' if DH in use, else 'F'  
daynight_park_cd: 'N' for night, 'D' for day  
gw_rbi_bat_id: Player ID for batter who got Game Winning RBI  

In [59]:
# As of Python 3.6, dictionaries maintain insertion order
game_fields = {key:value for num, 
    (key, value) in enumerate(game_reg_fields.items()) if num < 46}

for key in drop_columns:
    del game_fields[key]

game_fields

{'game_id': 'game id',
 'game_dt': 'date',
 'game_ct': 'game number (0 = no double header)',
 'game_dy': 'day of week',
 'start_game_tm': 'start time',
 'dh_fl': 'DH used flag',
 'daynight_park_cd': 'day/night flag',
 'away_team_id': 'visiting team',
 'home_team_id': 'home team',
 'park_id': 'game site',
 'away_start_pit_id': 'vis. starting pitcher',
 'home_start_pit_id': 'home starting pitcher',
 'attend_park_ct': 'attendance',
 'temp_park_ct': 'temperature',
 'minutes_game_ct': 'time of game',
 'inn_ct': 'number of innings',
 'away_score_ct': 'visitor final score',
 'home_score_ct': 'home final score',
 'away_hits_ct': 'visitor hits',
 'home_hits_ct': 'home hits',
 'away_err_ct': 'visitor errors',
 'home_err_ct': 'home errors',
 'away_lob_ct': 'visitor left on base',
 'home_lob_ct': 'home left on base',
 'win_pit_id': 'winning pitcher',
 'lose_pit_id': 'losing pitcher',
 'save_pit_id': 'save for',
 'gwrbi_bat_id': 'GW RBI'}

#### Persist game_fields

In [60]:
os.chdir(wrangled)

# index=[0] is required for dictionary of scalar values
game_fields_df = pd.DataFrame(game_fields, index=[0])
game_fields_df.to_csv('game_fields.csv', index=False)

## 5. Scrape Data for Player Lookup Table

There is no separate file for this.  It will be scraped from a web page.

In [61]:
import requests
import pandas as pd
from io import StringIO
from bs4 import BeautifulSoup

In [62]:
# get the web page
r = requests.get("https://www.retrosheet.org/retroID.htm")
soup = BeautifulSoup(r.content, 'lxml')

# data is within the pre tag
table_txt = soup.pre.string

# remove unnecessary double quotes
table_txt = table_txt.replace('"','')

In [63]:
# read from this string instead of file
players = pd.read_csv(StringIO(table_txt), 
    parse_dates=['Play debut', 'Mgr debut', 'Ump debut'])

In [64]:
# Coach debut has some bad values
def parse_date(value):
    if pd.isna(value) or value == '43188' or int(value[-4:]) < 1800:
        return pd.NaT
    else:
        return pd.datetime.strptime(value, '%m/%d/%Y')
players['Coach debut'] = players['Coach debut'].apply(parse_date)

In [65]:
players.head()

Unnamed: 0,ID,Last,First,Play debut,Mgr debut,Coach debut,Ump debut
0,aardd001,Aardsma,David,2004-04-06,NaT,NaT,NaT
1,aaroh101,Aaron,Hank,1954-04-13,NaT,NaT,NaT
2,aarot101,Aaron,Tommie,1962-04-10,NaT,1979-04-06,NaT
3,aased001,Aase,Don,1977-07-26,NaT,NaT,NaT
4,abada001,Abad,Andy,2001-09-10,NaT,NaT,NaT


In [66]:
name_chg = {'ID':'player_id',
         'Last':'last',
         'First':'first',
         'Play debut':'player_debut',
         'Mgr debut':'mgr_debut',
         'Coach debut': 'coach_debut',
         'Ump debut':'ump_debut'}
players = players.rename(columns=name_chg)
players.head()

Unnamed: 0,player_id,last,first,player_debut,mgr_debut,coach_debut,ump_debut
0,aardd001,Aardsma,David,2004-04-06,NaT,NaT,NaT
1,aaroh101,Aaron,Hank,1954-04-13,NaT,NaT,NaT
2,aarot101,Aaron,Tommie,1962-04-10,NaT,1979-04-06,NaT
3,aased001,Aase,Don,1977-07-26,NaT,NaT,NaT
4,abada001,Abad,Andy,2001-09-10,NaT,NaT,NaT


#### Persist Players

In [67]:
os.chdir(wrangled)
players.to_csv('players.csv', index=False)

## 6. Scrape Data for Stadium Lookup Table
There is no separate file for this, it will be scraped from a webpage.

In [68]:
# get the web page (this is not html!)
r = requests.get("https://www.retrosheet.org/parkcode.txt")

table_txt = r.content.decode("utf-8")

# read from this string instead of file
parks = pd.read_csv(StringIO(table_txt), parse_dates=['START', 'END'])

In [69]:
parks.columns = parks.columns.str.lower()
parks.head()

Unnamed: 0,parkid,name,aka,city,state,start,end,league,notes
0,ALB01,Riverside Park,,Albany,NY,1880-09-11,1882-05-30,NL,TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882
1,ALT01,Columbia Park,,Altoona,PA,1884-04-30,1884-05-31,UA,
2,ANA01,Angel Stadium of Anaheim,Edison Field; Anaheim Stadium,Anaheim,CA,1966-04-19,NaT,AL,
3,ARL01,Arlington Stadium,,Arlington,TX,1972-04-21,1993-10-03,AL,
4,ARL02,Rangers Ballpark in Arlington,The Ballpark in Arlington; Ameriquest Fl,Arlington,TX,1994-04-11,NaT,AL,


#### Persist Stadiums

In [70]:
os.chdir(wrangled)
parks.to_csv('parks.csv', index=False)