This Notebook will be dedicated to parsing the Play-By-Play data from RetroSheets. This will allow us to gather data on a per-player basis to give us many more features with the hope of deriving more meaningful results from the models we train. The goal for this parser will to give us a boxscore-like DataFrame for each game from 2014-2019. There will be 18 batters for both NL and AL games. For the sake of simplicity, pitching stats are not included in the PBP for now, but this can easily be changed for the future. Let's walk through how we'll parse this data.

We will go through each play of each game, and update statistics for each player after each play. 

The fields of each player will be:

- (Home/Visitor) Player i:
    - player id
    - hits
    - singles
    - doubles
    - triples
    - walks
    - bunts
    - sacrifice hits
    - sacrifice flys
    - RBIs
    - at bats
    - number of stolen bases
    - number times caught stealing
    - number times picked off
    - number of errors
    
Therefore each player has 15 fields associated with them, and since there will be 18 batters, there will be a total of 270 columns of batter data. An example of one of these columns is "Visiting Player i Errors" where "i" is the ith player on the visiting team, which is determined by the "batting" flag in the event file and "Errors" is determined by summing the number of errors that player committed over the course of a single game.

In addition to the 270 batter data columns, there will be one column for the game ID, which be used to get the stats on a per-game basis.

In [2]:
import pandas as pd

column_labels = ["game id",
"visiting team",
"inning",
"batting team",
"outs",
"balls",
"strikes",
"pitch sequence",
"vis score",
"home score",
"batter",
"batter hand",
"res batter",
"res batter hand",
"pitcher",
"pitcher hand",
"res pitcher",
"res pitcher hand",
"catcher",
"first base",
"second base",
"third base",
"shortstop",
"left field",
"center field",
"right field",
"first runner",
"second runner",
"third runner",
"event text",
"leadoff flag",
"pinchhit flag",
"defensive position",
"lineup position",
"event type",
"batter event flag",
"ab flag",
"hit value",
"SH flag",
"SF flag",
"outs on play",
"double play flag",
"triple play flag",
"RBI on play",
"wild pitch flag",
"passed ball flag",
"fielded by",
"batted ball type",
"bunt flag",
"foul flag",
"hit location",
"num errors",
"1st error player",
"1st error type",
"2nd error player",
"2nd error type",
"3rd error player",
"3rd error type",
"batter dest (5 if scores and unearned, 6 if team unearned)",
"runner on 1st dest (5 if scores and unearned, 6 if team unearned)",
"runner on 2nd dest (5 if scores and unearned, 6 if team unearned)",
"runner on 3rd dest (5 if socres and uneanred, 6 if team unearned)",
"play on batter",
"play on runner on 1st",
"play on runner on 2nd",
"play on runner on 3rd",
"SB for runner on 1st flag",
"SB for runner on 2nd flag",
"SB for runner on 3rd flag",
"CS for runner on 1st flag",
"CS for runner on 2nd flag",
"CS for runner on 3rd flag",
"PO for runner on 1st flag",
"PO for runner on 2nd flag",
"PO for runner on 3rd flag",
"Responsible pitcher for runner on 1st",
"Responsible pitcher for runner on 2nd",
"Responsible pitcher for runner on 3rd",
"New Game Flag",
"End Game Flag",
"Pinch-runner on 1st",
"Pinch-runner on 2nd",
"Pinch-runner on 3rd",
"Runner removed for pinch-runner on 1st",
"Runner removed for pinch-runner on 2nd",
"Runner removed for pinch-runner on 3rd",
"Batter removed for pinch-hitter",
"Position of batter removed for pinch-hitter",
"Fielder with First Putout (0 if none)",
"Fielder with Second Putout (0 if none)",
"Fielder with Third Putout (0 if none)",
"Fielder with First Assist (0 if none)",
"Fielder with Second Assist (0 if none)",
"Fielder with Third Assist (0 if none)",
"Fielder with Fourth Assist (0 if none)",
"Fielder with Fifth Assist (0 if none)",
"event num",]

mets_2014 = pd.read_csv('datasets/retro_sheets_pbp_filtered/2014NYN.EVN', names=column_labels)

mets_2014.head()

Unnamed: 0,game id,visiting team,inning,batting team,outs,balls,strikes,pitch sequence,vis score,home score,...,Position of batter removed for pinch-hitter,Fielder with First Putout (0 if none),Fielder with Second Putout (0 if none),Fielder with Third Putout (0 if none),Fielder with First Assist (0 if none),Fielder with Second Assist (0 if none),Fielder with Third Assist (0 if none),Fielder with Fourth Assist (0 if none),Fielder with Fifth Assist (0 if none),event num
0,NYN201403310,WAS,1,0,0,2,2,BCSFBFFX,0,0,...,0,5,0,0,0,0,0,0,0,1
1,NYN201403310,WAS,1,0,1,1,1,BFX,0,0,...,0,0,0,0,0,0,0,0,0,2
2,NYN201403310,WAS,1,0,1,2,2,BBCC1S,0,0,...,0,2,0,0,0,0,0,0,0,3
3,NYN201403310,WAS,1,0,2,3,0,BBBX,0,0,...,0,3,0,0,5,0,0,0,0,4
4,NYN201403310,WAS,1,1,0,2,2,CCBBS,0,0,...,0,3,0,0,2,0,0,0,0,5


In [2]:
mets_2014.shape

(6263, 34)

In [89]:
# Let's start by getting Starting Lineups
import os

gl_path = 'datasets/retro_sheet_gls/'
sl_path = 'datasets/starting_lineups/'
ignore = ['.ipynb_checkpoints']

if 'starting_lineups' not in os.listdir('datasets/'):
    os.mkdir('datasets/starting_lineups')

for file in os.listdir(gl_path):
    if file not in ignore:
        gl = pd.read_csv(gl_path + file)
        gl['Game ID'] = gl['Home Team'].str.cat(
                gl['Date'].astype('string').str.cat(
                gl['Number of game'].astype('string')))
        starting_lineups = gl.filter(items=['Game ID'] + 
                                     [f'Visiting Team Player {i+1} ID' for i in range(9)] + 
                                     [f'Home Team Player {i+1} ID' for i in range(9)],
                                    axis=1)
        starting_lineups.to_csv(sl_path + 'SL' + file[2:6] + '.csv')

In [5]:
# get all play-by-play data
pbp_data_path = 'datasets/retro_sheet_pbp_new/'
pbp_data = []
for file in os.listdir(pbp_data_path):
    if '.EVN' in file or '.EVA' in file:
        pbp_data.append(pd.read_csv(pbp_data_path + file, names=column_labels))
pbp_data = pd.concat(pbp_data, ignore_index=True)

pbp_data

Unnamed: 0,Game ID,Visiting Team,Batting Team,Batter,First Runner,Second Runner,Third Runner,Event Type,Batter Event Flag,AB Flag,...,SB For Runner On 1st Flag,SB For Runner On 2nd Flag,SB For Runner On 3rd Flag,CS For Runner On 1st Flag,CS For Runner On 2nd Flag,CS For Runner On 3rd Flag,PO For Runner On 1st Flag,PO For Runner On 2nd Flag,PO For Runner On 3rd Flag,Event Num
0,SFN201904050,TBA,0,meada001,,,,2,T,T,...,F,F,F,F,F,F,F,F,F,1
1,SFN201904050,TBA,0,phamt001,,,,2,T,T,...,F,F,F,F,F,F,F,F,F,2
2,SFN201904050,TBA,0,choij001,,,,21,T,T,...,F,F,F,F,F,F,F,F,F,3
3,SFN201904050,TBA,0,loweb001,,choij001,,21,T,T,...,F,F,F,F,F,F,F,F,F,4
4,SFN201904050,TBA,0,diazy001,,loweb001,,23,T,T,...,F,F,F,F,F,F,F,F,F,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1144617,NYN201510040,WAS,1,murpd006,wrigd002,,,2,T,T,...,F,F,F,F,F,F,F,F,F,58
1144618,NYN201510040,WAS,0,renda001,,,,2,T,T,...,F,F,F,F,F,F,F,F,F,59
1144619,NYN201510040,WAS,0,turnt001,,,,3,T,T,...,F,F,F,F,F,F,F,F,F,60
1144620,NYN201510040,WAS,0,harpb003,,,,21,T,T,...,F,F,F,F,F,F,F,F,F,61


In [6]:
# get all starting lineups
sl_path = 'datasets/starting_lineups/'
ignore = ['.ipynb_checkpoints']

sl = list()

for file_num in range(2014, 2020, 1):
    sl.append(pd.read_csv(f'datasets/starting_lineups/SL{file_num}.csv'))  
        
starting_lineups = pd.concat(sl, ignore_index=True).drop(columns='Unnamed: 0')

starting_lineups

Unnamed: 0,Game ID,Visiting Team Player 1 ID,Visiting Team Player 2 ID,Visiting Team Player 3 ID,Visiting Team Player 4 ID,Visiting Team Player 5 ID,Visiting Team Player 6 ID,Visiting Team Player 7 ID,Visiting Team Player 8 ID,Visiting Team Player 9 ID,Home Team Player 1 ID,Home Team Player 2 ID,Home Team Player 3 ID,Home Team Player 4 ID,Home Team Player 5 ID,Home Team Player 6 ID,Home Team Player 7 ID,Home Team Player 8 ID,Home Team Player 9 ID
0,ARI201403220,puigy001,turnj001,ramih003,gonza003,vanss001,uribj002,ethia001,ellia001,kersc001,polla001,hilla001,goldp001,pradm001,trumm001,montm001,owinc001,parrg001,milew001
1,ARI201403230,gordd002,puigy001,ramih003,gonza003,ethia001,ellia001,baxtm001,uribj002,ryu-h001,polla001,hilla001,goldp001,pradm001,montm001,trumm001,parrg001,gregd001,cahit001
2,SDN201403300,crawc002,puigy001,ramih003,gonza003,ethia001,uribj002,ellia001,gordd002,ryu-h001,cabre001,denoc001,headc001,gyorj001,alony001,medit001,venaw001,river003,casha001
3,ANA201403310,almoa001,millb002,canor001,smoaj001,morrl001,seagk001,saunm001,ackld001,zunim001,calhk001,troum001,pujoa001,hamij003,freed001,ibanr001,kendh001,iannc001,aybae001
4,BAL201403310,navad002,pedrd001,ortid001,napom001,carpm001,sizeg001,bogax001,piera001,middw001,markn001,hardj003,jonea003,davic003,cruzn002,wietm001,yound003,flahr001,schoj001
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14572,CHA201909290,reyev001,mercj002,cabrm001,hickj001,rodrr009,stewc002,demet001,greig001,castw003,sancc001,andet001,abrej003,moncy001,jimee001,collz001,castw002,palkd001,engea001
14573,KCA201909290,wadel001,polaj001,sanom001,cronc002,cavej001,schoj001,castj006,torrr001,milli001,merrw001,solej001,dozih001,gorda001,mcbrr001,cuthc001,mejie001,arteh001,dinin001
14574,SEA201909290,semim001,profj001,piscs001,davik003,brows003,phegj001,neuss001,barrf001,bolts001,longs001,crawj002,nolaa002,seagk001,lewik001,narvo001,voged001,smitm007,gordd002
14575,TEX201909290,lemad001,judga001,gardb001,stanm004,torrg001,sancg002,gregd001,urshg001,maybc001,choos001,andre001,calhw001,santd001,odorr001,solan001,guzmr001,deshd002,trevj001


In [87]:
# calculate a batters stats for a single game
def calculate_stats(data, player_id):
    stats = dict()
    # calculate batting stats
    h, s, d, t, hr, w = 0, 0, 0, 0, 0, 0
    for event in data[data['Batter'] == player_id]['Event Type']:
        if event == 20:
            s, h = s+1, h+1
        elif event == 21:
            d, h = d+1, h+1
        elif event == 22:
            t, h = t+1, h+1
        elif event == 23:
            hr, h = hr+1, h+1
        elif event == 14 or event == 15:
            w += 1
    # calculate at-bats, sh, sh, RBIs and bunts
    tf = ['AB Flag', 'SH Flag', 'SF Flag', 'Bunt Flag']
    tf_val = [0,0,0,0]
    for label_index in range(len(tf)):
        tf_val[label_index] += data[(data['Batter'] == player_id) & (data[tf[label_index]] == 'T')].shape[0]
    ab, sh, sf, b = tf_val
    rbi = sum(data[data['Batter'] == player_id]['RBI On Play'])
    # calculate base running stats
    sb, cs, po = 0, 0, 0
    sb += data[(player_id == data['First Runner']) & ('T' == data['SB For Runner On 1st Flag'])].shape[0]
    cs += data[(player_id == data['First Runner']) & ('T' == data['CS For Runner On 1st Flag'])].shape[0]
    po += data[(player_id == data['First Runner']) & ('T' == data['PO For Runner On 1st Flag'])].shape[0]

    sb += data[(player_id == data['Second Runner']) & ('T' == data['SB For Runner On 2nd Flag'])].shape[0]
    cs += data[(player_id == data['Second Runner']) & ('T' == data['CS For Runner On 2nd Flag'])].shape[0]
    po += data[(player_id == data['Second Runner']) & ('T' == data['PO For Runner On 2nd Flag'])].shape[0]

    sb += data[(player_id == data['Third Runner']) & ('T' == data['SB For Runner On 3rd Flag'])].shape[0]
    cs += data[(player_id == data['Third Runner']) & ('T' == data['CS For Runner On 3rd Flag'])].shape[0]
    po += data[(player_id == data['Third Runner']) & ('T' == data['PO For Runner On 3rd Flag'])].shape[0]
    # calculate errors
    err = 0
    err += data[(player_id == data['1st Error Player'])].shape[0]
    err += data[(player_id == data['2nd Error Player'])].shape[0]
    err += data[(player_id == data['3rd Error Player'])].shape[0]
    # fill new dictionary with stats
    stats[f'ID'] = [player_id]
    stats[f'Hits'] = [h]
    stats[f'Singles'] = [s]
    stats[f'Doubles'] = [d]
    stats[f'Triples'] = [t]
    stats[f'Home Runs'] = [hr]
    stats[f'Walks'] = [w]
    stats[f'Bunts'] = [b]
    stats[f'Sacrifice Bunts'] = [sh]
    stats[f'Sacrifice Flies'] = [sf]
    stats[f'RBIs'] = [rbi]
    stats[f'At-bats'] = [ab]
    stats[f'Stolen Bases'] = [sb]
    stats[f'Caught Stealing'] = [cs]
    stats[f'Picked Off'] = [po]
    stats[f'Errors'] = [err]
    
    # return stats
    return stats

In [88]:
def pbp_parser(pbp_data):
    # get column labels for new play-by-play dataframe
    defaults = ['Game ID']
    player_stats = ['Hits', 'Singles',
                    'Doubles', 'Triples', 'Home Runs', 
                    'Walks', 'Bunts', 'Sacrifice Bunts', 
                    'Sacrifice Flies', 'RBIs', 'At-bats', 
                    'Stolen Bases', 'Caught Stealing', 
                    'Picked Off', 'Errors']
    pbp_labels = defaults + player_stats
    
    # make a directory for individual player stats
    players_dir = 'datasets/player_stats/'
    if 'player_stats' not in os.listdir('datasets/'):
        os.mkdir(players_dir)
    
    # loop thru all pbp
    for i in range(pbp_data.shape[0]):
        batter_id = pbp_data.loc[i, 'Batter']
        
        batter_filename = f'{batter_id}.csv'
        
        already_seen = set() # set to track which games we've already calculated stats for
        
        # check if we've already done this batter
        if batter_filename in os.listdir(players_dir):
            continue
        else:
            batter_stats = pd.DataFrame(columns=pbp_labels)
            
            batter_pbp = pbp_data[pbp_data['Batter'] == batter_id].reset_index()
            for j in range(batter_pbp.shape[0]):
                # get game ID
                game_id = batter_pbp.loc[j, 'Game ID']
                # have we already done this game?
                if game_id in already_seen:
                    continue
                else:
                    # get game's play by play
                    current_game_pbp = batter_pbp[batter_pbp['Game ID'] == game_id]
                    # append this games stats to end of data frame
                    current_game_stats = pd.DataFrame(data=calculate_stats(data=current_game_pbp, player_id=batter_id))
                    current_game_stats['Game ID'] = game_id
                    batter_stats = pd.concat([batter_stats, current_game_stats])
                    # add to set of already seen games so no duplicates
                    already_seen.add(game_id)
                    
            # write batter's stats to csv
            batter_stats.to_csv(players_dir + batter_filename)
        
    return

In [90]:
pbp_parser(pbp_data)

In [85]:
batter = pbp_data.loc[0,'Batter']
filename = f'{batter}.csv'
hey = pd.read_csv('datasets/player_stats/' + filename)
hey.head()

Unnamed: 0.1,Unnamed: 0,Game ID,Hits,Singles,Doubles,Triples,Home Runs,Walks,Bunts,Sacrifice Bunts,Sacrifice Flies,RBIs,At-bats,Stolen Bases,Caught Stealing,Picked Off,Errors,ID
0,0,SFN201904050,2,1,1,0,0,0,0,0,0,1,5,0,0,0,0,meada001
1,0,SFN201904060,1,1,0,0,0,0,0,0,0,0,5,0,0,0,0,meada001
2,0,SFN201904070,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,meada001
3,0,MIA201905140,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,meada001
4,0,MIA201905150,0,0,0,0,0,1,0,0,0,0,3,0,0,0,0,meada001


In [91]:
hey.shape

(196, 18)

In [102]:
#os.mkdir('datasets/player_stats/group_1')
#os.mkdir('datasets/player_stats/group_2')
#os.mkdir('datasets/player_stats/group_3')

directory = 1
count = 0
ignore = ['group_1', 'group_2', 'group_3']
for file in os.listdir('datasets/player_stats/'):
    if file not in ignore:
        os.rename(f'datasets/player_stats/{file}', f'datasets/player_stats/group_{directory}/{file}')
        count += 1
        if count == 1000:
            directory +=1
            count = 0

In [103]:
len(os.listdir('datasets/player_stats/'))

3

In [106]:
print(len(os.listdir('datasets/player_stats/group_1')))
print(len(os.listdir('datasets/player_stats/group_2')))
print(len(os.listdir('datasets/player_stats/group_3')))

1000
1000
75
