# 29 August 2022: Retrosheet Parsing, Cont'd

I've added some functionality to my <code>retrosheet_utils</code> module so that it can now parse game play records. Let's see how well it's working.

In [1]:
import retrosheet_utils
import pandas as pd

In [2]:
# Load `play` data from sample event file
sample_event_file_path = '../data/retrosheet/reg_season/2017LAN.EVN'
data, cols = retrosheet_utils.parse_event_file_play(sample_event_file_path)

print(cols)
print()
print(data[:5])

['id', 'inning', 'hom_vis', 'player_id', 'count', 'pitches', 'event', 'vs']

[['LAN201704030', '1', '0', 'margm001', '12', 'CFBS', 'K', 'kersc001'], ['LAN201704030', '1', '0', 'myerw001', '12', 'BSSX', 'E6/TH/G+.B-2', 'kersc001'], ['LAN201704030', '1', '0', 'solay001', '00', 'B', 'WP.2-3', 'kersc001'], ['LAN201704030', '1', '0', 'solay001', '22', 'B.*BFFX', 'S7/G.3-H(UR)', 'kersc001'], ['LAN201704030', '1', '0', 'renfh001', '00', 'X', '2/P2F', 'kersc001']]


This may look confusing, but let me explain, since, as is, it's structured and ready to be used to generate a DataFrame.

This parsing function <code>parse_event_file_play()</code> uses (mostly) information from `play`-type records from retrosheet event files.

Here's an example `play` record:
    <code>play,1,1,granc001,22,CBFBX,S8/G</code>

There are 7 componenets here:
<ol>
    <li>Record type (in this case, 'play')</li>
    <li>Inning number (in this case '1')</li>
    <li>Home (1) or Away (0) Team (in this case '1', or home)</li>
    <li>Player ID of player at bat (here granc001)</li>
    <li>Pitch count when this event occured (here '22', meaning 2 balls and 2 strikes)</li>
    <li>Pitch sequence (here's where things start getting commplicated: each character encodes a pitch, so the sequence CBFBX encodes called strike [C], ball [B], foul [F], ball [B], and then a ball that's put into play by batter [X])</li>
    <li>Event (this field can get even more complicated, although this is a relatively straightforward example that encodes a ground ball [/G] single [S] fielded by the center [8]</li>
</ol>

My parsing function preserves the latter 6 of these 7 components (we don't need the 'play' record type), to which I add a unique ID for the game to which this play belonds (`id`) and the current pitcher that the batter is facing (`vs`). It returns two variables:
<ul>
    <li>`cols` is a list of 8 column names</li>
    <li>`data` is a 2D array, with each row corresponding to the above columns</li>
</ul>

So here's what we can do with the parsed output:

In [3]:
df = pd.DataFrame.from_records(data, columns=cols)
df

Unnamed: 0,id,inning,hom_vis,player_id,count,pitches,event,vs
0,LAN201704030,1,0,margm001,12,CFBS,K,kersc001
1,LAN201704030,1,0,myerw001,12,BSSX,E6/TH/G+.B-2,kersc001
2,LAN201704030,1,0,solay001,00,B,WP.2-3,kersc001
3,LAN201704030,1,0,solay001,22,B.*BFFX,S7/G.3-H(UR),kersc001
4,LAN201704030,1,0,renfh001,00,X,2/P2F,kersc001
...,...,...,...,...,...,...,...,...
7214,LAN201709270,9,0,renfh001,00,,NP,mccab001
7215,LAN201709270,9,0,renfh001,11,.BCX,7/F,jansk001
7216,LAN201709270,9,0,villc002,02,SSS,K,jansk001
7217,LAN201709270,9,0,solay001,21,FBBX,DGR/9/L+,jansk001


Since I've loaded event file 2017LAN.EVN, what we're seeing here is a record of every at bat from every Los Angeles Dodgers home game from 2017. There's still a lot of work to be done until this can become truly useful, of course. The main thing is to figure out how to parse the complicated `pitches` and `event` columns.

Because I've ostensibly started this project to learn more about trends in homeruns, one thing we <i>can</i> do at the moment is look for those, which are coded as 'H' or 'HR' in the `event` column.

In [4]:
# Masks for legibility
is_in_play = (df.pitches.str.endswith('X'))
is_hr = (df.event.str.startswith('D'))

hrs_at_LAD = df[is_in_play & is_hr]
hrs_at_LAD

Unnamed: 0,id,inning,hom_vis,player_id,count,pitches,event,vs
12,LAN201704030,2,1,gonza003,31,BBBCX,DGR/78/L,chacj001
23,LAN201704030,3,1,turnj001,12,CBCX,D8/L+,chacj001
35,LAN201704030,4,1,turnj001,30,*BB*BX,D7/G+/MREV.2-H;1-3,chacj001
46,LAN201704030,5,1,puigy001,01,FX,D7/L+,bethc001
78,LAN201704030,8,0,margm001,00,X,D9/L.1-H,hatcc002
...,...,...,...,...,...,...,...,...
7078,LAN201709260,6,1,herne001,22,.BCBFFX,D7/L+,mcgrk001
7096,LAN201709260,7,1,granc001,00,.X,D7/L,diazm004
7143,LAN201709270,2,1,barna001,32,CBBFFFBX,D9/L+,richc002
7156,LAN201709270,3,1,barna001,00,X,D8/L+.3-H;1-H,richc002


So here we have a list of all home runs at Dodger's stadium in 2017. Looks like there were 267 of them.

In [5]:
hrs_at_LAD.groupby(by='id')['id'].count().sort_values(ascending=False)

id
LAN201705210    8
LAN201709090    8
LAN201706100    6
LAN201709250    6
LAN201705200    6
               ..
LAN201705260    1
LAN201705270    1
LAN201706070    1
LAN201706090    1
LAN201709040    1
Name: id, Length: 79, dtype: int64

Or here's a quick query to tally up homeruns by date. 8 homeruns on 5/21/2017 and 9/9/2017!

In any case, this is just a start. Next step is to load an entire seasons' worth of batting data all at once (or multiple seasons) so we can start looking for trends across all the ballparks and not just one.