# Parsing Retrosheet Event Files

There are more questions I have about trends in homerun statistics that unfortunately can't be answered by the Baseball Databank datasets. Instead, I want to turn to the far more detailed Retrosheet datasets, so that I can answer more specific questions. For example, people are always talking about Coors Field in Denver as being a scary place for pitchers--something about the high altitude and/or shape of the stadium as making it especially homerun-prone. Is it true?

To answer this kind of question, the simple offensive statlines provided by the Baseball Databank datasets simply aren't enough. What we really need are the Retrosheet event files, which preserve historical baseball games inning by inning and, in some cases, pitch by pitch.

The problem is that they're not formatted in a way that can immediately be used but need to be parsed. (<a href="https://www.retrosheet.org/eventfile.htm">Here's</a> some info on how these event files are structured.)

So, as a preliminary step, I need to build some parsing functions that can help us deal with Retrosheet's huge amounts of information.

Let me show you what I mean.

I started by downloading event file datasets for 2010–2021, which look like this:

In [1]:
ls ../data/retrosheet/reg_season/

2010ANA.EVA  2014ATL.EVN  2018BOS.EVA  ANA2016.ROS  DET2018.ROS  PHI2020.ROS
2010ARI.EVN  2014BAL.EVA  2018CHA.EVA  ANA2017.ROS  DET2019.ROS  PHI2021.ROS
2010ATL.EVN  2014BOS.EVA  2018CHN.EVN  ANA2018.ROS  DET2020.ROS  PIT2010.ROS
2010BAL.EVA  2014CHA.EVA  2018CIN.EVN  ANA2019.ROS  DET2021.ROS  PIT2011.ROS
2010BOS.EVA  2014CHN.EVN  2018CLE.EVA  ANA2020.ROS  FLO2010.ROS  PIT2012.ROS
2010CHA.EVA  2014CIN.EVN  2018COL.EVN  ANA2021.ROS  FLO2011.ROS  PIT2013.ROS
2010CHN.EVN  2014CLE.EVA  2018DET.EVA  ARI2010.ROS  HOU2010.ROS  PIT2014.ROS
2010CIN.EVN  2014COL.EVN  2018HOU.EVA  ARI2011.ROS  HOU2011.ROS  PIT2015.ROS
2010CLE.EVA  2014DET.EVA  2018KCA.EVA  ARI2012.ROS  HOU2012.ROS  PIT2016.ROS
2010COL.EVN  2014HOU.EVA  2018LAN.EVN  ARI2013.ROS  HOU2013.ROS  PIT2017.ROS
2010DET.EVA  2014KCA.EVA  2018MIA.EVN  ARI2014.ROS  HOU2014.ROS  PIT2018.ROS
2010FLO.EVN  2014LAN.EVN  2018MIL.EVN  ARI2015.ROS  HOU2015.ROS  PIT2019.ROS
2010HOU.EVN  2014MIA.EVN  2018MIN.EVA  ARI2016.ROS  HOU2016.ROS 

There's two types of datasets here:
<ul>
    <li> Event files are prefixed by the year and then the home team to which they pertain, appended by EVA or EVN depending on whether it's an American League or National League team</li>
    <li> Roster files are appended by .ROS, and these just contain roster info for the entire season—not relevant for us now</li>
</ul>

Here's what one of these event files looks like:

In [2]:
!head -n 50 ../data/retrosheet/reg_season/2017LAN.EVN

id,LAN201704030
version,2
info,visteam,SDN
info,hometeam,LAN
info,site,LOS03
info,date,2017/04/03
info,number,0
info,starttime,1:10PM
info,daynight,day
info,usedh,false
info,umphome,vanol901
info,ump1b,marqa901
info,ump2b,fairc901
info,ump3b,rackd901
info,howscored,park
info,pitches,pitches
info,oscorer,whitj701
info,temp,67
info,winddir,torf
info,windspeed,3
info,fieldcond,unknown
info,precip,unknown
info,sky,sunny
info,timeofgame,172
info,attendance,53701
info,wp,kersc001
info,lp,chacj001
info,save,
start,margm001,"Manuel Margot",0,1,8
start,myerw001,"Wil Myers",0,2,3
start,solay001,"Yangervis Solarte",0,3,4
start,renfh001,"Hunter Renfroe",0,4,9
start,schir001,"Ryan Schimpf",0,5,5
start,hedga001,"Austin Hedges",0,6,2
start,aybae001,"Erick Aybar",0,7,6
start,chacj001,"Jhoulys Chacin",0,8,1
start,jankt001,"Travis Jankowski",0,9,7
start,tolea001,"Andrew Toles",1,1,7
start,seagc001,"Corey Seager",1,2,6
start,tu

There are different kinds of records here, identified by the term that prepending each row.
<ul>
    <li>The <code>id</code> record at the top is a unique identifer for each game</li>
    <li>The <code>info</code> records pertain to game-specific data, such as the home team, visiting team, attendance, etc.</li>
    <li>The <code>start</code> records specify the players who are starting for both teams</li>
    <li>the <code>play</code> records specify individual plays</li>
</ul>
And then there are a number of other record types that we don't see in this snippet here.

For a simple place to start, I decided to write some routines that will parse the <code>id</code> and <code>info</code> record types. This won't allow us to answer my question about Coors Field, but it's a first step in figuring out how to make these datasets useable.

In the <code>retrosheet_utils</code> module I'll import below, I've written two functions thus far:
<ul>
    <li><code>parse_event_file_info()</code> takes a path to a sepcific event file and returns a list of dictionaries, each of which contains <code>info</code> fields for a given game <code>id</code></li>
    <li><code>load_season_info()</code> takes a year and runs <code>parse_event_file_info()</code> for each event file pertaining to that year, returning a list of dictionaries of all the games of that year</li>
    </ul>
 
Here's what it looks like:           

In [3]:
import retrosheet_utils

data = retrosheet_utils.load_season_info(2017)

data

[{'id': 'BOS201704030',
  'visteam': 'PIT',
  'hometeam': 'BOS',
  'site': 'BOS07',
  'date': '2017/04/03',
  'number': '0',
  'starttime': '2:06PM',
  'daynight': 'day',
  'usedh': 'true',
  'umphome': 'demud901',
  'ump1b': 'nauep901',
  'ump2b': 'guccc901',
  'ump3b': 'torrc901',
  'howscored': 'park',
  'pitches': 'pitches',
  'oscorer': 'shalm701',
  'temp': '48',
  'winddir': 'fromrf',
  'windspeed': '13',
  'fieldcond': 'unknown',
  'precip': 'unknown',
  'sky': 'cloudy',
  'timeofgame': '183',
  'attendance': '36594',
  'wp': 'porcr001',
  'lp': 'coleg001',
  'save': 'kimbc001'},
 {'id': 'BOS201704050',
  'visteam': 'PIT',
  'hometeam': 'BOS',
  'site': 'BOS07',
  'date': '2017/04/05',
  'number': '0',
  'starttime': '7:10PM',
  'daynight': 'night',
  'usedh': 'true',
  'umphome': 'nauep901',
  'ump1b': 'guccc901',
  'ump2b': 'torrc901',
  'ump3b': 'demud901',
  'howscored': 'park',
  'pitches': 'pitches',
  'oscorer': 'shalm701',
  'temp': '40',
  'winddir': 'fromrf',
  'winds

With 30 teams and 162 regular season games (each of which, of course, involves 2 teams!), the above list should contain info for 30 * 162 / 2 = 2430 games.

In [4]:
len(data)

2400

Seems we're missing 30 games, but we'll figure that out later.

In this format, in any case, we can easily create a DataFrame.

In [5]:
import pandas as pd
import numpy as np

In [6]:
df = pd.DataFrame.from_records(data)
df

Unnamed: 0,id,visteam,hometeam,site,date,number,starttime,daynight,usedh,umphome,...,winddir,windspeed,fieldcond,precip,sky,timeofgame,attendance,wp,lp,save
0,BOS201704030,PIT,BOS,BOS07,2017/04/03,0,2:06PM,day,true,demud901,...,fromrf,13,unknown,unknown,cloudy,183,36594,porcr001,coleg001,kimbc001
1,BOS201704050,PIT,BOS,BOS07,2017/04/05,0,7:10PM,night,true,nauep901,...,fromrf,8,unknown,unknown,unknown,233,36137,kellj001,basta001,
2,BOS201704110,BAL,BOS,BOS07,2017/04/11,0,7:10PM,night,true,coope901,...,tolf,16,unknown,unknown,unknown,195,37497,pomed001,bundd001,
3,BOS201704120,BAL,BOS,BOS07,2017/04/12,0,7:10PM,night,true,johna901,...,rtol,6,unknown,unknown,unknown,226,32211,givem001,wrigs001,
4,BOS201704130,PIT,BOS,BOS07,2017/04/13,0,2:06PM,day,true,morag901,...,torf,15,unknown,unknown,cloudy,195,32400,barnm001,nicaj001,kimbc001
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2395,CHN201709150,SLN,CHN,CHI11,2017/09/15,0,1:21PM,day,false,bakej902,...,rtol,11,unknown,unknown,sunny,187,38464,edwac001,martc006,
2396,CHN201709160,SLN,CHN,CHI11,2017/09/16,0,3:05PM,day,false,torrc901,...,rtol,11,unknown,unknown,sunny,169,40959,hendk001,wachm001,daviw001
2397,CHN201709170,SLN,CHN,CHI11,2017/09/17,0,1:23PM,day,false,drecb901,...,fromrf,5,unknown,unknown,unknown,224,37242,strop001,lyont001,daviw001
2398,CHN201709290,CIN,CHN,CHI11,2017/09/29,0,1:22PM,day,false,muchm901,...,unknown,15,unknown,unknown,sunny,176,36258,duenb001,lorem002,grimj002


In this format, we can actually make some sense of all this information!

For example, in 2017, which ballparks had the highest average attendance?

In [7]:
df['attendance'] = df['attendance'].astype('int')

avg_attendance = df.groupby('site')['attendance'].mean().sort_values(ascending=False)

avg_attendance

site
LOS03    46482.287500
STL10    42544.375000
SFO03    40810.062500
CHI11    39482.387500
TOR02    39456.150000
NYC21    39383.600000
ANA01    37308.062500
DEN02    36508.800000
BOS07    36039.512500
MIL06    31294.120482
WAS11    31116.600000
ARL02    30922.487500
ATL03    30900.862500
HOU03    30570.415584
NYC20    29897.771084
DET05    28721.462500
KAN06    27351.162500
PHO01    26429.087500
SEA03    26396.875000
SAN02    26376.900000
MIN04    25289.137500
CLE08    25226.275000
BAL12    25062.500000
PIT08    23950.278481
PHI13    23495.000000
CIN09    22642.150000
MIA02    20231.064935
CHI12    20123.425000
OAK01    18282.362500
STP01    15013.550000
WIL02     2596.000000
Name: attendance, dtype: float64

We can make this a little more legible by joining with a table containing the ballpark names that match the codes"

In [8]:
ballparks = pd.read_csv('../data/retrosheet/parkcodes.csv')
ballparks

Unnamed: 0,PARKID,NAME,AKA,CITY,STATE,START,END,LEAGUE,NOTES
0,ALB01,Riverside Park,,Albany,NY,09/11/1880,05/30/1882,NL,TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882
1,ALT01,Columbia Park,,Altoona,PA,04/30/1884,05/31/1884,UA,
2,ANA01,Angel Stadium of Anaheim,Edison Field; Anaheim Stadium,Anaheim,CA,04/19/1966,,AL,
3,ARL01,Arlington Stadium,,Arlington,TX,04/21/1972,10/03/1993,AL,
4,ARL02,Rangers Ballpark in Arlington,The Ballpark in Arlington; Ameriquest Fl,Arlington,TX,04/11/1994,09/29/2019,AL,
...,...,...,...,...,...,...,...,...,...
255,WIL02,BB&T Ballpark at Bowman Field,,Williamsport,PA,08/20/2017,08/20/2017,NL,PIT
256,WNY01,West New York Field Club Grounds,,West New York,NJ,09/11/1898,09/17/1899,NL,"BRO:9/18&10/2/1898; NY1:9/11/98, 6/4&7/16&8/13..."
257,WOR01,Agricultural County Fair Grounds I,,Worcester,MA,05/01/1880,09/29/1882,NL,
258,WOR02,Agricultural County Fair Grounds II,,Worcester,MA,08/17/1887,08/17/1887,NL,1 BSN game


In [9]:
pd.merge(left=avg_attendance, right=ballparks, how='left', left_on='site', right_on='PARKID')

Unnamed: 0,attendance,PARKID,NAME,AKA,CITY,STATE,START,END,LEAGUE,NOTES
0,46482.2875,LOS03,Dodger Stadium,Chavez Ravine,Los Angeles,CA,04/10/1962,,NL,LAN:1962-prsnt; LAA:1962-9/2/65; CAL:9/2to9/22/65
1,42544.375,STL10,Busch Stadium III,,St. Louis,MO,04/10/2006,,NL,
2,40810.0625,SFO03,AT&T Park,Pacific Bell Park; SBC Park,San Francisco,CA,04/11/2000,,NL,
3,39482.3875,CHI11,Wrigley Field,Weeghman Park; Cubs Park,Chicago,IL,04/23/1914,,NL,CHF:1914-15; CHN:1916-date
4,39456.15,TOR02,Rogers Centre,Skydome,Toronto,ONT,06/05/1989,,AL,
5,39383.6,NYC21,Yankee Stadium II,,New York,NY,04/16/2009,,AL,
6,37308.0625,ANA01,Angel Stadium of Anaheim,Edison Field; Anaheim Stadium,Anaheim,CA,04/19/1966,,AL,
7,36508.8,DEN02,Coors Field,,Denver,CO,04/26/1995,,NL,
8,36039.5125,BOS07,Fenway Park,,Boston,MA,04/20/1912,,AL,BOS:1912-date; BSN:9/7to9/29/1914;4/14to7/26/15
9,31294.120482,MIL06,Miller Park,,Milwaukee,WI,04/06/2001,,NL,


There you have it! In 2017, Dodger Stadium (go Dodgers!) had the highest average attendence at 46,482, followed by the Cardinals' Busch Stadium, the Giants' AT&T Park, and the Cubs' Wrigley Field.

Maybe we'll also want to have a look at things like weather and wind conditions, game duration, etc. On their own these might be fun to look at, but once the rest of the Retrosheet event file is parsed, we could have some potentially powerful insights on our hands about how game conditions correlate with game stats.