# Model Runs per Half Inning With Linear Regression

Runs scored per half-inning will be modeled using multivariate linear regression for singles, doubles, triples, home runs, and other types of plays.

Linear regression models are good explanatory models.  The coefficient in front of "singles" for example, tells how much each single contributes towards scoring a run.

The target variable, runs, is a count.  Its distribution is far from normal which suggests that linear regression may not be the optimal predictive model.  Nevertheless, a linear model can be fit which predicts well and is easy to understand.  As the primary goal is understanding rather than having the highest possible predictive accuracy, a linear model will be created.

The first step is to create a DataFrame that has stats per half inning.  As the output of the cwevent parser is missing some necessary fields, I have added these fields by parsing the Retrosheet event file using regular expressions.  These new fields were verified to be 100% consistent with those created by the cwgame parser.

## Baseball Assumptions
* The goal of each offensive half-inning is to score as many runs as possible
* The goal of each defensive half-inning is to allow as few runs as possible
  
The above assumptions are not entirely correct for the home half of the 9th or later innings in which the offensive team is trying to end the game with a win, and the defensive team is trying to prevent the game from ending with a loss.  To account for this, only the first 8 innings of play will be used.

Some games have less than 8 complete innings.  This is usually due to rain.  As it is difficult for either team to predict when an umpire will call a game short of the 9th, the strategy employed in the bottom of the 9th or later, arguably does not apply in these games.  Shortened games will be included in the model.  Shortened games are rare, making up only 0.3% of all games since 2000.

## Fields Added to Output of cwevent
This section describes the details of the parsing of the Retrosheet data.

The cwevent parser output is described at: http://chadwick.sourceforge.net/doc/cwevent.html

cwevent does not produce a field for so, sb, cs, bk, bb, ibb, hbp, and xi. These fields were added by custom parsing of the cwevent field, event_tx.  This was done prior to creating the wrangled event csv file.

cwevent also has a h_cd field with values of 1, 2, 3, 4 for single, double, triple, home run, but for ease of processing, a boolean field was added for each.  When the event data is grouped by half-inning, the boolean values can be summed to get the total number of singles, doubles, triples, and home runs per half-inning.

As cwevent has been used and tested for years, I made no attempt to rewrite it.  Rather I augmented its output by parsing its event_tx field, and then verified that when aggregated, these new fields exactly match the output of the cwgame parser.

In [1]:
import os
import pandas as pd
import numpy as np
from pathlib import Path
import re

In [2]:
import sys

# import data_helper.py from download_scripts directory
sys.path.append('../download_scripts')
import data_helper as dh

In [3]:
# Linear Regression modules
import statsmodels.api as sm
import statsmodels.formula.api as smf
from patsy import dmatrices

In [4]:
data_dir = Path('../data')
lahman_data = data_dir.joinpath('lahman/wrangled').resolve()
retrosheet_data = data_dir.joinpath('retrosheet/wrangled').resolve()

## Read in the Data

In [5]:
# This may be a very large dataframe if many years were parsed and/or many fields were selected
event = dh.from_csv_with_types(retrosheet_data / 'event.csv.gz')
event.shape

(10483615, 42)

In [6]:
pd.set_option('display.max_columns', 50)
event.head(3)

Unnamed: 0,game_id,inn_ct,home_half,away_score_ct,home_score_ct,bat_id,pit_id,event_tx,h_cd,outs,e,event_id,team_id,opponent_team_id,inn_runs_ct,start_bases_cd,end_bases_cd,r,fate_runs_ct,ab,sh,sf,dp,tp,wp,pb,inn_end,pa,bat_safe_err,so,sb,cs,bk,ibb,bb,hbp,xi,single,double,triple,hr,h
0,BAL195504120,1,0,0,0,goodb101,colej101,S,1,0,0,1,BOS,BAL,0,0,1,0,1,True,False,False,False,False,False,False,False,True,False,False,0,0,False,False,False,False,False,True,False,False,False,True
1,BAL195504120,1,0,0,0,joose101,colej101,S7.1-3,1,0,0,2,BOS,BAL,0,1,5,0,1,True,False,False,False,False,False,False,False,True,False,False,0,0,False,False,False,False,False,True,False,False,False,True
2,BAL195504120,1,0,0,0,throf101,colej101,64(1)/FO.3-H,0,1,0,3,BOS,BAL,0,5,1,1,0,True,False,False,False,False,False,False,False,True,False,False,0,0,False,False,False,False,False,False,False,False,False,False


In [7]:
# as most processing is done by year, add a year column
event['year'] = event['game_id'].str[3:7].astype('int')

In [8]:
# for now, just consider half-innings since 2000
event = event.query('year >= 2000')
event.shape

(3846595, 43)

In [9]:
# what percent of official games are shorter than 9 innings?
# the game csv file has the number of innings for each game
game = dh.from_csv_with_types(retrosheet_data / 'game.csv.gz')
game['year'] = game['game_start'].dt.year
game = game.query('year >= 2000')

np.round(game.query('inn_ct < 9')['inn_ct'].count() / len(game), 3)

0.003

In [10]:
# Alternative calculation
# 8 complete innings is 8 * 3 * 2 = 48 outs
np.round(game.query('outs_ct < 48')['outs_ct'].count() / len(game), 3)

0.003

The above shows that 99.7% of all games have at least 8 complete innings of play.

In [11]:
# remove the 9th and later innings
event_8 = event.query('inn_ct < 9')

In [12]:
# these play-by-play columns will be aggregated to the half-inning
agg_cols = ['pb', 'wp', 'dp', 'bb', 'outs', 'so', 'hbp', 'triple', 'single', 'tp', 'sf',
            'r', 'double', 'pa', 'bat_safe_err', 'h', 'sb', 'sh', 'ibb', 'ab', 'cs',
            'hr', 'xi', 'e', 'bk']

In [13]:
# group by half-inning
key = ['game_id', 'inn_ct', 'home_half']
inn_8 = event_8[key + agg_cols].groupby(key).agg('sum')

In [14]:
inn_8.head(17)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,pb,wp,dp,bb,outs,so,hbp,triple,single,tp,sf,r,double,pa,bat_safe_err,h,sb,sh,ibb,ab,cs,hr,xi,e,bk
game_id,inn_ct,home_half,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
ANA200004030,1,0,0.0,0.0,0.0,1.0,3,0.0,0.0,0.0,1.0,0.0,0.0,0,0.0,4.0,0.0,1.0,0,0.0,0.0,3.0,1,0.0,0.0,0,0.0
ANA200004030,1,1,0.0,0.0,1.0,0.0,3,1.0,0.0,0.0,1.0,0.0,0.0,0,0.0,3.0,0.0,1.0,0,0.0,0.0,3.0,0,0.0,0.0,0,0.0
ANA200004030,2,0,0.0,0.0,0.0,1.0,3,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,4.0,0.0,0.0,0,0.0,0.0,3.0,0,0.0,0.0,0,0.0
ANA200004030,2,1,0.0,0.0,0.0,1.0,3,0.0,0.0,0.0,1.0,0.0,0.0,1,0.0,5.0,0.0,2.0,0,0.0,0.0,4.0,1,1.0,0.0,0,0.0
ANA200004030,3,0,0.0,0.0,0.0,0.0,3,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,3.0,0.0,0.0,0,0.0,0.0,3.0,0,0.0,0.0,0,0.0
ANA200004030,3,1,0.0,0.0,0.0,0.0,3,1.0,0.0,0.0,0.0,0.0,0.0,0,1.0,4.0,0.0,1.0,0,0.0,0.0,4.0,0,0.0,0.0,0,0.0
ANA200004030,4,0,0.0,0.0,0.0,0.0,3,1.0,0.0,0.0,1.0,0.0,0.0,0,0.0,3.0,0.0,1.0,0,0.0,0.0,3.0,1,0.0,0.0,0,0.0
ANA200004030,4,1,0.0,0.0,0.0,0.0,3,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,3.0,0.0,0.0,0,0.0,0.0,3.0,0,0.0,0.0,0,0.0
ANA200004030,5,0,0.0,0.0,0.0,1.0,3,1.0,0.0,0.0,0.0,0.0,0.0,0,0.0,5.0,1.0,0.0,0,0.0,0.0,4.0,0,0.0,0.0,1,0.0
ANA200004030,5,1,0.0,0.0,0.0,0.0,3,1.0,0.0,0.0,3.0,0.0,0.0,0,0.0,6.0,0.0,3.0,0,0.0,0.0,6.0,0,0.0,0.0,0,0.0


In [15]:
# spot check the above
usecols = ['game_id', 'bat_last', 'team_id', 'opponent_team_id', 'r', 'h', 'e',
           'lob', 'line_tx', 'ab', 'double', 'triple', 'hr']
team_game = dh.from_csv_with_types(retrosheet_data / 'team_game.csv.gz', usecols=usecols)
team_game.query('game_id == "ANA200004030"')

Unnamed: 0,game_id,bat_last,team_id,opponent_team_id,r,h,e,lob,line_tx,ab,double,triple,hr
167420,ANA200004030,True,ANA,NYA,2,10,1,11,10000001,35,1,0,1
167421,ANA200004030,False,NYA,ANA,3,6,0,5,2100,32,0,0,2


The line score shows that after 8 innings of play, 4 runs had been scored.  This agrees with the play-by-play data aggregated to the half-inning level displayed above.

In [16]:
# bb (base on balls) is defined to include ibb (intentional base on balls)
# create a new column for unintentional base on balls
inn_8['ubb'] = inn_8['bb'] - inn_8['ibb']

The manager of the team on the field opts for an intentional walk to reduce the number of runs in a potentially high scoring situation.  Intentional walks are rare in the first 8 innings of play.

An unintentional walk would be expected to result in more runs than an intentional walk, as the intentional walk is intended to reduce the number of runs scored.  However since intentional walks are only issued in potentially high scoring situations, intentional walks will be positively correlated with runs.

In [17]:
# average number of times each event occurs in a half-inning
inn_8.mean()

pb              0.007563
wp              0.036643
dp              0.103954
bb              0.357514
outs            2.999849
so              0.790901
hbp             0.039606
triple          0.021022
single          0.658867
tp              0.000094
sf              0.030968
r               0.519664
double          0.202162
pa              4.308033
bat_safe_err    0.039963
h               1.003564
sb              0.064520
sh              0.031648
ibb             0.023166
ab              3.837282
cs              0.026264
hr              0.121513
xi              0.000595
e               0.069912
bk              0.003552
ubb             0.334348
dtype: float64

### Field Descriptions
pb = passed ball  
wp = wild pitch  
bk = balk  
dp = double play  
tp = triple play  
bb = base on balls  
ibb = intentional base on balls  
outs = will always be 3 when aggregated to the half-inning  
so = strike outs  
hbp = hit by pitch  
sf = sacrifice fly  
sh = sacrifice bunt (aka sacrifice hit)  
sb = stolen base  
cs = caught stealing  
ab = at bat  
pa = plate appearance  
bat_safe_err = batter reaches on error  
xi = batter reaches on interference  
e = error  
h = hit  

Single, double, triple and hr are self-explanatory.

## Creating the Linear Model
The choice of which variables to include is not obvious.

A high value for R squared (or adjusted R squared) is the aim, as this implies the model fits the data well. However the ability to explain as simply as possible what causes runs to be scored, is also a goal.

Use of very rare plays, such as a triple play, is unlikely to improve the accuracy of the model because there is too little data to support it.

Use of variables which neither help nor hurt, makes the model harder to understand.

The following dependent variables were chosen by experimentation to be a small number of variables which have a good adjusted R squared value.

Note: as the target variable, runs, is not normally distributed, it is likely that the confidence intervals (CIs) computed by an ordinary least squares (OLS) algorithm are not accurate.

In [18]:
# perhaps the easiest way to specify the dependent variables is to use R formula notation
formula = 'r ~ ubb + ibb + hbp + single + double +triple +hr +sf +dp +e'
model = smf.ols(formula=formula, data=inn_8)
result = model.fit()
# print(result.summary())

In [19]:
# 78% of the variance is explained by the linear model
np.round(result.rsquared_adj, 2)

0.78

In [20]:
# number of observations used to build model
result.nobs

776987.0

In [21]:
# the coefficients of the linear model
np.round(result.params, 2)

Intercept   -0.29
ubb          0.34
ibb          0.22
hbp          0.38
single       0.48
double       0.73
triple       0.96
hr           1.39
sf           0.54
dp          -0.29
e            0.45
dtype: float64

In [22]:
# 95% confidence interval for each coefficient.
# As the assumptions of a linear model are not quite met by the baseball data,
# the condifence intervals may not be accurate, but they are representive of the uncertainty
result.conf_int().round(3)

Unnamed: 0,0,1
Intercept,-0.29,-0.287
ubb,0.341,0.344
ibb,0.212,0.226
hbp,0.37,0.381
single,0.483,0.486
double,0.725,0.729
triple,0.953,0.968
hr,1.388,1.394
sf,0.535,0.547
dp,-0.295,-0.287


In [23]:
single = result.params['single']
double = result.params['double']
triple = result.params['triple']
hr = result.params['hr']

# Single, Double, Triple, Home Run 
print(f'{single:4.2f} {double:4.2f} {triple:4.2f} {hr:4.2f}')

0.48 0.73 0.96 1.39


## Interpreting the Linear Model

### Walks, Hit by Pitch
Each intentional walk (ibb) leads to 0.22 more runs.  
Each unintentional walk (ubb) leads to 0.34 more runs.  
It was expected that unintentional walks would lead to more runs than intentional walks.  

Each hit-by-pitch (hbp) leads to 0.38 more runs.  This is very close to that of an unintentional walk.  This is expected as the result of hbp and ubb is the same.  

### Sacrifice Fly, Single, Error
All three of these are worth roughly the same.

An defensive error advances the runners or allows the batter to reach base or both.  
A single usually advances the runners and allows the batter to reach base.  
A sacrifice fly, by definition, advances one or more runners, such as a runner from third who scores.  

A sacrifice fly in this model is worth slightly more than a single or an error.  As a baseball enthusiast, I would have expected a sacrifice fly to be worth slightly less.  It may be that the sacrifice fly results in slightly more runs because the scenarios in which it can occur, are more favorable to run scoring.  (This leads to the concept of the Run Expectancy Matrix which will be covered in a later notebook.)

### Single, Double, Triple, Home Run
When a batter's [slugging percentage](http://m.mlb.com/glossary/standard-stats/slugging-percentage) is computed, a single counts as 1.0, a double 2.0, a triple 3.0 and a home run 4.0.  

Above we see that the value of a double is not twice that of a single, nor are the triple and home runs as valuable as suggested by the computation of the slugging percentage.  In Sabermetrics, this leads to the concept of a weighted-on-base-average (wOBA) that uses weights for single, double, triple and home run that are similar to those computed above.

### Double Play
A double play eliminates a base runner, so any inning with a dp is less likely to have runs scored.

# Summary
A simple linear model for understanding which plays contribute to runs in a half-inning was created. This linear model was able to account for 78% of the variance in the runs scored per half-inning. 

Creating a metric which weighs each player's singles, doubles, triples and home runs by the coefficients found in the linear model will better represent a player's ability to create runs than slugging percentage.  There are other refinements that could be made, but a simple linear model based off data for each half-inning of play is an excellent starting point.