## Baseball WAR Predictor ##

Data source: fangraphs.com

Downloaded batting data from 1950-2017. Goal is to create a model to predict a player's WAR given basic baseball stats.

In [76]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [77]:
battingStats = pd.read_csv("FanGraphs Leaderboard.csv")
battingStats.head()

Unnamed: 0,Season,Name,Team,G,PA,HR,R,RBI,SB,BB%,...,AVG,OBP,SLG,wOBA,wRC+,BsR,Off,Def,WAR,playerid
0,2002,Barry Bonds,Giants,143,612,46,117,110,9,32.4 %,...,0.37,0.582,0.799,0.544,244,-1.2,108.9,-2.0,12.7,1109
1,2001,Barry Bonds,Giants,153,664,73,129,137,13,26.7 %,...,0.328,0.515,0.863,0.537,235,1.3,118.0,-12.0,12.5,1109
2,2004,Barry Bonds,Giants,147,617,45,129,101,6,37.6 %,...,0.362,0.609,0.812,0.537,233,-0.3,105.7,-4.4,11.9,1109
3,1956,Mickey Mantle,Yankees,150,652,52,132,130,10,17.2 %,...,0.353,0.464,0.705,0.498,202,2.1,86.4,4.1,11.5,1008082
4,1957,Mickey Mantle,Yankees,144,623,34,121,94,16,23.4 %,...,0.365,0.512,0.665,0.498,217,2.7,88.9,0.2,11.4,1008082


In [78]:
battingStats.columns

Index(['Season', 'Name', 'Team', 'G', 'PA', 'HR', 'R', 'RBI', 'SB', 'BB%',
       'K%', 'ISO', 'BABIP', 'AVG', 'OBP', 'SLG', 'wOBA', 'wRC+', 'BsR', 'Off',
       'Def', 'WAR', 'playerid'],
      dtype='object')

In [79]:
battingStats.dtypes

Season        int64
Name         object
Team         object
G             int64
PA            int64
HR            int64
R             int64
RBI           int64
SB            int64
BB%          object
K%           object
ISO         float64
BABIP       float64
AVG         float64
OBP         float64
SLG         float64
wOBA        float64
wRC+          int64
BsR         float64
Off         float64
Def         float64
WAR         float64
playerid      int64
dtype: object

Dropping columns that are not needed for the model

In [80]:
drop_columns = ["Season", "Name", "Team"]
battingStats = battingStats.drop(drop_columns, axis=1)

In [81]:
battingStats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8464 entries, 0 to 8463
Data columns (total 20 columns):
G           8464 non-null int64
PA          8464 non-null int64
HR          8464 non-null int64
R           8464 non-null int64
RBI         8464 non-null int64
SB          8464 non-null int64
BB%         8464 non-null object
K%          8464 non-null object
ISO         8464 non-null float64
BABIP       8464 non-null float64
AVG         8464 non-null float64
OBP         8464 non-null float64
SLG         8464 non-null float64
wOBA        8464 non-null float64
wRC+        8464 non-null int64
BsR         8464 non-null float64
Off         8464 non-null float64
Def         8464 non-null float64
WAR         8464 non-null float64
playerid    8464 non-null int64
dtypes: float64(10), int64(8), object(2)
memory usage: 1.3+ MB


Removing the '%' character from 2 columns and changing the columns to floats so they can be used in the model

In [82]:
battingStats['BB%'] = battingStats['BB%'].map(lambda x: x.strip('%')).astype(float)
battingStats['K%'] = battingStats['K%'].map(lambda x: x.strip('%')).astype(float)

In [83]:
battingStats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8464 entries, 0 to 8463
Data columns (total 20 columns):
G           8464 non-null int64
PA          8464 non-null int64
HR          8464 non-null int64
R           8464 non-null int64
RBI         8464 non-null int64
SB          8464 non-null int64
BB%         8464 non-null float64
K%          8464 non-null float64
ISO         8464 non-null float64
BABIP       8464 non-null float64
AVG         8464 non-null float64
OBP         8464 non-null float64
SLG         8464 non-null float64
wOBA        8464 non-null float64
wRC+        8464 non-null int64
BsR         8464 non-null float64
Off         8464 non-null float64
Def         8464 non-null float64
WAR         8464 non-null float64
playerid    8464 non-null int64
dtypes: float64(12), int64(8)
memory usage: 1.3 MB


In [84]:
battingStats.corr()['WAR']

G           0.245351
PA          0.394911
HR          0.496875
R           0.651702
RBI         0.501000
SB          0.147971
BB%         0.418101
K%         -0.010551
ISO         0.533763
BABIP       0.396889
AVG         0.584607
OBP         0.689709
SLG         0.663357
wOBA        0.756209
wRC+        0.796861
BsR         0.219377
Off         0.836402
Def         0.336213
WAR         1.000000
playerid   -0.032687
Name: WAR, dtype: float64

Dropping columns that don't give us important information (playerid) or that are stats that Fangraph calculates themselves. We want to only keep the more popular and easily available stats, not ones that require special calculations.

In [85]:
drop_columns = ['playerid', 'Def', 'Off', 'BsR', 'wRC+', 'wOBA']
df = battingStats.drop(drop_columns, axis=1)

Use RFECV to pull out the best features to use for our model

In [86]:
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression

X = df.drop('WAR', axis=1)
y = df['WAR']
model_1 = LinearRegression()
selector_1 = RFECV(model_1, cv=10)
selector_1.fit(X, y)
best_columns_1 = X.columns[selector_1.support_]
best_columns_1

Index(['G', 'HR', 'R', 'RBI', 'SB', 'BB%', 'K%', 'ISO', 'BABIP', 'AVG', 'OBP',
       'SLG'],
      dtype='object')

In [87]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(df[best_columns_1], df['WAR'], test_size=.25, random_state=3)

model_1.fit(X_train_1, y_train_1)
predictions_1 = model_1.predict(X_test_1)
mse = mean_squared_error(y_test_1, predictions_1)
rmse = mse ** (1/2)
score = model_1.score(X_test_1, y_test_1)
print(mse, rmse, score)


1.62209013924 1.2736130257 0.624604992425


In [88]:
Baez = [[142, 14, 50, 59, 12, 3.3, 24.0, .150, .336, .273, .314, .423],
            [145, 23, 75, 75, 10, 5.9, 28.3, .207, .345, .279, .317, .480]]

Baez_WAR = model_1.predict(Baez)
Baez_WAR

array([ 1.41923991,  2.16682954])

I used my favorite baseball player, Javier Baez, and put in his stats from 2016 and 2017. Our model predicts 1.4 and 2.2 respectively. Fangraphs' WAR has him at 2.2 and 2.3 respectively. Our model doesn't take into account defense which probably accounts for most if not all of the discrepency in the 2 values. But overall not too bad using just basic baseball stats.

### Add Fielding Data ###

Add fielding data from Fangraphs to improve model

In [89]:
fielding = pd.read_csv("Fangraphs Fielding.csv")
batting = pd.read_csv("Fangraphs Leaderboard.csv")

In [90]:
fielding.head()

Unnamed: 0,Season,Name,Team,Pos,G,GS,Inn,PO,A,E,...,DPT,DPF,Scp,SB,CS,PB,WP,FP,TZ,playerid
0,1989,Barry Bonds,PIT,LF,156,146.0,1338.0,366,14,6,...,,,,,,,,0.985,37.0,1109
1,1999,Andruw Jones,ATL,CF,162,162.0,1447.1,493,13,10,...,,,,,,,,0.981,36.0,96
2,1975,Mark Belanger,BAL,SS,152,138.0,1262.1,259,508,17,...,,,,,,,,0.978,35.0,1000786
3,1998,Andruw Jones,ATL,CF,159,156.0,1372.2,413,20,2,...,,,,,,,,0.995,35.0,96
4,1968,Brooks Robinson,BAL,3B,162,161.0,1435.0,168,353,16,...,,,,,,,,0.97,33.0,1011055


In [91]:
batting.head()

Unnamed: 0,Season,Name,Team,G,PA,HR,R,RBI,SB,BB%,...,AVG,OBP,SLG,wOBA,wRC+,BsR,Off,Def,WAR,playerid
0,2002,Barry Bonds,Giants,143,612,46,117,110,9,32.4 %,...,0.37,0.582,0.799,0.544,244,-1.2,108.9,-2.0,12.7,1109
1,2001,Barry Bonds,Giants,153,664,73,129,137,13,26.7 %,...,0.328,0.515,0.863,0.537,235,1.3,118.0,-12.0,12.5,1109
2,2004,Barry Bonds,Giants,147,617,45,129,101,6,37.6 %,...,0.362,0.609,0.812,0.537,233,-0.3,105.7,-4.4,11.9,1109
3,1956,Mickey Mantle,Yankees,150,652,52,132,130,10,17.2 %,...,0.353,0.464,0.705,0.498,202,2.1,86.4,4.1,11.5,1008082
4,1957,Mickey Mantle,Yankees,144,623,34,121,94,16,23.4 %,...,0.365,0.512,0.665,0.498,217,2.7,88.9,0.2,11.4,1008082


Need to merge the 2 datasets into 1, using the playerid and season columns. First I'll clean up the fielding dataframe to include only needed columns.

In [92]:
fielding.columns

Index(['Season', 'Name', 'Team', 'Pos', 'G', 'GS', 'Inn', 'PO', 'A', 'E', 'FE',
       'TE', 'DP', 'DPS', 'DPT', 'DPF', 'Scp', 'SB', 'CS', 'PB', 'WP', 'FP',
       'TZ', 'playerid'],
      dtype='object')

In [93]:
keep_columns = ['Season', 'Pos', 'Inn', 'PO', 'A', 'E', 'DP', 'FP', 'playerid']
fielding_clean = fielding[keep_columns]

In [94]:
fielding_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8247 entries, 0 to 8246
Data columns (total 9 columns):
Season      8247 non-null int64
Pos         8247 non-null object
Inn         7945 non-null float64
PO          8247 non-null int64
A           8247 non-null int64
E           8247 non-null int64
DP          8247 non-null int64
FP          8247 non-null float64
playerid    8247 non-null int64
dtypes: float64(2), int64(6), object(1)
memory usage: 579.9+ KB


Missing values in innings column so we'll just drop the column altogether.

In [95]:
fielding_clean = fielding_clean.drop('Inn', axis=1)
fielding_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8247 entries, 0 to 8246
Data columns (total 8 columns):
Season      8247 non-null int64
Pos         8247 non-null object
PO          8247 non-null int64
A           8247 non-null int64
E           8247 non-null int64
DP          8247 non-null int64
FP          8247 non-null float64
playerid    8247 non-null int64
dtypes: float64(1), int64(6), object(1)
memory usage: 515.5+ KB


In [96]:
fielding_clean.head()

Unnamed: 0,Season,Pos,PO,A,E,DP,FP,playerid
0,1989,LF,366,14,6,1,0.985,1109
1,1999,CF,493,13,10,1,0.981,96
2,1975,SS,259,508,17,105,0.978,1000786
3,1998,CF,413,20,2,6,0.995,96
4,1968,3B,168,353,16,31,0.97,1011055


In [97]:
full_stats = pd.merge(batting, fielding_clean, how='left', on=['playerid','Season'])
full_stats

Unnamed: 0,Season,Name,Team,G,PA,HR,R,RBI,SB,BB%,...,Off,Def,WAR,playerid,Pos,PO,A,E,DP,FP
0,2002,Barry Bonds,Giants,143,612,46,117,110,9,32.4 %,...,108.9,-2.0,12.7,1109,LF,240.0,4.0,8.0,0.0,0.968
1,2001,Barry Bonds,Giants,153,664,73,129,137,13,26.7 %,...,118.0,-12.0,12.5,1109,LF,246.0,9.0,6.0,1.0,0.977
2,2004,Barry Bonds,Giants,147,617,45,129,101,6,37.6 %,...,105.7,-4.4,11.9,1109,LF,214.0,11.0,4.0,0.0,0.983
3,1956,Mickey Mantle,Yankees,150,652,52,132,130,10,17.2 %,...,86.4,4.1,11.5,1008082,CF,367.0,10.0,4.0,3.0,0.990
4,1957,Mickey Mantle,Yankees,144,623,34,121,94,16,23.4 %,...,88.9,0.2,11.4,1008082,CF,326.0,6.0,7.0,1.0,0.979
5,1967,Carl Yastrzemski,Red Sox,161,680,44,112,121,10,13.4 %,...,67.0,14.6,11.1,1014326,LF,289.0,13.0,7.0,1.0,0.977
6,1975,Joe Morgan,Reds,146,639,17,107,94,67,20.7 %,...,65.6,17.3,11.0,1009179,2B,356.0,425.0,11.0,96.0,0.986
7,1965,Willie Mays,Giants,157,638,52,118,112,9,11.9 %,...,63.2,14.8,10.7,1008315,CF,323.0,10.0,5.0,4.0,0.985
8,1991,Cal Ripken,Orioles,162,717,34,99,114,6,7.4 %,...,46.9,31.8,10.6,1010978,SS,267.0,528.0,11.0,114.0,0.986
9,1962,Willie Mays,Giants,162,706,49,130,141,18,11.0 %,...,58.7,20.0,10.5,1008315,CF,425.0,6.0,4.0,1.0,0.991


In [98]:
full_stats.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8464 entries, 0 to 8463
Data columns (total 29 columns):
Season      8464 non-null int64
Name        8464 non-null object
Team        8464 non-null object
G           8464 non-null int64
PA          8464 non-null int64
HR          8464 non-null int64
R           8464 non-null int64
RBI         8464 non-null int64
SB          8464 non-null int64
BB%         8464 non-null object
K%          8464 non-null object
ISO         8464 non-null float64
BABIP       8464 non-null float64
AVG         8464 non-null float64
OBP         8464 non-null float64
SLG         8464 non-null float64
wOBA        8464 non-null float64
wRC+        8464 non-null int64
BsR         8464 non-null float64
Off         8464 non-null float64
Def         8464 non-null float64
WAR         8464 non-null float64
playerid    8464 non-null int64
Pos         6838 non-null object
PO          6838 non-null float64
A           6838 non-null float64
E           6838 non-null float6

A lot of rows from the batting dataframe didn't have a corresponding fielding column, so we'll drop any rows that has missing values.

In [99]:
clean_stats = full_stats.dropna()
clean_stats.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6838 entries, 0 to 8463
Data columns (total 29 columns):
Season      6838 non-null int64
Name        6838 non-null object
Team        6838 non-null object
G           6838 non-null int64
PA          6838 non-null int64
HR          6838 non-null int64
R           6838 non-null int64
RBI         6838 non-null int64
SB          6838 non-null int64
BB%         6838 non-null object
K%          6838 non-null object
ISO         6838 non-null float64
BABIP       6838 non-null float64
AVG         6838 non-null float64
OBP         6838 non-null float64
SLG         6838 non-null float64
wOBA        6838 non-null float64
wRC+        6838 non-null int64
BsR         6838 non-null float64
Off         6838 non-null float64
Def         6838 non-null float64
WAR         6838 non-null float64
playerid    6838 non-null int64
Pos         6838 non-null object
PO          6838 non-null float64
A           6838 non-null float64
E           6838 non-null float6

Fix some columns and drop columns we don't want live we did previously

In [100]:
clean_stats['BB%'] = clean_stats['BB%'].map(lambda x: x.strip('%')).astype(float)
clean_stats['K%'] = clean_stats['K%'].map(lambda x: x.strip('%')).astype(float)
drop_columns = ['Season', 'playerid', 'Def', 'Off', 'BsR', 'wRC+', 'wOBA']
data = clean_stats.drop(drop_columns, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [101]:
data.drop('Team', axis=1, inplace=True)

In [102]:
data.head()

Unnamed: 0,Name,G,PA,HR,R,RBI,SB,BB%,K%,ISO,...,AVG,OBP,SLG,WAR,Pos,PO,A,E,DP,FP
0,Barry Bonds,143,612,46,117,110,9,32.4,7.7,0.429,...,0.37,0.582,0.799,12.7,LF,240.0,4.0,8.0,0.0,0.968
1,Barry Bonds,153,664,73,129,137,13,26.7,14.0,0.536,...,0.328,0.515,0.863,12.5,LF,246.0,9.0,6.0,1.0,0.977
2,Barry Bonds,147,617,45,129,101,6,37.6,6.6,0.45,...,0.362,0.609,0.812,11.9,LF,214.0,11.0,4.0,0.0,0.983
3,Mickey Mantle,150,652,52,132,130,10,17.2,15.2,0.353,...,0.353,0.464,0.705,11.5,CF,367.0,10.0,4.0,3.0,0.99
4,Mickey Mantle,144,623,34,121,94,16,23.4,12.0,0.3,...,0.365,0.512,0.665,11.4,CF,326.0,6.0,7.0,1.0,0.979


We'll create dummy columns for the Pos column so it can be used in the model.

In [103]:
dummies = pd.get_dummies(data['Pos'])
data = pd.concat([data,dummies],axis=1)
data

Unnamed: 0,Name,G,PA,HR,R,RBI,SB,BB%,K%,ISO,...,DP,FP,1B,2B,3B,C,CF,LF,RF,SS
0,Barry Bonds,143,612,46,117,110,9,32.4,7.7,0.429,...,0.0,0.968,0,0,0,0,0,1,0,0
1,Barry Bonds,153,664,73,129,137,13,26.7,14.0,0.536,...,1.0,0.977,0,0,0,0,0,1,0,0
2,Barry Bonds,147,617,45,129,101,6,37.6,6.6,0.450,...,0.0,0.983,0,0,0,0,0,1,0,0
3,Mickey Mantle,150,652,52,132,130,10,17.2,15.2,0.353,...,3.0,0.990,0,0,0,0,1,0,0,0
4,Mickey Mantle,144,623,34,121,94,16,23.4,12.0,0.300,...,1.0,0.979,0,0,0,0,1,0,0,0
5,Carl Yastrzemski,161,680,44,112,121,10,13.4,10.1,0.295,...,1.0,0.977,0,0,0,0,0,1,0,0
6,Joe Morgan,146,639,17,107,94,67,20.7,8.1,0.181,...,96.0,0.986,0,1,0,0,0,0,0,0
7,Willie Mays,157,638,52,118,112,9,11.9,11.1,0.328,...,4.0,0.985,0,0,0,0,1,0,0,0
8,Cal Ripken,162,717,34,99,114,6,7.4,6.4,0.243,...,114.0,0.986,0,0,0,0,0,0,0,1
9,Willie Mays,162,706,49,130,141,18,11.0,12.0,0.311,...,1.0,0.991,0,0,0,0,1,0,0,0


In [104]:
data.drop('Pos', axis=1, inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6838 entries, 0 to 8463
Data columns (total 28 columns):
Name     6838 non-null object
G        6838 non-null int64
PA       6838 non-null int64
HR       6838 non-null int64
R        6838 non-null int64
RBI      6838 non-null int64
SB       6838 non-null int64
BB%      6838 non-null float64
K%       6838 non-null float64
ISO      6838 non-null float64
BABIP    6838 non-null float64
AVG      6838 non-null float64
OBP      6838 non-null float64
SLG      6838 non-null float64
WAR      6838 non-null float64
PO       6838 non-null float64
A        6838 non-null float64
E        6838 non-null float64
DP       6838 non-null float64
FP       6838 non-null float64
1B       6838 non-null uint8
2B       6838 non-null uint8
3B       6838 non-null uint8
C        6838 non-null uint8
CF       6838 non-null uint8
LF       6838 non-null uint8
RF       6838 non-null uint8
SS       6838 non-null uint8
dtypes: float64(13), int64(6), object(1), uint8(8)
mem

In [105]:
data.corr()['WAR']

G        0.241014
PA       0.388946
HR       0.514406
R        0.645117
RBI      0.517371
SB       0.141140
BB%      0.434166
K%       0.001372
ISO      0.551283
BABIP    0.404738
AVG      0.595164
OBP      0.701089
SLG      0.676864
WAR      1.000000
PO       0.047437
A       -0.074513
E       -0.097615
DP      -0.081436
FP       0.092585
1B      -0.011571
2B      -0.086370
3B       0.055635
C        0.057482
CF       0.048205
LF       0.040771
RF       0.028537
SS      -0.103833
Name: WAR, dtype: float64

Use RFECV like before to select the best columns to use in our model and then train and test the model.

In [106]:
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression

X = data.drop(['WAR', 'Name'], axis=1)
y = data['WAR']
model_2 = LinearRegression()
selector_2 = RFECV(model_2, cv=10)
selector_2.fit(X, y)
best_columns_2 = X.columns[selector_2.support_]
best_columns_2

Index(['G', 'PA', 'HR', 'R', 'RBI', 'SB', 'BB%', 'K%', 'ISO', 'BABIP', 'AVG',
       'OBP', 'SLG', 'PO', 'A', 'E', 'DP', 'FP', '1B', '2B', '3B', 'C', 'CF',
       'LF', 'RF', 'SS'],
      dtype='object')

In [107]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(data[best_columns_2], data['WAR'], test_size=.25, random_state=3)

model_2.fit(X_train_2, y_train_2)
predictions_2 = model_2.predict(X_test_2)
mse = mean_squared_error(y_test_2, predictions_2)
rmse = mse ** (1/2)
score = model_2.score(X_test_2, y_test_2)
print(mse, rmse, score)

1.1935619976 1.09250263048 0.733248350154


With the fielding data included we got a better score, from 0.62 to 0.73. Let's put in some player stats and see what our model predicts their WAR to be and compare to our earlier model that doesn't include defensive stats.

In [108]:
Bryant_17 = [[151, 665, 29, 111, 73, 7, 14.3, 19.2, .242, .334, .295, .409, .537, 72, 260, 18, 20, .949, 0, 0, 1, 0, 0, 0, 0, 0]]

Bryant_WAR = model_2.predict(Bryant_17)
Bryant_WAR

array([ 6.15744879])

In [114]:
Bryant_17_1 = [[151, 29, 111, 73, 7, 14.3, 19.2, .242, .334, .295, .409, .537]]

Bryant_WAR_1 = model_1.predict(Bryant_17_1)
Bryant_WAR_1

array([ 5.95160926])

Fangraphs had Kris Bryant at 6.6 WAR

In [109]:
Machado_16 = [[157, 696, 37, 105, 96, 0, 6.9, 17.2, .239, .309, .294, .343, .533, 86, 236, 7, 23, .979, 0, 0, 1, 0, 0, 0, 0, 0]]

Machado_WAR = model_2.predict(Machado_16)
Machado_WAR

array([ 4.97669483])

In [116]:
Machado_16_1 = [[157, 37, 105, 96, 0, 6.9, 17.2, .239, .309, .294, .343, .533]]

Machado_WAR_1 = model_1.predict(Machado_16_1)
Machado_WAR_1

array([ 4.43189174])

Fangraphs had Manny Machado at 6.2 WAR

In [110]:
Russell_16 = [[151, 598, 21, 67, 95, 5, 9.2, 22.6, .179, .277, .238, .321, .417, 152, 388, 14, 58, .975, 0, 0, 0, 0, 0, 0, 0, 1]]

Russell_WAR = model_2.predict(Russell_16)
Russell_WAR

array([ 2.21946023])

In [117]:
Russell_16_1 = [[151, 21, 67, 95, 5, 9.2, 22.6, .179, .277, .238, .321, .417]]

Russell_WAR_1 = model_1.predict(Russell_16_1)
Russell_WAR_1

array([ 1.44291872])

Fangraphs had Addison Russell at 3.3 WAR

In [111]:
Abreu_14 = [[145, 622, 36, 80, 107, 3, 8.2, 21.1, .264, .356, .317, .383, .581, 970, 69, 6, 100, .994, 1, 0, 0, 0, 0, 0, 0, 0]]

Abreu_WAR = model_2.predict(Abreu_14)
Abreu_WAR

array([ 4.31950772])

In [118]:
Abreu_14_1 = [[145, 36, 80, 107, 3, 8.2, 21.1, .264, .356, .317, .383, .581]]

Abreu_WAR_1 = model_1.predict(Abreu_14_1)
Abreu_WAR_1

array([ 5.16803844])

Fangraphs had Jose Abreu at 5.3 WAR

In [112]:
Alexei_14 = [[145, 506, 6, 38, 48, 8, 4.2, 12.5, .092, .265, .241, .277, .333, 162, 317, 14, 83, .972, 0, 0, 0, 0, 0, 0, 0, 1]]

Alexei_WAR = model_2.predict(Alexei_14)
Alexei_WAR

array([-0.66459271])

In [120]:
Alexei_14_1 = [[145, 6, 38, 48, 8, 4.2, 12.5, .092, .265, .241, .277, .333]]

Alexei_WAR_1 = model_1.predict(Alexei_14_1)
Alexei_WAR_1

array([-0.11853611])

Fangraphs Alexei Ramirez at -1.5 WAR

In [113]:
Heyward_16 = [[142, 592, 7, 61, 49, 11, 9.1, 15.7, .094, .266, .230, .306, .325, 218, 4, 2, 0, .991, 0, 0, 0, 0, 0, 0, 1, 0]]

Heyward_WAR = model_2.predict(Heyward_16)
Heyward_WAR

array([-0.04303057])

In [122]:
Heyward_16_1 = [[142, 7, 61, 49, 11, 9.1, 15.7, .094, .266, .230, .306, .325]]

Heyward_WAR_1 = model_1.predict(Heyward_16_1)
Heyward_WAR_1

array([ 0.46008741])

Fangraphs had Jason Heyward at 1.0 WAR

My WAR predictor does a pretty good job of assigning defensive value in predicting WAR. Good defenders (Bryant, Machado, and Russell) got bumps in their WAR prediction from the previous model, making this model more accurate. And the bad defenders (Alexei and Abreu) both saw drops in their predicted WAR, with Alexei's getting closer to the correct WAR but Abreu's actually being further away from his true WAR.

The odd ball here is Heyward. Heyward is regarded as the best defensive right fielder in the game today, and possibly one of the best ever. In 2016, he also put up one of the worst offensive seasons that year. I would think that our first model would punish him for his bad offensive stats, and our second model would give him a small bump for his elite defense. But it came out the opposite, with the first model being 0.5 below the true value (and surprisingly positive) and the second model dropping his value even further away, to 1.0 WAR below his true value and slightly negative. 

Defensive value has been, and still is, one of the hardest aspects for baseball sabermetricians to determine value. Many sites like Fangraphs and Baseball Prospectus have created defensive statistics like Defensive Runs Saved and Fielding Runs Above Average to quantify it better, because using the typical counting stats like we used (Put outs, errors, fielding percentage) don't hold enough information to properly assign defensive value. I think we can see from the Heyward example that while our model is more accurate with defensive stats overall, it is far from perfect and has some glaring holes in determining defensive value.