### Baseball WAR Predictor ###

Data source: fangraphs.com

Downloaded batting data from 1950-2017. Goal is to create a model to predict a player's WAR given basic baseball stats.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
battingStats = pd.read_csv("FanGraphs Leaderboard.csv")
battingStats.head()

Unnamed: 0,Season,Name,Team,G,PA,HR,R,RBI,SB,BB%,...,AVG,OBP,SLG,wOBA,wRC+,BsR,Off,Def,WAR,playerid
0,2002,Barry Bonds,Giants,143,612,46,117,110,9,32.4 %,...,0.37,0.582,0.799,0.544,244,-1.2,108.9,-2.0,12.7,1109
1,2001,Barry Bonds,Giants,153,664,73,129,137,13,26.7 %,...,0.328,0.515,0.863,0.537,235,1.3,118.0,-12.0,12.5,1109
2,2004,Barry Bonds,Giants,147,617,45,129,101,6,37.6 %,...,0.362,0.609,0.812,0.537,233,-0.3,105.7,-4.4,11.9,1109
3,1956,Mickey Mantle,Yankees,150,652,52,132,130,10,17.2 %,...,0.353,0.464,0.705,0.498,202,2.1,86.4,4.1,11.5,1008082
4,1957,Mickey Mantle,Yankees,144,623,34,121,94,16,23.4 %,...,0.365,0.512,0.665,0.498,217,2.7,88.9,0.2,11.4,1008082


In [3]:
battingStats.columns

Index(['Season', 'Name', 'Team', 'G', 'PA', 'HR', 'R', 'RBI', 'SB', 'BB%',
       'K%', 'ISO', 'BABIP', 'AVG', 'OBP', 'SLG', 'wOBA', 'wRC+', 'BsR', 'Off',
       'Def', 'WAR', 'playerid'],
      dtype='object')

In [4]:
battingStats.dtypes

Season        int64
Name         object
Team         object
G             int64
PA            int64
HR            int64
R             int64
RBI           int64
SB            int64
BB%          object
K%           object
ISO         float64
BABIP       float64
AVG         float64
OBP         float64
SLG         float64
wOBA        float64
wRC+          int64
BsR         float64
Off         float64
Def         float64
WAR         float64
playerid      int64
dtype: object

Dropping columns that are not needed for the model

In [5]:
drop_columns = ["Season", "Name", "Team"]
battingStats = battingStats.drop(drop_columns, axis=1)

In [6]:
battingStats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8464 entries, 0 to 8463
Data columns (total 20 columns):
G           8464 non-null int64
PA          8464 non-null int64
HR          8464 non-null int64
R           8464 non-null int64
RBI         8464 non-null int64
SB          8464 non-null int64
BB%         8464 non-null object
K%          8464 non-null object
ISO         8464 non-null float64
BABIP       8464 non-null float64
AVG         8464 non-null float64
OBP         8464 non-null float64
SLG         8464 non-null float64
wOBA        8464 non-null float64
wRC+        8464 non-null int64
BsR         8464 non-null float64
Off         8464 non-null float64
Def         8464 non-null float64
WAR         8464 non-null float64
playerid    8464 non-null int64
dtypes: float64(10), int64(8), object(2)
memory usage: 1.3+ MB


Removing the '%' character from 2 columns and changing the columns to floats so they can be used in the model

In [7]:
battingStats['BB%'] = battingStats['BB%'].map(lambda x: x.strip('%')).astype(float)
battingStats['K%'] = battingStats['K%'].map(lambda x: x.strip('%')).astype(float)

In [8]:
battingStats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8464 entries, 0 to 8463
Data columns (total 20 columns):
G           8464 non-null int64
PA          8464 non-null int64
HR          8464 non-null int64
R           8464 non-null int64
RBI         8464 non-null int64
SB          8464 non-null int64
BB%         8464 non-null float64
K%          8464 non-null float64
ISO         8464 non-null float64
BABIP       8464 non-null float64
AVG         8464 non-null float64
OBP         8464 non-null float64
SLG         8464 non-null float64
wOBA        8464 non-null float64
wRC+        8464 non-null int64
BsR         8464 non-null float64
Off         8464 non-null float64
Def         8464 non-null float64
WAR         8464 non-null float64
playerid    8464 non-null int64
dtypes: float64(12), int64(8)
memory usage: 1.3 MB


In [9]:
battingStats.corr()['WAR']

G           0.245351
PA          0.394911
HR          0.496875
R           0.651702
RBI         0.501000
SB          0.147971
BB%         0.418101
K%         -0.010551
ISO         0.533763
BABIP       0.396889
AVG         0.584607
OBP         0.689709
SLG         0.663357
wOBA        0.756209
wRC+        0.796861
BsR         0.219377
Off         0.836402
Def         0.336213
WAR         1.000000
playerid   -0.032687
Name: WAR, dtype: float64

Dropping columns that don't give us important information (playerid) or that are stats that Fangraph calculates themselves. We want to only keep the more popular and easily available stats, not ones that require special calculations.

In [10]:
drop_columns = ['playerid', 'Def', 'Off', 'BsR', 'wRC+', 'wOBA']
data = battingStats.drop(drop_columns, axis=1)

Use RFECV to pull out the best features to use for our model

In [11]:
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression

X = data.drop('WAR', axis=1)
y = data['WAR']
model = LinearRegression()
selector = RFECV(model, cv=10)
selector.fit(X, y)
best_columns = X.columns[selector.support_]
best_columns

Index(['G', 'HR', 'R', 'RBI', 'SB', 'BB%', 'K%', 'ISO', 'BABIP', 'AVG', 'OBP',
       'SLG'],
      dtype='object')

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_test, y_train, y_test = train_test_split(data[best_columns], data['WAR'], test_size=.25, random_state=3)

model.fit(X_train, y_train)
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
rmse = mse ** (1/2)
score = model.score(X_test, y_test)
print(mse, rmse, score)


1.62209013924 1.2736130257 0.624604992425


In [25]:
Baez = [[142, 14, 50, 59, 12, 3.3, 24.0, .150, .336, .273, .314, .423],
            [145, 23, 75, 75, 10, 5.9, 28.3, .207, .345, .279, .317, .480]]

Baez_WAR = model.predict(Baez)
Baez_WAR

array([ 1.41923991,  2.16682954])

I used my favorite baseball player, Javier Baez, and put in his stats from 2016 and 2017. Our model predicts 1.4 and 2.2 respectively. Fangraphs' WAR has him at 2.2 and 2.3 respectively. Our model doesn't take into account defense which probably accounts for most if not all of the discrepency in the 2 values. But overall not too bad using just basic baseball stats.