## Expected Batting Average

### About

- This project aims to replicate the MLB statcast advanced stat "xBA"

### Goals

1) Use supervised ML to predict whether a batted ball is a hit ot an out based on:
    - Pitch velo
    - Exit velo
    - Launch angle
    - Hit location
    - Hit distance (in case HRs aren't given their own hit location?)
    - Batter speed (maybe)
    
2) Find the *expected* batting average for a batter given the above parameters for each batted ball in play 

3) Compare results with statcast's xBA results

### Data

- Data gathered from baseball savant (statcast) search
- Example search query to get all (?) batted balls resulting in outs in 2018 
    - https://baseballsavant.mlb.com/statcast_search?hfPT=&hfAB=single%7Cdouble%7Ctriple%7Chome%5C.%5C.run%7Cfield%5C.%5C.out%7Cstrikeout%7Cstrikeout%5C.%5C.double%5C.%5C.play%7Cdouble%5C.%5C.play%7Cgrounded%5C.%5C.into%5C.%5C.double%5C.%5C.play%7Cfielders%5C.%5C.choice%7Cfielders%5C.%5C.choice%5C.%5C.out%7Cforce%5C.%5C.out%7Csac%5C.%5C.bunt%7Csac%5C.%5C.bunt%5C.%5C.double%5C.%5C.play%7Csac%5C.%5C.fly%7Csac%5C.%5C.fly%5C.%5C.double%5C.%5C.play%7Ctriple%5C.%5C.play%7C&hfBBT=&hfPR=&hfZ=&stadium=&hfBBL=&hfNewZones=&hfGT=R%7C&hfC=&hfSea=2018%7C&hfSit=&player_type=batter&hfOuts=&opponent=&pitcher_throws=&batter_stands=&hfSA=&game_date_gt=&game_date_lt=&hfInfield=&team=&position=&hfOutfield=&hfRO=&home_road=&hfFlag=&hfPull=&metric_1=&hfInn=&min_pitches=0&min_results=0&group_by=name&sort_col=pitches&player_event_sort=h_launch_speed&sort_order=desc&min_pas=0#results
    - seems like this returns a maximum of 40,000 results
- Data reference
    - https://baseballsavant.mlb.com/csv-docs

### Notes

- My plan is to use 2018 results in the training/test sets to determine 2019 xBA results
    - Need to think more about if this is the right strategy
- Is it possible to get spray chart info for this?
- Having hc_x and hc_y as two separate features doesn't really tell us much, we need the combination of the two as a vector:
    - $ hc = \sqrt{hc_y^2 + hc_x^2} \tan(y/x) $
    - ATTN: getting weird values here, need to plot to see if it makes sense

In [43]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

def hc(x, y):
    return(np.sqrt(x**2 + y**2)*np.tan(y/x))


data = pd.read_csv('hits_outs_2018.csv')

yt = data['hc_y'].max() - data['hc_y']

data['hc_xt'] = xt[xt.notnull()].apply(int)
data['hc_yt'] = yt[yt.notnull()].apply(int)

fig, ax = plt.subplots(1, 2, figsize=(18,6))

ax[0].scatter(x, y, s=0.01)
ax[1].scatter(xx, yy, s=0.01)

data['hc_xt'].
# fig = plt.figure(figsize=(9,6))
# ax = fig.add_subplot(111, projection='3d')
# ax.scatter(xs=x, ys=y, zs=y2, s=0.1)

# x = data['hc_x'] - 125.42
# y = 198.27 - data['hc_y']
x = data['hc_x']
y = data['hc_y']

xt = data['hc_x'] - data['hc_x'].min()

# Note: HR and ground rule doubles have hit_location=NaN, change to over_fence
# data['events'][(data['hit_location'].isnull()) & (data['hit'] == 1)].value_counts()
data['over_fence'] = (data['hit_location'].isnull()) & (data['hit'] == 1)

# get *actual* hit location in polar coordinates
# see https://baseballwithr.wordpress.com/2018/01/15/chance-of-hit-as-function-of-launch-angle-exit-velocity-and-spray-angle/
# for info about the transformations hc_x - 125.42, y - 198.27
data['hc'] = hc(data['hc_x'] - 125.42, 198.27 - data['hc_y'])

positions = ['pos_1', 'pos_2', 'pos_3', 'pos_4', 'pos_5', 'pos_6','pos_7', 'pos_8', 'pos_9']

# separate hit_location into separate fepatures
i = 1
for pos in positions:
    data[pos] = data['hit_location'] == i
    i += 1

data = data[
    [
        'release_speed', 'release_spin_rate', 'hc_x', 'hc_y',
        'launch_speed', 'launch_angle', 'hit_distance_sc',
        'pos_1', 'pos_2', 'pos_3', 'pos_4', 'pos_5', 'pos_6',
        'pos_7', 'pos_8', 'pos_9', 'over_fence',
        'hit'
    ]
].dropna()

X = data.drop('hit', axis=1)
y = data['hit']

# scale data
scaler = MinMaxScaler()
scaler.fit(X)
X = pd.DataFrame(scaler.transform(X), columns=data.drop('hit', axis=1).columns)
# X = pd.DataFrame(X, columns=data.drop('hit', axis=1).columns)

X.describe()

In [44]:
# split up data into 75% train, 25% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state = 13
)

clf = LinearSVC().fit(X_train, y_train)

print('Accuracy of Linear SVC classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Linear SVC classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))
print('\nCoefficients: ')
print(clf.coef_)

Accuracy of Linear SVC classifier on training set: 0.76
Accuracy of Linear SVC classifier on test set: 0.76

Coefficients: 
[[-0.15965515 -0.031682    0.24363443  0.10114727 -1.95162981 -0.33318871
   0.05866753  0.56513669 -0.12123883  0.04026703  0.03296147  0.09360727
   1.25817671  1.19563104  1.23614443  2.27906772]]


In [25]:
# z = [release_speed, release_spin_rate, launch_speed, launch_angle, hit_distance, pos_1, ... , pos_9, over_fence]
z = [0.8, 0.8, 0.7, 0.9, 0.6, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
clf.predict([z])

array([1])