## Expected Batting Average

### About

- This project aims to replicate the MLB statcast advanced stat "xBA"

### Goals

1) Use supervised ML to predict whether a batted ball is a hit ot an out based on:
    - Pitch velo
    - Exit velo
    - Launch angle
    - Hit location
    - Hit distance (in case HRs aren't given their own hit location?)
    - Batter speed (maybe)
    
2) Find the *expected* batting average for a batter given the above parameters for each batted ball in play 

3) Compare results with statcast's xBA results

### Data

- Data gathered from baseball savant (statcast) search
- Example search query to get all (?) batted balls resulting in outs in 2018 
    - https://baseballsavant.mlb.com/statcast_search?hfPT=&hfAB=single%7Cdouble%7Ctriple%7Chome%5C.%5C.run%7Cfield%5C.%5C.out%7Cstrikeout%7Cstrikeout%5C.%5C.double%5C.%5C.play%7Cdouble%5C.%5C.play%7Cgrounded%5C.%5C.into%5C.%5C.double%5C.%5C.play%7Cfielders%5C.%5C.choice%7Cfielders%5C.%5C.choice%5C.%5C.out%7Cforce%5C.%5C.out%7Csac%5C.%5C.bunt%7Csac%5C.%5C.bunt%5C.%5C.double%5C.%5C.play%7Csac%5C.%5C.fly%7Csac%5C.%5C.fly%5C.%5C.double%5C.%5C.play%7Ctriple%5C.%5C.play%7C&hfBBT=&hfPR=&hfZ=&stadium=&hfBBL=&hfNewZones=&hfGT=R%7C&hfC=&hfSea=2018%7C&hfSit=&player_type=batter&hfOuts=&opponent=&pitcher_throws=&batter_stands=&hfSA=&game_date_gt=&game_date_lt=&hfInfield=&team=&position=&hfOutfield=&hfRO=&home_road=&hfFlag=&hfPull=&metric_1=&hfInn=&min_pitches=0&min_results=0&group_by=name&sort_col=pitches&player_event_sort=h_launch_speed&sort_order=desc&min_pas=0#results
    - seems like this returns a maximum of 40,000 results
- Data reference
    - https://baseballsavant.mlb.com/csv-docs

### Notes

- My plan is to use 2018 results in the training/test sets to determine 2019 xBA results
    - Need to think more about if this is the right strategy
- Is it possible to get spray chart info for this?
- Having hc_x and hc_y as two separate features doesn't really tell us much, we need the combination of the two as a vector:
    - $ hc = \sqrt{hc_y^2 + hc_x^2} \tan(y/x) $
    - ATTN: getting weird values here, need to plot to see if it makes sense

In [141]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

data = pd.read_csv('hits_outs_2018.csv')

data = data[
    [
        'launch_speed', 'launch_angle', 'hc_x', 'hc_y',
        'release_speed', 'release_spin_rate', 
        'hit'
    ]
].dropna()

xi = data['hc_x'] - data['hc_x'].min()
yi = data['hc_y'].max() - data['hc_y']
data['hc_xi'] = xi.apply(int)
data['hc_yi'] = yi.apply(int)

# data.plot(x='hc_xi', y='hc_yi', kind='scatter', s=0.01)

# map to a x b grid with values from 0:a*b
a, b = len(data['hc_xi'].unique()), len(data['hc_yi'].unique())
n, p = np.ogrid[:a*b:b, :b]
grid = np.add(n, p)

# assign values from grid to corresponding (hc_xi, hc_yi) pairs
data['hc_grid'] = grid[data['hc_xi'], data['hc_yi']]
# data['hc_grid'].value_counts()

data = data.drop(['hc_x', 'hc_y', 'hc_xi', 'hc_yi'], axis=1)

X = data.drop('hit', axis=1)
y = data['hit']

# scale data
scaler = MinMaxScaler()
scaler.fit(X)
X = pd.DataFrame(scaler.transform(X), columns=data.drop('hit', axis=1).columns)
# X = pd.DataFrame(X, columns=data.drop('hit', axis=1).columns)

X.describe()

Unnamed: 0,launch_speed,launch_angle,release_speed,release_spin_rate,hc_grid
count,68379.0,68379.0,68379.0,68379.0,68379.0
mean,0.725026,0.560883,0.711766,0.555811,0.501749
std,0.12604,0.123617,0.117427,0.097509,0.191703
min,0.0,0.0,0.0,0.0,0.0
25%,0.653915,0.5,0.633745,0.509428,0.366673
50%,0.753559,0.568182,0.738683,0.561521,0.497216
75%,0.819395,0.630682,0.800412,0.612336,0.64053
max,1.0,1.0,1.0,1.0,1.0


In [142]:
# split up data into 75% train, 25% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state = 13
)

clf = LinearSVC().fit(X_train, y_train)

print('Accuracy of Linear SVC classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Linear SVC classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))
print('\nCoefficients: ')
print(clf.coef_)

Accuracy of Linear SVC classifier on training set: 0.60
Accuracy of Linear SVC classifier on test set: 0.60

Coefficients: 
[[ 1.65632792 -0.38213331 -0.14642394 -0.05803376 -0.09837637]]


In [25]:
# z = [release_speed, release_spin_rate, launch_speed, launch_angle, hit_distance, pos_1, ... , pos_9, over_fence]
z = [0.8, 0.8, 0.7, 0.9, 0.6, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
clf.predict([z])

array([1])