## Expected Batting Average

### About

- This project aims to replicate the MLB statcast advanced stat "xBA"

### Goals

1) Use supervised ML to predict whether a batted ball is a hit ot an out based on:
    - Pitch velo
    - Exit velo
    - Launch angle
    - Hit location
    - Hit distance (in case HRs aren't given their own hit location?)
    - Batter speed (maybe)
    
2) Find the *expected* batting average for a batter given the above parameters for each batted ball in play 

3) Compare results with statcast's xBA results

### Data

- Data gathered from baseball savant (statcast) search
- Example search query to get all (?) batted balls resulting in outs in 2018 
    - https://baseballsavant.mlb.com/statcast_search?hfPT=&hfAB=single%7Cdouble%7Ctriple%7Chome%5C.%5C.run%7Cfield%5C.%5C.out%7Cstrikeout%7Cstrikeout%5C.%5C.double%5C.%5C.play%7Cdouble%5C.%5C.play%7Cgrounded%5C.%5C.into%5C.%5C.double%5C.%5C.play%7Cfielders%5C.%5C.choice%7Cfielders%5C.%5C.choice%5C.%5C.out%7Cforce%5C.%5C.out%7Csac%5C.%5C.bunt%7Csac%5C.%5C.bunt%5C.%5C.double%5C.%5C.play%7Csac%5C.%5C.fly%7Csac%5C.%5C.fly%5C.%5C.double%5C.%5C.play%7Ctriple%5C.%5C.play%7C&hfBBT=&hfPR=&hfZ=&stadium=&hfBBL=&hfNewZones=&hfGT=R%7C&hfC=&hfSea=2018%7C&hfSit=&player_type=batter&hfOuts=&opponent=&pitcher_throws=&batter_stands=&hfSA=&game_date_gt=&game_date_lt=&hfInfield=&team=&position=&hfOutfield=&hfRO=&home_road=&hfFlag=&hfPull=&metric_1=&hfInn=&min_pitches=0&min_results=0&group_by=name&sort_col=pitches&player_event_sort=h_launch_speed&sort_order=desc&min_pas=0#results
    - seems like this returns a maximum of 40,000 results
- Data reference
    - https://baseballsavant.mlb.com/csv-docs

### Notes

- My plan is to use 2018 results in the training/test sets to determine 2019 xBA results
    - Need to think more about if this is the right strategy
- Is it possible to get spray chart info for this?
- Having hc_x and hc_y as two separate features doesn't really tell us much, we need the combination of the two as a vector:
    - $ hc = \sqrt{hc_y^2 + hc_x^2} \tan(y/x) $
    - ATTN: getting weird values here, need to plot to see if it makes sense

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

def spray_angle(df):
    '''
        Calculate spray angle from hc_x and hc_y in statcast csv output
    '''
    # make home plate (0,0)
    df['hc_x'] = df['hc_x'] - 125.42
    df['hc_y'] = 198.27 - df['hc_y']
    # df['hc_x'][df['hc_x'].isnull()] = 0
    # df['hc_y'][df['hc_y'].isnull()] = 0
    print(df['events'][df['hc_x'].isnull()])
    df['spray_angle'] = np.arctan(df['hc_x']/df['hc_y'])
    df.loc[df['stand'] == 'L', 'spray_angle'] = -df.loc[df['stand'] == 'L', 'spray_angle']
    # convert to degrees
    df['spray_angle'] = df['spray_angle'].apply(np.rad2deg)
    return(df)

def hit_positions(df):
    '''
        Separate hit_location values into individual columns
    '''
    # list for pos column names
    positions = ['pos_1', 'pos_2', 'pos_3', 'pos_4', 'pos_5', 'pos_6', 'pos_7', 'pos_8', 'pos_9']
    # Note: HR and ground rule doubles have hit_location=NaN, change to over_fence
    df['over_fence'] = (df['hit_location'].isnull()) & (df['hit'] == 1)
    i = 1
    for pos in positions:
        df[pos] = df['hit_location'] == i
        i += 1
    return(df)
    
def pre_process(df):
    '''
          Process dataframe for SVC, break df into features and results
    '''
    # define features
    positions = ['pos_1', 'pos_2', 'pos_3', 'pos_4', 'pos_5', 'pos_6', 'pos_7', 'pos_8', 'pos_9']
    features = [
            'launch_speed', 'launch_angle', 'spray_angle', 'hc_x', 'hc_y',
            'release_speed', 'release_spin_rate', 'over_fence', *positions
    ]
    # add outcome column
    df['hit'] = (df['events'] == 'single') | (df['events'] == 'double') | (df['events'] == 'triple') | (df['events'] == 'home_run')
    # process df
    df = spray_angle(df)
    df = hit_positions(df)
    df = df[['hit', *features]].dropna()
    # break into features (X) and outcomes (y)
    X, y = df[features], df['hit']
    # normalize to [0,1] scale
    scaler = MinMaxScaler()
    scaler.fit(X)
    X = pd.DataFrame(scaler.transform(X), columns=X.columns)
    return(X, y)

    
# data = pd.read_csv('hits_outs_2018.csv')
data = pd.read_csv('hits_outs_2018.csv')

X, y = pre_process(data)

X.describe()

# split up data into 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state = 13, test_size=0.2
)

clf = LinearSVC().fit(X_train, y_train)

print('Accuracy of Linear SVC classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Linear SVC classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))
print('\nCoefficients: ')
print(clf.coef_)

# calculate expected batting average for player
dee = pd.read_csv('dee_2019.csv')

X0, y0 = pre_process(dee)

predicted_outcomes = clf.predict(X0)
unique, counts = np.unique(predicted_outcomes, return_counts=True)
d = dict(zip(unique, counts))
xBA = d[True]/(d[False] + d[True])

hits = 0
for event in ['single', 'double', 'triple', 'home_run']:
    hits += dee['events'].value_counts()[event]

BA = hits/len(dee['events'])

print('Predicted batting average: {:.3f}'.format(xBA))
print('Actual batting average: {:.3f}'.format(BA))

0        strikeout
1        strikeout
11       strikeout
12       strikeout
20       strikeout
22       strikeout
24       strikeout
31       strikeout
44       strikeout
49       strikeout
52       strikeout
58       strikeout
62       strikeout
63       strikeout
66       strikeout
69       strikeout
71       strikeout
72       strikeout
77       strikeout
78       strikeout
84       strikeout
86       strikeout
87       strikeout
89       strikeout
92       strikeout
100      strikeout
105      strikeout
110      strikeout
114      strikeout
115      strikeout
           ...    
39916    strikeout
39918    strikeout
39919    strikeout
39924    strikeout
39926    strikeout
39932    strikeout
39934    strikeout
39937    strikeout
39944    strikeout
39949    strikeout
39952    strikeout
39953    strikeout
39956    strikeout
39960    strikeout
39961    strikeout
39962    strikeout
39963    strikeout
39964    strikeout
39965    strikeout
39968    strikeout
39970    strikeout
39975    str

In [2]:
data = pd.read_csv('hits_outs_2018_min100.csv')
data.shape

(40000, 89)

In [3]:
# example prediction
# z = [launch_speed, launch_angle, spray_angle, hc_x, hc_y, release_speed, release_spin_rate, over_fence, pos_1, ... , pos_9]
z = [0.8, 0.8, 0.7, 0.9, 0.6, 0.5, 0.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
clf.predict([z])

array([False])