## Building models for batting average

From the notebook 1-MLB-predictions-EDA, I noticed that, of the features:
- features_new = ['avg_swing_speed', 'fast_swing_rate', 'blasts_contact', 'blasts_swing',
       'squared_up_contact', 'squared_up_swing', 'avg_swing_length', 'swords']

'blast_swing', 'blast contact', 'squared up', 'squared up swings', 'swords' correlate most with average. I also noticed that blast_swing and blast_contact correlate, similarly sqaured up swings and contact. We will still include all of these in the initial model exploration. 

Also, of the following quality of at bat features:
- features_quality = ['exit_velocity_avg', 'launch_angle_avg', 'sweet_spot_percent', 'barrel',
       'barrel_batted_rate', 'solidcontact_percent', 'flareburner_percent',
       'poorlyunder_percent', 'poorlytopped_percent', 'poorlyweak_percent',
       'hard_hit_percent']

'exit velo','sweet spot percentage','barrel','flareburner percentage','hard hit percentage','poorly topped','poorly under','poorly weak' correlate most with average. 

For which model we will use, we will build models using only new features since I am interested in what features most contribute to batting average, then one using new fatures + quality at bat features to see overall, what features contribute most to batting average. To do this: 
- With new features, models we will use:
    - Linear Regression with new tracking stats. Using this method since we do want interpretability of which features are important to batting.  
    - Find a regression model using Huber Regularization since there are some outliers in the data. 
    - Lasso Regression to see which features are more important.
- With all features, we will do the same as above, plus use:
    - PCA to find features, then build Regression model. We can use this just as a predictor of batting average.
- Compare the previous models. 


In [170]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import LinearRegression, Lasso, Ridge, HuberRegressor
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

from sklearn.preprocessing import StandardScaler

In [171]:
## This is data from the start of the 2024 season, until 5/14/2024.
## it includes those columbs listed below. Need to update this data set. 
features_new = ['avg_swing_speed', 'fast_swing_rate', 'blasts_contact', 'blasts_swing',
                 'squared_up_contact', 'squared_up_swing', 'avg_swing_length', 'swords']
features_quality = ['exit_velocity_avg','launch_angle_avg','sweet_spot_percent','barrel','barrel_batted_rate','solidcontact_percent',
                    'flareburner_percent','poorlyunder_percent','poorlytopped_percent','poorlyweak_percent','hard_hit_percent']

features = features_new + features_quality

batters = pd.read_csv('../updated_batter_data.csv')

batters = batters[features + ['batting_avg']]

batters = batters.dropna(subset = features)
# batters.columns

In [172]:
batters.shape

(466, 20)

In [173]:
## split data
batters_train, batters_test = train_test_split(batters,
                                              shuffle = True,
                                              random_state = 555, 
                                              test_size = .2)

In [174]:
def powerset(s):
    power_set = [[]]
    for x in s:
        power_set += [s0+[x] for s0 in power_set]
    return power_set[1:]

In [214]:
all_models = ['baseline']
all_models.extend(powerset(features_new))

In [215]:
kfold = KFold(5,
             shuffle = True,
             random_state = 555)

mses = np.zeros((5, len(all_models)))

for i, (train_index, train_test) in enumerate(kfold.split(batters_train)):
    batters_tt = batters_train.iloc[train_index]
    batters_ho = batters_train.iloc[train_test]
    
    for j, subset in enumerate(all_models):
        ## baseline prediction 
        if subset == 'baseline':
            subset = np.mean(batters_tt.batting_avg)*np.ones(len(batters_ho.batting_avg))
        
            mses[i,j] = mean_squared_error(batters_ho.batting_avg.values, subset)
        else:
            if len(subset) == 1:
                model1 = LinearRegression()
            
                model1.fit(batters_tt[subset].values.reshape(-1,1), 
                           batters_tt.batting_avg)
            
                pred = model1.predict(batters_ho[subset].values.reshape(-1,1))
            
                mses[i,j] = mean_squared_error(batters_ho.batting_avg.values.reshape(-1,1), pred)
                
            else: 
                model1 = LinearRegression()
            
                model1.fit(batters_tt[subset].values, 
                           batters_tt.batting_avg.values)
            
                pred = model1.predict(batters_ho[subset].values)
                
                mses[i,j] = mean_squared_error(batters_ho.batting_avg.values, pred)
        

In [216]:
mses_arr = np.mean(mses,axis = 0)
print(mses_arr)
print('the minimum mean squared error is: ', mses_arr.min())
print('---')
print('the subset that gave this smallest errorr is:')
print(all_models[mses_arr.argmin()])
error_dict = {'best subset error': mses_arr.min()}

[0.00383842 0.00388921 0.00384017 0.00393403 0.00364635 0.00364906
 0.00362692 0.00371077 0.00352362 0.00349036 0.00349101 0.00354372
 0.00332536 0.00335584 0.0033335  0.003379   0.00373704 0.00375987
 0.00371978 0.00380211 0.00370262 0.00371952 0.00370795 0.00378223
 0.00360426 0.00356688 0.00358832 0.00362462 0.00339785 0.00342805
 0.00340982 0.00345957 0.0035902  0.00353448 0.00351797 0.00356611
 0.00351375 0.00354105 0.00351982 0.00358436 0.00346286 0.00348548
 0.00347576 0.00353083 0.00338524 0.00340768 0.00339024 0.00343321
 0.00354367 0.00338616 0.00340757 0.0034136  0.00323604 0.00325708
 0.00324584 0.00329905 0.00325682 0.00327247 0.00326559 0.00331365
 0.0033276  0.00334378 0.00333356 0.00336442 0.00389278 0.00393815
 0.00390482 0.00398251 0.00367778 0.00368641 0.00367027 0.00374999
 0.00354953 0.00352579 0.00353318 0.00358223 0.00336973 0.00338893
 0.00338066 0.00341023 0.00376867 0.00379497 0.00376563 0.00383751
 0.00374997 0.00376013 0.00375392 0.00382597 0.00363742 0.0036

In [217]:
error_dict

{'best subset error': 0.003126112606799465}

In [212]:
swings = ['blasts_swing', 'squared_up_swing',  'swords']
contacts = ['blasts_contact','squared_up_contact','swords' ]

subsets = [swings,contacts]
kfold = KFold(5,
             shuffle = True,
             random_state = 555)

mses = np.zeros((5, 2))

for j,subset in enumerate(subsets):
    for i, (train_index, train_test) in enumerate(kfold.split(batters_train)):
        batters_tt = batters_train.iloc[train_index]
        batters_ho = batters_train.iloc[train_test]
    
        model1 = LinearRegression()
            
        model1.fit(batters_tt[subset].values, 
                    batters_tt.batting_avg.values)
            
        pred = model1.predict(batters_ho[subset].values)
                
        mses[i,j] = mean_squared_error(batters_ho.batting_avg.values, pred)
        

In [213]:
mses_arr = np.mean(mses,axis = 0)
mses_arr



array([0.0033536, 0.0035955])

In [210]:
best = ['squared_up_swing', 'blasts_swing']

kfold = KFold(5,
             shuffle = True,
             random_state = 555)

mses = np.zeros((5, 1))

for i, (train_index, train_test) in enumerate(kfold.split(batters_train)):
    batters_tt = batters_train.iloc[train_index]
    batters_ho = batters_train.iloc[train_test]
    
    model1 = LinearRegression()
            
    model1.fit(batters_tt[best].values, 
                    batters_tt.batting_avg.values)
            
    pred = model1.predict(batters_ho[best].values)
                
    mses[i,0] = mean_squared_error(batters_ho.batting_avg.values, pred)
        

In [211]:
mses_arr = np.mean(mses,axis = 0)
mses_arr

array([0.00346286])

### Now we will use Lasso Regression, PCA, and Huber Regression

In [183]:
alpha = [0.00001,0.0001,0.001,0.01,0.1,1,10,100,1000]

ridge_coefs = np.empty((len(alpha),len(features_new)))
lasso_coefs = np.empty((len(alpha),len(features_new)))


for i in range(len(alpha)):
    ## first scale
    ## set up the lasso pipeline
    lasso_pipe = Pipeline([('scale',StandardScaler()),
                              ('lasso', Lasso(alpha=alpha[i], max_iter=5000000))
                          ])
   
    ## fit the lasso
    lasso_pipe.fit(batters_train[features_new], batters_train.batting_avg)

    
    # record the coefficients
    lasso_coefs[i,:] = lasso_pipe['lasso'].coef_

In [184]:
print("Lasso Coefficients")

pd.DataFrame(np.round(lasso_coefs,8),
            columns = features_new,
            index = ["alpha=" + str(a) for a in alpha])

Lasso Coefficients


Unnamed: 0,avg_swing_speed,fast_swing_rate,blasts_contact,blasts_swing,squared_up_contact,squared_up_swing,avg_swing_length,swords
alpha=1e-05,0.002206,-0.006009,0.08576,-0.06147,-0.073218,0.087621,0.00587,0.010649
alpha=0.0001,-0.000662,-0.002761,0.053734,-0.030956,-0.057266,0.069461,0.005339,0.010292
alpha=0.001,-0.0,-0.0,0.0,0.016647,-0.020892,0.031012,0.002045,0.009073
alpha=0.01,0.0,0.0,0.0,0.008916,0.0,0.004008,0.0,0.000178
alpha=0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
alpha=1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
alpha=10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
alpha=100,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
alpha=1000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


So from Lasso, it seems like ['blasts_swing', 'squared_up_swing',  'swords']

In [185]:
swings = [['blasts_swing', 'squared_up_swing',  'swords']]

kfold = KFold(5,
             shuffle = True,
             random_state = 555)

mses = np.zeros((5, len(swings)))

for j,subset in enumerate(swings):
    for i, (train_index, train_test) in enumerate(kfold.split(batters_train)):
        batters_tt = batters_train.iloc[train_index]
        batters_ho = batters_train.iloc[train_test]
    
        model1 = LinearRegression()
            
        model1.fit(batters_train[subset].values, 
                    batters_train.batting_avg.values)
            
        pred = model1.predict(batters_ho[subset].values)
                
        mses[i,j] = mean_squared_error(batters_ho.batting_avg.values, pred)

mses_arr = np.mean(mses,axis = 0)
print(mses_arr)
print('the minimum mean squared error is: ', mses_arr.min())
print('---')
print('the subset that gave this smallest errorr is:')
# print(all_models[mses_arr.argmin()])
error_dict['Lasso'] = mses_arr.min()
error_dict

[0.00316171]
the minimum mean squared error is:  0.003161706283398068
---
the subset that gave this smallest errorr is:


{'best subset error': 0.0029026945861780158, 'Lasso': 0.003161706283398068}

### Now with Huber regression

In [186]:
swings = [['blasts_swing', 'squared_up_swing',  'swords']]

kfold = KFold(5,
             shuffle = True,
             random_state = 555)

mses = np.zeros((5, len(swings)))

for j,subset in enumerate(swings):
    for i, (train_index, train_test) in enumerate(kfold.split(batters_train)):
        batters_tt = batters_train.iloc[train_index]
        batters_ho = batters_train.iloc[train_test]
    
        model1 = Pipeline((('scale', StandardScaler()), ('huber', HuberRegressor(max_iter = 200)))
            
        model1.fit(batters_train[subset].values, 
                    batters_train.batting_avg.values)
            
        pred = model1.predict(batters_ho[subset].values)
                
        mses[i,j] = mean_squared_error(batters_ho.batting_avg.values, pred)

mses_arr = np.mean(mses,axis = 0)
print(mses_arr)
print('the minimum mean squared error is: ', mses_arr.min())
print('---')
print('the subset that gave this smallest errorr is:')
# print(all_models[mses_arr.argmin()])
error_dict['Huber'] = mses_arr.min()
error_dict

[0.00317889]
the minimum mean squared error is:  0.003178886371114434
---
the subset that gave this smallest errorr is:


{'best subset error': 0.0029026945861780158,
 'Lasso': 0.003161706283398068,
 'Huber': 0.003178886371114434}

### Now with PCA

# Including new features and quality features

In [206]:
all_models = ['baseline']
all_models.extend(powerset(['blasts_swing', 'squared_up_swing',  'swords'] + features_quality))

# all_models=[['blasts_swing', 'squared_up_swing', 'sweet_spot_percent', 'barrel', 'flareburner_percent']]

kfold = KFold(5,
             shuffle = True,
             random_state = 555)

mses = np.zeros((5, len(all_models)))

for i, (train_index, train_test) in enumerate(kfold.split(batters_train)):
    batters_tt = batters_train.iloc[train_index]
    batters_ho = batters_train.iloc[train_test]
    
    for j, subset in enumerate(all_models):
        ## baseline prediction 
        if subset == 'baseline':
            subset = np.mean(batters_tt.batting_avg)*np.ones(len(batters_ho.batting_avg))
        
            mses[i,j] = mean_squared_error(batters_ho.batting_avg.values, subset)
        else:
            if len(subset) == 1:
                model1 = LinearRegression()
            
                model1.fit(batters_tt[subset].values.reshape(-1,1), 
                           batters_tt.batting_avg)
            
                pred = model1.predict(batters_ho[subset].values.reshape(-1,1))
            
                mses[i,j] = mean_squared_error(batters_ho.batting_avg.values.reshape(-1,1), pred)
                
            else: 
                model1 = LinearRegression()
            
                model1.fit(batters_tt[subset].values, 
                           batters_tt.batting_avg.values)
            
                pred = model1.predict(batters_ho[subset].values)
                

mses[i,j] = mean_squared_error(batters_ho.batting_avg.values, pred)
mses_arr = np.mean(mses,axis = 0)
print(mses_arr)
print('the minimum mean squared error is: ', mses_arr.min())
print('---')
print('the subset that gave this smallest errorr is:')
print(all_models[mses_arr.argmin()])
error_dict_all = {'best subset error': mses_arr.min()}        

[0.00383842 0.00352362 0.0035902  ... 0.         0.         0.00061427]
the minimum mean squared error is:  0.0
---
the subset that gave this smallest errorr is:
['blasts_swing', 'squared_up_swing']


In [207]:
alpha = [0.00001,0.0001,0.001,0.01,0.1,1,10,100,1000]

subset = features_new + features_quality

lasso_coefs = np.empty((len(alpha),len(subset)))



for i in range(len(alpha)):
    ## first scale
    ## set up the lasso pipeline
    lasso_pipe = Pipeline([('scale',StandardScaler()),
                              ('lasso', Lasso(alpha=alpha[i], max_iter=5000000))
                          ])
   
    ## fit the lasso
    lasso_pipe.fit(batters_train[subset], batters_train.batting_avg)

    
    # record the coefficients
    lasso_coefs[i,:] = lasso_pipe['lasso'].coef_
    
print("Lasso Coefficients")

pd.DataFrame(np.round(lasso_coefs,8),
            columns = subset,
            index = ["alpha=" + str(a) for a in alpha])

Lasso Coefficients


Unnamed: 0,avg_swing_speed,fast_swing_rate,blasts_contact,blasts_swing,squared_up_contact,squared_up_swing,avg_swing_length,swords,exit_velocity_avg,launch_angle_avg,sweet_spot_percent,barrel,barrel_batted_rate,solidcontact_percent,flareburner_percent,poorlyunder_percent,poorlytopped_percent,poorlyweak_percent,hard_hit_percent
alpha=1e-05,-0.000211,-0.003607,0.076042,-0.062377,-0.062397,0.075243,0.008471,0.002699,0.002342,0.00327,0.00958,0.01513,-0.023838,-0.013964,-0.008388,-0.042609,-0.036024,-0.02167,-0.003202
alpha=0.0001,-0.002895,-0.0,0.038014,-0.026327,-0.044774,0.055348,0.007815,0.002713,0.002438,0.002865,0.009398,0.013739,-0.006177,-0.00058,0.013592,-0.017557,-0.009498,-0.007237,-0.004109
alpha=0.001,-0.0,0.0,0.0,0.006191,-0.014426,0.024296,0.004417,0.001813,0.0,0.0,0.009964,0.013296,0.0,0.002331,0.021738,-0.004819,-0.0,-0.001573,-0.0
alpha=0.01,0.0,0.0,0.0,0.005272,0.0,0.002587,0.0,0.0,0.0,-0.0,0.005161,0.007099,0.0,0.0,0.016551,-0.0,-0.0,-0.0,0.0
alpha=0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,0.0
alpha=1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,0.0
alpha=10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,0.0
alpha=100,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,0.0
alpha=1000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,0.0


In [208]:
swings = [['blasts_swing', 'squared_up_swing', 'sweet_spot_percent', 'barrel', 'flareburner_percent']]

kfold = KFold(5,
             shuffle = True,
             random_state = 555)

mses = np.zeros((5, len(swings)))

for i, (train_index, train_test) in enumerate(kfold.split(batters_train)):
    batters_tt = batters_train.iloc[train_index]
    batters_ho = batters_train.iloc[train_test]
    
    for j, subset in enumerate(swings):
        ## baseline prediction 
        if subset == 'baseline':
            subset = np.mean(batters_tt.batting_avg)*np.ones(len(batters_ho.batting_avg))
        
            mses[i,j] = mean_squared_error(batters_ho.batting_avg.values, subset)
        else:
            if len(subset) == 1:
                model1 = LinearRegression()
            
                model1.fit(batters_tt[subset].values.reshape(-1,1), 
                           batters_tt.batting_avg)
            
                pred = model1.predict(batters_ho[subset].values.reshape(-1,1))
            
                mses[i,j] = mean_squared_error(batters_ho.batting_avg.values.reshape(-1,1), pred)
                
            else: 
                model1 = LinearRegression()
            
                model1.fit(batters_tt[subset].values, 
                           batters_tt.batting_avg.values)
            
                pred = model1.predict(batters_ho[subset].values)
                

mses[i,j] = mean_squared_error(batters_ho.batting_avg.values, pred)
mses_arr = np.mean(mses,axis = 0)
print(mses_arr)
print('the minimum mean squared error is: ', mses_arr.min())
print('---')
print('the subset that gave this smallest errorr is:')
# print(all_models[mses_arr.argmin()])
error_dict_all['Lasso'] = mses_arr.min()     

[0.00057657]
the minimum mean squared error is:  0.0005765727448695776
---
the subset that gave this smallest errorr is:


In [209]:
error_dict_all

{'best subset error': 0.0, 'Lasso': 0.0005765727448695776}