## Building models for batting average

From the notebook 1-MLB-predictions-EDA, I noticed that, of the features:
- features_new = ['avg_swing_speed', 'fast_swing_rate', 'blasts_contact', 'blasts_swing',
       'squared_up_contact', 'squared_up_swing', 'avg_swing_length', 'swords']

'blast_swing', 'blast contact', 'squared up', 'squared up swings', 'swords' correlate most with average. I also noticed that blast_swing and blast_contact correlate, similarly sqaured up swings and contact. We will still include all of these in the initial model exploration. 

Also, of the following quality of at bat features:
- features_quality = ['exit_velocity_avg', 'launch_angle_avg', 'sweet_spot_percent', 'barrel',
       'barrel_batted_rate', 'solidcontact_percent', 'flareburner_percent',
       'poorlyunder_percent', 'poorlytopped_percent', 'poorlyweak_percent',
       'hard_hit_percent']

'exit velo','sweet spot percentage','barrel','flareburner percentage','hard hit percentage','poorly topped','poorly under','poorly weak' correlate most with average. 

For which model we will use, we will build models using only new features since I am interested in what features most contribute to batting average, then one using new fatures + quality at bat features to see overall, what features contribute most to batting average. To do this: 
- Models we will use:
    - Linear Regression with new tracking stats, then one with both. Using this method since we do want interpretability of which features are important to batting. 
    - Find features using PCA, then build Regression model. We can use this just as a predictor of batting average. 
    - Find a regression model using Huber Regularization since there are some outliers in the data. 
    - Lasso Regression to see which features are more important.
- Compare the previous models. 


In [88]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import LinearRegression, Lasso, Ridge, Huber
from sklearn.decomposition import PCA
# from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

In [155]:
## This is data from the start of the 2024 season, until 5/14/2024.
## it includes those columbs listed below. Need to update this data set. 
features_new = ['blasts_swing', 'blasts_contact', 'squared_up_swing', 'squared_up_contact', 'swords']
features_quality = ['exit_velocity_avg','launch_angle_avg','sweet_spot_percent','barrel','barrel_batted_rate','solidcontact_percent',
                    'flareburner_percent','poorlyunder_percent','poorlytopped_percent','poorlyweak_percent','hard_hit_percent']

features = features_new + features_quality

batters = pd.read_csv('../updated_batter_data.csv')

batters = batters[features + ['batting_avg']]

batters = batters.dropna(subset = features)
# batters.columns

In [156]:
batters.shape

(466, 17)

In [157]:
## split data
batters_train, batters_test = train_test_split(batters,
                                              shuffle = True,
                                              random_state = 555, 
                                              test_size = .2)

In [158]:
def powerset(s):
    power_set = [[]]
    for x in s:
        power_set += [s0+[x] for s0 in power_set]
    return power_set[1:]

In [159]:
all_models = ['baseline']
all_models.extend(powerset(features_new))

In [184]:
kfold = KFold(5,
             shuffle = True,
             random_state = 555)

mses = np.zeros((5, len(all_models)))

for i, (train_index, train_test) in enumerate(kfold.split(batters_train)):
    batters_tt = batters_train.iloc[train_index]
    batters_ho = batters_train.iloc[train_test]
    
    for j, subset in enumerate(all_models):
        ## baseline prediction 
        if subset == 'baseline':
            subset = np.mean(batters_tt.batting_avg)*np.ones(len(batters_ho.batting_avg))
        
            mses[i,j] = mean_squared_error(batters_ho.batting_avg.values, subset)
        else:
            if len(subset) == 1:
                model1 = LinearRegression()
            
                model1.fit(batters_tt[subset].values.reshape(-1,1), 
                           batters_tt.batting_avg)
            
                pred = model1.predict(batters_ho[subset].values.reshape(-1,1))
            
                mses[i,j] = mean_squared_error(batters_ho.batting_avg.values.reshape(-1,1), pred)
                
            else: 
                model1 = LinearRegression()
            
                model1.fit(batters_train[subset].values, 
                           batters_train.batting_avg.values)
            
                pred = model1.predict(batters_ho[subset].values)
                
                mses[i,j] = mean_squared_error(batters_ho.batting_avg.values, pred)
        

In [203]:
mses_arr = np.mean(mses,axis = 0)
print(mses_arr)
print('the minimum mean squared error is: ', mses_arr.min())
print('---')
print('the subset that gave this smallest errorr is:')
print(all_models[mses_arr.argmin()])
error_dict = {'best subset error': mses_arr.min()}

[0.00383842 0.00352362 0.00364635 0.00322763 0.0035902  0.00326434
 0.00333155 0.0031921  0.00373704 0.00337398 0.00349032 0.00321713
 0.00345601 0.00308695 0.00306058 0.00303867 0.00372918 0.00331356
 0.0034431  0.00314716 0.00337786 0.00316171 0.00321963 0.00309713
 0.00353427 0.00328679 0.00339255 0.00312933 0.0033074  0.00298973
 0.00296215 0.00293257]
the minimum mean squared error is:  0.0029325746182761367
---
the subset that gave this smallest errorr is:
['blasts_swing', 'blasts_contact', 'squared_up_swing', 'squared_up_contact', 'swords']


In [204]:
error_dict

{'best subset error': 0.0029325746182761367}

### Now we will attempt to use Lasso Regression, PCA, and Huber Regression