## Building models for batting average

From the notebook 1-MLB-predictions-EDA, I noticed that, of the features:
- features_new = ['avg_swing_speed', 'fast_swing_rate', 'blasts_contact', 'blasts_swing',
       'squared_up_contact', 'squared_up_swing', 'avg_swing_length', 'swords']

'blast_swing', 'blast contact', 'squared up', 'squared up swings', 'swords' correlate most with average. I also noticed that blast_swing and blast_contact correlate, similarly sqaured up swings and contact. We will still include all of these in the initial model exploration. 

Also, of the following quality of at bat features:
- features_quality = ['exit_velocity_avg', 'launch_angle_avg', 'sweet_spot_percent', 'barrel',
       'barrel_batted_rate', 'solidcontact_percent', 'flareburner_percent',
       'poorlyunder_percent', 'poorlytopped_percent', 'poorlyweak_percent',
       'hard_hit_percent']

'exit velo','sweet spot percentage','barrel','flareburner percentage','hard hit percentage','poorly topped','poorly under','poorly weak' correlate most with average. 

For which model we will use, we will build models using only new features since I am interested in what features most contribute to batting average, then one using new fatures + quality at bat features to see overall, what features contribute most to batting average. To do this: 
- Models we will use:
    - Linear Regression with new tracking stats, then one with both. Using this method since we do want interpretability of which features are important to batting. 
    - Find features using PCA, then build Regression model. We can use this just as a predictor of batting average. 
    - Find a regression model using Huber Regularization since there are some outliers in the data. 
    - Lasso Regression to see which features are more important.
- Compare the previous models. 


In [15]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import LinearRegression, Lasso, Ridge, Huber
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

In [8]:
## This is data from the start of the 2024 season, until 5/14/2024.
## it includes those columbs listed below. Need to update this data set. 

batters = pd.read_csv('../updated_batter_data.csv')

## split data
batters_train, batters_test = train_test_split(batters,
                                              shuffle = True,
                                              random_state = 555, 
                                              test_size = .2)

In [9]:
features_new = ['blast_swing', 'blast contact', 'squared up', 'squared up swings', 'swords']
features_quality = ['exit velo','sweet spot percentage','barrel','flareburner percentage',
                    'hard hit percentage','poorly topped','poorly under','poorly weak']

In [12]:
def powerset(s):
    power_set = [[]]
    for x in s:
        power_set += [s0+[x] for s0 in power_set]
    return power_set[1:]

In [13]:
all_models = ['baseline']
all_models.extend(powerset(features_new))

In [16]:
kfold = KFold(5,
             shuffle = True,
             random_state = 555)

mses = np.array((2,2))