# Lab 6: Variable Selection and Regularization

## Part I: Different Model Specs
### A. Regression without regularization
1. Create a pipeline that includes all the columns as predictors for Salary, and performs ordinary linear regression

In [6]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.compose import ColumnTransformer

In [4]:
hitters = pd.read_csv(r"C:\Users\achur\OneDrive\Desktop\School\CP Fall 2024\544\Hitters.csv")

In [15]:
# drop na values
X = hitters.drop(columns=['Salary'])
y = hitters['Salary'].dropna()
X = X.loc[y.index]

# training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# column transformer for dummy variables
ct_dummies = ColumnTransformer(
    [("dummify", OneHotEncoder(sparse_output=False, handle_unknown='ignore'), ["League"])],  # Adjust column name
    remainder="passthrough"
).set_output(transform="pandas")

# interaction of column transformer
ct_inter = ColumnTransformer(
    [
        ("interaction", PolynomialFeatures(interaction_only=True, include_bias=False), 
         ["remainder__Years", "dummify__League_A"])  # Adjust as per the actual column names
    ],
    remainder="drop"
).set_output(transform="pandas")

# pipeline
pipeline = Pipeline([
    ('dummification', ct_dummies),
    ('interactions', ct_inter),
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

# fit pipeline onto training data
pipeline.fit(X_train, y_train)

2. Fit this pipeline to the full dataset, and interpret a few of the most important coefficients.

In [23]:
# fitting pipeline to whole dataset
pipeline.fit(X, y)

# linear regression model from pipeline
regressor = pipeline.named_steps['regressor']

# coefficients from fitted model
coefficients = regressor.coef_

# transformed features
transormed_features = pipeline.named_steps['interactions'].get_feature_names_out()

# data frame for coefficients for easier read
coeff_df = pd.DataFrame({'Feature': transformed_features, 'Coefficient': coefficients})
coeff_df = coeff_df.sort_values(by='Coefficient', key=abs, ascending=False)  # Sort by importance (absolute value)

# most important coefficients
coeff_df.head()


Unnamed: 0,Feature,Coefficient
0,interaction__remainder__Years,166.978056
2,interaction__remainder__Years dummify__League_A,27.099368
1,interaction__dummify__League_A,-18.891896


Interpretation: The first feature represents the linear relationship between Years and Salary. This means that on average, for each additional year of playing, the predicted Salary increases by 166.978k, holding all other variables constant. The second feature represnts the linear relationship between the interaction terms Year and League A. This means that for players in League A, each additional year of playing, the salary is expected to increase by 27.099k. The third feature represents the relationship between League A and the other league. This means that being in League A is associated with a -18.89k decrease in salary compared to the other league

3. Use cross-validation to estimate the MSE you would expect if you used this pipeline to predict 1989 salaries.

### B. Ridge regression
1. Create a pipeline that includes all the columns as predictors for Salary, and performs ordinary ridge regression

2. Use cross-validation to tune the lambda hyperparameter.

3. Fit the pipeline with your chosen lambda to the full dataset, and interpret a few of the most important coefficients.

4. Report the MSE you would expect if you used this pipeline to predict 1989 salaries.

### C. Lasso Regression
1. Create a pipeline that includes all the columns as predictors for Salary, and performs ordinary ridge regression

2. Use cross-validation to tune the lambda hyperparameter.

3. Fit the pipeline with your chosen lambda to the full dataset, and interpret a few of the most important coefficients.

4. Report the MSE you would expect if you used this pipeline to predict 1989 salaries.

### D. Elastic Net
1. Create a pipeline that includes all the columns as predictors for Salary, and performs ordinary ridge regression

2. Use cross-validation to tune the lambda and alpha hyperparameters.

3. Fit the pipeline with your chosen hyperparameters to the full dataset, and interpret a few of the most important coefficients.

4. Report the MSE you would expect if you used this pipeline to predict 1989 salaries.

## Part II. Variable Selection
Based on the above results, decide on:

* Which numeric variable is most important.

* Which five numeric variables are most important

* Which categorical variable is most important

For each of the four model specifications, compare the following possible feature sets:

1. Using only the one best numeric variable.

2. Using only the five best variables.

3. Using the five best numeric variables and their interactions with the one best categorical variable.

Report which combination of features and model performed best, based on the validation metric of MSE. (Note: lambda and alpha must be re-tuned for each feature set.)

## Part III. Discussion

### A. Ridge

Compare your Ridge models with your ordinary regression models. How did your coefficients compare? Why does this make sense?

### B. Lasso

Compare your LASSO model in I with your three LASSO models in II. Did you get the same lambda results? Why does this make sense? Did you get the same MSEs? Why does this make sense?

### C. Elastic Net

Compare your MSEs for the Elastic Net models with those for the Ridge and LASSO models. Why does it make sense that Elastic Net always “wins”?

## Part IV: Final Model

Fit your final best pipeline on the full dataset, and summarize your results in a few short sentences and a plot.

## Appendix and References