## Lab 2 - Ridge Regression and the Lasso

Notes - If you do not have a fortran compiler...Get one. Follow step 
[here](https://www.youtube.com/watch?v=xuQL_BZydS0)

NOTES - Standardize is NOT Normalize

#### Import block

In [2]:
import numpy as np
import pandas as pd

import sklearn.linear_model as skl_lm
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import  train_test_split
from sklearn.preprocessing import scale

We are using the exact same data as in lab 1 with removal of all None values

Load data

In [3]:
data_path = 'D:\\PycharmProjects\\ISLR\\data\\'
hitter = pd.read_csv(f'{data_path}Hitters.csv', index_col=0, na_values='NA').dropna()

# Transform categorical variables into dummy
for i in ['League', 'Division', 'NewLeague']:
    hitter[i] = hitter[i].astype('category').cat.codes

### 6.6.1 - Ridge Regression

There are some important points before we proceed. 
1. Standardizing is what we will perform NOT normalizing. 
2. alpha/lambda used in sklearn package follows the formula in the book while glmnet does not.
3. Use MSE to compare the end result is the surefire way to understand this.

In [4]:
# Setting seed for reproducible
np.random.seed(1)

# Initialized and standardize X - NOT y
t_prop = 0.5
y = hitter.Salary
X = hitter.drop('Salary', axis=1)

X_scaled = scale(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y.ravel(), test_size=t_prop)

# Fit Ridge regression
score_dict = {}
for i in [4, 0]:
    ridge = skl_lm.Ridge(alpha=i,fit_intercept=True).fit(X_train, y_train)
    pred = ridge.predict(X_test)
    score_dict[f'alpha={i}'] = mean_squared_error(y_test, pred)

# Fit OLS
regr = skl_lm.LinearRegression(fit_intercept=True).fit(X_train, y_train)
pred = regr.predict(X_test)
score_dict['OLS'] = mean_squared_error(y_test, pred)

# Return dataframe
pd.DataFrame.from_dict(score_dict, orient='index', columns=['MSE'])


Unnamed: 0,MSE
alpha=4,102375.707696
alpha=0,116690.468567
OLS,116690.468567


We get the result that is kinda similar to the book. (Differences are in how we split the data). 
Nevertheless, OLS result is the same as alpha = 0. 

Now we will use RidgeCV to get the best fit for our model. RidgeCV to the rescue

In [37]:
ridgeCV = skl_lm.RidgeCV(alphas=np.linspace(0, 1000, num=1000), fit_intercept=True, cv=10).fit(X_train, y_train)
print('Best alpha/lambda: ', ridgeCV.alpha_)

# Print coefficient
print(pd.Series(ridgeCV.coef_, index=X.columns))

Best alpha/lambda:  214.21421421421422
AtBat        11.789680
Hits         28.059029
HmRun         7.833336
Runs         19.611866
RBI          26.932623
Walks        34.214407
Years        12.372675
CAtBat       20.348077
CHits        27.732906
CHmRun       33.721081
CRuns        27.651778
CRBI         34.991083
CWalks       22.682022
League        5.187553
Division    -28.433171
PutOuts      54.174106
Assists      -1.400036
Errors        1.001642
NewLeague     4.207332
dtype: float64




None of the coefficient is at 0. Ridge does not perform feature selection! 
The best alpha seems to be 214 (compared to 212 in the book). We can calculate MSE

In [6]:
ridge = skl_lm.Ridge(alpha=214, fit_intercept=True).fit(X_train, y_train)
pred = ridge.predict(X_test)
pd.DataFrame.from_dict({'MSE':mean_squared_error(y_test, pred)}, 
                       orient='index', columns=['alpha=214'])

Unnamed: 0,alpha=214
MSE,100416.083138


Notes: I did not run and get the parameters as it will not be the same as in the book.

### 6.6.2 - The Lasso

The main advantage of The Lasso over Ridge is interpretability. The Lass also performs
feature selection for us.

First, we will let alpha = 1

In [11]:
# The Lasso
score_dict = {}
lasso = skl_lm.Lasso(alpha=1, fit_intercept=True, max_iter=10000).fit(X_train, y_train)
pred = lasso.predict(X_test)
score_dict[f'alpha={i}'] = mean_squared_error(y_test, pred)
    
pd.DataFrame.from_dict(score_dict, orient='index', columns=['alpha=1'])

Unnamed: 0,alpha=1
alpha=0,106603.291383


We already get a much better MSE value using only alpha=1. Now lets use crossvalidation
to find the best alpha. (max_iter is there to ensure convergence)

In [39]:
# Run the lassoCV
lassoCV = skl_lm.LassoCV(alphas=None, fit_intercept=True, cv=10, max_iter=100000)
lassoCV.fit(X_train, y_train)
print(f'Best alpha/lambda for lasso is: {lassoCV.alpha_}\n')
print(pd.Series(lassoCV.coef_, index=X.columns))

Best alpha/lambda for lasso is: 28.038544563299862

AtBat          0.000000
Hits          47.417650
HmRun          0.000000
Runs           0.000000
RBI            0.000000
Walks         64.015609
Years          0.000000
CAtBat         0.000000
CHits          0.000000
CHmRun        16.244204
CRuns          0.000000
CRBI         168.983059
CWalks         0.000000
League         0.000000
Division     -43.965445
PutOuts      104.196353
Assists       -0.000000
Errors        -0.000000
NewLeague      0.000000
dtype: float64


There are only 6 variables (not including intercept) with coefficient different from zero. 
However, we get one less variable than the book: LeagueN. 

