# Regularization

## Why Regularize?

In an attempt to fit a good model to data, we often tend to overfit. Regularization discourages overly complex models by penalizing the loss function.

### The Bias-Variance Tradeoff

When we did Linear Regression, we briefly talked about the Bias-Variance Tradeoff.

![](http://scott.fortmann-roe.com/docs/docs/BiasVariance/biasvariance.png)

![](https://miro.medium.com/max/544/1*Y-yJiR0FzMgchPA-Fm5c1Q.jpeg)

**High bias** 

 - Systematic error in predictions (i.e. the average)
 - Bias is about the strength of assumptions the model makes
 - Underfit models tend to have high bias


**High variance**

 - The model is highly sensitive to changes in the data
 - Overfit models tend to have low bias and high variance
    
    
![](https://gblobscdn.gitbook.com/assets%2F-LvBP1svpACTB1R1x_U4%2F-LvNWUoWieQqaGmU_gl9%2F-LvNoby-llz4QzAK15nL%2Fimage.png?alt=media&token=41720ce9-bb66-4419-9bd8-640abf1fc415)

 - Underfit Models fail to capture all of the information in the data
 - Overfit models fit to the noise in the data and fail to generalize


**How would we know if our model is over or underfit?**
 - Train test split & look at the testing error
 - As model complexity increases so does the possibility for overfitting

## Ridge and Lasso

Ridge and Lasso regression are two examples of penalized estimation. Penalized estimation makes some or all of the coefficients smaller in magnitude (closer to zero). Some of the penalties have the property of performing both variable selection (setting some coefficients exactly equal to zero) and shrinking the other coefficients. 

In Ridge regression, the cost function is changed by adding a penalty term to the square of the magnitude of the coefficients. 

$$ \text{cost_function_ridge}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij})-b)^2 + \lambda \sum_{j=1}^p m_j^2$$

Lasso regression (Least Absolute Shrinkage and Selection Operator) is very similar to Ridge regression, except that the magnitude of the coefficients are not squared in the penalty term.

$$ \text{cost_function_lasso}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij})-b)^2 + \lambda \sum_{j=1}^p \mid m_j \mid$$

So we're penalizing large coefficients -- what are the effects/implications of that?

### Standardization before Regularization

An important step before using either Lasso or Ridge regularization is to first standardize your data such that it is all on the same scale. Regularization is based on the concept of penalizing larger coefficients, so **if you have features that are on different scales, some will get unfairly penalized**. A downside of standardization is that the value of the coefficients become less interpretable and must be transformed back to their original scale if you want to interpret how a one unit change in a feature impacts the target variable.

**Scaler documentation:**

* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

## Let's Code! 

Start with a regular Linear Regression.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import OneHotEncoder

# import warnings
# warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('data/ames_train.csv') # Ames housing data

# Drop sale detail columns 
df = df.drop(columns = ['Id', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition'])

# Create X and y
y = df['SalePrice']
X = df.drop(columns=['SalePrice'], axis=1)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### Time to Clean/Process

In [3]:
# Explore X_train
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1095 entries, 1023 to 1126
Data columns (total 75 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1095 non-null   int64  
 1   MSZoning       1095 non-null   object 
 2   LotFrontage    895 non-null    float64
 3   LotArea        1095 non-null   int64  
 4   Street         1095 non-null   object 
 5   Alley          70 non-null     object 
 6   LotShape       1095 non-null   object 
 7   LandContour    1095 non-null   object 
 8   Utilities      1095 non-null   object 
 9   LotConfig      1095 non-null   object 
 10  LandSlope      1095 non-null   object 
 11  Neighborhood   1095 non-null   object 
 12  Condition1     1095 non-null   object 
 13  Condition2     1095 non-null   object 
 14  BldgType       1095 non-null   object 
 15  HouseStyle     1095 non-null   object 
 16  OverallQual    1095 non-null   int64  
 17  OverallCond    1095 non-null   int64  
 18  YearB

In [5]:
len(X_train)

1095

In [29]:
# Let's check the percentage of our training data that's null per column
null_perc = X_train.isna().sum() / len(X_train)
null_perc.sort_values(ascending=False).head(15)

GarageFinish    0.052968
GarageType      0.052968
GarageCond      0.052968
GarageQual      0.052968
GarageYrBlt     0.052968
BsmtExposure    0.024658
BsmtFinType2    0.024658
BsmtCond        0.024658
BsmtQual        0.024658
BsmtFinType1    0.024658
MasVnrType      0.003653
MasVnrArea      0.003653
Electrical      0.000913
YearBuilt       0.000000
ExterQual       0.000000
dtype: float64

In [18]:
# Drop where nulls are more than 10% of column
null_cols_to_drop = list(null_perc.loc[null_perc > .1].index)
print(null_cols_to_drop)

X_train = X_train.drop(null_cols_to_drop, axis=1)
X_test = X_test.drop(null_cols_to_drop, axis=1)

['LotFrontage', 'Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']


In [27]:
# Start with the continuous variables

# Grab only numeric features
X_train['Fireplaces'].dtype

num_types = ['int64', 'float64']

num_cols = []
for col in X_train.columns:
    if X_train[col].dtype in num_types:
        num_cols.append(col)
        
# list comprehension
num_cols = [c for c in X_train.columns if X_train[c].dtype in num_types]

In [28]:
X_train[num_cols].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1095 entries, 1023 to 1126
Data columns (total 33 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1095 non-null   int64  
 1   LotArea        1095 non-null   int64  
 2   OverallQual    1095 non-null   int64  
 3   OverallCond    1095 non-null   int64  
 4   YearBuilt      1095 non-null   int64  
 5   YearRemodAdd   1095 non-null   int64  
 6   MasVnrArea     1091 non-null   float64
 7   BsmtFinSF1     1095 non-null   int64  
 8   BsmtFinSF2     1095 non-null   int64  
 9   BsmtUnfSF      1095 non-null   int64  
 10  TotalBsmtSF    1095 non-null   int64  
 11  1stFlrSF       1095 non-null   int64  
 12  2ndFlrSF       1095 non-null   int64  
 13  LowQualFinSF   1095 non-null   int64  
 14  GrLivArea      1095 non-null   int64  
 15  BsmtFullBath   1095 non-null   int64  
 16  BsmtHalfBath   1095 non-null   int64  
 17  FullBath       1095 non-null   int64  
 18  HalfB

In [30]:
X_train_cont = X_train[num_cols]
X_test_cont = X_test[num_cols]

In [34]:
# Impute missing values with 0 using SimpleImputer
# (most columns look like they just don't have details)

imputer = SimpleImputer(strategy="constant", fill_value=0)

X_train_imputed = imputer.fit_transform(X_train_cont)
X_test_imputed = imputer.transform(X_test_cont)


In [35]:
# Scale the train and test data
scaler = MinMaxScaler()

X_train_imsc = scaler.fit_transform(X_train_imputed)
X_test_imsc = scaler.transform(X_test_imputed)

In [37]:
X_train['MSZoning']

1023    RL
810     RL
1384    RL
626     RL
813     RL
        ..
1095    RL
1130    RL
1294    RL
860     RL
1126    RL
Name: MSZoning, Length: 1095, dtype: object

In [39]:
# Now time for the categorical columns

# Create X_cat which contains only the categorical variables
cat_cols = [c for c in X_train.columns if X_train[c].dtype in ['object']]

X_train_cat = X_train[cat_cols]
X_test_cat = X_test[cat_cols]

# Fill missing values with the string 'missing'
X_train_cat = X_train_cat.fillna(value="missing")
# Same as:
# imputer_cat = SimpleImputer(strategy='constant', fill_value='missing')
# imputer_cat.fit_transform(X_train_cat)

X_test_cat = X_test_cat.fillna(value="missing")


In [43]:
# Exploring column percentages

# Let's remove any column where the most common value is more than 90% of that col
col_to_drop= []
for col in X_train_cat.columns:
    col_series = X_train_cat[col].value_counts()
    display(col_series/len(X_train_cat))
    
    if col_series[0]/len(X_train_cat) > .9:
        col_to_drop.append(col)

RL         0.790868
RM         0.149772
FV         0.042922
RH         0.012785
C (all)    0.003653
Name: MSZoning, dtype: float64

Pave    0.996347
Grvl    0.003653
Name: Street, dtype: float64

Reg    0.621918
IR1    0.338813
IR2    0.031963
IR3    0.007306
Name: LotShape, dtype: float64

Lvl    0.905936
Bnk    0.041096
HLS    0.029224
Low    0.023744
Name: LandContour, dtype: float64

AllPub    0.999087
NoSeWa    0.000913
Name: Utilities, dtype: float64

Inside     0.699543
Corner     0.190868
CulDSac    0.073059
FR2        0.033790
FR3        0.002740
Name: LotConfig, dtype: float64

Gtl    0.946119
Mod    0.045662
Sev    0.008219
Name: LandSlope, dtype: float64

NAmes      0.152511
CollgCr    0.102283
OldTown    0.079452
Edwards    0.075799
Somerst    0.056621
NWAmes     0.054795
Gilbert    0.053881
NridgHt    0.052968
Sawyer     0.046575
BrkSide    0.040183
SawyerW    0.036530
Crawfor    0.035616
Mitchel    0.034703
NoRidge    0.028311
Timber     0.024658
IDOTRR     0.022831
StoneBr    0.018265
ClearCr    0.017352
SWISU      0.016438
Blmngtn    0.013699
BrDale     0.011872
MeadowV    0.009132
Veenker    0.008219
NPkVill    0.006393
Blueste    0.000913
Name: Neighborhood, dtype: float64

Norm      0.863927
Feedr     0.053881
Artery    0.033790
RRAn      0.015525
PosN      0.012785
RRAe      0.009132
PosA      0.005479
RRNn      0.004566
RRNe      0.000913
Name: Condition1, dtype: float64

Norm      0.991781
Feedr     0.002740
Artery    0.001826
PosN      0.001826
RRAn      0.000913
RRAe      0.000913
Name: Condition2, dtype: float64

1Fam      0.835616
TwnhsE    0.076712
Duplex    0.033790
Twnhs     0.029224
2fmCon    0.024658
Name: BldgType, dtype: float64

1Story    0.494064
2Story    0.309589
1.5Fin    0.104110
SLvl      0.045662
SFoyer    0.021005
1.5Unf    0.010046
2.5Unf    0.009132
2.5Fin    0.006393
Name: HouseStyle, dtype: float64

Gable      0.769863
Hip        0.205479
Flat       0.010046
Gambrel    0.008219
Mansard    0.004566
Shed       0.001826
Name: RoofStyle, dtype: float64

CompShg    0.982648
Tar&Grv    0.008219
WdShngl    0.003653
WdShake    0.002740
Metal      0.000913
ClyTile    0.000913
Roll       0.000913
Name: RoofMatl, dtype: float64

VinylSd    0.359817
HdBoard    0.152511
MetalSd    0.146119
Wd Sdng    0.145205
Plywood    0.068493
CemntBd    0.038356
BrkFace    0.034703
Stucco     0.017352
WdShing    0.017352
AsbShng    0.014612
BrkComm    0.001826
Stone      0.000913
ImStucc    0.000913
CBlock     0.000913
AsphShn    0.000913
Name: Exterior1st, dtype: float64

VinylSd    0.351598
HdBoard    0.140639
Wd Sdng    0.140639
MetalSd    0.139726
Plywood    0.094064
CmentBd    0.037443
Wd Shng    0.029224
Stucco     0.019178
AsbShng    0.015525
BrkFace    0.013699
Brk Cmn    0.005479
ImStucc    0.005479
Stone      0.002740
AsphShn    0.002740
CBlock     0.000913
Other      0.000913
Name: Exterior2nd, dtype: float64

None       0.579909
BrkFace    0.315068
Stone      0.090411
BrkCmn     0.010959
missing    0.003653
Name: MasVnrType, dtype: float64

TA    0.622831
Gd    0.331507
Ex    0.035616
Fa    0.010046
Name: ExterQual, dtype: float64

TA    0.876712
Gd    0.098630
Fa    0.021918
Ex    0.001826
Po    0.000913
Name: ExterCond, dtype: float64

PConc     0.449315
CBlock    0.427397
BrkTil    0.098630
Slab      0.017352
Stone     0.004566
Wood      0.002740
Name: Foundation, dtype: float64

TA         0.442922
Gd         0.421005
Ex         0.085845
Fa         0.025571
missing    0.024658
Name: BsmtQual, dtype: float64

TA         0.892237
Gd         0.047489
Fa         0.034703
missing    0.024658
Po         0.000913
Name: BsmtCond, dtype: float64

No         0.656621
Av         0.148858
Gd         0.089498
Mn         0.080365
missing    0.024658
Name: BsmtExposure, dtype: float64

Unf        0.287671
GLQ        0.285845
ALQ        0.155251
BLQ        0.102283
Rec        0.092237
LwQ        0.052055
missing    0.024658
Name: BsmtFinType1, dtype: float64

Unf        0.863927
Rec        0.038356
LwQ        0.030137
missing    0.024658
BLQ        0.018265
ALQ        0.015525
GLQ        0.009132
Name: BsmtFinType2, dtype: float64

GasA     0.977169
GasW     0.013699
Grav     0.003653
Wall     0.002740
OthW     0.001826
Floor    0.000913
Name: Heating, dtype: float64

Ex    0.506849
TA    0.291324
Gd    0.165297
Fa    0.035616
Po    0.000913
Name: HeatingQC, dtype: float64

Y    0.928767
N    0.071233
Name: CentralAir, dtype: float64

SBrkr      0.915068
FuseA      0.060274
FuseF      0.021005
FuseP      0.002740
missing    0.000913
Name: Electrical, dtype: float64

TA    0.502283
Gd    0.401826
Ex    0.066667
Fa    0.029224
Name: KitchenQual, dtype: float64

Typ     0.925114
Min2    0.025571
Min1    0.024658
Mod     0.011872
Maj1    0.008219
Maj2    0.003653
Sev     0.000913
Name: Functional, dtype: float64

Attchd     0.594521
Detchd     0.263927
BuiltIn    0.063014
missing    0.052968
Basment    0.013699
CarPort    0.006393
2Types     0.005479
Name: GarageType, dtype: float64

Unf        0.410046
RFn        0.293151
Fin        0.243836
missing    0.052968
Name: GarageFinish, dtype: float64

TA         0.900457
missing    0.052968
Fa         0.031963
Gd         0.010959
Ex         0.002740
Po         0.000913
Name: GarageQual, dtype: float64

TA         0.908676
missing    0.052968
Fa         0.024658
Gd         0.008219
Po         0.003653
Ex         0.001826
Name: GarageCond, dtype: float64

Y    0.916895
N    0.061187
P    0.021918
Name: PavedDrive, dtype: float64

In [45]:
col_to_drop

['Street',
 'LandContour',
 'Utilities',
 'LandSlope',
 'Condition2',
 'RoofMatl',
 'Heating',
 'CentralAir',
 'Electrical',
 'Functional',
 'GarageQual',
 'GarageCond',
 'PavedDrive']

In [46]:
# Now drop those
X_train_cat = X_train_cat.drop(col_to_drop, axis=1)
X_test_cat = X_test_cat.drop(col_to_drop, axis=1)

In [47]:
# OneHotEncode categorical variables
ohe = OneHotEncoder(handle_unknown='ignore')

X_train_ohe = ohe.fit_transform(X_train_cat)
X_test_ohe = ohe.transform(X_test_cat)

# Convert these columns into a DataFrame 
ohe_col_names = ohe.get_feature_names(input_features=X_train_cat.columns)
cat_train_df = pd.DataFrame(X_train_ohe.todense(), columns=ohe_col_names)
cat_test_df = pd.DataFrame(X_test_ohe.todense(), columns=ohe_col_names)

In [48]:
cat_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1095 entries, 0 to 1094
Columns: 167 entries, MSZoning_C (all) to GarageFinish_missing
dtypes: float64(167)
memory usage: 1.4 MB


In [53]:
# Put it all back together
X_train_all = pd.concat([pd.DataFrame(X_train_imsc), cat_train_df], axis=1)
X_test_all = pd.concat([pd.DataFrame(X_test_imsc), cat_test_df], axis=1)

# Fit the model
linreg = LinearRegression()
linreg.fit(X_train_all, y_train)

LinearRegression()

In [54]:
# Write a quick evaluation function
def evaluate(train_actual, train_predicted, test_actual, test_predicted):
    '''
    Takes in both actual and predicted values, for the train and test set
    Then prints the scores based on those values
    
    Inputs:
    -------
    train_actual - actual target values for the train set
    train_predicted - predicted target values for the train set
    test_actual - actual target values for the test set
    test_predicted - predicted target values for the test set
    '''
    print('Train R2:', r2_score(train_actual, train_predicted))
    print('Test R2:', r2_score(test_actual, test_predicted))
    print("*****")
    print('Train MSE:', mean_squared_error(train_actual, train_predicted))
    print('Test MSE:', mean_squared_error(test_actual, test_predicted))
    print("*****")
    print('Train RMSE:', mean_squared_error(train_actual, train_predicted, squared=False))
    print('Test RMSE:', mean_squared_error(test_actual, test_predicted, squared=False))

In [55]:
# Grab predictions and evaluate
train_preds = linreg.predict(X_train_all)
test_preds = linreg.predict(X_test_all)

evaluate(y_train, train_preds, y_test, test_preds)

Train R2: 0.8902022283622764
Test R2: -2.2071686650036666e+20
*****
Train MSE: 666631145.9652967
Test MSE: 1.546189862898988e+30
*****
Train RMSE: 25819.201110129197
Test RMSE: 1243458830399699.2


In [56]:
# Let's wrap up that coefficient exploration in a function


def eval_coefficients(model, column_names):
    '''
    Prints an exploration of the coefficients
    
    Inputs:
    model - a fit linear model (sklearn)
    column_names - a list of feature names that matches the order passed into the model
    
    Outputs:
    coefs - a Series, sorted by coefficient value
    '''

    print("Total number of coefficients: ", len(model.coef_))
    print("Coefficients close to zero: ", sum(abs(model.coef_) < 10**(-10)))
    print(f"Intercept: {model.intercept_}")
    
    coefs = pd.Series(model.coef_, index= column_names)
    display(coefs.sort_values(ascending=False))
    return coefs.sort_values(ascending=False)

In [60]:
X_train_all.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,GarageType_missing,GarageFinish_Fin,GarageFinish_RFn,GarageFinish_Unf,GarageFinish_missing
0,0.588235,0.008797,0.666667,0.5,0.963768,0.933333,0.01016,0.002835,0.0,0.569349,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,0.041319,0.555556,0.625,0.73913,0.816667,0.071843,0.11747,0.334516,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.176471,0.036271,0.555556,0.5,0.485507,0.0,0.0,0.036145,0.0,0.152397,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.051611,0.444444,0.5,0.637681,0.466667,0.0,0.0,0.0,0.418664,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.039496,0.555556,0.625,0.623188,0.133333,0.176343,0.107725,0.0,0.357021,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [59]:
model_cols = [*X_train_cont.columns, *ohe_col_names]

In [61]:
len(model_cols)

200

In [63]:
# Explore coefficients
linreg_coefs = eval_coefficients(linreg, model_cols)

Total number of coefficients:  200
Coefficients close to zero:  0
Intercept: 9.110175314056923e+16


BsmtFinSF1         4.767837e+16
BldgType_1Fam      2.231566e+16
BldgType_Duplex    2.231566e+16
BldgType_2fmCon    2.231566e+16
BldgType_TwnhsE    2.231566e+16
                       ...     
ExterQual_Fa      -4.542415e+16
ExterQual_Ex      -4.542415e+16
ExterQual_Gd      -4.542415e+16
ExterQual_TA      -4.542415e+16
TotalBsmtSF       -5.161496e+16
Length: 200, dtype: float64

**Evaluate**

- WOAH THAT MODEL IS SO BAD
- Also those coefficients are crazy


## Fitting Ridge and Lasso

* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

### LASSO

In [64]:
from sklearn.linear_model import Lasso

lasso = Lasso() # Lasso is also known as the L1 norm 

# Fit
lasso.fit(X_train_all, y_train)

# Predict
l_train_preds = lasso.predict(X_train_all)
l_test_preds = lasso.predict(X_test_all)


# Evaluate
evaluate(y_train, l_train_preds, y_test, l_test_preds)

Train R2: 0.8901917658981585
Test R2: 0.8615905414873004
*****
Train MSE: 666694668.2421196
Test MSE: 969601032.6483977
*****
Train RMSE: 25820.43121719929
Test RMSE: 31138.41731123144


  model = cd_fast.enet_coordinate_descent(


In [67]:
# Adjust HYPERPARAMETERS -- check documentation!
lasso_v2 = Lasso(alpha=10)

lasso_v2.fit(X_train_all, y_train)

l_train_preds_v2 = lasso_v2.predict(X_train_all)
l_test_preds_v2 = lasso_v2.predict(X_test_all)

In [68]:
evaluate(y_train, l_train_preds_v2, y_test, l_test_preds_v2)

Train R2: 0.8896209715692706
Test R2: 0.867010656536806
*****
Train MSE: 670160214.6908449
Test MSE: 931631451.6273761
*****
Train RMSE: 25887.452842851206
Test RMSE: 30522.63834643683


In [69]:
# Check Lasso Coefficients
lasso_coefs = eval_coefficients(lasso_v2, model_cols)

Total number of coefficients:  200
Coefficients close to zero:  42
Intercept: -32987.26173273483


GrLivArea               144199.948743
OverallQual              76181.228725
LotArea                  65937.181741
2ndFlrSF                 60865.832960
Neighborhood_StoneBr     57520.664444
                            ...      
BldgType_TwnhsE         -21610.237813
BldgType_Twnhs          -24887.607300
LotShape_IR3            -31765.339691
KitchenAbvGr            -44399.066444
Exterior1st_ImStucc     -54895.225279
Length: 200, dtype: float64

### Ridge

In [73]:
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=5) # Ridge is also known as the L2 norm

# Fit
ridge.fit(X_train_all, y_train)

# Predict
r_train_preds = ridge.predict(X_train_all)
r_test_preds = ridge.predict(X_test_all)


# Evaluate
evaluate(y_train, r_train_preds, y_test, r_test_preds)

Train R2: 0.8816228437958734
Test R2: 0.8658942423728896
*****
Train MSE: 718720408.6148963
Test MSE: 939452277.8760194
*****
Train RMSE: 26808.961349050736
Test RMSE: 30650.485769005674


In [74]:
# Check Ridge Coefficients
ridge_coefs = eval_coefficients(ridge, model_cols)

Total number of coefficients:  200
Coefficients close to zero:  0
Intercept: 19532.59058687798


OverallQual             56677.063670
2ndFlrSF                50632.302560
GrLivArea               43975.219016
Neighborhood_StoneBr    39737.296291
Neighborhood_NoRidge    37688.530485
                            ...     
Neighborhood_OldTown   -13479.540465
KitchenAbvGr           -13974.639202
Neighborhood_Mitchel   -15485.495773
LotShape_IR3           -15668.899146
Neighborhood_Edwards   -20190.545360
Length: 200, dtype: float64

### Let's Discuss

- Can use Lasso to remove useless features, Ridge will give a better understanding of which features are important (based on coefficients)


## Ridge & Lasso: Other benefits

### Ridge:
* We can "shrink down" prediction variables effects instead of deleting/zeroing them
* When you have features with high multicollinearity, the coefficients are automatically spread across them (you won't have redundancy)
* Since includes all features it can be computationally expensive (for many variables)

### Lasso:
* When you have a lot of variables it performs feature selection for you!
* Multicollinearity is also dealt with


### Por que no los dos??

Enter ElasticNet: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html