# Regularization

## Why Regularize?

In an attempt to fit a good model to data, we often tend to overfit. Regularization discourages overly complex models by penalizing the loss function.

### The Bias-Variance Tradeoff

When we did Linear Regression, we briefly talked about the Bias-Variance Tradeoff.

![](http://scott.fortmann-roe.com/docs/docs/BiasVariance/biasvariance.png)

![](https://miro.medium.com/max/544/1*Y-yJiR0FzMgchPA-Fm5c1Q.jpeg)

**High bias** 

 - Systematic error in predictions (i.e. the average)
 - Bias is about the strength of assumptions the model makes
 - Underfit models tend to have high bias


**High variance**

 - The model is highly sensitive to changes in the data
 - Overfit models tend to have low bias and high variance
    
    
![](https://gblobscdn.gitbook.com/assets%2F-LvBP1svpACTB1R1x_U4%2F-LvNWUoWieQqaGmU_gl9%2F-LvNoby-llz4QzAK15nL%2Fimage.png?alt=media&token=41720ce9-bb66-4419-9bd8-640abf1fc415)

 - Underfit Models fail to capture all of the information in the data
 - Overfit models fit to the noise in the data and fail to generalize


**How would we know if our model is over or underfit?**
 - Train test split & look at the testing error
 - As model complexity increases so does the possibility for overfitting

## Ridge and Lasso

Ridge and Lasso regression are two examples of penalized estimation. Penalized estimation makes some or all of the coefficients smaller in magnitude (closer to zero). Some of the penalties have the property of performing both variable selection (setting some coefficients exactly equal to zero) and shrinking the other coefficients. 

In Ridge regression, the cost function is changed by adding a penalty term to the square of the magnitude of the coefficients. 

$$ \text{cost_function_ridge}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij})-b)^2 + \lambda \sum_{j=1}^p m_j^2$$

Lasso regression (Least Absolute Shrinkage and Selection Operator) is very similar to Ridge regression, except that the magnitude of the coefficients are not squared in the penalty term.

$$ \text{cost_function_lasso}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij})-b)^2 + \lambda \sum_{j=1}^p \mid m_j \mid$$

So we're penalizing large coefficients -- what are the effects/implications of that?

### Standardization before Regularization

An important step before using either Lasso or Ridge regularization is to first standardize your data such that it is all on the same scale. Regularization is based on the concept of penalizing larger coefficients, so **if you have features that are on different scales, some will get unfairly penalized**. A downside of standardization is that the value of the coefficients become less interpretable and must be transformed back to their original scale if you want to interpret how a one unit change in a feature impacts the target variable.

**Scaler documentation:**

* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

## Let's Code! 

Start with a regular Linear Regression.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import OneHotEncoder

# import warnings
# warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('data/ames_train.csv') # Ames housing data

# Drop sale detail columns 
df = df.drop(columns = ['Id', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition'])

# Create X and y
y = df['SalePrice']
X = df.drop(columns=['SalePrice'], axis=1)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### Time to Clean/Process

In [3]:
# Explore X_train
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1095 entries, 1023 to 1126
Data columns (total 75 columns):
MSSubClass       1095 non-null int64
MSZoning         1095 non-null object
LotFrontage      895 non-null float64
LotArea          1095 non-null int64
Street           1095 non-null object
Alley            70 non-null object
LotShape         1095 non-null object
LandContour      1095 non-null object
Utilities        1095 non-null object
LotConfig        1095 non-null object
LandSlope        1095 non-null object
Neighborhood     1095 non-null object
Condition1       1095 non-null object
Condition2       1095 non-null object
BldgType         1095 non-null object
HouseStyle       1095 non-null object
OverallQual      1095 non-null int64
OverallCond      1095 non-null int64
YearBuilt        1095 non-null int64
YearRemodAdd     1095 non-null int64
RoofStyle        1095 non-null object
RoofMatl         1095 non-null object
Exterior1st      1095 non-null object
Exterior2nd      1095 no

In [None]:
# Let's check the percentage of our training data that's null per column


In [None]:
# Drop where nulls are more than 10% of column


In [None]:
# Start with the continuous variables

# Grab only numeric features


# Impute missing values with 0 using SimpleImputer
# (most columns look like they just don't have details)


# Scale the train and test data


In [None]:
# Now time for the categorical columns

# Create X_cat which contains only the categorical variables

# Fill missing values with the string 'missing'


In [None]:
# Exploring column percentages

# Let's remove any column where the most common value is more than 90% of that col


In [None]:
# Now drop those


In [None]:
# OneHotEncode categorical variables


# Convert these columns into a DataFrame 


In [None]:
# Put it all back together


# Fit the model


In [None]:
# Write a quick evaluation function


In [None]:
# Grab predictions and evaluate


In [None]:
# Plot residuals?


In [None]:
# Explore coefficients


**Evaluate**

- 


In [None]:
# Let's wrap up that coefficient exploration in a function


## Fitting Ridge and Lasso

* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

### LASSO

In [None]:
from sklearn.linear_model import Lasso

lasso = Lasso() # Lasso is also known as the L1 norm 

# Fit

# Predict

# Evaluate


In [None]:
# Adjust HYPERPARAMETERS -- check documentation!


In [None]:
# Check Lasso Coefficients


### Ridge

In [None]:
from sklearn.linear_model import Ridge

ridge = Ridge() # Ridge is also known as the L2 norm

# Fit

# Predict

# Evaluate


In [None]:
# Adjust HYPERPARAMETERS


In [None]:
# Check Ridge Coefficients


### Let's Discuss

- 


## Ridge & Lasso: Other benefits

### Ridge:
* We can "shrink down" prediction variables effects instead of deleting/zeroing them
* When you have features with high multicollinearity, the coefficients are automatically spread across them (you won't have redundancy)
* Since includes all features it can be computationally expensive (for many variables)

### Lasso:
* When you have a lot of variables it performs feature selection for you!
* Multicollinearity is also dealt with


### Por que no los dos??

Enter ElasticNet: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html