# POLI 175 - Lecture 11

## Regularization

## Adding new package

In [18]:
using Pkg
Pkg.add("MLJModels")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.9/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.9/Manifest.toml`


## Class Examples

1. Education expenditure dataset

1. Pinochet voting dataset

Let us load them all:

In [19]:
## Packages Here
using DataFrames
using MLJ, MLJIteration
import MLJLinearModels, MLJBase, MLJModels
import MultivariateStats, MLJMultivariateStatsInterface
import CSV, Plots, GLM, StatsBase, Random
import LaTeXStrings, StatsPlots, Lowess, Gadfly, RegressionTables
import CovarianceMatrices, Econometrics, LinearAlgebra, MixedModelsExtras
import Missings, StatsAPI, FreqTables, EvalMetrics
import NearestNeighborModels

# Adapted from @xiaodaigh: https://github.com/xiaodaigh/DataConvenience.jl
function onehot!(df::AbstractDataFrame, 
        col, cate = sort(unique(df[!, col])); 
        outnames = Symbol.(col, :_, cate))
    transform!(df, @. col => ByRow(isequal(cate)) .=> outnames)
end

onehot! (generic function with 2 methods)

## Class Examples

In [20]:
## Loading the data
chile = CSV.read(
    download("https://raw.githubusercontent.com/umbertomig/POLI175julia/main/data/chilesurvey.csv"), 
    DataFrame,
    missingstring = ["NA"]
); dropmissing!(chile)
chile.voteyes = ifelse.(chile.vote .== "Y", 1, 0)

# One-hot encoding (we will learn a better way to do it later)
onehot!(chile, :region);
onehot!(chile, :education);
onehot!(chile, :sex);

# Drop reference categories
select!(chile, Not(:region, :sex, :education, :region_C, :education_P, :sex_M))

# Checking
first(chile, 3)

Row,population,age,income,statusquo,vote,voteyes,region_M,region_N,region_S,region_SA,education_PS,education_S,sex_F
Unnamed: 0_level_1,Int64,Int64,Int64,Float64,String1,Int64,Bool,Bool,Bool,Bool,Bool,Bool,Bool
1,175000,65,35000,1.0082,Y,1,False,True,False,False,False,False,False
2,175000,29,7500,-1.29617,N,0,False,True,False,False,True,False,False
3,175000,38,15000,1.23072,Y,1,False,True,False,False,False,False,True


## Class Examples

In [21]:
## Education Expenditure Dataset
educ = CSV.read(download("https://raw.githubusercontent.com/umbertomig/POLI175julia/main/data/educexp.csv"), DataFrame)

# Processing
educ.educ_log = log.(educ.education);
educ.income_log = log.(educ.income)
educ.urban_log = log.(educ.urban)
educ.young_log = log.(educ.young)

# Drop reference categories
select!(educ, Not(:education, :income, :urban, :urban, :young))

# Checking
first(educ, 3)

Row,states,educ_log,income_log,urban_log,young_log
Unnamed: 0_level_1,String3,Float64,Float64,Float64,Float64
1,ME,5.24175,7.94591,6.23048,5.85993
2,NH,5.1299,8.08918,6.33505,5.84615
3,VT,5.43808,8.03008,5.77455,5.85364


## Class Examples

Now, let's go crazy: Add lots of polynomials to our model!

In [22]:
## Education Expenditure Dataset
educ.income_log_square = educ.income_log .^ 2
educ.urban_log_square = educ.urban_log .^ 2
educ.young_log_square = educ.young_log .^ 2
educ.income_log_cube = educ.income_log .^ 3
educ.urban_log_cube = educ.urban_log .^ 3
educ.young_log_cube = educ.young_log .^ 3
educ.income_log_4th = educ.income_log .^ 4
educ.urban_log_4th = educ.urban_log .^ 4
educ.young_log_4th = educ.young_log .^ 4

educ_y, educ_X = unpack(
    educ[:, Not(:states)],
    ==(:educ_log);
    :educ_log => Continuous, 
    :income_log => Continuous,
    :urban_log => Continuous,
    :young_log => Continuous,
    :income_log_square => Continuous,
    :urban_log_square => Continuous,
    :young_log_square => Continuous,
    :income_log_cube => Continuous,
    :urban_log_cube => Continuous,
    :young_log_cube => Continuous,
    :income_log_4th => Continuous,
    :urban_log_4th => Continuous,
    :young_log_4th => Continuous,
);    

# Checking
first(educ_X, 3)

Row,income_log,urban_log,young_log,income_log_square,urban_log_square,young_log_square,income_log_cube,urban_log_cube,young_log_cube,income_log_4th,urban_log_4th,young_log_4th
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,7.94591,6.23048,5.85993,63.1375,38.8189,34.3388,501.685,241.86,201.223,3986.34,1506.91,1179.15
2,8.08918,6.33505,5.84615,65.4348,40.1329,34.1775,529.313,254.244,199.807,4281.71,1610.65,1168.1
3,8.03008,5.77455,5.85364,64.4823,33.3454,34.2651,517.798,192.555,200.575,4157.96,1111.92,1174.1


# Linear Model Regularization

In this class, we are going to focus on two of the most used methods for model selection:
- **Ridge**
- **Lasso**

These methods are different from the subset selection in special ways:
- They are meant to "*shrink*" the coefficients toward zero!

It may be counter-intuitive, but these methods are great tools to reduce the variance of the estimates (recall the bias-variance trade-offs).

# Ridge Regression

## Ridge Regression

Instead of minimizing the RSS, one minimizes:

$$ \text{Residual Sum of Squares} + \underbrace{\lambda \ \sum_{j=1}^p\beta_j^2}_{\text{shrinkage penalty}} $$

The $\lambda \geq 0$ parameters is called *tuning parameter*. 

Selecting a good $\lambda$ is crucial for a good set of estimates.

In [23]:
## 4-Fold CV
cv4 = CV(nfolds = 4, rng = 5624957)

CV(
  nfolds = 4, 
  shuffle = true, 
  rng = Random.MersenneTwister(5624957))

# Ridge Regression

## Ridge Regression in Julia

In [33]:
## Ridge Regression (instantiate)
ridge = MLJLinearModels.RidgeRegressor(lambda = 1.1)

RidgeRegressor(
  lambda = 1.1, 
  fit_intercept = true, 
  penalize_intercept = false, 
  scale_penalty_with_samples = true, 
  solver = nothing)

# Ridge Regression

## Ridge Regression in Julia

In [34]:
evaluate(ridge, 
    educ_X, 
    educ_y, 
    resampling = cv4,
    measure = [rms, l2, rsq], 
    verbosity = 0)

PerformanceEvaluation object with these fields:
  model, measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows, resampling, repeats
Extract:
┌────────────────────────┬───────────┬─────────────┬─────────┬──────────────────
│[22m measure                [0m│[22m operation [0m│[22m measurement [0m│[22m 1.96*SE [0m│[22m per_fold       [0m ⋯
├────────────────────────┼───────────┼─────────────┼─────────┼──────────────────
│ RootMeanSquaredError() │ predict   │ 0.143       │ 0.0225  │ [0.135, 0.155,  ⋯
│ LPLoss(                │ predict   │ 0.0203      │ 0.00619 │ [0.0181, 0.024, ⋯
│   p = 2)               │           │             │         │                 ⋯
│ RSquared()             │ predict   │ 0.364       │ 0.497   │ [0.619, -0.174, ⋯
└────────────────────────┴───────────┴─────────────┴─────────┴──────────────────
[36m                                                                1 column omitted[0m


# Ridge Regression

## Ridge Regression in Julia

In [35]:
## Parameters
machine1 = machine(ridge, educ_X, educ_y)
fit!(machine1)
fitted_params(machine1)

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mTraining machine(RidgeRegressor(lambda = 1.1, …), …).
[36m[1m┌ [22m[39m[36m[1mInfo: [22m[39mSolver: MLJLinearModels.Analytical
[36m[1m│ [22m[39m  iterative: Bool false
[36m[1m└ [22m[39m  max_inner: Int64 200


(coefs = [:income_log => 0.0001938614453657547, :urban_log => -9.930350801419571e-5, :young_log => -2.8916895544529442e-5, :income_log_square => 0.0020754452837817607, :urban_log_square => -0.0008172682730275851, :young_log_square => -0.00021897172033659906, :income_log_cube => 0.01248357269781201, :urban_log_cube => -0.003781046851971933, :young_log_cube => -0.000856728047262753, :income_log_4th => -0.0005787286363908233, :urban_log_4th => 0.00019411956103879537, :young_log_4th => 0.0017618021164507865],
 intercept = -0.19011849054925398,)

In [27]:
machine2 = machine(MLJLinearModels.LinearRegressor(), educ_X, educ_y)
fit!(machine2)
fitted_params(machine2)

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mTraining machine(LinearRegressor(fit_intercept = true, …), …).
[36m[1m┌ [22m[39m[36m[1mInfo: [22m[39mSolver: MLJLinearModels.Analytical
[36m[1m│ [22m[39m  iterative: Bool false
[36m[1m└ [22m[39m  max_inner: Int64 200


(coefs = [:income_log => 5491.685110506905, :urban_log => -5182.951483977237, :young_log => 203796.67288586378, :income_log_square => -1071.267550317659, :urban_log_square => 1241.461648527421, :young_log_square => -52466.09081907871, :income_log_cube => 92.83470022410806, :urban_log_cube => -132.02141910048752, :young_log_cube => 6000.291944567464, :income_log_4th => -3.014604876144366, :urban_log_4th => 5.2588947129884716, :young_log_4th => -257.21322365141253],
 intercept = -299159.5494253556,)

## Ridge Regression

$\lambda = 0$ is the same as the least square regression.

As $\lambda$ grows, the shrinkage penalty increases, and the ridge coefficients approach zero.

**Caveat 1**: The scale of the variable influences the results.

In a OLS (standard least square regression), when you multiply $x_j$ by $c \neq 0$, you divide $\beta_j$ by $\dfrac{1}{c}$.

Example: If you measure GDP in USD vs Millions of USD changes the ridge regression coefficient significantly.

**Suggestion**: Standardize the variables before running the regression

## Ridge Regression

### Standardization

Let a variable $x_j$. Then, the standardized variable $z_j$ is obtained by:
1. Subtracting the mean of the variable $x_j$ ($\overline{x}_j$) and then,
2. Dividing the result by the standard deviation ($\sigma_{x_j}$) of the variable $x_j$.

$$ z_j \ = \ \dfrac{x_j - \overline{x}_j}{\sigma_{x_j}} $$

The resulting variable $z_j$ has mean zero, variance one, and has no unit!

Variations of one unit are called *deviations*: In a regression with a standardized variable, we say that $\beta_j$ would represent a variation on average $y$ when we increase $z_j$ by one standard-deviation.

This is a great practice for prediction, but in general, be mindful about the unit of your data.

**Again:** In a standard least square regression, when you multiply $x_j$ by $c \neq 0$, you divide $\beta_j$ by $\dfrac{1}{c}$.

## Ridge Regression

### Standardization

In [36]:
## Standardizing the X variables
scaler = MLJModels.Standardizer();
educ_X_std = MLJModels.transform(fit!(machine(scaler, educ_X)), educ_X);
first(educ_X, 3)

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mTraining machine(Standardizer(features = Symbol[], …), …).


Row,income_log,urban_log,young_log,income_log_square,urban_log_square,young_log_square,income_log_cube,urban_log_cube,young_log_cube,income_log_4th,urban_log_4th,young_log_4th
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,7.94591,6.23048,5.85993,63.1375,38.8189,34.3388,501.685,241.86,201.223,3986.34,1506.91,1179.15
2,8.08918,6.33505,5.84615,65.4348,40.1329,34.1775,529.313,254.244,199.807,4281.71,1610.65,1168.1
3,8.03008,5.77455,5.85364,64.4823,33.3454,34.2651,517.798,192.555,200.575,4157.96,1111.92,1174.1


In [37]:
first(educ_X_std, 3)

Row,income_log,urban_log,young_log,income_log_square,urban_log_square,young_log_square,income_log_cube,urban_log_cube,young_log_cube,income_log_4th,urban_log_4th,young_log_4th
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,-0.66766,-0.987876,-0.324325,-0.674958,-0.999383,-0.327454,-0.681938,-1.00939,-0.330514,-0.688598,-1.01794,-0.333504
2,0.144252,-0.559031,-0.537359,0.134041,-0.578223,-0.538393,0.123751,-0.596509,-0.539337,0.113395,-0.613857,-0.540191
3,-0.190629,-2.85761,-0.421601,-0.201391,-2.7537,-0.423835,-0.212054,-2.65328,-0.425989,-0.22261,-2.55642,-0.428063


## Ridge Regression

The best way to do it:

In [38]:
## Ridge Regression (greatly done)
evaluate(ridge, 
    educ_X_std, 
    educ_y, 
    resampling = cv4,
    measure = [rms, l2, rsq], 
    verbosity = 0)

PerformanceEvaluation object with these fields:
  model, measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows, resampling, repeats
Extract:
┌────────────────────────┬───────────┬─────────────┬─────────┬──────────────────
│[22m measure                [0m│[22m operation [0m│[22m measurement [0m│[22m 1.96*SE [0m│[22m per_fold       [0m ⋯
├────────────────────────┼───────────┼─────────────┼─────────┼──────────────────
│ RootMeanSquaredError() │ predict   │ 0.152       │ 0.0162  │ [0.161, 0.141,  ⋯
│ LPLoss(                │ predict   │ 0.0231      │ 0.00489 │ [0.0259, 0.0199 ⋯
│   p = 2)               │           │             │         │                 ⋯
│ RSquared()             │ predict   │ 0.498       │ 0.163   │ [0.552, 0.286,  ⋯
└────────────────────────┴───────────┴─────────────┴─────────┴──────────────────
[36m                                                                1 column omitted[0m


## Ridge Regression

**Caveat 2**: The ridge regression produces a different set of parameter for each $\lambda$.
- $\widehat{\beta}_{\lambda}^{R}$

Also note that the intercept ($\beta_0$) is not considered.

- We do not want to shrink the mean of $y_i$ when $x_{ij} = 0$ for all $j$.

- And if we standardize the variables, then the intercept will be $\widehat{\beta}_0 = \overline{y}$.

## Ridge Regression

![img](https://github.com/umbertomig/POLI175julia/blob/c9b0555e3e97778495bee72746aee43ddf3226d7/img/ridge1.png?raw=true)

## Ridge Regression

Advantage: the *non-obvious* advantage is that as $\lambda$ increses, the flexibility of the model decreases.

This decreases variance (but increase bias).

This may be "optimized": You may find the optimal bias-variance trade-off by manipulating $\lambda$.

## Ridge Regression

Specially important when $p$ is close to $n$ (number of predictors is close to the number of cases).
- This is called *high dimensional data*.

If $p > n$, then ridge does very well. This case would have a very high variance.

And it is *way* better than *best subset selection*: you fit just one model:
- In practice, as many as the different $\lambda$s.
- There are algorithms to solve efficiently for all $\lambda$s, which means that it may be more efficient than best, forward, and backward stepwise selection.

## Ridge Regression (the book calls the regularization parameter $\lambda$) 

![img](https://github.com/umbertomig/POLI175julia/blob/c9b0555e3e97778495bee72746aee43ddf3226d7/img/ridge2.png?raw=true)


# Cross-Validation

## Cross-Validation

To select the tuning parameters you can use cross-validation.

The idea is to search through a grid of tuning parameter candidates, selecting the one that does best in the cross-validation.

It is indeed a very straight-forward idea, if you think about it.

## Cross-Validation

![img](https://github.com/umbertomig/POLI175julia/blob/c9b0555e3e97778495bee72746aee43ddf3226d7/img/cvridge.png?raw=true)

# Lasso Regression

# Lasso Regression

Ridge regression has one disadvantage: it always include $p$ predictors.

The shrinkage never sets coefficients to be exactly zero (that is, be removed from the prediction).

This could potentially make subset selection better than ridge.

## Lasso Regression

But one alternative, using the same principles as the ridge regression is the **Lasso** Regression.

In the lasso regression, the objective function becomes:

$$ \text{Residual Sum of Squares} + \lambda \sum_{j=1}^p|\beta_j| $$

Does the same as ridge: the larger the $\lambda$, the more the *shrinkage*.

## Lasso Regression

Unlike ridge, for some values of $\lambda$, **Lasso** actually force coefficients to be exactly equal to zero.

Thus, **Lasso** performs variable selection, much like the subset selection models we have seen.

**Side-effect**: Makes models easier to interpret!

- Yields *sparse* models: models that only involve a subset of the variables.
    
Like ridge, selecting a good $\lambda$ is critical.

# Lasso Regression

## Lasso Regression in Julia

In [47]:
## Lasso Regression (instantiate)
lasso = MLJLinearModels.LassoRegressor(lambda = 1.1)

LassoRegressor(
  lambda = 1.1, 
  fit_intercept = true, 
  penalize_intercept = false, 
  scale_penalty_with_samples = true, 
  solver = nothing)

# Lasso Regression

## Lasso Regression in Julia

In [43]:
evaluate(lasso, 
    educ_X, 
    educ_y, 
    resampling = cv4,
    measure = [rms, l2, rsq], 
    verbosity = 0)

PerformanceEvaluation object with these fields:
  model, measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows, resampling, repeats
Extract:
┌────────────────────────┬───────────┬─────────────┬─────────┬──────────────────
│[22m measure                [0m│[22m operation [0m│[22m measurement [0m│[22m 1.96*SE [0m│[22m per_fold       [0m ⋯
├────────────────────────┼───────────┼─────────────┼─────────┼──────────────────
│ RootMeanSquaredError() │ predict   │ 0.414       │ 0.055   │ [0.452, 0.368,  ⋯
│ LPLoss(                │ predict   │ 0.171       │ 0.0452  │ [0.205, 0.135,  ⋯
│   p = 2)               │           │             │         │                 ⋯
│ RSquared()             │ predict   │ -2.96       │ 1.66    │ [-4.48, -1.06,  ⋯
└────────────────────────┴───────────┴─────────────┴─────────┴──────────────────
[36m                                                                1 column omitted[0m


# Lasso Regression

## Lasso Regression in Julia

In [44]:
## Lasso Parameters
machine2 = machine(lasso, educ_X, educ_y)
fit!(machine2)
fitted_params(machine2)

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mTraining machine(LassoRegressor(lambda = 1000.0, …), …).
[36m[1m┌ [22m[39m[36m[1mInfo: [22m[39mSolver: MLJLinearModels.ProxGrad
[36m[1m│ [22m[39m  accel: Bool true
[36m[1m│ [22m[39m  max_iter: Int64 1000
[36m[1m│ [22m[39m  tol: Float64 0.0001
[36m[1m│ [22m[39m  max_inner: Int64 100
[36m[1m│ [22m[39m  beta: Float64 0.8
[36m[1m└ [22m[39m  gram: Bool false


(coefs = [:income_log => 0.0, :urban_log => 0.0, :young_log => 0.0, :income_log_square => 0.0, :urban_log_square => 0.0, :young_log_square => 0.0, :income_log_cube => 0.0, :urban_log_cube => 0.0, :young_log_cube => 0.0, :income_log_4th => 0.0011781991399936443, :urban_log_4th => 0.0, :young_log_4th => 0.0],
 intercept = 9.124195115236468e-7,)

In [16]:
## Ridge Parameters
fitted_params(machine1)

(coefs = [:income_log => 0.0001938614453657547, :urban_log => -9.930350801419571e-5, :young_log => -2.8916895544529442e-5, :income_log_square => 0.0020754452837817607, :urban_log_square => -0.0008172682730275851, :young_log_square => -0.00021897172033659906, :income_log_cube => 0.01248357269781201, :urban_log_cube => -0.003781046851971933, :young_log_cube => -0.000856728047262753, :income_log_4th => -0.0005787286363908233, :urban_log_4th => 0.00019411956103879537, :young_log_4th => 0.0017618021164507865],
 intercept = -0.19011849054925398,)

# Lasso Regression

## Lasso Regression in Julia

In [54]:
## Your turn: Run the Lasso Regression with the Standardized data
lasso = MLJLinearModels.LassoRegressor(lambda = 0.01)
machine3 = machine(lasso, educ_X_std, educ_y)
fit!(machine3)
fitted_params(machine3)

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mTraining machine(LassoRegressor(lambda = 0.01, …), …).
[36m[1m┌ [22m[39m[36m[1mInfo: [22m[39mSolver: MLJLinearModels.ProxGrad
[36m[1m│ [22m[39m  accel: Bool true
[36m[1m│ [22m[39m  max_iter: Int64 1000
[36m[1m│ [22m[39m  tol: Float64 0.0001
[36m[1m│ [22m[39m  max_inner: Int64 100
[36m[1m│ [22m[39m  beta: Float64 0.8
[36m[1m└ [22m[39m  gram: Bool false


(coefs = [:income_log => 0.1939668465318225, :urban_log => -0.03906820937405762, :young_log => 0.0, :income_log_square => 0.0, :urban_log_square => -0.0006708201024971472, :young_log_square => 0.0, :income_log_cube => 0.0, :urban_log_cube => -0.0003040293652511153, :young_log_cube => 0.00035077161164430697, :income_log_4th => 0.0, :urban_log_4th => -7.944699145649931e-5, :young_log_4th => 0.0766495988405259],
 intercept = 5.253605704566857,)

## Lasso

![img](https://github.com/umbertomig/POLI175julia/blob/c9b0555e3e97778495bee72746aee43ddf3226d7/img/lasso1.png?raw=true)

## Lasso x Ridge Regression

Selection property of lasso: 

- Lasso and ridge are equivalent to a constraint on the shape of the acceptable parameter space.

- But the "diamond shape" of lasso makes it shrinks some coefficients towards zero.

![img](https://github.com/umbertomig/POLI175julia/blob/c9b0555e3e97778495bee72746aee43ddf3226d7/img/lassovsridge3.png?raw=true)

## Lasso x Ridge Regression (the book calls the regularization parameter $\lambda$) 

Lasso performs similarly to ridge in most cases. In these cases, I'd say that lasso is better:

- Reduces the complexity in the model.

![img](https://github.com/umbertomig/POLI175julia/blob/c9b0555e3e97778495bee72746aee43ddf3226d7/img/lassovsridge1.png?raw=true)

## Lasso x Ridge Regression

But when all coefficients are different from zero, then ridge is better.

![img](https://github.com/umbertomig/POLI175julia/blob/c9b0555e3e97778495bee72746aee43ddf3226d7/img/lassovsridge2.png?raw=true)

## Lasso x Ridge Regression

To summarize, none is better in all situations.

- But you should know now that this is true for most ML algorithms.

You may need to search which model is better.

Moreover, finding $\lambda$ is also a big deal. Cross-Validation can help us with that!

# Questions?

# See you next class
