# Summary
Regression is one of the most used algorithms in Predictive analytics, but it comes with few disadvantages. In this blog, I give an overview of regularization, particularly Ridge and Lasso, for improved results and controlling model complexity.

# What is regression?

Regression is one of the most commonly used machine learning algorithm for supervised learning. These models are used to predict values based on the relation between dependent (target variable) and independent variables (features). 
Statistically, the linear model represents Ordinary least squares, which is given as

\begin{equation}
y_{i}=\beta_{1} x_{i 1}+\beta_{2} x_{i 2}+\cdots+\beta_{p} x_{i p}+\varepsilon_{i}
\end{equation}

It represents a kind of optimization problem, where we try to minimize our object function which is in the form of sum of squared errors. 

\begin{equation}
R S S=\sum_{i=1}^{n}\left(\varepsilon_{i}\right)^{2}=\sum_{i=1}^{n}\left(y_{i}-\left(\alpha+\beta x_{i}\right)\right)^{2}
\end{equation}

# Need of regularization

The above given objective function performs well when the assumptions of OLS are satisfied. In a snapshot, the OLS assumptions are:

1.	No multicollinearity
2.	Linear relationship between features and target
3.	Homoskedasticity i.e., Variance of error along X is constant
4.	Normal distribution of errors

But with increase in data points and features, the real world data usually does not follow the OLS assumptions.

Additionally, with increase in the features, the model complexity increases in an attempt to give an unbiased estimator. This lowered bias also causes over fitting. To deal with such issues, we implement regularization.

# What is regularization?

Regularization is a technique to reduce the bias as well as model complexity of regression models. The basic idea is to include a constraint in form of penalty term for the parameter estimate (feature coefficients) in our original objective function. 

# Mechanics of regularization:

Earlier, our objective function was \begin{equation}
min\sum_{i=1}^{n}\left(y_{i}-\left(\alpha+\beta x_{i}\right)\right)^{2}
\end{equation} 
and our goal was to minimize it. In regularization, we still want to minimize our objective function, but the objective function now consists of a penalty term. The penalty is applied on the parameter estimates forcing it to take minimum possible values and thus decreasing the model complexity. The main difference between Ridge and Lasso regression is in selection of the penalty term.


# Ridge Regression

In ridge regression, the penalty term is of the second order, with a constraint region in shape of a circle and thus also called L2 regularization. 

<img src='images\ridge.png' style="width:250px;height:250px">


\begin{equation}
min\sum_{i=1}^{n}\left(y_{i}-\left(\alpha+\beta x_{i}\right)\right)^{2}+
\lambda\|\beta\|_{2}^{2}
\end{equation}

One of the issues with ridge regression is that it cannot force the coefficients to be zero, thus each coefficient is still being used in the model. Thus, ridge only helps in shrinking the coefficients and not removing it.

# Lasso Regression
### Least Absolute Shrinkage and Selection Operator
However, in Lasso regression the penalty term is absolute sum of coefficients,resulting in a constraint region in shape of a diamond. This diamond serves as a constraint to the actual equation. It forces the coefficients to take minimum possible value. Thus “Absolute Shrinkage” in Lasso. 

<img src='images\lasso.png' style="width:250px;height:250px">
<br>
\begin{equation}
min\sum_{i=1}^{n}\left(y_{i}-\left(\alpha+\beta x_{i}\right)\right)^{2}+\lambda\|\beta\|_{1}
\end{equation}

Since the contour can meet the constraint at one of the axis, the coefficient can’t take a value of zero. When this happens, it is basically removing that feature from consideration in the model. Thus the term “Selection” in LASSO.

# Selection of regularization parameter
The parameter can be identified by performing cross-validation and selecting the value of $\beta$ that minimizes the function.

# Advantages

1.	Avoids overfitting
2.	Lasso allows shrinkage as well as variable selection
3.	Variable selection in Lasso provides sparse solutions

# Disadvantages

1.	Ridge cannot be used in variable selection
2.	They increase the bias
3.	They are scale variant 
