## Lasso Regression
Just like Ridge regression, Lasso is also a regularization algorithm but it uses `L1 Regularization`. The lasso regression encourages simple, sparse models (i.e. models with fewer parameters). This particular type of regression is well-suited for models showing high levels of muticollinearity or when you want to automate certain parts of model selection, like variable selection/parameter elimination. The acronym “LASSO” stands for Least Absolute Shrinkage and Selection Operator.

### L1 Regularization
Lasso regression performs L1 regularization, which adds a penalty equal to the absolute value of the magnitude of coefficients. This type of regularization can result in sparse models with few coefficients; Some coefficients can become zero and eliminated from the model. Larger penalties result in coefficient values closer to zero, which is the ideal for producing simpler models. On the other hand, L2 regularization (e.g. Ridge regression) doesn’t result in elimination of coefficients or sparse models. This makes the Lasso far easier to interpret than the Ridge.

### Lasso (L1 Regularization) modifies the loss function as 
> Sometime the `alpha (α)` is also denoted with `lambda (λ)`, so don't get confused with them they are same <br>
<img src="https://user.oc-static.com/upload/2019/10/07/15704536640472_lasso.png" width="500">

Here, A tuning parameter, λ controls the strength of the L1 penalty. λ is basically the amount of shrinkage:
* When λ = 0, no parameters are eliminated. The estimate is equal to the one found with linear regression.
* As λ increases, more and more coefficients are set to zero and eliminated (theoretically, when λ = ∞, all coefficients are eliminated).
* As λ increases, bias increases.
* As λ decreases, variance increases.


### Terminologies
* `Multi-collinearity` is a term referred to a situation, when two or more feature of a dataset is highly correlated with each other.
* In machine learning, `Model complexity` often refers to the number of features or terms included in a given predictive model, as well as whether the chosen model is linear, nonlinear, and so on. It can also refer to the algorithmic learning complexity or computational complexity. When you increase complexity of your model, it is more likely to overfit, meaning it will adapt to training data very well, but will not figure out general relationships in the data. In such case, performance on a test set is going to be poor.
> Underfitting vs Good fitting vs Overfitting <br>
<img src="https://user.oc-static.com/upload/2019/09/12/15682951882177_biva-11.png" width="600">
* `Sparse model` is a model with many zeroed parmater(features) and has a sparse parameter resulted from the operation of L1 or L2 regularization.

## Building Lasso Regression

In [1]:
import numpy as np
import pandas as pd
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split

In [2]:
# from sklearn.datasets import load_boston
# boston = load_boston()
# df = pd.DataFrame(boston.data, columns=boston.feature_names)
# df['target'] = boston.target
df = pd.read_csv('house_prices.csv')
df.head()

Unnamed: 0,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,...,Sale Type_Oth,Sale Type_VWD,Sale Type_WD,Sale Condition_Abnorml,Sale Condition_AdjLand,Sale Condition_Alloca,Sale Condition_Family,Sale Condition_Normal,Sale Condition_Partial,SalesPrice
0,20,141.0,31770,6,5,1960,1960,112.0,639.0,0.0,...,0,0,1,0,0,0,0,1,0,215000
1,20,80.0,11622,5,6,1961,1961,0.0,468.0,144.0,...,0,0,1,0,0,0,0,1,0,105000
2,20,81.0,14267,6,6,1958,1958,108.0,923.0,0.0,...,0,0,1,0,0,0,0,1,0,172000
3,20,93.0,11160,7,5,1968,1968,0.0,1065.0,0.0,...,0,0,1,0,0,0,0,1,0,244000
4,60,74.0,13830,5,5,1997,1998,0.0,791.0,0.0,...,0,0,1,0,0,0,0,1,0,189900


In [3]:
features = df.iloc[:, :-1]
label = df.iloc[:, -1]
print(features.shape, label.shape)

(2930, 304) (2930,)


In [4]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
features_scaled = MinMaxScaler().fit_transform(features)

In [5]:
x_train, x_test, y_train, y_test = train_test_split(features_scaled, label, test_size=0.2, random_state=31)
print('x_train',x_train.shape, 'x_test',x_test.shape, 'y_train',y_train.shape, 'y_test', y_test.shape)

x_train (2344, 304) x_test (586, 304) y_train (2344,) y_test (586,)


## Linear Regression

In [6]:
from sklearn.linear_model import LinearRegression

In [7]:
lmodel = LinearRegression().fit(x_train, y_train)
y_pred = lmodel.predict(x_test)

# Accuracy of the model
print(metrics.r2_score(y_test, y_pred))

-3.1351239397758332e+16


In [8]:
print(metrics.r2_score(y_test, y_pred))
print(lmodel.score(x_train, y_train))
print(lmodel.score(x_test, y_test))

-3.1351239397758332e+16
0.9347017960202054
-3.1351239397758332e+16


From the above metrics we can clearly see that out model is overfitted

### Linear regression with cross validation
To find the best value of Linear Regression, we need to utilize the whole dataset and find the best sample with higher accuracy for that we can use cross validation

In [9]:
from sklearn.model_selection import cross_val_score
cross_val_score(LinearRegression(), x_train, y_train, cv=5, scoring='r2')

array([-1.51396960e+20, -4.42046903e+19, -1.31796110e+17, -2.62812729e+20,
       -1.99474402e+19])

So, these are the best score we are able to get from linear Regression

## Lasso Regression

In [10]:
from sklearn.linear_model import Lasso

In [11]:
lsmodel = Lasso(alpha=10).fit(x_train, y_train)
y_pred = lsmodel.predict(x_test)

# Accuracy of the model
print(metrics.r2_score(y_test, y_pred))

0.9212986569497557


In [12]:
print(lsmodel.score(x_train, y_train))
print(lsmodel.score(x_test, y_test))

0.9328604709671917
0.9212986569497557


From the above metrics we can see that how lasso regression improved the performance of our model

### Lasso Regression with cross validation
To find the best value of alpha or λ we can use LassoCV function, which uses cross validation to find the best value of λ.

In [13]:
from sklearn.linear_model import LassoCV
# from sklearn.model_selection import KFold
# kfold = KFold(n_splits=5, shuffle=True, random_state=17)
models = LassoCV()
models.fit(x_train, y_train)
y_pred = models.predict(x_test)
print('Score', metrics.r2_score(y_test, y_pred))
print('Values of Alpha(aka λ)', models.alpha_)

Score 0.917601834831105
Values of Alpha(aka λ) 93.56315655078515


In [14]:
print(sum(sorted(lmodel.coef_ == 0)))
print(sum(sorted(lsmodel.coef_ == 0)))

0
91


So in total Lasso Regresson eliminated 91 columns, which it thinks dosen't contribute in predicting price of a house