## Ordinary Least Square

In [1]:
from sklearn.linear_model import LinearRegression

X = [[0, 1], [2, 3], [3, 4.5]]
y = [1, 3, 6]

reg = LinearRegression()
reg.fit(X, y)
print(reg.coef_)
print(reg.intercept_)

[-3.  4.]
-3.000000000000005


- Mind the problem of **multicollinearity**
    - meaning the features are correlated with each other. The design matrix **X<span>** will close to singular
    - the model with be **highly sensitive to random variance**

## Ridge Regression

In [10]:
from sklearn.linear_model import Ridge

X = [[0, 1], [2, 3], [3, 4.5]]
y = [1, 3, 6]

reg = Ridge(alpha=0.5)
reg.fit(X, y)
print(reg.coef_)
print(reg.intercept_)

[0.44444444 0.94444444]
-0.08333333333333393


- Using alpha to controls the amount of shrinkage, thus make the model more robust to collinearity


- L2 norm regularization

#### Ridge regression with built-in cross-validation

In [12]:
from sklearn.linear_model import RidgeCV
import numpy as np

X = [[0, 1], [2, 3], [3, 4.5]]
y = [1, 3, 6]

reg = RidgeCV(alphas=np.logspace(-6, 6, 13), cv=2) # set alpha, specify cv
reg.fit(X, y)
print(reg.alpha_) # the best alpha have been founded

1e-06


## Lasso Regression

In [15]:
from sklearn.linear_model import Lasso

X = [[0, 1], [2, 3], [3, 4.5]]
y = [1, 3, 6]

reg = Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000)
reg.fit(X, y)
print(reg.predict([[1, 1]]))

[0.84594595]


- Use L1 norm to get a sparse model, meaning driven the coefficient to 0


- Often used for **feature selection**


- Most often **preferable for high-dimensional datasets with many collinear features**


- Model selection: built-in cv using **LassoCV** and **LassoLarsCV**
    - **LassoLarsCV** based on **Least Angle Regression**, which exploring more relevant values of **alpha**, and ofter faster
    - comparing to C of SVM, alpha = 1/C or alpha = 1/(n_samples*C)
    
    
- Model selection: could also use **LassoLarsIC** to select model, which use ** Akaike information criterion (AIC)** and the **Bayes Information criterion (BIC)**, which considered a cheaper alternative to cross-validation, but need a proper estimation of degrees of freedom
    - ** Akaike information criterion (AIC)**
        - **-2L<sub>m</sub> + 2m**, L<sub>m</sub> is the maximized log-likelihood, m the number of parameters
        - measures the goodness of fit
        - the smaller the better
    - **Bayes Information criterion (BIC)**
        - **-2L<sub>m</sub> + ln(m)**, L<sub>m</sub> is the maximized log-likelihood, m the number of parameters
        - usually better tha AIC
        
        
- **MultiTaskLasso**, used when y in a 2D array of (n_samples, n_tasks)


- **ElasticNet**, a linear regression trained with both L1 and L2 norm regularization, control the covex combination of L1 and L2 using **l1_ratio** parameter
    - **ElasticNetCV**, which using cross-validation
    - **MultiTaskElasticNet**, used for y in a 2D array of (n_samples, n_tasks)
    - **MultiTaskElasticNetCV**