# Generalized Linear Model

## Ordinary Least Square

In [1]:
from sklearn.linear_model import LinearRegression

X = [[0, 1], [2, 3], [3, 4.5]]
y = [1, 3, 6]

reg = LinearRegression()
reg.fit(X, y)
print(reg.coef_)
print(reg.intercept_)

[-3.  4.]
-3.000000000000005


- Mind the problem of **multicollinearity**
    - meaning the features are correlated with each other. The design matrix **X<span>** will close to singular
    - the model with be **highly sensitive to random variance**

## Ridge Regression

In [10]:
from sklearn.linear_model import Ridge

X = [[0, 1], [2, 3], [3, 4.5]]
y = [1, 3, 6]

reg = Ridge(alpha=0.5)
reg.fit(X, y)
print(reg.coef_)
print(reg.intercept_)

[0.44444444 0.94444444]
-0.08333333333333393


- Using alpha to controls the amount of shrinkage, thus make the model more robust to collinearity
- L2 norm regularization

#### Ridge regression with built-in cross-validation

In [12]:
from sklearn.linear_model import RidgeCV
import numpy as np

X = [[0, 1], [2, 3], [3, 4.5]]
y = [1, 3, 6]

reg = RidgeCV(alphas=np.logspace(-6, 6, 13), cv=2) # set alpha, specify cv
reg.fit(X, y)
print(reg.alpha_) # the best alpha have been founded

1e-06


## Lasso Regression

In [15]:
from sklearn.linear_model import Lasso

X = [[0, 1], [2, 3], [3, 4.5]]
y = [1, 3, 6]

reg = Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000)
reg.fit(X, y)
print(reg.predict([[1, 1]]))

[0.84594595]


- Use L1 norm to get a sparse model, meaning driven the coefficient to 0


- Often used for **feature selection**


- Most often **preferable for high-dimensional datasets with many collinear features**


- Model selection: built-in cv using **LassoCV** and **LassoLarsCV**
    - **LassoLarsCV** based on **Least Angle Regression**, which exploring more relevant values of **alpha**, and ofter faster
    - comparing to C of SVM, alpha = 1/C or alpha = 1/(n_samples*C)
    
    
- Model selection: could also use **LassoLarsIC** to select model, which use ** Akaike information criterion (AIC)** and the **Bayes Information criterion (BIC)**, which considered a cheaper alternative to cross-validation, but need a proper estimation of degrees of freedom
    - ** Akaike information criterion (AIC)**
        - **-2L<sub>m</sub> + 2m**, L<sub>m</sub> is the maximized log-likelihood, m the number of parameters
        - measures the goodness of fit
        - the smaller the better
    - **Bayes Information criterion (BIC)**
        - **-2L<sub>m</sub> + ln(m)**, L<sub>m</sub> is the maximized log-likelihood, m the number of parameters
        - usually better tha AIC
        
        
- **MultiTaskLasso**, used when y in a 2D array of (n_samples, n_tasks)


- **ElasticNet**, a linear regression trained with both L1 and L2 norm regularization, control the covex combination of L1 and L2 using **l1_ratio** parameter
    - **ElasticNetCV**, which using cross-validation
    - **MultiTaskElasticNet**, used for y in a 2D array of (n_samples, n_tasks)
    - **MultiTaskElasticNetCV**

## Least Angle Regression (LARS)

- Use **LARS** function
- Used to **select features**, similar to forward stepwise regression and forward stagewise regression
- Pros: faster, easily modified
- Cons: sensitive to noise

#### LARS Lasso

In [5]:
from sklearn.linear_model import LassoLars

X = [[0, 1], [2, 3], [3, 4.5]]
y = [1, 3, 6]

reg = LassoLars(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000)
reg.fit(X, y)
print(reg.predict([[1, 1]]))
print(reg.coef_)
print(reg.coef_path_) # Show the path of developing coefficients, which has the size (n_features, max_features+1)

[0.97823825]
[0.         1.28459732]
[[0.         0.        ]
 [0.         3.19001149]]


## Orthogonal Matching Pursuit (OMP)

- Use **OrthogonalMatchingPursuit** and **OrthogonalMatchingPursuitCV** function
- Similar as LARS, which performs **feature selection** and get a sparse model
- Different in perform **L0 norm** regularization, so the number of non-zero coefficients must be specified

## Bayesian Regression

- Introducing uninformative priors for regularization
- Similar to Ridge Regression
- Use data at hand to fit the regularization parameters
- Inference could be time consuming

#### Bayesian Ridge Regression

In [31]:
from sklearn.linear_model import BayesianRidge

X = [[0, 1], [2, 3], [3, 4.5]]
y = [1, 3, 6]

reg = BayesianRidge()
reg.fit(X, y)
print(reg.predict([[2, 3]]))

[3.61066477]


- The parameters **w<span>**, **alpha**, **lambda** are inferred
- The hyper parameters, which are the priors of **alpha**, **lambda** could be specified

#### Automatic Relevance Determination (ARD)

- Use **ARDRegression** function
- Similar to Bayesian Ridge Regression, but produce a sparser model
    - It drops the assumption of spherical Gaussian priors for **w<span>**
    - It assigns a standard deviation **lambda** for each **w<span>**

## A Toy Example for Regressions

In [91]:
import numpy as np
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV, \
ElasticNet, ElasticNetCV, Lars, LarsCV, LassoLars, LassoLarsCV, LassoLarsIC, \
OrthogonalMatchingPursuit, OrthogonalMatchingPursuitCV, BayesianRidge, ARDRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Prepare the data
np.random.seed(42)
data = load_boston()
X = data['data']
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Parameters
alpha = np.arange(0.1, 0.6, 0.1)
l1_ratio = 0.5

# OLS regression
clf = LinearRegression()
clf.fit(X_train, y_train)
print('OLS Regression')
print('mse: %.4f'%(mean_squared_error(y_test, clf.predict(X_test))))
print('\n')

# Ridge regression
print("Ridge Regression")
for i in alpha:
    clf = Ridge(alpha=i)
    clf.fit(X_train, y_train)
    print('mse: %.4f (alpha: %.2f)'%(mean_squared_error(y_test, clf.predict(X_test)), i))
print('\n')

# Ridge CV
print("Ridge CV")
clf = RidgeCV(alphas=alpha)
clf.fit(X_train, y_train)
print('mse: %.4f (alpha: %.2f)'%(mean_squared_error(y_test, clf.predict(X_test)), clf.alpha_))
print('\n')

# Lasso regression
print("Lasso Regression")
for i in alpha:
    clf = Lasso(alpha=i)
    clf.fit(X_train, y_train)
    print('mse: %.4f (alpha: %.2f)'%(mean_squared_error(y_test, clf.predict(X_test)), i))
print('\n')

# Lasso CV
print("Lasso CV")
clf = LassoCV(alphas=alpha, cv=3)
clf.fit(X_train, y_train)
print('mse: %.4f (alpha: %.2f)'%(mean_squared_error(y_test, clf.predict(X_test)), clf.alpha_))
print('\n')

# ElasticNet regression
print("ElasticNet Regression")
for i in alpha:
    clf = ElasticNet(alpha=i, l1_ratio=l1_ratio)
    clf.fit(X_train, y_train)
    print('mse: %.4f (alpha: %.2f)'%(mean_squared_error(y_test, clf.predict(X_test)), i))
print('\n')

# ElasticNet CV
print("ElasticNet CV")
clf = ElasticNetCV(alphas=alpha, cv=3, l1_ratio=l1_ratio)
clf.fit(X_train, y_train)
print('mse: %.4f (alpha: %.2f)'%(mean_squared_error(y_test, clf.predict(X_test)), clf.alpha_))
print('\n')

# Least Angle Regression (LARS)
print("LARS")
clf = Lars()
clf.fit(X_train, y_train)
print('mse: %.4f'%(mean_squared_error(y_test, clf.predict(X_test))))
print('\n')

# LARS Lasso
print("LARS Lasso")
for i in alpha:
    clf = LassoLars(alpha=i)
    clf.fit(X_train, y_train)
    print('mse: %.4f (alpha: %.2f)'%(mean_squared_error(y_test, clf.predict(X_test)), i))
print('\n')

# LARS Lasso CV
print("LARS Lasso CV")
clf = LassoLarsCV(cv=3) # Find the alpha, no need to specify
clf.fit(X_train, y_train)
print('mse: %.4f (alpha: %.2f)'%(mean_squared_error(y_test, clf.predict(X_test)), clf.alpha_))
print('\n')

# LARS Lasso IC
print("LARS Lasso IC")
clf = LassoLarsIC(criterion='aic') # Use IC to find the best alpha
clf.fit(X_train, y_train)
print('mse: %.4f (alpha: %.2f using aic)'%(mean_squared_error(y_test, clf.predict(X_test)), clf.alpha_))
clf = LassoLarsIC(criterion='bic') # Use IC to find the best alpha
clf.fit(X_train, y_train)
print('mse: %.4f (alpha: %.2f using bic)'%(mean_squared_error(y_test, clf.predict(X_test)), clf.alpha_))
print('\n')

# Orthogonal Matching Pursuit (OMP)
print("OMP")
clf = OrthogonalMatchingPursuit(n_nonzero_coefs=10) # Must specify how many non-zero coefficients 
clf.fit(X_train, y_train)
print('mse: %.4f'%(mean_squared_error(y_test, clf.predict(X_test))))
print(clf.coef_)
print('\n')

# OMP CV
print("OMP CV")
clf = OrthogonalMatchingPursuitCV(cv=3) # Use CV to find the best non-zero numbers
clf.fit(X_train, y_train)
print('mse: %.4f'%(mean_squared_error(y_test, clf.predict(X_test))))
print(clf.coef_)
print('\n')

# Bayesian Ridge Regression
print("Bayesian Ridge Regression")
clf = BayesianRidge() # See, most parameters are inferred
clf.fit(X_train, y_train)
print('mse: %.4f'%(mean_squared_error(y_test, clf.predict(X_test))))
print(clf.coef_)
print('\n')

# Automatic Relevance Determination
print("ARD")
clf = ARDRegression() # See, it will be sparser than the Bayesian Ridge
clf.fit(X_train, y_train)
print('mse: %.4f'%(mean_squared_error(y_test, clf.predict(X_test))))
print(clf.coef_)
print('\n')

OLS Regression
mse: 22.0987


Ridge Regression
mse: 22.1422 (alpha: 0.10)
mse: 22.1867 (alpha: 0.20)
mse: 22.2305 (alpha: 0.30)
mse: 22.2727 (alpha: 0.40)
mse: 22.3129 (alpha: 0.50)


Ridge CV
mse: 22.1422 (alpha: 0.10)


Lasso Regression
mse: 23.3859 (alpha: 0.10)
mse: 23.4004 (alpha: 0.20)
mse: 23.2705 (alpha: 0.30)
mse: 23.2139 (alpha: 0.40)
mse: 23.2303 (alpha: 0.50)


Lasso CV
mse: 23.3859 (alpha: 0.10)


ElasticNet Regression
mse: 22.9952 (alpha: 0.10)
mse: 22.8678 (alpha: 0.20)
mse: 22.8932 (alpha: 0.30)
mse: 22.9980 (alpha: 0.40)
mse: 23.1302 (alpha: 0.50)


ElasticNet CV
mse: 22.9952 (alpha: 0.10)


LARS
mse: 22.0987


LARS Lasso
mse: 29.3274 (alpha: 0.10)
mse: 40.9072 (alpha: 0.20)
mse: 58.8988 (alpha: 0.30)
mse: 72.2608 (alpha: 0.40)
mse: 72.2608 (alpha: 0.50)


LARS Lasso CV
mse: 22.0987 (alpha: 0.00)


LARS Lasso IC
mse: 24.3174 (alpha: 0.01 using aic)
mse: 26.2103 (alpha: 0.06 using bic)


OMP
mse: 23.0326
[-1.28289621e-01  2.10004125e-02  0.00000000e+00  2.98248555e+00
 

## Logistic Regression

- Need to specify L1, L2, elasticnet or None regularizer


- Use different **solvers** to specify different regularizer, as well as **OVR(one-vs-rest)** or **multinomial**
     - **liblinear**: OVR, L1, L2
     - **lbfgs**, **sag**, **newton-cg**: OVR, multinomial, L2, faster for high-dimensional data
         - **sag** best for datasets with large samples and large number of features
         - **lbfgs** is recomended for small datasets
     - **saga**: OVR, multinomial, L1, L2, elasticnet(the only support), a variant of **sag**
     
     
- Use **LogisticRegressionCV** to find optimal **C<span>** and **l1_ratio**

## Stochastic Gradient Descent (SGD)

- Has **SGDClassifier** and **SGDRegressor**
- Efficient for large dataset
- Use **partial_fit** to enable online/out-of-core learning
- Setting **hinge** loss, yielding a SVM, while **log** loss produce a logistic regression

## Perceptron

## Passive Aggressive Algorithms

## Robustness Regression

- Dealing with errors and outliers
- **RANSAC**, **Theil Sen**, **HuberRegressor**
- When in doubt, use **RANSAC** (RANdom SAmple Consensus)

## Polynomial Regression

- Use **PloynomialFeatures**