In [1]:
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

## Linera Regression

the most common performance measure of a regression model is the Root Mean Square Error (RMSE)

In [2]:
import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

The Normal Equation: 
$\hat{\theta} = (X^TX)^{-1}X^Ty$

In [3]:
X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
theta_best

array([[ 3.81767399],
       [ 3.05748155]])

Equivalent code using sklearn:

In [4]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_

X_new = np.array([[0], [2]])
lin_reg.predict(X_new)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

(array([ 3.81767399]), array([[ 3.05748155]]))

array([[ 3.81767399],
       [ 9.9326371 ]])

## Gradient Descent
- MSE cost function for a Linear Regression model happens to be a convex function
- the cost function has the shape of a bowl, but it can be an elongated bowl if the features have very
different scales.

When using Gradient Descent, you should ensure that all features have a similar scale (e.g., using Scikit-Learn’s StandardScaler
class), or else it will take much longer to converge.

### Batch Gradient Descent
Notice that this formula involves calculations over the __full__ training set X, at each Gradient Descent step! This is why the algorithm
is called Batch Gradient Descent: it uses the __whole__ batch of training data at every step. As a result it is terribly slow on very
large training sets. However, Gradient Descent scales well
with the number of features; training a Linear Regression model when there are hundreds of thousands of features is much faster
using Gradient Descent than using the Normal Equation.

### Stochastic Gradient Descent
picks a random instance in the training set at every step and
computes the gradients based only on that single instance.

- randomness is good to escape from local optima, but bad because it means that the algorithm
can never settle at the minimum
- One solution to this dilemma is to gradually reduce the learning rate.
(The function that determines the learning rate at each iteration is called the learning
schedule.)

Perform Linear Regression using SGD with Scikit-Learn:
- defaults to optimizing the squared error cost function.

In [5]:
from sklearn.linear_model import SGDRegressor
# runs 50 epochs, learning rate of 0.1, default learning schedule
sgd_reg = SGDRegressor(n_iter=50, penalty=None, eta0=0.1) 
sgd_reg.fit(X, y.ravel())
sgd_reg.intercept_, sgd_reg.coef_



SGDRegressor(alpha=0.0001, average=False, epsilon=0.1, eta0=0.1,
       fit_intercept=True, l1_ratio=0.15, learning_rate='invscaling',
       loss='squared_loss', max_iter=None, n_iter=50, penalty=None,
       power_t=0.25, random_state=None, shuffle=True, tol=None, verbose=0,
       warm_start=False)

(array([ 3.80106953]), array([ 3.0465731]))

### Mini-batch Gradient Descent
computes the gradients on small random sets of instances called mini-batches

The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance
boost from hardware optimization of matrix operations, especially when using GPUs.


| Algorithm | Large m | Out-of-core support | Large n | Hyperparams | Scaling required | Scikit-Learn |
|------|------|------|------|------|------|------|
| Normal Equation | Fast | No | Slow | 0 | No | LinearRegression |
|Batch GD|Slow|No|Fast|2|Yes|n/a|
|Stochastic GD|Fast|Yes|Fast|$\ge$2|Yes|SGDRegressor|
|Mini-batch GD|Fast|Yes|Fast|$\ge$2|Yes|n/a|

## Polynomial Regression

In [6]:
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1) # a quadratic equation

transform our training data, adding the square (2nd-degree polynomial) of each feature in the
training set as new features (in this case there is just one feature):

In [7]:
from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
X[0]
X_poly[0]

array([-1.79919609])

array([-1.79919609,  3.23710658])

X_poly now contains the original feature of X plus the square of this feature.

In [8]:
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
lin_reg.intercept_, lin_reg.coef_

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

(array([ 1.87883458]), array([[ 0.93746118,  0.46913615]]))

方法intercept- 和 coef- 查看线性回归的系数

## Learning Curves
If your model is underfitting the training data, adding more training examples will not help. You need to use a more complex model or come up with better features.

## Regularized Linear Models
For a linear model, regularization is typically achieved by constraining the weights of the model.
### Ridge Regression
Note that the regularization term
should **only** be added to the cost function during training.

$J(\theta) = MSE(\theta) + 1/2\alpha\sum_{i=1}^n\theta_i^2$

Note that the bias term θ0 is not regularized (the sum starts at i = 1, not 0).

$l_2$ norm of weight vector

**It is important to scale the data (e.g., using a StandardScaler) before performing Ridge Regression, as it is sensitive to the scale
of the input features. This is true of most regularized models.**

In [9]:
from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver="cholesky")
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])

Ridge(alpha=1, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='cholesky', tol=0.001)

array([[ 4.45057126]])

### Lasso Regression

$l_1$ norm of weight vector

$J(\theta) = MSE(\theta) + \alpha\sum_{i=1}^n|\theta_i|$

**An important characteristic of Lasso Regression is that it tends to completely eliminate the weights of the
least important features (i.e., set them to zero).**

**In other words, Lasso Regression automatically performs feature selection and outputs a
sparse model (i.e., with few nonzero feature weights).**

In [10]:
from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]])

Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

array([ 4.40235736])

### Elastic Net
a middle ground between Ridge Regression and Lasso Regression.

So when should you use Linear Regression, Ridge, Lasso, or Elastic Net? It is almost always preferable
to have at least a little bit of regularization, so generally you should avoid plain Linear Regression. Ridge
is a good default, but if you suspect that only a few features are actually useful, you should prefer Lasso or
Elastic Net since they tend to reduce the useless features’ weights down to zero as we have discussed. In
general, Elastic Net is preferred over Lasso since Lasso may behave erratically when the number of
features is greater than the number of training instances or when several features are strongly correlated.

惩罚函数一个可导一个不可导。
一个可做变量选择，一个不可以。
总体上，在机器学习里，都是克服overfitting的方法。

In [11]:
from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X, y)
elastic_net.predict([[1.5]])

ElasticNet(alpha=0.1, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False)

array([ 4.40593684])

### Early Stopping

## Logistic Regression