In [6]:
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

## Linera Regression

the most common performance measure of a regression model is the Root Mean Square Error (RMSE)

In [1]:
import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

The Normal Equation: 
$\hat{\theta} = (X^TX)^{-1}X^Ty$

In [2]:
X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
theta_best

array([[ 3.91992426],
       [ 3.09086772]])

Equivalent code using sklearn:

In [8]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_

X_new = np.array([[0], [2]])
lin_reg.predict(X_new)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

(array([ 3.91992426]), array([[ 3.09086772]]))

array([[  3.91992426],
       [ 10.1016597 ]])

## Gradient Descent
- MSE cost function for a Linear Regression model happens to be a convex function
- the cost function has the shape of a bowl, but it can be an elongated bowl if the features have very
different scales.

When using Gradient Descent, you should ensure that all features have a similar scale (e.g., using Scikit-Learn’s StandardScaler
class), or else it will take much longer to converge.

### Batch Gradient Descent
Notice that this formula involves calculations over the __full__ training set X, at each Gradient Descent step! This is why the algorithm
is called Batch Gradient Descent: it uses the __whole__ batch of training data at every step. As a result it is terribly slow on very
large training sets. However, Gradient Descent scales well
with the number of features; training a Linear Regression model when there are hundreds of thousands of features is much faster
using Gradient Descent than using the Normal Equation.

### Stochastic Gradient Descent
picks a random instance in the training set at every step and
computes the gradients based only on that single instance.

- randomness is good to escape from local optima, but bad because it means that the algorithm
can never settle at the minimum
- One solution to this dilemma is to gradually reduce the learning rate.
(The function that determines the learning rate at each iteration is called the learning
schedule.)

Perform Linear Regression using SGD with Scikit-Learn:
- defaults to optimizing the squared error cost function.

In [10]:
from sklearn.linear_model import SGDRegressor
# runs 50 epochs, learning rate of 0.1, default learning schedule
sgd_reg = SGDRegressor(n_iter=50, penalty=None, eta0=0.1) 
sgd_reg.fit(X, y.ravel())
sgd_reg.intercept_, sgd_reg.coef_



SGDRegressor(alpha=0.0001, average=False, epsilon=0.1, eta0=0.1,
       fit_intercept=True, l1_ratio=0.15, learning_rate='invscaling',
       loss='squared_loss', max_iter=None, n_iter=50, penalty=None,
       power_t=0.25, random_state=None, shuffle=True, tol=None, verbose=0,
       warm_start=False)

(array([ 3.90502601]), array([ 3.07268167]))

### Mini-batch Gradient Descent
computes the gradients on small random sets of instances called mini-batches

The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance
boost from hardware optimization of matrix operations, especially when using GPUs.


| Algorithm | Large m | Out-of-core support | Large n | Hyperparams | Scaling required | Scikit-Learn |
|------|------|------|------|------|------|------|
| Normal Equation | Fast | No | Slow | 0 | No | LinearRegression |
|Batch GD|Slow|No|Fast|2|Yes|n/a|
|Stochastic GD|Fast|Yes|Fast|$\ge$2|Yes|SGDRegressor|
|Mini-batch GD|Fast|Yes|Fast|$\ge$2|Yes|n/a|

## Polynomial Regression

In [11]:
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1) # a quadratic equation

transform our training data, adding the square (2nd-degree polynomial) of each feature in the
training set as new features (in this case there is just one feature):

In [12]:
from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
X[0]
X_poly[0]

array([-1.09840714])

array([-1.09840714,  1.20649825])

X_poly now contains the original feature of X plus the square of this feature.

In [14]:
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
lin_reg.intercept_, lin_reg.coef_

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

(array([ 2.11122778]), array([[ 0.92575444,  0.42506331]]))

方法intercept- 和 coef- 查看线性回归的系数