正则化线性模型：对于线性模型，正则化通常是通过约束模型的权重实现的

岭回归（Tikhonov正则化）：是线性回归的正则化版本，将等于$\alpha\sum^{n}_{i=1}\theta^{2}_{i}$的正则化项添加到成本函数

岭回归的成本函数： $J(\theta)=MSE(\theta)+\alpha\frac{1}{2}\sum^{n}_{i=1}\theta^{2}_{i}$

偏置项$\theta_{0}$没有正则化。如果将$w$定义为特征权重的向量($\theta_{1}$至$\theta_{n}$)，则正则项等于$\frac{1}{2}||w||^{2}_{2}$

闭式解的岭回归：$\hat{\theta}=(X^{T}X+\alpha A)^{-1}X^{T}y$

In [6]:
from sklearn.linear_model import Ridge
import numpy as np
m=100
X=6*np.random.rand(m,1)-3
y=0.5*X**2+X+2+np.random.randn(m,1)
ridge_reg = Ridge(alpha=1, solver="cholesky")
ridge_reg.fit(X,y)
ridge_reg.predict([[1.5]])

array([5.07977381])

In [8]:
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(penalty="l2")
sgd_reg.fit(X, y.ravel())
sgd_reg.predict([[1.5]])

array([5.08241634])

Lasso回归：即最小绝对收缩和选择算子回归，添加的正则项是权重向量的$l_{1}$范数

Lasso回归成本函数：$J(\theta)=MSE(\theta)+\alpha\sum^{n}_{i=1}|\theta_{i}|$

随着参数接近全局最优值，梯度会变小，因此梯度下降自然会减慢，有助于收敛；当增加$\alpha$时，最佳参数越来越接近远点，但从未被完全消除。

为避免使用Lasso时梯度下降最终在最优解附近反弹，需要逐渐降低训练期间的学习率

Lasso回归子梯度向量：
$$
g(\theta, J)=\mathbf{\nabla_{\theta}MSE(\theta)} = \alpha\begin{pmatrix}
sign(\theta_{1}) \\
sign(\theta_{2})\\
...\\
sign(\theta_{n})
\end{pmatrix}
$$

其中$sign(\theta_{i})$与$\theta_{i}$的正负相关

In [13]:
from sklearn.linear_model import Lasso
lasso_reg=Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]])

array([5.04241451])

弹性网络：正则项是岭正则项和Lasso正则项的简单混合

弹性网络成本函数：$J(\theta)=MSE(\theta)+r\alpha\sum^{n}_{i=1}|\theta_{i}|+\frac{1-r}{2}\alpha\sum^{n}_{i=1}\theta^{2}_{i}$

In [14]:
from sklearn.linear_model import ElasticNet
elastiic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastiic_net.fit(X, y)
elastiic_net.predict([[1.5]])

array([5.04014293])

In [18]:
from sklearn.preprocessing import StandardScaler
#提前停止法
from sklearn.base import clone
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
#prepare the data
poly_scaler=Pipeline([
    ("poly_features", PolynomialFeatures(degree=90, include_bias=False)),
    ("std_scaler", StandardScaler())
])
X_train, X_val, y_train, y_val= train_test_split(X, y, test_size=0.2)
X_train_poly_scaled = poly_scaler.fit_transform(X_train)
X_val_poly_scaled = poly_scaler.transform(X_val)
sgd_reg = SGDRegressor(max_iter=1, tol=None, warm_start=True, penalty=None, learning_rate="constant", eta0=0.0005)
minimum_val_error=float("inf")
best_epoch=None
best_model=None
for epoch in range(1000):
    sgd_reg.fit(X_train_poly_scaled, y_train.ravel())
    y_val_predict=sgd_reg.predict(X_val_poly_scaled)
    val_error=mean_squared_error(y_val, y_val_predict)
    if val_error < minimum_val_error:
        minimum_val_error = val_error
        best_epoch = epoch
        best_model = clone(sgd_reg)