*课后习题使用波士顿房价数据集
1. 线性回归原理
    使用均方误差来衡量模型
    使用极大似然估计来解释使用均方误差的合理性
2. 线性回归的损失函数、代价函数、目标函数
    损失函数：度量单样本预测的错误程度，损失函数值越小，模型越好
    代价函数：度量全部样本集的平均误差
    目标函数：代价函数和正则化函数，最终要优化的函数
    *过拟合，结构风险最小化，和模型的复杂度
3. 线性回归的优化
    a.梯度下降
        优点：适合数据点多的情况
        缺点：函数为非凸函数时可能无法找到最优值，可能只找到局部最优
    b.最小二乘法
    c.牛顿法
    d. 拟牛顿法
        DFP,BFGS
4. 线性回归评价指标
    均方误差（MSE）、均方根误差（RMSE）、平均绝对误差（MAE）
    最常用的为R方，越接近1，可解释力度就越大，模型拟合更好
5. sklearn.linear_model参数
    a. fit_intercept :默认TRUE，是否计算截距。一般都需要考虑截距，在中心化数据中自己衡量
    b. normalize :数据标准化一般在模型训练前就完成了，这里默认是Fal
    c. copy_X: 默认TRUE，否则X会被改写
    d. n_jobs:int默认为1
    
    可用属性：
    a. coef_ :回归系数，训练后的输入端模型系数，如果label有两个，即y值有两个，2维array
    b. intercept_ : 截距
    
    methods:
    a. fit(X , y , sample_weight=None):
        X : arrray , 稀疏矩阵[n_samples , n_features]
        y : array[n_samples , n_targets]
        sample_weight : 权重
    b. get_params(deep = True ) : 返回对regressor的设置值
    c. predict(X) : 预测，基于R方
    d. score: 评估

In [5]:
# 导入包
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_boston
# 实例化API
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.externals import joblib
import pandas as pd

In [2]:
pip install joblib

Note: you may need to restart the kernel to use updated packages.


In [6]:
# 波士顿房价数据集
lb = load_boston()

In [7]:
# 分割数据集
x_train,x_test,y_train,y_test = train_test_split(lb.data , lb.target, test_size=0.25)

In [9]:
# 中心化，目标值和特征值都需要
std_x = StandardScaler()
x_train = std_x.fit_transform(x_train)
x_test = std_x.fit_transform(x_test)

# 目标值这里需要使用reshape(-1,1)把一维转为二维
std_y = StandardScaler()
y_train = std_y.fit_transform(y_train.reshape(-1,1))
y_test = std_y.fit_transform(y_test.reshape(-1,1))

方法一：线性回归 

In [10]:
# 调用模型
lr = LinearRegression(fit_intercept=True)
# 训练模型
lr.fit(x_train,y_train)
# 打印权重参数
print(lr.coef_)
# 计算R方
print('R2 : %s'%(lr.score(x_train,y_train)))

[[-0.04833565  0.13701495  0.04766017  0.09052687 -0.268753    0.2534092
   0.0257136  -0.33239964  0.26785557 -0.19461213 -0.23715599  0.06928937
  -0.45197094]]
R2 : 0.7287476764521696


In [11]:
# 保存训练好的模型
joblib.dump(lr , "test.pkl")

['test.pkl']

In [12]:
# 读取刚刚保存的模型
lr_model = joblib.load('test.pkl')
y_predict = lr_model.predict(x_test)

In [14]:
# 预测测试集房子的价格，.inverse_transform()把之前标准化的数据还原一下
y_predict = std_y.inverse_transform(lr.predict(x_test))
print("房子预测价格：",y_predict)
print("均方误差：",mean_squared_error(std_y.inverse_transform(y_test),y_predict))

房子预测价格： [[22.30022494]
 [25.05344943]
 [23.61414285]
 [15.49032167]
 [23.67461967]
 [16.92755336]
 [17.95368154]
 [28.00861325]
 [19.28250746]
 [15.14099842]
 [16.50381259]
 [ 7.40852631]
 [29.44768223]
 [18.93280679]
 [16.83432286]
 [14.33736871]
 [30.19988952]
 [18.63455213]
 [16.88940662]
 [27.13043817]
 [15.51743409]
 [20.61306011]
 [10.90056847]
 [12.16493397]
 [26.33540919]
 [32.68486091]
 [29.74420914]
 [23.50207839]
 [32.27959172]
 [34.67105042]
 [12.75581666]
 [15.88012404]
 [33.92815807]
 [18.14438007]
 [10.49415343]
 [23.57355177]
 [24.40285215]
 [41.12903422]
 [34.3947034 ]
 [10.3789342 ]
 [16.39868287]
 [17.62684585]
 [18.50428455]
 [20.30512216]
 [19.69691826]
 [22.87386859]
 [ 8.08706497]
 [14.94378151]
 [38.93783953]
 [28.03526251]
 [18.77776281]
 [16.30100637]
 [22.60712249]
 [24.18430582]
 [21.3718732 ]
 [26.3650467 ]
 [31.7998305 ]
 [19.11968763]
 [11.83699044]
 [31.70702257]
 [22.23890304]
 [33.31293999]
 [19.27128431]
 [24.56326548]
 [31.2466413 ]
 [16.42206909]
 [

方法二：最小二乘法

In [17]:
class LR_LS():
    def __init__(self):
        self.w = None
    def fit(self , X , y):
        self.w = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
    # np.linalg.det():矩阵求行列式  
    # np.linalg.inv()：矩阵求逆  
    # np.linalg.norm():求范数
    def predict(self ,X):
        y_pred = X.dot(self.w)
        return y_pred
    
if __name__ =="__main__":
    lr_ls = LR_LS()
    lr_ls.fit(x_train , y_train)
    print("估计的参数：%s"%(lr_ls.w))
    
    
    y_predict = std_y.inverse_transform(lr_ls.predict(x_test))
    print("房子预测价格：",y_predict)

估计的参数：[[-0.04833565]
 [ 0.13701495]
 [ 0.04766017]
 [ 0.09052687]
 [-0.268753  ]
 [ 0.2534092 ]
 [ 0.0257136 ]
 [-0.33239964]
 [ 0.26785557]
 [-0.19461213]
 [-0.23715599]
 [ 0.06928937]
 [-0.45197094]]
房子预测价格： [[22.30022494]
 [25.05344943]
 [23.61414285]
 [15.49032167]
 [23.67461967]
 [16.92755336]
 [17.95368154]
 [28.00861325]
 [19.28250746]
 [15.14099842]
 [16.50381259]
 [ 7.40852631]
 [29.44768223]
 [18.93280679]
 [16.83432286]
 [14.33736871]
 [30.19988952]
 [18.63455213]
 [16.88940662]
 [27.13043817]
 [15.51743409]
 [20.61306011]
 [10.90056847]
 [12.16493397]
 [26.33540919]
 [32.68486091]
 [29.74420914]
 [23.50207839]
 [32.27959172]
 [34.67105042]
 [12.75581666]
 [15.88012404]
 [33.92815807]
 [18.14438007]
 [10.49415343]
 [23.57355177]
 [24.40285215]
 [41.12903422]
 [34.3947034 ]
 [10.3789342 ]
 [16.39868287]
 [17.62684585]
 [18.50428455]
 [20.30512216]
 [19.69691826]
 [22.87386859]
 [ 8.08706497]
 [14.94378151]
 [38.93783953]
 [28.03526251]
 [18.77776281]
 [16.30100637]
 [22.60712

方法三：梯度下降

In [18]:
class LR_GD():
    def __init__(self):
        self.w = None
    def fit(self ,X , y , alpha=0.02 , loss = 1e-10):
        y = y.reshape(-1,1)
        [m,d] = np.shape(X) # 自变量的维度
        self.w = np.zeros((d))   # 将参数的初始值定为0
        tol = 1e5
        
        while tol > loss :
            h_f = X.dot(self.w).reshape(-1,1)
            theta = self.w + alpha*np.mean(X*(y-h_f) , axis=0)  # 计算迭代的参数值
            tol = np.sum(np.abs(theta - self.w))
            self.w = theta
    def predict(self,X):
        y_pred = X.dot(self.w)
        return y_pred

if __name__ == "__main__":
    lr_gd = LR_GD()
    lr_gd.fit(x_train , y_train)
    print("估计的参数值为：%s"%(lr_gd.w))
    y_predict = std_y.inverse_transform(lr_gd.predict(x_test))
    print("房子预测价格：",y_predict)
          

估计的参数值为：[-0.04833565  0.13701495  0.04766016  0.09052687 -0.268753    0.2534092
  0.0257136  -0.33239964  0.26785554 -0.1946121  -0.23715599  0.06928937
 -0.45197094]
房子预测价格： [22.30022486 25.05344944 23.61414292 15.49032171 23.67461968 16.92755331
 17.9536815  28.00861323 19.28250749 15.1409984  16.50381256  7.40852663
 29.44768226 18.93280672 16.83432292 14.33736874 30.19988955 18.63455216
 16.88940658 27.13043823 15.51743405 20.61306022 10.90056842 12.16493397
 26.33540914 32.68486089 29.74420923 23.5020785  32.27959169 34.67105039
 12.75581664 15.88012403 33.92815812 18.14438003 10.49415343 23.57355178
 24.40285212 41.12903412 34.39470346 10.37893414 16.39868285 17.626846
 18.50428449 20.30512209 19.69691824 22.87386869  8.08706497 14.94378153
 38.93783954 28.03526248 18.77776288 16.30100632 22.60712246 24.18430582
 21.3718733  26.36504664 31.79983047 19.11968773 11.83699051 31.70702259
 22.23890303 33.31293988 19.27128442 24.56326548 31.24664119 16.42206913
 26.58463427 16.02592432