# 1、普通线性回归

### 1.1 OLS回归

* 对于线性回归模型中，解释变量的估计系数，在 **coef_** 参数中得到，而对于截距项系数，可以在 **intercept_** 中得到。

* 线性回归模型是去最小化拟合结果与实际结果的残差的平方和

* **LinearRegression** 会接收其 **fit** 方法的数组X,y,并且在拟合后将其计算出的解释变量、截距项的系数分别存放在coef_和intercept_中

In [6]:
from sklearn import linear_model
import pandas as pd
import numpy as np
# 生成模拟的数据
df = pd.DataFrame({'x1':np.random.randn(10),'x2':np.random.randn(10),'y':np.arange(1,11)})
print(df)

         x1        x2   y
0 -0.846334 -0.599632   1
1  0.098812 -1.172374   2
2  0.282343 -1.394880   3
3 -0.864165  0.520508   4
4 -0.444380 -0.132791   5
5  1.999830  0.674573   6
6  0.615407  0.583752   7
7  1.131488 -0.598677   8
8  0.949680  1.230386   9
9  1.948539 -0.224125  10


#### 对于reg.fit(X,y)其中X，y的数据类型要求：
* scikit-learn要求X是一个特征矩阵，y是一个NumPy向量
* pandas构建在NumPy之上
* 因此，X可以是pandas的DataFrame，y可以是pandas的Series，scikit-learn可以理解这种结构

In [7]:
# 建立线性回归模型
reg = linear_model.LinearRegression()
reg.fit(df[['x1','x2']],df['y'])
print(reg.coef_,reg.intercept_)

[ 1.90166226  1.36417865] 4.72552726548


### for example 1:

In [10]:
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
print(data.head())
print(data.shape)    # 查看数据框的维度

      TV  Radio  Newspaper  Sales
1  230.1   37.8       69.2   22.1
2   44.5   39.3       45.1   10.4
3   17.2   45.9       69.3    9.3
4  151.5   41.3       58.5   18.5
5  180.8   10.8       58.4   12.9
(200, 4)


#### 建立线性回归模型

In [11]:
x = data[['TV', 'Radio', 'Newspaper']]   # 解释变量
y = data['Sales']                        # 被解释变量
# 构造训练集和测试集
from sklearn.cross_validation import train_test_split
# 默认是将75% 25%分开测试集、训练集
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=1)
print(x_train.shape,y_train.shape,x_test.shape,y_test.shape)

(150, 3) (150,) (50, 3) (50,)


In [17]:
linreg = linear_model.LinearRegression()
linreg.fit(x_train,y_train)
print('截距项：%s' % linreg.intercept_,'解释变量系数：%s' % linreg.coef_)

截距项：2.87696662232 解释变量系数：[ 0.04656457  0.17915812  0.00345046]


#### 预测

In [18]:
y_pred = linreg.predict(x_test)
print(y_pred)

[ 21.70910292  16.41055243   7.60955058  17.80769552  18.6146359
  23.83573998  16.32488681  13.43225536   9.17173403  17.333853
  14.44479482   9.83511973  17.18797614  16.73086831  15.05529391
  15.61434433  12.42541574  17.17716376  11.08827566  18.00537501
   9.28438889  12.98458458   8.79950614  10.42382499  11.3846456
  14.98082512   9.78853268  19.39643187  18.18099936  17.12807566
  21.54670213  14.69809481  16.24641438  12.32114579  19.92422501
  15.32498602  13.88726522  10.03162255  20.93105915   7.44936831
   3.64695761   7.22020178   5.9962782   18.43381853   8.39408045
  14.08371047  15.02195699  20.35836418  20.57036347  19.60636679]


#### 模型拟合结果的评价
* 1、平均绝对误差 |yi−y|
* 2、均方误差 MSE (yi−y)^2
* 3、均方根误差 RMSE

In [19]:
# 计算RMSE
from sklearn import metrics
print(np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

1.40465142303


#### 特征选择
在之前展示的数据中，我们看到Newspaper和销量之间的线性关系比较弱，现在我们移除这个特征，看看线性回归预测的结果的RMSE如何？

In [20]:
X = data[['TV','Radio']]
y = data.Sales
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

1.38790346994


### for example 2:

In [21]:
import matplotlib.pyplot as plt
from sklearn import datasets

In [22]:
# 加载数据集
diabetes = datasets.load_diabetes()
# 仅适用一个特征
diabetes_x = diabetes.data[:,np.newaxis,2]
# 将数据分为测试集和训练集
diabetes_x_train = diabetes_x[:-20]
diabetes_x_test = diabetes_x[-20:]
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# 建立线性回归模型
regr = linear_model.LinearRegression()
regr.fit(diabetes_x_train, diabetes_y_train)
print('回归系数: \n', regr.coef_)
print('均方误差:%.2f' % np.mean((regr.predict(diabetes_x_test)-diabetes_y_test)**2))

回归系数: 
 [ 938.23786125]
均方误差:2548.07


In [23]:
# 绘制回归图
plt.scatter(diabetes_x_test,diabetes_y_test,color='black')
plt.plot(diabetes_x_test,regr.predict(diabetes_x_test),color='blue',linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()