线性回归预测美国波士顿地区房价

In [1]:
from sklearn.datasets import load_boston
boston=load_boston()
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [2]:
from sklearn.model_selection import train_test_split
import numpy as np
X=boston.data
y=boston.target
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=33)

In [5]:
print(y_train.shape)
print(y_test.shape)
#对数据进行标准化处理
from sklearn.preprocessing import StandardScaler
ss_X=StandardScaler()
ss_y=StandardScaler()

X_train=ss_X.fit_transform(X_train)
X_test=ss_X.transform(X_test)
y_train=ss_y.fit_transform(y_train.reshape(-1,1))   #这里不加reshape会出错，[1,2,3,4]要变为[[1],[2],[3],[4]]
y_test=ss_y.transform(y_test.reshape(-1,1))

(379,)
(127,)


In [8]:
#使用LinearRegression和SGDRegressor进行预测
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(X_train,y_train)
lr_y_predict=lr.predict(X_test)


In [9]:
from sklearn.linear_model import SGDRegressor
sgd=SGDRegressor()
sgd.fit(X_train,y_train)
sgd_y_predict=sgd.predict(X_test)


  y = column_or_1d(y, warn=True)


和分类预测不同，回归预测不能要求预测值和真实值完全相同，这里使用平均绝对误差MAE和均方误差MSE作为评价指标

In [10]:
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
print("the r2_score of lr is :",r2_score(y_test,lr_y_predict))

#inverse_transform是将标准化后的数据转换为原始数据
print("the MSE of lr is :",mean_squared_error(ss_y.inverse_transform(y_test),ss_y.inverse_transform(lr_y_predict)))
print("the MAE of lr is :",mean_absolute_error(ss_y.inverse_transform(y_test),ss_y.inverse_transform(lr_y_predict)))

the r2_score of lr is : 0.6757955014529483
the MSE of lr is : 25.13923652035344
the MAE of lr is : 3.5325325437053965


In [12]:
print("the r2_score of sgd is :",r2_score(y_test,sgd_y_predict))
print("the MSE of sgd is :",mean_squared_error(ss_y.inverse_transform(y_test),ss_y.inverse_transform(sgd_y_predict)))
print("the MAE of sgd is :",mean_absolute_error(ss_y.inverse_transform(y_test),ss_y.inverse_transform(sgd_y_predict)))

the r2_score of sgd is : 0.6558880014802
the MSE of sgd is : 26.68288983974506
the MAE of sgd is : 3.513016326793664


虽然，使用梯度下降估计参数的方法SGDRegressor在性能表现上不及使用解析方法的LinearRegression，但是在面对数据规模庞大的任务上，随机梯度法不论在分类问题还是回归问题上都表现的十分高效。

In [13]:
#使用支持向量机做回归
from sklearn.svm import SVR
#1、使用线性核
linear_svr=SVR(kernel='linear')
linear_svr.fit(X_train,y_train)
linear_svr_y_predict=linear_svr.predict(X_test)

#2、使用多项式核
poly_svr=SVR(kernel='poly')
poly_svr.fit(X_train,y_train)
poly_svr_y_predict=poly_svr.predict(X_test)

#3、使用径向基核
rbf_svr=SVR(kernel='rbf')
rbf_svr.fit(X_train,y_train)
rbf_svr_y_predict=rbf_svr.predict(X_test)



  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


配置不同的核函数来改变模型性能

In [14]:
#使用回归树来预测
from sklearn.tree import DecisionTreeRegressor
dtr=DecisionTreeRegressor()
dtr.fit(X_train,y_train)
dtr_y_predict=dtr.predict(X_test)

print("the performance is :",mean_squared_error(ss_y.inverse_transform(y_test),ss_y.inverse_transform(dtr_y_predict)),
      mean_absolute_error(ss_y.inverse_transform(y_test),ss_y.inverse_transform(dtr_y_predict)),dtr.score(X_test,y_test))

the performance is : 25.389842519685043 3.192913385826772 0.6725635977202951


使用集成模型来预测

In [15]:
from sklearn.ensemble import RandomForestRegressor,ExtraTreesRegressor,GradientBoostingRegressor

rfr=RandomForestRegressor()
rfr.fit(X_train,y_train)
rfr_y_predict=rfr.predict(X_test)

etr=ExtraTreesRegressor()
etr.fit(X_train,y_train)
etr_y_predict=etr.predict(X_test)

gbr=GradientBoostingRegressor()
gbr.fit(X_train,y_train)
gbr_y_predict=gbr.predict(X_test)

print("the performance of rfr is:",rfr.score(X_test,y_test))
print("the performance of etr is:",etr.score(X_test,y_test))
print("the performance of gbr is:",gbr.score(X_test,y_test))

the performance of rfr is: 0.783110191410873
the performance of etr is: 0.8083476457625046
the performance of gbr is: 0.8391170295536652


  after removing the cwd from sys.path.
  
  y = column_or_1d(y, warn=True)
