### 继续使用集成模型来进行数据的回归分析

### 要使用到普通随机森林、提升树回归模型、和极端随机森林。并对比这三种树模型在波士顿房价预测上的性能差异

##### 极端随机森林extremely randomized tress，与普通随机森林不同的地方是，它每当构建一个分裂节点的时候，不会任意的选择特征；
##### 而是随机搜集一部分特征，然后使用Entropy信息熵或基尼不纯度Jini Impurity来挑选最佳的节点

### 1.加载数据


In [1]:
from sklearn.datasets import load_boston
df = load_boston()

### 2.数据分割

In [2]:
from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test = train_test_split(df.data,df.target,test_size=0.25,random_state=123)



### 3.使用三种树形回归预测房价

In [4]:
from sklearn.ensemble import RandomForestRegressor,ExtraTreesRegressor,GradientBoostingRegressor

#普通随机深林
rfr = RandomForestRegressor(n_estimators=100)
rfr.fit(x_train,y_train)
rfr_predict = rfr.predict(x_test)

#极端随机森林
etr = ExtraTreesRegressor(n_estimators=100)
etr.fit(x_train,y_train)
etr_predict = etr.predict(x_test)

#梯度提升树
gbr = GradientBoostingRegressor(n_estimators=100)
gbr.fit(x_train,y_train)
gbr_predict = gbr.predict(x_test)

### 4.性能评估

In [5]:
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
print('普通随机森林：R-squared : %s , MSE : %s , MAE : %s'%(r2_score(rfr_predict,y_test),
                                                     mean_squared_error(rfr_predict,y_test),mean_absolute_error(rfr_predict,y_test)))

print('极端随机森林：R-squared : %s , MSE : %s , MAE : %s'%(r2_score(etr_predict,y_test),
                                                     mean_squared_error(etr_predict,y_test),mean_absolute_error(etr_predict,y_test)))

print('梯度提升树：R-squared : %s , MSE : %s , MAE : %s'%(r2_score(gbr_predict,y_test),
                                                     mean_squared_error(gbr_predict,y_test),mean_absolute_error(gbr_predict,y_test)))

普通随机森林：R-squared : 0.798777604544 , MSE : 15.2364673543 , MAE : 2.22626771654
极端随机森林：R-squared : 0.830096663585 , MSE : 12.205984685 , MAE : 2.1658503937
梯度提升树：R-squared : 0.831776348656 , MSE : 13.6826340285 , MAE : 2.26740684706


In [6]:
import numpy as np
#利用训练好的极端随机森林，输入每种特征对输出目标的贡献度
print(np.sort(tuple(zip(etr.feature_importances_,df.feature_names)),axis=0))

[['0.00284444758598' 'AGE']
 ['0.00484871887672' 'B']
 ['0.0144323983971' 'CHAS']
 ['0.0145231261998' 'CRIM']
 ['0.0173157045183' 'DIS']
 ['0.0288086860723' 'INDUS']
 ['0.0325966755958' 'LSTAT']
 ['0.0377769643196' 'NOX']
 ['0.0398316431178' 'PTRATIO']
 ['0.0404295609625' 'RAD']
 ['0.0478236994209' 'RM']
 ['0.356460621393' 'TAX']
 ['0.36230775354' 'ZN']]


In [19]:
df.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'],
      dtype='<U7')

In [20]:
etr.feature_importances_

array([ 0.02445915,  0.00986784,  0.0158564 ,  0.00163278,  0.03230127,
        0.34264414,  0.01901012,  0.04515863,  0.02683744,  0.06400414,
        0.0367347 ,  0.01189511,  0.36959827])

In [23]:
list(zip(etr.feature_importances_,df.feature_names))

[(0.024459154970015155, 'CRIM'),
 (0.0098678442495268182, 'ZN'),
 (0.015856402704781794, 'INDUS'),
 (0.0016327793087089963, 'CHAS'),
 (0.032301271608473452, 'NOX'),
 (0.34264414143704819, 'RM'),
 (0.019010115373657859, 'AGE'),
 (0.045158626829870076, 'DIS'),
 (0.026837436977527336, 'RAD'),
 (0.064004141082167412, 'TAX'),
 (0.03673470174942451, 'PTRATIO'),
 (0.011895111923940787, 'B'),
 (0.36959827178485755, 'LSTAT')]

In [27]:
print(gbr.feature_importances_)

[  8.12203929e-02   9.96693026e-03   3.77328837e-02   1.50855139e-04
   7.13855382e-02   1.87724845e-01   9.04964973e-02   1.26455338e-01
   2.03974848e-02   7.38847461e-02   5.76633451e-02   7.86477480e-02
   1.64273396e-01]


In [28]:
rfr.feature_importances_

array([ 0.02976292,  0.00117874,  0.00585249,  0.00114585,  0.02098487,
        0.5730904 ,  0.01480721,  0.04071374,  0.00294294,  0.02012298,
        0.01178815,  0.00996938,  0.26764032])

### 5.集成算法特点分析

##### 集成算法的性能一般都比较高，因此备受青睐