# 🍡 模型融合

融合多个强学习器，修正单一算法在偏差和方差的问题。

均值法（Averaging）：
* 将所有算法结果普通平均、加权平均。
* 多个强学习器相互独立时，强学习器平均后的误差一定小于单一学习器
* 加权和普通平均效果不相上下

投票法（Voting）
* 包含多数投票、相对多数投票、加权投票、软投票，仅适用分类
* 绝对多数投票要求样本类别占比50%以上，否则输出空值
* 相对多数投票少数服从多数
* 软投票是根据强学习器概率和少数服从多数

堆叠法（Stacking）
* 建立一个元学习器与一个（或多个）个体学习器，将原始数据分为train和test。使用train训练个体学习器，使用个体学习器在train输出结果，作为元学习器的训练数据，最终由元学习器在test上输出结果
* 如果只有一个个体学习器，会执行交叉验证得到多组输出结果作为元学习器的训练数据

混合法（Blending）
* 一种特殊的Stacking
* 建立一个元学习器与一个（或多个）个体学习器，将原始数据分为train、val、test，使用train训练个体学习器，在val上输出结果，作为元学习器的训练数据，最终由元学习器在test上输出结果

In [1]:
# 导入加利福尼亚房价数据
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split as TTS

data = fetch_california_housing()
print(data.keys())

# 划分数据集
x = data['data']
y = data['target']
train_x, test_x, train_y, test_y = TTS(x, y, test_size=0.3, random_state=22)

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])


In [2]:
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

from sklearn.model_selection import cross_val_score, KFold
import time

In [3]:
cv = KFold(n_splits=5, shuffle=True, random_state=22)

## 壹丨GBDT、LGBM、XGBoost训练

In [4]:
# 1. GBDT
gbr = GradientBoostingRegressor(n_estimators=100, random_state=22)
start = time.time()
res = cross_val_score(gbr, train_x, train_y, cv=cv, scoring='neg_mean_squared_error')

print(time.time() - start)
tmp = (abs(res)**0.5).mean()
print(f'Mean(RMSE) = {tmp}')
tmp = (abs(res)**0.5).var()
print(f'Var(RMSE) = {tmp}')

28.284186124801636
Mean(RMSE) = 0.5435198338668081
Var(RMSE) = 0.00023465103020537168


In [5]:
# 2. LGBM
lgb = LGBMRegressor(n_estimators=100, force_col_wise=True,metric='rmse',random_state=22)
start = time.time()
res = cross_val_score(lgb, train_x, train_y, cv=cv, scoring='neg_mean_squared_error')

print(time.time() - start)
tmp = (abs(res)**0.5).mean()
print(f'Mean(RMSE) = {tmp}')
tmp = (abs(res)**0.5).var()
print(f'Var(RMSE) = {tmp}')

[LightGBM] [Info] Total Bins 1837
[LightGBM] [Info] Number of data points in the train set: 11558, number of used features: 8
[LightGBM] [Info] Start training from score 2.070339
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 11558, number of used features: 8
[LightGBM] [Info] Start training from score 2.067614
[LightGBM] [Info] Total Bins 1837
[LightGBM] [Info] Number of data points in the train set: 11558, number of used features: 8
[LightGBM] [Info] Start training from score 2.067731
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 11559, number of used features: 8
[LightGBM] [Info] Start training from score 2.072033
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 11559, number of used features: 8
[LightGBM] [Info] Start training from score 2.075613
1.1566441059112549
Mean(RMSE) = 0.4774297398269434
Var(RMSE) = 0.00024916379465450555


In [6]:
# 3. XGBoost
xgb = XGBRegressor(n_estimators=100, random_state=22)
start = time.time()
res = cross_val_score(xgb, train_x, train_y, cv=cv, scoring='neg_mean_squared_error')

print(time.time() - start)
tmp = (abs(res)**0.5).mean()
print(f'Mean(RMSE) = {tmp}')
tmp = (abs(res)**0.5).var()
print(f'Var(RMSE) = {tmp}')

0.8879978656768799
Mean(RMSE) = 0.4830401686217102
Var(RMSE) = 0.0002604404270682063


结果汇总如下：

| 算法 | RMSE   | 方差   | 用时    |
| ---- | ------ | ------ | ------- |
| GBDT | 0.5435 | 0.0002 | 21.5430 |
| LGBM | 0.4774 | 0.0002 | 0.3145  |
| XGB  | 0.4830 | 0.0003 | 0.61999 |

## 贰丨模型融合

In [8]:
from sklearn.ensemble import VotingRegressor

# 以元祖的列表方式构建estimators
gbr = GradientBoostingRegressor(n_estimators=100,random_state=22)
lgb = LGBMRegressor(n_estimators=100,random_state=22,force_col_wise=True,metric='rmse')
xgb = XGBRegressor(n_estimators=100,random_state=22)

In [9]:
# 使用投票法
estimators = [('GBR',gbr),('LGBM',lgb),('XGB',xgb)]
mix = VotingRegressor(estimators,verbose=True)
cv_res = cross_val_score(mix,train_x,train_y,cv=cv,scoring='neg_mean_squared_error')

[Voting] ...................... (1 of 3) Processing GBR, total=   5.6s
[LightGBM] [Info] Total Bins 1837
[LightGBM] [Info] Number of data points in the train set: 11558, number of used features: 8
[LightGBM] [Info] Start training from score 2.070339
[Voting] ..................... (2 of 3) Processing LGBM, total=   0.1s
[Voting] ...................... (3 of 3) Processing XGB, total=   0.2s
[Voting] ...................... (1 of 3) Processing GBR, total=   5.4s
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 11558, number of used features: 8
[LightGBM] [Info] Start training from score 2.067614
[Voting] ..................... (2 of 3) Processing LGBM, total=   0.2s
[Voting] ...................... (3 of 3) Processing XGB, total=   0.2s
[Voting] ...................... (1 of 3) Processing GBR, total=   5.4s
[LightGBM] [Info] Total Bins 1837
[LightGBM] [Info] Number of data points in the train set: 11558, number of used features: 8
[LightGBM] [Info] S

In [10]:
tmp = (abs(cv_res)**0.5).mean()
print(f'Mean(RMSE) = {tmp}')
tmp = (abs(cv_res)**0.5).var()
print(f'Var(RMSE) = {tmp}')

Mean(RMSE) = 0.47908001792266736
Var(RMSE) = 0.00023419211448020937


融合模型的结果没有单独的LGBM效果好。平均法融合效果更好的前提是：

* 评估器是精调之后的强学习器
* 被融合的评估器在交叉验证上的分数差异不大
* 评估器与评估器之间是相互独立的（调整随机性，使评估器之间的差异更大）

## 叁丨模型调参后融合

In [11]:
gbr = GradientBoostingRegressor(n_estimators=300, 
                                learning_rate=0.5,
                                max_features=0.6,
                                random_state=22)
start = time.time()
res = cross_val_score(gbr, train_x, train_y, cv=cv, scoring='neg_mean_squared_error')

print(time.time() - start)
tmp = (abs(res)**0.5).mean()
print(f'Mean(RMSE) = {tmp:.4f}')
tmp = (abs(res)**0.5).var()
print(f'Var(RMSE) = {tmp:.4f}')

44.16127824783325
Mean(RMSE) = 0.5111
Var(RMSE) = 0.0004


In [12]:
lgb = LGBMRegressor(n_estimators=200, 
                    learning_rate=0.5,
                    force_col_wise=True,
                    colsample_bytree=0.6,
                    metric='rmse',
                    random_state=22)
start = time.time()
res = cross_val_score(lgb, train_x, train_y, cv=cv, scoring='neg_mean_squared_error')

print(time.time() - start)
tmp = (abs(res)**0.5).mean()
print(f'Mean(RMSE) = {tmp:.4f}')
tmp = (abs(res)**0.5).var()
print(f'Var(RMSE) = {tmp:.4f}')

[LightGBM] [Info] Total Bins 1837
[LightGBM] [Info] Number of data points in the train set: 11558, number of used features: 8
[LightGBM] [Info] Start training from score 2.070339
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 11558, number of used features: 8
[LightGBM] [Info] Start training from score 2.067614
[LightGBM] [Info] Total Bins 1837
[LightGBM] [Info] Number of data points in the train set: 11558, number of used features: 8
[LightGBM] [Info] Start training from score 2.067731
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 11559, number of used features: 8
[LightGBM] [Info] Start training from score 2.072033
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 11559, number of used features: 8
[LightGBM] [Info] Start training from score 2.075613
1.304314136505127
Mean(RMSE) = 0.5187
Var(RMSE) = 0.0002


In [13]:
xgb = XGBRegressor(n_estimators=100,
                   learning_rate=0.5,
                   colsample_bytree=0.6,
                   random_state=22)
start = time.time()
res = cross_val_score(xgb, train_x, train_y, cv=cv, scoring='neg_mean_squared_error')

print(time.time() - start)
tmp = (abs(res)**0.5).mean()
print(f'Mean(RMSE) = {tmp:.4f}')
tmp = (abs(res)**0.5).var()
print(f'Var(RMSE) = {tmp:.4f}')

0.8009979724884033
Mean(RMSE) = 0.5025
Var(RMSE) = 0.0003


In [14]:
estimators = [('GBR',gbr),('LGBM',lgb),('XGB',xgb)]
mix = VotingRegressor(estimators,verbose=True)
cv_res = cross_val_score(mix,train_x,train_y,cv=cv,scoring='neg_mean_squared_error')

[Voting] ...................... (1 of 3) Processing GBR, total=   8.5s
[LightGBM] [Info] Total Bins 1837
[LightGBM] [Info] Number of data points in the train set: 11558, number of used features: 8
[LightGBM] [Info] Start training from score 2.070339
[Voting] ..................... (2 of 3) Processing LGBM, total=   0.2s
[Voting] ...................... (3 of 3) Processing XGB, total=   0.1s
[Voting] ...................... (1 of 3) Processing GBR, total=   8.3s
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 11558, number of used features: 8
[LightGBM] [Info] Start training from score 2.067614
[Voting] ..................... (2 of 3) Processing LGBM, total=   0.2s
[Voting] ...................... (3 of 3) Processing XGB, total=   0.1s
[Voting] ...................... (1 of 3) Processing GBR, total=   8.4s
[LightGBM] [Info] Total Bins 1837
[LightGBM] [Info] Number of data points in the train set: 11558, number of used features: 8
[LightGBM] [Info] S

In [15]:
tmp = (abs(cv_res)**0.5).mean()
print(f'Mean(RMSE) = {tmp:.4f}')
tmp = (abs(cv_res)**0.5).var()
print(f'Var(RMSE) = {tmp:.4f}')

Mean(RMSE) = 0.4682
Var(RMSE) = 0.0003


结果汇总如下：

| 算法             | RMSE   | 方差   | 用时    |
| ---------------- | ------ | ------ | ------- |
| Baseline（LGBM） | 0.4774 | 0.0002 | 0.3145  |
| GBDT             | 0.5111 | 0.0004 | 44.1612 |
| LGBM             | 0.5187 | 0.0002 | 1.3043  |
| XGB              | 0.5025 | 0.0003 | 0.8010  |
| Averaging        | 0.4682 | 0.0003 | ——      |