# B XGBoost

XGBoost（Extreme Gradient Boosting），极致提升树/极限提升树，由陈天奇开发。在GBDT基础上改进的算法。

官网：https://xgboost.readthedocs.io/en/stable/

安装：`pip install xgboost`

In [1]:
# 导入加利福尼亚房价数据
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split as TTS

data = fetch_california_housing()
print(data.keys())

# 划分数据集
x = data['data']
y = data['target']
train_x, test_x, train_y, test_y = TTS(x, y, test_size=0.3, random_state=22)

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])


## 壹丨XGBoost原生API调用

In [2]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error as MSE

### 1. 数据转换

In [3]:
# XGBoost原生调用需要把数据转换为特定的数据格式
train_data = xgb.DMatrix(train_x, train_y)
test_data = xgb.DMatrix(test_x, test_y)

### 2. XGB训练测试

In [4]:
# 设置参数
param = {'seed': 22}

# 训练模型
reg = xgb.train(param, train_data)

# 预测结果
y_pred = reg.predict(test_data)

# 计算mse
mse = MSE(test_y, y_pred, squared=False)
print(f'MSE = {mse}')

MSE = 0.5131600787859798


### 3. 交叉验证

In [5]:
xgb.cv(param, train_data, nfold=5, seed=22)

Unnamed: 0,train-rmse-mean,train-rmse-std,test-rmse-mean,test-rmse-std
0,0.942639,0.004828,0.950416,0.015797
1,0.7955,0.004464,0.810545,0.015531
2,0.698609,0.003671,0.722387,0.019121
3,0.633481,0.002354,0.665639,0.020278
4,0.587897,0.003754,0.627098,0.018831
5,0.557956,0.004437,0.604098,0.01647
6,0.533969,0.004048,0.585822,0.018141
7,0.511882,0.004187,0.568054,0.01649
8,0.495614,0.005177,0.556252,0.016304
9,0.480777,0.005718,0.545853,0.012671


输出为生成树过程的n折平均结果，即最终结果为最后一行，且为n折交叉验证的平均

## 贰丨XGBoost的sklearn API调用

In [6]:
from xgboost import XGBRegressor

### 1. 训练测试

In [7]:
# 创建对象
reg = xgb.XGBRegressor(n_estimators=100, random_state=22)

# 训练
reg.fit(train_x, train_y)

# 测试, 默认返回R2
reg.score(test_x, test_y)

0.833294092003024

In [8]:
# 预测
y_pred = reg.predict(test_x)

# 计算mse
mse = MSE(test_y, y_pred, squared=False)
print(f'MSE = {mse}')

MSE = 0.4609130051083081


In [9]:
# 输出贡献率
for fn, fi in zip(data['feature_names'], reg.feature_importances_):
    print(f'{fn}: {fi:.4f}')

MedInc: 0.4889
HouseAge: 0.0658
AveRooms: 0.0363
AveBedrms: 0.0228
Population: 0.0245
AveOccup: 0.1429
Latitude: 0.1079
Longitude: 0.1109


### 2. 交叉验证

In [10]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score as CVS

In [11]:
reg = xgb.XGBRegressor(n_estimators=100, random_state=22)
# 交叉验证
cv = KFold(n_splits=5, shuffle=True, random_state=22)
res = CVS(reg, train_x, train_y, cv=cv, scoring='neg_mean_squared_error')
rmse = (abs(res) ** 0.5).mean()
vrmse = (abs(res) ** 0.5).var()
print(f'Mean(RMSE) = {rmse}\n'
      f'Var(RMSE) = {vrmse}')

Mean(RMSE) = 0.4830401686217102
Var(RMSE) = 0.0002604404270682063


## 叁丨XGBoost参数

### 1. 查看XGBoost参数


In [None]:
# 查看sklearn API参数
from xgboost import XGBRegressor as XGBR

reg = XGBR()
reg

### 2. sklearn API和原生API对照

| 参数相关流程         | 原生库参数                                                   | sklearn API参数                                              |
| -------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
| 损失函数             | <b>`objective`</b>，`lambda`，`alpha`                           | <b>`objective`</b>，`reg_alpha`，`reg_lambda`                   |
| 集成规则             | <b>`eta`</b>，`base_score`，`eval_metric`，`subsample`，`sampling_method`，`colsample_bytree`，`colsample_bylevel`，`colsample_bynode` | <b>`learning_rate`</b>，`base_score`，`eval_metric`，`subsample`，`colsample_bytree`，`colsample_bylevel`，`colsample_bynode` |
| 弱评估器             | <b>`num_boost_round`</b>，`booster`，`tree_method`，`sketch_eps`，`updater`，`grow_policy` | <b>`n_estimators`</b>，`booster`，`tree_method`                 |
| 训练流程（抗过拟合） | `num_feature`，`max_depth`，`gamma`，`min_child_weight`，`max_delta_step`，`max_leaves`，`max_bin` | `max_depth`，`gamma`，`min_child_weight`，`max_delta_step`   |
| 训练流程（结果监控） | <b>`verbosity`</b>                                              | <b>`verbosity`</b>                                              |
| 训练流程（提前停止） | `early_stopping_rounds`                                      | `early_stopping_rounds`                                      |
| 训练流程（增量学习） |                                                              | `warm_start`                                                 |
| 随机性控制           | `seed`                                                       | `random_state`                                               |
| 其他流程             | <b>`missing`</b>，`scale_pos_weight`，`predictor`，<b>`num_parallel_tree`</b> | `n_jobs`，`scale_pos_weight`，`num_parallel_tree`，`enable_categorical`，`importance_type` |


### 3. 参数解析

`objective`：目标函数（损失函数+模型复杂

度）
$$
Obj = \Sigma^m_{i=1}l(y, f(x)) + \Sigma^K_{k=1} \Omega(g_k) \\

\Omega(g_k) = \gamma T+正则项(Regularization)
$$


实际填写`objective`参数时，填写的是损失函数名，不包括正则项（该参数默认值为平方损失）

| 任务 | 损失函数              | 概念                                         |
| ---- | --------------------- | -------------------------------------------- |
| 回归 | `reg:squarederror`    | 平方损失                                     |
| 回归 | `reg:squaredlogerror` | 平放对数损失                                 |
| 分类 | `binary:logistic`     | 二分类交叉熵损失，输出概率                   |
| 分类 | `binary:logitraw`     | 二分类交叉熵损失，输出Sigmoid之前的值        |
| 分类 | `multi:softmax`       | 多分类交叉熵损失，输出具体的类别             |
| 分类 | `multi:softprob`      | 多分类交叉熵损失，输出每个样本每个类别下概率 |
| ……   | ……                    | ……                                           |

`lambda`：决定完整目标函数，L2正则项系数，默认为1

`alpha`：决定完整目标函数，L1正则项系数，默认为0

`num_boost_round`：树的数量，默认10。（`n_estimators`：默认100）

`eta`：学习率，默认0.3（`learning_rate`：默认0.3）

`verbosity`：打印相关信息。

| 参数 | 操作              |
| ---- | ----------------- |
| 0    | 不打印任何内容    |
| 1    | 仅打印警告        |
| 2    | 打印树的全部信息  |
| 3    | 打印更多debug信息 |

`num_parallel_tree`：允许并行建立的树的oost建立随机森林

`missing`：容忍缺失的数据

## 肆丨调参

以学习率为例

In [12]:
# sklearn API
from xgboost import XGBRegressor
from sklearn.model_selection import KFold, cross_val_score as CVS

import numpy as np

for lr in np.linspace(0.1, 0.5, 5):
    reg = XGBRegressor(objective='reg:squarederror',
                       learning_rate=lr,
                       random_state=22)
    cv = KFold(n_splits=5, shuffle=True, random_state=22)
    res = CVS(reg, train_x, train_y, cv=cv, scoring='neg_mean_squared_error')

    print(f'learning_rate = {lr:.4f}, res = {(abs(res) ** 0.5).mean()}')

learning_rate = 0.1000, res = 0.4868025769066027
learning_rate = 0.2000, res = 0.4793697761831933
learning_rate = 0.3000, res = 0.4830401686217102
learning_rate = 0.4000, res = 0.5026510715198398
learning_rate = 0.5000, res = 0.512333817939106


In [13]:
# 原生API
import xgboost as xgb

train_data = xgb.DMatrix(train_x, train_y)

for eta in np.linspace(0.1, 0.5, 5):
    param = {'objective': 'reg:squarederror',
             'eta': eta,
             'seed': 22}
    res = xgb.cv(param, train_data,
                 num_boost_round=100, nfold=5, seed=22,
                 shuffle=True, metrics='rmse')
    print(f'eta = {eta:.4f}, res = {res.iloc[-1, -2]}')

eta = 0.1000, res = 0.4873787736549489
eta = 0.2000, res = 0.4836093275141261
eta = 0.3000, res = 0.4846305601993509
eta = 0.4000, res = 0.49834340288402323
eta = 0.5000, res = 0.5076573688820785
