# 准备工作

+ 载入必要的包

In [1]:
library(xgboost)
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(pROC)


Attaching package: 'dplyr'

The following object is masked from 'package:xgboost':

    slice

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

"package 'pROC' was built under R version 3.3.3"Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'

The following objects are masked from 'package:stats':

    cov, smooth, var



+ 载入数据集

In [2]:
df_train = read.csv("F:/XGBoost/data/cs-training.csv", stringsAsFactors = FALSE) %>%
  na.omit() %>%   # 删除包含缺失值的样本 
  select(-`X`)    # 删除第一列索引列

In [3]:
train_data = as.matrix(df_train %>% select(-SeriousDlqin2yrs))
train_label = df_train$SeriousDlqin2yrs

# 一共有8个参数需要调节：
+ 1 eta[默认0.3]
+ 2 nrounds通过xgb.cv和early.stop.round控制
+ 3 max.depth[默认6]
+ 4 min.child.weight[默认1]
+ 5 gamma[默认0]
+ 6 subsample[默认1]
+ 7 colsample.bytree[默认1]
+ 8 scale.pos.weight[默认1]

调参策略：  

+ 1 选择较高的学习速率(eta)。一般情况下，学习速率的值为0.1。但是，对于不同的问题，理想的学习速率有时候会在0.05到0.3之间波动。选择对应于此学习速率的理想决策树数量(nrounds)。XGBoost有一个很有用的函数“cv”，这个函数可以在每一次迭代中使用交叉验证，并返回理想的决策树数量。

+ 2 对于给定的学习速率和决策树数量，进行决策树特定参数调优(max.depth, min.child.weight, gamma, subsample, colsample.bytree)。

+ 3 xgboost的正则化参数的调优。(lambda, alpha)。这些参数可以降低模型的复杂度，从而提高模型的表现。

+ 4 降低学习速率，确定理想参数。

## 1 在较高的`eta`下，调节`nrounds`参数

In [4]:
xgb_params = list(
    objective = "binary:logistic", # 二分类问题
    eval_metric = "auc",           # 用AUC作为评价指标
      
    # 设置需要调节的参数初始值
    eta = 0.1,                     # 初始值设为0.1
    max.depth = 5,                 # 取值最好在3-10之间，起始值在4-6之间都是不错的选择
    min.child.weight = 1,          # 由于是不平衡的分类问题，选取较小的值
    gamma = 0,                     # 初始值为0
    subsample = 0.8,               # 最常见的初始值，典型值的范围在0.5-0.9之间
    colsample_bytree = 0.8,        # 最常见的初始值，典型值的范围在0.5-0.9之间
    scale.pos.weight = 1           # 类别不平衡，初始值设为1
)

In [5]:
set.seed(27)
xgb = xgb.cv(data = train_data,
             label = train_label,
             params = xgb_params,
             nrounds = 1000,
             
             # 固定不变的参数
             nfold = 5,                                                   # 5折交叉验证
             stratified = TRUE,                                           # 不平衡样本，分层采样
             verbose = TRUE,
             early.stop.round = 50
)

[0]	train-auc:0.814889+0.016538	test-auc:0.810415+0.024144
[1]	train-auc:0.840380+0.002048	test-auc:0.836830+0.005340
[2]	train-auc:0.842096+0.002344	test-auc:0.838683+0.005185
[3]	train-auc:0.844532+0.002250	test-auc:0.841315+0.004608
[4]	train-auc:0.846417+0.001315	test-auc:0.842477+0.005932
[5]	train-auc:0.848501+0.002065	test-auc:0.844527+0.005591
[6]	train-auc:0.849803+0.001776	test-auc:0.845627+0.005784
[7]	train-auc:0.850456+0.002176	test-auc:0.846332+0.005722
[8]	train-auc:0.851198+0.002268	test-auc:0.847078+0.005692
[9]	train-auc:0.851655+0.002288	test-auc:0.847603+0.005781
[10]	train-auc:0.852052+0.002455	test-auc:0.847953+0.005337
[11]	train-auc:0.851989+0.002545	test-auc:0.847898+0.005024
[12]	train-auc:0.852956+0.001816	test-auc:0.848917+0.005845
[13]	train-auc:0.853242+0.001728	test-auc:0.849200+0.005922
[14]	train-auc:0.853648+0.002068	test-auc:0.849581+0.005479
[15]	train-auc:0.854119+0.001876	test-auc:0.849917+0.005434
[16]	train-auc:0.854302+0.001560	test-auc:0.850149

### 在`eta`为0.1时，最优的`nrounds`为77

## 2 给定`eta`、`nrounds`，进行决策树参数调优
`max.depth` 、 `min.child.weight` 、 `gamma` 、 `subsample` 、 `colsample.bytree`

### 2.1 `max.depth` 和 `min.child.weight` 参数调优
先大范围地粗调参数，然后再小范围地微调。

In [6]:
xgb_params = list(
    objective = "binary:logistic",
    eval_metric = "auc",
    
    eta = 0.1,                     # 初始值设为0.1
    gamma = 0,                     # 初始值为0
    subsample = 0.8,               # 最常见的初始值，典型值的范围在0.5-0.9之间
    colsample_bytree = 0.8,        # 最常见的初始值，典型值的范围在0.5-0.9之间
    scale.pos.weight = 1           # 类别不平衡，初始值设为1
)

In [7]:
max.depth = seq(3, 9, 2)
min.child.weight = seq(1, 5, 2)
to_tune = expand.grid(max.depth = max.depth, min.child.weight = min.child.weight)       # 根据待调参数和取值范围设置

result = vector(mode = 'numeric', length = 4)                                           # 根据待调参数个数设置
result.names = c('test.auc.mean', 'test.auc.std', 'max.depth', 'min.child.weight')      # 根据待调参数名称设置
names(result) = result.names

for (i in seq(dim(to_tune)[1])) {
    xgb_params$max.depth = to_tune[i, 1]
    xgb_params$min.child.weight = to_tune[i, 2]                                         # 根据根据待调参数名称设置
    
    set.seed(27)
    xgb = xgb.cv(data = train_data,
                 label = train_label,
                 params = xgb_params,

                 nrounds = 77,                                                         # 根据上一轮次确定的nrounds设置
                 nfold = 5,
                 stratified = TRUE,
                 verbose = FALSE,
                 prediction = TRUE                                                     # 添加这一参数，才会输出auc.mean和auc.std
    )
    
    stats = as.data.frame(xgb$dt)
    stats_params = c(stats[nrow(xgb$dt), 3], stats[nrow(xgb$dt), 4], to_tune[i, 1], to_tune[i, 2])     # 根据待调参数个数设置
    names(stats_params) = result.names
    
    result = rbind(result, stats_params)
}

result = as.data.frame(result)[-1,]
result

Unnamed: 0,test.auc.mean,test.auc.std,max.depth,min.child.weight
stats_params,0.854751,0.005685,3,1
stats_params.1,0.855423,0.005471,5,1
stats_params.2,0.853199,0.005482,7,1
stats_params.3,0.850534,0.004927,9,1
stats_params.4,0.854745,0.005654,3,3
stats_params.5,0.855603,0.005375,5,3
stats_params.6,0.853897,0.005362,7,3
stats_params.7,0.850961,0.005332,9,3
stats_params.8,0.854725,0.005492,3,5
stats_params.9,0.855361,0.005456,5,5


In [8]:
result[which.max(result$test.auc.mean),]

Unnamed: 0,test.auc.mean,test.auc.std,max.depth,min.child.weight
stats_params.5,0.855603,0.005375,5,3


理想的`max.depth`值为5，理想的`min.child.weight`值为3。在这个值附近我们可以再进一步调整，来找出理想值。我们把上下范围各拓展1，因为之前我们进行组合的时候，参数调整的步长是2。

In [9]:
max.depth = c(4, 5, 6)
min.child.weight = c(2, 3, 4)
to_tune = expand.grid(max.depth = max.depth, min.child.weight = min.child.weight)       # 根据待调参数和取值范围设置

result = vector(mode = 'numeric', length = 4)                                           # 根据待调参数个数设置
result.names = c('test.auc.mean', 'test.auc.std', 'max.depth', 'min.child.weight')      # 根据待调参数名称设置
names(result) = result.names

for (i in seq(dim(to_tune)[1])) {
    xgb_params$max.depth = to_tune[i, 1]
    xgb_params$min.child.weight = to_tune[i, 2]                                         # 根据根据待调参数名称设置
    
    set.seed(27)
    xgb = xgb.cv(data = train_data,
                 label = train_label,
                 params = xgb_params,

                 nrounds = 77,                                                         # 根据上一轮次确定的nrounds设置
                 nfold = 5,
                 stratified = TRUE,
                 verbose = FALSE,
                 prediction = TRUE                                                     # 添加这一参数，才会输出auc.mean和auc.std
    )
    
    stats = as.data.frame(xgb$dt)
    stats_params = c(stats[nrow(xgb$dt), 3], stats[nrow(xgb$dt), 4], to_tune[i, 1], to_tune[i, 2])     # 根据待调参数个数设置
    names(stats_params) = result.names
    
    result = rbind(result, stats_params)
}

result = as.data.frame(result)[-1,]
result

Unnamed: 0,test.auc.mean,test.auc.std,max.depth,min.child.weight
stats_params,0.855297,0.005597,4,2
stats_params.1,0.855357,0.005455,5,2
stats_params.2,0.854612,0.005226,6,2
stats_params.3,0.855269,0.005419,4,3
stats_params.4,0.855603,0.005375,5,3
stats_params.5,0.855113,0.005816,6,3
stats_params.6,0.855368,0.005504,4,4
stats_params.7,0.855651,0.005409,5,4
stats_params.8,0.85483,0.005499,6,4


In [10]:
result[which.max(result$test.auc.mean),]

Unnamed: 0,test.auc.mean,test.auc.std,max.depth,min.child.weight
stats_params.7,0.855651,0.005409,5,4


至此，我们得到`max.depth`的理想取值为5，`min.child.weight`的理想取值为4。同时，我们还能看到`cv`的得分有了小小一点提高。需要注意的一点是，随着模型表现的提升，进一步提升的难度是指数级上升的，尤其是你的表现已经接近完美的时候。

### 2.2 `gamma` 参数调优
`gamma`参数取值范围可以很大，这里把取值范围设置为5了，也可以取更精确的`gamma`值。

In [18]:
xgb_params = list(
    objective = "binary:logistic",
    eval_metric = "auc",
    
    eta = 0.1,                     # 初始值设为0.1
    max.depth = 5,
    min.child.weight = 4,
    subsample = 0.8,               # 最常见的初始值，典型值的范围在0.5-0.9之间
    colsample_bytree = 0.8,        # 最常见的初始值，典型值的范围在0.5-0.9之间
    scale.pos.weight = 1           # 类别不平衡，初始值设为1
)

In [19]:
gamma = seq(0, 0.5, 0.1)
to_tune = expand.grid(gamma = gamma)                                                # 根据待调参数和取值范围设置

result = vector(mode = 'numeric', length = 3)                                       # 根据待调参数个数设置
result.names = c('test.auc.mean', 'test.auc.std', 'gamma')                          # 根据待调参数名称设置
names(result) = result.names

for (i in seq(dim(to_tune)[1])) {
    xgb_params$gamma = to_tune[i, 1]                                                # 根据根据待调参数名称设置
    
    set.seed(27)
    xgb = xgb.cv(data = train_data,
                 label = train_label,
                 params = xgb_params,

                 nrounds = 77,                                                      # 根据上一轮次确定的nrounds设置
                 nfold = 5,
                 stratified = TRUE,
                 verbose = FALSE,
                 prediction = TRUE                                                  # 添加这一参数，才会输出auc.mean和auc.std
    )
    
    stats = as.data.frame(xgb$dt)
    stats_params = c(stats[nrow(xgb$dt), 3], stats[nrow(xgb$dt), 4], to_tune[i, 1]) # 根据待调参数个数设置
    names(stats_params) = result.names
    
    result = rbind(result, stats_params)
}

result = as.data.frame(result)[-1,]
result

Unnamed: 0,test.auc.mean,test.auc.std,gamma
stats_params,0.855651,0.005409,0.0
stats_params.1,0.855701,0.005334,0.1
stats_params.2,0.85559,0.005359,0.2
stats_params.3,0.855524,0.005394,0.3
stats_params.4,0.855557,0.005343,0.4
stats_params.5,0.855437,0.005455,0.5


In [20]:
result[which.max(result$test.auc.mean),]

Unnamed: 0,test.auc.mean,test.auc.std,gamma
stats_params.1,0.855701,0.005334,0.1


理想的`gamma`值为0.1。由于参数都发生了变化，重新调整`nrouds`参数。

In [21]:
xgb_params = list(
    objective = "binary:logistic",
    eval_metric = "auc",
    
    eta = 0.1,
    max.depth = 5,
    min.child.weight = 4,
    gamma = 0.1,
    subsample = 0.8,
    colsample_bytree = 0.8,
    scale.pos.weight = 1
)

In [22]:
set.seed(27)
xgb = xgb.cv(data = train_data,
             label = train_label,
             params = xgb_params,
             nrounds = 1000,
             
             nfold = 5,
             stratified = TRUE,
             verbose = TRUE,
             early.stop.round = 50
)

[0]	train-auc:0.814978+0.016563	test-auc:0.810560+0.024164
[1]	train-auc:0.840399+0.002071	test-auc:0.836829+0.005364
[2]	train-auc:0.842136+0.002398	test-auc:0.838778+0.005137
[3]	train-auc:0.844550+0.002283	test-auc:0.841368+0.004577
[4]	train-auc:0.846347+0.001174	test-auc:0.842460+0.005922
[5]	train-auc:0.848440+0.001954	test-auc:0.844594+0.005602
[6]	train-auc:0.849744+0.001748	test-auc:0.845703+0.005815
[7]	train-auc:0.850471+0.002122	test-auc:0.846536+0.005865
[8]	train-auc:0.851194+0.002244	test-auc:0.847212+0.005745
[9]	train-auc:0.851607+0.002309	test-auc:0.847662+0.005739
[10]	train-auc:0.852017+0.002441	test-auc:0.848026+0.005289
[11]	train-auc:0.851955+0.002552	test-auc:0.847938+0.005028
[12]	train-auc:0.852957+0.001818	test-auc:0.849040+0.005728
[13]	train-auc:0.853278+0.001761	test-auc:0.849292+0.005841
[14]	train-auc:0.853617+0.002064	test-auc:0.849568+0.005525
[15]	train-auc:0.854101+0.001883	test-auc:0.849923+0.005416
[16]	train-auc:0.854291+0.001580	test-auc:0.850217

理想的`nrounds`参数仍然是77。

### 2.2 `subsample` 和 `colsample.bytree` 参数调优
`gamma`参数取值范围可以很大，这里把取值范围设置为5了，也可以取更精确的`gamma`值。

In [23]:
xgb_params = list(
    objective = "binary:logistic",
    eval_metric = "auc",
    
    eta = 0.1,                     # 初始值设为0.1
    max.depth = 5,
    min.child.weight = 4,
    gamma = 0.1,
    scale.pos.weight = 1           # 类别不平衡，初始值设为1
)

In [24]:
subsample = c(0.6, 0.7, 0.8, 0.9, 1.0)
colsample.bytree = c(0.6, 0.7, 0.8, 0.9, 1.0)
to_tune = expand.grid(subsample = subsample, colsample.bytree = colsample.bytree)       # 根据待调参数和取值范围设置

result = vector(mode = 'numeric', length = 4)                                           # 根据待调参数个数设置
result.names = c('test.auc.mean', 'test.auc.std', 'subsample', 'colsample.bytree')      # 根据待调参数名称设置
names(result) = result.names

for (i in seq(dim(to_tune)[1])) {
    xgb_params$subsample = to_tune[i, 1]
    xgb_params$colsample.bytree = to_tune[i, 2]                                         # 根据根据待调参数名称设置
    
    set.seed(27)
    xgb = xgb.cv(data = train_data,
                 label = train_label,
                 params = xgb_params,

                 nrounds = 77,                                                         # 根据上一轮次确定的nrounds设置
                 nfold = 5,
                 stratified = TRUE,
                 verbose = FALSE,
                 prediction = TRUE                                                     # 添加这一参数，才会输出auc.mean和auc.std
    )
    
    stats = as.data.frame(xgb$dt)
    stats_params = c(stats[nrow(xgb$dt), 3], stats[nrow(xgb$dt), 4], to_tune[i, 1], to_tune[i, 2])     # 根据待调参数个数设置
    names(stats_params) = result.names
    
    result = rbind(result, stats_params)
}

result = as.data.frame(result)[-1,]
result

Unnamed: 0,test.auc.mean,test.auc.std,subsample,colsample.bytree
stats_params,0.855951,0.005536,0.6,0.6
stats_params.1,0.855781,0.005637,0.7,0.6
stats_params.2,0.855817,0.005608,0.8,0.6
stats_params.3,0.855552,0.005467,0.9,0.6
stats_params.4,0.855865,0.005612,1.0,0.6
stats_params.5,0.855721,0.005437,0.6,0.7
stats_params.6,0.855675,0.005617,0.7,0.7
stats_params.7,0.855505,0.005663,0.8,0.7
stats_params.8,0.85572,0.005435,0.9,0.7
stats_params.9,0.85564,0.005583,1.0,0.7


In [25]:
result[which.max(result$test.auc.mean),]

Unnamed: 0,test.auc.mean,test.auc.std,subsample,colsample.bytree
stats_params,0.855951,0.005536,0.6,0.6


虽然`subsample` 和 `colsample.bytree` 的理想取值是0.6，但是我们还没尝试过小于0.6的取值。

In [26]:
xgb_params = list(
    objective = "binary:logistic",
    eval_metric = "auc",
    
    eta = 0.1,                     # 初始值设为0.1
    max.depth = 5,
    min.child.weight = 4,
    gamma = 0.1,
    scale.pos.weight = 1           # 类别不平衡，初始值设为1
)

In [27]:
subsample = c(0.4, 0.5, 0.6)
colsample.bytree = c(0.4, 0.5, 0.6)
to_tune = expand.grid(subsample = subsample, colsample.bytree = colsample.bytree)       # 根据待调参数和取值范围设置

result = vector(mode = 'numeric', length = 4)                                           # 根据待调参数个数设置
result.names = c('test.auc.mean', 'test.auc.std', 'subsample', 'colsample.bytree')      # 根据待调参数名称设置
names(result) = result.names

for (i in seq(dim(to_tune)[1])) {
    xgb_params$subsample = to_tune[i, 1]
    xgb_params$colsample.bytree = to_tune[i, 2]                                         # 根据根据待调参数名称设置
    
    set.seed(27)
    xgb = xgb.cv(data = train_data,
                 label = train_label,
                 params = xgb_params,

                 nrounds = 77,                                                         # 根据上一轮次确定的nrounds设置
                 nfold = 5,
                 stratified = TRUE,
                 verbose = FALSE,
                 prediction = TRUE                                                     # 添加这一参数，才会输出auc.mean和auc.std
    )
    
    stats = as.data.frame(xgb$dt)
    stats_params = c(stats[nrow(xgb$dt), 3], stats[nrow(xgb$dt), 4], to_tune[i, 1], to_tune[i, 2])     # 根据待调参数个数设置
    names(stats_params) = result.names
    
    result = rbind(result, stats_params)
}

result = as.data.frame(result)[-1,]
result

Unnamed: 0,test.auc.mean,test.auc.std,subsample,colsample.bytree
stats_params,0.855194,0.005897,0.4,0.4
stats_params.1,0.85504,0.005227,0.5,0.4
stats_params.2,0.855417,0.005802,0.6,0.4
stats_params.3,0.855119,0.006109,0.4,0.5
stats_params.4,0.855604,0.005773,0.5,0.5
stats_params.5,0.855965,0.005655,0.6,0.5
stats_params.6,0.855133,0.00536,0.4,0.6
stats_params.7,0.855822,0.005636,0.5,0.6
stats_params.8,0.855951,0.005536,0.6,0.6


In [28]:
result[which.max(result$test.auc.mean),]

Unnamed: 0,test.auc.mean,test.auc.std,subsample,colsample.bytree
stats_params.5,0.855965,0.005655,0.6,0.5


`subsample` 和 `colsample.bytree` 参数的理想取值分别是0.6和0.5。以0.05为步长，在这个值附近尝试取值。

In [29]:
xgb_params = list(
    objective = "binary:logistic",
    eval_metric = "auc",
    
    eta = 0.1,                     # 初始值设为0.1
    max.depth = 5,
    min.child.weight = 4,
    gamma = 0.1,
    scale.pos.weight = 1           # 类别不平衡，初始值设为1
)

In [30]:
subsample = c(0.55, 0.6, 0.65)
colsample.bytree = c(0.45, 0.5, 0.55)
to_tune = expand.grid(subsample = subsample, colsample.bytree = colsample.bytree)       # 根据待调参数和取值范围设置

result = vector(mode = 'numeric', length = 4)                                           # 根据待调参数个数设置
result.names = c('test.auc.mean', 'test.auc.std', 'subsample', 'colsample.bytree')      # 根据待调参数名称设置
names(result) = result.names

for (i in seq(dim(to_tune)[1])) {
    xgb_params$subsample = to_tune[i, 1]
    xgb_params$colsample.bytree = to_tune[i, 2]                                         # 根据根据待调参数名称设置
    
    set.seed(27)
    xgb = xgb.cv(data = train_data,
                 label = train_label,
                 params = xgb_params,

                 nrounds = 77,                                                         # 根据上一轮次确定的nrounds设置
                 nfold = 5,
                 stratified = TRUE,
                 verbose = FALSE,
                 prediction = TRUE                                                     # 添加这一参数，才会输出auc.mean和auc.std
    )
    
    stats = as.data.frame(xgb$dt)
    stats_params = c(stats[nrow(xgb$dt), 3], stats[nrow(xgb$dt), 4], to_tune[i, 1], to_tune[i, 2])     # 根据待调参数个数设置
    names(stats_params) = result.names
    
    result = rbind(result, stats_params)
}

result = as.data.frame(result)[-1,]
result

Unnamed: 0,test.auc.mean,test.auc.std,subsample,colsample.bytree
stats_params,0.855661,0.005851,0.55,0.45
stats_params.1,0.855417,0.005802,0.6,0.45
stats_params.2,0.855462,0.005832,0.65,0.45
stats_params.3,0.855321,0.005901,0.55,0.5
stats_params.4,0.855965,0.005655,0.6,0.5
stats_params.5,0.856129,0.005717,0.65,0.5
stats_params.6,0.855321,0.005901,0.55,0.55
stats_params.7,0.855965,0.005655,0.6,0.55
stats_params.8,0.856129,0.005717,0.65,0.55


In [31]:
result[which.max(result$test.auc.mean),]

Unnamed: 0,test.auc.mean,test.auc.std,subsample,colsample.bytree
stats_params.5,0.856129,0.005717,0.65,0.5


`subsample` 和 `colsample.bytree` 参数的理想取值分别是0.65和0.5。

## 3 `reg.alpha` 和 `reg.lambda` 正则化参数调优

In [35]:
xgb_params = list(
    objective = "binary:logistic",
    eval_metric = "auc",
    
    eta = 0.1,                     # 初始值设为0.1
    max.depth = 5,
    min.child.weight = 4,
    gamma = 0.1,
    subsample = 0.65,
    colsample.bytree = 0.5,
    scale.pos.weight = 1           # 类别不平衡，初始值设为1
)

In [36]:
reg.alpha = c(1e-5, 1e-2, 0.1, 1, 100)
reg.lambda = c(1e-5, 1e-2, 0.1, 1, 100)
to_tune = expand.grid(reg.alpha = reg.alpha, reg.lambda = reg.lambda)       # 根据待调参数和取值范围设置

result = vector(mode = 'numeric', length = 4)                                           # 根据待调参数个数设置
result.names = c('test.auc.mean', 'test.auc.std', 'reg.alpha', 'reg.lambda')      # 根据待调参数名称设置
names(result) = result.names

for (i in seq(dim(to_tune)[1])) {
    xgb_params$reg.alpha = to_tune[i, 1]
    xgb_params$reg.lambda = to_tune[i, 2]                                         # 根据根据待调参数名称设置
    
    set.seed(27)
    xgb = xgb.cv(data = train_data,
                 label = train_label,
                 params = xgb_params,

                 nrounds = 77,                                                         # 根据上一轮次确定的nrounds设置
                 nfold = 5,
                 stratified = TRUE,
                 verbose = FALSE,
                 prediction = TRUE                                                     # 添加这一参数，才会输出auc.mean和auc.std
    )
    
    stats = as.data.frame(xgb$dt)
    stats_params = c(stats[nrow(xgb$dt), 3], stats[nrow(xgb$dt), 4], to_tune[i, 1], to_tune[i, 2])     # 根据待调参数个数设置
    names(stats_params) = result.names
    
    result = rbind(result, stats_params)
}

result = as.data.frame(result)[-1,]
result

Unnamed: 0,test.auc.mean,test.auc.std,reg.alpha,reg.lambda
stats_params,0.855965,0.005795,1e-05,1e-05
stats_params.1,0.855929,0.005869,0.01,1e-05
stats_params.2,0.855818,0.005702,0.1,1e-05
stats_params.3,0.856205,0.005623,1.0,1e-05
stats_params.4,0.852955,0.006003,100.0,1e-05
stats_params.5,0.855869,0.005565,1e-05,0.01
stats_params.6,0.856067,0.005714,0.01,0.01
stats_params.7,0.855873,0.005652,0.1,0.01
stats_params.8,0.856202,0.005622,1.0,0.01
stats_params.9,0.852955,0.006003,100.0,0.01


In [37]:
result[which.max(result$test.auc.mean),]

Unnamed: 0,test.auc.mean,test.auc.std,reg.alpha,reg.lambda
stats_params.3,0.856205,0.005623,1,1e-05


`reg.alpha` 和 `reg.lambda` 的理想取值分别是1和1e-05。

## 4 降低学习速率

In [38]:
xgb_params = list(
    objective = "binary:logistic", # 二分类问题
    eval_metric = "auc",           # 用AUC作为评价指标
      
    # 设置需要调节的参数初始值
    eta = 0.01,                    # 降为0.01
    max.depth = 5,
    min.child.weight = 4,
    gamma = 0.1,
    subsample = 0.65,
    colsample_bytree = 0.5,
    scale.pos.weight = 1,
    reg.alpha = 1,
    reg.lambda = 1e-05
)

In [39]:
set.seed(27)
xgb = xgb.cv(data = train_data,
             label = train_label,
             params = xgb_params,
             nrounds = 10000,
             
             # 固定不变的参数
             nfold = 5,                                                   # 5折交叉验证
             stratified = TRUE,                                           # 不平衡样本，分层采样
             verbose = TRUE,
             early.stop.round = 50
)

[0]	train-auc:0.771708+0.022965	test-auc:0.773125+0.021977
[1]	train-auc:0.836288+0.013674	test-auc:0.835571+0.011022
[2]	train-auc:0.845811+0.004724	test-auc:0.844420+0.005791
[3]	train-auc:0.846838+0.005489	test-auc:0.844786+0.004868
[4]	train-auc:0.848338+0.004375	test-auc:0.845884+0.003927
[5]	train-auc:0.849174+0.004460	test-auc:0.846711+0.004717
[6]	train-auc:0.851631+0.001489	test-auc:0.849149+0.005906
[7]	train-auc:0.852786+0.001471	test-auc:0.850241+0.005878
[8]	train-auc:0.852928+0.002109	test-auc:0.850035+0.005632
[9]	train-auc:0.852956+0.002102	test-auc:0.849678+0.006017
[10]	train-auc:0.853483+0.002041	test-auc:0.850144+0.005855
[11]	train-auc:0.853637+0.002176	test-auc:0.850182+0.005493
[12]	train-auc:0.853728+0.002096	test-auc:0.850243+0.005587
[13]	train-auc:0.853814+0.001873	test-auc:0.850042+0.005660
[14]	train-auc:0.853943+0.001789	test-auc:0.850231+0.005878
[15]	train-auc:0.853772+0.002173	test-auc:0.850042+0.005673
[16]	train-auc:0.854086+0.002000	test-auc:0.850348

### 在`eta`为0.01时，最优的`nrounds`为733，最终的交叉验证AUC为0.856639。