[Automated Model Tuning
](https://www.kaggle.com/willkoehrsen/automated-model-tuning)

Thank Will Koehresn for the great spirit of sharing

一如既往的学习一下Will Koehresn的代码，Kaggle的分享精神实在太好了，希望伪汉化工作真的很渣，旨在督促自己学习，各位还是尽量看原文。


## 关于 LightGBm

梯度提升算法最近成为了顶级机器学习模型之一，特别适用于结构化数据。简单的说模型的思路，不同于随机森林的各个子模型之间相互独立，训练过程也相互平行，GBM每个新的子模型都会借鉴之前模型的错误分类，每次增加的子模型都能够最大降低模型损失。

因为模型结构较为复杂，所以参数也就比较多，虽然原kernel基本上是直接对所有参数进行搜索，但是我觉得可以参考一下[scikit-learn 梯度提升树(GBDT)调参小结](http://www.cnblogs.com/pinard/p/6143927.html)的调参思路,主要是处理框架参数，和弱学习器参数的不同策略。

[参数](http://lightgbm.apachecn.org/cn/latest/Parameters.html), 也有中文文档可以详细了解一下相关参数含义。

## 进行调参的四个步骤

1. 确定目标函数
2. 定义搜寻范围
3. 选择调参算法
4. 输出结果

## 贝叶斯优化调参初步

贝叶斯优化调参是不同于GridSearch和随机调参的另外一种调参思路，最重要的区别是贝叶斯优化是一种 informed methods，也就是利用已经尝试过的调参结果，来决定下一组尝试的参数。这样的好处是是，在更有可能得到最优结果的区域，能够进行更多的尝试。

## 关于Hyperopt

可以进行贝叶斯优化的开源库有很多，Spearmint (基于高斯过程) and SMAC (基于随机森林)。Hyperopt是基于Tree Parzen Estimator算法构建代替函数来选择下一代的参数。作者选择这个是因为文档比较完整。想尽量把用起来，没有纠结算法细节，以后有空看看。。

## 开始试验

基本的实验设定

- 5折交叉验证
- Early Stopping 为 100
- 选取10000个样本为训练集，6000为试验集记性实验，这样可以快速迭代实验，学习调参方法，最后看看调好的参数是否能够迁移到整个数据集上











In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Modeling
import lightgbm as lgb

# Evaluation of the model
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import roc_auc_score

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['font.size'] = 18
%matplotlib inline

# Governing choices for search
N_FOLDS = 5
MAX_EVALS = 5

In [None]:
features = pd.read_csv('../input/home-credit-default-risk/application_train.csv')

# Sample 16000 rows (10000 for training, 6000 for testing)
features = features.sample(n = 16000, random_state = 42)

# Only numeric features
features = features.select_dtypes('number')

# Extract the labels
labels = np.array(features['TARGET'].astype(np.int32)).reshape((-1, ))
features = features.drop(columns = ['TARGET', 'SK_ID_CURR'])

# Split into training and testing data
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 6000, random_state = 42)

print('Train shape: ', train_features.shape)
print('Test shape: ', test_features.shape)

train_features.head()

In [None]:
model = lgb.LGBMClassifier(random_state=50)

# Training set
train_set = lgb.Dataset(train_features, label = train_labels)
test_set = lgb.Dataset(test_features, label = test_labels)

#### 1 用原始参数训练Baseline Model，用来比较模型调参的效果

In [None]:
# Default hyperparamters
hyperparameters = model.get_params()

# Using early stopping to determine number of estimators.
del hyperparameters['n_estimators']

# Perform cross validation with early stopping
cv_results = lgb.cv(hyperparameters, train_set, num_boost_round = 10000, nfold = N_FOLDS, metrics = 'auc', 
           early_stopping_rounds = 100, verbose_eval = False, seed = 42)

# Highest score
best = cv_results['auc-mean'][-1]

# Standard deviation of best score
best_std = cv_results['auc-stdv'][-1]

print('The maximium ROC AUC in cross validation was {:.5f} with std of {:.5f}.'.format(best, best_std))
print('The ideal number of iterations was {}.'.format(len(cv_results['auc-mean'])))

In [None]:
model.n_estimators = len(cv_results['auc-mean'])

model.fit(train_features, train_labels)
preds = model.predict_proba(test_features)[:, 1]
baseline_auc = roc_auc_score(test_labels, preds)

print('The baseline model scores {:.5f} ROC AUC on the test set.'.format(baseline_auc))

#### 2 构建目标函数

In [None]:
import csv
from hyperopt import STATUS_OK
from timeit import default_timer as timer

def objective(hyperparameters):
    """Objective function for Gradient Boosting Machine Hyperparameter Optimization.
       Writes a new line to `outfile` on every iteration"""
    
    # 确定是哪一组验证集
    global ITERATION
    
    ITERATION += 1
    
    # 使用 early stopping 确定弱学习器的数量
    if 'n_estimators' in hyperparameters:
        del hyperparameters['n_estimators']
    
    # 因为设定参数 不同的boosting type需要不同的subsample参数，所以重新构造调参范围
    subsample = hyperparameters['boosting_type'].get('subsample', 1.0)
    hyperparameters['boosting_type'] = hyperparameters['boosting_type']['boosting_type']
    hyperparameters['subsample'] = subsample
    
    # 确认整数参数的情况
    for parameter_name in ['num_leaves', 'subsample_for_bin', 'min_child_samples']:
        hyperparameters[parameter_name] = int(hyperparameters[parameter_name])
        
    start = timer()
    
    cv_results = lgb.cv(hyperparameters, train_set, num_boost_round = 10000, nfold = N_FOLDS, 
                        early_stopping_rounds = 100, metrics = 'auc', seed = 50)

    run_time = timer() - start
    
    best_score = cv_results['auc-mean'][-1]
    
    # 损失函数 越小越好
    loss = 1 - best_score
    
    # 返回最优结果的 Boosting 轮数
    n_estimators = len(cv_results['auc-mean'])
    
    hyperparameters['n_estimators'] = n_estimators

    # Write to the csv file ('a' means append)
    of_connection = open(OUT_FILE, 'a')
    writer = csv.writer(of_connection)
    writer.writerow([loss, hyperparameters, ITERATION, run_time, best_score])
    of_connection.close()
    
    # Dictionary with information for evaluation
    return {'loss': loss, 'hyperparameters': hyperparameters, 'iteration': ITERATION,
            'train_time': run_time, 'status': STATUS_OK}

#### 2 定义搜寻范围

In [None]:
from hyperopt import hp
from hyperopt.pyll.stochastic import sample

In [None]:
learning_rate = {'learning_rate': hp.loguniform('learning_rate', np.log(0.005), np.log(0.2))}

- log 均匀分布

In [None]:
learning_rate_dist = []

# Draw 10000 samples from the learning rate domain
for _ in range(10000):
    learning_rate_dist.append(sample(learning_rate)['learning_rate'])
    
plt.figure(figsize = (8, 6))
sns.kdeplot(learning_rate_dist, color = 'red', linewidth = 2, shade = True);
plt.title('Learning Rate Distribution', size = 18); plt.xlabel('Learning Rate', size = 16); plt.ylabel('Density', size = 16);

以上是采样一万个学习率的分布情况，因为是log均匀分布，所以在较小的地方分布比较密集

- 均匀分布

In [None]:
num_leaves = {'num_leaves': hp.quniform('num_leaves', 30, 150, 1)}
num_leaves_dist = []

for _ in range(10000):
    num_leaves_dist.append(sample(num_leaves)['num_leaves'])
    
plt.figure(figsize = (8, 6))
sns.kdeplot(num_leaves_dist, linewidth = 2, shade = True);
plt.title('Number of Leaves Distribution', size = 18); plt.xlabel('Number of Leaves', size = 16); plt.ylabel('Density', size = 16);

- 相互影响的参数处理

因为如果设定了boosting_type为goss，subsample就只能为1，所以要把两个参数放到一起设定

In [None]:
boosting_type = {'boosting_type': hp.choice('boosting_type', 
                                            [{'boosting_type': 'gbdt', 'subsample': hp.uniform('subsample', 0.5, 1)}, 
                                             {'boosting_type': 'dart', 'subsample': hp.uniform('subsample', 0.5, 1)},
                                             {'boosting_type': 'goss', 'subsample': 1.0}])}

hyperparams = sample(boosting_type)
hyperparams

- 其他调参范围的完整设定

[官方文档中关于调参范围各种设置的方式](https://github.com/hyperopt/hyperopt/wiki/FMin#21-parameter-expressions)，只是上面两种是用的最多的。

In [None]:
space = {
    'boosting_type': hp.choice('boosting_type', 
                                            [{'boosting_type': 'gbdt', 'subsample': hp.uniform('gdbt_subsample', 0.5, 1)}, 
                                             {'boosting_type': 'dart', 'subsample': hp.uniform('dart_subsample', 0.5, 1)},
                                             {'boosting_type': 'goss', 'subsample': 1.0}]),
    'num_leaves': hp.quniform('num_leaves', 20, 150, 1),
    'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.5)),
    'subsample_for_bin': hp.quniform('subsample_for_bin', 20000, 300000, 20000),
    'min_child_samples': hp.quniform('min_child_samples', 20, 500, 5),
    'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),
    'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0),
    'colsample_bytree': hp.uniform('colsample_by_tree', 0.6, 1.0),
    'is_unbalance': hp.choice('is_unbalance', [True, False]),
}

**对调参范围进行抽样的方式**

In [None]:
x = sample(space)

# 因为之前对 相互影响的参数进行了特殊设置，现在也需要特殊处理一下，让整个参数处于一个层级
subsample = x['boosting_type'].get('subsample', 1.0)
x['boosting_type'] = x['boosting_type']['boosting_type']
x['subsample'] = subsample

x

In [None]:
x = sample(space)
subsample = x['boosting_type'].get('subsample', 1.0)
x['boosting_type'] = x['boosting_type']['boosting_type']
x['subsample'] = subsample
x

In [None]:
# 创建一个文件，用来存调参结果
OUT_FILE = 'bayes_test.csv'
of_connection = open(OUT_FILE, 'w')
writer = csv.writer(of_connection)

ITERATION = 0

# 写入列名
headers = ['loss', 'hyperparameters', 'iteration', 'runtime', 'score']
writer.writerow(headers)
of_connection.close()

# 测试目标函数
results = objective(sample(space))
print('The cross validation loss = {:.5f}.'.format(results['loss']))
print('The optimal number of estimators was {}.'.format(results['hyperparameters']['n_estimators']))

#### 3 调参算法的选择 Optimization Algorithm

选择调参算法，其实就是选择了一种构建函数模型的方式。然后根据这个算法可以算出下一步尝试的参数，哪个后验概率最大。

以下是两个作者推荐的文档，我反正暂时没看。。。

[技术细节](https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf)

[概念性的解释](https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f)，这个kernel文档作者自己的博文

In [None]:
from hyperopt import tpe

# Create the algorithm
tpe_algorithm = tpe.suggest

#### 4 结果记录

Hyperopt 提供了记录结果的工具，但是我们自己记录，可以方便实时监控。

In [None]:
from hyperopt import Trials

# Record results
trials = Trials()

In [None]:
OUT_FILE = 'bayes_test.csv'
of_connection = open(OUT_FILE, 'w')
writer = csv.writer(of_connection)

ITERATION = 0

headers = ['loss', 'hyperparameters', 'iteration', 'runtime', 'score']
writer.writerow(headers)
of_connection.close()

## 调参实践

In [None]:
from hyperopt import fmin

In [None]:
global  ITERATION

ITERATION = 0

# 这一步就是调参的运行过程
best = fmin(fn = objective, space = space, algo = tpe.suggest, trials = trials,
            max_evals = MAX_EVALS)

best

这里运行可能有点小问题，可以参照这个[解决](https://blog.csdn.net/FontThrone/article/details/79012616)

In [None]:
results = pd.read_csv(OUT_FILE)

In [None]:
import ast

def evaluate(results, name):
    """evaluate 函数用来评估最佳参数的表现
    返回的结果是 将之前用csv记录的结果，结构化返回，方便后续统计分析参数的分布"""
    
    new_results = results.copy()
    # 使用ast.literal_eval str -> dic
    new_results['hyperparameters'] = new_results['hyperparameters'].map(ast.literal_eval)
    
    # Sort with best values on top
    new_results = new_results.sort_values('score', ascending = False).reset_index(drop = True)
    
    # 打印最高分数
    print('The highest cross validation score from {} was {:.5f} found on iteration {}.'.format(name, new_results.loc[0, 'score'], new_results.loc[0, 'iteration']))
    
    # 使用最佳参数建模训练，返回分数
    hyperparameters = new_results.loc[0, 'hyperparameters']
    model = lgb.LGBMClassifier(**hyperparameters)
    model.fit(train_features, train_labels)
    preds = model.predict_proba(test_features)[:, 1]
    
    print('ROC AUC from {} on test data = {:.5f}.'.format(name, roc_auc_score(test_labels, preds)))
    
    # 将dict存储的参数转化为 结构化的数据 df
    hyp_df = pd.DataFrame(columns = list(new_results.loc[0, 'hyperparameters'].keys()))

    for i, hyp in enumerate(new_results['hyperparameters']):
        hyp_df = hyp_df.append(pd.DataFrame(hyp, index = [0]), 
                               ignore_index = True)
        
    # 增加 iteration score 两列 
    hyp_df['iteration'] = new_results['iteration']
    hyp_df['score'] = new_results['score']
    
    return hyp_df

ast.literal_eval 是升级版的 eval ，能够将字符串解析为python对象，安全性比 eval 高

In [None]:
bayes_results = evaluate(results, name = 'Bayesian')
bayes_results

## 搜索结果的探索

作者对 贝叶斯优化 和 随机搜索的各1000组结果进行可视化，更直观的比较两种调参方式的特性。

- 参数分布，观察参数的集中趋势
- 时序分布，观察随着算法迭代，参数是如何分布的

原作者是先后进行可视化的，我觉得对一个参数同时进行两种可视化可能更有意思点，我用作者思路和代码重新设计一下

In [None]:
bayes_results = pd.read_csv('../input/home-credit-model-tuning/bayesian_trials_1000.csv').sort_values('score', ascending = False).reset_index()
random_results = pd.read_csv('../input/home-credit-model-tuning/random_search_trials_1000.csv').sort_values('score', ascending = False).reset_index()
random_results['loss'] = 1 - random_results['score']

bayes_params = evaluate(bayes_results, name = 'Bayesian')
random_params = evaluate(random_results, name = 'random')

In [None]:
# Dataframe of just scores
scores = pd.DataFrame({'ROC AUC': random_params['score'], 'iteration': random_params['iteration'], 'search': 'Random'})
scores = scores.append(pd.DataFrame({'ROC AUC': bayes_params['score'], 'iteration': bayes_params['iteration'], 'search': 'Bayesian'}))

scores['ROC AUC'] = scores['ROC AUC'].astype(np.float32)
scores['iteration'] = scores['iteration'].astype(np.int32)

scores.head()# Dataframe of just scores
scores = pd.DataFrame({'ROC AUC': random_params['score'], 'iteration': random_params['iteration'], 'search': 'Random'})
scores = scores.append(pd.DataFrame({'ROC AUC': bayes_params['score'], 'iteration': bayes_params['iteration'], 'search': 'Bayesian'}))

scores['ROC AUC'] = scores['ROC AUC'].astype(np.float32)
scores['iteration'] = scores['iteration'].astype(np.int32)

scores.head()

In [None]:
best_random_params = random_params.iloc[random_params['score'].idxmax(), :].copy()
best_bayes_params = bayes_params.iloc[bayes_params['score'].idxmax(), :].copy()

In [None]:
best_random_params

In [None]:
# Plot of scores over the course of searching
sns.lmplot('iteration', 'ROC AUC', hue = 'search', data = scores, size = 8);
plt.scatter(best_bayes_params['iteration'], best_bayes_params['score'], marker = '*', s = 400, c = 'orange', edgecolor = 'k')
plt.scatter(best_random_params['iteration'], best_random_params['score'], marker = '*', s = 400, c = 'blue', edgecolor = 'k')
plt.xlabel('Iteration'); plt.ylabel('ROC AUC'); plt.title("Validation ROC AUC versus Iteration");

从这张看出来，其实随机搜索的效果真的很好，如果计算资源有限，在少量的迭代过程中就能得到较好的参数，但是由于随机的效果，随着迭代数量的增加，不能明显提升分数

贝叶斯优化参数效果有一个稳定提升的进程，虽然理论上可能是需要较少的搜索可以得到较好的效果，但还是需要一定量的迭代之后才能发挥。

In [None]:
fig, axs = plt.subplots(3, 1, figsize = (24, 22))

# 第一张参数的分布
hyper = 'learning_rate'
# sns.regplot('iteration', 'learning_rate', data = bayes_params, ax = axs[0])
# axs[i].scatter(best_bayes_params['iteration'], best_bayes_params[hyper], marker = '*', s = 200, c = 'k')
# axs[i].set(xlabel = 'Iteration', ylabel = '{}'.format(hyper), title = '{} over Search'.format(hyper));

sns.kdeplot(learning_rate_dist, label='Sampling Distribution', linewidth=4, ax=axs[0])
sns.kdeplot(random_params['learning_rate'], label='Random Search', linewidth=4, ax=axs[0])
sns.kdeplot(bayes_params['learning_rate'], label='Bayes Optimization', linewidth=4, ax=axs[0])
axs[0].vlines(x=best_random_params['learning_rate'], ymin=0.0, 
              ymax=20.0, linestyles='--', linewidth=4, colors=['orange', 'green'])
axs[0].set(xlabel='Learning Rate', ylabel='Learning Rate', title = 'Bayes Optimization Search');

# 第二章 贝叶斯优化的时序分布
sns.regplot('iteration', hyper, data = bayes_params, ax = axs[1])
axs[1].scatter(best_bayes_params['iteration'], best_bayes_params[hyper], marker = '*', s = 200, c = 'k')
axs[1].set(xlabel = 'Iteration', ylabel = '{}'.format(hyper), title = 'Bayes Optimization Search vs Random Search');

# 第三章 随机搜索的时序分布
sns.regplot('iteration', hyper, data=random_params, ax=axs[2])
axs[2].scatter(best_random_params['iteration'], best_random_params[hyper], marker='*', s=300, c='k')
axs[2].set(xlabel='Iteration', ylabel='{}'.format(hyper), title='Random Search')

plt.show()

可以看到贝叶斯优化在前40%左右的搜索模式和随机搜索很相似，到了后期才集中到可能分数较高的区域。

In [None]:
hyper_list = ['colsample_bytree', 
              'min_child_samples', 
              'num_leaves',
              'reg_alpha',
              'reg_lambda',
              'subsample_for_bin']

vline_heights = [10, 0.01, 0.012, 2.6, 1.7, 0.000007]

for hyper, vheight in zip(hyper_list, vline_heights):
        
    fig, axs = plt.subplots(3, 1, figsize = (24, 22))

    # 第一张参数的分布
    sns.kdeplot([sample(space[hyper]) for _ in range(1000)], label = 'Sampling Distribution', linewidth = 4, ax=axs[0])
    sns.kdeplot(random_params[hyper], label='Random Search', linewidth=4, ax=axs[0])
    sns.kdeplot(bayes_params[hyper], label='Bayes Optimization', linewidth=4, ax=axs[0])
    axs[0].vlines(x=[best_bayes_params[hyper],best_random_params[hyper]], ymin=0.0, 
                  ymax=vheight, linestyles='--', linewidth=4, colors=['orange', 'green'])
    axs[0].set(xlabel=hyper, ylabel='density', title = 'Bayes Optimization Search vs Random Search');

    # 第二章 贝叶斯优化的时序分布
    sns.regplot('iteration', hyper, data = bayes_params, ax = axs[1])
    axs[1].scatter(best_bayes_params['iteration'], best_bayes_params[hyper], marker = '*', s = 200, c = 'k')
    axs[1].set(xlabel = 'Iteration', ylabel = '{}'.format(hyper), title = 'Bayes Optimization');

    # 第三章 随机搜索的时序分布
    sns.regplot('iteration', hyper, data=random_params, ax=axs[2])
    axs[2].scatter(best_random_params['iteration'], best_random_params[hyper], marker='*', s=300, c='k')
    axs[2].set(xlabel='Iteration', ylabel='{}'.format(hyper), title='Random Search')

    plt.show()


和原文不同我尝试将每个参数的 概率分布和时序分布放到一起来展示，效果可能比原文清楚点。

## 后续

后续的一些探索不列出来了：
- 在非数值型参数上，参数值贝叶斯优化过程中也有集中趋势，这甚至可以指导我们进行 GridSearch和RandomSearch
- Random 在测试集上效果更好，贝叶斯优化在验证集上效果较好
- 迁移到整体数据集上，贝叶斯优化居然达到了<https://www.kaggle.com/jsaguiar/updated-0-792-lb-lightgbm-with-simple-features>给出的 LB分数 0.792，迁移效果居然还不错

## 结论

- 贝叶斯优化在测试集上有更好的泛化效果
- 相比GridSearch和RandomSearch，贝叶斯优化只需要更少的迭代次数


