# 🦺 随机网格搜索

sklearn中网格搜索优化方法包括两类：
1. 调整搜索空间。
2. 调整每次训练的数据

调整搜索空间：挑选出部分参数组合，构造参数子空间，并只在子空间中搜索。如`n_estimators`5个参数与`max_depth`6个参数构成的5×6=30的空间中随机选择参数组合子空间，并只在这些上面进行搜索。相同全域空间下，可以更快；相同训练次数下，可以覆盖更大空间；得到的最小损失与网格搜索的最小损失很接近

`sklearn.model_selection.RandomizedSearchCV()`

| Name                  | Description                                                  |
| --------------------- | ------------------------------------------------------------ |
| `estimator`           | 评估器、调参对象                                             |
| `param_distributions` | 全域参数空间，`dict`、`list` of `dict`                       |
| `n_iter`              | 迭代次数                                                     |
| `scoring`             | 评价指标，支持多个输出                                       |
| `n_jobs`              | 线程数                                                       |
| `refit`               | 挑选评估指标和最佳参数，在完整数据集上训练                   |
| `cv`                  | 交叉验证折数                                                 |
| `verbose`             | 输出工作日志                                                 |
| `pre_dispatch`        | 多任务并行时任务划分数量                                     |
| `random_state`        | 随机数种子                                                   |
| `error_score`         | 网格搜索报错时返回结果，选择`raise`时直接报错并中断训练过程，其他情况会显示警告信息后继续完成训练 |
| `return_train_score`  | 交叉验证是否显示训练集中参数得分                             |

In [1]:
# 导入加利福尼亚房价数据
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data = fetch_california_housing()
print(data.keys())

# 划分数据集
x = data['data']
y = data['target']
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.2, random_state=22)

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])


In [2]:
import time
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold, RandomizedSearchCV

In [3]:
def count_space(param_ranges):
    """计算全域参数空间大小"""
    space_size = 1
    for param_range in param_ranges.values():
        space_size *= len(param_range)
    return space_size

In [4]:
# 创造参数空间
param_grid_sample = {'criterion': ['squared_error', 'poisson'],
                     'n_estimators': [*range(20, 100, 5)],
                     'max_depth': [*range(10, 25, 2)],
                     'max_features': ['log2', 'sqrt', 16, 32, 64, 'auto'],
                     'min_impurity_decrease': [*np.arange(0, 5, 10)]}

# 建立回归器、交叉验证
reg = RandomForestRegressor(random_state=22, verbose=True, n_jobs=-1)
cv = KFold(n_splits=5, shuffle=True, random_state=22)

In [5]:
# 计算全域参数空间大小，是能够抽样的最大值
count_space(param_grid_sample)

1536

In [6]:
# 创建随机搜索评估器
search = RandomizedSearchCV(estimator=reg,
                            param_distributions=param_grid_sample,
                            n_iter=200,  # 子空间大小设置为全域一半左右
                            scoring='neg_mean_squared_error',
                            verbose=True,
                            cv=cv,
                            random_state=22,
                            n_jobs=-1)

In [7]:
# 训练随机搜索评估器
start = time.time()
search.fit(train_x, train_y)
print(time.time() - start)

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs

194.0148138999939


[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:    0.3s finished


In [8]:
# 查看最优的参数
print(f'最优参数: {search.best_params_}\n'
      f'最优算法结果(RMSE): {abs(search.best_score_) ** 0.5}')

最优参数: {'n_estimators': 90, 'min_impurity_decrease': 0, 'max_features': 'log2', 'max_depth': 18, 'criterion': 'squared_error'}
最优算法结果(RMSE): 0.4991504254535447


In [9]:
# 根据最优参数重建模型
best_reg = RandomForestRegressor(n_estimators=85, max_depth=24, max_features='log2', min_impurity_decrease=0,
                                 criterion='squared_error')
best_reg.fit(train_x, train_y)
best_reg.score(test_x, test_y)

0.8223297122582989