## [作業重點]
了解如何使用 Sklearn 中的 hyper-parameter search 找出最佳的超參數

### 作業
請使用不同的資料集，並使用 hyper-parameter search 的方式，看能不能找出最佳的超參數組合

？？？吼～ 題目就出這樣！建議一下資料集比較有幫助

In [1]:
from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor

In [2]:
# 嘗試regression類的資料集
# diabetes
diabetes = datasets.load_diabetes()
# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.25)
diabetes.data.shape

(442, 10)

In [3]:
# 建立模型
clf = GradientBoostingRegressor()
# 採用預設參數
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(metrics.mean_squared_error(y_test, y_pred))

3313.532244969265


In [4]:
# 設定要訓練的超參數組合
n_estimators = [100, 200, 300] # default=100
max_features = [1, 3, None] # default=None
max_depth = [1, 3, 5] # default=3

param_grid = dict(n_estimators=n_estimators, max_features=max_features, max_depth=max_depth)

## 建立搜尋物件，放入模型及參數組合字典 (n_jobs=-1 會使用全部 cpu 平行運算)
grid_search = GridSearchCV(clf, param_grid, scoring="neg_mean_squared_error", n_jobs=-1, verbose=1)

# 開始搜尋最佳參數
grid_result = grid_search.fit(x_train, y_train)

Fitting 3 folds for each of 27 candidates, totalling 81 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    2.8s
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:    4.4s finished


In [5]:
# 印出最佳結果與最佳參數
print("Best Accuracy: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best Accuracy: -3395.777636 using {'max_depth': 3, 'max_features': 1, 'n_estimators': 200}


In [6]:
# 使用最佳參數重新建立模型
clf_bestparam = GradientBoostingRegressor(max_depth=grid_result.best_params_['max_depth'],
                                          max_features=grid_result.best_params_['max_features'],
                                           n_estimators=grid_result.best_params_['n_estimators'])
# 訓練模型
clf_bestparam.fit(x_train, y_train)
# 預測測試集
y_pred = clf_bestparam.predict(x_test)
print(metrics.mean_squared_error(y_test, y_pred))

2864.5714978025826


In [7]:
# 再試一個從sklean提供的範例程式 （https://scikit-learn.org/stable/auto_examples/model_selection/plot_randomized_search.html）
# use a full grid over all parameters
from time import time
#diabetes = datasets.load_diabetes()
#X, y = diabetes.data, diabetes.target
#clf = GradientBoostingRegressor()

param_grid = {"max_depth": [1, 3, 5],
              "max_features": [1, 3, None],
              "n_estimators": [100, 200, 300]}

# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid, cv=5, iid=False)
start = time()
grid_search.fit(x_train, y_train)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(grid_search.cv_results_['params'])))
print("Best score: %f using %s" % (grid_search.best_score_, grid_search.best_params_))

GridSearchCV took 9.22 seconds for 27 candidate parameter settings.
Best score: 0.408937 using {'max_depth': 1, 'max_features': 3, 'n_estimators': 200}


In [8]:
# 承上
# 對比randomized search
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint

# specify parameters and distributions to sample from
param_dist = {"max_depth": [1, 3, 5],
              "max_features": sp_randint(1, 11),
              "n_estimators": [100, 200, 300]}

# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search, cv=5, iid=False)

start = time()
random_search.fit(x_train, y_train)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
print("Best score: %f using %s" % (random_search.best_score_, random_search.best_params_))

RandomizedSearchCV took 7.24 seconds for 20 candidates parameter settings.
Best score: 0.408634 using {'max_depth': 1, 'max_features': 3, 'n_estimators': 100}


In [9]:
# 使用最佳參數重新建立模型
clf_bestparam = GradientBoostingRegressor(max_depth=random_search.best_params_['max_depth'],
                                          max_features=random_search.best_params_['max_features'],
                                          n_estimators=random_search.best_params_['n_estimators'])
# 訓練模型
clf_bestparam.fit(x_train, y_train)
# 預測測試集
y_pred = clf_bestparam.predict(x_test)
print(metrics.mean_squared_error(y_test, y_pred))

2662.554459997909
