## [作業重點]
了解如何使用 Sklearn 中的 hyper-parameter search 找出最佳的超參數

### 作業
請使用不同的資料集，並使用 hyper-parameter search 的方式，看能不能找出最佳的超參數組合

In [1]:
from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

In [2]:
### DB:手寫字集
digits = datasets.load_digits()
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=4)
clf = GradientBoostingClassifier()

clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print("Acuuracy(before): ", metrics.accuracy_score(y_test, y_pred))

Acuuracy(before):  0.9666666666666667


In [3]:
# 設定要訓練的超參數組合
###criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1
###loss="deviance", learning_rate=0.1, n_estimators=100
n_estimators = [100, 150, 200, 250, 300]
max_depth = [1, 3, 5, 7, 9]
param_grid = dict(n_estimators=n_estimators, max_depth=max_depth)

## 建立搜尋物件，放入模型及參數組合字典 (n_jobs=-1 會使用全部 cpu 平行運算)
grid_search = GridSearchCV(clf, param_grid, scoring="neg_mean_squared_error", n_jobs=-1, verbose=1)

# 開始搜尋最佳參數
grid_result = grid_search.fit(x_train, y_train)

# 預設會跑 3-fold cross-validadtion，總共 9 種參數組合，總共要 train 27 次模型

Fitting 3 folds for each of 25 candidates, totalling 75 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   17.1s
[Parallel(n_jobs=-1)]: Done  75 out of  75 | elapsed:   38.4s finished


In [4]:
# 印出最佳結果與最佳參數
print("Best Accuracy: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best Accuracy: -0.926503 using {'max_depth': 3, 'n_estimators': 150}


In [5]:
# 使用最佳參數重新建立模型
clf_bestparam = GradientBoostingClassifier(max_depth=grid_result.best_params_['max_depth'],
                                           n_estimators=grid_result.best_params_['n_estimators'])
# 訓練模型
clf_bestparam.fit(x_train, y_train)
# 預測測試集
y_pred = clf_bestparam.predict(x_test)
# 調整參數後的 acc
print("Acuuracy(after): ", metrics.accuracy_score(y_test, y_pred))

Acuuracy(after):  0.9666666666666667


In [6]:
### DB:Boston
boston = datasets.load_boston()
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=4)
GB_R = GradientBoostingRegressor()
GB_R.fit(x_train, y_train)
y_pred = GB_R.predict(x_test)
print("MSE(before): ", metrics.mean_squared_error(y_test, y_pred))

# 設定要訓練的超參數組合
n_estimators = [100, 150, 200, 250, 300]
max_depth = [1, 3, 5, 7, 9]
param_grid = dict(n_estimators=n_estimators, max_depth=max_depth)
grid_search = GridSearchCV(GB_R, param_grid, scoring="neg_mean_squared_error", n_jobs=-1, verbose=1)
# 開始搜尋最佳參數
grid_result = grid_search.fit(x_train, y_train)
# 預設會跑 3-fold cross-validadtion，總共 9 種參數組合，總共要 train 27 次模型

MSE(before):  11.752551120010553
Fitting 3 folds for each of 25 candidates, totalling 75 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  75 out of  75 | elapsed:    1.5s finished


In [7]:
# 印出最佳結果與最佳參數
print("Best Accuracy: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

# 使用最佳參數重新建立模型
GB_R_bestparam = GradientBoostingRegressor(max_depth=grid_result.best_params_['max_depth'],
                                           n_estimators=grid_result.best_params_['n_estimators'])
# 訓練模型
GB_R_bestparam.fit(x_train, y_train)
# 預測測試集
y_pred = GB_R_bestparam.predict(x_test)
# 調整參數後的 MSE
print("MSE(after): ", metrics.mean_squared_error(y_test, y_pred))

Best Accuracy: -10.877299 using {'max_depth': 3, 'n_estimators': 300}
MSE(after):  10.112537353275997


In [8]:
### DB:Wine
wine = datasets.load_wine()
x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.25, random_state=4)
GB_C = GradientBoostingClassifier()
GB_C.fit(x_train, y_train)
y_pred = GB_C.predict(x_test)
print("Acuuracy(before): ", metrics.accuracy_score(y_test, y_pred))

# 設定要訓練的超參數組合
n_estimators = [100, 150, 200, 250, 300]
max_depth = [1, 3, 5, 7, 9]
param_grid = dict(n_estimators=n_estimators, max_depth=max_depth)
grid_search = GridSearchCV(GB_C, param_grid, scoring="neg_mean_squared_error", n_jobs=-1, verbose=1)
# 開始搜尋最佳參數
grid_result = grid_search.fit(x_train, y_train)
# 預設會跑 3-fold cross-validadtion，總共 9 種參數組合，總共要 train 27 次模型

Acuuracy(before):  0.9777777777777777
Fitting 3 folds for each of 25 candidates, totalling 75 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  60 out of  75 | elapsed:    2.3s remaining:    0.5s
[Parallel(n_jobs=-1)]: Done  75 out of  75 | elapsed:    2.9s finished


In [9]:
# 印出最佳結果與最佳參數
print("Best Accuracy: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

# 使用最佳參數重新建立模型
GB_C_bestparam = GradientBoostingClassifier(max_depth=grid_result.best_params_['max_depth'],
                                           n_estimators=grid_result.best_params_['n_estimators'])
# 訓練模型
GB_C_bestparam.fit(x_train, y_train)
# 預測測試集
y_pred = GB_C_bestparam.predict(x_test)
# 調整參數後的 acc
print("Acuuracy(after): ", metrics.accuracy_score(y_test, y_pred))

Best Accuracy: -0.037594 using {'max_depth': 1, 'n_estimators': 200}
Acuuracy(after):  0.9555555555555556
