## [作業重點]
了解如何使用 Sklearn 中的 hyper-parameter search 找出最佳的超參數

### 作業
請使用不同的資料集，並使用 hyper-parameter search 的方式，看能不能找出最佳的超參數組合

In [1]:
from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV

In [2]:
# 讀取資料集
digits = datasets.load_digits()
digits.target

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=42)

# 建立模型
gbc = GradientBoostingClassifier(random_state = 10)

In [3]:
# 先看看使用預設參數得到的結果
gbc.fit(x_train, y_train)
y_pred = gbc.predict(x_test)

In [4]:
print('accuracy :', (y_pred==y_test).mean())

accuracy : 0.9688888888888889


In [5]:
gbc.get_params()

{'criterion': 'friedman_mse',
 'init': None,
 'learning_rate': 0.1,
 'loss': 'deviance',
 'max_depth': 3,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_iter_no_change': None,
 'presort': 'auto',
 'random_state': 10,
 'subsample': 1.0,
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}

In [6]:
# 設定要訓練的超參數組合
learning_rate = [0.1, 0.2, 0.3]
n_estimators = [100, 200, 300]
max_depth = [1, 3, 5]
min_samples_leaf = [1, 2, 4]
min_samples_split = [2, 4, 8]

param_grid = dict(learning_rate=learning_rate, n_estimators=n_estimators,
                  max_depth=max_depth, min_samples_leaf=min_samples_leaf, min_samples_split=min_samples_split)

## 建立搜尋物件，放入模型及參數組合字典 (n_jobs=-1 會使用全部 cpu 平行運算)
grid_search = RandomizedSearchCV(gbc, param_grid, n_jobs=-1, verbose=1)

# 開始搜尋最佳參數
grid_result = grid_search.fit(x_train, y_train)


Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:   35.9s finished


In [7]:
# print best estimator
grid_result.best_estimator_ 

GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.2, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=4, min_samples_split=8,
                           min_weight_fraction_leaf=0.0, n_estimators=200,
                           n_iter_no_change=None, presort='auto',
                           random_state=10, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [8]:
# print best score
grid_result.best_score_

0.949517446176689

In [9]:
# print best parameters
grid_result.best_params_ 

{'n_estimators': 200,
 'min_samples_split': 8,
 'min_samples_leaf': 4,
 'max_depth': 3,
 'learning_rate': 0.2}

In [10]:
gbc_bestparameter = GradientBoostingClassifier(n_estimators = grid_result.best_params_['n_estimators'],
                                              min_samples_split = grid_result.best_params_['min_samples_split'],
                                              min_samples_leaf = grid_result.best_params_['min_samples_leaf'],
                                              max_depth = grid_result.best_params_['max_depth'],
                                              learning_rate = grid_result.best_params_['learning_rate'])

In [11]:
gbc_bestparameter.fit(x_train, y_train)
y_pred = gbc.predict(x_test)
print('accuracy :', (y_pred==y_test).mean())

accuracy : 0.9688888888888889
