劍橋實驗室教你如何調節參數
https://cambridgecoding.wordpress.com/2016/04/03/scanning-hyperspace-how-to-tune-machine-learning-models/

Random Forest 超參數調整
https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

Intro to Model Tuning: Grid and Random Search
https://www.google.com/search?q=random+hyperparameter+tune&spell=1&sa=X&ved=2ahUKEwiDwfzu0dXtAhWIbc0KHS4WB-4QBSgAegQIBhA2&biw=1600&bih=876

## [作業重點]
了解如何使用 Sklearn 中的 hyper-parameter search 找出最佳的超參數

### 作業
請使用不同的資料集，並使用 hyper-parameter search 的方式，看能不能找出最佳的超參數組合

In [1]:
from sklearn import datasets, metrics
boston=datasets.load_boston()
wine=datasets.load_wine()
digits=datasets.load_digits()

In [2]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split, KFold, GridSearchCV

# boston + Grid Search

In [3]:
x_train,x_test,y_train,y_test=train_test_split(boston.data,boston.target,test_size=0.2,random_state=4)

In [4]:
clf=GradientBoostingRegressor()
clf.fit(x_train,y_train)
y_pred=clf.predict(x_test)
mse=metrics.mean_squared_error(y_test,y_pred)
print(f"MSE:{mse:.4f}")

MSE:11.1448


In [5]:
# 設定要訓練的超參數組合
n_estimators = [100, 200, 300]
max_depth = [1, 3, 5]
learning_rate = [0.1,0.5,0.05]

param_grid = dict(n_estimators=n_estimators, max_depth=max_depth,learning_rate=learning_rate)

## 建立搜尋物件，放入模型及參數組合字典 (n_jobs=-1 會使用全部 cpu 平行運算)
grid_search = GridSearchCV(clf, param_grid, scoring="neg_mean_squared_error", n_jobs=-1, verbose=1)

# 開始搜尋最佳參數
grid_result = grid_search.fit(x_train, y_train)

# 預設會跑 3-fold cross-validadtion，總共 27 種參數組合，總共要 train 81 次模型

Fitting 3 folds for each of 27 candidates, totalling 81 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.1s
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:    5.9s finished


In [6]:
# 印出最佳結果與最佳參數
print("Best Accuracy: %f using %s" % (
    grid_result.best_score_, grid_result.best_params_))

Best Accuracy: -10.546482 using {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200}


In [7]:
grid_result.best_params_

{'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200}

In [8]:
clf_best_params=GradientBoostingRegressor(max_depth=grid_result.best_params_["max_depth"],n_estimators=grid_result.best_params_["n_estimators"])
clf_best_params.fit(x_train,y_train)
y_pred=clf_best_params.predict(x_test)
mse=metrics.mean_squared_error(y_test,y_pred)
print(f"MSE:{mse:.4f}")

MSE:12.0913


# wine + Random Search

In [9]:
x_train,x_test,y_train,y_test=train_test_split(wine.data,wine.target,test_size=0.2,random_state=4)

In [10]:
clf=GradientBoostingClassifier()
clf.fit(x_train,y_train)
y_pred=clf.predict(x_test)
acc=metrics.accuracy_score(y_test,y_pred)
print(f"Accuracy:{acc:.4f}")

Accuracy:1.0000


In [11]:
from sklearn.model_selection import RandomizedSearchCV

# 設定要訓練的超參數組合
n_estimators = [100, 200, 300]
max_depth = [1, 3, 5]
learning_rate = [0.1,0.5,0.05]
param_grid = dict(n_estimators=n_estimators, max_depth=max_depth,learning_rate=learning_rate)

## 建立搜尋物件，放入模型及參數組合字典 (n_jobs=-1 會使用全部 cpu 平行運算)
#default n_iter = 10 >>> 10 candidates
grid_search = RandomizedSearchCV(clf, param_grid, scoring="neg_mean_squared_error", n_jobs=-1, verbose=1,n_iter=10)

# 開始搜尋最佳參數
grid_result = grid_search.fit(x_train, y_train)

# 預設會跑 3-fold cross-validadtion，總共 只取 10 種參數組合，總共要 train 30 次模型

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    4.8s finished


In [12]:
# 印出最佳結果與最佳參數
print("Best Accuracy: %f using %s" % (
    grid_result.best_score_, grid_result.best_params_))

Best Accuracy: -0.035211 using {'n_estimators': 200, 'max_depth': 1, 'learning_rate': 0.1}


In [13]:
grid_result.best_params_

{'n_estimators': 200, 'max_depth': 1, 'learning_rate': 0.1}

In [14]:
clf_best_params=GradientBoostingClassifier(max_depth=grid_result.best_params_["max_depth"],n_estimators=grid_result.best_params_["n_estimators"])
clf_best_params.fit(x_train,y_train)
y_pred=clf_best_params.predict(x_test)
acc=metrics.accuracy_score(y_test,y_pred)
print(f"Accuracy:{acc:.4f}")

Accuracy:0.9722


# digits + Random Search 

In [15]:
import numpy as np

In [16]:
x_train,x_test,y_train,y_test=train_test_split(digits.data,digits.target,test_size=0.2,random_state=4)

In [17]:
clf=GradientBoostingRegressor()
clf.fit(x_train,y_train)
y_pred=clf.predict(x_test)
mse=metrics.mean_squared_error(y_test,y_pred)
print(f"MSE:{mse:.4f}")

MSE:1.4604


In [18]:
# 設定要訓練的超參數組合
n_estimators = [100, 200, 300]
max_depth = np.arange(1,5)
learning_rate = np.linspace(0.01,0.3)
param_grid = dict(n_estimators=n_estimators, max_depth=max_depth,learning_rate=learning_rate)

## 建立搜尋物件，放入模型及參數組合字典 (n_jobs=-1 會使用全部 cpu 平行運算)
#default n_iter = 10 >>> 10 candidates
grid_search = RandomizedSearchCV(clf, param_grid, scoring="neg_mean_squared_error", verbose=3)

# 開始搜尋最佳參數
grid_result = grid_search.fit(x_train, y_train)

# 預設會跑 3-fold cross-validadtion，總共 9 種參數組合，總共要 train 27 次模型

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] n_estimators=100, max_depth=1, learning_rate=0.2289795918367347 .
[CV]  n_estimators=100, max_depth=1, learning_rate=0.2289795918367347, score=-3.314, total=   0.1s
[CV] n_estimators=100, max_depth=1, learning_rate=0.2289795918367347 .
[CV]  n_estimators=100, max_depth=1, learning_rate=0.2289795918367347, score=-3.541, total=   0.1s
[CV] n_estimators=100, max_depth=1, learning_rate=0.2289795918367347 .


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.2s remaining:    0.0s


[CV]  n_estimators=100, max_depth=1, learning_rate=0.2289795918367347, score=-3.473, total=   0.2s
[CV] n_estimators=300, max_depth=4, learning_rate=0.2289795918367347 .
[CV]  n_estimators=300, max_depth=4, learning_rate=0.2289795918367347, score=-1.202, total=   1.2s
[CV] n_estimators=300, max_depth=4, learning_rate=0.2289795918367347 .
[CV]  n_estimators=300, max_depth=4, learning_rate=0.2289795918367347, score=-1.309, total=   1.1s
[CV] n_estimators=300, max_depth=4, learning_rate=0.2289795918367347 .
[CV]  n_estimators=300, max_depth=4, learning_rate=0.2289795918367347, score=-1.513, total=   1.0s
[CV] n_estimators=100, max_depth=2, learning_rate=0.1520408163265306 .
[CV]  n_estimators=100, max_depth=2, learning_rate=0.1520408163265306, score=-1.749, total=   0.1s
[CV] n_estimators=100, max_depth=2, learning_rate=0.1520408163265306 .
[CV]  n_estimators=100, max_depth=2, learning_rate=0.1520408163265306, score=-2.235, total=   0.1s
[CV] n_estimators=100, max_depth=2, learning_rate=0

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:   13.3s finished


In [19]:
# 印出最佳結果與最佳參數
print("Best Accuracy: %f using %s" % (
    grid_result.best_score_, grid_result.best_params_))

Best Accuracy: -1.341483 using {'n_estimators': 300, 'max_depth': 4, 'learning_rate': 0.2289795918367347}


In [20]:
grid_result.best_params_

{'n_estimators': 300, 'max_depth': 4, 'learning_rate': 0.2289795918367347}

In [21]:
clf_best_params=GradientBoostingClassifier(max_depth=grid_result.best_params_["max_depth"],n_estimators=grid_result.best_params_["n_estimators"])
clf_best_params.fit(x_train,y_train)
y_pred=clf_best_params.predict(x_test)
mse=metrics.mean_squared_error(y_test,y_pred)
print(f"Accuracy:{mse:.4f}")

Accuracy:0.3611
