## [作業重點]
了解如何使用 Sklearn 中的 hyper-parameter search 找出最佳的超參數

### 作業
請使用不同的資料集，並使用 hyper-parameter search 的方式，看能不能找出最佳的超參數組合

In [1]:
import numpy as np
from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression

In [2]:
diabetes = datasets.load_diabetes()
wine = datasets.load_wine()

## Diabetes dataset

#### Baseline

In [3]:
x_train, x_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size = 0.25, random_state = 42)
reg = GradientBoostingRegressor(random_state = 20)
reg.fit(x_train, y_train)
y_pred = reg.predict(x_test)
print(f'train loss: {metrics.mean_squared_error(y_train, reg.predict(x_train))}')
print(f'test loss: {metrics.mean_squared_error(y_test, y_pred)}')

train loss: 905.4545327948354
test loss: 3203.1262904754335


#### Grid Search CV Hyperparameter tuning

In [4]:
learning = [0.001, 0.01, 0.1, 1]
n_estimators = [50, 100, 200, 300]
max_depth = [1, 3, 5, 7]
grid_parms = dict(learning_rate = learning, 
                  n_estimators = n_estimators, 
                  max_depth = max_depth)
grid_search = GridSearchCV(reg, grid_parms, scoring = 'neg_mean_squared_error', n_jobs = -1, verbose = 1)
grid_search.fit(x_train, y_train)

# 這樣會有 4 * 4 * 4 = 64 種參數組合，再搭配 5 組切割的資料，總共有 64 * 5 = 320 總可能

Fitting 5 folds for each of 64 candidates, totalling 320 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    5.5s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   18.3s
[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed:   27.4s finished


GridSearchCV(cv=None, error_score=nan,
             estimator=GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0,
                                                 criterion='friedman_mse',
                                                 init=None, learning_rate=0.1,
                                                 loss='ls', max_depth=3,
                                                 max_features=None,
                                                 max_leaf_nodes=None,
                                                 min_impurity_decrease=0.0,
                                                 min_impurity_split=None,
                                                 min_samples_leaf=1,
                                                 min_samples_split=2,
                                                 min_weight_fraction_leaf=0.0,
                                                 n_estimators=100,
                                                 n_ite...one,
                        

In [5]:
print(f' best loss: {grid_search.best_score_}, HPs:{grid_search.best_params_}')

 best loss: -3249.328994113577, HPs:{'learning_rate': 0.1, 'max_depth': 1, 'n_estimators': 200}


#### Random Search CV Hyperparameter tuning
> 也可以給一個分佈隨機去裡面抽~ [ref](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV)

In [6]:
learning = [0.001, 0.01, 0.1, 1]
n_estimators = [50, 100, 200, 300]
max_depth = [1, 3, 5, 7]
grid_parms = dict(learning_rate = learning, 
                  n_estimators = n_estimators, 
                  max_depth = max_depth)
random_search = RandomizedSearchCV(reg, grid_parms, scoring = 'neg_mean_squared_error', n_jobs = -1, verbose = 1)
random_search.fit(x_train, y_train)

# 這樣會有 4 * 4 * 4 = 64 種參數組合，再搭配 5 組切割的資料，總共有 64 * 5 = 320 總可能

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.3s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    4.7s finished


RandomizedSearchCV(cv=None, error_score=nan,
                   estimator=GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0,
                                                       criterion='friedman_mse',
                                                       init=None,
                                                       learning_rate=0.1,
                                                       loss='ls', max_depth=3,
                                                       max_features=None,
                                                       max_leaf_nodes=None,
                                                       min_impurity_decrease=0.0,
                                                       min_impurity_split=None,
                                                       min_samples_leaf=1,
                                                       min_samples_split=2,
                                                       min_weight_fraction_leaf=0.0,
                          

In [7]:
print(f' best loss: {random_search.best_score_}, HPs:{random_search.best_params_}')

 best loss: -3436.5317447592324, HPs:{'n_estimators': 300, 'max_depth': 3, 'learning_rate': 0.01}


#### Build model with best HPs
> 雖然 train loss 上升了不少，但是 test loss 卻有下降。也許代表原本的低 train loss 是 overfitting 造成。

In [8]:
reg = GradientBoostingRegressor(random_state = 20,
                                learning_rate = grid_search.best_params_['learning_rate'],
                                max_depth = grid_search.best_params_['max_depth'],
                                n_estimators = grid_search.best_params_['n_estimators'])

# reg = grid_search.best_estimator_  # 這樣也可以~
# reg = random_search.best_estimator_

reg.fit(x_train, y_train)
y_pred = reg.predict(x_test)
print(f'train loss: {metrics.mean_squared_error(y_train, reg.predict(x_train))}')
print(f'test loss: {metrics.mean_squared_error(y_test, y_pred)}')

train loss: 2235.5237932523937
test loss: 2812.9857279113453


## Wine dataset

#### Baseline

In [9]:
wine.data.shape

(178, 13)

#### Normalization

In [10]:
x_mean = np.mean(wine.data, axis = 0)
x_std = np.std(wine.data, axis = 0)
x = np.empty(wine.data.shape)

for i in range(x.shape[1]):
    for j in range(x.shape[0]):
        if x_std[i] != 0:
            x[j][i] = (wine.data[j][i] - x_mean[i]) / x_std[i]

In [11]:
x_train, x_test, y_train, y_test = train_test_split(x, wine.target, test_size = 0.25, random_state = 45)
clf = LogisticRegression(random_state = 1)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(f'train accuracy: {metrics.accuracy_score(y_train, clf.predict(x_train))}')
print(f'test accuracy: {metrics.accuracy_score(y_test, y_pred)}')

train accuracy: 1.0
test accuracy: 0.9777777777777777


#### Hyperparameter tuning

In [12]:
penalty = ['l1', 'l2', 'elasticnet', 'none']
tol = [1e-5, 1e-4, 1e-3]
C = [0.1, 1, 10]
max_iter = [50, 100, 200, 300]
grid_parms = dict(penalty = penalty, 
                  tol = tol, 
                  C = C,
                  max_iter = max_iter)
grid_search = GridSearchCV(clf, grid_parms, n_jobs = -1, verbose = 1)
grid_search.fit(x_train, y_train)
# 這樣會有 4 * 3 * 3 * 4 = 72 種參數組合，再搭配 5 組切割的資料，總共有 72 * 5 = 360 總可能

Fitting 5 folds for each of 144 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 312 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:    1.9s finished


GridSearchCV(cv=None, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=1, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'C': [0.1, 1, 10], 'max_iter': [50, 100, 200, 300],
                         'penalty': ['l1', 'l2', 'elasticnet', 'none'],
                         'tol': [1e-05, 0.0001, 0.001]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=1)

In [13]:
print(f' best loss: {grid_search.best_score_}, HPs:{grid_search.best_params_}')

 best loss: 0.9777777777777779, HPs:{'C': 0.1, 'max_iter': 50, 'penalty': 'l2', 'tol': 1e-05}


#### Bulid model with best HPs

In [14]:
clf = grid_search.best_estimator_  

clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(f'train loss: {metrics.accuracy_score(y_train, clf.predict(x_train))}')
print(f'test loss: {metrics.accuracy_score(y_test, y_pred)}')

train loss: 0.9924812030075187
test loss: 1.0
