## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

## 調整RandomForestClassifier

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

  return f(*args, **kwds)


In [4]:
# 匯入資料
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
features = iris.feature_names
X = pd.DataFrame(X, columns=features)

In [5]:
X.shape

(150, 4)

In [6]:
# 切分資料
from sklearn.model_selection import train_test_split, GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=4)

In [13]:
def get_the_best_params_performance(model, params, X, y):
    grid = GridSearchCV(model, params, error_score=0.)
    grid.fit(X, y)
    print('Best_Accuracy:{}'.format(grid.best_score_))
    print('Best_Params:{}'.format(grid.best_params_))
    print('Best fit time:{}'.format(round(grid.cv_results_['mean_fit_time'].mean(), 3)))
    print('Average of testing time:{}'.format(round(grid.cv_results_['mean_score_time'].mean(), 3)))


In [14]:
# 訓練模型:隨機森林

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier()

forest_params = {'n_estimators':[100, 200, 300, 400, 500],
                'max_depth':[None, 5, 7, 9, 11],
                'min_samples_leaf':[1, 3, 5, 7, 9],
                'max_features':['auto', 'log2']}
get_the_best_params_performance(forest, forest_params, X_train, y_train)



Best_Accuracy:0.9732142857142857
Best_Params:{'max_depth': 5, 'max_features': 'auto', 'min_samples_leaf': 5, 'n_estimators': 200}
Best fit time:0.286
Average of testing time:0.02


在經過網格搜尋後，其最佳模型超參數設定為:

'max_depth': 5

'max_features': 'auto'

'min_samples_leaf': 5

'n_estimators': 200

在驗證集上，其準確率可以到達0.973

In [16]:
import time
from sklearn.metrics import accuracy_score

In [17]:
forest_best_params = RandomForestClassifier(n_estimators=200, max_depth=5, min_samples_leaf=5, max_features='auto')
start_time = time.time()
forest_best_params.fit(X_train, y_train)
end_time = time.time()
y_pred = forest_best_params.predict(X_test)
print('Best_accuracy:{}'.format(accuracy_score(y_test, y_pred)))
print('Fit time:{}'.format(end_time-start_time))

Best_accuracy:0.9736842105263158
Fit time:0.2050487995147705


### 使用超參數中的warm_start來加快訓練速度

In [20]:
forest = RandomForestClassifier(max_depth=5, warm_start=True)
for n_estimator in range(100, 210, 10):
    forest.set_params(n_estimators=n_estimator)
    start_time = time.time()
    forest.fit(X_train, y_train)
    end_time = time.time()
    duration_time = (end_time-start_time)
    print('n_estimators:{}, duration time:{}'.format(n_estimator, duration_time))
    
y_pred = forest.predict(X_test)
print('Accuracy:{}'.format(accuracy_score(y_test, y_pred)))


n_estimators:100, duration time:0.1092216968536377
n_estimators:110, duration time:0.010002851486206055
n_estimators:120, duration time:0.011252880096435547
n_estimators:130, duration time:0.009974956512451172
n_estimators:140, duration time:0.01092386245727539
n_estimators:150, duration time:0.010266542434692383
n_estimators:160, duration time:0.009929656982421875
n_estimators:170, duration time:0.011171579360961914
n_estimators:180, duration time:0.010281801223754883
n_estimators:190, duration time:0.009955167770385742
n_estimators:200, duration time:0.009507417678833008
Accuracy:0.9736842105263158


### 改變超參數

In [21]:
forest_change = RandomForestClassifier(n_estimators=300, max_depth=5, min_samples_leaf=5, max_features='auto')
start_time = time.time()
forest_change.fit(X_train, y_train)
end_time = time.time()
y_pred = forest_change.predict(X_test)
print('Best_accuracy:{}'.format(accuracy_score(y_test, y_pred)))
print('Fit time:{}'.format(end_time-start_time))

Best_accuracy:0.9736842105263158
Fit time:0.3052036762237549


將訓練的C&RT量增加至300棵，準確率沒有提升，但訓練模型時間增加

In [22]:
forest_change = RandomForestClassifier(n_estimators=100, max_depth=5, min_samples_leaf=5, max_features='auto')
start_time = time.time()
forest_change.fit(X_train, y_train)
end_time = time.time()
y_pred = forest_change.predict(X_test)
print('Best_accuracy:{}'.format(accuracy_score(y_test, y_pred)))
print('Fit time:{}'.format(end_time-start_time))

Best_accuracy:0.9736842105263158
Fit time:0.12648963928222656


## 回歸任務

In [24]:
from sklearn.datasets import load_wine
# 匯入資料

wine = load_wine()
X = wine.data
y = wine.target

features = wine.feature_names
X = pd.DataFrame(X, columns=features)

In [25]:
X.shape

(178, 13)

In [26]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 13 columns):
alcohol                         178 non-null float64
malic_acid                      178 non-null float64
ash                             178 non-null float64
alcalinity_of_ash               178 non-null float64
magnesium                       178 non-null float64
total_phenols                   178 non-null float64
flavanoids                      178 non-null float64
nonflavanoid_phenols            178 non-null float64
proanthocyanins                 178 non-null float64
color_intensity                 178 non-null float64
hue                             178 non-null float64
od280/od315_of_diluted_wines    178 non-null float64
proline                         178 non-null float64
dtypes: float64(13)
memory usage: 18.2 KB


In [27]:
X.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
alcohol,178.0,13.000618,0.811827,11.03,12.3625,13.05,13.6775,14.83
malic_acid,178.0,2.336348,1.117146,0.74,1.6025,1.865,3.0825,5.8
ash,178.0,2.366517,0.274344,1.36,2.21,2.36,2.5575,3.23
alcalinity_of_ash,178.0,19.494944,3.339564,10.6,17.2,19.5,21.5,30.0
magnesium,178.0,99.741573,14.282484,70.0,88.0,98.0,107.0,162.0
total_phenols,178.0,2.295112,0.625851,0.98,1.7425,2.355,2.8,3.88
flavanoids,178.0,2.02927,0.998859,0.34,1.205,2.135,2.875,5.08
nonflavanoid_phenols,178.0,0.361854,0.124453,0.13,0.27,0.34,0.4375,0.66
proanthocyanins,178.0,1.590899,0.572359,0.41,1.25,1.555,1.95,3.58
color_intensity,178.0,5.05809,2.318286,1.28,3.22,4.69,6.2,13.0


In [28]:
# 切分資料

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=4)

In [29]:
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor()
forest_params = {'n_estimators':[100, 200, 300, 400, 500],
                'max_depth':[None, 5, 7, 9, 11],
                'min_samples_leaf':[1, 3, 5, 7, 9],
                'max_features':['auto', 'log2']}
get_the_best_params_performance(forest, forest_params, x_train, y_train)



Best_Accuracy:0.942893656253014
Best_Params:{'max_depth': 5, 'max_features': 'log2', 'min_samples_leaf': 1, 'n_estimators': 300}
Best fit time:0.269
Average of testing time:0.015


In [31]:
from sklearn.metrics import mean_squared_error
forest_tuned = RandomForestRegressor(n_estimators=300, 
                                    max_depth=5,
                                    min_samples_leaf=1,
                                    max_features='log2')
forest_tuned.fit(x_train, y_train)
y_pred = forest_tuned.predict(x_test)
print('R^2 of the test set by RF:{}'.format(forest_tuned.score(x_test, y_test)))
print('MSE of the test set by RF:{}'.format(mean_squared_error(y_test, y_pred)))

R^2 of the test set:0.980574154021891
MSE of the test set:0.012710738726417125


In [32]:
from sklearn.linear_model import LinearRegression
linear = LinearRegression()
linear.fit(x_train, y_train)
y_pred = linear.predict(x_test)
print('R^2 of the test set by LR:{}'.format(linear.score(x_test, y_test)))
print('MSE of the test set by LR:{}'.format(mean_squared_error(y_test, y_pred)))

R^2 of the test set by LR:0.8990855103432577
MSE of the test set by LR:0.06603046854083143


線性模型整體表現較隨機森林來得差