## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [36]:
# 使用決策樹，對樹進行限制
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score, mean_squared_error, mean_absolute_error
import sklearn.datasets as datasets
boston = datasets.load_boston()
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size = 0.2, random_state = 15)
clf = DecisionTreeRegressor(max_depth = 4, min_samples_split = 2,
                           min_samples_leaf = 1, criterion='friedman_mse')
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(f'MSE of DT : {mean_squared_error(y_test, y_pred)}')
print(f'MAE of DT : {mean_absolute_error(y_test, y_pred)}')

MSE of DT : 21.09584274826357
MAE of DT : 3.141036365298898


In [47]:
# 使用1000棵樹，不對樹進行限制
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score, mean_squared_error, mean_absolute_error
import sklearn.datasets as datasets
boston = datasets.load_boston()
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size = 0.2, random_state = 15)
clf = RandomForestRegressor(n_estimators=1000)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(f'MSE of RF : {mean_squared_error(y_test, y_pred)}')
print(f'MAE of RF : {mean_absolute_error(y_test, y_pred)}')
import pandas as pd
print(pd.DataFrame(data = zip(boston.feature_names, clf.feature_importances_),
                   columns=['name', 'importance']).sort_values('importance', ascending = False))

MSE of RF : 11.33036844156864
MAE of RF : 2.0061803921568693
       name  importance
12    LSTAT    0.452769
5        RM    0.356478
7       DIS    0.070875
0      CRIM    0.029397
4       NOX    0.022951
10  PTRATIO    0.018890
6       AGE    0.015298
11        B    0.011863
9       TAX    0.009662
2     INDUS    0.006451
8       RAD    0.003465
3      CHAS    0.000993
1        ZN    0.000909


In [48]:
# 使用1000棵樹，對樹進行限制
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score, mean_squared_error, mean_absolute_error
import sklearn.datasets as datasets
boston = datasets.load_boston()
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size = 0.2, random_state = 15)
clf = RandomForestRegressor(n_estimators=1000, max_depth = 5)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(f'MSE of RF : {mean_squared_error(y_test, y_pred)}')
print(f'MAE of RF : {mean_absolute_error(y_test, y_pred)}')
print(pd.DataFrame(data = zip(boston.feature_names, clf.feature_importances_),
                   columns=['name', 'importance']).sort_values('importance', ascending = False))

MSE of RF : 13.043877185023078
MAE of RF : 2.276169548610828
       name  importance
12    LSTAT    0.485723
5        RM    0.358030
7       DIS    0.068567
0      CRIM    0.025389
4       NOX    0.019292
10  PTRATIO    0.015396
6       AGE    0.008168
11        B    0.006651
9       TAX    0.006278
2     INDUS    0.003873
8       RAD    0.001777
3      CHAS    0.000553
1        ZN    0.000303


#### 增加對決策樹的限制後，模型表現反而變得較差。
RF已透過投票的方式避免Overfitting了，不需要再透過限制樹來避免。