## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [103]:
from sklearn import datasets, metrics
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split

In [None]:
'''
RandomForestClassifier
clf = RandomForestClassifier(    
        n_estimators=10, #決策樹的數量
        criterion="gini",
        max_features="auto", #選取的特徵數量  {“auto”, “sqrt”, “log2”}, int or float, default=”auto”
        max_depth=10,
        min_samples_split=2,
        min_samples_leaf=1
        )
'''

In [96]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# n_estimators=10, #決策樹的數量
# max_features="auto", #選取的特徵數量 The number of features to consider when looking for the best split: 
#                      If “auto”, then max_features=sqrt(n_features)
# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
clf = RandomForestClassifier(n_estimators=50, max_depth=4, max_features='auto')

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)
y_pred

array([2, 0, 2, 2, 2, 1, 2, 0, 0, 2, 0, 0, 0, 1, 2, 0, 1, 0, 0, 2, 0, 2,
       1, 0, 0, 0, 0, 0, 0, 2, 1, 0, 2, 0, 1, 2, 2, 1])

In [97]:
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)

Accuracy:  0.9736842105263158


In [98]:
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [99]:
print("Feature importance: ", clf.feature_importances_)

Feature importance:  [0.13659409 0.03232761 0.39149355 0.43958475]


In [100]:
#multiclass多分類評分: 須加上average參數 average{'micro', 'macro', 'samples','weighted', 'binary'} or None, default='binary'
f1 = metrics.f1_score(y_test, y_pred, average=None) # 使用 F1-Score 評估: If average=None, the scores for each class are returned.
precision = metrics.precision_score(y_test, y_pred, average=None) # 使用 Precision 評估
recall  = metrics.recall_score(y_test, y_pred, average=None) # 使用 recall 評估
print("F1-Score: ", f1) 
print("Precision: ", precision)
print("Recall: ", recall)

F1-Score:  [1.         0.93333333 0.96      ]
Precision:  [1.         1.         0.92307692]
Recall:  [1.    0.875 1.   ]


### ANS: 調整參數, 預測結果差異不大: 增加n_estimators樹的數量, 有時候準確度反而下降

#### 2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [101]:
boston = datasets.load_boston()
boston.target

array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
       18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
       15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
       13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
       21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
       35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2, 25. , 33. , 23.5,
       19.4, 22. , 17.4, 20.9, 24.2, 21.7, 22.8, 23.4, 24.1, 21.4, 20. ,
       20.8, 21.2, 20.3, 28. , 23.9, 24.8, 22.9, 23.9, 26.6, 22.5, 22.2,
       23.6, 28.7, 22.6, 22. , 22.9, 25. , 20.6, 28.4, 21.4, 38.7, 43.8,
       33.2, 27.5, 26.5, 18.6, 19.3, 20.1, 19.5, 19.5, 20.4, 19.8, 19.4,
       21.7, 22.8, 18.8, 18.7, 18.5, 18.3, 21.2, 19.2, 20.4, 19.3, 22. ,
       20.3, 20.5, 17.3, 18.8, 21.4, 15.7, 16.2, 18. , 14.3, 19.2, 19.6,
       23. , 18.4, 15.6, 18.1, 17.4, 17.1, 13.3, 17.8, 14. , 14.4, 13.4,
       15.6, 11.8, 13.8, 15.6, 14.6, 17.8, 15.4, 21

In [102]:
boston.data.shape

(506, 13)

In [124]:
# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.1, random_state=4)

# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4):若為"回歸問題"，請使用 RandomForestRegressor
clf = RandomForestRegressor(n_estimators=30, max_depth=4)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)
y_pred

array([17.36574439, 25.51211989, 21.0636694 , 17.63535728, 47.24529542,
       23.34595906, 33.72779001, 16.83203112, 13.45678491, 17.36476079,
       29.05232721, 23.97173137, 21.14083109, 25.0136991 , 21.3501658 ,
       14.32403947, 21.14083109, 11.7325699 , 12.5553167 , 16.41153019,
       10.19887173, 17.30548302, 18.29342747, 21.19424123, 21.14716501,
       21.19424123, 17.358714  , 15.72782369, 20.19845622, 18.44965104,
       13.60533119, 23.2575491 , 30.03211342, 21.87040674, 13.9233906 ,
       12.35352181, 32.70020566, 45.16402226, 23.34595906, 23.12867325,
       44.57590029, 31.30259498, 14.52737125, 27.60373937, 27.96321335,
       21.35593384, 47.26432593, 18.5749687 , 20.76031875, 21.3501658 ,
       29.2275653 ])

In [125]:
# 預測值與實際值的差距，使用 MSE
print("Mean squared error: %.2f"
      % metrics.mean_squared_error(y_test, y_pred))

Mean squared error: 12.64


### ANS: 與D038回歸模型(LinearRegression) 的結果進行比較: RandomForestRegressor 預測結果更佳(誤差更小)

In [126]:
wine = datasets.load_wine()
wine.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

In [127]:
wine.data.shape

(178, 13)

In [140]:
# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.2, random_state=4)

# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
clf = RandomForestClassifier(n_estimators=50, max_depth=3)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)
y_pred

array([2, 2, 0, 0, 1, 2, 0, 1, 0, 1, 1, 0, 2, 2, 0, 1, 0, 1, 1, 2, 1, 2,
       1, 2, 0, 2, 1, 1, 2, 2, 0, 1, 0, 1, 2, 2])

In [141]:
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)

Accuracy:  1.0


In [142]:
#multiclass多分類評分: 須加上average參數 average{'micro', 'macro', 'samples','weighted', 'binary'} or None, default='binary'
f1 = metrics.f1_score(y_test, y_pred, average=None) # 使用 F1-Score 評估: If average=None, the scores for each class are returned.
precision = metrics.precision_score(y_test, y_pred, average=None) # 使用 Precision 評估
recall  = metrics.recall_score(y_test, y_pred, average=None) # 使用 recall 評估
print("F1-Score: ", f1) 
print("Precision: ", precision)
print("Recall: ", recall)

F1-Score:  [1. 1. 1.]
Precision:  [1. 1. 1.]
Recall:  [1. 1. 1.]


### ANS:　與D038回歸模型(LogisticRegression) 的結果進行比較:　RandomForestClassifier表現更佳, 預測準確率到1 (100%), 多增加樹也是一樣1