## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

# 1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？

In [22]:
from sklearn import datasets, metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [23]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
clf = RandomForestClassifier(n_estimators=5, max_depth=8)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

In [24]:
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)

Accuracy:  0.9736842105263158


In [25]:
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [26]:
clf_feature_importances = list(zip(iris.feature_names, clf.feature_importances_))
clf_feature_importances_sorted = sorted(clf_feature_importances, key=lambda x: x[1], reverse=True)
print( "Feature importance: ", clf_feature_importances_sorted )

Feature importance:  [('petal length (cm)', 0.5346120532647282), ('petal width (cm)', 0.25333802015911705), ('sepal length (cm)', 0.20097034712689524), ('sepal width (cm)', 0.011079579449259632)]


# Ans 1: 

n_estimators 變大則對於結果的影響不大，n_estimators 變小則對於結果的影響較大。
max_depth 的變大變小，都會有顯著影響！

影響：

Accuracy 不變，但是 Feature importance 有改變了！

# 2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [27]:
# 讀取 Wine 資料集
wine = datasets.load_wine()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.25, random_state=4)

# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
clf = RandomForestClassifier(n_estimators=5, max_depth=8)
# clf = RandomForestClassifier(n_estimators=20, min_samples_split=10, min_samples_leaf=10, 
#                             max_features=8, max_depth=6, bootstrap=True)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

In [28]:
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)

Accuracy:  0.9111111111111111


In [29]:
print(wine.feature_names)

['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']


In [30]:
clf_feature_importances = list(zip(wine.feature_names, clf.feature_importances_))
clf_feature_importances_sorted = sorted(clf_feature_importances, key=lambda x: x[1], reverse=True)
print( "Feature importance: ", clf_feature_importances_sorted )

Feature importance:  [('od280/od315_of_diluted_wines', 0.24986952996939454), ('alcohol', 0.20202926674846156), ('total_phenols', 0.10699048551678705), ('magnesium', 0.10382716791219418), ('proline', 0.10268723792860804), ('color_intensity', 0.07698840233985241), ('flavanoids', 0.040585942476144846), ('hue', 0.03684909956286532), ('alcalinity_of_ash', 0.034463524231256794), ('ash', 0.0184507808694069), ('proanthocyanins', 0.01425421276243873), ('malic_acid', 0.013004349682589613), ('nonflavanoid_phenols', 0.0)]


# Ans 2.

在 HW38 的 Wine 的資料集中，

用 邏輯斯迴歸(LR) 的 Accuracy 是 0.94

用 隨機森林(Random Forest) 的 Accuracy 是 1.0

而用 決策樹(Decision tree) 的 Accuracy 是 0.91


In [31]:
rf = RandomForestClassifier(n_estimators=20, min_samples_split=10, min_samples_leaf=10, 
                            max_features=8, max_depth=6, bootstrap=True)
rf.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=6, max_features=8, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=10, min_samples_split=10,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [32]:
# 將隨機森林結果輸出
pred_rf_prob = rf.predict_proba(x_test)
pred_rf = [list(row).index(row.max()) for row in pred_rf_prob]
pred_rf

count_misclassified = (y_test != pred_rf).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(y_test, pred_rf)
print('Accuracy: {:.2f}'.format(accuracy))

Misclassified samples: 1
Accuracy: 0.98


In [33]:
clf_feature_importances = list(zip(wine.feature_names, clf.feature_importances_))
clf_feature_importances_sorted = sorted(clf_feature_importances, key=lambda x: x[1], reverse=True)
print( "Feature importance: ", clf_feature_importances_sorted )

Feature importance:  [('od280/od315_of_diluted_wines', 0.24986952996939454), ('alcohol', 0.20202926674846156), ('total_phenols', 0.10699048551678705), ('magnesium', 0.10382716791219418), ('proline', 0.10268723792860804), ('color_intensity', 0.07698840233985241), ('flavanoids', 0.040585942476144846), ('hue', 0.03684909956286532), ('alcalinity_of_ash', 0.034463524231256794), ('ash', 0.0184507808694069), ('proanthocyanins', 0.01425421276243873), ('malic_acid', 0.013004349682589613), ('nonflavanoid_phenols', 0.0)]


# Ans 2.(Con.)

RandomForestClassifier(n_estimators=20, min_samples_split=10, min_samples_leaf=10, 
                            max_features=8, max_depth=6, bootstrap=True)
                            
依照 HW38 的參數輸入，得到的 Accuracy 為 0.98, 原因可能是：

1. HW 38 的訓練及測試資料是分為三份，目前我們是分為兩份。

2. 隨機森林的每次輸出結果，本來就有隨機性，不一定相同。
