## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [1]:
from sklearn import datasets, metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

## Part 1

In [2]:
iris = datasets.load_iris()
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

In [3]:
# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
clf0 = RandomForestClassifier(n_estimators=20, max_depth=4)
clf0.fit(x_train, y_train)
y_pred = clf0.predict(x_test)
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print(iris.feature_names)
print("Feature importance: ", clf0.feature_importances_)

Accuracy:  0.9736842105263158
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.09032712 0.03223463 0.44703657 0.43040169]


In [4]:
# 使用 30 顆樹，每棵樹的最大深度為 4
clf1 = RandomForestClassifier(n_estimators=30, max_depth=4)
clf1.fit(x_train, y_train)
y_pred = clf1.predict(x_test)
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print(iris.feature_names)
print("Feature importance: ", clf1.feature_importances_)

Accuracy:  0.9473684210526315
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.10170551 0.02746194 0.36738361 0.50344893]


In [5]:
# 使用 20 顆樹，每棵樹的最大深度為 3
clf2 = RandomForestClassifier(n_estimators=20, max_depth=3)
clf2.fit(x_train, y_train)
y_pred = clf2.predict(x_test)
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print(iris.feature_names)
print("Feature importance: ", clf2.feature_importances_)

Accuracy:  0.9736842105263158
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.13398192 0.03246949 0.35505935 0.47848924]


In [6]:
# 使用 15 顆樹，每棵樹的最大深度為 3
clf3 = RandomForestClassifier(n_estimators=15, max_depth=3)
clf3.fit(x_train, y_train)
y_pred = clf3.predict(x_test)
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print(iris.feature_names)
print("Feature importance: ", clf3.feature_importances_)

Accuracy:  0.9736842105263158
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.16461422 0.03666324 0.32864921 0.47007333]


In [7]:
# 使用 10 顆樹，每棵樹的最大深度為 3
clf4 = RandomForestClassifier(n_estimators=10, max_depth=3)
clf4.fit(x_train, y_train)
y_pred = clf4.predict(x_test)
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print(iris.feature_names)
print("Feature importance: ", clf4.feature_importances_)

Accuracy:  0.9736842105263158
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.07335386 0.01612695 0.29932724 0.61119195]


In [8]:
# with default settings
clf5 = RandomForestClassifier()
clf5.fit(x_train, y_train)
y_pred = clf5.predict(x_test)
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print(iris.feature_names)
print("Feature importance: ", clf5.feature_importances_)

Accuracy:  0.9736842105263158
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.14670809 0.01245585 0.40905192 0.43178414]




well, I've tried a bunch of different settings but I don't see any differences. 
end of part 1... 
:(

## Part 2

In [9]:
wine = datasets.load_wine()
x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.25, random_state=4)

In [10]:
# 1 tree
clf_t1 = RandomForestClassifier(n_estimators=1)
clf_t1.fit(x_train, y_train)
y_pred = clf_t1.predict(x_test)
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print(wine.feature_names)
print("Feature importance: ", clf_t1.feature_importances_)

Accuracy:  0.8444444444444444
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Feature importance:  [0.         0.         0.         0.03084398 0.02269108 0.
 0.36831289 0.02127955 0.03843477 0.42739642 0.00161779 0.
 0.08942352]


In [11]:
# default (10 trees)
clf = RandomForestClassifier()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print(wine.feature_names)
print("Feature importance: ", clf.feature_importances_)

Accuracy:  1.0
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Feature importance:  [0.16860867 0.04094972 0.01879261 0.06124486 0.00498586 0.04017253
 0.08088877 0.00245564 0.01943166 0.1499546  0.07149096 0.13163521
 0.20938892]




In [12]:
# compare to decision tree
from sklearn.tree import DecisionTreeClassifier

clf_dt = DecisionTreeClassifier()
clf_dt.fit(x_train, y_train)
y_pred = clf_dt.predict(x_test)
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print(wine.feature_names)
print("Feature importance: ", clf_dt.feature_importances_)

Accuracy:  0.9111111111111111
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Feature importance:  [0.01364138 0.0184594  0.         0.         0.         0.
 0.12455196 0.         0.         0.41184168 0.         0.04285558
 0.38865   ]


In [13]:
boston = datasets.load_boston()
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=4)

In [15]:
from sklearn.ensemble import RandomForestRegressor

regr = RandomForestRegressor()
regr.fit(x_train, y_train)
y_pred = regr.predict(x_test)
print("Mean squared error: %.2f" % metrics.mean_squared_error(y_test, y_pred))
print(boston.feature_names)
print("Feature importance: ", clf.feature_importances_)

Mean squared error: 17.39
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
Feature importance:  [0.16860867 0.04094972 0.01879261 0.06124486 0.00498586 0.04017253
 0.08088877 0.00245564 0.01943166 0.1499546  0.07149096 0.13163521
 0.20938892]




In [22]:
# 20 trees
regr = RandomForestRegressor(n_estimators=20)
regr.fit(x_train, y_train)
y_pred = regr.predict(x_test)
print("Mean squared error: %.2f" % metrics.mean_squared_error(y_test, y_pred))
print(boston.feature_names)
print("Feature importance: ", clf.feature_importances_)

Mean squared error: 14.51
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
Feature importance:  [0.16860867 0.04094972 0.01879261 0.06124486 0.00498586 0.04017253
 0.08088877 0.00245564 0.01943166 0.1499546  0.07149096 0.13163521
 0.20938892]


In [23]:
# compare to decision tree
from sklearn.tree import DecisionTreeRegressor

regr = DecisionTreeRegressor()
regr.fit(x_train, y_train)
y_pred = regr.predict(x_test)
print("Mean squared error: %.2f" % metrics.mean_squared_error(y_test, y_pred))
print(boston.feature_names)
print("Feature importance: ", clf.feature_importances_)

Mean squared error: 29.65
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
Feature importance:  [0.16860867 0.04094972 0.01879261 0.06124486 0.00498586 0.04017253
 0.08088877 0.00245564 0.01943166 0.1499546  0.07149096 0.13163521
 0.20938892]
