## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [1]:
from sklearn import datasets, metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

### 鳶尾花資料集

In [3]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型 
clf = RandomForestClassifier(n_estimators=20, criterion="gini", max_features="auto", max_depth=4, min_samples_split=2, min_samples_leaf=1)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

#
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print(iris.feature_names)
print("Feature importance: ", clf.feature_importances_)

Accuracy:  0.9736842105263158
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.06756644 0.02975784 0.39731311 0.50536261]


### 隨機森林 -  wine

In [4]:
# 讀取資料集
wine = datasets.load_wine()
datas = wine

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(datas.data, datas.target, test_size=0.25, random_state=4)

# 建立模型 
clf = RandomForestClassifier(n_estimators=20, criterion="gini", max_features="auto", max_depth=4, min_samples_split=2, min_samples_leaf=1)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

#
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print(datas.feature_names)
print("Feature importance: ", clf.feature_importances_)

Accuracy:  0.9777777777777777
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Feature importance:  [0.14098294 0.02808645 0.01667575 0.01084289 0.03495177 0.06176719
 0.19441582 0.00576066 0.00872231 0.17722477 0.05836787 0.12793063
 0.13427095]


### 回歸模型 - wine

In [5]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score

# 讀取資料集
datas = datasets.load_wine()
#print(datas.keys(),'\n')
#print(datas['DESCR'])

#---
# 只使用資料集中的 1 個 feature (column)
#X = datas.data[:, np.newaxis, 2]
X = datas.data
print("Data shape: ", X.shape)

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(X, datas.target, test_size=0.1, random_state=4)

# 建立一個模型
#regr = linear_model.LinearRegression()
reg = linear_model.LogisticRegression()

# 將訓練資料丟進去模型訓練
reg.fit(x_train, y_train)

# 將測試資料丟進模型得到預測結果
y_pred = reg.predict(x_test)

#---
# 可以看回歸模型的參數值
print('Coefficients: ', reg.coef_)

# 預測值與實際值的差距，使用 MSE
print("Mean squared error: %.2f"
      % mean_squared_error(y_test, y_pred))
#---
# 畫出回歸模型與實際資料的分佈
#plt.scatter(x_test, y_test,  color='black')
#plt.plot(x_test, y_pred, color='blue', linewidth=3)
#plt.show()

#---
acc = accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)

Data shape:  (178, 13)
Coefficients:  [[-5.88852656e-01  6.67300827e-01  1.00960693e+00 -5.80989219e-01
  -3.55178256e-02  3.62071144e-01  1.18894658e+00  3.78340624e-03
  -4.54784892e-01 -1.53560698e-01 -1.62107824e-01  9.11550191e-01
   1.77906683e-02]
 [ 9.31771389e-01 -1.08459849e+00 -7.53390627e-01  2.41931110e-01
   1.24181909e-02  3.53858216e-02  5.76719638e-01  5.39359650e-01
   6.06710292e-01 -1.86151560e+00  9.52831552e-01  7.69014213e-02
  -1.44579779e-02]
 [-3.44877619e-01  6.57378630e-01  3.90432260e-02  1.20175740e-01
   1.94696375e-02 -6.60620544e-01 -1.84324382e+00 -9.24618142e-02
  -6.79666411e-01  1.08773341e+00 -4.94768310e-01 -1.20152083e+00
   2.92068606e-04]]
Mean squared error: 0.06
Accuracy:  0.9444444444444444
