## [作業重點]
目前你應該已經要很清楚資料集中，資料的型態是什麼樣子囉！包含特徵 (features) 與標籤 (labels)。因此要記得未來不管什麼專案，必須要把資料清理成相同的格式，才能送進模型訓練。
今天的作業開始踏入決策樹這個非常重要的模型，請務必確保你理解模型中每個超參數的意思，並試著調整看看，對最終預測結果的影響為何

## 作業

1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

In [1]:
from sklearn import datasets, metrics
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split

### 鳶尾花資料集

In [2]:
# 讀取資料
iris = datasets.load_iris()
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

In [3]:
# 決策樹分類器(預設超參數)
clf = DecisionTreeClassifier()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
acc = metrics.accuracy_score(y_test, y_pred)
print(f"Acuuracy: {acc}")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature} - {importance}")

Acuuracy: 0.9736842105263158
sepal length (cm) - 0.017965992941931345
sepal width (cm) - 0.0
petal length (cm) - 0.522291342258947
petal width (cm) - 0.4597426647991217


In [5]:
# 決策樹分類器(criterion='entropy')
clf = DecisionTreeClassifier(criterion='entropy')
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
acc = metrics.accuracy_score(y_test, y_pred)
print(f"Acuuracy: {acc}")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature} - {importance}")

Acuuracy: 0.9736842105263158
sepal length (cm) - 0.01560620187870998
sepal width (cm) - 0.0
petal length (cm) - 0.07501716294579418
petal width (cm) - 0.9093766351754958


In [6]:
# 決策樹分類器(min_samples_leaf = 5)
clf = DecisionTreeClassifier(min_samples_leaf = 5)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
acc = metrics.accuracy_score(y_test, y_pred)
print(f"Acuuracy: {acc}")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature} - {importance}")

Acuuracy: 0.9736842105263158
sepal length (cm) - 0.0
sepal width (cm) - 0.0
petal length (cm) - 0.03953519962361139
petal width (cm) - 0.9604648003763887


### Boston資料集

In [7]:
# 讀取資料
boston = datasets.load_boston()
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=4)

In [11]:
# 決策樹迴歸器
reg = DecisionTreeRegressor()
reg.fit(x_train, y_train)
y_pred = reg.predict(x_test)
mse = metrics.mean_squared_error(y_test, y_pred)
print(f"MSE: {mse}")
for feature, importance in zip(boston.feature_names, reg.feature_importances_):
    print(f"{feature} - {importance}")

MSE: 30.74448818897638
CRIM - 0.06510927762581185
ZN - 0.0014281278518604073
INDUS - 0.0014342964341626716
CHAS - 0.009734988538214306
NOX - 0.030793061494979096
RM - 0.5440198015969876
AGE - 0.014417615158382956
DIS - 0.059054275757724525
RAD - 0.0059174498878867944
TAX - 0.01254643351649498
PTRATIO - 0.02467830219214734
B - 0.00848046943581101
LSTAT - 0.22238590050953655


In [13]:
# 決策樹迴歸器
reg = DecisionTreeRegressor(criterion="mae")
reg.fit(x_train, y_train)
y_pred = reg.predict(x_test)
mse = metrics.mean_squared_error(y_test, y_pred)
print(f"MSE: {mse}")
for feature, importance in zip(boston.feature_names, reg.feature_importances_):
    print(f"{feature} - {importance}")

MSE: 24.70692913385827
CRIM - 0.10090234440883783
ZN - 0.005397200202394999
INDUS - 0.017034913138809237
CHAS - 0.0027407657277787106
NOX - 0.042587282847023074
RM - 0.29773148928993043
AGE - 0.028883454208129496
DIS - 0.06978411199190417
RAD - 0.024287400910777506
TAX - 0.02099848203744308
PTRATIO - 0.026480013493000477
B - 0.02264294147411027
LSTAT - 0.3405296002698605
