## [作業重點]
目前你應該已經要很清楚資料集中，資料的型態是什麼樣子囉！包含特徵 (features) 與標籤 (labels)。因此要記得未來不管什麼專案，必須要把資料清理成相同的格式，才能送進模型訓練。
今天的作業開始踏入決策樹這個非常重要的模型，請務必確保你理解模型中每個超參數的意思，並試著調整看看，對最終預測結果的影響為何

## 作業

1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

In [2]:
from sklearn import datasets, metrics

# 如果是分類問題，請使用 DecisionTreeClassifier，若為回歸問題，請使用 DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split

# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型
clf = DecisionTreeClassifier()

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

print(iris.feature_names)

print("Feature importance: ", clf.feature_importances_)

Acuuracy:  0.9736842105263158
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.01796599 0.         0.52229134 0.45974266]


### 調整Decision Tree的參數

In [3]:
# 建立模型
clf = DecisionTreeClassifier(
        criterion='entropy',
        max_depth=10,
        min_samples_split=3,
        min_samples_leaf=1)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

print(iris.feature_names)

print("Feature importance: ", clf.feature_importances_)

Acuuracy:  0.9736842105263158
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.0156062  0.         0.62264163 0.36175217]


### 觀察結果: 沒有太多變化, Feature importance比較看得出不同.

### 看Breast Cancer資料集

In [9]:
# 讀取Breast Cancer資料集
diabetes = datasets.load_breast_cancer()

# 切分訓練集/測試集
x_train_d, x_test_d, y_train_d, y_test_d = train_test_split(diabetes.data, diabetes.target, test_size=0.2, random_state=66)

# 建立模型
clf = DecisionTreeClassifier(
        criterion='gini',
        #max_depth=10,
        min_samples_split=2,
        min_samples_leaf=1, random_state=0)

# 訓練模型
clf.fit(x_train_d, y_train_d)

# 預測測試集
#y_pred_d = clf.predict(x_test_d)
print("Accuracy on training set: {:.3f}".format(clf.score(x_train_d, y_train_d)))
print("Accuracy on test set: {:.3f}".format(clf.score(x_test_d, y_test_d)))

print(diabetes.feature_names)

print("Feature importance: ", clf.feature_importances_)

Accuracy on training set: 1.000
Accuracy on test set: 0.939
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
Feature importance:  [0.         0.05940707 0.00468454 0.         0.00702681 0.
 0.         0.01967507 0.         0.         0.01233852 0.
 0.         0.         0.00468454 0.         0.         0.00433754
 0.00624605 0.03653941 0.         0.01612033 0.         0.71474329
 0.         0.         0.00461856 0.10957827 0.         0.        ]


### 用Random Forest試試看

In [10]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=0)
rf.fit(x_train_d, y_train_d)
print("Accuracy on training set: {:.3f}".format(rf.score(x_train_d, y_train_d)))
print("Accuracy on test set: {:.3f}".format(rf.score(x_test_d, y_test_d)))

rf1 = RandomForestClassifier(max_depth=3, n_estimators=100, random_state=0)
rf1.fit(x_train_d, y_train_d)
print("Accuracy on training set: {:.3f}".format(rf1.score(x_train_d, y_train_d)))
print("Accuracy on test set: {:.3f}".format(rf1.score(x_test_d, y_test_d)))

Accuracy on training set: 1.000
Accuracy on test set: 0.965
Accuracy on training set: 0.978
Accuracy on test set: 0.956
