## [作業重點]
目前你應該已經要很清楚資料集中，資料的型態是什麼樣子囉！包含特徵 (features) 與標籤 (labels)。因此要記得未來不管什麼專案，必須要把資料清理成相同的格式，才能送進模型訓練。
今天的作業開始踏入決策樹這個非常重要的模型，請務必確保你理解模型中每個超參數的意思，並試著調整看看，對最終預測結果的影響為何

## 作業

1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

In [1]:
import pandas as pd
from sklearn import datasets, metrics

# 如果是分類問題，請使用 DecisionTreeClassifier，若為回歸問題，請使用 DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

### 建立模型四步驟
在 Scikit-learn 中，建立一個機器學習的模型其實非常簡單，流程大略是以下四個步驟  

1. 讀進資料，並檢查資料的 shape (有多少 samples (rows), 多少 features (columns)，label 的型態是什麼？)  
讀取資料的方法：  
使用 pandas 讀取 .csv 檔：pd.read_csv  
使用 numpy 讀取 .txt 檔：np.loadtxt  
使用 Scikit-learn 內建的資料集：sklearn.datasets.load_xxx  
檢查資料數量：data.shape (data should be np.array or dataframe)  
2. 將資料切為訓練 (train) / 測試 (test)  
  train_test_split(data)
3. 建立模型，將資料 fit 進模型開始訓練  
clf = DecisionTreeClassifier()  
clf.fit(x_train, y_train)
4. 將測試資料 (features) 放進訓練好的模型中，得到 prediction，與測試資料的 label (y_test) 做評估  
clf.predict(x_test)  
accuracy_score(y_test, y_pred)  
f1_score(y_test, y_pred)  

#### Dataset IRIS

In [2]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()
# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)
# 建立模型
clf = DecisionTreeClassifier()
# 訓練模型
clf.fit(x_train, y_train)
# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

dict = {"features": iris.feature_names, "importances": clf.feature_importances_ }
select_df = pd.DataFrame(dict)
print(select_df)

Acuuracy:  0.9736842105263158
            features  importances
0  sepal length (cm)     0.000000
1   sepal width (cm)     0.017966
2  petal length (cm)     0.059924
3   petal width (cm)     0.922110


#### Dataset BOSTON

In [3]:
boston = datasets.load_boston()
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=4)
reg = DecisionTreeRegressor()
reg.fit(x_train, y_train)
# 預測測試集
y_pred = reg.predict(x_test)

mse = mean_squared_error(y_test, y_pred)
print("MSE: ", mse)

dict = {"features": boston.feature_names, "importances": reg.feature_importances_ }
select_df = pd.DataFrame(dict)
print(select_df)

MSE:  26.34629921259842
   features  importances
0      CRIM     0.065927
1        ZN     0.002177
2     INDUS     0.008894
3      CHAS     0.000053
4       NOX     0.028675
5        RM     0.541745
6       AGE     0.015328
7       DIS     0.066598
8       RAD     0.005448
9       TAX     0.012772
10  PTRATIO     0.019825
11        B     0.010514
12    LSTAT     0.222043


#### Dataset WINE

In [4]:
wine = datasets.load_wine()
x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.25, random_state=4)
clf = DecisionTreeClassifier()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

dict = {"features": wine.feature_names, "importances": clf.feature_importances_ }
select_df = pd.DataFrame(dict)
print(select_df)

Acuuracy:  0.8888888888888888
                        features  importances
0                        alcohol     0.013641
1                     malic_acid     0.030766
2                            ash     0.000000
3              alcalinity_of_ash     0.000000
4                      magnesium     0.044051
5                  total_phenols     0.000000
6                     flavanoids     0.124552
7           nonflavanoid_phenols     0.000000
8                proanthocyanins     0.000000
9                color_intensity     0.337025
10                           hue     0.018459
11  od280/od315_of_diluted_wines     0.042856
12                       proline     0.388650
