## [作業重點]
目前你應該已經要很清楚資料集中，資料的型態是什麼樣子囉！包含特徵 (features) 與標籤 (labels)。因此要記得未來不管什麼專案，必須要把資料清理成相同的格式，才能送進模型訓練。
今天的作業開始踏入決策樹這個非常重要的模型，請務必確保你理解模型中每個超參數的意思，並試著調整看看，對最終預測結果的影響為何

## 作業

1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

In [2]:
# 匯入套件
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [12]:
# 匯入資料

from sklearn.datasets import load_boston
boston = load_boston()
X = boston.data
y = boston.target
features = boston.feature_names
X.shape

(506, 13)

In [13]:
X = pd.DataFrame(X, columns=features)

In [14]:
# 切分資料

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)


In [15]:
X_train.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
502,0.04527,0.0,11.93,0.0,0.573,6.12,76.7,2.2875,1.0,273.0,21.0,396.9,9.08
172,0.13914,0.0,4.05,0.0,0.51,5.572,88.5,2.5961,5.0,296.0,16.6,396.9,14.69
80,0.04113,25.0,4.86,0.0,0.426,6.727,33.5,5.4007,4.0,281.0,19.0,396.9,5.29
46,0.18836,0.0,6.91,0.0,0.448,5.786,33.3,5.1004,3.0,233.0,17.9,396.9,14.15
318,0.40202,0.0,9.9,0.0,0.544,6.382,67.2,3.5325,4.0,304.0,18.4,395.21,10.36


In [16]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 379 entries, 502 to 37
Data columns (total 13 columns):
CRIM       379 non-null float64
ZN         379 non-null float64
INDUS      379 non-null float64
CHAS       379 non-null float64
NOX        379 non-null float64
RM         379 non-null float64
AGE        379 non-null float64
DIS        379 non-null float64
RAD        379 non-null float64
TAX        379 non-null float64
PTRATIO    379 non-null float64
B          379 non-null float64
LSTAT      379 non-null float64
dtypes: float64(13)
memory usage: 41.5 KB


In [17]:
X_train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CRIM,379.0,3.805183,9.375846,0.00632,0.083475,0.24522,3.68339,88.9762
ZN,379.0,11.521108,23.492644,0.0,0.0,0.0,17.75,100.0
INDUS,379.0,11.220053,6.875362,0.46,5.255,9.69,18.1,27.74
CHAS,379.0,0.081794,0.274413,0.0,0.0,0.0,0.0,1.0
NOX,379.0,0.554073,0.117825,0.385,0.4475,0.538,0.624,0.871
RM,379.0,6.255726,0.687415,3.561,5.875,6.172,6.611,8.78
AGE,379.0,68.751451,28.276504,6.0,45.05,79.2,94.05,100.0
DIS,379.0,3.824433,2.138449,1.1296,2.09445,3.3175,5.10855,12.1265
RAD,379.0,9.525066,8.73455,1.0,4.0,5.0,24.0,24.0
TAX,379.0,405.182058,169.483657,187.0,277.0,329.0,666.0,711.0


In [18]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

In [24]:
def get_the_best_model_and_performance(model, params, X, y):
    grid = GridSearchCV(model, params, error_score=0.)
    grid.fit(X, y)
    print('Best Performance:{}'.format(grid.best_score_))
    print('The parameters of best performance:{}'.format(grid.best_params_))
    print('Average time to fit(s):{}'.format(round(grid.cv_results_['mean_fit_time'].mean(), 3)))
    print('Average time to predict(s):{}'.format(round(grid.cv_results_['mean_score_time'].mean(), 3)))
    

In [28]:
clf = DecisionTreeRegressor()
params = {'criterion':['mse', 'friedman_mse', 'mae'],
         'max_depth':[None, 1, 3, 5, 7, 9, 11],
         'min_samples_leaf':[3, 5, 7, 9, 11],
         'splitter':['best', 'random']}

In [29]:
get_the_best_model_and_performance(clf, params, X_train, y_train)



Best Performance:0.7320795174862316
The parameters of best performance:{'criterion': 'mse', 'max_depth': None, 'min_samples_leaf': 11, 'splitter': 'best'}
Average time to fit(s):0.002
Average time to predict(s):0.001


## 結果
在利用網格搜尋的方式來找到表現最佳的超參數為
'criterion': 'mse', 'max_depth': None, 'min_samples_leaf': 11

經過validation後，其R^2 = 0.732

## 超參數的理解

1. criterion:評估群內資料的相似程度之metric
    * 在分類任務中(DecisionTreeClassifier):
    * 在回歸任務中(DecisionTreeRegressor):
2. splitter:控制決策樹的隨機性
3. 剪枝方式
    * max_depth:限制樹的最大長成深度
    * min_samples_leaf:一個節點在分枝後每個子節點都必須包含至少min_samples_leaf個訓練樣本(搭配max_depth一起使用)
    * min_samples_split:一個中間節點必須包含至少min_samples_split個訓練樣本，這個節點在允許被分枝，否則分枝不會發生

# 調整其他超參數(分類任務)

In [31]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
# 讀取鳶尾花資料集
iris = load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型
clf = DecisionTreeClassifier()

# 參數
params = {'criterion':['gini', 'entropy'],
         'max_depth':[None, 1, 3, 5, 7, 9, 11],
         'min_samples_leaf':[3, 5, 7, 9, 11],
         'splitter':['best', 'random']}

In [32]:
get_the_best_model_and_performance(clf, params, x_train, y_train)



Best Performance:0.9821428571428571
The parameters of best performance:{'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 3, 'splitter': 'best'}
Average time to fit(s):0.0
Average time to predict(s):0.0


利用網格搜尋後的最佳超參數為

