# hw2 decision tree
這一個檔案將自己隨機產生的女孩資料作為訓練資料讀取進來，經過適當的整理之後餵給decision tree訓練，最後再和資料的答案比較，確認預測的準確率。

### import packages

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.externals.six import StringIO   
from sklearn.tree import export_graphviz
from sklearn.metrics import accuracy_score
import pydotplus
import os
os.environ['PATH'] = os.environ['PATH'] + (';C:\\Users\\Mimori\\Anaconda3\\pkgs\\graphviz-2.38-hfd603c8_2\\Library\\bin\\graphviz\\')

### data preprocessing
因為是自己產生的資料，所以還算是整齊，比較不會有缺漏之類的狀況發生。只要把需要和不需要的資料區分開來，再把資料型態改好就行。

### 把score拿掉，只留喜歡或不喜歡的1或0

In [2]:
df = pd.read_csv('./DT_data.csv')
df = df.drop('score', 1)
df.head()

Unnamed: 0,height,weight,age,cup,income,suki
0,154,46,36,f,66000,1
1,162,32,12,d,59000,1
2,140,53,17,f,70000,1
3,169,54,36,e,70000,1
4,143,56,17,e,22000,0


### 把資料型態是字元的cup改為用數字表示

In [3]:
# deal with the char data in cup!
cup_map = {'a' : 0, 'b' :1, 'c' :2, 'd' :3, 'e' :4, 'f' :5}
df['cup'] = df['cup'].map(cup_map)
df.head()

Unnamed: 0,height,weight,age,cup,income,suki
0,154,46,36,5,66000,1
1,162,32,12,3,59000,1
2,140,53,17,5,70000,1
3,169,54,36,4,70000,1
4,143,56,17,4,22000,0


### 把資料分成訓練資料和測試資料

In [4]:
df_train = df[:1700]
df_train.head()

Unnamed: 0,height,weight,age,cup,income,suki
0,154,46,36,5,66000,1
1,162,32,12,3,59000,1
2,140,53,17,5,70000,1
3,169,54,36,4,70000,1
4,143,56,17,4,22000,0


In [5]:
df_test = df[1700:]
df_test.head()

Unnamed: 0,height,weight,age,cup,income,suki
1700,147,30,16,2,24000,0
1701,155,43,36,5,32000,0
1702,146,66,28,2,44000,0
1703,151,67,12,0,29000,0
1704,140,62,17,3,60000,1


### 分別從訓練資料和測試資料取出喜歡或不喜歡的資料作為答案的y

In [6]:
y = df_train['suki'].values
df_train = df_train.drop('suki', 1)

In [7]:
y_test = df_test['suki'].values
X_test = df_test.drop('suki', 1)

### 建立decision tree，測試不同depth的結果

#### depth = 5的decision tree
一開始先把depth設定為5，把每一個特徵都拿來判斷結果，並且找出五個特徵分別的重要程度。可以看到這裡找出來最重要的兩個特徵分別是年齡和月收入，都是我們一開始權重就給到3的特徵，符合我們的設定。

In [11]:
depth = 5

dtree=DecisionTreeClassifier(max_depth = depth)
dtree.fit(df_train,y)

dot_data = StringIO()
export_graphviz(dtree, 
                out_file=dot_data,  
                filled=True, 
                feature_names=list(df_train),
                class_names=['kirai','suki'],
                special_characters=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_pdf("tree_depth_" + str(depth) + ".pdf")
dtree.feature_importances_

array([0.01859873, 0.01551502, 0.31096149, 0.24995026, 0.4049745 ])

把測試資料丟進訓練好的decision tree，得到預測結果y_predict，並計算準確度

In [14]:
y_predict = dtree.predict(X_test)
accuracy_score(y_test, y_predict)

0.8933333333333333

#### depth = 4的decision tree
可以看到其中一個比較不重要的特徵身高已經被拿掉了，準確率也沒有下降得非常多，從0.893掉到0.886而已

In [15]:
depth = 4

dtree=DecisionTreeClassifier(max_depth = depth)
dtree.fit(df_train,y)

dot_data = StringIO()
export_graphviz(dtree, 
                out_file=dot_data,  
                filled=True, 
                feature_names=list(df_train),
                class_names=['kirai','suki'],
                special_characters=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_pdf("tree_depth_" + str(depth) + ".pdf")
dtree.feature_importances_

array([0.        , 0.01450567, 0.30132692, 0.23309695, 0.45107046])

In [16]:
y_predict = dtree.predict(X_test)
accuracy_score(y_test, y_predict)

0.8866666666666667

#### depth = 3的decision tree
這次被拿掉的是另外一個不重要的特徵體重，但拿掉了兩項特徵之後，準確率就開始稍微明顯的下降，從0.886降到0.853

In [17]:
depth = 3

dtree=DecisionTreeClassifier(max_depth = depth)
dtree.fit(df_train,y)

dot_data = StringIO()
export_graphviz(dtree, 
                out_file=dot_data,  
                filled=True, 
                feature_names=list(df_train),
                class_names=['kirai','suki'],
                special_characters=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_pdf("tree_depth_" + str(depth) + ".pdf")
dtree.feature_importances_

array([0.        , 0.        , 0.31747819, 0.22401151, 0.45851029])

In [18]:
y_predict = dtree.predict(X_test)
accuracy_score(y_test, y_predict)

0.8533333333333334

#### depth = 2的decision tree
再拿掉一個特徵cup，準確率也從0.853再降到0.783

In [22]:
depth = 2

dtree=DecisionTreeClassifier(max_depth = depth)
dtree.fit(df_train,y)

dot_data = StringIO()
export_graphviz(dtree, 
                out_file=dot_data,  
                filled=True, 
                feature_names=list(df_train),
                class_names=['kirai','suki'],
                special_characters=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_pdf("tree_depth_" + str(depth) + ".pdf")
dtree.feature_importances_

array([0.        , 0.        , 0.42475314, 0.        , 0.57524686])

In [23]:
y_predict = dtree.predict(X_test)
accuracy_score(y_test, y_predict)

0.7833333333333333

#### depth = 1的decision tree
剩下最後的一個特徵income，但是預測準確率卻還是有0.73，比想像的還要高很多。可能是資料設計的有問題，導致雖然兩項特徵的權重一樣，但造成的影響程度卻不相同。

![tree_depth_1.png](attachment:tree_depth_1.png)

In [52]:
depth = 1

dtree=DecisionTreeClassifier(max_depth = depth)
dtree.fit(df_train,y)

dot_data = StringIO()
export_graphviz(dtree, 
                out_file=dot_data,  
                filled=True, 
                feature_names=list(df_train),
                class_names=['kirai','suki'],
                special_characters=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_pdf("tree_depth_" + str(depth) + ".pdf")
dtree.feature_importances_

array([0., 0., 0., 0., 1.])

In [53]:
y_predict = dtree.predict(X_test)
accuracy_score(y_test, y_predict)

0.73

### try random forest
就是讓很多decision tree來投票選出一個結果，在這邊測試看看decision tree數量不同(n_estimators參數的調整)，結果會有什麼變化

In [30]:
from sklearn.ensemble import RandomForestClassifier

#### n_estimators = 2(兩棵樹)
準確率0.88，跟一棵樹的時候差不多

In [50]:
forest = RandomForestClassifier(criterion='entropy', n_estimators=2,random_state=3,n_jobs=2)
forest.fit(df_train,y)
forest.feature_importances_

array([0.08985848, 0.08487033, 0.27650429, 0.171533  , 0.37723391])

In [51]:
y_predict = forest.predict(X_test)
accuracy_score(y_test, y_predict)

0.88

#### n_estimators = 10(十棵樹)
數量增加到十棵樹，準確率有明顯的提升，來到了0.936

In [56]:
forest = RandomForestClassifier(criterion='entropy', n_estimators=10,random_state=3,n_jobs=2)
forest.fit(df_train,y)
forest.feature_importances_

array([0.08540742, 0.08240534, 0.29760691, 0.15897417, 0.37560615])

In [57]:
y_predict = forest.predict(X_test)
accuracy_score(y_test, y_predict)

0.9366666666666666

#### n_estimators = 100(一百棵樹)
一次把樹的數量增加到100，但結果準確率還是在0.94左右，看來已經是極限了，再增加也沒有用。

In [58]:
forest = RandomForestClassifier(criterion='entropy', n_estimators=100,random_state=3,n_jobs=2)
forest.fit(df_train,y)
forest.feature_importances_

array([0.0859987 , 0.08947261, 0.28655467, 0.17908004, 0.35889398])

In [59]:
y_predict = forest.predict(X_test)
accuracy_score(y_test, y_predict)

0.94