# Decision Tree를 사용한 알고리즘

#### Decision Tree  
의사결정 트리: Supervised learning. Classification/Regression 용도로 주로 사용  

목표변수에 따라...
* *Continuous variable*: Regression Tree
* *Discrete variable*: Classification Tree

#### Composition
1. Root Node
2. Parent Node
3. Child Node
4. Terminal Node (leaf node)
5. Branch: Connections from the root to leaves
6. Depth: Number of node layers from root to leaves

#### Decision Tree 분석과정
1. 성장 (Growing): find the optimal split criterion, grows the tree. 적절한 Stopping rule 사용
2. 가지치기 (Pruning): Remove unnecessary branchs that increases risk of misclassification or has inappropriate induciton rule
3. 타당성 평가
4. 해석 및 예측

#### Decision Tree 분리기준 (Split criterion)
1. 정보 이득 (info gain): Diff b/t entropy of parent node and that of child node

### Assignment: Tell if you can play tennis using decision tree

In [8]:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

from IPython.display import Image

import pandas as pd
import numpy as np
import pydotplus
import os

In [40]:
tennis_data = pd.read_csv('PlayTennis.csv')
print(tennis_data)

# 데이터 전처리.

labels = { 'Outlook': ['Sunny', 'Overcast', 'Rain'], 
           'Temperature':['Hot', 'Mild', 'Cool'], 
           'Humidity':['High','Normal'],
           'Wind':['Weak','Strong'],
           'PlayTennis':['No','Yes'] }
i = 0
for col in list(labels.keys()):
    for val in labels[col]:
        tennis_data[col] = tennis_data[col].replace(val,i)
        i += 1
    
print("Preprocessing done: \n", tennis_data)
#tennis_data.Outlook = tennis_data.Outlook.replace('Sunny', 0)

     Outlook Temperature Humidity    Wind PlayTennis
0      Sunny         Hot     High    Weak         No
1      Sunny         Hot     High  Strong         No
2   Overcast         Hot     High    Weak        Yes
3       Rain        Mild     High    Weak        Yes
4       Rain        Cool   Normal    Weak        Yes
5       Rain        Cool   Normal  Strong         No
6   Overcast        Cool   Normal  Strong        Yes
7      Sunny        Mild     High    Weak         No
8      Sunny        Cool   Normal    Weak        Yes
9       Rain        Mild   Normal    Weak        Yes
10     Sunny        Mild   Normal  Strong        Yes
11  Overcast        Mild     High  Strong        Yes
12  Overcast         Hot   Normal    Weak        Yes
13      Rain        Mild     High  Strong         No
Preprocessing done: 
     Outlook  Temperature  Humidity  Wind  PlayTennis
0         0            3         6     8          10
1         0            3         6     9          10
2         1            3

In [47]:
X = np.array(pd.DataFrame( tennis_data, columns = ['Outlook', 'Temperature','Humidity','Wind'] ))
y = np.array(pd.DataFrame( tennis_data, columns = ['PlayTennis'] ))
X_train, X_test, y_train, y_test = train_test_split(X,y)

dt_clf = DecisionTreeClassifier()
dt_clf = dt_clf.fit(X_train, y_train)
dt_pred = dt_clf.predict(X_test)

print(confusion_matrix(y_test, dt_pred))
print(classification_report(y_test, dt_pred))

[[1 1]
 [0 2]]
              precision    recall  f1-score   support

          10       1.00      0.50      0.67         2
          11       0.67      1.00      0.80         2

    accuracy                           0.75         4
   macro avg       0.83      0.75      0.73         4
weighted avg       0.83      0.75      0.73         4

