### **엔트로피 계산과 분류 과정**

* 엔트로피: $H[Y] = -\sum_{k=1}^K p(y_k) \log_2 p(y_k)$

* 조건부 엔트로피: $H[Y \mid X] = - \sum_i \sum_j \,p(x_i, y_j) \log_2 p(y_j \mid x_i)$

** 분류과정**

1. 어떤 기준으로 분류 후에 histogram으로 나눈후 조건부 엔트로피를 계산 함
2. 이전 entropy와 새로구한 조건부 엔트로피의 차이(:=Infomation Gain)이 최대 인 것을 best feature로 선택한다



### **데이터가 discrete 인 경우**

아래와 같은 데이터 셋이면

| f1 | f2 | f3 | y |
|:-:|:-:|:-:|:-:|
|0|1|1| yes|
|0|1|0| no|
|1|1|1| no|
|0|1|0| no|
|0|0|1| no|
|1|0|1| yes|

#### **1. base entropy: 분류가 안되었을 때의 entropy**

|y=yes|y=no|total|
|:-:|:-:|:-:|
|2|4|6|

$E_{base} = -[\ P(y_{=yes})\log{P(y_{=yes})} + P(y_{=no})\log{P(y_{=no})}\ ]\\
= -(\frac{2}{6}\log{\frac{2}{6}}+\frac{4}{6}\log{\frac{4}{6}})$

In [1]:
Ebase = -((2/6)*np.log2(2/6) + (4/6)*np.log2(4/6))
Ebase

0.91829583405448956

#### **2. 이제 feature별로 조건부 엔트로피를 구하고 Infomation Gain구함**

* feature 1  

|f1|y=yes|y=no|total|
|:-:|:-:|:-:|:-:|
|x=1|1|1|2|
|x=0|1|3|4|
|total|2|4|6|

$E_1 = -[\ P(y_{=yes},x_{=1})\log{P(y_{=yes}|x_{=1})} + P(y_{=no},x_{=1})\log{P(y_{=no}|x_{=1})} \\
\qquad\ \ \ +P(y_{=yes},x_{=0})\log{P(y_{=yes}|x_{=0})} + P(y_{=no},x_{=0})\log{P(y_{=no}|x_{=0})}\ ] \\
= -[\ \frac{1}{6}\log{\frac{1}{2}}+\frac{1}{6}\log{\frac{1}{2}}+\frac{1}{6}\log{\frac{1}{4}}+\frac{3}{6}\log{\frac{3}{4}}\ ] $

$IG_1 = E_{base} - E_1$

In [2]:
E1 = -((1/6)*np.log2(1/2) + (1/6)*np.log2(1/2) + (1/6)*np.log2(1/4) + (3/6)*np.log2(3/4))
IG1 = Ebase - E1

* feature 2

|f2|y=yes|y=no|total|
|:-:|:-:|:-:|:-:|
|x=1|1|3|4|
|x=0|1|1|2|
|total|2|4|6|

$E_2 = -[\ \frac{1}{6}\log{\frac{1}{4}} + \frac{3}{6}\log{\frac{3}{4}} + \frac{1}{6}\log{\frac{1}{2}} + \frac{1}{6}\log{\frac{1}{2}}]$

$IG_2 = E_{base} - E_2$

In [3]:
E2 = -((1/6)*np.log2(1/4) + (3/6)*np.log2(3/4) + (1/6)*np.log2(1/2) + (1/6)*np.log2(1/2))
IG2 = Ebase - E2

* feature 3

|f3|y=yes|y=no|total|
|:-:|:-:|:-:|:-:|
|x=1|2|0|2|
|x=0|2|2|4|
|total|4|2|6|

$E_3 = -[\ \frac{2}{6}\log{\frac{2}{2}} + \frac{0}{6}\log{\frac{0}{2}} + \frac{2}{6}\log{\frac{2}{4}} + \frac{2}{6}\log{\frac{2}{4}}]$

$IG_3 = E_{base} - E_3$

In [4]:
E3 = -((2/6)*np.log2(2/4) + (2/6)*np.log2(2/4))
IG3 = Ebase - E3

In [5]:
print('f1 | Entropy: {0:.4f}, IG: {1:.4f}'.format(E1, IG1))
print('f2 | Entropy: {0:.4f}, IG: {1:.4f}'.format(E2, IG2))
print('f3 | Entropy: {0:.4f}, IG: {1:.4f}'.format(E3, IG3))

f1 | Entropy: 0.8742, IG: 0.0441
f2 | Entropy: 0.8742, IG: 0.0441
f3 | Entropy: 0.6667, IG: 0.2516


위 결과대로 f3 우선 선택하고 그다음으로  f1 혹은 f2를 선택한다,

---

### **코드**

In [6]:
from pprint import pprint

In [7]:
from DecisionTree import Decision_Tree

In [8]:
tree = Decision_Tree()

In [9]:
data, labels = tree.example_dataset()

### frequency matrix 만들기
이를 통해서 엔트로피를 계산한다.

In [10]:
mat, _labels = tree.frequency_matrix(data[:, -1])
pd.DataFrame(mat.astype(np.int), index=_labels)

Unnamed: 0,0
no,7
yes,1
total,8


In [19]:
-((7/8)*np.log2(7/8) + (1/8)*np.log2(1/8))

0.5435644431995964

In [18]:
tree.cal_entropy(data[:, -1])

0.5435644431995964

In [11]:
mat, _labels = tree.frequency_matrix(data[:, 0] ,data[:, -1])
pd.DataFrame(mat.astype(np.int), index=_labels['rows'], columns=_labels['cols'])

Unnamed: 0,no,yes,total
0,3,1,4
1,4,0,4
total,7,1,8


In [12]:
-((3/8) * np.log2(3/4) + (1/8) * np.log2(1/4) + (4/8) * np.log2(4/4)) # + (0/8) * np.log(0/4) 

0.40563906222956642

In [13]:
tree.cal_entropy(data[:, [0, -1]])

0.40563906222956642

### tree 만들기

In [14]:
myTree = tree.build_Tree(data, labels)

In [15]:
pprint(myTree)

{'cartoon': {'0': {'more than 1 person': {'0': 'no',
                                          '1': {'winter': {'0': 'no',
                                                           '1': 'yes'}}}},
             '1': 'no'}}


In [16]:
test_list = [np.array([0, 0, 0]), np.array([0, 1, 1])]
test_labels = ['yes', 'yes']

In [17]:
for test_data, test_label in zip(test_list, test_labels):
    print(test_data, tree.predict(myTree, test_data, test_label))

[0 0 0] ('no', False)
[0 1 1] ('yes', True)
