### Information Gain의 문제점

- Attribute에 포함된 값이 다양할수록 선택하고자 하는 경향이 있음

##### 다양할수록 splitinfo의 값이 커진다
$
\text{splitinfo} = - \sum_{i=1}^{n} \left( \frac{D_i}{D} \cdot \log_2 \left( \frac{D_i}{D} \right) \right)
$

##### 다양할수록 splitinfo의 값이 커져서 GainRatio의 값이 작아진다. 즉, 값이 다양할수록 페널티를 준다.
$
\text{Gainratio} = \frac{\text{gain(A)}}{\text{splitinfo}} = \frac{Info(D) - Info_a(D)}{\text{splitinfo}}
$


### Gini Index

- CART 알고리즘의 split measure
- 훈련 튜플 세트를 파티션으로 나누었을 때 불순한 정도를 측정
- 데이터의 대상 속성을 얼마나 잘못 분류할지를 계산

### Binary Split

- CART 알고리즘은 Binary Split을 전제로 분석함
- 예를 들어, young, middle-age, senior 이 있으면 young과 나머지, middle-age와 나머지
- $2^{k-1} -1$ 개 만큼의 Split 생성
- 가장 Gini 값이 적은 분류를 선택

In [1]:
import pandas as pd
import numpy as np

In [2]:
pd_data = pd.read_csv('https://raw.githubusercontent.com/AugustLONG/ML01/master/01decisiontree/AllElectronics.csv')
pd_data.drop("RID",axis=1)
pd_data

Unnamed: 0,RID,age,income,student,credit_rating,class_buys_computer
0,1,youth,high,no,fair,no
1,2,youth,high,no,excellent,no
2,3,middle_aged,high,no,fair,yes
3,4,senior,medium,no,fair,yes
4,5,senior,low,yes,fair,yes
5,6,senior,low,yes,excellent,no
6,7,middle_aged,low,yes,excellent,yes
7,8,youth,medium,no,fair,no
8,9,youth,low,yes,fair,yes
9,10,senior,medium,yes,fair,yes


In [4]:
youth = pd_data.loc[pd_data['age'] == 'youth']
middle_senior = pd_data.loc[pd_data.index.difference(youth.index)]

In [5]:
youth

Unnamed: 0,RID,age,income,student,credit_rating,class_buys_computer
0,1,youth,high,no,fair,no
1,2,youth,high,no,excellent,no
7,8,youth,medium,no,fair,no
8,9,youth,low,yes,fair,yes
10,11,youth,medium,yes,excellent,yes


In [8]:
# 차집합 (youth를 제외한 값들)
middle_senior

Unnamed: 0,RID,age,income,student,credit_rating,class_buys_computer
2,3,middle_aged,high,no,fair,yes
3,4,senior,medium,no,fair,yes
4,5,senior,low,yes,fair,yes
5,6,senior,low,yes,excellent,no
6,7,middle_aged,low,yes,excellent,yes
9,10,senior,medium,yes,fair,yes
11,12,middle_aged,medium,no,excellent,yes
12,13,middle_aged,high,yes,fair,yes
13,14,senior,medium,no,excellent,no


In [9]:
def get_gini(df):
    buy_df = df.loc[df["class_buys_computer"]=="yes"]
    not_buy_df = df.loc[df["class_buys_computer"]=="no"]
    
    if len(buy_df) == 0:
        buy_value = 0
    else:
        buy_value = len(buy_df)

    if len(not_buy_df) == 0:
        not_buy_value = 0
    else:
        not_buy_value = len(not_buy_df)

    result = ( buy_value / len(df) ) ** 2 + ( not_buy_value / len(df) ) ** 2
        
    return 1- result 

In [12]:
from itertools import chain, combinations

def powerset(data):
    listed_data = list(data)
    chain_set = chain.from_iterable(combinations(listed_data, i) 
                                    for i in range(len(listed_data)+1))
    return [set_data for set_data in chain_set]

In [16]:
# 모든 경우의 수 뽑아줌
powerset(pd_data['age'].unique())

[(),
 ('youth',),
 ('middle_aged',),
 ('senior',),
 ('youth', 'middle_aged'),
 ('youth', 'senior'),
 ('middle_aged', 'senior'),
 ('youth', 'middle_aged', 'senior')]

In [17]:
def get_binary_split(df, attribute):
    powerset_data = powerset(df[attribute].unique())
    result = []
    result = [data for data in powerset_data if len(data) != 0]
    result = [set(data) for data in result if len(data) != len(df[attribute].unique())]

    return result