# I. Algorithm 

## 1. Mathematics
### Gini impurity: 
$$I_G(P) = 1 - \sum_{i=1}^J p_i^2$$

### Information gain:
$$IG(\text{Question}) = I_G(\text{Node}) - P(\text{True}).I_G(\text{True}) - P(\text{False}).I_G(\text{False})$$

## 2. Code
- Step 1: Create a function to calculate the gini index of each node.
- Step 2: Create a function to calculate the information gain of each question.
- Step 3: Run a for loop through each value to find a best question - a question have highest information gain.
- Step 4: Use a recursive algorithm to build a tree.
    - Base case: The node can not be asked more questions return Leaf Node.
    - Recusive case: If there is still questions to ask, ask question and then check the child nodes - this node is considered as Decision Node. 

In [175]:
import pandas as pd
training_data = [
    ['Green', 3, 'Apple'],
    ['Yellow', 3, 'Apple'],
    ['Red', 1, 'Grape'],
    ['Red', 1, 'Grape'],
    ['Yellow', 3, 'Lemon'],
]
header = ["color", "diameter", "label"]
data = pd.DataFrame(data = training_data, columns = header)
((data["color"].value_counts()/data["color"].value_counts().sum())**2).sum()

0.3600000000000001

In [176]:
def gini(label):
    return 1 - ((label.value_counts()/label.value_counts().sum())**2).sum()

In [177]:
def info_gain(true_label, false_label, current_uncertainty):
    p = float(len(true_label)) / (len(true_label) + len(false_label))
    return current_uncertainty - p * gini(true_label) - (1 - p) * gini(false_label)

In [178]:
def question(data, column, condition):
    if isinstance(condition, str):
        true = data[data[column] == condition]
        false = data[data[column] != condition]
    else:
        true = data[data[column] >= condition]
        false = data[data[column] < condition]
    return true,false

In [179]:
def find_best_split(data):
    best_gain = 0  
    best_column = None 
    best_condition = None
    current_uncertainty = gini(data["label"])
    for column in data.columns[:-1]:
        for value in data[column].unique():
            true, false = question(data, column, value)
            if len(true) == 0 or len(false) == 0:
                continue
            gain = info_gain(true["label"], false["label"], current_uncertainty)
            if gain >= best_gain:
                best_gain, best_column, best_condition= gain, column, value
    return best_gain, best_column, best_condition

In [186]:
def match(row,column,condition):
    if isinstance(condition, str):
        return row[column] == condition
    else:
        return row[column] >= condition

In [180]:
def build_tree(data):
    gain, column, condition = find_best_split(data)
    if gain == 0:
        return (data["label"].value_counts()/data["label"].value_counts().sum()).apply(lambda x: str(int(x*100))+"%").to_dict()
    true, false = question(data, column, condition)
    true = build_tree(true)
    false = build_tree(false)
    return [true,false, column, condition]

In [181]:
tree = build_tree(data)
for i in tree:
    print(i)

[{'Lemon': '50%', 'Apple': '50%'}, {'Apple': '100%'}, 'color', 'Yellow']
{'Grape': '100%'}
diameter
3


In [189]:
def classify(row, node):
    if len(node) == 4:
        true, false, column, condition = node
        if match(row,column,condition):
            classify(row, true)
        else:
            classify(row, false)
    else:
        print(node)

In [194]:
classify(data.loc[0], tree)

{'Apple': '100%'}


# II. Practice

# III. Preference