# I. Algorithm 

## 1. Mathematics
### Gini Index: 
$$I_G(P) = 1 - \sum_{i=1}^J p_i^2$$
### Entropy:
$$H(\mathbf{p}) = -\sum_{i=1}^n p_i \log(p_i)\quad\quad$$
Note:
- $\mathbf{p} = \frac{N_c}{N}$ with $N_c$ is the number of data in class $c$ and $N$ is the number of data in the whole dataset.
### Information gain:
$$IG(\text{Question}) = I_G(\text{Node}) - P(\text{True}).I_G(\text{True}) - P(\text{False}).I_G(\text{False})$$

## 2. Code
- Step 1: Create a function to calculate the information gain of each question based on gini index or entropy.
- Step 2: Ask questions and split dataset into subnote.
- Step 3: Run a for loop through each value to find a best question - a question have highest information gain.
- Step 4: Use a recursive algorithm to build a tree.
    - Base case: The node can not be asked more questions return Leaf Node.
    - Recusive case: If there is still questions to ask, ask question and then check the child nodes - this node is considered as Decision Node. 
- Step 5: Predict new data based on the tree already built.

In [2]:
import pandas as pd
import numpy as np

class Question(object):
    def __init__(self, column, condition):
        self.column = column
        self.condition = condition
        
    def match(self,row):
        if isinstance(self.condition, str):
            return row[self.column] == self.condition
        else:
            return row[self.column] >= self.condition
         
class LeafNode(object):
    def __init__(self, label, samples, depth):
        self.label = label
        self.samples = samples
        self.depth = depth

class DecisionNode(object):
    def __init__(self, question, true, false, samples, depth):
        self.question = question
        self.true = true
        self.false = false
        self.samples = samples
        self.depth = depth
        
class DecisionTree(object):
    def __init__(self, max_depth= 10, criterion = "gini", min_samples_split = 2, min_samples_leaf = 1):
        self.train = pd.DataFrame()
        self.test = pd.DataFrame()
        self.label = []
        self.criterion = criterion
        self.max_depth = max_depth  
        self.min_samples_split = min_samples_split 
        self.min_samples_leaf = min_samples_leaf 
        self.tree = None
        
    def fit(self,train):
        self.train = train
        self.tree = self.build_tree(self.train, 0)
        self.print_tree(self.tree)
        
    # Step 1: Create a function to calculate the information gain of each question based on gini index or entropy
    def impurity(self, label):
        if self.criterion == "gini":
            return 1 - ((label.value_counts()/label.value_counts().sum())**2).sum()
        if self.criterion == "entropy":
            p = label.value_counts()/label.value_counts().sum()
            return - (p*np.log(p)).sum()      
    
    def info_gain(self, true_label, false_label, current_uncertainty):
        p = float(len(true_label)) / (len(true_label) + len(false_label))
        return current_uncertainty - p * self.impurity(true_label) - (1 - p) * self.impurity(false_label)
    
    # Step 2: Ask questions and split dataset into subnote
    def split(self, data, question):  
        if isinstance(question.condition, str):
            true = data[data[question.column] == question.condition]
            false = data[data[question.column] != question.condition]
        else:
            true = data[data[question.column] >= question.condition]
            false = data[data[question.column] < question.condition]
        return true,false
    
    # Step 3: Run a for loop through each value to find a best question - a question have highest information gain
    def find_best_split(self, data):
        best_gain = 0  
        best_question = Question(None, None)
        current_uncertainty = self.impurity(data["label"])
        for column in data.columns[:-1]:
            for condition in data[column].unique():
                true, false = self.split(data, Question(column, condition))
                if len(true) == 0 or len(false) == 0:
                    continue
                gain = self.info_gain(true["label"], false["label"], current_uncertainty)
                if gain >= best_gain:
                    best_gain, best_question = gain, Question(column, condition)
        return best_gain, best_question
    
    # Step 4: Use a recursive algorithm to build a tree
    def build_tree(self, data, depth):
        # Find best question         
        gain, question = self.find_best_split(data)
        samples = data["label"].value_counts().sum()
        # Can not find question or the samples is smaller than min samples split          
        if gain == 0 or samples < self.min_samples_split or depth == self.max_depth:
            label = (data["label"].value_counts()/data["label"].value_counts().sum()).apply(lambda x: str(int(x*100))+"%").to_dict()
            return LeafNode(label, samples, depth)
        # Split based on best question         
        true, false = self.split(data, question)
        true_samples = true["label"].value_counts().sum() 
        false_samples = false["label"].value_counts().sum() 
        # Check if leaf node is smaller than min samples leaf or not
        if true_samples < self.min_samples_leaf or false_samples < self.min_samples_leaf:
            label = (data["label"].value_counts()/data["label"].value_counts().sum()).apply(lambda x: str(int(x*100))+"%").to_dict()
            return LeafNode(label, samples, depth)
        true = self.build_tree(true, depth + 1)
        false = self.build_tree(false, depth + 1)
        return DecisionNode(question, true, false, samples, depth)
    
    # Print tree
    def print_tree(self, node, spacing=""):
        # Base case: we've reached a leaf
        if isinstance(node, LeafNode):
            print (spacing + "Predict", node.label, ", Samples: ", node.samples, ", Depth: ", node.depth)
            return

        # Print the question at this node
        print (spacing + str(node.question.column) + " " + str(node.question.condition), "Samples: ", node.samples, ", Depth: ", node.depth)

        # Call this function recursively on the true branch
        print (spacing + '--> True:')
        self.print_tree(node.true, spacing + "  ")

        # Call this function recursively on the false branch
        print (spacing + '--> False:')
        self.print_tree(node.false, spacing + "  ")
    
    # Step 5: Predict new data based on the tree already built
    def classify(self, index, node):
        row = self.test.loc[index]
        # Base case: we've reached a leaf
        if isinstance(node, LeafNode):
            self.label.append(node.label)
        # Decide whether to follow the true-branch or the false-branch.
        # Compare the feature / value stored in the node,
        # to the example we're considering.
        if isinstance(node, DecisionNode):
            if node.question.match(row):
                return self.classify(index, node.true)
            else:
                return self.classify(index, node.false)
    def predict(self, test):
        self.test = test
        for i in range(test.shape[0]):
            self.classify(i, self.tree)
        self.test["label"] = self.label
        return self.test

# II. Practice

In [3]:
# Create data
training_data = [
    ['Green', 3, 'Apple'],
    ['Yellow', 3, 'Apple'],
    ['Red', 1, 'Grape'],
    ['Red', 3, 'Apple'],
    ['Yellow', 3, 'Lemon'],
    ['Blue', 1, 'Berry'],
    ['Red', 1, 'Grape'],
    ['Yellow', 3, 'Lemon'],
    ['Red', 1, 'Grape'],
    ['Red', 1, 'Grape'],
    ['Blue', 2, 'Berry'],
    ['Red', 1, 'Grape'],
    ['Yellow', 5, 'Banana'],
    ['Red', 1, 'Grape'],
    ['Green', 4, 'Banana'],
    ['Blue', 2, 'Berry'],
]
header = ["color", "diameter", "label"]
data = pd.DataFrame(data = training_data, columns = header)

In [4]:
# Create and Plot model
model = DecisionTree(min_samples_split = 3, min_samples_leaf = 2, max_depth = 3)
model.fit(data)

diameter 2 Samples:  16 , Depth:  0
--> True:
  diameter 4 Samples:  9 , Depth:  1
  --> True:
    Predict {'Banana': '100%'} , Samples:  2 , Depth:  2
  --> False:
    diameter 3 Samples:  7 , Depth:  2
    --> True:
      Predict {'Apple': '60%', 'Lemon': '40%'} , Samples:  5 , Depth:  3
    --> False:
      Predict {'Berry': '100%'} , Samples:  2 , Depth:  3
--> False:
  Predict {'Grape': '85%', 'Berry': '14%'} , Samples:  7 , Depth:  1


In [5]:
# Predict
test = data.loc[0:3][["color","diameter"]]
model.predict(test)

Unnamed: 0,color,diameter,label
0,Green,3,"{'Apple': '60%', 'Lemon': '40%'}"
1,Yellow,3,"{'Apple': '60%', 'Lemon': '40%'}"
2,Red,1,"{'Grape': '85%', 'Berry': '14%'}"
3,Red,3,"{'Apple': '60%', 'Lemon': '40%'}"


# III. References

Youtube - Let’s Write a Decision Tree Classifier from Scratch - Machine Learning Recipes #8: [https://www.youtube.com/watch?v=LDRbO9a6XPU]

Machine Learning cơ bản - Bài 34: Decision Trees (1): Iterative Dichotomiser 3 [https://machinelearningcoban.com/2018/01/14/id3/]