# I. Algorithm 

## 1. Mathematics
### Gini impurity: 
$$I_G(P) = 1 - \sum_{i=1}^J p_i^2$$

### Information gain:
$$IG(\text{Question}) = I_G(\text{Node}) - P(\text{True}).I_G(\text{True}) - P(\text{False}).I_G(\text{False})$$

## 2. Code
- Step 1: Create a function to calculate the information gain of each question based on gini index or entropy.
- Step 2: Ask questions and split dataset into subnote.
- Step 3: Run a for loop through each value to find a best question - a question have highest information gain.
- Step 4: Use a recursive algorithm to build a tree.
    - Base case: The node can not be asked more questions return Leaf Node.
    - Recusive case: If there is still questions to ask, ask question and then check the child nodes - this node is considered as Decision Node. 
- Step 5: Predict new data based on the tree already built.

In [150]:
import pandas as pd
import numpy as np

class Question(object):
    def __init__(self, column, condition):
        self.column = column
        self.condition = condition
        
    def match(self,row):
        if isinstance(self.condition, str):
            return row[self.column] == self.condition
        else:
            return row[self.column] >= self.condition
         
class LeafNode(object):
    def __init__(self, label, samples):
        self.label = label
        self.samples = samples

class DecisionNode(object):
    def __init__(self, question, true, false, samples):
        self.question = question
        self.true = true
        self.false = false
        self.samples = samples
        
class DecisionTree(object):
    def __init__(self, max_depth= 10, criterion = "gini", min_samples_split = 2, min_samples_leaf = 1):
        self.data = pd.DataFrame()
        self.criterion = criterion
        self.max_depth = max_depth  
        self.min_samples_split = min_samples_split 
        self.min_samples_leaf = min_samples_leaf 
        self.tree = None
        
    def fit(self,data):
        self.data = data
        self.tree = self.build_tree(self.data)
        self.print_tree(self.tree)
        
    # Step 1: Create a function to calculate the information gain of each question based on gini index or entropy
    def impurity(self, label):
        if self.criterion == "gini":
            return 1 - ((label.value_counts()/label.value_counts().sum())**2).sum()
        if self.criterion == "entropy":
            p = label.value_counts()/label.value_counts().sum()
            return - (p*np.log(p)).sum()      
    
    def info_gain(self, true_label, false_label, current_uncertainty):
        p = float(len(true_label)) / (len(true_label) + len(false_label))
        return current_uncertainty - p * self.impurity(true_label) - (1 - p) * self.impurity(false_label)
    
    # Step 2: Ask questions and split dataset into subnote
    def split(self, data, question):  
        if isinstance(question.condition, str):
            true = data[data[question.column] == question.condition]
            false = data[data[question.column] != question.condition]
        else:
            true = data[data[question.column] >= question.condition]
            false = data[data[question.column] < question.condition]
        return true,false
    
    # Step 3: Run a for loop through each value to find a best question - a question have highest information gain
    def find_best_split(self, data):
        best_gain = 0  
        best_question = Question(None, None)
        current_uncertainty = self.impurity(data["label"])
        for column in data.columns[:-1]:
            for condition in data[column].unique():
                true, false = self.split(data, Question(column, condition))
                if len(true) == 0 or len(false) == 0:
                    continue
                gain = self.info_gain(true["label"], false["label"], current_uncertainty)
                if gain >= best_gain:
                    best_gain, best_question = gain, Question(column, condition)
        return best_gain, best_question
    
    # Step 4: Use a recursive algorithm to build a tree
    def build_tree(self, data):
        gain, question = self.find_best_split(data)
        samples = data["label"].value_counts().sum()
        if gain == 0:
            label = (data["label"].value_counts()/data["label"].value_counts().sum()).apply(lambda x: str(int(x*100))+"%").to_dict()
            samples = data["label"].value_counts().sum()
            return LeafNode(label, samples)
        true, false = self.split(data, question)
        true = self.build_tree(true)
        false = self.build_tree(false)
        return DecisionNode(question, true, false, samples)
    
    # Print tree
    def print_tree(self, node, spacing=""):
        # Base case: we've reached a leaf
        if isinstance(node, LeafNode):
            print (spacing + "Predict", node.label, "Samples: ", node.samples)
            return

        # Print the question at this node
        print (spacing + str(node.question.column) + " " + str(node.question.condition), "Samples: ", node.samples)

        # Call this function recursively on the true branch
        print (spacing + '--> True:')
        print_tree(node.true, spacing + "  ")

        # Call this function recursively on the false branch
        print (spacing + '--> False:')
        print_tree(node.false, spacing + "  ")
    
    # 
    def classify(self, row, node):
        # Base case: we've reached a leaf
        if isinstance(node, LeafNode):
            return node.label

        # Decide whether to follow the true-branch or the false-branch.
        # Compare the feature / value stored in the node,
        # to the example we're considering.
        if node.question.match(row):
            return self.classify(row, node.true)
        else:
            return self.classify(row, node.false)

# II. Practice

In [151]:
# Create data
training_data = [
    ['Green', 3, 'Apple'],
    ['Yellow', 3, 'Apple'],
    ['Red', 1, 'Grape'],
    ['Red', 1, 'Grape'],
    ['Yellow', 3, 'Lemon'],
]
header = ["color", "diameter", "label"]
data = pd.DataFrame(data = training_data, columns = header)

In [152]:
model = DecisionTree()
model.fit(data)

diameter 3 Samples:  5
--> True:
  color Yellow Samples:  3
  --> True:
    Predict {'Lemon': '50%', 'Apple': '50%'} Samples:  2
  --> False:
    Predict {'Apple': '100%'} Samples:  1
--> False:
  Predict {'Grape': '100%'} Samples:  2


In [153]:
model.classify(data.loc[0], model.tree)

{'Apple': '100%'}

# III. Preference

Youtube - Let’s Write a Decision Tree Classifier from Scratch - Machine Learning Recipes #8: [https://www.youtube.com/watch?v=LDRbO9a6XPU]

Machine Learning cơ bản - Bài 34: Decision Trees (1): Iterative Dichotomiser 3 [https://machinelearningcoban.com/2018/01/14/id3/]