# Coding custom decision tree algorithm (modified CART)
## Part 4: Assembling trees


### Introduction

This is the **final part** of the four-part series on building a custom decision tree algorithm.

Let us quickly **recap** previous three parts. Part 1 gave a brief overview of decision trees and motivated this whole series. In Part 2 I explored Gini impurity index, and in Part 3 I've shown how to go through all features in a dataset, and for each find the optimal splitting value (threshold) and the corresponding Gini impurity index. This last result was achieved via ```split_node()``` function that will be used below. Its output is the name of the feature on which maximum Gini impurity decrease was achieved, the threshold value for which it was achieved, and the Gini impurity decrease itself.

In this part I will explain the remaining parts of the logic, and will tie everything up together so that in the end we have a single function, a call to which trains a custom (modified CART) decision tree.

To begin with, **I will introduce the *node* class**. This will be our building block of the tree. As the name suggests, *node* objects will represent nodes of decision trees. They will store indices of the corresponding records in the dataset. These indices will be used when splitting conditions have to be determined. Nodes will also store information such as the splitting feature, the threshold, and the Gini impurity decrease. Finally, they will know who their parent and child nodes are.

After I show you the *node* class, **we will move on to the learning process**. In this section, called "Training the tree", I will show you how to build a tree on a dataset using modified CART algorithm. This is where all the fun happens. The biggest challenges here are to grow a tree, which we achieve with the *node* class - a **binary-tree**-like data structure - and to stop growing the tree as required by stopping conditions. I handle this via another data structure - **queues**.

I found this part the most challenging and the most entertaining of the four. I hope you do too.
Let us now move on to the analysis of the *node* class.


### Node class 
#### A brief overview

I will define a class ```node``` which will contain all relevant info pertaining to nodes of decision trees.
During the learning process, node will be created, splitting conditions for the node will be identified, and then two children nodes will be produced. The process then repeats for each of the children nodes until the stopping conditions are met.

```node``` class will be "bare-bones" in the sense that only essential attributes and methods will be provided. I'm sure the reader can think of many additional useful things that could be added to this class. 

As for the methods, I include *add* method used to add children nodes to the current node, *split* method to determine splitting conditions, and *node_info* to print out essential node attributes. Let's explore each method.


#### Node class - *add*

In [None]:
    def add(self,child,direction=None):
        """
        child is an instance of node to be added to the tree.
        direction is a string, 'left' or 'right', or an integer, 0 (for left) or 1 (for right).
        """
        if direction=='left' or direction==0:
            self.left=child
        elif direction=='right' or direction==1:
            self.right=child
        else:
            print('direction not specified!')

The *add* method will take two parameters: a ```node``` object, *child*, that will become the child of the current node (*self*), and the *direction* parameter which specifies whether the *child* is a left or right child node. Direction may be specified either via integer value (0=left, 1=right) or via string ("left" and "right").

#### Node class - *split*

In [None]:
def split(self, dataset, target, target_values):
        """
        Given dataset [pd.Dataframe], target [string], and target_values [list], determine the feature,
        the threshold value, and the Gini impurity decrease for the optimal split.
        """
        self.feature, self.threshold, self.gini = split_node(dataset.loc[self.records], target, target_values)

The *split* method simply calls on the *split_node* function that was presented in Part 3. 

Recall, *split_node* takes a dataset, name of the target column, and a list of possible target values, and it returns the name of the optimal feature on which to split, the threshold value for which to split, and the Gini impurity decrease due to the split. 

*split_node* is called by the *split* method of the ```node``` class in order to provide information about the split in the node. Its inputs are self-explanatory, except for the first parameter - dataset. Remember that the tree is trained on some dataset which is kept in a Pandas DataFrame? Each ```node``` object has attribute *records* which stores indices of the rows of the dataset that belong to the given node. (For example, say the original dataset has 100 records. Root node's *records* attribute stores the list of all indices \[1,...,100\]. If the split is such that the first 50 records end up in the left child node, and the last 50 records in the right child node, than child nodes' *records* attributes store indices \[1,...,50\], and \[51,...,100\], respectively.)

To summarize the above paragraph: *self.records* are indices of the records in the given node, and 
```python
dataset.loc[self.records]
```
is a dataframe with all records in the node (with all predictive features and the target variable).

The output of the *split_node* function are feature name, threshold value, and Gini impurity decrease. They are saved into corresponding attributes of the node.

#### Node class - *node_info*

In [None]:
def node_info(self):
        print(f'Splitting feature: {self.feature}')
        print(f'Splitting threshold: {self.threshold:.2f}')
        print(f'Gini decrease: {self.gini:.3f}')
        print(f'Node depth is: {self.depth}')
        print(f'Number of records: {len(self.records)}') if self.records else print('No records!')
        if self.parent is not None:
            print(f"Parent splitting feature is {self.parent.feature}, and threshold is {self.parent.threshold}")
        if self.left is None:
            print(f"No left child")
        else:
            print(f"Left child splitting feature is {self.left.feature}, and threshold is {self.left.threshold}")
        if self.right is None:
            print(f"No right child")
        else:
            print(f"Right child splitting feature is {self.right.feature}, and threshold is {self.right.threshold}")

The last method that I'll define is *node_info*, which allows us to print out some key information about the node. This is quite straightforward. We will want to know the name of the splitting feature, the threshold value for which the split is realized, and the resulting Gini impurity decrease.

Further, we print out node depth, which is the "level" in the tree at which the node sits. For root I will set *depth* to 0, for its children to 1, and so on. I will also print out the number of records in the node.

Finally, I add some more print statements which are useful to check that the tree is correctly built. These should be removed after testing. In particular, I print out some info about the parent and child nodes, in order to make sure that the tree structure and everything else works as it's supposed to.

#### Node class - recap

Let's put everything together. *init* method allows us to specify many attributes when we create new nodes. 
One can specify splitting feature name, *split_feature*, its corresponding threshold, *split_threshold*, and Gini impurity decrease *gini_decrease*. Further, we can specify if the node has any *parent*, or *left* or *right* children nodes. Finally, we may specify the list of indices, *records*, of the dataset corresponding to the given node, and its *depth* in the structure.

All the other methods were already discussed above, so we can put everything together.

In [None]:
class node:
    """
    The objects of class node are nodes of the decision tree.
    """
    def __init__(self, split_feature=None, split_threshold=None, gini_decrease=None,\
                 parent=None, left=None, right=None, records=None, depth=None):
        
        self.feature=split_feature
        self.threshold=split_threshold
        self.gini=gini_decrease
        self.parent=parent
        self.left=left
        self.right=right
        self.records=records
        self.depth=depth
        
    def add(self,child,direction=None):
        """
        child is an instance of node to be added to the tree.
        direction is a string, 'left' or 'right', or an integer, 0 (for left) or 1 (for right).
        """
        if direction=='left' or direction==0:
            self.left=child
        elif direction=='right' or direction==1:
            self.right=child
        else:
            print('direction not specified!')
    
    def split(self, dataset, target, target_values):
        """
        Given dataset [pd.Dataframe], target [string], and target_values [list], determine the feature,
        the threshold value, and the Gini impurity decrease for the optimal split.
        """
        self.feature, self.threshold, self.gini = split_node(dataset.loc[self.records], target, target_values)
        
    def node_info(self):
        print(f'Splitting feature: {self.feature}')
        print(f'Splitting threshold: {self.threshold:.2f}')
        print(f'Gini decrease: {self.gini:.3f}')
        print(f'Node depth is: {self.depth}')
        print(f'Number of records: {len(self.records)}') if self.records else print('No records!')
        if self.parent is not None:
            print(f"Parent splitting feature is {self.parent.feature}, and threshold is {self.parent.threshold}")
        if self.left is None:
            print(f"No left child")
        else:
            print(f"Left child splitting feature is {self.left.feature}, and threshold is {self.left.threshold}")
        if self.right is None:
            print(f"No right child")
        else:
            print(f"Right child splitting feature is {self.right.feature}, and threshold is {self.right.threshold}")

Now that we have all the building blocks ready, we can finally start building trees.

### Training the tree

Let me explain **the workflow**, so that the code is easier to follow.
At the highest level, there are two parts: first, we prepare the root node and split it into two children nodes, and second, we enter a loop that "recursively" builds the tree. Let's see the steps in detail.

First of all, input variables need to be passed to the *train_tree* function. This may be considered part 0.
***train_tree* function will build an entire tree, consisting of multiple nodes, which will learn on a *dataset* using modified CART algorithm.** Besides *dataset* in Pandas DataFrame format, the function needs the name of the target variable, *target*, and stopping conditions. 

Stopping conditions are maximum depth, *max_depth*, minimumm Gini impurity decrease, *min_gini_dec*, and minimum number of records, *min_records*. If any of the stopping conditions is met, the node will not be split any further. *max_depth* condition is a global one - it restricts the maximum number of levels that the tree can grow. The other two conditions terminate node splitting in case there's not enough Gini impurity decrease or if there are not enough data in the node for the split to be meaningful.
    
    
**First part** of the function begins with the *train_tree* function determining all unique values in the *target* column and storing them in the variable *target_values*
* The tree is built by building the root node and assigning it to the variable *tree*. We specify *depth*=0, and assign indices of the entire dataset to the *records* attribute of the root node.
* *split* method is called on the root node - generating values for its splitting feature, and corresponding threshold value and Gini impurity decrease.
* The indices of the dataset are split into *left* and *right* subset. These two subsets are the data corresponding to the two children nodes. The key logic for the *left* node is: ```dataset[tree.feature]<=tree.threshold```. This is a "mask" (a list of True/False values) telling us which rows of the parent dataset have splitting feature value smaller-or-equal than the splitting threshold value. Indices of the records that satisfy the condition are transformed to a list and stored in the variable *left*. Variable *right* is completely analogous.
* *train_tree* now adds children nodes to the root node. It does this by calling on the *tree.add()* method, passing *node* objects and *direction* strings to it. *node* object of the child nodes both have *parent=tree* and *depth=1*. List of indices of the corresponding *records* are stored in variables *left* and *right*.
* As we build the tree we need to handle the fact that stopping conditions will be reached in different nodes at different points. For this purpose we use the ***queue* library**. We construct a new Queue() (imported from the queue library), and add (*put*) child nodes into the queue.

**Second part** of the function consists of the loop that builds the tree "recursively". We will use the queue *children* to store the nodes that still have to be split. As we split each node, we add its children nodes to the queue. In the loop: 
* We first get the next node in the queue, and store it in var *c*. 
* Next, we get the dataset corresponding to the given node and store it in var *temp_data*. 
* Next, for node *c* we go through all the steps that we went through with the root node above, namely, we call *c.split()* on *temp_data*, we compare the value of the splitting feature for all records in *temp_data* to the splitting threshold, and we store the resulting record indices in *left* and *right* lists.
* We check the spltting conditions. They tell us whether new nodes should be added to the tree. If stopping conditions are not met, then new nodes are added via *c.add()*. *node* objects passed to *c.add()* method have *c* as the parent, have one depth higher than parent, and are assigned corresponding indices of records.
* Finally, provided the split happened, new child nodes are added to the queue

Finally, if the learning process is not started in silent mode, so if *train_tree* was called with *silent=False*, then a statement is printed with details about the split.

In [None]:
def train_tree(dataset, target, max_depth=3, min_gini_dec=0.001, min_records=50, silent=True):
    """
    This function trains a classification tree. The tree is stored as a binary tree with nodes of class "node"
    It takes dataset [pd.Dataframe] to be trained on, and the column name of the target variable in the dataset, target [string].
    User can specify max depth of the tree, max_depth [integer], minimum Gini decrease required to split the tree,
    min_gini_dec [float], and minimum number of records that each leaf should have, min_records [integer].
    """
    # get all distinct values that the target variable takes (all the labels)
    target_values = np.array(dataset[target][dataset[target].duplicated()==False])
    # define the binary tree that will store the decision tree. root node depth is set to zero, and records holds indices of the entire dataset
    tree = node(depth=0, records=dataset.index.to_list() )
    #do the first split: get splitting feature, threshold, and gini impurity decrease for the node
    tree.split(dataset, target, target_values)
    left = dataset.index[ dataset[tree.feature]<=tree.threshold ].to_list() #get indices of records to go to left node
    right = dataset.index[ dataset[tree.feature]>tree.threshold ].to_list() #get indices of records to go to right node
    #prepare children nodes
    tree.add(node(parent=tree, depth=1, records=left), 'left' )
    tree.add(node(parent=tree, depth=1, records=right), 'right')
    #put children nodes in a Python queue
    children = Queue() 
    children.put(tree.left)
    children.put(tree.right) 
    #build the tree while there are children nodes to process
    while not children.empty():
        #run through nodes in queue
        c=children.get()
        #get node dataset
        temp_data=dataset.loc[c.records]
        c.split(temp_data, target, target_values)
        #get node properties
        left = temp_data.index[ temp_data[c.feature]<=c.threshold ].to_list() #records to go to left child
        right = temp_data.index[ temp_data[c.feature]>c.threshold ].to_list() #records to go to right child
        #check if node should be split. if yes generate children nodes.
        if c.depth<=max_depth and len(c.records)>= min_records and c.gini>min_gini_dec:
            c.add(node(parent=c, depth=c.depth+1, records=left), 'left')
            c.add(node(parent=c, depth=c.depth+1, records=right), 'right')
            #add children nodes to queue
            children.put(c.left) 
            children.put(c.right)  
            
        if not silent:
            print(f"split node on feature {c.feature}, for value {c.threshold:.2f}. decreased impurity by {c.gini:.3f}")
    return tree

### Conclusion

In this part I've introduced a binary-tree-like data structure named *node* whose objects were used to represent the nodes of decision trees.
*node* class had methods to add new child nodes to it, and to perform node splitting. The latter method identified the optimal feature and threshold for splitting and it calculated Gini impurity decrease.  Additionally, it provided us with lists of records that would be propagated/assigned to child nodes.

With *node* class ready, we were able to write the function that builds, or trains, the tree. Admittedly, I'm leaving a lot out, but the goal was not to build a very versatile class like scikit-learn DecisionTreeClassifier, but rather to show how to build a custom decision tree learning algorithm - and we've achieved that.

The next steps would be to write functions for prediction, to add quality evaluation metrics, and so on. Some optimization wouldn't hurt either. All of this should then be wrapped into one larger class, which can be called in a simple way similar to how you call scikit-learn classifiers.

This concludes the four part series on building a custom decision tree classifier. **I will hopefully prepare one more post, directly related to this series, in which I'll show how this algorithm performs on real data.**
I've already tested it myself and I've compared it to scikit-learn, but on a dataset that I cannot share. I will therefore try to do a systematic test on publicly available data that I can share online.

What I can already say is the following: the code is fast, but not as fast as scikit-learn. It generates very similar results. In what I've seen so far, they always agree on the optimal feature, and they almost entirely agree on the  threshold value. 

When there is significant difference in the threshold values, this generates child nodes with different datasets, and from then on the results between scikit-learn and the custom algorithm diverge significantly. This is not shortcoming of the above code, however. Trees are notorious for their "instability" - this is one of the reasons why one has to be very careful with relying on their decision path and interpreting it.

I hope you've enjoyed reading this part as well as the previous ones, and I hope you found it useful.
With this, I conclude the series.