# CART Decision Tree
## Introduction 
Decision trees are a simple yet remarkably powerful type of classification model. Decision trees categorize data by proposing a question based on each feature (or a subset of features) from your dataset, and then splitting the data based on that question. There are many different types of decision trees, but they all work on the same premise: 

**All decision trees partition the data recursively by examining the features of the data, and choosing the feature(s) that best split the data. In other words, these decisions are based on maximizing data homogenity (similarity) within the trees leaf nodes, and heterogenity (difference) between the leaf nodes.**

Since such models are built on the premise of the labeled data we provide them, they belong to the *supervised learning* family of algorithms. This notebook examies the CART algorithm, a common decision tree algorithm which produces binary classification or regression trees based on wether the data is categorical or numeric. Below is an example of a very simple decision tree. We use this tree as an example to describe how CART works. 
<br>
<br>

<img src="tree.png" width=500 height=500 />

A CART decision tree repeatedly splits the data at each **node** by proposing a question with a binary outcome. The first node is the root node, and the last nodes are the leaf nodes which store our predictions. 

## Constructing the tree
For our simple made up dataset, it was easy to construct a tree that classified our samples with 100% accuracy. We only have three entries and very few features. However, when handling datasets which are much larger (think thousands of different fruits with many more features) the ideal decision tree structure becomes far less apparent. Constructing a good decision tree involves understanding *what questions to ask, and why to ask them.* Ideally, we want to choose our questions in a way that best classifies our data, and works well at predicting any new data. We would like a rigorous theoretical framework to arrive at the best possible tree structure. CART uses two key concepts to achieve this:

1. A measure of *impurity* with the **Gini score**
2. A measure of **information gain**

## Impurity and information gain
The Gini score describes the **probability that a label is incorrectly assigned to a randomly chosen example in a set of data.** 

It is computed as: 
### Gini score = $ \sum_{n=0}^{i} p_{i}~(1~-~p_{i})$
where $p_{i}$ is the probability of the label 'i' being chosen. We can simplify this formula:
### $ \sum_{n=0}^{i} p_{i}~(1~-~p_{i}) $ = $ \sum_{n=0}^{i} p_{i}~-\sum_{n=0}^{i}~(p_{i})^2 $ 
### $ \sum_{n=0}^{i} p_{i} $ = 1
Therefore 
### Gini score = $ 1 - \sum_{n=0}^{i} (p_{i})^2 $

<br>

Passing our data through our example decision tree results in our leaf nodes having zero impurity (0 gini score.) This is because for our three examples, all our leaf nodes end up having exactly one type of fruit (it is impossible to mismatch the one existing label at each node.) The outcome is considered perfectly pure. On the other hand, if we instead introduced another fruit, say a yellow pineapple with height 8cm, then our leftmost leafnode would classify pineapple and banana in the same node. This results in an impurity of 50% (i.e a 50% chance that we classify incorrectly.) 

Our goal is to choose our questions at each node to *minimize the gini score.* Below is a step by step description of how this is done. 

1. Compute the impurity (i.e Gini score) of the starting set of data
2. Ask a question based on a feature in the data. 
3. Compute the weighted average gini score from the resulting leaf nodes of that question
4. Compute the information gain; the difference in Gini scores from before the question was asked, and the weighted average Gini score in step 3.
5. Repeat steps 2 through 4 for all the features , and choose the question which produces the **biggest information gain**. 

We summarize these steps below in image form: