# CART (Classification and Regression Trees)
## Introduction 
Decision trees are a simple yet remarkably powerful models which can be used in both classification and regression problems. Decision trees categorize data by proposing a question based on each feature (or a subset of features) from your dataset, and then splitting the data based on that question. There are many different types of decision trees, but they all work on the same premise: 

**All decision trees partition the data recursively by examining the features of the data, and choosing the feature(s) that best split the data. In other words, these decisions are based on maximizing data homogenity (similarity) within the trees leaf nodes, and heterogenity (difference) between the leaf nodes.**

Since such models are built on the premise of the labeled data we provide them, they belong to the *supervised learning* family of algorithms. This notebook examines the CART algorithm, a common decision tree algorithm which produces binary classification or regression trees based on wether the data is categorical or numeric. 

# CART for Classification
Below is an example of a very simple decision tree for classification. We use this tree as an example to describe how CART works in classification problems. 
<br>
<br>

<img src="tree.png" width=400 height=400 />

A CART decision tree repeatedly splits the data at each **node** by proposing a question with a binary outcome. The first node is the root node, and the last nodes are the leaf nodes which store our predictions. For our simple made up dataset, it was easy to construct a tree that classified our samples with 100% accuracy. We only have three entries and very few features. However, when handling datasets which are much larger (think thousands of different fruits with many more features) the ideal decision tree structure becomes far less apparent. Constructing a good decision tree involves understanding *what questions to ask, and why to ask them.* Ideally, we want to choose our questions in a way that best classifies our data, and works well at predicting any new data. We would like a rigorous theoretical framework to arrive at the best possible tree structure. CART uses two key concepts to achieve this:

1. A measure of *impurity* with the **Gini score**
2. A measure of **information gain**

## Impurity and information gain
The Gini score describes the **probability that a label is incorrectly assigned to a randomly chosen example in a set of data.** Information gain refers to the **difference in Gini score between two sets of data.** We elaborate on this later on. First, we describe the formula for the Gini score.

### Gini score = $ \sum_{n=0}^{i} p_{i}~(1~-~p_{i})$
where $p_{i}$ is the probability of the label 'i' being chosen. We can simplify this formula:
### $ \sum_{n=0}^{i} p_{i}~(1~-~p_{i}) $ = $ \sum_{n=0}^{i} p_{i}~-\sum_{n=0}^{i}~(p_{i})^2 $ 
### $ \sum_{n=0}^{i} p_{i} $ = 1
Therefore 
### Gini score = $ 1 - \sum_{n=0}^{i} (p_{i})^2 $

<br>

Passing our data through our example decision tree results in our leaf nodes having zero impurity (0 gini score.) This is because for our three examples, all our leaf nodes end up having exactly one type of fruit (it is impossible to mismatch the one existing label at each node.) The outcome is considered perfectly pure. On the other hand, if we instead introduced another fruit, say a yellow pineapple with height 8cm, then our leftmost leafnode would classify pineapple and banana in the same node. This results in an impurity of 50% (i.e a 50% chance that we classify incorrectly.) 

Our goal is to choose our questions at each node to *minimize the gini score.* Below is a step by step description of how this is done. 

1. Compute the impurity (i.e Gini score) of the starting set of data
2. Ask a question based on a feature in the data. 
3. Compute the weighted average gini score from the resulting leaf nodes of that question
4. Compute the information gain; the difference in Gini scores from before the question was asked, and the weighted average Gini score in step 3.
5. Repeat steps 2 through 4 for all the features , and choose the question which produces the **biggest information gain**. 

Note, that if the best information gain we can obtain is zero (no change in Gini score) we no longer split the node and it becomes a leaf node. 

Using these steps, we recursively build each branch of the tree until there are no more questions left to ask. We summarize these steps below using our original made up fruit data table, but slightly modified to have a quantity feature:

<table><tr>
<td> <img src="step_1.png" alt="Drawing" style="width: 350px;"/> </td>
<td> <img src="step_2.png" alt="Drawing" style="width: 350px;"/> </td>
</tr></table>

<table><tr>
<td> <img src="step_3.png" alt="Drawing" style="width: 350px;"/> </td>
<td> <img src="step_4.png" alt="Drawing" style="width: 350px;"/> </td>
</tr></table>

<br>

<img src="step_5.png" width=300 height=300 />


# CART for Regression 
CART can be used in cases where the target variable is continuous. In such a case, CART assumes a regression framework and needs a measure of how well its predictions model the original data. Just like linear regression, CART takes advantage of the **Least Squares Deviation**, where the goal is to minimize the sum of the squared residuals between the predicted values and the datapoints. 
## Using 1 predictor 
To demonstrate, we use a made up graph of some fictional populations monthly rent as a function of disposable income. In this case, we are using 1 predictor (income) to try and model montly rent. 1 predictor also makes it easier to demonstrate the least squares regression. 

<img src="data.png" width=400 height=400 />
<br>

CART begins its regression by iterating through each datapoint, taking the predictor value and carrying out these steps:  

1. Produce a partition based based on the datapoint. 
2. Compute the residual sum of squares for both leaf nodes. For each leaf node, we comute the residual with respect to the average value of the data in each leaf node. 
3. **Choose the the datapoint which produces the smallest residual sum of squares (RSS) as the partition value at the node.**
4. Recursively reapply the same steps to the chidren nodes and build the tree until some stopping criteria is reached

Stopping criteria is elaborated on later on.  
<table><tr>
<td> <img src="reg_s1.png" alt="Drawing" style="width: 500px;"/> </td>
<td> <img src="reg_s2.png" alt="Drawing" style="width: 500px;"/> </td>
</tr></table>
<img src="reg_s34.png" width=350 height=400 />

# CART using scikit-learn
We write a simple implementation of a CART DT using scikit learn. For our data, we use a made up pharmaceutical dataset from Kaggle. Our Kaggle dataset features a list of patients, some diagonstic data (such as age, blood pressure, cholesterol) and one of 5 drugs (A,B,C,X,Y) perscribed to cure an illness. We aim to construct a CART DT which best predicts the appropriate drug for each patient.

In [None]:
import pandas as pd
