# C4.5 
## Introduction

This notebook assumes you have read the ID3 notebook and are familiar with *entropy* and *information gain*. 
The C4.5 algorithm is an improved version of the ID3 decision tree algorithm, in that it can be used for both Regression and Classification tasks. C4.5 differs from ID3 in two key aspects: 

1. Partitioning rule uses **Gain Ratio** rather than pure Information Gain 
2. Use of **windowing** data for DT generation 

We describe the implications of these differences in this notebook. 

## The problem with Information Gain in ID3

A major drawback of the ID3 algorithm is its ***oversensitivity to class-rich features.*** ID3 tends to inflate the information gain for features with many classes, and thus tends to prefer these features for partitioning. Consider a situation where we have one feature with far more classes than any other feature in the dataset 

**We can deduce that a split using this feature may partition the dataset so finely that the weighted class entropy is small simply due to the sparsity of labels in each child node.** As such, the information gain becomes very high , not as a result of some functional reliance on the feature, but rather solely on the multitude of classes on the feature. 

Datasets with a substantial imbalance in the number of classes between features can lead to very small weighted entropies, and this very large information gains. ***Ideally, class numbers per feature should not be a determining factor in node partition. Gain ratio was designed to tackle this precise weakness.*** 

## Gain Ratio 
Gain ratio aims to penalize the inflation from class number by **normalizing the information gain with the feature entropy.** 

### $ Gain~Ratio~=~ \frac{Information~Gain}{Entropy_{feature}}$

Normalizing in this way diminishes the difference in information gain between features with many classes and features without (brings them numerically closer, and does not magnify the inflated values as much).

## Algorithm
Algorithmically, C4.5 is the same as ID3 barring the aformentioned differences. Its stepwise process is as follows: 

1. Compute the entropy of the data in the current node (at the root node, we would do the whole dataset.) 
2. Compute the **average entropy** for each feature if we were to partition on it
3. Determine the **information gain** by subtracting the average feature entropy from the entropy in step 1.
4. Compute the **gain ratio** for each feature using the information gain and average entropy
5. Choose the feature which produces the **largest grain ratio** as the partition attribute. 
6. Partition the node into N children, one child for each class in the selected feature for partitioning. 
7. Repeat from step 1 for child nodes until stopping criteria is reached (such as reaching a pure leaf node, or a maximum depth.)

<br>
<table><tr>
<td> <img src="images/step_1.png" alt="Drawing" style="width: 4000px;"/> </td>
<td> <img src="images/step_2.png" alt="Drawing" style="width: 4000px;"/> </td>
</tr></table>
<br>
<table><tr>
<td> <img src="images/step_3.png" alt="Drawing" style="width: 4000px;"/> </td>
<td> <img src="images/step_4.png" alt="Drawing" style="width: 4000px;"/> </td>
</tr></table>
<br>
<img src="images/step_5.png" width=500 height=500 />