# Decision Trees

A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.

In the example below, we can fill a table containing all elements of the decision tree. For instance, considering part of the tree as the image below, we can fill the table as:

<img src="https://cdn.jsdelivr.net/gh/rogergranada/MOOCs/edX/GTx/CS7641/SL1/images/decision_tree.svg" align="center" width="300"/>
    
| Occupied | Type  | Rainy | Hungry | Hot | Date | Happiness | Class  |
| :------: | :---: | :---: | :----: | :-: | :--: | :-------: | :----: |
|    T     | Pizza |   T   |   T    |  T  |  T   |     T     | go     |
|    T     | Thai  |   T   |   T    |  T  |  T   |     F     | not go |
|    T     | Thai  |   T   |   T    |  T  |  F   |     T     | not go |
|    T     | Other |   F   |   T    |  T  |  T   |     F     | not go |
|    T     | Other |   F   |   T    |  T  |  T   |     T     | not go |
  

Quiz 1: Which is the best division of data in a decision tree?

<img src="https://cdn.jsdelivr.net/gh/rogergranada/MOOCs/edX/GTx/CS7641/SL1/images/quiz_1.svg" align="center" width="600"/>

The best division is the one that minimizes the entropy. Thus, the division presented in the center is the best one since it can divide balls and crosses in two different groups.

## Decision Trees Expressiveness AND

Create a decision tree that represents the boolean expression **A AND B**:

<img src="https://cdn.jsdelivr.net/gh/rogergranada/MOOCs/edX/GTx/CS7641/SL1/images/decision_and.svg" align="center" width="400"/>

Create a decision tree that represents the boolean expression **A OR B**:

<img src="https://cdn.jsdelivr.net/gh/rogergranada/MOOCs/edX/GTx/CS7641/SL1/images/decision_or.svg" align="center" width="400"/>

Create a decision tree that represents the boolean expression **A XOR B**:

<img src="https://cdn.jsdelivr.net/gh/rogergranada/MOOCs/edX/GTx/CS7641/SL1/images/decision_xor.svg" align="center" width="200"/>

As we can see, for the XOR expression, we have to build the entire decision tree instead of having a pruned version of it as occurs in AND and OR expressions. In case of we have a decision tree with more attributes, the number of nodes grows fast. For example, generalizing the **OR** function to more than two attributes, we have the **N-OR** expression, also called the **ANY** expression. The decision tree for an **ANY** expression is:

<img src="https://cdn.jsdelivr.net/gh/rogergranada/MOOCs/edX/GTx/CS7641/SL1/images/decision_any.svg" align="center" width="150"/>

In the case of an **N-OR** expression, we need $n$ nodes in the tree and the complexity is linear. Now, if we want to build an **N-XOR** decision tree, in an *odd parity*, i.e., if the number of TRUE in the same branch of the tree is odd, then the output is also TRUE, we have:

<img src="https://cdn.jsdelivr.net/gh/rogergranada/MOOCs/edX/GTx/CS7641/SL1/images/decision_nxor.svg" align="center" width="250"/>

As we can see, the **N-XOR** decision tree needs an exponensial number of nodes as it grows, having a complexity of $O(2^n)$.

## Decision trees: Expressiveness

- XOR is hard  
- $n$ atributes (boolean) O(n!)
- How many trees? (a lot!)
- Output is boolean

**Truth table**:

| $A_1$ | $A_2$ | $A_3$ | ... | $A_n$ | Output |
| ----- | ----- | ----- | --- | ----- | ------ |
|   T   |   T   |   T   | ... |   T   |   T/F  |
|   T   |   T   |   T   | ... |   F   |   T/F  |
|   F   |   T   |   T   | ... |   T   |   T/F  | 
|  ...  |  ...  |  ...  | ... |  ...  |        |
|   F   |   F   |   F   | ... |   F   |   T/F  |

- How many rows? $2^n$
- How big is the truth table, i.e, how many ways to fill the output? $2^{2^n}$

In [3]:
# Checking how it grows
def size_truth_table(n):
    return 2**(2**n)

for n in range(1,9):
    print 'Size of truth table for n=%d: %d' % (n, size_truth_table(n))

Size of truth table for n=1: 4
Size of truth table for n=2: 16
Size of truth table for n=3: 256
Size of truth table for n=4: 65536
Size of truth table for n=5: 4294967296
Size of truth table for n=6: 18446744073709551616
Size of truth table for n=7: 340282366920938463463374607431768211456
Size of truth table for n=8: 115792089237316195423570985008687907853269984665640564039457584007913129639936


## ID 3 Algorithm

Loop:
 A <- best attribute # Information Gain(S, A)
 Assign A as decision attribute for *Node*
 For each value of A:
  Create a descendant of *Node*
 Sort training examples to leaves
 If example perfectly classifies:
  STOP
 Else:
  Iterate over leaves

$$Gain(S, A) = Entropy(S) - \sum_v \frac{|S_v|}{|S|}Entropy(S_v)$$

where $S$ is the collection of examples and $A$ is an attribute, and Entropy is measured as:

$$Entropy(S) = - \sum_v p(v)\ log\ p(v)$$


### ID3: Inductive bias

- Prefer to select good splits at the top
- Prefer correct ones instead of incorrect ones
- Prefer short trees than longer trees

## Other considerations

When dealing with continuous values, prefer to deal with ranges instead of values. For example, age could be divided into <20 and >=20. Thus, for continuous values we try to create Trues and Falses for ranges.

The algorithm finishes when everything is correctly classified, or there are no more attributes to split