# Decision Trees

## Decision Trees Basics

### Classification Problems with Nominal Features

Setting:
* $ X $ is a multiset of feature vectors
* $ C $ is a set of classes
* $ D \subseteq X \times C $ is a multiset of examples

Learning task:
* Fit $ D $ using a decision tree $ T $.

**Decision Tree for the Concept "EnjoySurfing**

![img1](img/topic5img1.png)

**Splitting, Induced Splitting**

Let $ X $ be a set of feature vectors and $ D $ a set of examples. A splitting of $ X $ is a decomposition into mutually exclusive subsets $ X_1, \dots, X_m $.

This induces a splitting $ D_1, \dots, D_m $ Where $ D_l, l = 1, \dots, m $ is defined as $ \{(\mathbf{x}, c) \in D \mid \mathbf{x} \in X_l\} $.

![img2](img/topic5img2.png)

A splitting of $ X $ depends on the measurement scale of a feature:

1. $ m $-ary splitting induced by a (nominal) feature $ A $:
   - $ dom(A) = \{a_1, \dots, a_m\}: X = \{\mathbf{x} \in X: \mathbf{x}|_A = a_1\} \cup \dots \cup \{\mathbf{x} \in X: \mathbf{x}|_A = a_m\} $
2. Binary splitting induced by a (nominal) feature $ A $:
   - $ B \subset dom(A): X = \{\mathbf{x} \in X: \mathbf{x}|_A \in B\} \cup \{\mathbf{x} \in X: \mathbf{x}|_A \notin B\} $
3. Binary splitting induced by an ordinal feature $ A $:
   - $ v \in dom(A): X = \{\mathbf{x} \in X: \mathbf{x}|_A \succeq v\} \cup \{\mathbf{x} \in X: \mathbf{x}|_A \preceq v\} $

Note:
* $ x|_A $ denotes the projection operator, which returns that vector component (dimension) of $ x $, that is associated with the feature $ A $.
* A splitting of $ X $ into two disjoint, non-empty subsets is called a binary splitting

**Decision Tree**

Let $ X $ be a set of features and $ C $ be a set of classes. A decision tree $ T $ for $ X $ and $ C $ is a finite tree with a distinguished root node. A non-leaf node $ t $ has assigned (1) a set $ X(t) \subseteq X $, (2) a splitting of $ X(t) $ and (3) a one-to-one mapping of the subsets of the splitting to its successors.

Recap: $ X(t) = X $ iff $ t $ is root node, a leaf node has assigned a class from $ C $.

Classification of some $ \mathbf{x} \in X $ given a decision tree $ T $:
1. Find root node $ t $ of $ T $
2. If $ t $ is a non-leaf node, find among its successors that node $ t' $ whose subset of the splitting of $ X(t') $ contains $ \mathbf{x} $. Repeat step 2 with $ t = t' $.
3. If $ t $ is a leaf node, label $ \mathbf{x} $ with the associated class

The set of possible decision trees over $ D $ forms the hypothesis space $ H $.

### Notation

Let $ T $ be a decision tree for $ X $ and $ C $, let $ D $ be a set of examples and let $ t $ be the root node.
* $ X(t) $ denotes the subset of $ X $ that is represented by $ t $
* $ D(t) $ denotes the subset of the example set that is represented by $ t $, where $ D(t) = \{(\mathbf{x}, c) \in D \mid \mathbf{x} \in X(t)\} $.

![img3](img/topic5img3.png)

### Algorithm Template: Construction

Algorithm: Decision Tree Construction

Input: Multiset of examples $ D $

Output: Root node of decision tree $ t $

![img4](img/topic5img4.png)

### Algorithm Template: Classification

Algorithm: Decision Tree Classification

Input:
* Feature vector $ \mathbf{x} $
* Root node of DT $ t $

Output: $ y(\mathbf{x}) $ (class of feature vector in the decision tree)

DT-Classify($ \mathbf{x}, t $)

```
IF isLeafNode(t)
THEN return (label(t))
ELSE return (DT-Classify(x, splitSuccessor(t, x))
```

### When to use Decision Trees

* the objects can be described by feature-value combinations
* the domain and range of target function are discrete
* hyptheses can be represented in DNF
* the training set contains noise

Typical application areas:
* medical diagnosis
* fault detection in technical systems
* risk analysis for credit approval
* basic scheduling tasks such as calender management
* classification of design flaws in software engineering

### Assessment of Decision Trees

1. Size

Among those theories that can explain an observation, the most simple one is to be preferred (Ockham's Razor)

Here: Among all decision trees of minimum classification error we choose the one of smallest size

2. Classification Error

Quantifies the rigor according to which a class label is assigned to $ \mathbf{x} $ in a leaf node based on the examples in the example set.

If all leaf nodes of a decision tree represent a single example in $ D $, the classification error of $ T $ w.r.t. $ D $ is zero.

**Assessment of Decision Trees: Size**

* Leaf node number
  - corresponds to number of rules encoded in DT
* Tree height
  - corresponds to max rule length and and bounds number of premises to be evaluated to reach a class decision
* External path length
  - totals lengths of all paths required to reach leaf nodes from root (corresponds to the total space to store all rules encoded within the decision tree)
* Weighted external path length
  - external path length with each length value weighted by the number of examples in $ D $ which are classified by this path

Example:

![img5](img/topic5img5.png)

The following trees accurately classify all examples in $ D $:

![img6](img/topic5img6.png)

The problem to decide for a set of examples $ D $ whether or not a decision tree exists whose external path length is bounded by $ b $ is NP-complete.

The class that is assigned to $ t,  label(t) $ is defined as follows:

$ label(t) = \text{argmax}_{c \in C} |\{(\mathbf{x}, c) \in D(t)\}| $

Misclassification rate of node classifier $ t $ w.r.t. $ D(t) $:

$ \text{Err}(t, D(t)) = \frac{|\{(\mathbf{x}, c) \in D: c \neq label(t)\}|}{|D(t)|} $

$ = 1 - \text{max}_{c \in C} \frac{|(\mathbf{x}, c) \in D(t)|}{|D(t)|} $

Misclassification rate of decision tree classifier $ T $ w.r.t. $ D $:

$ \text{Err}(T, D) = \sum\limits_{t \in leaves(T)}\frac{|D(t)|}{|D|} \cdot \text{Err}(t, D(t)) $