<a href="https://colab.research.google.com/github/SzymonNowakowski/Machine-Learning-2024/blob/master/Lab06_tree-methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 6 - Tree-based Methods

### Author: Szymon Nowakowski


# Introduction
-------------------

In today's class, we will explore **Classification and Regression Trees (CART)** and the **Random Forest** algorithm, both of which represent **tree-based methods** in machine learning. **CART** serves as a foundational algorithm capable of handling both **classification** and **regression** tasks. It works by recursively partitioning the data based on the most informative features, resulting in a simple yet powerful binary decision tree. Using measures like **Gini impurity** for classification and **mean squared error** for regression, **CART** selects the best splits to optimize predictive accuracy.

While individual decision trees are easy to interpret, they can suffer from **overfitting**, limiting their generalization to new data. This challenge is effectively addressed by **Random Forest**, an **ensemble method** that constructs multiple decision trees on random subsets of the data and aggregates their predictions. By averaging results in regression tasks or using majority voting in classification, **Random Forest** significantly improves accuracy and robustness while reducing overfitting.

Both **CART** and **Random Forest** are considered **off-the-shelf** methods, meaning they can be applied directly to a wide range of problems with minimal tuning, making them go-to solutions for many real-world machine learning tasks. Today, we will implement these algorithms, apply them to datasets, and evaluate their performance to better understand their practical applications.

# CART: Detailed Explanation
----------------------------

## The Split into Regions in Decision Trees

In both **classification** and **regression trees**, the data space is recursively partitioned into rectangular regions based on feature values. At each node of the tree, the algorithm selects a **feature** and a **threshold** that best splits the data into two subsets. This process continues recursively, resulting in a hierarchical partitioning of the feature space.

- Each split corresponds to a decision rule, like $X_j < t$, where $X_j$ is a feature and $t$ is the threshold.
- The data points that satisfy the rule go to the left branch; the rest go to the right.
- The process continues until a stopping criterion is met (e.g., maximum depth, minimum number of samples, or impurity threshold).

The end result is a division of the space into **non-overlapping regions** $R_1, R_2, \dots, R_M$, where each region corresponds to a terminal (leaf) node in the tree.




## How the Regression Tree Predicts a New Value

In a **regression tree**, the prediction for a new observation is based on the **mean** of the target values in the region where the observation falls.

- When a new data point is passed through the tree, it follows the decision rules from the root to a specific leaf node.
- The predicted value is the **average** of the training data points within that leaf's region.

**Mathematically:**

If a region $R_m$ contains data points $\{y_i\}_{i=1}^{N_m}$, the prediction $\hat{y}$ for any $x \in R_m$ is:

$$
\hat{y} = \frac{1}{N_m} \sum_{i: x_i \in R_m} y_i
$$

## How the Regression Tree Builds the Region Partitioning

The goal in regression is to minimize the **sum of squared residuals (SSR)** within each region. At each split, the algorithm selects:

- The **feature** $X_j$ to split on.
- The corresponding **threshold** $t$ for that feature.

**The selection process involves both $X_j$ (the feature) and $t$ (the split point).**

The algorithm proceeds as follows:

1. **For each feature** $X_j$:
   - Consider all possible thresholds $t$ (often midpoints between sorted unique values).
   - Evaluate the SSR for each possible split.

2. **Select the feature $X_j^*$ and threshold $t^*$** that minimize the total SSR:

  $$
  \sum_{m=1}^{M} \sum_{i: x_i \in R_m} (y_i - \bar{y}_{R_m})^2
  $$

  Where:

  - $R_m$ is a region (leaf node) defined by the splits.
  - $\bar{y}_{R_m}$ is the mean target value in region $R_m$.

  This process is repeated recursively for each new subset until a stopping criterion is met.


## How the Classification Tree Predicts a Class for a New Observation

In a **classification tree**, the prediction for a new observation is based on the **majority class** within the region it falls into.

- The new observation follows the decision rules down the tree until it reaches a leaf node.
- The predicted class is the one with the highest proportion of samples in that leaf.

If region $R_m$ contains samples from classes $C_1, C_2, \dots, C_k$, the predicted class $\hat{C}$ is:

$$
\hat{C} = \arg\max_{c} \, P_c
$$

Where $P_c$ is the proportion of class $c$ in region $R_m$.


## How the Classification Tree Builds the Region Partitioning

In classification, the goal is to minimize impurity in the resulting regions. At each split, the algorithm selects:

- The **feature** $X_j$ to split on.
- The corresponding **threshold** $t$ for that feature.

**The selection process again involves both $X_j$ (the feature) and $t$ (the split point).**

The process is as follows:

1. **For each feature** $X_j$:
   - Consider all possible thresholds $t$.
   - Calculate impurity measures (Gini or Cross-Entropy) for the resulting splits.

2. **Select the feature $X_j^*$ and threshold $t^*$** that result in the **largest decrease in impurity**.

  a) Gini Index

    The **Gini Index** measures the probability of misclassification:

    $$
    G(R_m) = 1 - \sum_{c=1}^{K} p_{mc}^2
    $$

    Where:

    - $p_{mc}$ is the proportion of samples of class $c$ in region $R_m$.

  b) Cross-Entropy (Deviance)

    The **Cross-Entropy** measure is:

    $$
    H(R_m) = - \sum_{c=1}^{K} p_{mc} \log(p_{mc})
    $$

The split that leads to the **greatest reduction in impurity** (using either Gini or Cross-Entropy) is selected.





## Key Points

- The **tree-building process selects both the feature** $X_j$ **and the threshold** $t$ **that optimally split the data** based on the chosen objective (SSR for regression, impurity reduction for classification).
- This process continues recursively, resulting in a tree that partitions the data space into regions with either low variance (for regression) or low impurity (for classification).
