---
title: "Tree-based Methods"
subtitle: "Unit 3.3"
author: "Sean Sylvia"
format:
  revealjs:
    theme: default
    slide-number: true
    chalkboard: true
    transition: fade
    progress: true
    incremental: false
    toc: false
    scrollable: false
    smaller: true
    footer: "UNC HPM 883 - Advanced Quantitative Methods"
draft: true
---


## Unit 2: Basic ML Crash Course

1. Introduction to ML

2. Lasso and friends (Linear High-dimensional Regression)

3. Tree-based methods (Nonlinear)

# Review

## Basic ML Setup

1. Flexible functional forms

2. Limit expressiveness via **regularization**

3. Learn how much to regularize **tuning**

::: {.fragment}

- What do the features imply about properties of $\hat{f}$ ?
- How can we use $\hat{f}$ in applied data analyses?
:::

## The Approximation-Overfit Tradeoff

::: {.columns}
::: {.column width="45%"}
**The Fundamental Challenge**

As model complexity increases, we face two competing forces:

1. **Approximation error decreases** as we better capture the true underlying function
2. **Estimation error increases** as we begin to fit noise in our training data

This creates the **bias-variance tradeoff** that defines machine learning:

- **Simple models**: High bias, low variance
- **Complex models**: Low bias, high variance

:::

::: {.column width="55%"}
![Bias-variance tradeoff visualization. The blue curve represents test error, while the red curve shows training error. The gap widens as model complexity increases, indicating overfitting.](media/bias_variance_tradeoff.png)
:::
:::

## Supervised Learning

For supervised learners, we need three things:

1. Function Class
2. A regularizer
3. Optimization algorithms to guide us

##  Choosing a regularization parameter using k-fold Cross-validation 

![](media/k-fold.png)

## Full ML Exercise 

![](media/ml-exercise.png)

## The Regularization Spectrum: How We Control Complexity

Every model class has its own unique form of regularization that controls the bias-variance tradeoff. Understanding this spectrum reveals the fundamental unity behind seemingly diverse machine learning approaches.

::: {.columns}
::: {.column width="100%"}
<table class="regularization-table">
  <thead>
    <tr>
      <th>Function Class</th>
      <th>Regularization Parameters</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Linear</td>
      <td>LASSO, ridge, elastic net</td>
    </tr>
    <tr>
      <td>Decision/regression trees</td>
      <td>Depth, leaves, leaf size, info gain</td>
    </tr>
    <tr>
      <td>Random forest</td>
      <td>Trees, variables per tree, sample sizes, complexity</td>
    </tr>
    <tr>
      <td>Nearest neighbors</td>
      <td>Number of neighbors</td>
    </tr>
    <tr>
      <td>Kernel regression</td>
      <td>Bandwidth</td>
    </tr>
    <tr>
      <td>Splines</td>
      <td>Number of knots, order</td>
    </tr>
    <tr>
      <td>Neural nets</td>
      <td>Layers, sizes, connectivity, drop-out, early stopping</td>
    </tr>
  </tbody>
</table>
:::
:::

::: {.fragment}
> **Cross-cutting insight**: While the specific mechanisms differ, regularization always involves restricting a model's capacity to memorize training data, instead encouraging it to generalize underlying patterns. Tree-based methods share this fundamental principle with linear models, but implement it through structural constraints rather than coefficient penalties.
:::

## Lasso Regression: Constrained Minimization to Regularize

::: {.nonincremental}
**Objective**: Minimize the sum of squared errors while keeping coefficients small
:::

::: {.columns}
::: {.column width="40%"}
### Constrained Form

$$
\begin{align}
\min_{\beta} &\sum_{i=1}^{n} \left(y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij}\right)^2\\
\text{subject to } &\sum_{j=1}^{p} |\beta_j| \leq t
\end{align}
$$

- $t \geq 0$ is the constraint parameter
- Smaller $t$ means more regularization
- $t = 0$ forces all $\beta_j = 0$
:::

::: {.column width="60%"}
![](media/lasso-contours.png){width=150%}

*Contours of RSS function and lasso constraint region (diamond). Solution occurs at corners, forcing coefficients to zero.*
:::
:::

## Lasso: Lagrangian Form


$$
\min_{\beta} \sum_{i=1}^{n} \left(y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij}\right)^2 + \lambda \sum_{j=1}^{p} |\beta_j|
$$

- $\lambda \geq 0$ is the penalty parameter
- $\lambda$ and $t$ have an inverse relationship
- As $\lambda \rightarrow \infty$, all $\beta_j \rightarrow 0$


## Geometric Interpretation

::: {.columns}
::: {.column width="50%"}
- **The constraint region**: $\sum_{j=1}^{p} |\beta_j| \leq t$ forms a diamond (L1 norm)
- Unlike ridge regression's circular constraint (L2 norm)
- **Key insight**: The corners of the diamond often intersect with axes
- This means some coefficients become exactly zero
:::

::: {.column width="50%"}
![](media/lasso-contours.png){width=150%}

*Contours of RSS function and lasso constraint region (diamond). Solution occurs at corners, forcing coefficients to zero.*
:::
:::

## Lasso Lambda and coefficient paths (relaxing constraint)

![Each line is a coefficient. Lambda is "relaxed" moving from left to right](media/lambda.png)

## Why Lasso Performs Variable Selection

::: {.nonincremental}
The L1 penalty's diamond shape makes it likely for solutions to occur at corners where some $\beta_j = 0$
:::

::: {.columns}
::: {.column width="35%"}
- Solution occurs where RSS contours touch constraint region
- Corners of diamond intersect with coordinate axes
- When solution is at a corner, some coefficients equal zero
- **Result**: Automatic variable selection
:::

::: {.column width="65%"}

![](media/lasso_vs_ridge_vs_en.png){width=125%}

*Comparison of lasso (diamond) vs. ridge (circle) vs. elastic net (fat diamond thingy) constraint regions.*
:::
:::

## The Lasso Selection Problem

::: {.nonincremental}
**Challenge**: Different regularization paths can lead to different selected variables
:::

::: {.fragment}
**OLS (All Variables):**  
Health = β₀ + β₁·Age + β₂·Income + β₃·Education + β₄·SES + ... + βₙ·X_n + ε
:::

::: {.fragment}
**Lasso (λ = 0.1):**  
Health = β₀ + β₁·Age + β₂·Income + β₃·Education + <span class="faded">β₄·SES</span> + ... + βₙ·X_n + ε
:::

::: {.fragment}
**Lasso (λ = 0.2):**  
Health = β₀ + β₁·Age + <span class="faded">β₂·Income + β₃·Education</span> + β₄·SES + ... + βₙ·X_n + ε
:::

::: {.fragment}
**Key implications:**

- Variable selection depends heavily on choice of λ
- Highly correlated predictors compete for selection
- Different random seeds in cross-validation → different final models
- Selection can be unstable with small changes in data
:::


```{css, echo=FALSE}
.faded {
  color: #999999;
  text-decoration: line-through;
}
```