# Boosting

## Forward stagewise additive modeling
Generally, forward stagewise additive modeling takes the form:

$f(x) = \sum_{m=1}^{M} \beta_m b_m(x; \gamma_m)$

Where $\beta_i$ are the expansion coefficients and $b_i(x; \gamma_i)$ are usually simple functions of $x$ (i.e a weak classifier)
### Algorithm

1. initialize $f_o(x) = 0$
2. for m = 1 to M:
 - a. Compute: $(\beta_m, \gamma_m) = \text{argmin}_{\beta, \gamma} \sum^{N}_{i=1} L(y_i, f_{m-1} (x_i) + \beta b(x_i; \gamma))$
 - b. set $f_m = f_{m-1} (x) + \beta_m b(x; \gamma_m)$

For $b_i(x; \gamma_i)$ = tree, $\gamma_i$ can be splitting variables and splitting points

Forward stagewise modeling approximates the solution to 2.a by  sequentially adding new basis function to the expansion without
adjusting the parameters and coefficients of those that have already been added. At each iteration m, one solves for the
optimal basis function $b_m(x; \gamma_m)$ and coefficient $\beta_m$, adds it to the current expansion $f_{m-1} (x)$, previously added
terms are not modified.

## Adaboost as forward additive stagewise modeling

Assume, we have a two class classification problem with class labels $y \in \{-1, 1\}$ and we want to build up from some weak classifiers $\{b_i\}$
that only do slightly better than random guess.

Let the loss function $L = e^{-yf(x)}$

For forward additive stagewise modeling, at each iteration m, one must solve

$(\beta_m, b_m) = \text{argmin}_{\beta, b} \sum_{i=1}^{N} e^{-y[f_{m-1}(x_i) + \beta b(x_i)]}$

\begin{aligned}
\implies (\beta_m, b_m) & = \text{argmin}_{\beta, b} \sum_{i=1}^{N} \underbrace{e^{-y[f_{m-1}(x_i)]}}_{\text{this term does not depend on $\beta, b$}}e^{-y[\beta b(x_i)]}\\
& = \text{argmin}_{\beta, b} \sum_{i=1}^{N} w_i^{(m)}e^{-y[\beta b(x_i)]}\\
\end{aligned}

Where $w_i^{(m)} = e^{-y[f_{m-1}(x_i)]}$, we can regard this as a weight applied on each sample at each iteration

For any value of $\beta$,

  \begin{equation}
    w_i^{(m)}e^{-y[\beta b(x_i)]} =
    \begin{cases}
      w_i^{(m)} e^{-\beta} & y_i = b(x_i)\\
      w_i^{(m)} e^{\beta}, & y_i \neq b(x_i)
    \end{cases}
  \end{equation}

\begin{aligned}
\implies \sum_{i=1}^{N} w_i^{(m)}e^{-y[\beta b(x_i)]} & = e^\beta \sum_{i=1}^{N} w_i^{(m)}I(y_i \neq b(x_i)) + e^{-\beta}\sum_{i=1}^{N} w_i^{(m)}I(y_i = b(x_i))\\
& =  e^\beta \sum_{i=1}^{N} w_i^{(m)}I(y_i \neq b(x_i)) + e^{-\beta}(\sum_{i=1}^{N} w_i^{(m)} - \sum_{i=1}^{N}w_i^{(m)}I(y_i \neq b(x_i)))\\
& = (e^\beta - e^{-\beta}) \sum_{i=1}^{N} w_i^{(m)}I(y_i \neq b(x_i)) + e^{-\beta} \sum_{i=1}^{N} w_i^{(m)}
\end{aligned}

$\implies b_m(x_i) = \text{argmin}_{b} \sum_{i=1}^{N} w_i^{(m)}I(y_i \neq b(x_i))$ which is a classifier $b_m$ that minimizes weighted classification errors

By taking the derivative wrt $\beta$:

$\implies(e^\beta - e^{-\beta}) \sum_{i=1}^{N} w_i^{(m)}I(y_i \neq b(x_i)) + e^{-\beta} \sum_{i=1}^{N} w_i^{(m)} = 0$

$\implies \frac{e^\beta + e^{-\beta}}{e^{-\beta}} = \frac{\sum_{i=1}^{N} w_i^{(m)}}{\sum_{i=1}^{N} w_i^{(m)}I(y_i \neq b(x_i))}$

$\implies \frac{e^\beta}{e^{-\beta}} = \frac{\sum_{i=1}^{N} w_i^{(m)}}{\sum_{i=1}^{N} w_i^{(m)}I(y_i \neq b(x_i))} - 1$

$\implies 2\beta = log(\frac{1}{error_m} - 1)$

$\implies \beta_m = \frac{1}{2} log (\frac{1 - error_m}{error_m})$

Where, $error_m = \frac{\sum_{i=1}^{N} w_i^{(m)}I(y_i \neq b(x_i))}{\sum_{i=1}^{N} w_i^{(m)}}$

Since, $f_m(x) = f_{m-1}(x) + \beta_m b_m(x) \implies w_i^{(m + 1)} = e^{-y[f_{m}(x_i)]} = e^{-yf_{m-1}(x) + \beta_m b_m(x)} = w_i^{(m)} e^{-y_i\beta_mb_m(x_i)}$

This means that the weight applied on each sample is adjusted by a factor of $e^{-y_i\beta_mb_m(x_i)}$

## Adaboost

Even though Adaboost algorithm was originally motivated from a very different perspective than presented in the previous section, it is equivalent to
forward stagewise additive modeling based on exponential loss.

### Algorithm
1. Initialize the observation weights $w_i^{(1)} = \frac{1}{N}, i\in \{1, ...., N\}$
2. For m = 1 to M:
 - a. Fit a classifier $b_m(x)$ to the training data using weights $w_i$
 - b. Compute $error_m = \frac{\sum_{i=1}^{N} w_i^{(m)}I(y_i \neq b(x_i))}{\sum_{i=1}^{N} w_i^{(m)}}$
 - c. Compute $\beta_m = \frac{1}{2} log (\frac{1 - error_m}{error_m})$
 - d. Set $w_i^{(m + 1)} = w_i^{(m)} e^{-y_i\beta_mb_m(x_i)}, \forall i \in \{1, ..., N\}$

3. Output $f(x) = sign[\sum_{i=1}^{M} \beta_m b_m(x)]$

At each iteration, weight of each sample is multiplied by:
1. $e^\beta$ if misclassified
2. $e^{-\beta}$ if classified correctly


