## AdaBoost

Ensemble methods:
<img src="images/ensemble_outline.jpg" width="500"/>

**WHAT**

AdaBoost [Freund and Schapire](#Firstpaper) is a method to produce a powerful ensemble by building a sequence of estimators.

**WHY**

Theory fundation: Any weak learning algorithm (performs only slightly better than random guessing, error is less than $1/2-\gamma$) can be efficiently transformed or boosted into a strong learning algorithm \[[Schaprie](#Schaprie)\].


**HOW**

AdaBoost fits a sequence of weak learners on repeatedly modified versions of the data. The data are modified by applying weights $w_1, w_2, ... , w_N$ to each of the training samples in each iteration. 

Initially, all the samples have the same weights $w_i = 1/N$, and at the first step, a weak learner is trained on the orginal data. 

Then, in each successive iteration, the sample weights are individually modified and a learner is trained based on the reweighted data. At a given step, the wights of samples that were incorrectly predicted by the boosted model induced at the previous step are increased, while for the samples that were predicted correctly, their weights are decreased.

As iterations proceed, examples that are hard to predict will receive very high weights.

<a name="Firstpaper"></a> \[[Freund and Schapire](https://books.google.com.au/books?hl=en&lr=&id=CAnxFA6DaagC&oi=fnd&pg=PA23&dq=a+decision+theoretic+generalization+of+online+learning&ots=XaVCZZZXsD&sig=2YyHZOvGolS_gG7uhZ1KuPZYjPc#v=onepage&q=a%20decision%20theoretic%20generalization%20of%20online%20learning&f=false)\] Schapire, R.E. and Freund, Y., 1995, March. A decision-theoretic generalization of on-line learning and an application to boosting. In Second European Conference on Computational Learning Theory (pp. 23-37).

<a name="Schapire"></a> \[[Schapire](https://cs.rochester.edu/u/stefanko/Teaching/09CS446/Boosting.pdf)\] Schapire, R.E., 1990. The strength of weak learnability. Machine learning, 5(2), pp.197-227.

### The 1995 paper

#### 1.  1st reading
##### 1.1 Understanding
1. AdaBoost was originally motivated from a very different perspective (an application of online allocation) than the present most discussed version (forward stagewise additive modeling based on exponential loss). 
2. At a given iteration t, the output is made only based on the the specific hypothesis $h_t$
<img src="images/AdaBoost_process.jpg" width="600"/>
3. The details and formulas
<img src="images/AdaBoost_Alg.jpg" width="600"/>

##### 1.2 Questions 
1. For the calculation of $\beta_t$, some have a coefficient 1/2, while some don't have. What does 1/2 do?
    
    这里1/2是使loss function最小化的求解过程中得到的解：
    $\beta_t = \frac{1}{2}\log\frac{1-\epsilon_t}{\epsilon_t}$

### AdaBoost in ESL
<img src='images/AdaBoost_M1.png' width='600'>

**Additive Expansion**
- General form:
\begin{align*}
f_m(x) &=  f_{m-1}(x) + \beta_m b(x; \gamma_m) \\
f(x) &=  \sum_{m=1}^M \beta_m b(x; \gamma_m)
\end{align*}

- In AdaBoost:
\begin{equation*}
f_m(x) = f_{m-1}(x) + \beta_m G_m(x)
\end{equation*}
where $G_m(x) \in \{-1,1\}$.

**AdaBoost is a forward stagewise modelling based on exponential loss**

- The Exponential Loss Function:
    \begin{equation*}
    L(y_i, \hat{f}(x_i) = exp(-y_i \cdot \hat{f}(x_i))
    \label{eq:exp_loss} \tag{1}
    \end{equation*}
    where $y_i \in \{-1,1\}$.

- We want to minimise the loss for all examples and all basis estimators (*Attention to the equation here!*)
    \begin{equation*}
    \min\limits_{\{\beta_m, G_m(x)\}_1^m} \sum_{i=1}^N L \Big(y_i, \sum_{m=1}^M \beta_m G_m(x_i) \Big)
    \label{eq:c_el} \tag{2}
    \end{equation*}

- To minimise **Eq. $\eqref{eq:c_el}$** is difficult, and we then solve a simple variation of it
    \begin{equation*}
    \min\limits_{\beta_m, G_m(x)} \sum_{i=1}^N L(y_i, \hat{f}_m(x_i) \\ =  \min\limits_{\beta_m, G_m(x)} \sum_{i=1}^N e^{-y_i \cdot \hat{f}_m(x_i)}  
    \label{eq:s_el} \tag{3}
    \end{equation*}

<img src='images/AdaBoost_Forward_Stagewise.png' width='600'>

- Let's solve **Eq. $\eqref{eq:s_el}$**. It can be written as:
    \begin{align}
    \sum_{i=1}^N L(y_i, f_m(x_i) &= \sum_{i=1}^N e^{-y_i(f_{m-1}(x_i) + \beta_m G_m(x_i))} \\
    &= \sum_{i=1}^N e^{-y_if_{m-1}(x_i)} \cdot e^{-y_i\beta_m G_m(x_i)} \\
    &= \sum_{i=1}^N w_i^{(m)} \cdot e^{-\beta_m y_i G_m(x_i)}
    \label{eq:4} \tag{4}
    \end{align}
    
    In Eq. $\eqref{eq:4}$, line 2, we can easily see that the left part is neither related to $\beta_m$ nor $G_m(x)$, thus we regard it as a "coefficient": $w_i^{(m)}=e^{-y_if_{m-1}(x_i)}$ 

- The solution to **Eq. $\eqref{eq:4}$** is two steps. For any $\beta_m > 0$, if we have the most $y_i=G_m(x_i)$, which means $y_i G_m(x_i)=1$, **Eq. $\eqref{eq:4}$** would be smaller than when there are many $y_i \neq G_m(x_i)$ Thus, first, we want as many as $y_i=G_m(x_i)$, which means minimise the loss of $G_m(x)$, then based on the results of $G_m(x)$, calculate $\beta_m$.

- **First step: optimal $G_m(x)$** <br/> 
    The optimal $G_m(x)$ meets:
    \begin{equation*}
    \min\limits_{G_m(x)} 
    =\frac{\sum_{i=1}^N w_i^{(m)} I(y_i \neq G_m(x_i))} {\sum_{i=1}^N w_i^{(m)}}
    = \frac{\sum_{y_i \neq G_m(x_i)} w_i^{(m)}}{\sum_{i=1}^N w_i^{(m)}}
    \end{equation*}
    
    Let the prediciton error of $G_m(x)$ be $err_m$, we have:
    \begin{equation*}
    err_m = \frac{\sum_{y_i \neq G_m(x_i)} w_i^{(m)}}{\sum_{i=1}^N w_i^{(m)}}
    \label{eq:5} \tag{5}
    \end{equation*}

- **Second step: optimal $\beta_m$** <br/> 
    Given this $err_m$, we want to solve $\beta_m$ in **Eq. $\eqref{eq:4}$**
    \begin{align}
    Eq.(4) = \sum_{i=1}^N w_i^{(m)} \cdot e^{-\beta_m y_i G_m(x_i)} &= \sum_{y_i=G_m(x_i)} w_i^{(m)} \cdot e^{-\beta_m} + \sum_{y_i \neq G_m(x_i)} w_i^{(m)} \cdot e^{\beta_m} \\
    &= [\sum_{i=1}^N w_i^{(m)} \cdot e^{-\beta_m} - \sum_{y_i \neq G_m(x_i)} w_i^{(m)} \cdot e^{-\beta_m}] + \sum_{y_i \neq G_m(x_i)} w_i^{(m)} \cdot e^{\beta_m}\\
    & = e^{-\beta_m} \sum_{i=1}^N w_i^{(m)}  + (e^{\beta_m}-e^{-\beta_m})\sum_{y_i \neq G_m(x_i)} w_i^{(m)}
    \label{eq:6} \tag{6}
    \end{align}
    
    Then we want to minimise $\eqref{eq:6}$. Divide it by $\sum_{i=1}^N w_i^{(m)}$, we get
    \begin{equation*}
    e^{-\beta_m} + (e^{\beta_m}-e^{-\beta_m})err_m
    \label{eq:7} \tag{7}
    \end{equation*}
    It's easy to solve $\eqref{eq:7}$, and we get
    \begin{equation*}
    \beta_m = \frac{1}{2} \log\frac{1-err_m}{err_m}
    \label{eq:8} \tag{8}
    \end{equation*}

- To now, we get the optimal $G_m(x)$ and $\beta_m$ for **Eq. $\eqref{eq:s_el}$**. Where $G_m(x)$ is the best model at current iteration $m$ given the inputs at interation $m$. Let $err_m$ be the error of $G_m(x)$, $\beta_m = \frac{1}{2} \log\frac{1-err_m}{err_m}$.

**AdaBoost minimises the exponential loss (eq. $\eqref{eq:exp_loss}$) via a forward-stagewise additive modelling approach.**

**AdaBoost is not optimising training set misclassification error.** So it is common that the misclassification error of training set has decreased to 0 at XXX rounds of iterations, and the exponential loss keeps decreasing after XXX rounds (and the error on testing data keeps decreasing as well). 
The exponential loss is more sensitive to changes in the estimated class probabilities.

The only unsolved is the update rule for the  "coefficient" $w_i^{(m)}$.
\begin{align}
w_i^{(m+1)}&=e^{-y_i f_{m}(x_i)} \\
&=e^{-y_i [f_{m-1}(x_i)+\beta_m G_m(x_i)]} \\
&= e^{-\beta_m y_iG_m(x_i)} e^{-y_i f_{m-1}(x_i)} \\
&= e^{-\beta_m y_iG_m(x_i)} w_i^{(m)} \\
&= e^{-\beta_m(1-2I(y_i \neq G_m(x_i))} w_i^{(m)} \\
&= e^{2\beta_m I(y_i \neq G_m(x_i))} e^{-\beta_m} w_i^{(m)}
\label{eq:9} \tag{9}
\end{align}

\begin{align}
y_iG_m(x_i)&=1, y_i = G_m(x_i) \\
y_iG_m(x_i)&=-1, y_i \neq G_m(x_i) \\
y_iG_m(x_i)&= 1-2I(y_i \neq G_m(x_i))
\end{align}

**Eq. $\eqref{eq:9}$** is the equation to get the examples' weights at next iteration $w_i^{(m+1)}$ with $w_i^{(m)}$.

### Write a AdaBoost algorithm step by step

In [44]:
# Implement the AdaBoost
import numpy as np


def myAdaBoost(X_train, y_train, X_test, y_test, weak_learner, m_iteration=200):
    """
    Implementation of AdaBoost.M1, Freund and Schapire (1997), see TESL pp.339
    two-class prediction problem, with the classes are -1 and 1 
    and the final prediction uses sign(wl_weight*wl_prediction)
    """
    # 0. preprocess: y label to -1 and 1
    ori_labels = np.unique(y_train).astype(int)
    if len(ori_labels) != 2:
        print('Cannot handle >2 classes')
        exit()
    elif np.all(ori_labels==np.array([-1,1])):
        pass
    else:
        print('reset lables to -1 or 1')
        y_train = np.where(y_train==ori_labels[0],-1,1)
        y_test = np.where(y_test==ori_labels[0],-1,1)
    
    # 1. initialise weights for each of the N examples, w_i=1/N
    N, N_test = len(X_train), len(X_test)
    w = np.ones(X_train.shape[0])/N
    
    # 2. boosting
    
    predictions, test_predictions = [], []
    wl_weights = []  # weights of each weak learners
    
    for m in range(1, m_iteration+1):
        # (a). train a weak classfier G_m(x) to training data with weights w
        weak_learner.fit(X_train, y_train, sample_weight=w)
        
        # wls.append(weak_learner)  # !!! model append in loop does not work?!!!
        
        # (b). calculate error for G_m(x)
        y_pre = weak_learner.predict(X_train)
        predictions.append(y_pre)
        error = np.sum(w*1*(y_pre != y_train)) / np.sum(w)
        
        # for testing
        y_test_pre = weak_learner.predict(X_test)
        test_predictions.append(y_test_pre)
        
        # (c). compute alpha_m
        alpha_m = np.log((1-error)/error)
        wl_weights.append(alpha_m)
        
        # (d). update training examples' weights
        w = w * np.exp(alpha_m*1*(y_pre != y_train))
        
    # 3. get the weighted majority vote
    wl_weights = np.array(wl_weights)  # m
    print('model weights (first three): ', wl_weights[:3])
    
    
    # training
    predictions = np.array(predictions) # m*N, predict value is either 1 or -1
    train_final = np.sign(np.sum(np.transpose(predictions)*wl_weights, axis=1))
    print('Train accuracy', np.sum(y_train==train_final)/N)
    
    
    # testing
    test_predictions = np.array(test_predictions)
    test_final = np.sign(np.sum(np.transpose(test_predictions)*wl_weights, axis=1))
    print('Test accuracy', np.sum(y_test==test_final)/N_test)

Testcase from 李航 《统计机器学习》

In [47]:
# 1. let's define a weak learner: 1-depth decision tree
from sklearn.tree import DecisionTreeClassifier
clf_tree = DecisionTreeClassifier(max_depth = 1, random_state = 1)

# 2. set up train and test examples
X_train = np.arange(10).reshape(10,1)
y_train = np.array([1,1,1,-1,-1,-1,1,1,1,-1])-6

X_test = np.array([5.8, 2.1, 4.6, 7.9]).reshape(4,1)
y_test = np.array([-1, 1, -1, 1])

# 3. use AdaBoost
myAdaBoost(X_train, y_train, X_test, y_test, clf_tree, m_iteration=3)

reset lables to -1 or 1
model weights (first three):  [0.84729786 1.29928298 1.5040774 ]
Train accuracy 1.0
Test accuracy 0.75


Testcase from [sklearn.datasets.make_hastie_10_2](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_hastie_10_2.html#sklearn.datasets.make_hastie_10_2)

In [36]:
np.all(np.unique(y_train).astype(int)==np.array([-1,1]))

True

In [25]:
from sklearn import datasets
X, y = datasets.make_hastie_10_2(n_samples=12000, random_state=1)

X_test, y_test = X[2000:], y[2000:]
X_train, y_train = X[:2000], y[:2000]

myAdaBoost(X_train, y_train, X_test, y_test, clf_tree, m_iteration=400)

model weights (first three):  [0.17645644 0.16017128 0.24968398]
Train accuracy 0.9415
Test accuracy 0.884


### Example: Two-class AdaBoost

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_gaussian_quantiles

In [None]:
X1, y1 = make_gaussian_quantiles(cov=2., n_samples=200, n_features=2,
                                n_classes=2, random_state=1)
X2, y2 = make_gaussian_quantiles(mean=(3, 3), cov=1.5,
                                 n_samples=300, n_features=2,
                                 n_classes=2, random_state=1)
X = np.concatenate((X1, X2))
y = np.concatenate((y1, -y2+1))

In [None]:
set(list(y2))

In [None]:
# Create AdaBoosted DT
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), algorithm="SAMME",n_estimators=200)

In [None]:
# Fit
bdt.fit(X, y)

In [None]:
# Mean accuracy of self.predict(X) wrt. y.
bdt.score(X,y)

In [None]:
n_trees = len(bdt)

# error at each iteration
plt.figure(figsize=(10,5))
plt.subplot(121)
plt.plot(range(1, n_trees + 1), bdt.estimator_errors_[:n_trees],
         "b", label='SAMME', alpha=.5)
plt.legend()
plt.ylabel('Error')
plt.xlabel('Number of Trees')

# boost weight of each tree
plt.subplot(122)
plt.plot(range(1, n_trees + 1), bdt.estimator_weights_[:n_trees],
         "b", label='SAMME', alpha=.5)
plt.legend()
plt.ylabel('Weight')
plt.xlabel('Number of Trees')

In [None]:
plot_colors = "br"
plot_step = 0.02
class_names = "AB"

plt.figure(figsize=(5, 5))

# Plot the decision boundaries
# plt.subplot(121)
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                     np.arange(y_min, y_max, plot_step))

Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.axis("tight")

# Plot the training points
for i, n, c in zip(range(2), class_names, plot_colors):
    idx = np.where(y == i)
    plt.scatter(X[idx, 0], X[idx, 1],
                c=c, cmap=plt.cm.Paired,
                s=20, edgecolor='k',
                label="Class %s" % n)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc='upper right')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Decision Boundary')

In [None]:
# Plot the two-class decision scores
twoclass_output = bdt.decision_function(X)
plot_range = (twoclass_output.min(), twoclass_output.max())
plt.figure(figsize=(5,5))
for i, n, c in zip(range(2), class_names, plot_colors):
    plt.hist(twoclass_output[y == i],
             bins=10,
             range=plot_range,
             facecolor=c,
             label='Class %s' % n,
             alpha=.5,
             edgecolor='k')
x1, x2, y1, y2 = plt.axis()
plt.axis((x1, x2, y1, y2 * 1.2))
plt.legend(loc='upper right')
plt.ylabel('Samples')
plt.xlabel('Score')
plt.title('Decision Scores')

plt.tight_layout()
plt.show()