<a href="https://colab.research.google.com/github/xiaochengJF/MachineLearning/blob/master/AdaBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AdaBoost


In [0]:
from numpy import *

## 数据集

In [0]:
def loadDataSet():
    x = [0, 1, 2, 3, 4, 5]
    y = [1, 1, -1, -1, 1, -1]
    return x, y

## 计算可能的划分点

选取两相邻样本均值作为划分点

In [0]:
def generateGxList(x):
    gxlist = []
    for i in range(len(x) - 1):
        gx = (x[i] + x[i + 1]) / 2
        gxlist.append(gx)
    return gxlist

## 计算误差
其中$m = 1,2,..,M$，代表第m轮迭代。$i$代表第$i$个样本，$w$是样本权重，$I$指示函数取值为1或0，括号中的表达式为真时，指示结果为1，否则为0
$$e_m = \sum^N_{i=1}w_{mi}I(G_m(x_i)\neq y_i)$$
<font face=楷体 color=red size=5>注意</font>：第一轮迭代计算时样本权重初始化为总样本数分之一

In [0]:
def calcErrorNum(gx, x, y, weight):
    #判断以gx为切分点的两种方式里，哪种会让误差更小
    error1 = 0
    errorNeg1 = 0
    ygx = 1
    for i in range(len(x)):
        if i < gx and y[i] != 1: error1 += weight[i]
        if i > gx and y[i] != -1: error1 += weight[i]
        if i < gx and y[i] != -1: errorNeg1 += weight[i]
        if i > gx and y[i] != 1: errorNeg1 += weight[i]
    if errorNeg1 < error1:
        return errorNeg1, -1 # x>gx,则fgx = 1
    return error1, 1 # x<gx, 则fgx = 1

## 计算弱分类器权重
错误率越小弱分类器的权重越高，错误率越大权重越小
$$\alpha_m = \frac12\log(\frac{1-e_m}{e_m})$$

In [0]:
def calcAlpha(minError):
    alpha = 1/2 * log((1-minError)/minError)
    return alpha

## 计算样本新权重

$$\begin{aligned}
\left\{\begin{aligned}
&w_{m+1,i}=\frac{w_{mi}}{z_m}\exp(-\alpha_my_iG_m(x_i))\\
&z_m=\sum^N_{i=1}w_{mi}\exp(-\alpha_my_iG_m(x_i))
\end{aligned}\right.
\end{aligned}$$

In [0]:
def calcNewWeight(alpha,ygx, weight, gx, y):
    newWeight = []
    sumWeight = 0
    for i in range(len(weight)):
        flag = 1
        if i < gx and y[i] != ygx: flag = -1
        if i > gx and y[i] != -ygx: flag = -1
        weighti = weight[i]*exp(-alpha*flag)
        newWeight.append(weighti)
        sumWeight += weighti
    newWeight = newWeight / sumWeight

    return newWeight

## 训练基本弱分类器
取错误率最低的弱分类器为当前迭代的最优弱分类器

In [0]:
def trainfxi(fx, i, x, y, weight):
    minError = inf
    bestGx = 0.5
    gxlist = generateGxList(x)
    bestygx = 1
    # 计算基本分类器
    for xi in gxlist:
        error, ygx = calcErrorNum(xi, x, y, weight)
        if error < minError:
            minError = error
            bestGx = xi
            bestygx = ygx
    fx[i]['gx'] = bestGx
    #计算alpha
    alpha = calcAlpha(minError)
    fx[i]['alpha'] = alpha
    fx[i]['ygx'] = bestygx
    #计算新的训练数据权值
    newWeight = calcNewWeight(alpha,bestygx, weight, bestGx, y)
    return newWeight

## 计算当前弱分类器线性组合的错误率
计算当前所有弱分类器线性组合形成的强分类器的错误率

In [0]:
def calcFxError(fx, n, x, y):
    errorNum = 0
    for i in range(len(x)):
        fi = 0
        for j in range(n):
            fxiAlpha = fx[j]['alpha']
            fxiGx = fx[j]['gx']
            ygx = fx[j]['ygx']
            if i < fxiGx: fgx = ygx
            else: fgx = -ygx
            fi += fxiAlpha * fgx
        if sign(fi) != y[i]: errorNum += 1

    return errorNum/len(x)

## 训练强分类器
首先，初始化样本权重为总样本数分之一，然后训练弱分类器，并更新样本权重。errorThreshold为最低错误率阈值，maxIterNum为最大迭代数，用这两个变量控制迭代是否结束。

In [1]:
def trainAdaBoost(x, y, errorThreshold, maxIterNum):
    fx = {}
    weight = []
    xNum = len(x)
    for i in range(xNum):
        w = float(1/xNum)
        weight.append(w)

    for i in range(maxIterNum):
        fx[i] = {}
        newWeight = trainfxi(fx, i, x, y, weight)
        weight = newWeight
        fxError = calcFxError(fx, (i+1), x, y)
        if fxError < errorThreshold: break

    return fx

if __name__ == '__main__':
    x, y = loadDataSet()
    errorThreshold = 0.01
    maxIterNum = 10
    fx = trainAdaBoost(x, y, errorThreshold, maxIterNum)
    print(fx)

{0: {'gx': 1.5, 'alpha': 0.8047189562170503, 'ygx': 1}, 1: {'gx': 4.5, 'alpha': 0.6931471805599453, 'ygx': 1}, 2: {'gx': 3.5, 'alpha': 0.7331685343967135, 'ygx': -1}}
