## Boosting算法

### Adaboost算法

生成单个结点：树桩
1. 遍历所有的特征
2. 以一定步长（或者自己生成一些切分点，比如相邻数据的中点）便利所有的切分点，得到误差$e_m=\sum_{i=1}^{N}w_{mi}I(G(x)\neq y_i)$
3. 找到最小的使$e_m$最小的切分点和特征

In [2]:
%matplotlib inline

import numpy as np
from matplotlib import pyplot as plt

In [3]:
'''
give specific split point and get the stump label
parameter:
    dataset: dataset
    index: index
    value: split point value
    label: all labbel
return:
    label: split label
'''
def subset_label(dataset, index, value, ope):
    ret_label = np.ones(dataset.shape[0])
    
    if ope == 'lt':
        ret_label[dataset[:, index] < value] = -1
    else:
        ret_label[dataset[:, index] >= value] = -1
    
    return ret_label

'''
stump generation
parameter:
    dataset: data
    labels: labels
    weight: weight D
'''
def generate_stump(dataset, labels, weight):
    n, m = dataset.shape
    
    best_em = np.inf
    best_index = 0
    best_split_point = [0, 'lt']
    max_num = 10; 
    for i in range(m):
        min_value = dataset[:, i].min()
        max_value = dataset[:, i].max()
        stride =  (max_value - min_value) / max_num
        for ope in ['lt', 'gt']:
            for j in range(-1, max_num+1):
                split_point = min_value + stride*j
                sub_labels = subset_label(dataset, i, split_point, ope)
                em = np.sum(weight[sub_labels != labels])
                # print("split: {} {} Em {}".format(ope, split_point, em))
                
                if em < best_em:
                    best_em = em
                    best_index = i
                    best_split_point = [split_point, ope]
    return best_em, best_index, best_split_point

## test function

1. 加载简单的数据
2. 测试是否可以找到最好的切分点

In [4]:
'''
load_dataset, generate the dataset and labels
'''
def load_dataset():
    data = np.array([0,1,2,3,4,5,6,7,8,9]).reshape(-1, 1)
    labels = np.array([1,1,1,-1,-1,-1,1,1,1,-1])
    
    return data, labels


In [5]:
dataset, labels = load_dataset()
D0 = np.ones_like(labels)/len(labels)

best_em, best_index, best_split = generate_stump(dataset, labels, D0)
print(best_em, best_index, best_split)

0.30000000000000004 0 [2.7, 'gt']


## complete process

1. 循环的进行子分类器的训练，得到一系列的子分类器
2. 根据em，获取每个子分类器的权重
3. 根据分类误差，计算下一个子分类器的weight

In [6]:
'''
update weight
parameter:
    dataset: dataset
    label: labels
    weight: last weight
    alpha: sub-classifier's weight
    split_point: split point
'''
def update_weight(dataset, labels, weightk, alphak, Gx):
    idx, value, ope = Gx
    
    res_labels = np.ones_like(labels)
    if ope == 'lt':
        res_labels[dataset[:, idx] < value] = -1
    else:
        res_labels[dataset[:, idx] >= value] = -1
    '''
    update weight
    '''
    weight = weightk*np.exp(-alphak*labels*res_labels)
    weight = weight/weight.sum()
    
    return weight


In [11]:
'''
pred_result
parameter:
    dataset: dataset
    idx: feature index
    value: model parameter
    ope: operator
'''
def pred_result(dataset, idx, value, ope):
    N = dataset.shape[0]
    res = np.ones(N)
    
    if ope == 'lt':
        res[dataset[:, idx] < value] = -1
    else:
        res[dataset[:, idx] >= value] = -1
        
    return res

'''
pred: predict the result
parameter:
    dataset: dataset
    idx: feature index
    value: model parameter
    ope: operator
'''
def pred(dataset, labels, model):
    add = np.zeros_like(labels)
    for idx, value, ope, alpha in model:
        res = pred_result(dataset, idx, value, ope)
        add = add + alpha*res
    
    add[add >= 0] = 1
    add[add < 0] = -1
    add = add.astype(labels.dtype)
    return np.sum(add != labels)
    
'''
train adaboost
parameter:
    dataset: dataset
    labels: label
'''
def train(dataset, labels, M, toler):
    n, m = dataset.shape
    last_weight = np.ones_like(labels)/n
    
    models = []
    for i in range(M):
        ''' generate stump '''
        Em, idx, split = generate_stump(dataset, labels, last_weight)
        value, ope = split
        ''' update alpha '''
        alpha = np.log((1-Em)/Em)/2.0
        ''' update weight '''
        last_weight = update_weight(dataset, labels, last_weight, alpha, [idx, value, ope])
        models.append([idx, value, ope, alpha])
        error_cnt = pred(dataset, labels, models)
        print(">>> alpha {} Em {} ErrorRate {}/{}={}".format(alpha, Em, error_cnt, n, error_cnt/n))
        if error_cnt < toler:
            break
    
    return models

In [12]:
models = train(dataset, labels, 4, 1)
for m in models:
    print(m)

>>> alpha 0.4236489301936017 Em 0.30000000000000004 ErrorRate 3/10=0.3
>>> alpha 0.6496414920651304 Em 0.21428571428571427 ErrorRate 3/10=0.3
>>> alpha 0.752038698388137 Em 0.18181818181818185 ErrorRate 0/10=0.0
[0, 2.7, 'gt', 0.4236489301936017]
[0, 8.1, 'gt', 0.6496414920651304]
[0, 5.4, 'lt', 0.752038698388137]


## 使用更大的数据集进行验证

1. load dataset

In [13]:
'''
load dataset
parameter:
    train_path: train_dataset file path
    test_path: test dataset file path
'''
def load_dataset(train_path, test_path):
    
    def load(file_path):
        data = []
        with open(file_path, 'r') as fp:
            for line in fp.readlines():
                line = line.strip()
                elem = [float(x) for x in line.split('\t')]
                data.append(elem)
        return np.array(data)
    
    train_data = load(train_path)
    test_data = load(test_path)
    
    return train_data[:, :-1], train_data[:, -1].astype(np.int64), test_data[:, :-1], test_data[:, -1].astype(np.int64)

In [14]:
train_data, train_label, test_data, test_label = load_dataset('./horseColicTraining2.txt', './horseColicTest2.txt')

In [18]:
model = train(train_data, train_label, 100, 0.01)

for m in model:
    print(m)

>>> alpha 0.4616623792657674 Em 0.28428093645484953 ErrorRate 85/299=0.2842809364548495
>>> alpha 0.3124824504246708 Em 0.3486531061022541 ErrorRate 85/299=0.2842809364548495
>>> alpha 0.28680973201695786 Em 0.3604020725787441 ErrorRate 74/299=0.24749163879598662
>>> alpha 0.23297004638939492 Em 0.38557761823256775 ErrorRate 74/299=0.24749163879598662
>>> alpha 0.19803846151213766 Em 0.40225526514026255 ErrorRate 76/299=0.25418060200668896
>>> alpha 0.18847887349020642 Em 0.4068608605454535 ErrorRate 72/299=0.2408026755852843
>>> alpha 0.15227368997476795 Em 0.42444621646246306 ErrorRate 72/299=0.2408026755852843
>>> alpha 0.15510870821690512 Em 0.4230616708699977 ErrorRate 66/299=0.22073578595317725
>>> alpha 0.1353619735335938 Em 0.4327293757248841 ErrorRate 74/299=0.24749163879598662
>>> alpha 0.12521587326132078 Em 0.437717234434148 ErrorRate 69/299=0.23076923076923078
>>> alpha 0.1334764812820768 Em 0.4336552906609674 ErrorRate 72/299=0.2408026755852843
>>> alpha 0.141822432537710

## test classifier

使用验证集数据来验证分类器

In [19]:
N = test_data.shape[0]
ErrorCnt = pred(test_data, test_label, model)

print("accuarcy {}/{}={}".format(ErrorCnt, N, ErrorCnt/N))

accuarcy 13/67=0.19402985074626866
