<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#AdaBoost-类" data-toc-modified-id="AdaBoost-类-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>AdaBoost 类</a></span><ul class="toc-item"><li><span><a href="#基础分类器" data-toc-modified-id="基础分类器-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>基础分类器</a></span></li><li><span><a href="#验证" data-toc-modified-id="验证-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>验证</a></span></li><li><span><a href="#使用鸢尾花数据集进行测试" data-toc-modified-id="使用鸢尾花数据集进行测试-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>使用鸢尾花数据集进行测试</a></span></li><li><span><a href="#sklearn.AdaBoostClassifier" data-toc-modified-id="sklearn.AdaBoostClassifier-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>sklearn.AdaBoostClassifier</a></span></li></ul></li><li><span><a href="#提升树模型" data-toc-modified-id="提升树模型-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>提升树模型</a></span><ul class="toc-item"><li><span><a href="#使用-sklearn.ensemble.AdaBoostRegressor" data-toc-modified-id="使用-sklearn.ensemble.AdaBoostRegressor-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>使用 sklearn.ensemble.AdaBoostRegressor</a></span></li><li><span><a href="#使用-sklearn.ensemble.GradientBoostingRegressor" data-toc-modified-id="使用-sklearn.ensemble.GradientBoostingRegressor-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>使用 sklearn.ensemble.GradientBoostingRegressor</a></span></li></ul></li></ul></div>

![title](boosting.gif)

今天的主题是**集成学习**中的$adaboost$, 或称作$adaptive \ boosting$, 首先我们来建立一种概念, 什么是$adaptive \ boosting$:
$adaboost$是**集成学习**的一种, 意思是建立多个**弱分类器**, 然后用这些弱分类器的**线性加权组合**来形成一个**强分类器**. 什么是弱分类器呢, 就是只比胡猜稍微好一点的分类器, 训练这些弱分类器是一个迭代的过程, 在这个过程里, 下一个弱分类器总是更加关注上一个弱分类器没有分好的数据样本, 以弥补之前弱分类器的不足, $adaboost$就是类似"三个臭皮匠顶个诸葛亮"的算法.    


In [1]:
import numpy as np
from collections import Counter
import copy

## AdaBoost 类

In [2]:
class AdaBoost:
    def __init__(self, base):
        self.base = base # 弱分类器
    
    def fit(self, X, Y, max_step):
        N = len(X)
        self.W = np.ones(N) / N # 每个样本的权值初始化
        self.alpha = [] # 存放弱分类器权重
        self.weaker = [] # 存放弱分类器
        for step in range(max_step): # 训练弱分类器
            weaker = copy.deepcopy(self.base) # 生成一个新弱分类器
            weaker.fit(X, Y, sample_weight = self.W) # 使用弱分类器带样本权值进行训练
            results = weaker.predict(X)
            # 输出准确度（Accuracy）
            scores = (results == Y)
            print('Weaker {:} accuracy = {:3.2f}%'.format(step, Counter(scores)[True]/len(Y) * 100))
            # 计算当前弱分类器的分类误差率：带权值误差
            error = np.dot(self.W, [0 if score else 1 for score in scores])
            alpha = 0.5 * np.log((1 - error)/error) # 计算当前弱分类器的权重
            self.alpha.append(alpha)
            self.W = self.W * np.exp(alpha * np.array([-1 if score else 1 for score in scores])) # 更新每个样本的权值：分类正确-yG(x)=-1，否则 1
            self.W = self.W / sum(self.W) # 归一化
            self.weaker.append(weaker)
        # 计算最终分类器的结果
        f = np.dot(np.c_[[weaker.predict(X) for weaker in self.weaker]].T, np.array(self.alpha).reshape(-1, 1))
        results = [1 if y >= 0 else -1 for y in f]
        scores = (results == Y)
        print('\nAdaBoost accuracy = {:3.2f}%, training completed!'.format(Counter(scores)[True]/len(Y) * 100))
    
    def predict(self, X):
        f = np.dot(np.c_[[weaker.predict(X) for weaker in self.weaker]].T, np.array(self.alpha).reshape(-1, 1))
        return [1 if y >= 0 else -1 for y in f]

In [3]:
 import operator

### 基础分类器

In [4]:
class Simple: 
    def fit(self, X, Y, sample_weight):
        minloss = float('inf')
        for compare in [operator.le, operator.ge]:
            for threshold in X:
                results = compare(np.array(X), threshold)
                results = [1 if result else -1 for result in results]
                results = results == Y
                results = [0 if result else 1 for result in results]
                loss = np.dot(results, sample_weight)
                if minloss > loss:
                    minloss = loss
                    self.threshold = threshold
                    self.compare = compare
    
    def predict(self, X):
        return [1 if self.compare(x, self.threshold) else -1 for x in X]

### 验证

In [5]:
X = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Y = [1, 1, 1, -1, -1, -1, 1, 1, 1, -1]

In [6]:
ada = AdaBoost(Simple())
ada.fit(np.array(X), np.array(Y), 3)
print(ada.alpha)

Weaker 0 accuracy = 70.00%
Weaker 1 accuracy = 70.00%
Weaker 2 accuracy = 60.00%

AdaBoost accuracy = 100.00%, training completed!
[0.4236489301936017, 0.6496414920651304, 0.752038698388137]


In [7]:
from sklearn.tree import DecisionTreeClassifier
ada = AdaBoost(DecisionTreeClassifier(max_depth = 1))
ada.fit(np.array(X).reshape(-1, 1), np.array(Y), 3)
print(ada.alpha)

Weaker 0 accuracy = 70.00%
Weaker 1 accuracy = 70.00%
Weaker 2 accuracy = 60.00%

AdaBoost accuracy = 100.00%, training completed!
[0.4236489301936017, 0.6496414920651304, 0.752038698388137]


### 使用鸢尾花数据集进行测试

In [8]:
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

In [9]:
iris_npz = np.load('iris.npz')
data = iris_npz['data']
X = iris_npz['X']
Y = iris_npz['Y']
# 转换成适合 AdaBoost 处理的标签
Y[:50] = 1
Y[50:] = -1

In [10]:
XTRAIN, XTEST, YTRAIN, YTEST = train_test_split(X, Y, test_size = 0.25)

In [11]:
import copy

ada = AdaBoost(DecisionTreeClassifier(max_depth = 1)) # 使用决策树桩作为弱分类器
ada.fit(XTRAIN, YTRAIN, 10)

Weaker 0 accuracy = 90.67%
Weaker 1 accuracy = 78.67%
Weaker 2 accuracy = 84.00%
Weaker 3 accuracy = 66.67%
Weaker 4 accuracy = 72.00%
Weaker 5 accuracy = 86.67%
Weaker 6 accuracy = 69.33%
Weaker 7 accuracy = 52.00%
Weaker 8 accuracy = 78.67%
Weaker 9 accuracy = 86.67%

AdaBoost accuracy = 100.00%, training completed!


In [12]:
results = ada.predict(XTEST)
scores = (results == YTEST)
print('Accuracy = {:3.2f}%'.format(Counter(scores)[True]/len(YTEST) * 100))

Accuracy = 96.00%


### sklearn.AdaBoostClassifier

In [13]:
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier(n_estimators = 10)
clf.fit(XTRAIN, YTRAIN)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=10, random_state=None)

In [14]:
clf.score(XTEST, YTEST)

1.0

## 提升树模型

In [15]:
class BoostTree:
    def __init__(self, base):
        self.base = base # 弱回归器
    
    def fit(self, X, Y, max_step):
        self.weaker = [] # 存放弱回归器
        R = copy.deepcopy(Y) # 残差
        for step in range(max_step): # 训练弱回归器
            weaker = copy.deepcopy(self.base) # 生成一个新弱回归器
            weaker.fit(X, R) # 训练弱回归器
            results = weaker.predict(X)
            R = R - results # 计算残差
            self.weaker.append(weaker)
            # 计算误差
            f = np.sum(np.c_[[weaker.predict(X) for weaker in self.weaker]].T, axis = 1)
            print('Step {}, square loss {}'.format(step, np.linalg.norm(f - Y)**2))
        print('\nBoosting tree training completed!')
    
    def predict(self, X):
        return np.sum(np.c_[[weaker.predict(X) for weaker in self.weaker]].T, axis = 1)

In [16]:
X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Y = [5.56, 5.70, 5.91, 6.40, 6.80, 7.05, 8.90, 8.70, 9.00, 9.05]
XTEST = [1.2, 2.3, 3.4, 4.5, 5.6, 6.7, 7.8, 8.9, 9.5, 10.8]

In [17]:
from sklearn.tree import DecisionTreeRegressor
bt = BoostTree(DecisionTreeRegressor(max_depth = 1)) # 使用回归决策树桩作为弱回归器
bt.fit(np.array(X).reshape(-1, 1), Y, 6)

Step 0, square loss 1.9300083333333338
Step 1, square loss 0.800675
Step 2, square loss 0.4780083333333336
Step 3, square loss 0.3055592592592599
Step 4, square loss 0.22891522633744946
Step 5, square loss 0.1721780649862837

Boosting tree training completed!


In [18]:
bt.predict(np.array(XTEST).reshape(-1, 1))

array([5.63      , 5.63      , 5.81831019, 6.55164352, 6.81969907,
       8.95016204, 8.95016204, 8.95016204, 8.95016204, 8.95016204])

### 使用 sklearn.ensemble.AdaBoostRegressor

In [19]:
from sklearn.ensemble import AdaBoostRegressor
rsr = AdaBoostRegressor(loss = 'square', n_estimators = 6)
rsr.fit(np.array(X).reshape(-1, 1), Y)

AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='square',
                  n_estimators=6, random_state=None)

In [20]:
print('Final square loss {}'.format(np.linalg.norm(rsr.predict(np.array(X).
reshape(-1, 1)) - Y)**2))

Final square loss 0.05230000000000058


In [21]:
rsr.predict(np.array(XTEST).reshape(-1, 1))

array([5.63, 5.7 , 5.91, 6.4 , 7.05, 8.9 , 8.9 , 9.  , 9.  , 9.  ])

### 使用 sklearn.ensemble.GradientBoostingRegressor

In [22]:
from sklearn.ensemble import GradientBoostingRegressor
rsr = GradientBoostingRegressor(loss = 'ls', n_estimators = 6)
rsr.fit(np.array(X).reshape(-1, 1), Y)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
                          learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=6,
                          n_iter_no_change=None, presort='auto',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

In [23]:
print('Final square loss {}'.format(np.linalg.norm(rsr.predict(np.array(X).
reshape(-1, 1)) - Y)**2))

Final square loss 5.424059630005812


In [24]:
rsr.predict(np.array(XTEST).reshape(-1, 1))

array([6.50762479, 6.54599735, 6.64125405, 6.89189235, 7.14105909,
       8.05341449, 7.95970269, 8.10027039, 8.10027039, 8.12369834])