# Boosting
## AdaBoost

参考:
[傻子都能看懂的——详解AdaBoost原理](https://blog.csdn.net/qq_38890412/article/details/120360354)
[浅谈机器学习的梯度提升算法](https://github.com/apachecn/ml-mastery-zh/blob/7a3e2c2c60acdd90a85413a5af5d1241e09a6294/docs/xgboost/gentle-introduction-gradient-boosting-algorithm-machine-learning.md)

## 什么是加性模型(additive model)?

$$
H(\boldsymbol{x})=\sum_{t=1}^T \alpha_t h_t(\boldsymbol{x})
$$

## AdaBoost
机器学习的本质是对根据数据给出特定假定，构建模型，将其作为一个数值优化问题。
AdaBoost这类模型也类似, 通过梯度下降添加弱学习器来最小化模型损失。

AdaBoost思想起源于:
Hypothesis Boosting Problem: an efficient algorithm for converting relatively poor hypotheses into very good hypotheses.
![](images/2021091809193281.png)

## 假定一个加性模型
$$
H(\boldsymbol{x})=\sum_{t=1}^T \alpha_t h_t(\boldsymbol{x})
$$

$$
\ell_{\exp }(h \mid \mathcal{D})=\mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}}\left[e^{-f(\boldsymbol{x}) h(\boldsymbol{x})}\right]
$$



下面考虑迭代过程中的最优化问题, 分别对$h_{t}(x)$和$\alpha_{t}$求骗到可得到最优化的两个推论:
- 推论1
$$
\begin{aligned}
\ell_{\exp }\left(H_{t-1}+h_t \mid \mathcal{D}\right) & \approx \mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}}\left[e^{-f(\boldsymbol{x}) H_{t-1}(\boldsymbol{x})}\left(1-f(\boldsymbol{x}) h_t(\boldsymbol{x})+\frac{f(\boldsymbol{x})^2 h_t(\boldsymbol{x})^2}{2}\right)\right] \\
&=\mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}}\left[e^{-f(\boldsymbol{x}) H_{t-1}(\boldsymbol{x})}\left(1-f(\boldsymbol{x}) h_t(\boldsymbol{x})+\frac{1}{2}\right)\right], \quad(2.10)
\end{aligned}
$$
by noticing that $f(\boldsymbol{x})^2=1$ and $h_t(\boldsymbol{x})^2=1$.
Thus, the ideal classifier $h_t$ is
$$
\begin{aligned}
h_t(\boldsymbol{x}) &=\underset{h}{\arg \min } \ell_{\exp }\left(H_{t-1}+h \mid \mathcal{D}\right) \\
&=\underset{h}{\arg \min } \mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}}\left[e^{-f(\boldsymbol{x}) H_{t-1}(\boldsymbol{x})}\left(1-f(\boldsymbol{x}) h(\boldsymbol{x})+\frac{1}{2}\right)\right] \\
&=\underset{h}{\arg \max } \mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}}\left[e^{-f(\boldsymbol{x}) H_{t-1}(\boldsymbol{x})} f(\boldsymbol{x}) h(\boldsymbol{x})\right] \\
&=\underset{h}{\arg \max } \mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}}\left[\frac{e^{-f(\boldsymbol{x}) H_{t-1}(\boldsymbol{x})}}{\mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}}\left[e^{-f(\boldsymbol{x}) H_{t-1}(\boldsymbol{x})}\right]} f(\boldsymbol{x}) h(\boldsymbol{x})\right]
\end{aligned}
$$
by noticing that $\mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}}\left[e^{-f(\boldsymbol{x}) H_{t-1}(\boldsymbol{x})}\right]$ is a constant.
Denote a distribution $\mathcal{D}_t$ as
$$
\mathcal{D}_t(x)=\frac{\mathcal{D}(x) e^{-f(\boldsymbol{x}) H_{t-1}(\boldsymbol{x})}}{\mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}}\left[e^{-f(\boldsymbol{x}) H_{t-1}(\boldsymbol{x})}\right]} .
$$
Then, by the definition of mathematical expectation, it is equivalent to write that
$$
\begin{aligned}
h_t(\boldsymbol{x}) &=\underset{h}{\arg \max } \mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}}\left[\frac{e^{-f(\boldsymbol{x}) H_{t-1}(\boldsymbol{x})}}{\mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}}\left[e^{-f(\boldsymbol{x}) H_{t-1}(\boldsymbol{x})}\right]} f(\boldsymbol{x}) h(\boldsymbol{x})\right] \\
&=\underset{h}{\arg \max } \mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}_t}[f(\boldsymbol{x}) h(\boldsymbol{x})] .
\end{aligned}
$$


$$
\begin{aligned}
\mathcal{D}_{t+1}(\boldsymbol{x}) &=\frac{\mathcal{D}(x) e^{-f(\boldsymbol{x}) H_t(\boldsymbol{x})}}{\mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}}\left[e^{-f(\boldsymbol{x}) H_t(\boldsymbol{x})}\right]} \\
&=\frac{\mathcal{D}(\boldsymbol{x}) e^{-f(\boldsymbol{x}) H_{t-1}(\boldsymbol{x})} e^{-f(\boldsymbol{x}) \alpha_t h_t(\boldsymbol{x})}}{\mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}}\left[e^{-f(\boldsymbol{x}) H_t(\boldsymbol{x})}\right]} \\
&=\mathcal{D}_t(\boldsymbol{x}) \cdot e^{-f(\boldsymbol{x}) \alpha_t h_t(\boldsymbol{x})} \frac{\mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}}\left[e^{-f(\boldsymbol{x}) H_{t-1}(\boldsymbol{x})}\right]}{\mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}}\left[e^{-f(\boldsymbol{x}) H_t(\boldsymbol{x})}\right]},
\end{aligned}
$$


- 推论2
$$
\frac{\partial \ell_{\exp }\left(\alpha_t h_t \mid \mathcal{D}_t\right)}{\partial \alpha_t}=-e^{-\alpha_t}\left(1-\epsilon_t\right)+e^{\alpha_t} \epsilon_t=0,
$$

$$
\frac{\partial \ell_{\exp }\left(\alpha_t h_t \mid \mathcal{D}_t\right)}{\partial \alpha_t}=-e^{-\alpha_t}\left(1-\epsilon_t\right)+e^{\alpha_t} \epsilon_t=0,
$$
then the solution is
$$
\alpha_t=\frac{1}{2} \ln \left(\frac{1-\epsilon_t}{\epsilon_t}\right)
$$


因此Adaboost算法迭代过程为:
1. sample data: $\{ (x_1, y_1), \ldots, (x_i, y_i)) \}, y_i \in  \ \{1, \ldots, K\}$, Initialize the observation weights $w_i=1 / n, i=1,2, \ldots, n$.
2. For $m=1$ to $M$ :
(a) Fit a classifier $T^{(m)}(\boldsymbol{x})$ to the training data using weights $w_i$.
(b) Compute
$\operatorname{err}^{(m)}=\sum_{i=1}^n w_i I\left(y_i \neq T^{(m)}\left(\boldsymbol{x}_i\right)\right) / \sum_{i=1}^n w_i $ .
(c) Compute
$\alpha^{(m)}=\log \frac{1-e r r^{(m)}}{e r r^{(m)}}$ .
(d) Set
$w_i \leftarrow w_i \cdot \exp \left(\alpha^{(m)} \cdot I\left(y_i \neq T^{(m)}\left(\boldsymbol{x}_i\right)\right)\right), i=1,2, \ldots, n$ .
(e) Re-normalize $w_i$.
3. Output
$$
C(\boldsymbol{x})=\arg \max _k \sum_{m=1}^M \alpha^{(m)} \cdot I\left(T^{(m)}(\boldsymbol{x})=k\right) .
$$


In [None]:
from sklearn.ensemble import AdaBoostClassifier

# A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting
# https://www.sciencedirect.com/science/article/pii/S002200009791504X
# https://hastie.su.domains/Papers/samme.pdf
AdaBoostClassifier().predict()

https://projecteuclid.org/journals/annals-of-statistics/volume-29/issue-5/Greedy-function-approximation-A-gradient-boosting-machine/10.1214/aos/1013203451.full   GBDT

https://arxiv.org/abs/1603.02754  xgboost


其他:
https://zhuanlan.zhihu.com/p/142115015

In [1]:


import pandas as pd
import pymysql

with pymysql.connect(
        user='dbabigd',
        password='dbabigddbabigd',
        host='175.24.125.182',
        port=9030,
        database='crisps_order_center_v2'
) as conn:
    data = pd.read_sql_query(
        sql="""select a.cus_order_no as 客户单编号,cast(a.customer_id as char)as 客户id,a.customer_name,b.classify_three_name,a.create_time,a.cus_order_status_no
from crisps_order_center_v2.order_cus a
left join crisps_order_center_v2.order_sku b on a.cus_order_no = b.cus_order_no
where b.classify_three_name not like '%测试%' and a.customer_name not like '%测试%' and
a.cus_order_status_no not like 'ORDER_CUS_STATUS_CANCELLED'
order by a.create_time asc""",
        con=conn)
    data.head()

In [2]:
data

Unnamed: 0,客户单编号,客户id,customer_name,classify_three_name,create_time,cus_order_status_no
0,C21122500002,8162886898785517568,138******16,有限公司注册,2021-12-25 10:08:20,ORDER_CUS_STATUS_PROGRESSING
1,C21122500010,8091319090696617984,135******70,个人抵押贷,2021-12-25 11:00:23,ORDER_CUS_STATUS_PROGRESSING
2,C21122500019,7957543119100125184,182******74,有限公司注册,2021-12-25 11:07:23,ORDER_CUS_STATUS_PROGRESSING
3,C21122500046,8096575238874005504,155******46,法律咨询,2021-12-25 11:56:28,ORDER_CUS_STATUS_PROGRESSING
4,C21122500049,7822989022025662464,夏茂源,个体变更,2021-12-25 13:23:24,ORDER_CUS_STATUS_COMPLETED
5,C21122500051,8091319090696617984,135******70,公积金代理,2021-12-25 13:28:53,ORDER_CUS_STATUS_PROGRESSING
6,C21122500090,8091319090696617984,135******70,个体变更,2021-12-25 14:20:20,ORDER_CUS_STATUS_PROGRESSING
7,C21122500091,7957543119100125184,182******74,有限公司注册,2021-12-25 14:20:26,ORDER_CUS_STATUS_PROGRESSING
8,C21122500097,8068071036795289600,183******50,个体变更,2021-12-25 14:44:15,ORDER_CUS_STATUS_PROGRESSING
9,C21122500171,7822989022025662464,夏茂源,个体变更,2021-12-25 15:38:26,ORDER_CUS_STATUS_PROGRESSING


In [47]:
from datetime import datetime

data = pd.DataFrame(
    {
        '客户id': ['111', '112', '111', '113', '111'],
        '商品id': ['A', 'B', 'C', 'D', 'E'],
        'date': ['2022-03-01', '2022-07-31', '2022-06-01', '2020-01-13', '2022-07-13']  # date为字符串类型
    }
)


def get_custom(x):
    user_id = x['客户id'].values[0]
    last_date = None
    collector = []
    for each in x[['date', '商品id']].values:
        date = each[0]
        if last_date is None:
            collector.append(each[1])

        if last_date is not None:
            if abs((datetime.strptime(last_date, '%Y-%m-%d') - datetime.strptime(date, '%Y-%m-%d')).days) < 100:
                collector.append(each[1])
            else:
                return pd.DataFrame(
                    [(user_id, set(collector))],
                    columns=['客户id', '商品集合']
                )

        last_date = date

    return pd.DataFrame(
        [(user_id, set(collector))],
        columns=['客户id', '商品集合']
    )


data = data.sort_values(by='date', ascending=False)
user_record = data.groupby('客户id').apply(get_custom)
user_record

Unnamed: 0_level_0,Unnamed: 1_level_0,客户id,商品集合
客户id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
111,0,111,"{E, A, C}"
112,0,112,{B}
113,0,113,{D}


In [30]:
user_record[[len(each) > 1 for each in user_record['商品集合'].values]]

Unnamed: 0_level_0,Unnamed: 1_level_0,客户id,商品集合
客户id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
111,0,111,"{A, C}"


In [34]:

(datetime.strptime('2022-06-01', '%Y-%m-%d') - datetime.strptime('2022-03-01', '%Y-%m-%d')).days

92