# GBDT_LR
FFM 模型采用引人特征域的方式增强了模型的特征交叉能力，但无论如何，FFM 只能做二阶的特征交叉，如果继续提高特征交叉的维度，会不可避免地产生
组合爆炸和计算复杂度过高的问题

2014 年，Facebook 提出了一种利用 GBDT 自动进行特征筛选和组合，进而生成新的离散特征向量，再把该特征向量当作 LR 模型的输入，来产生最后的预测结果，这就是著名的 GBDT+LR 模型了。GBDT+LR 使用最广泛的场景是 CTR 点击率预估，即预测当给用户推送的广告会不会被用户点击

<img src="../img/gbdt_lr.png" width="400" >


**步骤**

1. **训练时**，GBDT 建树的过程相当于自动进行特征组合和离散化，然后从根结点到叶子节点的这条路径就可以看成是不同特征进行的特征组合，用叶子节点可以唯一的表示这条路径，并作为一个离散特征传入 LR 进行*二次训练*

    比如上图中，有两棵树，$x$ 为一条输入样本，遍历两棵树后，样本 $x$ 分别落到两颗树的叶子节点上，每个叶子节点对应 LR 一维特征，那么通过遍历树，就得到了该样本对应的所有 LR 特征。构造的新特征向量是取值 0/1 的。 比如左树有三个叶子节点，右树有两个叶子节点，最终的特征即为五维的向量。对于输入 $x$，假设他落在左树第二个节点，编码 \[0,1,0\]，落在右树第二个节点则编码 \[0,1\]，所以整体的编码为 \[0,1,0,0,1\]，这类编码作为特征，输入到线性分类模型（LR or FM）中进行分类


2. **预测时**，会先走 GBDT 的每棵树，得到某个叶子节点对应的一个离散特征（即一组特征组合），然后把该特征以 one-hot 形式传入 LR 进行线性加权预测

**注意**
1. 利用 GBDT 构建特征工程，再利用 LR 预估 CTR 这两步是独立训练的
2. 通过 GBDT 进行特征组合之后得到的离散向量是和训练数据的原特征一块作为逻辑回归的输入，而不仅仅全是这种离散特征
3. 建树的时候用集成算法建树的原因就是一棵树的表达能力很弱，不足以表达多个有区分性的特征组合，多棵树的表达能力更强一些。GBDT 每棵树都在学习前面一棵树尚存的不足，迭代多少次就会生成多少棵树
4. RF 也是多棵树，但从效果上有实践证明不如 GBDT。GBDT 前面的树，特征分裂主要体现对多数样本有区分度的特征；后面的树，主要体现的是经过前 N 颗树，对残差仍然较大的少数样本有区分度的特征。优先选用在整体上有区分度的特征，再选用针对少数样本有区分度的特征，思路更加合理，这应该也是用 GBDT 的原因
5. 在 CRT 预估中， GBDT 一般会建立两类树（非 ID 特征建一类，ID 类特征建一类），AD，ID 类特征在 CTR 预估中是非常重要的特征，直接将 AD，ID 作为特征进行建树不可行，故考虑为每个 AD，ID 建 GBDT 树
    1. 非 ID 类树：不以细粒度的 ID 建树，此类树作为 base，即便曝光少的广告、广告主，仍可以通过此类树得到有区分性的特征、特征组合
    2. ID 类树：以细粒度的 ID 建一类树，用于发现曝光充分的 ID 对应有区分性的特征、特征组合

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, LabelEncoder
from sklearn.metrics import log_loss
from scipy import sparse
import numpy as np
import pandas as pd
import lightgbm as lgb
import gc
import warnings
warnings.filterwarnings('ignore')

In [2]:
train_df = pd.read_csv('../data/train.csv')
test_df = pd.read_csv('../data/test.csv')

# 数据预处理
# 去掉id列
train_df.drop(['Id'], axis=1, inplace=True)
test_df.drop(['Id'], axis=1, inplace=True)

# 把测试集和训练集合并
test_df['Label'] = -1
data = pd.concat([train_df, test_df])

# 填充缺失值
data.fillna(-1, inplace=True)

# 把数值型特征和连续型特征分开
numberical_fea = ['I' + str(i + 1) for i in range(13)]
categorical_fea = ['C' + str(i + 1) for i in range(26)]

In [3]:
def lr_model(data, numberical_fea, categorical_fea):
    # 连续特征归一化
    scaler = MinMaxScaler()
    for col in numberical_fea:
        data[col] = scaler.fit_transform(data[col].values.reshape(-1, 1))

    # 离散特征one-hot编码
    for col in categorical_fea:
        onehot_feats = pd.get_dummies(data[col], prefix=col)
        data.drop([col], axis=1, inplace=True)
        data = pd.concat([data, onehot_feats], axis=1)

    # 把训练集和测试集分开
    train = data[data['Label'] != -1]
    target = train.pop('Label')
    test = data[data['Label'] == -1]
    test.drop(['Label'], axis=1, inplace=True)

    # 划分数据集
    X_train, X_val, y_train, y_val = train_test_split(train,
                                                      target,
                                                      test_size=0.2,
                                                      random_state=2020)

    # 建立模型
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    train_logloss = log_loss(y_train, lr.predict_proba(X_train)[:, 1])  # −(ylog(p)+(1−y)log(1−p))
    val_logloss = log_loss(y_val, lr.predict_proba(X_val)[:, 1])
    
    print('train_logloss: ', train_logloss)
    print('val_logloss: ', val_logloss)

    # 模型预测
    # predict_proba 返回n行k列的矩阵，第i行第j列上的数值是模型预测第i个预测样本为某个标签的概率, 这里的1表示点击的概率
    y_pred = lr.predict_proba(test)[:, 1]
    print('predict: ', y_pred[:10])

In [4]:
lr_model(data.copy(), numberical_fea, categorical_fea)

train_logloss:  0.12423395164771733
val_logloss:  0.444072456988388
predict:  [0.44783059 0.80628705 0.1756691  0.02070154 0.13984202 0.46490042
 0.43386417 0.07089967 0.07121148 0.27896238]


In [5]:
def gbdt_model(data, numberical_fea, categorical_fea):

    # 离散特征one-hot编码
    for col in categorical_fea:
        onehot_feats = pd.get_dummies(data[col], prefix=col)
        data.drop([col], axis=1, inplace=True)
        data = pd.concat([data, onehot_feats], axis=1)

    # 训练集和测试集分开
    train = data[data['Label'] != -1]
    target = train.pop('Label')
    test = data[data['Label'] == -1]
    test.drop(['Label'], axis=1, inplace=True)

    # 划分数据集
    X_train, X_val, y_train, y_val = train_test_split(train,
                                                      target,
                                                      test_size=0.2,
                                                      random_state=2020)

    # 建模
    gbm = lgb.LGBMClassifier(
        boosting_type='gbdt',  # 这里用gbdt
        objective='binary',
        subsample=0.8,
        min_child_weight=0.5,
        colsample_bytree=0.7,
        num_leaves=100,
        max_depth=12,
        learning_rate=0.01,
        n_estimators=10000)
    gbm.fit(
        X_train,
        y_train,
        eval_set=[(X_train, y_train), (X_val, y_val)],
        eval_names=['train', 'val'],
        eval_metric='binary_logloss',
        early_stopping_rounds=100,
    )

    train_logloss = log_loss(y_train, gbm.predict_proba(X_train)[:, 1])  # −(ylog(p)+(1−y)log(1−p))
    val_logloss = log_loss(y_val, gbm.predict_proba(X_val)[:, 1])
    print('train_logloss: ', train_logloss)
    print('val_logloss: ', val_logloss)

    # 模型预测
    # predict_proba 返回n行k列的矩阵，第i行第j列上的数值是模型预测第i个预测样本为某个标签的概率, 这里的1表示点击的概率
    y_pred = gbm.predict_proba(test)[:, 1]
    print('predict: ', y_pred[:10])

In [6]:
gbdt_model(data.copy(), numberical_fea, categorical_fea)

[1]	train's binary_logloss: 0.523857	val's binary_logloss: 0.457806
Training until validation scores don't improve for 100 rounds
[2]	train's binary_logloss: 0.521371	val's binary_logloss: 0.457213
[3]	train's binary_logloss: 0.519084	val's binary_logloss: 0.456616
[4]	train's binary_logloss: 0.516882	val's binary_logloss: 0.456046
[5]	train's binary_logloss: 0.514449	val's binary_logloss: 0.455649
[6]	train's binary_logloss: 0.512277	val's binary_logloss: 0.455319
[7]	train's binary_logloss: 0.509973	val's binary_logloss: 0.455039
[8]	train's binary_logloss: 0.507717	val's binary_logloss: 0.454523
[9]	train's binary_logloss: 0.505668	val's binary_logloss: 0.454546
[10]	train's binary_logloss: 0.503491	val's binary_logloss: 0.454134
[11]	train's binary_logloss: 0.501469	val's binary_logloss: 0.453151
[12]	train's binary_logloss: 0.499463	val's binary_logloss: 0.452609
[13]	train's binary_logloss: 0.497257	val's binary_logloss: 0.452419
[14]	train's binary_logloss: 0.495206	val's binary

[198]	train's binary_logloss: 0.299242	val's binary_logloss: 0.436053
[199]	train's binary_logloss: 0.298515	val's binary_logloss: 0.43649
[200]	train's binary_logloss: 0.298193	val's binary_logloss: 0.436511
[201]	train's binary_logloss: 0.297521	val's binary_logloss: 0.436551
[202]	train's binary_logloss: 0.296716	val's binary_logloss: 0.436666
[203]	train's binary_logloss: 0.295858	val's binary_logloss: 0.436609
[204]	train's binary_logloss: 0.295111	val's binary_logloss: 0.43681
[205]	train's binary_logloss: 0.294636	val's binary_logloss: 0.436886
[206]	train's binary_logloss: 0.293799	val's binary_logloss: 0.437118
[207]	train's binary_logloss: 0.293252	val's binary_logloss: 0.437264
[208]	train's binary_logloss: 0.29258	val's binary_logloss: 0.437418
[209]	train's binary_logloss: 0.292261	val's binary_logloss: 0.437434
[210]	train's binary_logloss: 0.291606	val's binary_logloss: 0.437469
[211]	train's binary_logloss: 0.291045	val's binary_logloss: 0.437558
[212]	train's binary_lo

In [7]:
# LR + GBDT建模
# 下面就是把上面两个模型进行组合，GBDT负责对各个特征进行交叉和组合，把原始特征向量转换为新的离散型特征向量，然后再使用逻辑回归模型
def gbdt_lr_model(data, numberical_fea, categorical_fea):

    # 离散特征one-hot编码
    for col in categorical_fea:
        onehot_feats = pd.get_dummies(data[col], prefix=col)
        data.drop([col], axis=1, inplace=True)
        data = pd.concat([data, onehot_feats], axis=1)

    train = data[data['Label'] != -1]
    target = train.pop('Label')
    test = data[data['Label'] == -1]
    test.drop(['Label'], axis=1, inplace=True)

    # 划分数据集
    X_train, X_val, y_train, y_val = train_test_split(train,
                                                      target,
                                                      test_size=0.2,
                                                      random_state=2020)

    gbm = lgb.LGBMClassifier(
        objective='binary',
        subsample=0.8,
        min_child_weight=0.5,
        colsample_bytree=0.7,
        num_leaves=100,
        max_depth=12,
        learning_rate=0.01,
        n_estimators=100
    )

    gbm.fit(
        X_train,
        y_train,
        eval_set=[(X_train, y_train), (X_val, y_val)],
        eval_names=['train', 'val'],
        eval_metric='binary_logloss',
        early_stopping_rounds=100,
    )

    model = gbm.booster_

    gbdt_feats_train = model.predict(train, pred_leaf=True)  # 返回（样本个数, 树的棵数）矩阵，每一个数字代表某个样本落在了某个树的哪个叶子节点
    gbdt_feats_test = model.predict(test, pred_leaf=True)

    gbdt_feats_name = [
        'gbdt_leaf_' + str(i) for i in range(gbdt_feats_train.shape[1])
    ]

    gbdt_feats_train_df = pd.DataFrame(gbdt_feats_train,
                                       columns=gbdt_feats_name)
    gbdt_feats_test_df = pd.DataFrame(gbdt_feats_test, columns=gbdt_feats_name)

    train = pd.concat([train, gbdt_feats_train_df], axis=1)
    test = pd.concat([test, gbdt_feats_test_df], axis=1)
    train_size = train.shape[0]
    data = pd.concat([train, test])
    del train
    del test
    gc.collect()

    # 连续特征归一化
    scaler = MinMaxScaler()
    for col in numberical_fea:
        data[col] = scaler.fit_transform(data[col].values.reshape(-1, 1))

    for col in gbdt_feats_name:
        onehot_feats = pd.get_dummies(data[col], prefix=col)
        data.drop([col], axis=1, inplace=True)
        data = pd.concat([data, onehot_feats], axis=1)

    train = data[:train_size]
    test = data[train_size:]
    del data
    gc.collect()

    X_train, X_val, y_train, y_val = train_test_split(train,
                                                      target,
                                                      test_size=0.3,
                                                      random_state=2018)

    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    train_logloss = log_loss(y_train, lr.predict_proba(X_train)[:, 1])
    print('train-logloss: ', train_logloss)
    val_logloss = log_loss(y_val, lr.predict_proba(X_val)[:, 1])
    print('val-logloss: ', val_logloss)
    y_pred = lr.predict_proba(test)[:, 1]
    print('predict: ', y_pred[:10])

In [8]:
gbdt_lr_model(data.copy(), numberical_fea, categorical_fea)

[1]	train's binary_logloss: 0.523857	val's binary_logloss: 0.457806
Training until validation scores don't improve for 100 rounds
[2]	train's binary_logloss: 0.521371	val's binary_logloss: 0.457213
[3]	train's binary_logloss: 0.519084	val's binary_logloss: 0.456616
[4]	train's binary_logloss: 0.516882	val's binary_logloss: 0.456046
[5]	train's binary_logloss: 0.514449	val's binary_logloss: 0.455649
[6]	train's binary_logloss: 0.512277	val's binary_logloss: 0.455319
[7]	train's binary_logloss: 0.509973	val's binary_logloss: 0.455039
[8]	train's binary_logloss: 0.507717	val's binary_logloss: 0.454523
[9]	train's binary_logloss: 0.505668	val's binary_logloss: 0.454546
[10]	train's binary_logloss: 0.503491	val's binary_logloss: 0.454134
[11]	train's binary_logloss: 0.501469	val's binary_logloss: 0.453151
[12]	train's binary_logloss: 0.499463	val's binary_logloss: 0.452609
[13]	train's binary_logloss: 0.497257	val's binary_logloss: 0.452419
[14]	train's binary_logloss: 0.495206	val's binary