# 模型融合

什么是 stacking 简单来说 stacking 就是当用初始训练数据学习出若干个基学习器后，将这几个学习器的预测结果作为新的训练集，来学习一个新的学习器。 将个体学习器结合在一起的时候使用的方法叫做结合策略。对于分类问题，我们可以使用投票法来选择输出最多的类。对于回归问题，我们可以将分类器输出的结果求平均值。 上面说的投票法和平均法都是很有效的结合策略，还有一种结合策略是使用另外一个机器学习算法来将个体机器学习器的结果结合在一起，这个方法就是Stacking。

# 模型融合目标

对于多种调参完成的模型进行模型融合。<p>
完成对于多种模型的融合，提交融合结果并打卡。

# 简单加权融合:

回归（分类概率）：算术平均融合（Arithmetic mean),几何平均融合（Geometric mean）<p>
分类：投票（Voting) <p>
综合：排序融合(Rank averaging)，log融合

# stacking/blending:

对于一般的blending，主要思路是把原始的训练集先分成两部分，比如70%的数据作为新的训练集，剩下30%的数据作为测试集。<p>
我们在这70%的数据上训练多个初级学习器，然后用余下的30%预测相应的P。<p>
我们直接用这30%数据在第一层预测的P结合真实值，作为新特征继续训练次级学习器。

### Stacking工作机制

step1 先从包含m个样本点的初始数据集，训练出T个初级学习器<p>
step2 根据“生成”（交叉验证法/留一法等）的新数据集训练次级学习器

# boosting/bagging（在xgboost，Adaboost,GBDT中已经用到）：

### Boosting工作机制

step1  先从初试训练集训练出基学<p>
step2  根据基学习器的表现，调整训练样本分布（调整方法见参考链接2），使得基学习器1做错的训练样本在后续得到更多关注<p>
step3  根据调整后的样本分布训练基学习器2<p>
step4  重复进行,得到 T个基学习器<p>
step5  将 T个基学习器加权结合

### Bagging工作机制

step1  基于自助采样法进行采样，可以采样出T个样本集，每个样本集包含m个样本点<p>
step2  基于 T个样本集训练得到T个基学习器<p>
step3  将T个基学习器结合，分类任务常用简单投票法，回归任务常用简单平均法<p>


# 代码示例

## 1）简单加权平均，结果直接融合

In [16]:
## 生成一些简单的样本数据，test_prei 代表第i个模型的预测值
test_pre1 = [1.2, 3.2, 2.1, 6.2]
test_pre2 = [0.9, 3.1, 2.0, 5.9]
test_pre3 = [1.1, 2.9, 2.2, 6.0]

# y_test_true 代表第模型的真实值
y_test_true = [1, 3, 2, 6]

In [17]:
def Weighted_method(test_pre1,test_pre2,test_pre3,w=[1/3,1/3,1/3]):
    Weighted_result = w[0]*pd.Series(test_pre1)+w[1]*pd.Series(test_pre2)+w[2]*pd.Series(test_pre3)
    return Weighted_result

In [18]:
from sklearn import metrics
# 各模型的预测结果计算MAE
print('Pred1 MAE:',metrics.mean_absolute_error(y_test_true, test_pre1))
print('Pred2 MAE:',metrics.mean_absolute_error(y_test_true, test_pre2))
print('Pred3 MAE:',metrics.mean_absolute_error(y_test_true, test_pre3))

Pred1 MAE: 0.1750000000000001
Pred2 MAE: 0.07499999999999993
Pred3 MAE: 0.10000000000000009


In [19]:
## 根据加权计算MAE
w = [0.3,0.4,0.3] # 定义比重权值
Weighted_pre = Weighted_method(test_pre1,test_pre2,test_pre3,w)
print('Weighted_pre MAE:',metrics.mean_absolute_error(y_test_true, Weighted_pre))

Weighted_pre MAE: 0.05750000000000027


# 回归\分类概率-融合：

In [20]:
## 定义结果的加权平均函数
def Mean_method(test_pre1,test_pre2,test_pre3):
    Mean_result = pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).mean(axis=1)
    return Mean_result

In [21]:
Mean_pre = Mean_method(test_pre1,test_pre2,test_pre3)
print('Mean_pre MAE:',metrics.mean_absolute_error(y_test_true, Mean_pre))

Mean_pre MAE: 0.06666666666666693


In [22]:
## 定义结果的加权平均函数
def Median_method(test_pre1,test_pre2,test_pre3):
    Median_result = pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).median(axis=1)
    return Median_result

In [23]:
Median_pre = Median_method(test_pre1,test_pre2,test_pre3)
print('Median_pre MAE:',metrics.mean_absolute_error(y_test_true, Median_pre))

Median_pre MAE: 0.07500000000000007


# Stacking融合(回归):

In [24]:
from sklearn import linear_model

def Stacking_method(train_reg1,train_reg2,train_reg3,y_train_true,test_pre1,test_pre2,test_pre3,model_L2= linear_model.LinearRegression()):
    model_L2.fit(pd.concat([pd.Series(train_reg1),pd.Series(train_reg2),pd.Series(train_reg3)],axis=1).values,y_train_true)
    Stacking_result = model_L2.predict(pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).values)
    return Stacking_result

In [25]:
# 生成一些简单的样本数据，train_regi 代表第i个模型的预测值
train_reg1 = [3.2, 8.2, 9.1, 5.2]
train_reg2 = [2.9, 8.1, 9.0, 4.9]
train_reg3 = [3.1, 7.9, 9.2, 5.0]
# y_trian_true 代表train模型的真实值
y_train_true = [3, 8, 9, 5] 

# 生成一些简单的样本数据，test_prei 代表第i个模型的预测值
test_pre1 = [1.2, 3.2, 2.1, 6.2]
test_pre2 = [0.9, 3.1, 2.0, 5.9]
test_pre3 = [1.1, 2.9, 2.2, 6.0]

# y_test_true 代表test模型的真实值
y_test_true = [1, 3, 2, 6]

In [26]:
model_L2= linear_model.LinearRegression()
Stacking_pre = Stacking_method(train_reg1,train_reg2,train_reg3,y_train_true,
                               test_pre1,test_pre2,test_pre3,model_L2)
print('Stacking_pre MAE:',metrics.mean_absolute_error(y_test_true, Stacking_pre))

Stacking_pre MAE: 0.04213483146067476


# 分类模型融合

# 1）Voting投票机制：

# 2）分类的Stacking\Blending融合：

# 本赛题示例

In [1]:
import pandas as pd
import numpy as np
import warnings
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

warnings.filterwarnings('ignore')
%matplotlib inline

import itertools
import matplotlib.gridspec as gridspec
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
# from mlxtend.classifier import StackingClassifier
from sklearn.model_selection import cross_val_score, train_test_split
# from mlxtend.plotting import plot_learning_curves
# from mlxtend.plotting import plot_decision_regions

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import SVR
from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA

import lightgbm as lgb
import xgboost as xgb
from sklearn.model_selection import GridSearchCV,cross_val_score
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor

from sklearn.metrics import mean_squared_error, mean_absolute_error

In [3]:
Train_data = pd.read_csv('data_train.csv', sep=',')
Test_data = pd.read_csv('data_test.csv', sep=',')
print('Train data shape:',Train_data.shape)
print('TestA data shape:',Test_data.shape)

Train data shape: (150000, 335)
TestA data shape: (50000, 336)


In [8]:
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

In [9]:
sample_feature = reduce_mem_usage(pd.read_csv('data_train.csv'))

Memory usage of dataframe is 402000080.00 MB
Memory usage after optimization is: 54450080.00 MB
Decreased by 86.5%


In [37]:
test_features=reduce_mem_usage(pd.read_csv('data_test.csv'))

Memory usage of dataframe is 134400080.00 MB
Memory usage after optimization is: 18550080.00 MB
Decreased by 86.2%


In [41]:
test_features.shape

(50000, 336)

In [None]:
continuous_feature_names = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14']

In [39]:
test_X = test_features[continuous_feature_names]

In [40]:
test_X = test_X.dropna().replace('-', 0).reset_index(drop=True)

In [10]:
train_y=sample_feature['price']

In [11]:
train_x=sample_feature.drop(['price'],axis=1)

In [49]:
train_x.shape,train_y.shape,test_X.shape

((119833, 17), (119833,), (50000, 17))

In [51]:
test_X.head()

Unnamed: 0,power,kilometer,v_0,v_1,v_2,v_3,v_4,v_5,v_6,v_7,v_8,v_9,v_10,v_11,v_12,v_13,v_14
0,5.75,1.0,49.59375,5.246094,1.000977,-4.121094,0.737305,0.264404,0.121826,0.070923,0.106567,0.078857,-7.050781,-0.854492,4.800781,0.620117,-3.664062
1,4.332031,0.928223,42.40625,-3.253906,-1.753906,3.646484,-0.725586,0.261719,0.0,0.096741,0.013702,0.052368,3.679688,-0.729004,-3.796875,-1.541016,-0.756836
2,4.699219,0.707031,45.84375,4.703125,0.155396,-1.118164,-0.229126,0.260254,0.112061,0.078064,0.062073,0.050537,-4.925781,1.000977,0.82666,0.138184,0.753906
3,5.082031,0.707031,46.4375,4.320312,0.428955,-2.037109,-0.234741,0.260498,0.10675,0.081116,0.075989,0.048279,-4.863281,0.505371,1.870117,0.365967,1.3125
4,4.332031,1.0,42.1875,-3.166016,-1.572266,2.603516,0.387451,0.250977,0.0,0.07782,0.028595,0.081726,3.617188,-0.67334,-3.197266,-0.025681,-0.101318


In [13]:
continuous_feature_names = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14']

In [14]:
train_X = train_x[continuous_feature_names]
#train_y = train['price']

In [15]:
train_X = train_X.dropna().replace('-', 0).reset_index(drop=True)

In [31]:
def build_model_lr(x_train,y_train):
    reg_model = linear_model.LinearRegression()
    reg_model.fit(x_train,y_train)
    return reg_model

def build_model_ridge(x_train,y_train):
    reg_model = linear_model.Ridge(alpha=0.8)#alphas=range(1,100,5)
    reg_model.fit(x_train,y_train)
    return reg_model

def build_model_lasso(x_train,y_train):
    reg_model = linear_model.LassoCV()
    reg_model.fit(x_train,y_train)
    return reg_model

def build_model_gbdt(x_train,y_train):
    estimator =GradientBoostingRegressor(loss='ls',subsample= 0.85,max_depth= 5,n_estimators = 100)
    param_grid = { 
            'learning_rate': [0.05,0.08,0.1,0.2],
            }
    gbdt = GridSearchCV(estimator, param_grid,cv=3)
    gbdt.fit(x_train,y_train)
    print(gbdt.best_params_)
    # print(gbdt.best_estimator_ )
    return gbdt

def build_model_xgb(x_train,y_train):
    model = xgb.XGBRegressor(n_estimators=120, learning_rate=0.08, gamma=0, subsample=0.8,\
        colsample_bytree=0.9, max_depth=5) #, objective ='reg:squarederror'
    model.fit(x_train, y_train)
    return model

def build_model_lgb(x_train,y_train):
    estimator = lgb.LGBMRegressor(num_leaves=63,n_estimators = 100)
    param_grid = {
        'learning_rate': [0.01, 0.05, 0.1],
    }
    gbm = GridSearchCV(estimator, param_grid)
    gbm.fit(x_train, y_train)
    return gbm

# XGBoost的五折交叉回归验证实现

In [63]:
xgr = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, subsample=0.8,\
        colsample_bytree=0.9, max_depth=7) # ,objective ='reg:squarederror'

scores_train = []
scores = []

## 5折交叉验证方式
sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
for train_ind,val_ind in sk.split(x_train,y_train):
    
    train_x=x_train.iloc[train_ind].values
    train_y=y_train.iloc[train_ind]
    val_x=x_train.iloc[val_ind].values
    val_y=y_train.iloc[val_ind]
    
    xgr.fit(train_x,train_y)
    pred_train_xgb=xgr.predict(train_x)
    pred_xgb=xgr.predict(val_x)
    
    score_train = mean_absolute_error(train_y,pred_train_xgb)
    scores_train.append(score_train)
    score = mean_absolute_error(val_y,pred_xgb)
    scores.append(score)

print('Train mae:',np.mean(score_train))
print('Val mae',np.mean(scores))


Train mae: 609.9809794734151
Val mae 720.139287886939


# 划分数据集，并用多种方法训练和预测

In [66]:
x_train,x_val,y_train,y_val = train_test_split(x_train,y_train,test_size=0.3)

## Train and Predict
print('Predict LR...')
model_lr = build_model_lr(x_train,y_train)
val_lr = model_lr.predict(x_val)
subA_lr = model_lr.predict(test_X)

print('Predict Ridge...')
model_ridge = build_model_ridge(x_train,y_train)
val_ridge = model_ridge.predict(x_val)
subA_ridge = model_ridge.predict(test_X)

print('Predict Lasso...')
model_lasso = build_model_lasso(x_train,y_train)
val_lasso = model_lasso.predict(x_val)
subA_lasso = model_lasso.predict(test_X)

print('Predict GBDT...')
model_gbdt = build_model_gbdt(x_train,y_train)
val_gbdt = model_gbdt.predict(x_val)
subA_gbdt = model_gbdt.predict(test_X)


Predict LR...
Predict Ridge...
Predict Lasso...
Predict GBDT...
{'learning_rate': 0.1}


In [30]:
x_train,x_val,y_train,y_val = train_test_split(train_X,train_y,test_size=0.3)

# 一般比赛中效果最为显著的两种方法

In [46]:

def Sta_inf(data):
    print('_min',np.min(data))
    print('_max:',np.max(data))
    print('_mean',np.mean(data))
    print('_ptp',np.ptp(data))
    print('_std',np.std(data))
    print('_var',np.var(data))

In [44]:
print('predict XGB...')
model_xgb = build_model_xgb(x_train,y_train)
val_xgb = model_xgb.predict(x_val)
subA_xgb = model_xgb.predict(test_X)

print('predict lgb...')
model_lgb = build_model_lgb(x_train,y_train)
val_lgb = model_lgb.predict(x_val)
subA_lgb = model_lgb.predict(test_X)

predict XGB...
predict lgb...


In [47]:
print('Sta inf of lgb:')
Sta_inf(subA_lgb)

Sta inf of lgb:
_min -326.4078284252571
_max: 90876.06238076159
_mean 5923.242385425989
_ptp 91202.47020918685
_std 7370.306693837621
_var 54321420.761227645


# 总结

模型融合很重要，比赛用的比较多的就是stacking和boosting，由于时间紧张，后续要好好理解其中的细节