# Ensembling & Stacking模型

### 在一系列特征工程之后,我们终于到了这个notebook的关键点了!<br>创建一个stacking ensemble!

## 通过python的类创建一个Helpers

### 我们使用一个python的类让我们的调用更为简便.对于那些编程新手来,会经常听到说,类是常常使用在面向对象编程里的.简而言之,一个类可以帮助我们基于一些代码或程序去拓展一些新的函数和方法.
### 在下面的代码中.我们会写一个SKlearnHelper的类,它能让我们给一些内置函数做一些拓展(例如train,predict,fit这些功能).这样的话,我们在使用不同的模型的时候就不需要频繁的去调用同样的方法了.

In [1]:
# Load in our libraries
import pandas as pd
import numpy as np
import re
import sklearn
import xgboost as xgb
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

import warnings
warnings.filterwarnings('ignore')

# Going to use these 5 base models for the stacking
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.cross_validation import KFold;



In [2]:
test = pd.read_csv('new_test.csv', index_col= 0)
train = pd.read_csv('new_train.csv', index_col= 0)

In [3]:
# Some useful parameters which will come in handy later on
#一些将来会使用的参数
ntrain = train.shape[0]
ntest = test.shape[0]
SEED = 0 # for reproducibility
NFOLDS = 5 # set folds for out-of-fold prediction (5折交叉验证)
kf = KFold(ntrain, n_folds= NFOLDS, random_state=SEED)

# Class to extend the Sklearn classifier
class SklearnHelper(object):
    def __init__(self, clf, seed=0, params=None):
        params['random_state'] = seed
        self.clf = clf(**params)

    def train(self, x_train, y_train):
        self.clf.fit(x_train, y_train)

    def predict(self, x):
        return self.clf.predict(x)
    
    def fit(self,x,y):
        return self.clf.fit(x,y)
    
    def feature_importances(self,x,y):
        print(self.clf.fit(x,y).feature_importances_)
    
# Class to extend XGboost classifer

### 那些已经会这些的人请原谅我.对于那些还不会的人,让我来解释一下上面代码的含义.在创建一些基本的分类器时,我们会调用一些Sklearn的包,因此我们只需要做一些类的拓展就好了.
### def init: Python在调用类的时候的标准默认函数.这意味着当你想要创建一个类的时候,你需要给clf模型一些参数,例如使用什么模型clf,random seed等于多少(seed), 和一些其他的参数.
### 剩下的一些代码就是对相应的一些内置函数的调用.


## Out-of-Fold预测
### 我们把input喂给一些基本分类器会得到一些output这是第一层.Stacking是基于第一层output数据做input的第二层分类模型.然而,我们不能仅仅简单的用所有基础模型直接在全部的training data去训练然后产生出的预测结果给第二层模型.这样运行会有过拟合的风险,因为第一层模型已经几乎要'看见'test set了.

In [4]:
def get_oof(clf, x_train, y_train, x_test):
    oof_train = np.zeros((ntrain,))# 生成1行ntrain列的全0矩阵
    oof_test = np.zeros((ntest,))#生成1行ntest列的全0矩阵
    oof_test_skf = np.empty((NFOLDS, ntest))#生成NFOLDS行,ntest列的空矩阵

    for i, (train_index, test_index) in enumerate(kf):
        x_tr = x_train[train_index]
        y_tr = y_train[train_index]
        x_te = x_train[test_index]
#上面这个for loop 是用KFold 做了n_folds的交叉验证,输出的都是index,模型会将所有数据分为
#train_index和text_index, 分n_folds次

        clf.train(x_tr, y_tr)

        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i, :] = clf.predict(x_test)
#这里是先用每个模型拟合分出来的训练集上分开的训练集和测试集,然后去预测真实数据,把每个模型
#的交叉验证结果存入在off_train中,因为在KFold交叉验证的时候,出来的test_index[i]是不同的,但
#交叉验证后的每个模型的test_index[i]的总和是等于test_index,这就是前面所说的我们把原来的训
#练集经过了一次模型的预测,最后得到的output全部存在oof_train中成为一个一行ntrian列的向量

#再说说oof_test_skf,这是直接把测试数据在每个模型上做了一次预测,一个模型的结果存入一行
#最后得到i个模型行,n_test列的数据.


    oof_test[:] = oof_test_skf.mean(axis=0)
#off_test就是对前面的每个特征在每个模型上得到的结果求个平均.    

    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)
#自己有兴趣看看流程可以自己debug一下

# 建立我们第一层的模型
### 我们现在准备了5个模型来作为第一层的分类模型.这些模型都可以直接调用1.Random Forest classifier<br>2.Extra Trees classifier<br>3.AdaBoost classifer<br>4.Gradient Boosting classifer<br>5.Support Vector Machine

In [5]:
# Put in our parameters for said classifiers
# Random Forest parameters
rf_params = {
    'n_jobs': -1,
    'n_estimators': 700,
     'random_state':1,
    'min_samples_split':16,
     #'max_features': 0.2,
    'min_samples_leaf': 1,
    'max_features' : 'auto',
    'oob_score':True,
    'criterion': 'gini',
    'random_state':1
}


# Extra Trees Parameters
et_params = {
    'n_jobs': -1,
    'n_estimators':500,
    #'max_features': 0.5,
    'max_depth': 8,
    'min_samples_leaf': 2,
    'verbose': 0
}

# AdaBoost parameters
ada_params = {
    'n_estimators': 500,
    'learning_rate' : 0.75
}

# Gradient Boosting parameters
gb_params = {
    'n_estimators': 500,
     #'max_features': 0.2,
    'max_depth': 5,
    'min_samples_leaf': 2,
    'verbose': 0
}

# Support Vector Classifier parameters 
svc_params = {
    'kernel' : 'linear',
    'C' : 0.025
    }

### Furthermore.前面提到了类和对象.现在就建立5个对象来代表我们5个模型

In [6]:
# Create 5 objects that represent our 4 models
rf = SklearnHelper(clf=RandomForestClassifier, params=rf_params)
et = SklearnHelper(clf=ExtraTreesClassifier, seed=SEED, params=et_params)
ada = SklearnHelper(clf=AdaBoostClassifier, seed=SEED, params=ada_params)
gb = SklearnHelper(clf=GradientBoostingClassifier, seed=SEED, params=gb_params)
svc = SklearnHelper(clf=SVC, seed=SEED, params=svc_params)

### train,test数据读取

In [7]:
# Create Numpy arrays of train, test and target ( Survived) dataframes to feed into our models
y_train = train['Survived'].ravel()
train = train.drop(['Survived'], axis=1)
x_train = train.values # Creates an array of the train data
x_test = test.drop(['PassengerId'], axis=1).values # Creats an array of the test data

## 第一层模型的预测值
### 我们现在讲训练数据和测试数据喂给5个分类模型,然后用我们之前创建的out-of-fold预测函数预测.

In [8]:
# Create our OOF train and test predictions. These base results will be used as new features
et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) # Extra Trees
rf_oof_train, rf_oof_test = get_oof(rf,x_train, y_train, x_test) # Random Forest
ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) # AdaBoost 
gb_oof_train, gb_oof_test = get_oof(gb,x_train, y_train, x_test) # Gradient Boost
svc_oof_train, svc_oof_test = get_oof(svc,x_train, y_train, x_test) # Support Vector Classifier

print("Training is complete")

Training is complete


## 对于不同的模型输出feature importances
### 现在第一层的分类模型已经训练完了,我们只需要用很简单的一行代码就能输出对应模型的feature importances

### 在Sklearn文档中,大多数的分类模型都含有函数featureimportances.因此我们只要调用就能简单的得到结果了.

In [9]:
rf_feature = rf.feature_importances(x_train,y_train)
et_feature = et.feature_importances(x_train, y_train)
assert isinstance(y_train, object)
ada_feature = ada.feature_importances(x_train, y_train)
gb_feature = gb.feature_importances(x_train,y_train)

[ 0.07913504  0.08869822  0.00763216  0.12225162  0.04515253  0.00255544
  0.05439717  0.01071831  0.00444041  0.00355836  0.03184209  0.04604151
  0.02142887  0.01365789  0.12409112  0.12692719  0.01068549  0.00696667
  0.00534     0.00347815  0.00344502  0.00517275  0.02019557  0.01381173
  0.00690553  0.00346757  0.01288248  0.0005032   0.02482869  0.01147307
  0.03285788  0.03299536  0.00283527  0.00774885  0.00105529  0.00484766
  0.00144391  0.00392849  0.00060345]
[ 0.01697154  0.01750097  0.00580434  0.1698187   0.05364356  0.00427037
  0.0679527   0.01363845  0.00558665  0.00401758  0.01115545  0.05975828
  0.02622538  0.01455904  0.13467212  0.14199794  0.01055417  0.0075047
  0.00399455  0.00323075  0.00532836  0.00267509  0.02764401  0.01500582
  0.00659896  0.00216067  0.01408285  0.00031966  0.03081482  0.01600647
  0.04241731  0.03818561  0.0039503   0.00980412  0.00108097  0.00470314
  0.00174225  0.00373936  0.00088298]
[ 0.188  0.642  0.004  0.002  0.016  0.002  0.   

In [10]:
rf_features =[ 0.07913504,0.08869822,0.00763216,0.12225162,0.04515253,0.00255544
,0.05439717,0.01071831,0.00444041,0.00355836,0.03184209,0.04604151
,0.02142887,0.01365789,0.12409112,0.12692719,0.01068549,0.00696667
,0.00534,0.00347815,0.00344502,0.00517275,0.02019557,0.01381173
,0.00690553,0.00346757,0.01288248,0.0005032, 0.02482869,0.01147307
,0.03285788,0.03299536,0.00283527,0.00774885,0.00105529,0.00484766
,0.00144391,0.00392849,0.00060345]

et_features =[ 0.01697154,0.01750097,0.00580434,0.1698187, 0.05364356,0.00427037
,0.0679527, 0.01363845,0.00558665,0.00401758,0.01115545,0.05975828
,0.02622538,0.01455904,0.13467212,0.14199794,0.01055417,0.0075047
,0.00399455,0.00323075,0.00532836,0.00267509,0.02764401,0.01500582
,0.00659896,0.00216067,0.01408285,0.00031966,0.03081482,0.01600647
,0.04241731,0.03818561,0.0039503, 0.00980412,0.00108097,0.00470314
,0.00174225,0.00373936,0.00088298]

ada_features = [ 0.188,0.642,0.004,0.002,0.016,0.002,0., 0.002,0., 0.002
,0.054,0.008,0., 0., 0., 0.01, 0.012,0., 0., 0.002,0.
,0., 0.004,0.008,0., 0., 0.002,0.002,0., 0.006,0.014
,0.002,0.004,0.004,0.006,0.002,0.002,0., 0., ]

gb_features = [3.21733143e-01, 3.29573449e-01, 1.84363236e-02, 3.94561432e-02
               , 3.92816755e-03, 1.49388411e-03, 1.14686277e-02, 5.51937330e-03
               , 4.81525893e-03, 4.76301001e-03, 5.28379065e-02, 9.65731856e-03
               , 6.50368781e-03, 1.29486815e-02, 7.25186433e-03, 8.11562169e-03
               , 1.26996100e-02, 3.29183654e-03, 9.53558241e-03, 7.78874360e-03
               , 1.12255908e-03, 6.24651095e-03, 9.80159275e-03, 1.08733196e-02
               , 2.05540685e-02, 2.02789313e-03, 1.39294387e-02, 1.27223403e-04
               , 1.49278162e-02, 1.79914919e-02, 4.23428069e-03, 6.35377966e-03
               , 3.26543125e-03, 2.80984448e-03, 1.63738175e-03, 4.43559768e-03
               , 4.20954252e-03, 1.42179816e-03, 2.21219689e-03]

### 创建一个dataframe然后把模型的feature importance数据放在一起

In [11]:
cols = train.columns.values
# Create a dataframe with features
feature_dataframe = pd.DataFrame( {'features': cols,
     'Random Forest feature importances': rf_features,
     'Extra Trees  feature importances': et_features,
      'AdaBoost feature importances': ada_features,
    'Gradient Boost feature importances': gb_features
    })

## 对feature importances做图
### 我使用交互式plotly包来做散点图

In [12]:
# Scatter plot 
trace = go.Scatter(
    y = feature_dataframe['Random Forest feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#         size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['Random Forest feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Random Forest Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

# Scatter plot 
trace = go.Scatter(
    y = feature_dataframe['Extra Trees  feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#         size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['Extra Trees  feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Extra Trees Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

# Scatter plot 
trace = go.Scatter(
    y = feature_dataframe['AdaBoost feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#         size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['AdaBoost feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'AdaBoost Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

# Scatter plot 
trace = go.Scatter(
    y = feature_dataframe['Gradient Boost feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#         size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['Gradient Boost feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Gradient Boosting Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

### 接下来我们计算所有feature importances的平均值,存在新的一列中

In [13]:
# Create the new column containing the average of values

feature_dataframe['mean'] = feature_dataframe.mean(axis= 1) # axis = 1 computes the mean row-wise
feature_dataframe.head(10)

Unnamed: 0,AdaBoost feature importances,Extra Trees feature importances,Gradient Boost feature importances,Random Forest feature importances,features,mean
0,0.188,0.016972,0.321733,0.079135,Age,0.15146
1,0.642,0.017501,0.329573,0.088698,Fare,0.269443
2,0.004,0.005804,0.018436,0.007632,Age_Null_Flag,0.008968
3,0.002,0.169819,0.039456,0.122252,Name_Title_1,0.083382
4,0.016,0.053644,0.003928,0.045153,Name_Title_2,0.029681
5,0.002,0.00427,0.001494,0.002555,Name_Title_4,0.00258
6,0.0,0.067953,0.011469,0.054397,Name_Title_5,0.033455
7,0.002,0.013638,0.005519,0.010718,"Cabin_num_(1.999, 28.667]",0.007969
8,0.0,0.005587,0.004815,0.00444,"Cabin_num_(28.667, 65.667]",0.003711
9,0.002,0.004018,0.004763,0.003558,"Cabin_num_(65.667, 148.0]",0.003585


## plot平均的feature importances
### 在得到所有模型的平均feature importances之后,我们能够做柱状图看看情况.

In [14]:
y = feature_dataframe['mean'].values
x = feature_dataframe['features'].values
data = [go.Bar(
            x= x,
             y= y,
            width = 0.5,
            marker=dict(
               color = feature_dataframe['mean'].values,
            colorscale='Portland',
            showscale=True,
            reversescale = False
            ),
            opacity=0.6
        )]

layout= go.Layout(
    autosize= True,
    title= 'Barplots of Mean Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='bar-direct-labels')

# 第二层预测基于第一层的输出结果

## 将第一层的输出结果作为新的特征
### 我们现在得到了第一层的预测值. 我们能把它作为一个新的特征,然后把它当做training data 喂给第二层的分类器.

In [15]:
base_predictions_train = pd.DataFrame( {'RandomForest': rf_oof_train.ravel(),
     'ExtraTrees': et_oof_train.ravel(),
     'AdaBoost': ada_oof_train.ravel(),
      'GradientBoost': gb_oof_train.ravel()
    })
base_predictions_train.head()

Unnamed: 0,AdaBoost,ExtraTrees,GradientBoost,RandomForest
0,0.0,0.0,0.0,0.0
1,1.0,1.0,1.0,1.0
2,0.0,0.0,0.0,0.0
3,1.0,1.0,1.0,1.0
4,0.0,0.0,0.0,0.0


##  第二层training set 的相关性的热力图

In [16]:
data = [
    go.Heatmap(
        z= base_predictions_train.astype(float).corr().values ,
        x=base_predictions_train.columns.values,
        y= base_predictions_train.columns.values,
          colorscale='Viridis',
            showscale=True,
            reversescale = True
    )
]
py.iplot(data, filename='labelled-heatmap')

In [17]:
x_train = np.concatenate(( et_oof_train, rf_oof_train, ada_oof_train, gb_oof_train, svc_oof_train), axis=1)
x_test = np.concatenate(( et_oof_test, rf_oof_test, ada_oof_test, gb_oof_test, svc_oof_test), axis=1)
#将第一层预测数据按列加和起来.

## 第二层的分类器用Xgboost

In [18]:
gbm = xgb.XGBClassifier(
    #learning_rate = 0.02,
 n_estimators= 2000,
 max_depth= 4,
 min_child_weight= 2,
 #gamma=1,
 gamma=0.9,                        
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread= -1,
 scale_pos_weight=1).fit(x_train, y_train)
predictions = gbm.predict(x_test)

In [19]:
# Generate Submission File 
StackingSubmission = pd.DataFrame({ 'PassengerId': test['PassengerId'],
                            'Survived': predictions.astype(int) })
StackingSubmission.to_csv("StackingSubmission.csv", index=False)