## Stacking

Wolpert in 1992 introduced Stacking. It involves:
1. Splitting the training data set into two disjoint sets.
2. Train several base learners on the first part.
3. Make predictions with the base learners on the second (validation)part.
4. Using predictions from the 3 rd step as the input to train a high learner.

<img src="https://raw.githubusercontent.com/teja/Machine_Learning/master/Images/Stacking1.png" width="640" align="left"/>

<img src="https://raw.githubusercontent.com/teja/Machine_Learning/master/Images/Stacking2.png" width="740" align="left"/>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split,KFold,GridSearchCV
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,ExtraTreesClassifier,BaggingClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score,log_loss
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process.kernels import RBF
from sklearn.naive_bayes import GaussianNB
from xgboost.sklearn import XGBClassifier

%matplotlib inline

#### Dataset- https://www.kaggle.com/c/otto-group-product-classification-challenge
The Otto Group is one of the world’s biggest e-commerce companies, with subsidiaries in more than 20 countries, including Crate & Barrel (USA), Otto.de (Germany) and 3 Suisses (France). We are selling millions of products worldwide every day, with several thousand products being added to our product line.

A consistent analysis of the performance of our products is crucial. However, due to our diverse global infrastructure, many identical products get classified differently. Therefore, the quality of our product analysis depends heavily on the ability to accurately cluster similar products. The better the classification, the more insights we can generate about our product range.

dataset with 93 features for more than 200,000 products. The objective is to build a predictive model which is able to distinguish between our main product categories. The winning models will be open sourced.

Data fields

    id - an anonymous id unique to a product
    feat_1, feat_2, ..., feat_93 - the various features of a product
    target - the class of a product


In [2]:
df_train = pd.read_csv("https://raw.githubusercontent.com/teja/Data_Files/master/otto_group/train.csv")
df_test = pd.read_csv("https://raw.githubusercontent.com/teja/Data_Files/master/otto_group/test.csv")

In [3]:
print(df_train.shape)
print(df_test.shape)

(61878, 95)
(144368, 94)


In [4]:
## Verify the classes
df_train.target.unique()

array(['Class_1', 'Class_2', 'Class_3', 'Class_4', 'Class_5', 'Class_6',
       'Class_7', 'Class_8', 'Class_9'], dtype=object)

In [5]:
## Check the numbers of samples across each classes
df_train.target.value_counts()

Class_2    16122
Class_6    14135
Class_8     8464
Class_3     8004
Class_9     4955
Class_7     2839
Class_5     2739
Class_4     2691
Class_1     1929
Name: target, dtype: int64

In [6]:
## Check the numbers of samples across each classes
pd.crosstab(df_train.target,1)

col_0,1
target,Unnamed: 1_level_1
Class_1,1929
Class_2,16122
Class_3,8004
Class_4,2691
Class_5,2739
Class_6,14135
Class_7,2839
Class_8,8464
Class_9,4955


In [7]:
## Lets replace Class_1 to 1 ,, do it for all 9 classes
df_train["target"] = df_train["target"].str.replace("Class_","")

In [8]:
df_train.target.head(2)   ## Data type is object , need to be int

0    1
1    1
Name: target, dtype: object

In [9]:
## Lets change the dtype - object to int
df_train["target"] = pd.to_numeric(df_train["target"])

In [10]:
df_train.target.head(2)

0    1
1    1
Name: target, dtype: int64

In [11]:
## lets define the X and y
X  = df_train.iloc[:,1:-1]
y = df_train.target

In [12]:
print(X.shape)
print(y.shape)
print(type(X))
print(type(y))

(61878, 93)
(61878,)
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [13]:
### Split the Trainning Data in train and test
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 42)

In [14]:
print(X_train.shape)
print(y_train.shape)
print(type(X_train))
print(type(y_train))
print(X_test.shape)
print(y_test.shape)
print(type(X_test))
print(type(y_test))

(49502, 93)
(49502,)
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
(12376, 93)
(12376,)
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [15]:
##### Function to Evaluate the Model

In [16]:
def model_evaluate(y_true,y_pred1,y_pred2):
    f1score = f1_score(y_true,y_pred1,average="weighted")
    print("F1 Score : ",f1score)
    logloss = log_loss(y_true,y_pred2,eps=1e-15, normalize=True)
    print("Log Loss for Predicted Probabilities : ",logloss)
## This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks,
## defined as the negative log-likelihood of a logistic model that returns y_pred probabilities for its training 
## data y_true. The log loss is only defined for two or more labels. For a single sample with true 
## label yt in {0,1} and estimated probability yp that yt = 1, the log loss is

### lets train base learneres on full X_train dataset

**SGD**

In [17]:
sgd_clf = SGDClassifier(max_iter=40000,tol=0.001,loss="log")
%time sgd_clf.fit(X_train,y_train)
## lets predicts the y_pred1 and y_pred2
sgd_y_pred1 = sgd_clf.predict(X_test)
sgd_y_pred2 = sgd_clf.predict_proba(X_test)
### Lets Evalaute Model
model_evaluate(y_test,sgd_y_pred1,sgd_y_pred2)

Wall time: 8.33 s
F1 Score :  0.7276665392288761
Log Loss for Predicted Probabilities :  0.6988421299318078


**KNN**

In [18]:
knn_clf = KNeighborsClassifier()
%time knn_clf.fit(X_train,y_train)
knn_pred1 = knn_clf.predict(X_test)
knn_pred2 = knn_clf.predict_proba(X_test)
model_evaluate(y_test,knn_pred1,knn_pred2)

Wall time: 3.76 s
F1 Score :  0.7739562191425366
Log Loss for Predicted Probabilities :  2.3086776076579794


**Random Forest**

In [19]:
rf_clf = RandomForestClassifier(n_estimators=100)
%time rf_clf.fit(X_train,y_train)
rf_pred1 = rf_clf.predict(X_test)
rf_pred2 = rf_clf.predict_proba(X_test)
model_evaluate(y_test,rf_pred1,rf_pred2)

Wall time: 11.8 s
F1 Score :  0.800866394700877
Log Loss for Predicted Probabilities :  0.6014111699538554


**ExtraTree Classifier**

In [20]:
Extree_clf = ExtraTreesClassifier(n_estimators=50)
%time Extree_clf.fit(X_train,y_train)
Extree_pred1 = Extree_clf.predict(X_test)
Extree_pred2 = Extree_clf.predict_proba(X_test)
model_evaluate(y_test,Extree_pred1,Extree_pred2)

Wall time: 6.88 s
F1 Score :  0.7941928401694024
Log Loss for Predicted Probabilities :  0.6891734010963404


**Multilayer Perceptron Classifier**

In [21]:
mlp_clf = MLPClassifier(max_iter=400)
%time mlp_clf.fit(X_train,y_train)
mlp_pred1 = mlp_clf.predict(X_test)
mlp_pred2 = mlp_clf.predict_proba(X_test)
model_evaluate(y_test,mlp_pred1,mlp_pred2)

Wall time: 1min 53s
F1 Score :  0.7739095996725924
Log Loss for Predicted Probabilities :  0.8608981965927792


**AdaBoost**

In [22]:
ada_clf = AdaBoostClassifier()
%time ada_clf.fit(X_train,y_train)
ada_pred1 = ada_clf.predict(X_test)
ada_pred2 = ada_clf.predict_proba(X_test)
## It's important to remember log loss does not have an upper bound. Log loss exists on the range [0, ∞)
model_evaluate(y_test,ada_pred1,ada_pred2)

Wall time: 5.55 s
F1 Score :  0.6744072921883629
Log Loss for Predicted Probabilities :  2.0261936531308335


**NaiveBayes Classifier**

In [23]:
nb_clf = GaussianNB()
%time nb_clf.fit(X_train,y_train)
nb_pred1 = nb_clf.predict(X_test)
nb_pred2 = nb_clf.predict_proba(X_test)
model_evaluate(y_test,nb_pred1,nb_pred2)

Wall time: 116 ms
F1 Score :  0.6286302230265768
Log Loss for Predicted Probabilities :  7.251880092416661


**XGBoost Classifier**

In [25]:
xgb_clf = XGBClassifier(num_class = 9,objective="multi:softprob",eval_metric="mlogloss")
%time xgb_clf.fit(X_train,y_train)
xgb_pred1 = xgb_clf.predict(X_test)
xgb_pred2 = xgb_clf.predict_proba(X_test)
model_evaluate(y_test,xgb_pred1,xgb_pred2)

F1 Score :  0.7578247463382258
Log Loss for Predicted Probabilities :  0.6483268005374851


Basically both softmax and softprob are used for multiclass classification. It’s the output which separates them. In Softmax you will get the class with the maximum probability as output, but with Softprob you will get a matrix with probability value of each class you are trying to predict.

**XGBoost Classifier with GridSearch**

In [26]:
xgb_gs_clf = XGBClassifier(num_class = 9,objective="multi:softprob",eval_metric="mlogloss",n_jobs=8)
gs_param = {'max_depth': [10], 'n_estimators': [300], 'gamma': [0.03], 'learning_rate': [0.08], 'min_child_weight': [5], 'colsample_bytree': [0.8], 'subsample': [0.85]}
gs = GridSearchCV(xgb_gs_clf,param_grid=gs_param,cv =2, scoring="neg_log_loss")
gs.fit(X_train,y_train)


GridSearchCV(cv=2, error_score=nan,
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, eval_metric='mlogloss',
                                     gamma=0, learning_rate=0.1,
                                     max_delta_step=0, max_depth=3,
                                     min_child_weight=1, missing=None,
                                     n_estimators=100, n_jobs=8, nthread=None,
                                     num_class=9, objective='multi:softprob',
                                     random_state=0...bda=1,
                                     scale_pos_weight=1, seed=None, silent=None,
                                     subsample=1, verbosity=1),
             iid='deprecated', n_jobs=None,
             param_grid={'colsample_bytree': [0.8], 'gamma': [0.03],
                         'learning_rate': [0.08], 'max

In [27]:
xgb_gs_pred1 = gs.predict(X_test)
xgb_gs_pred2 = gs.predict_proba(X_test)
model_evaluate(y_test,xgb_gs_pred1,xgb_gs_pred2)

F1 Score :  0.8246957240832267
Log Loss for Predicted Probabilities :  0.44924238754435664


**Create New Dataset using predicted value from base learners**

In [28]:
new_dataset = pd.DataFrame({"SGD":sgd_y_pred1,"KNN":knn_pred1,
                           "Random Foreset":rf_pred1,"Extree":Extree_pred1,
                           "MLP":mlp_pred1,"Adaboost":ada_pred1,
                           "NaiveBayes":nb_pred1,"XGBoost_GS":xgb_gs_pred1})
new_dataset.head()

Unnamed: 0,SGD,KNN,Random Foreset,Extree,MLP,Adaboost,NaiveBayes,XGBoost_GS
0,2,2,2,2,7,2,4,7
1,7,7,7,7,7,1,3,7
2,6,6,6,6,6,6,6,6
3,4,6,6,6,6,4,9,6
4,6,6,6,6,6,6,6,6


AS I have mentioned in starting graph that stacking uses predictions of base learning classifiers as input for training  to a second-level model. However we cannot simply train the base models on the full training data, generate predictions on the full test data set and then output these for the second-level model training. As we have seen that there are to much difference in F1 score and log_loss among base learning models.
We will use 

#### Lets Buils Stacking Model using splitting the trainng data for Base Learning Models

In [24]:
## Lets convert the X_train,X_test,y_train,y_test data type in numpy ndarray
X_train_np = X_train.values
X_test_np = X_test.values
y_train_np = y_train.values.ravel()
y_test_np = y_test.values.ravel()

In [25]:
print(X_train_np.shape)
print(y_train_np.shape)
print(type(X_train_np))
print(type(y_train_np))
print(X_test_np.shape)
print(y_test_np.shape)
print(type(X_test_np))
print(type(y_test_np))

(49502, 93)
(49502,)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
(12376, 93)
(12376,)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In [26]:
n_folds = 5  ## As we are using 5 base learner
n_class = len(df_train.target.unique())
kf = KFold(n_splits= n_folds, shuffle=True, random_state=42)

In [27]:
## To get Train and Test data using K-Fold
def get_train_test(clf, x_train, y_train, x_test):   ## Value of x_train etc should be in np.ndarray
    train_stage1 = np.zeros((x_train.shape[0],n_class)) ## To get traiing data (row count,9)
    ## to check th zero count in np.array   np.count_nonzero(train_stage1==0)
    test_stage1 = np.zeros((x_test.shape[0],n_class))  ## To get testing data (row count,9)
   #### oof_test_temp = np.empty((n_folds, ntest))
   
    for i,(train_index,test_index) in enumerate(kf.split(x_train)):
        
        x_tr = x_train[train_index]
        y_tr = y_train[train_index]
        x_te = x_train[test_index]
        y_te = y_train[test_index]
        
        clf.fit(x_tr, y_tr)

        pred_test1 = clf.predict_proba(x_te)
        train_stage1[test_index,:] = pred_test1  ### for 1st stage training data
        
        pred_test2 = clf.predict_proba(x_test)  
        test_stage1 =  test_stage1 + pred_test2  ### for 1st stage test data

    return train_stage1, test_stage1/n_folds

In [29]:
##Lets take the base learning model
### Stage 1 Model
base1_clf = sgd_clf
base2_clf = mlp_clf
base3_clf = knn_clf
base4_clf = Extree_clf
base5_clf = ada_clf
### Stage 2 Model
final_clf = rf_clf  

#### Build Stacking model

In [30]:
base1_train,base1_test = get_train_test(base1_clf,X_train_np,y_train_np,X_test_np)
base2_train,base2_test = get_train_test(base2_clf,X_train_np,y_train_np,X_test_np)
base3_train,base3_test = get_train_test(base3_clf,X_train_np,y_train_np,X_test_np)
base4_train,base4_test = get_train_test(base4_clf,X_train_np,y_train_np,X_test_np)
base5_train,base5_test = get_train_test(base5_clf,X_train_np,y_train_np,X_test_np)

In [31]:
## Now use output of these base learners as input for final model
X_train_f = np.concatenate((base1_train,base2_train,base3_train,base4_train,base5_train),axis = 1)
X_test_f = np.concatenate((base1_test,base2_test,base3_test,base4_test,base5_test),axis = 1)

In [32]:
X_train_f.shape,X_test_f.shape

((49502, 45), (12376, 45))

In [33]:
## Lets train the final model
%time final_clf.fit(X_train_f,y_train_np)

Wall time: 20.2 s


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [34]:
## Evaluate the final model
final_pred1 = final_clf.predict(X_test_f)
final_pred2 = final_clf.predict_proba(X_test_f)
model_evaluate(y_test,final_pred1,final_pred2)

F1 Score :  0.8290447215920956
Log Loss for Predicted Probabilities :  0.5566684256504653


We can observer after stacking we get better F1 score and loss is decreased.If we will use XGB probably we will get better F1 score.