This notebook is to understand the concept of stacking, blending and ensemble as part of model building process.It contains the learning and implementation of the same from Abhishek Thakur's awesome video tutorial on You Tube.Link for same is below :
https://www.youtube.com/watch?v=TuIgtitqJho&ab_channel=AbhishekThakur

For starter , let's answer some questions that came to me while watching the tutorial and then we will move towards implementation.

**1. Where can we use these metods?<br>**
These techniques were used initially mostly for competitions, but it is a widespread technique used for model improvemnt in various organizations and open source projects. I have used stacking for one of the project at my work place and observed significant improvemnt in accuracy of model.<br>

**2. How to implement?<br>**
In the video, implementation is done in nice form of multiple python scripts.Here I will be implementing the same in form of functions as I am using notebook.
<br>
<br>

P.S.: Do watch Abhishek's video on this topic and other topics as well.

## Import Required Libraries

In [None]:
from sklearn import model_selection,linear_model,metrics
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn import decomposition,ensemble
from sklearn.preprocessing import StandardScaler
from scipy.optimize import fmin
from functools import partial
import xgboost

import numpy as np 
import pandas as pd 
import glob
import time
import os

import warnings
warnings.filterwarnings("ignore")

In [None]:
## getting all the data files available
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


## Creating Data Folds
1. It takes path of the input data and number of folds as arguments
2. We are using stratified K fold here to maintain the distribution of data

In [None]:
def create_folds(df_path,num_folds):
    df  = pd.read_csv(df_path,sep='\t')
    df.loc[:,"kfold"] = -1
    
    ## for random shuffle of data, frac=1 will return all the records in shuffled manner.If you want to extract random subsample
    ##, change the frac paramter value
    
    df = df.sample(frac=1).reset_index(drop=True) 
    
    y = df.sentiment.values
    skf = model_selection.StratifiedKFold(n_splits=num_folds)
    
    ## t_ : indices for training, v_: indices for validation, f : fold number
    for f,(t_,v_) in enumerate(skf.split(X=df,y=y)):
        df.loc[v_,"kfold"] = f
        
    df.to_csv(f"train_folds_{num_folds}.csv",index=False)
    


In [None]:
create_folds("/kaggle/input/word2vec-nlp-tutorial/labeledTrainData.tsv.zip",5)
pd.read_csv("train_folds_5.csv").kfold.value_counts()

## Model Development

<b>Classification Model-</b><br>
1.We are using logistic regression and random forest<br>
2.AUC is taken as evaluation parameter<br>
3.We are going to predict_proba as AUC calucation requires probablity values


<b>Feature Creation -</b><br>
1.We are using TF-IDF and count vectorization for feature creation from text columns "review"<br>
2.We are using SVD on top of TF-IDF for feature transformation



In [None]:
model_dict = {'lr':linear_model.LogisticRegression(),'rf':ensemble.RandomForestClassifier()}
feat_create_dict = {'tfidf':TfidfVectorizer(max_features=1000),'cntvec':CountVectorizer(),'svd':decomposition.TruncatedSVD(n_components=120)}

In [None]:
def run_training(path,fold,feat_create_mode,model_name):
    df = pd.read_csv(path)
    
    df_train = df[df.kfold!=fold].reset_index(drop=True)
    df_valid = df[df.kfold==fold].reset_index(drop=True)

    if feat_create_mode!='svd':
        featvec = feat_create_dict[feat_create_mode]
        featvec.fit(df_train.review.values)

        xtrain = featvec.transform(df_train.review.values)
        xvalid = featvec.transform(df_valid.review.values)
    else :
        featvec = feat_create_dict['tfidf']
        featvec.fit(df_train.review.values)

        xtrain = featvec.transform(df_train.review.values)
        xvalid = featvec.transform(df_valid.review.values)
        
        svd = feat_create_dict[feat_create_mode]
        svd.fit(xtrain)
        
        xtrain = svd.transform(xtrain)
        xvalid = svd.transform(xvalid)
        
    
    ytrain = df_train.sentiment.values
    yvalid = df_valid.sentiment.values
    
    clf = model_dict[model_name]
    clf.fit(xtrain,ytrain)
    
    ypred = clf.predict_proba(xvalid)[:,1]
    auc = metrics.roc_auc_score(yvalid,ypred)
    
    print(f"fold = {fold} AUC = {auc}")
    
    df_valid.loc[:,f'{feat_create_mode}_{model_name}_pred'] = ypred
    
    return df_valid[['id','sentiment','kfold',f'{feat_create_mode}_{model_name}_pred']]

In [None]:
for key_feat,val_feat in  feat_create_dict.items():
    for key_model,val_model in model_dict.items():
        strt = time.time()
        print(f" Training model feature creation:{key_feat} and model:{key_model}")
        dfs = []
        for j in range(5):
            temp_df = run_training("train_folds_5.csv",j,key_feat,key_model)
            dfs.append(temp_df)
        fin_valid_df = pd.concat(dfs)
        fin_valid_df.to_csv(f"{key_feat}_{key_model}.csv",index=False)
        end = time.time()
        print(f"Time Taken: {end-strt}secs")
        print(fin_valid_df.shape)

## Combining all Model Results
From above, we get 6 different models with variations in feature creation.Now,we are going to see how to combine these predictions to get better overall AUC by blending,stacking or ensembling

In [None]:
files = glob.glob("*.csv")
files.remove("train_folds_5.csv")
df = pd.DataFrame()
for f in files :
    if len(df)<=0:
        df = pd.read_csv(f)
    else :
        temp_df = pd.read_csv(f).drop(columns = ['sentiment','kfold'])
        df = pd.merge(df,temp_df,on=['id'],how='left')
pred_cols = [col for col in df.columns if col.find("pred")>=0]
targets = df.sentiment.values

pred_dict = {col:df[col].values for col in pred_cols}
pred_rank_dict = {col:df[col].rank().values for col in pred_cols}

## Getting AUC for all models separately 
for col in pred_cols:
    auc = metrics.roc_auc_score(targets,df[col].values)
    print(f"pred_col = {col}; overall_auc ={auc}")

## Blending
In this section , we are going to following ways of combining all the predictions:
1. Average 
2. Weighted Average : Manual Setting of Weights
3. Rank Average : Learned first time from video tutorial
4. Weighted Rank Average

Manual setting of weights :<br>

As visible form above overall AUC scores is best for cntvec_lr_pred and tfidf_lr_pred models.So, we will try to give more weights to these models while combining 

In [None]:
print("-------------------------------------------")
print("Blending Results")
print("-------------------------------------------")

print("average")
avg_pred = df[pred_cols].mean(axis=1).values
print(metrics.roc_auc_score(targets,avg_pred))

print("-------------------------------------------")
print("weighted average")
wt_dict = {col:1 for col in pred_cols}
wt_dict['tfidf_lr_pred'] = 3
print("weights used")
print(wt_dict)
avg_pred = np.sum(np.array([val*wt_dict[key] for key,val in pred_dict.items()]),axis=0)/sum(list(wt_dict.values()))
print(metrics.roc_auc_score(targets,avg_pred))

print("-------------------------------------------")
print("rank averaging")
avg_pred  = np.mean(np.array([val for key,val in pred_dict.items()]),axis=0)
print(metrics.roc_auc_score(targets,avg_pred))

print("-------------------------------------------")
print("weighted rank averaging")
wt_rank_dict = {col:1 for col in pred_cols}
wt_rank_dict['tfidf_lr_pred'] = 3
print("weights used")
print(wt_rank_dict)
avg_pred = np.sum(np.array([val*wt_rank_dict[key] for key,val in pred_rank_dict.items()]),axis=0)/sum(list(wt_rank_dict.values()))
print(metrics.roc_auc_score(targets,avg_pred))

print("-------------------------------------------")
print("weighted rank averaging")
wt_rank_dict = {col:1 for col in pred_cols}
wt_rank_dict['cntvec_lr_pred'] = 3
print("weights used")
print(wt_rank_dict)
avg_pred = np.sum(np.array([val*wt_rank_dict[key] for key,val in pred_rank_dict.items()]),axis=0)/sum(list(wt_rank_dict.values()))
print(metrics.roc_auc_score(targets,avg_pred))

As we can see from previous section best AUC score is 0.9457357184 from cntvec_lr_pred models and after combining we are getting improvment of 0.01 points by doing weighted rank average that is 0.9503327391999999 from weighted rank averaging. Now, next step to get the optimal weights instead of manual setting for efficient processing.Let's see how to do that:

### Optimizing the Weighing Parameters for Blending

Below we are going to obtain optimal weights by taking all predictions columns as features with objective of maximizing the AUC 

In [None]:
## defining custom class for optimzation to get optimal weights
class OptimizeAUC():
    def __init__(self):
        self.coef_ = 0
    
    ## function to caluculate AUC for each fold while optimizing and 
    ## multiplying it with -1 , becasue we are using fmin (minimizing the metric) which in turn led to maximization of AUC
    def _auc(self,coef,X,y):
        X_coef = X*coef
        predictions = np.sum(X_coef,axis=1)
        auc_score = metrics.roc_auc_score(y,predictions)
        return -1.0*auc_score
    
    ## function for initiating optimization process.Here we are initializing the weights with dirichlet distribution
    ##, we can take any other values also for weight initialization
    def fit(self,X,y):
        partial_loss = partial(self._auc,X=X,y=y)
        init_coef = np.random.dirichlet(np.ones(X.shape[1]))
        self.coef_ = fmin(partial_loss,init_coef,disp=True)
    
    ## function to make prediction using weights obtained while training
    def predict(self,X):
        x_coef = X*self.coef_
        predictions = np.sum(x_coef,axis=1)
        return predictions

In [None]:
## function for model development for optimal parameters
def run_training_wts(pred_df,fold,pred_cols,model_name,std=False):

    train_df = pred_df[pred_df.kfold!=fold].reset_index(drop=True)
    valid_df = pred_df[pred_df.kfold==fold].reset_index(drop=True)
    
    xtrain = train_df[pred_cols].values
    xvalid = valid_df[pred_cols].values
    
    if std:
        std_ = StandardScaler()
        std_.fit(xtrain)
        
        xtrain = std_.transform(xtrain)
        xvalid = std_.transform(xvalid)
    
    opt = model_dict[model_name]
    opt.fit(xtrain,train_df.sentiment.values)
    if model_name != 'lr':
        preds = opt.predict(xvalid)
    else :
        preds = opt.predict_proba(xvalid)[:,1]
    
    auc = metrics.roc_auc_score(valid_df.sentiment.values,preds)
    
    print(f"fold={fold} auc={auc}")

    return opt.coef_

In [None]:
model_dict = {'lr':linear_model.LogisticRegression(),'custom_opt':OptimizeAUC(),'linear':linear_model.LinearRegression()}
for key_model,val_model in model_dict.items():
    for std in [True,False]:
        print(f"model:{key_model}; Scaling:{std}")
        coefs = []
        for j in range(5):
            temp_df =  run_training_wts(df,j,pred_cols,key_model,std)
            coefs.append(temp_df)
        coefs  = np.mean(np.array(coefs),axis=0)
        print(coefs)
        if key_model!='lr':
            wt_avg = np.sum(np.array([coefs[idx]*df[col].values for idx,col in enumerate(pred_cols)]),axis=0)
        else :
            wt_avg = np.sum(np.array([coefs[0,idx]*df[col].values for idx,col in enumerate(pred_cols)]),axis=0)
            
        auc = metrics.roc_auc_score(targets,wt_avg)
        print(f"optimal coefs overall auc = {auc}")
        print("==================================")


So, again we have seen improvment in AUC score from previous one that is from 0.9503327391999999 to 0.9535015936000001 by taking weights obtained from custom optimization function and without scaling the prediction values. This way, we are able to get optimal weights to get maximum improvement in AUC by weighing method

## Stacking 

It means stack of models that is on top of the model predictions which means using prediction values as features to make final predictions using another model

In [None]:
def run_training_stack(pred_df,fold,pred_cols):

    train_df = pred_df[pred_df.kfold!=fold].reset_index(drop=True)
    valid_df = pred_df[pred_df.kfold==fold].reset_index(drop=True)
    
    xtrain = train_df[pred_cols].values
    xvalid = valid_df[pred_cols].values
    
    clf = xgboost.XGBClassifier()
    clf.fit(xtrain,train_df.sentiment.values)
    preds = clf.predict_proba(xvalid)[:,1]
    
    auc = metrics.roc_auc_score(valid_df.sentiment.values,preds)
    print(f"fold={fold} auc={auc}")

    valid_df.loc[:,"xgb_pred"] = preds
    
    return valid_df

In [None]:
dfs = []
for j in range(5):
    temp_df = run_training_stack(df,j,pred_cols)
    dfs.append(temp_df)
fin_valid_df = pd.concat(dfs)
fin_valid_df.to_csv("xgb.csv",index=False)
auc = metrics.roc_auc_score(targets,fin_valid_df.xgb_pred.values)
print(f"AUC using Xgboost : {auc}")

Further Experiments :
1. Here , we are getting AUC score less then the previous section. We can improve this by tuning xgboost or experimenting with other models
2. We can also try using the predictions from previous models that is before stacking as features in the main training data and build model on top of that data

# **Again thanks to Abhishek Thakur for making tutorial videos and making it easily understandable.**