# [TPS - May] Stacking+Blending+Pseudolabelling+Averaging ðŸ¥‡
![tab](https://i.imgur.com/uHVJtv0.png")

**Description:** The dataset in the competition is synthetic but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the category on an eCommerce product given various attributes about the listing. Although the features are anonymized, they have properties relating to real-world features.

**Solution:** Two levels of modeling. (1) Define single models -include some of that competition- and obtain its predictions. (2) Then define a  metamodel with the information of the previous step.

<div class="alert alert-block alert-success"> Hello everybody ðŸ‘‹ This is my first notebook in TPS community, i hope you'll like it. ðŸ˜Š</div>


I present a possible solution for this competition. I'm relatively new in Kaggle, if you find any improvement in the proposal or the code, please let me know.

![stackedml](https://storage.googleapis.com/kagglesdsdata/datasets/1251020/2086737/stacked.PNG?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20210529%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20210529T233928Z&X-Goog-Expires=345599&X-Goog-SignedHeaders=host&X-Goog-Signature=5fb5391b275b6879052666f8134f5d83ed8a3d359cdf89c23675c2300fdf20fcd7b0c1e52731aca74c09cf884a81338e768f8b2104f00c2ed519b5b45b70377fd1fe4f883e4f3f5cee125b982f376c68e8a57a9ab15786d44f3feb566dfde02d32210889fe17c08ee4d6aff60364c401dfd10d4c26d923736adface3063026e4b6eec0db5feb27cea05e778428232e97e830f57c8abe2e025c50d28665962f8bf9852754091f90844ab40ce708131729093110969f6a812a5be0e9c0f56f04465c5f3e2db951b1aa229204066c6da3783dfbdfa592efe5a0bd3fb2798f3f39acfeb88d75f426bee0008631359ad73865aafbba32e3a37f0cb1aa65c7a794adb4)

I did this diagram in **TPS - March Competition [Top 1%]** ([link](https://www.kaggle.com/c/tabular-playground-series-mar-2021/leaderboard)).  Generally, this is a good workflow i think. See the information that you should save and how you arrive to 2nd stage. You can add more ideas - Ex: Pseudolabelling, Leakage, Selection of best models and PostProcessing - to get a better score. More information [here](https://mlwave.com/kaggle-ensembling-guide/) and in the additional resources.

So, in this opportunity, I propose that. In the 1st stage, the models that i got are the following: 11 lightgbm, 4 xgboost, 7 catboost, 1 keras, 1 deebtable, 2 logistic regressions, 5 autolightml.
Some models i got from my own ideas and others, from the notebooks of the competition (not all, some are a leakage and my stacking - or blending - ensemble model showed weird results). The 2nd stage is presented in this notebook.

Pd: Now you can see the respective dataset!

In [None]:
# LIBRARIES
# General
import pandas as pd
import numpy as np
import os
import tqdm
import gc
from itertools import chain
from glob import glob
import warnings
warnings.filterwarnings('ignore')
# Preprocessing
from sklearn.preprocessing import QuantileTransformer, StandardScaler, PolynomialFeatures, LabelEncoder
from sklearn.neighbors import LocalOutlierFactor
# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
# Hyperpameter Tuning
!pip install optuna
import optuna
from optuna.samplers import TPESampler
# Modeling
import lightgbm as lgb
import catboost as cb
import xgboost as xgb
from sklearn.linear_model import RidgeClassifier,RidgeClassifierCV,Ridge
from sklearn.calibration import CalibratedClassifierCV
from sklearn.tree import DecisionTreeRegressor
#Inference
from sklearn.model_selection import StratifiedKFold,train_test_split
from catboost import CatBoostClassifier, Pool
# Metric
from sklearn.metrics import roc_auc_score, log_loss, accuracy_score,confusion_matrix,classification_report
# Set constants
NUM_FOLDS=10
N_TRIALS = 100
S_TRIALS = 7000
TIME = 3600*6
RANDOM_STATE=60
EARLY_STOPPING_ROUNDS=1000
sub_columns=['Class_1','Class_2','Class_3','Class_4']

In [None]:
# Get data
train=pd.read_csv("../input/tabular-playground-series-may-2021/train.csv",index_col = 0)
test=pd.read_csv("../input/tabular-playground-series-may-2021/test.csv",index_col = 0)
sample=pd.read_csv("../input/tabular-playground-series-may-2021/sample_submission.csv")
# Preprocessing. Take only this observations
features=[f'feature_{i}' for i in range(50)]
idr=train[features].drop_duplicates().index #Duplicated values
# inr=np.load() # https://www.kaggle.com/fusioncenter/keras-embeddings-with-resnet-architecture. This idea is logic
train=train.iloc[idr]
train.reset_index(inplace=True,drop=True)
# Encoding Target
y_train= LabelEncoder().fit_transform(train['target'])

# Analyze the 1st stage of models

In [None]:
# Get models
train_files=sorted(glob(os.path.join('../input/tps-may-1st-stage-of-modeling/train_*')))
test_files=sorted(glob(os.path.join('../input/tps-may-1st-stage-of-modeling/test_*')))
# del train_files[??] ; del test_files[??] # In a previous analysis, i checked some models were a leakage (the predictions in the execution are worse). I'm not sure why (Now those models were removed)
names=[file[file.find('train_')+6:-3] for file in train_files]
train_predictions=pd.DataFrame()
test_predictions=pd.DataFrame()
oof=[]
preds=[]
for n_file,file in enumerate(train_files):
    train_tmp=pd.DataFrame(np.load(file),columns = [f'{names[n_file]}_Class_1',f'{names[n_file]}_Class_2',f'{names[n_file]}_Class_3',f'{names[n_file]}_Class_4'])
    test_tmp=pd.DataFrame(np.load(test_files[n_file]),columns =[f'{names[n_file]}_Class_1',f'{names[n_file]}_Class_2',f'{names[n_file]}_Class_3',f'{names[n_file]}_Class_4'])
    # Prepare the files_1: One dataframe with all classes in each file
    train_predictions=pd.concat([train_predictions,train_tmp],axis=1)
    test_predictions=pd.concat([test_predictions,test_tmp],axis=1)
    # Prepare the files_2: A list of dataframes
    train_tmp.columns=sub_columns
    test_tmp.columns=sub_columns
    oof.append(train_tmp)
    preds.append(test_tmp)

In [None]:
#Let's see the dataset containing all models
train_predictions #dataframe with all fields
# oof # a list of dataframes

In [None]:
# Analyze the models
def see_model(file,yreal=y_train):
    df=pd.DataFrame(np.load(file,allow_pickle=False),columns=sub_columns)
    print(f'Description of the classes:\n {df.describe()}') # To do a Table Img
    print(f'Most possible class poportion:\n {dict(df.idxmax(axis=1).value_counts())}') #To do a Pie Chart
    print(f'Distribution of the classes:')
    fig,ax=plt.subplots(figsize=(8.5,6.5),nrows=2,ncols=2)
    sns.histplot(df['Class_1'],ax=ax[0,0],color='khaki')
    sns.histplot(df['Class_2'],ax=ax[0,1],color='steelblue')
    sns.histplot(df['Class_3'],ax=ax[1,0],color='lightcoral')
    sns.histplot(df['Class_4'],ax=ax[1,1],color='palegreen')
    ax[0,0].set_ylabel(' ')
    ax[0,1].set_ylabel(' ')
    ax[1,0].set_ylabel(' ')
    ax[1,1].set_ylabel(' ')
    plt.savefig(f"model_{file[file.find('modeling/')+9:-3]}.png")

In [None]:
see_model('../input/tps-may-1st-stage-of-modeling/test_lgb_1.0914046652034706.npy') # A dashboard is useful here if you have a lot of models
# for file in glob('../input/tps-may-1st-stage-of-modeling'): see_model(file) # An alternative

# Stacking

In [None]:
def stacking(X_train,y_train,X_test):
    skf=StratifiedKFold(n_splits = NUM_FOLDS,shuffle = True,random_state = RANDOM_STATE)
    yv=np.zeros((len(X_train),4))
    yt=np.zeros((len(X_test),4))
    for fold,(idx_tr,idx_vl) in enumerate(skf.split(X_train,y_train)):
        X_tr,y_tr=pd.DataFrame(X_train.iloc[idx_tr]),pd.Series(y_train).iloc[idx_tr]
        X_vl,y_vl=pd.DataFrame(X_train.iloc[idx_vl]),pd.Series(y_train).iloc[idx_vl]
        model = CalibratedClassifierCV(RidgeClassifier(random_state=RANDOM_STATE),cv=10).fit(X_tr,y_tr)
        # See the pattern inference, forecast and evaluation
        yv[idx_vl]=model.predict_proba(X_vl)
        yt+=model.predict_proba(X_test)/NUM_FOLDS
        print(f"Found Metric in {fold}:{log_loss(y_vl,yv[idx_vl])}")
    metric=log_loss(y_train,yv)
    print(f'Results of the training: {metric}')
    # Create submission
    sub=pd.DataFrame(yt,columns=sub_columns)
    sub['id']=sample.id
    sub=sub[['id','Class_1','Class_2','Class_3','Class_4']]
    sub.to_csv(f'stacking_{metric}.csv',index=False)
    return yv,sub
def evaluation(y,yreal=y_train):
    metric=log_loss(yreal,y)
    print(f'Results of the training: {metric}')
    print(classification_report(yreal, np.argmax(y, axis = 1), target_names=sub_columns))
    sns.heatmap(pd.DataFrame(confusion_matrix(yreal, np.argmax(y, axis = 1))), annot=True, linewidths=.5, fmt="d")
    plt.savefig(f'evaluation_{metric}.png')

In [None]:
yst,ysf=stacking(train_predictions,y_train,test_predictions) # lb: 1.08563

In [None]:
evaluation(yst)

# Blending

In [None]:
class optimizer_1():    
    def __init__(self, metric='mse', n_weights=len(oof),trials=S_TRIALS):
        self.metric = metric
        self.trials = trials
        self.sampler = TPESampler(seed=RANDOM_STATE)
        self.n_weights=n_weights
    def objective(self,trial):
        weigths=[]
        sum_weights=0
        for i in range(self.n_weights):
            weigths.append({f"w_{i}": trial.suggest_uniform(f"w_{i}",0,3)})
            sum_weights+=weigths[i][f'w_{i}']
        y_pred=0
        for i in range(self.n_weights): y_pred+=oof[i]*weigths[i][f'w_{i}']/sum_weights
        return log_loss(y_train,np.array(y_pred))
    def optimize(self):
        study = optuna.create_study(direction="minimize", sampler=self.sampler)
        study.optimize(self.objective, n_trials=self.trials)
        # Best weights
        sum_weights_test=0
        for i in range(self.n_weights): sum_weights_test+=study.best_params[f'w_{i}']
        # Get predictions
        yv,yt=0,0
        for i in range(self.n_weights): yv+=oof[i]*study.best_params[f'w_{i}']/sum_weights_test
        for i in range(self.n_weights): yt+=preds[i]*study.best_params[f'w_{i}']/sum_weights_test
        # Save submission and send it 
        yt['id']=sample.id
        yt=yt[['id','Class_1','Class_2','Class_3','Class_4']]
        yt.to_csv(f'blending_{study.best_value}.csv',index=False)
        return yv,yt,study.best_params

In [None]:
ybt1,ybf1,wb1=optimizer_1(trials=1200).optimize()

In [None]:
evaluation(np.array(ybt1))

In [None]:
class optimizer_2():    
    def __init__(self, metric='mse',n_weights=train_predictions.shape[1],trials=S_TRIALS):
        self.metric = metric
        self.trials = trials
        self.sampler = TPESampler(seed=RANDOM_STATE)
        self.n_weights=n_weights
    def objective(self,trial):
        weigths=[]; w=[]
        sum_weights=0
        for i in range(self.n_weights):
            weigths.append({f"w_{i}": trial.suggest_uniform(f"w_{i}",0,3)})
            w.append(weigths[i][f'w_{i}'])
            sum_weights+=weigths[i][f'w_{i}']
        y_pred1,y_pred2,y_pred3,y_pred4=0,0,0,0
        for i in range(0,self.n_weights,4): y_pred1+=train_predictions.iloc[:,i]*weigths[i][f'w_{i}']/sum_weights
        for i in range(1,self.n_weights,4): y_pred2+=train_predictions.iloc[:,i]*weigths[i][f'w_{i}']/sum_weights
        for i in range(2,self.n_weights,4): y_pred3+=train_predictions.iloc[:,i]*weigths[i][f'w_{i}']/sum_weights
        for i in range(3,self.n_weights,4): y_pred4+=train_predictions.iloc[:,i]*weigths[i][f'w_{i}']/sum_weights
        y_pred=pd.concat([y_pred1,y_pred2,y_pred3,y_pred4],axis=1)
        y_pred.columns=sub_columns
        return log_loss(y_train,np.array(y_pred))
    def optimize(self):
        study = optuna.create_study(direction="minimize", sampler=self.sampler)
        study.optimize(self.objective, n_trials=self.trials)
        # Get predictions
        yv=self.get_matrix(train_predictions,study.best_params)
        yt=self.get_matrix(test_predictions,study.best_params)
        # Save submission and send it 
        yt['id']=sample.id
        yt=yt[['id','Class_1','Class_2','Class_3','Class_4']]
        yt.to_csv(f'blending_{study.best_value}.csv',index=False)
        return yv,yt,study.best_params
    def get_matrix(self,df,mat):
        sum_weights=0
        for i in range(self.n_weights): sum_weights+=mat[f'w_{i}']
        yv1,yv2,yv3,yv4=0,0,0,0
        for i in range(0,self.n_weights,4): yv1+=df.iloc[:,i]*mat[f'w_{i}']*4/sum_weights
        for i in range(1,self.n_weights,4): yv2+=df.iloc[:,i]*mat[f'w_{i}']*4/sum_weights
        for i in range(2,self.n_weights,4): yv3+=df.iloc[:,i]*mat[f'w_{i}']*4/sum_weights
        for i in range(3,self.n_weights,4): yv4+=df.iloc[:,i]*mat[f'w_{i}']*4/sum_weights
        yv=pd.concat([yv1,yv2,yv3,yv4],axis=1)
        yv.columns=sub_columns
        return yv

In [None]:
ybt2,ybf2,wb2=optimizer_2(trials=1200).optimize()

In [None]:
evaluation(np.array(ybt2))

# Pseudolabel

In [None]:
def new_data(df_train,df_test,labels,p=0.7):
    idx=labels.apply(lambda x: any(x>p), axis=1)
    # New labels
    y_test=labels[idx].idxmax(axis=1)
    y=pd.concat([train['target'],y_test],axis=0)
    y.reset_index(inplace = True,drop=True)
    y=LabelEncoder().fit_transform(y)
    # New observations
    new_obs=df_test.iloc[labels[idx].index]
    X=pd.concat([df_train,new_obs],axis=0)
    X.reset_index(inplace = True,drop=True)
    return X,y
# p should be a higher value. Honestly i'm not sure if pseudolabelling works. It's very risky. However a notebook of a gm shows the similiarity between train and test split

In [None]:
X,y=new_data(train_predictions,test_predictions,ysf[sub_columns],0.7)
yspt,yspf=stacking(X,y,test_predictions) # lb: 1.08563

In [None]:
evaluation(yspt,y)

# Averaging

In [None]:
def avg(models,weights):
    sub_lb=0
    for n_model,model in enumerate(models): sub_lb+=model*weights[n_model]
    sub_lb.id=sample.id
    sub_lb.to_csv(f'sub_lb_{sub_lb.iloc[0,1]}_{weights[0]}.csv',index=False) #sub_lb.iloc[0,1] is and index
    return sub_lb
best_lb=pd.read_csv('../input/tps-may2021-stacking/sub.csv') #2nd best lb. You can try the other
yvg_1=avg([best_lb,ysf],[0.7,0.3]) #lb:1.08510
yvg_2=avg([best_lb,yspf],[0.7,0.3])
yvg_3=avg([best_lb,yspf,ybf2],[0.7,0.25,0.05])

# Additional resources
- https://www.kaggle.com/davidedwards1/tabmar21-tabular-blend-final-sub
- https://www.kaggle.com/hiro5299834/3rd-tps-mar-2021-stacking
- https://www.kaggle.com/cdeotte/pseudo-labeling-qda-0-969
- https://www.kaggle.com/gomes555/tps-may2021-stacking
- https://www.kaggle.com/tunguz/adversarial-tps-may-2021
- https://www.kaggle.com/c/tabular-playground-series-may-2021/discussion/236561

## Last thoughts
- I'm not sure if this workflow is good, anyway i share the resources that i used.. i'd like to hear your comments. How do you improve this? Any fix in the code?
- On the other hand, i'm not sure what's the reason of avg (cv+lb) is much better. The lb is very weird. See this results: only best cv -> 1.08563 and best cv + best lb -> 1.08510.
- Hmm.. do you think are there exist possibilities of a shakeup? What is the good side?
- Things to do: add deep learning models (dae+model) and consider other possibilites of preprocessing.
<div class="alert alert-block alert-success">
Don't forget to upvote if you think this notebook is useful (or interesting), i apreciate that.. Well, That's all, thanks for reading and good luck in your projects!
</div>
