## Stacking Ensemble of 3 Models
1. Roberta Large model from - https://www.kaggle.com/rhtsingh/commonlit-readability-prize-roberta-torch-infer-2 
2. Ridge Regression on Sentence Embeddings from Sentence Transformers 
3. XGBoost Regression on Sentence Embeddings from Sentence Transformers
The last 2 models have been referred from - 

I have gone through multiple notebooks published in the competition that tried out weighted-average while ensembling models and have got some good results. I also wanted to try out stacking with a meta model and see how it performs.

Below, i provide a basic indroduction to Stacking Ensemble method and provide the code to implement the same. Do note that i have not done any fine-tuning on the hyperparameters for the regression models or the meta models. Further fine-tuning or selection of a better model may help you improve your score. 

**I am starting out with notebooks on Kaggle and this is one of my first attempts at it. If you find anything useful, please UPVOTE and also leave out SUGGESTIONS in the comments on how i can improve this and my future notebooks.**


### Stacking Ensemble
![Stacking Emseble - TowardsDataScience](https://miro.medium.com/max/1400/1*1ArQEf8OFkxVOckdWi7mSA.png)

[Image Source](https://towardsdatascience.com/the-power-of-ensembles-in-deep-learning-a8900ff42be9)

Stacking is a technique for ensembling multiple models. It helps us generalise the predictions from different models that might perform well on a subset of the data but not the whole set. Hence, by building different models that perform well on part of the data and using a meta model to generalize from their predictions, we can aim to achieve a better overall score. 

* The base models are trained on the original train dataset and the predictions from all the 3 models are combined and used as the intermediate train dataset which is used to train the final meta-model. 
* The best way to generate predictions to create the intermediate train dataset is to train the base models using K-Fold cross validation and use their Out-Of-Fold predictions to create the training dataset for the meta-model.
* Here i have used 5-fold cross-validation for all the 3 models to generate the predictions.

#### Meta Model selected - LGBMRegressor

Othe experiments that can tried out - 
1. Providing Sentence Embeddings as input to the final meta-model along with the predictions from the base models.
2. Trying out a different meta model for ensembling.

If you guys, have any other experiments that can be tried out, please comment below. That will be really helpful.



### Importing Libraries

In [None]:
import os
import gc
gc.enable()

import sys
sys.path.append('../input/sentence-transformers/sentence-transformers-master')

In [None]:
from tqdm import tqdm, trange
import pandas as pd
import numpy as np

import xgboost as xgb
from sklearn.linear_model import BayesianRidge
from lightgbm import LGBMRegressor

import sentence_transformers
from sentence_transformers import SentenceTransformer, models

from sklearn.metrics import mean_squared_error
from sklearn import model_selection
from sklearn.model_selection import StratifiedKFold

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.optim.optimizer import Optimizer
import torch.optim.lr_scheduler as lr_scheduler
from torch.utils.data import (
    Dataset, DataLoader, 
    SequentialSampler, RandomSampler
)
from transformers import RobertaConfig
from transformers import (
    get_cosine_schedule_with_warmup, 
    get_cosine_with_hard_restarts_schedule_with_warmup
)
from transformers import RobertaTokenizer
from transformers import RobertaModel


### CONFIG

In [None]:
FOLDS = 5
TRAIN_FILE = '../input/commonlitreadabilityprize/train.csv'
TEST_FILE = '../input/commonlitreadabilityprize/test.csv'
SAMPLE_SUB_FILE = '../input/commonlitreadabilityprize/sample_submission.csv'
ROBERTA_LARGE_PATH = '../input/robertalarge/'
SENTENCE_EMBEDDINGS_MODEL_PATH = '../input/finetuned-model1/checkpoint-568'

### Loading Dataset

In [None]:
train = pd.read_csv(TRAIN_FILE)
train.head()

In [None]:
test = pd.read_csv(TEST_FILE)
test.head()

### RMSE helper method

In [None]:
def rmse(targets, preds):
    return round(np.sqrt(mean_squared_error(targets, preds)), 4)

### Create Folds helper method

In [None]:
def create_folds(data, num_splits):
    data["kfold"] = -1
    kf = model_selection.KFold(n_splits=num_splits, shuffle=True, random_state=2021)
    for f, (t_, v_) in enumerate(kf.split(X=data)):
        data.loc[v_, 'kfold'] = f
    return data

In [None]:
train = create_folds(train, num_splits=5)

### Robeta Large Inference
The below code for the Roberta Inference has been taken from the notebook - https://www.kaggle.com/rhtsingh/commonlit-readability-prize-roberta-torch-infer-2. 

#### Dataset class

In [None]:
class DatasetRetriever(Dataset):
    def __init__(self, data, tokenizer, max_len, is_test=False):
        self.data = data
        self.excerpts = self.data.excerpt.values.tolist()
        if not is_test:
            self.targets = self.data.target.values.tolist()
        self.tokenizer = tokenizer
        self.is_test = is_test
        self.max_len = max_len
        
    def convert_examples_to_features(self, data, tokenizer, max_len, is_test=False):
        data = data.replace('\n', '')
        tok = tokenizer.encode_plus(
            data, 
            max_length=max_len, 
            truncation=True,
            return_attention_mask=True,
            return_token_type_ids=True
        )
        curr_sent = {}
        padding_length = max_len - len(tok['input_ids'])
        curr_sent['input_ids'] = tok['input_ids'] + ([0] * padding_length)
        curr_sent['token_type_ids'] = tok['token_type_ids'] + \
            ([0] * padding_length)
        curr_sent['attention_mask'] = tok['attention_mask'] + \
            ([0] * padding_length)
        return curr_sent
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, item):
        if not self.is_test:
            excerpt, label = self.excerpts[item], self.targets[item]
            features = self.convert_examples_to_features(
                excerpt, self.tokenizer, 
                self.max_len, self.is_test
            )
            return {
                'input_ids':torch.tensor(features['input_ids'], dtype=torch.long),
                'token_type_ids':torch.tensor(features['token_type_ids'], dtype=torch.long),
                'attention_mask':torch.tensor(features['attention_mask'], dtype=torch.long),
                'label':torch.tensor(label, dtype=torch.double),
            }
        else:
            excerpt = self.excerpts[item]
            features = self.convert_examples_to_features(
                excerpt, self.tokenizer, 
                self.max_len, self.is_test
            )
            return {
                'input_ids':torch.tensor(features['input_ids'], dtype=torch.long),
                'token_type_ids':torch.tensor(features['token_type_ids'], dtype=torch.long),
                'attention_mask':torch.tensor(features['attention_mask'], dtype=torch.long),
            }

#### Model

In [None]:
class CommonLitModel(nn.Module):
    def __init__(
        self, 
        model_name, 
        config,  
        multisample_dropout=False,
        output_hidden_states=False
    ):
        super(CommonLitModel, self).__init__()
        self.config = config
        self.roberta = RobertaModel.from_pretrained(
            model_name, 
            output_hidden_states=output_hidden_states
        )
        self.layer_norm = nn.LayerNorm(config.hidden_size)
        if multisample_dropout:
            self.dropouts = nn.ModuleList([
                nn.Dropout(0.5) for _ in range(5)
            ])
        else:
            self.dropouts = nn.ModuleList([nn.Dropout(0.3)])
        self.regressor = nn.Linear(config.hidden_size, 1)
        self._init_weights(self.layer_norm)
        self._init_weights(self.regressor)
 
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
 
    def forward(
        self, 
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        labels=None
    ):
        outputs = self.roberta(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
        )
        sequence_output = outputs[1]
        sequence_output = self.layer_norm(sequence_output)
 
        # multi-sample dropout
        for i, dropout in enumerate(self.dropouts):
            if i == 0:
                logits = self.regressor(dropout(sequence_output))
            else:
                logits += self.regressor(dropout(sequence_output))
        
        logits /= len(self.dropouts)
 
        # calculate loss
        loss = None
        if labels is not None:
            loss_fn = torch.nn.MSELoss()
            logits = logits.view(-1).to(labels.dtype)
            loss = torch.sqrt(loss_fn(logits, labels.view(-1)))
        
        output = (logits,) + outputs[1:]
        return ((loss,) + output) if loss is not None else output

#### Utils

In [None]:
def make_model(model_name='roberta-large', num_labels=1):
    tokenizer = RobertaTokenizer.from_pretrained(model_name)
    config = RobertaConfig.from_pretrained(model_name)
    config.update({'num_labels':num_labels})
    model = CommonLitModel(model_name, config=config)
    return model, tokenizer

def make_loader(
    data, 
    tokenizer, 
    max_len,
    batch_size,
    fold=0,
    is_test = False
):
    if is_test:
        test_dataset = DatasetRetriever(data, tokenizer, max_len, is_test=True)
        test_sampler = SequentialSampler(test_dataset)
        test_loader = DataLoader(
            test_dataset, 
            batch_size=batch_size // 2, 
            sampler=test_sampler, 
            pin_memory=False, 
            drop_last=False, 
            num_workers=0
        )

        return test_loader
    else:
        train_set, valid_set = data[data['kfold']!=fold], data[data['kfold']==fold]
        train_dataset = DatasetRetriever(train_set, tokenizer, max_len)
        valid_dataset = DatasetRetriever(valid_set, tokenizer, max_len)

        train_sampler = RandomSampler(train_dataset)
        train_loader = DataLoader(
            train_dataset, 
            batch_size=batch_size, 
            sampler=train_sampler, 
            pin_memory=True, 
            drop_last=False, 
            num_workers=4
        )

        valid_sampler = SequentialSampler(valid_dataset)
        valid_loader = DataLoader(
            valid_dataset, 
            batch_size=batch_size // 2, 
            sampler=valid_sampler, 
            pin_memory=True, 
            drop_last=False, 
            num_workers=4
        )

        return train_loader, valid_loader


#### Evaluator

In [None]:
class Evaluator:
    def __init__(self, model, scalar=None):
        self.model = model
        self.scalar = scalar

    def evaluate(self, data_loader, tokenizer):
        preds = []
        self.model.eval()
        total_loss = 0
        with torch.no_grad():
            for batch_idx, batch_data in enumerate(data_loader):
                input_ids, attention_mask, token_type_ids = batch_data['input_ids'], \
                    batch_data['attention_mask'], batch_data['token_type_ids']
                input_ids, attention_mask, token_type_ids = input_ids.cuda(), \
                    attention_mask.cuda(), token_type_ids.cuda()
                
                if self.scalar is not None:
                    with torch.cuda.amp.autocast():
                        outputs = self.model(
                            input_ids=input_ids,
                            attention_mask=attention_mask,
                            token_type_ids=token_type_ids
                        )
                else:
                    outputs = self.model(
                        input_ids=input_ids,
                        attention_mask=attention_mask,
                        token_type_ids=token_type_ids
                    )
                
                logits = outputs[0].detach().cpu().numpy().squeeze().tolist()
                preds += logits
        return preds

#### Config

In [None]:
 def config(fold):
    torch.manual_seed(2021)
    torch.cuda.manual_seed(2021)
    torch.cuda.manual_seed_all(2021)
    
    max_len = 250
    batch_size = 8

    model, tokenizer = make_model(
        model_name=ROBERTA_LARGE_PATH, 
        num_labels=1
    )
    model.load_state_dict(
        torch.load(f'../input/roberta-large-itptfit/model{fold}.bin')
    )
    test_loader = make_loader(
        test, tokenizer, max_len=max_len,
        batch_size=batch_size, is_test = True
    )

    if torch.cuda.device_count() >= 1:
        model = model.cuda() 
    else:
        raise ValueError('CPU training is not supported')

    # scaler = torch.cuda.amp.GradScaler()
    scaler = None
    return (
        model, tokenizer, 
        test_loader, scaler
    )

#### Test Run

In [None]:
def run(fold=0):
    model, tokenizer, \
        test_loader, scaler = config(fold)
    
    import time

    evaluator = Evaluator(model, scaler)

    test_time_list = []

    torch.cuda.synchronize()
    tic1 = time.time()

    preds = evaluator.evaluate(test_loader, tokenizer)

    torch.cuda.synchronize()
    tic2 = time.time() 
    test_time_list.append(tic2 - tic1)
    
    del model, tokenizer, test_loader, scaler
    gc.collect()
    torch.cuda.empty_cache()
    
    return preds

#### OOF Roberta Large Predictions

In [None]:
oof_roberta = np.zeros(len(train))
for fold in tqdm(range(5), total=5):
    model, tokenizer = make_model(
        model_name='../input/robertalarge/', 
        num_labels=1
    )
    model.load_state_dict(
        torch.load(f'../input/roberta-large-itptfit/model{fold}.bin')
    )
    model.cuda()
    model.eval()
    val_index = train[train.kfold==fold].index.tolist()
    train_loader, val_loader = make_loader(train, tokenizer, 250, 16, fold=fold)
    # scalar = torch.cuda.amp.GradScaler()
    scalar = None
    val_preds = []
    for index, data in enumerate(val_loader):
        input_ids, attention_mask, token_type_ids, labels = data['input_ids'], \
            data['attention_mask'], data['token_type_ids'], data['label']
        input_ids, attention_mask, token_type_ids, labels = input_ids.cuda(), \
            attention_mask.cuda(), token_type_ids.cuda(), labels.cuda()
        if scalar is not None:
            with torch.cuda.amp.autocast():
                outputs = model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    token_type_ids=token_type_ids,
                    labels=labels
                )
        else:
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids,
                labels=labels
            )
        
        loss, logits = outputs[:2]
        val_preds += logits.cpu().detach().numpy().tolist()
    oof_roberta[val_index] = val_preds
    
    del model, tokenizer, train_loader, val_loader
    gc.collect()
    torch.cuda.empty_cache()
    
print('Roberta: Mean OOF RMSE = {}'.format(rmse(train.target.values, oof_roberta)))

#### Test Set Roberta Large Predictions for Final Test set
We take mean across the predictions from the 5 folds.

In [None]:
pred_df_roberta = pd.DataFrame()
for fold in tqdm(range(5)):
    pred_df_roberta[f'fold{fold}'] = run(fold)
test_preds_roberta = pred_df_roberta.mean(axis=1).values.tolist()
test_preds_roberta

### Sentence Embeddings model using offline Sentence Transformers Libary.
Code referred from - https://www.kaggle.com/datafan07/eda-simple-bayesian-ridge-with-sentence-embeddings
The model is a Roberta model used in the linked notebook.

In [None]:
# setting model path for fine-tuned roberta weights
model_path = SENTENCE_EMBEDDINGS_MODEL_PATH
word_embedding_model = models.Transformer(model_path, max_seq_length=275)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

In [None]:
X_train = model.encode(train.excerpt, device='cuda')
display(X_train.shape)

In [None]:
X_test = model.encode(test.excerpt, device='cuda')
display(X_test.shape)

### XGBoost model on sentence embeddings along with OOF and test set predictions

In [None]:
%%time

xgb_params = {
    'objective': 'reg:squarederror',
    'eval_metric': 'rmse',
    
    'eta': 0.05,
    'max_depth': 3,
    
    'gamma': 1,
    'subsample': 0.8,
    
    'nthread': 2
}

best_iterations = []
oof_rmses = []
oof_xgboost = np.zeros(len(train))

test_preds_xgboost = []

for fold in range(FOLDS):
    print(f'\nTraining Fold {fold + 1} / {FOLDS}')
    
    train_idx, val_idx = train.index[train['kfold']!=fold].tolist(), train.index[train['kfold']==fold].tolist()
    
    dtrain = xgb.DMatrix(X_train[train_idx], train.target[train_idx])
    dvalid = xgb.DMatrix(X_train[val_idx], train.target[val_idx])
    evals_result = dict()
    booster = xgb.train(xgb_params,
                        dtrain,
                        evals=[(dtrain, 'train'), (dvalid, 'valid')],
                        num_boost_round=300,
                        early_stopping_rounds=20,
                        evals_result=evals_result,
                        verbose_eval=False)
    
    best_iteration = np.argmin(evals_result['valid']['rmse'])
    best_iterations.append(best_iteration)
    oof_rmse = evals_result['valid']['rmse'][best_iteration]
    oof_rmses.append(oof_rmse)
    print(f"Fold {fold+1}: train OOF RMSE: {oof_rmse}")
    
    val_pred = booster.predict(dvalid, ntree_limit=int(best_iteration+1))
    oof_xgboost[val_idx] = val_pred
    test_preds_xgboost.append(booster.predict(xgb.DMatrix(X_test), ntree_limit=int(best_iteration+1)))
    
print('XGBoost: Mean OOF RMSE = {}'.format(rmse(train.target.values, oof_xgboost)))

Taking mean across the 5 folds for the test set prediction

In [None]:
test_preds_xgboost = np.mean(test_preds_xgboost,0)
test_preds_xgboost

### Bayesian Ridge model on sentence embeddings along with OOF and test set predictions

In [None]:
%%time

oof_ridge = np.zeros(len(train))
test_preds_ridge = []

for fold in range(FOLDS):
    print(f'\nTraining Fold {fold + 1} / {FOLDS}')
    
    train_idx, val_idx = train.index[train['kfold']!=fold].tolist(), train.index[train['kfold']==fold].tolist()

    reg = BayesianRidge(n_iter=300, verbose=True)
    reg.fit(X_train[train_idx],train.target[train_idx])
    
    
    val_pred = reg.predict(X_train[val_idx])
    oof_ridge[val_idx] = val_pred
    oof_rmse = rmse(val_pred, train.target[val_idx].values)
    print(f"Fold {fold+1}: train OOF RMSE: {oof_rmse}\n")
    
    test_preds_ridge.append(reg.predict(X_test))

print('BayesianRidge: Mean OOF RMSE = {}'.format(rmse(train.target.values, oof_ridge)))

Taking mean across the 5 folds for the test set prediction

In [None]:
test_preds_ridge = np.mean(test_preds_ridge,0)
test_preds_ridge

### Intermediate Train set built from the OOF predictions of all the 3 models
You can experiment with your own models and add the OOF predictions from them to the this train set to be used for trianing the meta-model.

In [None]:
oof_train = pd.DataFrame()
oof_train['roberta'] = oof_roberta
oof_train['xgboost'] = oof_xgboost
oof_train['ridge'] = oof_ridge
oof_train['target'] = train.target.values
oof_train = create_folds(oof_train, num_splits=5)
display(oof_train.shape)
oof_train.head()

Train set with the columns that will be required for training.

In [None]:
x_oof_train = oof_train[['roberta', 'xgboost', 'ridge']]
x_oof_train.head()

### Creating the test set for generating the final prediction from the meta-model.
The mean of the predictions across all the 5 folds on the test set by the base models is used to create this final test set.

In [None]:
final_test = pd.DataFrame()
final_test['roberta'] = test_preds_roberta
final_test['xgboost'] = test_preds_xgboost
final_test['ridge'] = test_preds_ridge
display(final_test.shape)
final_test.head()

### Stacking Ensemble Model - LGBMRegressor
I haven't performed any hyperparameter tuning or experimented with any other models for the meta-model. Please leave comments below if want to share your experiments.

In [None]:
%%time

stacking_preds = []
oof_rmses = []


for fold in range(FOLDS):
    print(f'\nTraining Fold {fold + 1} / {FOLDS}')
    
    train_idx, val_idx = oof_train.index[oof_train['kfold']!=fold].tolist(), oof_train.index[oof_train['kfold']==fold].tolist()
    x_train, y_train = x_oof_train.iloc[list(train_idx)], oof_train.target.iloc[train_idx]
    x_val, y_val = x_oof_train.iloc[list(val_idx)], oof_train.target.iloc[val_idx]

    reg = LGBMRegressor(max_depth=5, n_estimators=40)
    reg.fit(x_train, y_train)
    
    val_pred = reg.predict(x_val)
    oof_rmse = rmse(val_pred, oof_train.target[val_idx].values)
    oof_rmses.append(oof_rmse)
    print(f"Fold {fold+1} train OOF RMSE: {oof_rmse}")
    
    stacking_preds.append(reg.predict(final_test))

print('Stacking LGBMRegressor: Mean OOF RMSE = {}'.format(np.mean(oof_rmses)))

### Submission

In [None]:
sub = pd.read_csv(SAMPLE_SUB_FILE)

In [None]:
sub['target'] = np.mean(stacking_preds,0)
sub.to_csv('submission.csv', index=False)

In [None]:
sub

### That's it from this notebook. Hope the beginners on Kaggle find something useful here. 