<a id='introduction'></a>
# Introduction

I used Maunish's ideas from [this notebook](https://www.kaggle.com/maunish/clrp-pytorch-roberta-inference).

This offline notebook uses [my other notebook](https://www.kaggle.com/angyalfold/roberta-large-k-fold-models/)'s model and tokenizer to make predictions. The idea here is to use models in a k-fold manner which use my fine-tuned Roberta-large model ([from here](https://www.kaggle.com/angyalfold/pretrain-roberta-large-on-clrp-data/)) with an attention head and a custom regressor on top.

This notebook is part of a series:
1. Pretrain roberta large on the CommonLit dataset [here](https://www.kaggle.com/angyalfold/pretrain-roberta-large-on-clrp-data/).
2. Produce k models which can later be used for determining the readability of texts [here](https://www.kaggle.com/angyalfold/roberta-large-k-fold-models/).
3. Make predictions with a custom NN regressor (this notebook).
4. Ensemble (Roberta large + SVR, Roberta large + Ridge, Roberta large + custom NN head) [here](https://www.kaggle.com/angyalfold/ensemble-for-commonlit/).

<a id="toc"></a>
# Table of contents
* [Introduction](#introduction)
* [Classes & configs](#classes)
    * [Configs](#classes_config)
    * [Data set](#classes_data_set)
    * [Model](#classes_model)
* [Make predictions](#make_predictions)
    * [Read test data](#make_predictions_read_test_data)
    * [Make predictions with pretrained models](#make_predictions_make_predictions)
    * [Save results](#make_predictions_save_results)

<a id='classes'></a>
# Classes & configs
[[back to top]](#toc)

<a id='classes_config'></a>
## Config
[[back to top]](#toc)

In [None]:
import torch

config = {
    'batch_size': 8,
    'best_pretrained_roberta_folder': '../input/pretrain-roberta-large-on-clrp-data/clrp_roberta_large/best_model/',
    'num_of_models': 5,
    'sentence_max_length' : 256
}

for (k, v) in config.items():
    print(f"The value for {k}: {v}")
    
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

print(f"Device: {device}")

<a id=classes_data_set></a>
## Data set
[[back to top]](#toc)

In [None]:
import torch

class ReadabilityDataset(torch.utils.data.Dataset):
    """Custom dataset for the Readability task"""
    def __init__(self, encodings):
        self.encodings = encodings
    
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        return item
        
    def __len__(self):
        return len(self.encodings['input_ids'])
    

print(ReadabilityDataset.__doc__)

<a id=classes_model></a>
## Model
[[back to top]](#toc)

Note, that the concept of attention is awesomely explained in [Lena Voita](https://lena-voita.github.io/)'s excellent notebook [here](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html).

In [None]:
from torch import nn

class AttentionHead(nn.Module):
    """Class implementing the attention head of the model."""
    def __init__(self, in_features, hidden_dim):
        super().__init__()
        self.in_features = in_features
        self.middle_features = hidden_dim
        self.W = nn.Linear(in_features, hidden_dim)
        self.V = nn.Linear(hidden_dim, 1)
        self.out_features = hidden_dim
       
    
    def forward(self, features):
        att = torch.tanh(self.W(features))
        score = self.V(att)
        attention_weights = torch.softmax(score, dim=1)
        context_vector = attention_weights * features
        
        return torch.sum(context_vector, dim=1)


print(AttentionHead.__doc__)

In [None]:
from torch import nn
from transformers import RobertaModel
from transformers import RobertaConfig

class ReadabilityRobertaModel(nn.Module):
    """Custom model for the Readability task containing a Roberta layer and a custom NN head."""
        
    def __init__(self):
        super(ReadabilityRobertaModel, self).__init__()
        
        self.model_config = RobertaConfig.from_pretrained(config['best_pretrained_roberta_folder'])
        self.model_config.update({
            "output_hidden_states": True,
            "hidden_dropout_prob": 0.0,
            "layer_norm_eps": 1e-7
        })
        
        self.roberta = RobertaModel.from_pretrained(config['best_pretrained_roberta_folder'],
                                                    config=self.model_config)
        self.attention_head = AttentionHead(self.model_config.hidden_size, 
                                            self.model_config.hidden_size)
        self.dropout = nn.Dropout(0.1)
        self.regressor = nn.Linear(self.model_config.hidden_size, 1)
        
        
    def forward(self, tokens, attention_mask):
        x = self.roberta(input_ids=tokens, attention_mask=attention_mask)[0]
        x = self.attention_head(x)
        x = self.dropout(x)
        x = self.regressor(x)
        return x
    
    
    def freeze_roberta(self):
        """
        Freezes the parameters of the Roberta model so when ReadabilityRobertaModel is 
        trained only the wieghts of the custom regressor are modified.
        """
        for param in self.roberta.named_parameters():
            param[1].requires_grad=False
    
    def unfreeze_roberta(self):
        """
        Unfreezes the parameters of the Roberta model so when ReadabilityRobertaModel is 
        trained both the wieghts of the custom regressor and of the underlying Roberta
        model are modified.
        """
        for param in self.roberta.named_parameters():
            param[1].requires_grad=True

    
print(ReadabilityRobertaModel.__doc__)

<a id='make_predictions'></a>
# Make predictions
[[back to top]](#toc)

<a id='make_predictions_read_test_data'></a>
## Read test data
[[back to top]](#toc)

In [None]:
import pandas as pd

test_csv_path = '/kaggle/input/commonlitreadabilityprize/test.csv'
test_data = pd.read_csv(test_csv_path)

print('The total # of samples is {}.'.format(test_data.shape[0]))

<a id='make_predictions_make_predictions'></a>
## Make predictions with pretrained models
[[back to top]](#toc)

Used [this notebook](https://www.kaggle.com/maunish/clrp-pytorch-roberta-inference) as a resource.

In [None]:
from torch.utils.data import DataLoader

def get_dataloader_from_dataframes(df, tokenizer):
    """Converts a complete dataframe (with all columns included) into a dataloader."""
    texts = df['excerpt'].values.tolist()
    data_encodings = tokenizer(texts, max_length=config['sentence_max_length'],
                              truncation=True, padding=True)
    dataset = ReadabilityDataset(data_encodings)
    dataloader = DataLoader(dataset, batch_size=config['batch_size'])
    
    return dataloader

print(get_dataloader_from_dataframes.__doc__)

In [None]:
import torch

from tqdm.auto import tqdm

tqdm.pandas()


def get_predictions(data, model_path, tokenizer):
    """Method which makes a prediction based on the provided model."""
    
    # setup model
    model = ReadabilityRobertaModel()
    model.load_state_dict(torch.load(model_path, map_location=device))
    model.to(device)
    model.eval()
    
    # convert data into dataloader
    dataloader = get_dataloader_from_dataframes(data, tokenizer)
    
    # iteration for predictions
    predictions = list()
    for i, batch in enumerate(tqdm(dataloader)):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        
        output = torch.flatten(model(tokens=input_ids, attention_mask=attention_mask))
        output = output.cpu().detach().numpy().tolist()
        
        predictions.extend(output)
        
    torch.cuda.empty_cache()
    return predictions

print(get_predictions.__doc__)

In [None]:
import numpy as np
from transformers import RobertaTokenizer

predictions = np.zeros(test_data.shape[0])

for i in range(config['num_of_models']):
    model_path = f'../input/roberta-large-k-fold-models/model{i}/model{i}.bin'
    tokenizer_path = f'../input/roberta-large-k-fold-models/model{i}/'
    tokenizer = RobertaTokenizer.from_pretrained(tokenizer_path)
    
    pred = get_predictions(test_data, model_path, tokenizer)
    print(f"The predictions from model {i}:")
    print(pred)
    predictions = predictions + pred
    
    
predictions = predictions / config['num_of_models']
predictions = predictions.tolist()
print("Overall predictions:")
print(predictions)

<a id='make_predictions_save_results'></a>
## Save results
[[back to top]](#toc)

In [None]:
submission = pd.DataFrame()
submission['id'] = test_data['id']
submission['target'] = predictions
submission.to_csv('submission.csv', index=False)

print('Saved predictions.')