## Introduction

The aim of this notebook is to train models for the [CommonLit Readability Prize competition](https://www.kaggle.com/c/commonlitreadabilityprize/overview/) using K-Fold, i.e. it produces k=5 models which can be used for making predictions.

This notebook is part of a series:
1. Pretrain roberta large on the CommonLit dataset [here](https://www.kaggle.com/angyalfold/pretrain-roberta-large-on-clrp-data/).
2. Produce k models which can later be used for determining the readability of texts (this notebook).
3. Make predictions with a custom NN regressor [here](https://www.kaggle.com/angyalfold/roberta-large-with-custom-regressor-pytorch/).
4. Ensemble (Roberta large + SVR, Roberta large + Ridge, Roberta large + custom NN head) [here](https://www.kaggle.com/angyalfold/ensemble-for-commonlit/).

K-Fold ideas & and the approach of producing multiple models are taken from Maunish' [notebook](https://www.kaggle.com/maunish/clrp-pytorch-roberta-finetune).

<a id="toc"></a>
# Table of contents
* [Parameters](#parameters)
* [Processing data](#processing_data)
    * [Load data](#processing_data_load)
    * [Organize data into folds](#processing_data_folds)
    * [Clearing texts](#processing_data_clearing_texts)
* [Setup](#setup)
    * [Tokenizer](#setup_tokenizer)
    * [Data set](#setup_dataset)
    * [Model](#setup_model)
    * [Loss function](#setup_loss)
* [Training process](#training)
    * [Iteration to train & validate model](#training_model_iteration)
    * [Training a model for each fold](#training_model_fold)

<a id="parameters"></a>
# Parameters
[[back to top]](#toc)

In [None]:
import gc
import torch

gc.enable()

config = {
    'batch_size': 8, # the batch size for the dataloaders,
    'best_pretrained_roberta_folder': '../input/pretrain-roberta-large-on-clrp-data/clrp_roberta_large/best_model/',
    'effective_batch_size': 128, # the number of samples within a gradient accumalation set
    'lr': 2e-5,
    'model_name': 'roberta-large',
    'num_of_folds': 5,
    'num_of_epochs': 5,
    'seed': 2021,
    'sentence_max_length': 256, # the max size of the input for the tokenzier
    'wd': 0.01
}

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

for k, v in config.items():
    print('The value for {}: {}'.format(k, v))

print('Device: {}'.format(device))

In [None]:
import os

for i in range(config['num_of_folds']):
    os.makedirs(f'model{i}',exist_ok=True)
    
print('Created folders for the models.')

<a id="processing_data"></a>
# Processing data
[[back to top]](#toc)

<a id="processing_data_load"></a>
## Load data
[[back to top]](#toc)

In [None]:
import pandas as pd

train_csv_path = '/kaggle/input/commonlitreadabilityprize/train.csv'
train_data = pd.read_csv(train_csv_path)

print('The total # of samples is {}.'.format(len(train_data)))

<a id=processing_data_folds></a>
## Organize data into folds
[[back to top]](#toc)

The following cell splits the data into 5 fold. To do that a new column (named *bin*) is added to the panda dataframe. The new column is produced by using [panda's *cut*](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html#pandas-cut) which can essentailly be used to convert continous values into discrete values.

Once there is a column (in this case *bin*) which splits the target values into categories [scikit learn's StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) can be used. An advantage of stratified k-fold is that "*the folds are made by preserving the percentage of samples for each class*".

(This idea was also borrowed from [Maunish' previously referred notebook](https://www.kaggle.com/maunish/clrp-pytorch-roberta-finetune).)

In [None]:
import numpy as np

from sklearn.model_selection import StratifiedKFold

# create & fill bins column (needed for kfold)
num_of_bins = int(np.floor(1 + np.log2(len(train_data))))
train_data.loc[:,'bin'] = pd.cut(train_data['target'], bins=num_of_bins, labels=False)
bins = train_data.bin.to_numpy()

# kfold
train_data['fold'] = -1
kfold = StratifiedKFold(n_splits=config['num_of_folds'], shuffle=True, random_state=config['seed'])
for k, (train_idx, valid_idx) in enumerate(kfold.split(X=train_data, y=bins)):
    train_data.loc[valid_idx, 'fold'] = k

print('Performed K-fold split on training data (K={}).'.format(config['num_of_folds']))

<a id='processing_data_clearing_texts'></a>
## Clearing texts
[[back to top]](#toc)

So far the only cleaning which takes place is to replace new line characters with spaces.

In [None]:
train_data['excerpt'] = train_data['excerpt'].apply(lambda x: x.replace('\n', ' '))
train_texts = train_data['excerpt'].values.tolist()

print('Text data has been cleaned.')

<a id="setup"></a>
# Setup
[[back to top]](#toc)

<a id="setup_tokenizer"></a>
## Tokenizer
[[back to top]](#toc)

In [None]:
from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained(config['model_name'])

print('Tokenizer is ready.')

<a id="setup_dataset"></a>
## Data set
[[back to top]](#toc)

In [None]:
import torch

class ReadabilityDataset(torch.utils.data.Dataset):
    """Custom dataset for the Readability task."""
    def __init__(self, encodings, targets):
        self.encodings = encodings
        self.targets = targets
        
        
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['target'] = torch.tensor(self.targets[idx])
        return item
        
    def __len__(self):
        return len(self.targets)
    

print(ReadabilityDataset.__doc__)

<a id="setup_model"></a>
## Model
[[back to top]](#toc)

Loads a pretrained Roberta model from another notebook titled [Fine-tuning Roberta-large on CLRP data](https://www.kaggle.com/angyalfold/fine-tuning-roberta-large-on-clrp-data).

The following model is based on Maunish' [notebook](https://www.kaggle.com/maunish/clrp-pytorch-roberta-finetune/data)

In [None]:
from torch import nn

class AttentionHead(nn.Module):
    """Class implementing the attention head of the model."""
    def __init__(self, in_features, hidden_dim):
        super().__init__()
        self.in_features = in_features
        self.middle_features = hidden_dim
        self.W = nn.Linear(in_features, hidden_dim)
        self.V = nn.Linear(hidden_dim, 1)
        self.out_features = hidden_dim
       
    
    def forward(self, features):
        att = torch.tanh(self.W(features))
        score = self.V(att)
        attention_weights = torch.softmax(score, dim=1)
        context_vector = attention_weights * features
        
        return torch.sum(context_vector, dim=1)


print(AttentionHead.__doc__)

Note, that the concept of attention is awesomely explained in [Lena Voita](https://lena-voita.github.io/)'s excellent notebook [here](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html).

In [None]:
from torch import nn
from transformers import RobertaModel
from transformers import RobertaConfig

class ReadabilityRobertaModel(nn.Module):
    """Custom model for the Readability task containing a Roberta layer and a custom NN head."""
        
    def __init__(self):
        super(ReadabilityRobertaModel, self).__init__()
        
        self.model_config = RobertaConfig.from_pretrained(config['best_pretrained_roberta_folder'])
        self.model_config.update({
            "output_hidden_states": True
        })
        
        self.roberta = RobertaModel.from_pretrained(config['best_pretrained_roberta_folder'],
                                                    config=self.model_config)
        self.attention_head = AttentionHead(self.model_config.hidden_size, 
                                            self.model_config.hidden_size)
        self.dropout = nn.Dropout(0.1)
        self.regressor = nn.Linear(self.model_config.hidden_size, 1)
        
        
    def forward(self, tokens, attention_mask):
        x = self.roberta(input_ids=tokens, attention_mask=attention_mask)[0]
        x = self.attention_head(x)
        x = self.dropout(x)
        x = self.regressor(x)
        return x
    
    
    def freeze_roberta(self):
        """
        Freezes the parameters of the Roberta model so when ReadabilityRobertaModel is 
        trained only the wieghts of the custom regressor are modified.
        """
        for param in self.roberta.named_parameters():
            param[1].requires_grad=False
    
    def unfreeze_roberta(self):
        """
        Unfreezes the parameters of the Roberta model so when ReadabilityRobertaModel is 
        trained both the wieghts of the custom regressor and of the underlying Roberta
        model are modified.
        """
        for param in self.roberta.named_parameters():
            param[1].requires_grad=True

    
print(ReadabilityRobertaModel.__doc__)

<a id="setup_loss"></a>
## Loss function
[back to top](#toc)

Create loss function based on [this](https://www.kaggle.com/maunish/clrp-pytorch-roberta-finetune).

In [None]:
def loss_fn(y_predicted, y_actual):
    return torch.sqrt(nn.MSELoss()(y_predicted, y_actual))
    
print('Defined loss function.')

<a id="training"></a>
# Training process
[[back to top]](#toc)

<a id='training_model_iteration'></a>
## Iteration to train & validate model
[[back_to_top]](#toc)

In [None]:
import torch

def get_batch_prediction(model, batch):
    """Executes the provided model on the batch to make predictions."""
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)

    batch_prediction = torch.flatten(model(tokens=input_ids, attention_mask=attention_mask))
    
    del input_ids, attention_mask
    torch.cuda.empty_cache()
    
    return batch_prediction

print(get_batch_prediction.__doc__)

In [None]:
import torch

def validate_model(model, dataloader, loss_fn):
    """Validates the given model with the data obtained from the provided dataloader. Returns the computed loss"""
    model.eval()

    loss = 0
    size = len(dataloader)
            
    with torch.no_grad():
        for batch in tqdm(dataloader):
            targets = batch['target'].to(device)
            predictions = get_batch_prediction(model, batch)

            loss += loss_fn(predictions, targets).item()

    loss = loss / size
    
    return loss

print(validate_model.__doc__)

In [None]:
def save_if_best_model(model, fold, best_loss, val_loss):
    """Saves the state dictionary of the provided model if the current validation loss is better than the best loss."""
    if val_loss < best_loss:
        best_loss = val_loss
        torch.save(model.state_dict(), f'./model{fold}/model{fold}.bin')
        tokenizer.save_pretrained(f'./model{fold}')
        print(f"Model & tokenizer has been saved to ./model{fold}.")
        
    return best_loss

print(save_if_best_model.__doc__)

Effective batch size as described here: https://stackoverflow.com/questions/68479235/cuda-out-of-memory-error-cannot-reduce-batch-size

In [None]:
from tqdm.auto import tqdm

import torch

tqdm.pandas()

def train_and_eval(model, train_dataloader, val_dataloader, optimizer,
                   loss_fn, best_loss, fold):
    """Trains the model and evaluates it on every validate_after_stepth step."""
    
    model.train()
    size = len(train_dataloader.dataset)
    total_train_loss = 0
    optimizer.zero_grad()
    
    for i, train_batch in enumerate(tqdm(train_dataloader)):
        train_targets = train_batch['target'].to(device)
        train_predictions = get_batch_prediction(model, train_batch)     
        
        train_loss = loss_fn(train_predictions, train_targets)
        train_loss.backward()
                
        if (i * config['batch_size'] % config['effective_batch_size']) == 0 or ((i + 1) == len(train_dataloader)):
            print(f"{i}th batch:")
            optimizer.step()
            optimizer.zero_grad()
        
            total_train_loss += train_loss.item()
        
            val_loss = validate_model(model, val_dataloader, loss_fn)
            best_loss = save_if_best_model(model, fold, best_loss, val_loss)
            
            if not (val_loss > best_loss):
                print(f"Best validation loss: {best_loss}")
                print(f"Training loss: {total_train_loss/(i+1)}")
                
print(train_and_eval.__doc__)

<a id='training_model_fold'></a>
## Training a model for each fold
[[back to top]](#toc)

In [None]:
from torch.utils.data import DataLoader

def create_dataloader(texts, targets):
    """Converts the provided texts & targets into a dataloader"""
    encodings = tokenizer(texts.values.tolist(),
                         max_length=config['sentence_max_length'],
                         truncation=True, padding=True)
    dataset = ReadabilityDataset(encodings, targets.values.tolist())
    dataloader = DataLoader(dataset, batch_size=config['batch_size'], shuffle=True)
    
    return dataloader

print(create_dataloader.__doc__)

In [None]:
def train_fold(fold, model, tokenizer, optimizer):
    """Trains the foldth model."""
    for i in range(config['num_of_epochs']):
        print(f'Epoch # {i+1}:')
        x_train = train_data.query(f"fold != {fold}")
        x_val = train_data.query(f"fold == {fold}")
        
        train_dataloader = create_dataloader(x_train['excerpt'], x_train['target'])
        val_dataloader = create_dataloader(x_val['excerpt'], x_val['target'])
                
        train_and_eval(model=model,
                       train_dataloader=train_dataloader, val_dataloader=val_dataloader,
                       optimizer=optimizer, loss_fn=loss_fn,
                       best_loss=9999,
                       fold=fold)
        
print(train_fold.__doc__)

In [None]:
from transformers import AdamW

print('Train the models:')
for fold in range(config['num_of_folds']):
    model = ReadabilityRobertaModel()
    model.to(device)
    
    optimizer = AdamW(model.parameters(), lr=config['lr'], weight_decay=config['wd'])
    
    print(f"Fold #: {fold}")
    train_fold(fold, model, tokenizer, optimizer)
    
    del model
    torch.cuda.empty_cache()
    print(f"Fold # {fold} successfully finished.")