<h1>RoBERTa-large 5-fold single model (MeanPooling)</h1>

This is an inference notebook. 

It uses 5 folds of a model using the <b>MeanPoolingModel</b> representation (see my notebook on [the best transformer representations](https://www.kaggle.com/jcesquiveld/best-transformer-representations)), which is the one with which I've obtained the best results.

For training this model, I've used used the following <b>strategy</b>:
<ul>
    <li>I've used Roberta-large as my base model.</li>
    <li>Because the dataset is small, and in my experiments I've seen that 5 epochs achieve almost the same results as training for a longer time, I've done all the runs just for 5 epochs.</li>
    <li>I've pretrained Roberta-large using the competition data and an extended dataset with extra excerpts from the same entries in Wikipedia and SimpleWikipedia present in the competition dataset. Also, more than 1200 excerpts in the training set correspond to books freely available in <a href='https://gutenberg.org/'>the gutenberg project web site</a>, so I've used extra excerpts from the <b>same books</b>.I've combined then in different ways. For example, only the competition data pretraining for 2 epochs, boths datasets pretraining for 2 epochs, only the extended dataset, pretraining for 5 epochs, etc. </li>
    <li>I've also used layer-wise learning rate decay (see my post <a href='https://www.kaggle.com/c/commonlitreadabilityprize/discussion/251761'>Layer-wise learning rate decay. What values to use?</a></li>. As you can see there, there's no clear winner, though the mean RMSE best value is for an initial learning rate of 3e-5 and a multiplier factor of 0.975.</li>
    <li>As the competition dataset is small, I've made different combinations of: pretrained strategy, initial learning rate, multiplier, and I've also used 5 different seeds. From these experiments, I've chosen the best loss values (I've evaluated every 20 iterations) for each fold. These are the results (LB 0.462):</li>
</ul>
        <table align='left'>
            <tr><th>Fold</th><th>Loss</th></tr>
            <tr><td>0</td><td>0.2227</td></tr>
            <tr><td>1</td><td>0.2424</td></tr>
            <tr><td>2</td><td>0.2143</td></tr>
            <tr><td>3</td><td>0.1960</td></tr>
            <tr><td>4</td><td>0.2354</td></tr>
        </table>

In [None]:
# Imports

import os
import random
import numpy as np
import pandas as pd
import glob
import re
import gc; gc.enable()

import torch
import torch.nn as nn
from torch.utils.data import Dataset, SequentialSampler, DataLoader

from transformers import AutoConfig, AutoModel, AutoTokenizer, AdamW, get_linear_schedule_with_warmup, logging

import transformers

from tqdm.notebook import tqdm

import warnings
warnings.filterwarnings('ignore')

from sklearn.metrics import mean_squared_error


In [None]:
# Constants

SEED = 42

HIDDEN_SIZE = 1024
MAX_LEN = 300

INPUT_DIR = '../input/commonlitreadabilityprize'
BASELINE_DIR = '../input/baseline-mp-ft2'
MODEL_DIR = '../input/roberta-transformers-pytorch/roberta-large'

TOKENIZER = AutoTokenizer.from_pretrained(MODEL_DIR)
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
BATCH_SIZE = 8

In [None]:
# Utility functions

def seed_everything(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True
    
    
seed_everything(SEED)

In [None]:
# Data

submission = pd.read_csv(os.path.join(INPUT_DIR, 'sample_submission.csv'))
test = pd.read_csv(os.path.join(INPUT_DIR, 'test.csv'))
test.head()

In [None]:
# Dataset

class CLRPDataset(Dataset):
    def __init__(self, texts, tokenizer):
        self.texts = texts
        self.tokenizer = tokenizer
        
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        encode = self.tokenizer(
            self.texts[idx],
            padding='max_length',
            max_length=MAX_LEN,
            truncation=True,
            add_special_tokens=True,
            return_attention_mask=True,
            return_tensors='pt'
        ) 
        return encode

In [None]:
# Model

class MeanPoolingModel(nn.Module):
    
    def __init__(self, model_name):
        super().__init__()
        
        config = AutoConfig.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name, config=config)
        self.layer_norm = nn.LayerNorm(HIDDEN_SIZE)
        self.linear = nn.Linear(HIDDEN_SIZE, 1)
        self.loss = nn.MSELoss()
        
    def forward(self, input_ids, attention_mask, labels=None):
        
        outputs = self.model(input_ids, attention_mask)
        last_hidden_state = outputs[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
        sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded, 1)
        sum_mask = input_mask_expanded.sum(1)
        sum_mask = torch.clamp(sum_mask, min=1e-9)
        mean_embeddings = sum_embeddings / sum_mask
        norm_mean_embeddings = self.layer_norm(mean_embeddings)
        logits = self.linear(norm_mean_embeddings)
        
        preds = logits.squeeze(-1).squeeze(-1)
        
        if labels is not None:
            loss = self.loss(preds.view(-1).float(), labels.view(-1).float())
            return loss
        else:
            return preds
        


<h2>Prediction</h2>

In [None]:
def predict(df, model):
    
    ds = CLRPDataset(df.excerpt.tolist(), TOKENIZER)
    dl = DataLoader(
        ds,
        batch_size=BATCH_SIZE,
        shuffle=False,
        pin_memory=False
    )
    
    model.to(DEVICE)
    model.eval()
    model.zero_grad()
    
    predictions = []
    for batch in tqdm(dl):
        inputs = {key:val.reshape(val.shape[0], -1).to(DEVICE) for key,val in batch.items()}
        outputs = model(**inputs)
        predictions.extend(outputs.detach().cpu().numpy().ravel())
        
    return predictions
    

In [None]:
# Calculate predictions of each fold and average them

fold_predictions = []
for path in glob.glob(BASELINE_DIR + '/*.ckpt'):
    model = MeanPoolingModel(MODEL_DIR)
    model.load_state_dict(torch.load(path))
    fold = int(re.match(r'.*_f_?(\d)_.*', path).group(1))
    print(f'*** fold {fold}: {path} ***')
    y_pred = predict(test, model)
    fold_predictions.append(y_pred)
    
    # Free memory
    del model
    gc.collect()
    
predictions = np.mean(fold_predictions, axis=0)

In [None]:
# Submission

submission['target'] = predictions
submission.to_csv('submission.csv', index=False)
submission.head()