# Context:
I trained a simple regressor based on Roberta base and achieved a LB score of 0.526. I was not extremely happy but it was better than my previous models so I was OK. But later on I came across Maunish's notebook in which he does almost the same thing but with some addtional small components like adding scheduler and so on and his model scored 0.479 which is a very significant difference. Congratulations to Maunish on it. I wanted to identify which components from Maunish contributed towards improved performance of his algorithm. Therefore in this notebook I would like to start with Maunish's model and progressively remove components to match that of mine and identify which additional components from him contribute to imporved performance.

Maunish's finetuner notebook : https://www.kaggle.com/maunish/clrp-pytorch-roberta-finetune  
Maunish's inference notebook: https://www.kaggle.com/maunish/clrp-pytorch-roberta-inference  
My finetuner notebook: https://www.kaggle.com/vigneshbaskaran/commonlit-easy-transformer-finetuner  
My inference notebook: https://www.kaggle.com/vigneshbaskaran/commonlit-easy-finetuner-inference  

The important components in an algorithm:


|      Component     |                      Maunish                      |            Vignesh            |   |
|:------------------:|:-------------------------------------------------:|:-----------------------------:|:-:|
|  Pretrained model  | Pretrained with additional data on top of ROBERTA |  Simple ROBERTA base from HF  |   |
|      Optimizer     |              AdamW with weight decay              |   AdamW with no weight decay  |   |
|      Scheduler     |                  Cosine Annealing                 |          No scheduler         |   |
| Model architecutre |       Last hidden state with attention head       | Pooler with no attention head |   |
|        Loss        |                      SQRT MSE                     |              MSE              |   |
|      Tokenizer     |                    max_len: 256                   |        max_len: default       |   |
|  train_valid_split |              Stratified KFold on bins             |          Simple KFold         |   |
|                    |                                                   |                               |   |


# Plan
1. Start by recreating Maunish's work - Make sure I can get his score of ~ 0.479
2. Sequentially replace Maunish's components with my components and track how the validation score changes
3. Finally make sure when every component of Maunish's work is replaced with my component I get my old score of ~ 0.526
4. Summarize the findings

# Recreating Maunish's model
1. Configuration: `{'lr': 2e-5, 'wd': 0.01, 'batch_size': 16, 'validated_every_n_iteration': 10, 'max_len': 256, 'epochs': 3, 'nfolds': 5, 'seed': 42}`
2. Seeds everything
3. Bins the data
4. KFold split based on bins
5. Define Dataset
6. Define model: Roberta + Attention head on last hidden state
7. Optimizer: `optim.AdamW(model.parameters(),lr=2e-5,weight_decay=0.01)`
8. LR scheduler: `lr_scheduler = get_cosine_schedule_with_warmup(optimizer,num_warmup_steps=0,num_training_steps= 10 * len(train_dl))`

In [None]:
import os
import gc
import abc
import torch
import random
import shutil
import numpy as np
import pandas as pd

from torch import nn
from pathlib import Path
from transformers.file_utils import ModelOutput
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import StratifiedKFold
from transformers import get_cosine_schedule_with_warmup, AdamW, AutoModel, AutoTokenizer

In [None]:
def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONASSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True
seed_everything()

# Common parts
## Dataset and Dataloader

In [None]:
class TrainingDataset(Dataset):
    def __init__(self, text_excerpts, targets):
        self.text_excerpts = text_excerpts
        self.targets = targets
    
    def __len__(self):
        return len(self.text_excerpts)
    
    def __getitem__(self, idx):
        sample = {'text_excerpt': self.text_excerpts[idx],
                  'target': self.targets[idx]}
        return sample

In [None]:
def transform_targets(targets):
    targets = targets.astype(np.float32).reshape(-1, 1)
    return targets

In [None]:
def create_training_dataloader(data, batch_size, shuffle, num_workers=4):
    text_excerpts = data['excerpt'].tolist()
    targets = transform_targets(data['target'].to_numpy())
    dataset = TrainingDataset(text_excerpts=text_excerpts, targets=targets)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers, pin_memory=True, drop_last=False)
    return dataloader

In [None]:
def split_into_kfolds(data, n_splits, shuffle, random_state):
    num_bins = int(np.floor(1 + np.log2(len(data))))
    data['bins'] = pd.cut(data['target'], bins=num_bins, labels=False)
    kf = StratifiedKFold(n_splits=n_splits, shuffle=shuffle, random_state=random_state)
    for train_indices, valid_indices in kf.split(X=data, y=data['bins'].tolist()):
        yield data.iloc[train_indices], data.iloc[valid_indices]

## Metric, EarlyStopping, Saver, Monitor

In [None]:
class Metric:
    def __init__(self):
        self.sse = 0
        self.num_samples = 0
    
    def update(self, targets, predictions):
        self.sse += np.sum(np.square(targets - predictions))
        self.num_samples += len(targets)
        
    def get_rmse(self):
        rmse = np.sqrt(self.sse / self.num_samples)
        return rmse

In [None]:
class Monitor:
    def __init__(self, num_patient_epochs):
        self.num_patient_epochs = num_patient_epochs
        self.best_epoch_num = None
        self.best_score = np.inf
        self.best_model = None
        
    def early_stopping(self, current_epoch_num):
        return True if current_epoch_num > self.best_epoch_num + self.num_patient_epochs else False
    
    def update_best_model(self, current_epoch_num, current_score, current_model, current_tokenizer, save_path):
        if current_score < self.best_score:
            self.best_epoch_num = current_epoch_num
            self.best_score = current_score
#             self.best_model = current_model
            # Save model and tokenizer
            shutil.rmtree(save_path, ignore_errors=True)
            os.makedirs(save_path)
            torch.save(current_model.state_dict(), save_path / 'model.pth')
            current_tokenizer.save_pretrained(save_path)

In [None]:
class KfoldMonitor:
    def __init__(self):
        self.fold_monitor = {}
        
    def update(self, fold, monitor):
        self.fold_monitor[fold] = monitor
        
    def get_mean_fold_score(self):
        mean_cross_validation_score = np.mean([fold_monitor.best_score for fold_monitor in self.fold_monitor.values()])
        return mean_cross_validation_score
        

## Training and Validation loop

In [None]:
def clear_cuda():
    gc.collect()
    torch.cuda.empty_cache()

In [None]:
def train(dataloader, model, tokenizer, padding, max_length, optimizer, device, scheduler=None):
    clear_cuda()
    model.train()
    model.to(device)
    epoch_loss = 0
    for batch_num, batch in enumerate(dataloader):
        # Forward Propagation
        inputs = tokenizer(batch['text_excerpt'], padding=padding, truncation=True, max_length=max_length, return_tensors="pt")
        inputs = {key:value.to(device) for key, value in inputs.items()}
        targets = batch['target'].to(device)
        optimizer.zero_grad()
        outputs = model(**inputs, labels=targets)
        epoch_loss += outputs.loss.item()
        # Backpropagation
        outputs.loss.backward()
        optimizer.step()
        if scheduler is not None:
            scheduler.step()
    average_epoch_loss = epoch_loss/len(dataloader)
    return model, average_epoch_loss

In [None]:
def evaluate(dataloader, model, tokenizer, padding, max_length, device):
    clear_cuda()
    model.eval()
    model.to(device)
    epoch_loss = 0
    metric = Metric()
    for batch_num, batch in enumerate(dataloader):
        # Forward Propagation
        inputs = tokenizer(batch['text_excerpt'], padding=padding, truncation=True, max_length=max_length,return_tensors="pt")
        inputs = {key:value.to(device) for key, value in inputs.items()}
        targets = batch['target'].to(device)
        with torch.no_grad():
            outputs = model(**inputs, labels=targets)
        epoch_loss += outputs.loss.item()
        targets = targets.detach().cpu().numpy()
        predictions = outputs.logits.detach().cpu().numpy()
        metric.update(targets=targets, predictions=predictions)
    average_epoch_loss = epoch_loss/len(dataloader)
    return average_epoch_loss, metric

In [None]:
def train_and_evaluate(fold, epoch_num, train_dataloader, valid_dataloader, model, tokenizer, padding, max_length, optimizer, device, scheduler, monitor, save_path):
    clear_cuda()
    model.to(device)
    for batch_num, batch in enumerate(train_dataloader):
        model.train()
        # Forward Propagation
        inputs = tokenizer(batch['text_excerpt'], padding=padding, truncation=True, max_length=max_length, return_tensors="pt")
        inputs = {key:value.to(device) for key, value in inputs.items()}
        targets = batch['target'].to(device)
        optimizer.zero_grad()
        outputs = model(**inputs, labels=targets)
        # Backpropagation
        outputs.loss.backward()
        optimizer.step()
        if scheduler is not None:
            scheduler.step()
        # Evaluate
        if batch_num  % 10 == 0:
            model.eval()
            metric = Metric()
            for _, batch in enumerate(valid_dataloader):
                # Forward Propagation
                inputs = tokenizer(batch['text_excerpt'], padding=padding, truncation=True, max_length=max_length,return_tensors="pt")
                inputs = {key:value.to(device) for key, value in inputs.items()}
                targets = batch['target'].to(device)
                with torch.no_grad():
                    outputs = model(**inputs, labels=targets)
                targets = targets.detach().cpu().numpy()
                predictions = outputs.logits.detach().cpu().numpy()
                metric.update(targets=targets, predictions=predictions)
            print(f'Fold: {fold}, epoch: {epoch_num}, Iteration num: {batch_num}, RMSE: {metric.get_rmse()}')
            monitor.update_best_model(current_epoch_num=epoch_num, current_score=metric.get_rmse(),
                                      current_model=model, current_tokenizer=tokenizer, save_path=save_path)
    return monitor

In [None]:
class RegressorOutput(ModelOutput):
    loss = None
    logits = None
    hidden_states = None
    attentions = None

# Different parts

## Vignesh Model

In [None]:
class RobertaPoolerRegressor(nn.Module):
    def __init__(self, model_path, apply_sqrt_to_loss):
        super(RobertaPoolerRegressor, self).__init__()
        self.roberta = AutoModel.from_pretrained(model_path)
        self.dropout = nn.Dropout(self.roberta.config.hidden_dropout_prob)
        self.regressor = nn.Linear(self.roberta.config.hidden_size, 1)
        self.loss_fn = nn.MSELoss()
        self.apply_sqrt_to_loss = apply_sqrt_to_loss
    
    def forward(self, input_ids=None, attention_mask=None, labels=None):
        roberta_outputs = self.roberta(input_ids=input_ids, 
                                       attention_mask=attention_mask)
        pooler_output = roberta_outputs['pooler_output']
        pooler_output = self.dropout(pooler_output)
        logits = self.regressor(pooler_output)
        if self.apply_sqrt_to_loss:
            loss = torch.sqrt(self.loss_fn(labels, logits)) if labels is not None else None
        else:
            loss = self.loss_fn(labels, logits) if labels is not None else None
        return RegressorOutput(loss=loss, logits=logits)

## Maunish Model

In [None]:
class AttentionHead(nn.Module):
    def __init__(self, hidden_dim):
        super(AttentionHead, self).__init__()
        self.W = nn.Linear(hidden_dim, hidden_dim)
        self.V = nn.Linear(hidden_dim, 1)
    
    def forward(self, x):
        attention_scores = self.V(torch.tanh(self.W(x)))
        attention_scores = torch.softmax(attention_scores, dim=1)
        attentive_x = attention_scores * x
        attentive_x = attentive_x.sum(axis=1)
        return attentive_x

In [None]:
class RobertaLastHiddenStateRegressor(nn.Module):
    def __init__(self, model_path):
        super(RobertaLastHiddenStateRegressor, self).__init__()
        self.roberta = AutoModel.from_pretrained(model_path)
        self.head = AttentionHead(self.roberta.config.hidden_size)
        self.dropout = nn.Dropout(self.roberta.config.hidden_dropout_prob)
        self.regressor = nn.Linear(self.roberta.config.hidden_size, 1)
        self.loss_fn = nn.MSELoss()
    
    def forward(self, input_ids=None, attention_mask=None, labels=None):
        roberta_outputs = self.roberta(input_ids=input_ids,
                                       attention_mask=attention_mask)
        last_hidden_state = roberta_outputs['last_hidden_state']
        attentive_vector = self.head(last_hidden_state)
        attentive_vector = self.dropout(attentive_vector)
        logits = self.regressor(attentive_vector)
        loss = torch.sqrt(self.loss_fn(labels, logits)) if labels is not None else None
        return RegressorOutput(loss=loss, logits=logits)

# Training

In [None]:
class Experiment(abc.ABC):
    def __init__(self, save_name, run_on_sample, n_splits=5, random_state=42, batch_size=8, num_epochs=3, num_patient_epochs=1):
        # Everything common across experiments is declared here
        self.full_train_data = pd.read_csv('../input/commonlitreadabilityprize/train.csv')
        self.random_state = random_state
        self.run_on_sample = run_on_sample
        if run_on_sample:
            self.full_train_data = self.full_train_data.sample(frac=0.02, random_state=self.random_state)
        self.save_name = save_name
        self.n_splits = n_splits
        self.batch_size = batch_size
        self.num_epochs = num_epochs
        self.num_patient_epochs = num_patient_epochs
        self.kfold_monitor = KfoldMonitor()
        self.device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
    
    @abc.abstractmethod
    def get_experiment_params(self):
        raise NotImplementedError
            
    def run(self):
        clear_cuda()
        for fold, (train_data, valid_data) in enumerate(split_into_kfolds(data=self.full_train_data, n_splits=self.n_splits,
                                                                              shuffle=True, random_state=self.random_state)):
            clear_cuda()
            train_dataloader = create_training_dataloader(data=train_data, batch_size=self.batch_size, shuffle=True)
            valid_dataloader = create_training_dataloader(data=valid_data, batch_size=self.batch_size, shuffle=False)
            model, tokenizer, padding, max_length, optimizer, scheduler = self.get_experiment_params()
            monitor = Monitor(num_patient_epochs=self.num_patient_epochs)
            save_path = Path(f'{self.save_name}/fold_{fold}')
            for epoch_num in range(self.num_epochs):
                monitor = train_and_evaluate(fold=fold, epoch_num=epoch_num, train_dataloader=train_dataloader, valid_dataloader=valid_dataloader,
                                             model=model, tokenizer=tokenizer, padding=padding, max_length=max_length,
                                             optimizer=optimizer, device=self.device, scheduler=scheduler, monitor=monitor, save_path=save_path)
            del model
            self.kfold_monitor.update(fold=fold, monitor=monitor)
            print('----------------------------------------------------')
            
        mean_cross_validation_score = np.mean([fold_monitor.best_score for fold_monitor in self.kfold_monitor.fold_monitor.values()])
        print(f'Mean cross validation score: {mean_cross_validation_score}')       

# Experiment 1
Maunish's Original Algorithm. **No modification**. Details:
1. Pretrained model: Additional pretraining done on Goodreads data
2. Regressor: On top of Last hidden state - vectorized by attention
3. Tokenizer: max_length: 256
4. Scheduler: cosine schedule with warmup

In [None]:
class Experiment1(Experiment):
    def __init__(self):
        super(Experiment1, self).__init__(save_name='experiment_1', run_on_sample=False)

    def get_experiment_params(self):        
        model_path = '../input/maunish-clrp-model/clrp_roberta_base'
        tokenizer_path = '../input/commonlit-data-download/roberta-base'
        model = RobertaLastHiddenStateRegressor(model_path=model_path)
        model.to(self.device)
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
        padding = 'max_length'
        max_length = 256
        optimizer = AdamW(params=model.parameters(), lr=2e-5, weight_decay=0.01)
        scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=10 * ((len(experiment_1.full_train_data) // experiment_1.batch_size) +1 ))
        return model, tokenizer, padding, max_length, optimizer, scheduler
    
experiment_1 = Experiment1()
experiment_1.run()

# Experiment 2
Maunish's Original Algorithm. Modified the pretrained model. Details:
1. Pretrained model: **Base model from Huggingface**
2. Regressor: On top of Last hidden state - vectorized by attention
3. Tokenizer: max_length: 256
4. Scheduler: cosine schedule with warmup

In [None]:
class Experiment2(Experiment):
    def __init__(self):
        super(Experiment2, self).__init__(save_name='experiment_2', run_on_sample=True)

    def get_experiment_params(self):        
        model_path = '../input/commonlit-data-download/roberta-base'
        tokenizer_path = '../input/commonlit-data-download/roberta-base'
        model = RobertaLastHiddenStateRegressor(model_path=model_path)
        model.to(self.device)
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
        padding = 'max_length'
        max_length = 256
        optimizer = AdamW(params=model.parameters(), lr=2e-5, weight_decay=0.01)
        scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=10 * ((len(experiment_1.full_train_data) // experiment_1.batch_size) +1 ))
        return model, tokenizer, padding, max_length, optimizer, scheduler
    
    
experiment_2 = Experiment2()
experiment_2.run()

# Experiment 3
Maunish's Original Algorithm. **Regressor modified with MSE loss**. Details:
1. Pretrained model: Additional pretraining done on Goodreads data
2. Regressor: On top of **Pooler output** with **MSE** loss  
3. Tokenizer: max_length: 256
4. Scheduler: cosine schedule with warmup

In [None]:
class Experiment3(Experiment):
    def __init__(self):
        super(Experiment3, self).__init__(save_name='experiment_3', run_on_sample=True)

    def get_experiment_params(self):        
        model_path = '../input/maunish-clrp-model/clrp_roberta_base'
        tokenizer_path = '../input/commonlit-data-download/roberta-base'
        model = RobertaPoolerRegressor(model_path=model_path, apply_sqrt_to_loss=False)
        model.to(self.device)
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
        padding = 'max_length'
        max_length = 256
        optimizer = AdamW(params=model.parameters(), lr=2e-5, weight_decay=0.01)
        scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=10 * ((len(experiment_1.full_train_data) // experiment_1.batch_size) +1 ))
        return model, tokenizer, padding, max_length, optimizer, scheduler
    
    
experiment_3 = Experiment3()
experiment_3.run()

# Experiment 4
Maunish's Original Algorithm. **Max length of tokenizer modified**. Details:
1. Pretrained model: Additional pretraining done on Goodreads data
2. Regressor: On top of Last hidden state - vectorized by attention
3. Tokenizer: max_length: **None**
4. Scheduler: cosine schedule with warmup

In [None]:
class Experiment4(Experiment):
    def __init__(self):
        super(Experiment4, self).__init__(save_name='experiment_4', run_on_sample=True)

    def get_experiment_params(self):        
        model_path = '../input/maunish-clrp-model/clrp_roberta_base'
        tokenizer_path = '../input/commonlit-data-download/roberta-base'
        model = RobertaLastHiddenStateRegressor(model_path=model_path)
        model.to(self.device)
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
        padding = True
        max_length = None
        optimizer = AdamW(params=model.parameters(), lr=2e-5, weight_decay=0.01)
        scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=10 * ((len(experiment_1.full_train_data) // experiment_1.batch_size) +1 ))
        return model, tokenizer, padding, max_length, optimizer, scheduler
    
    
experiment_4 = Experiment4()
experiment_4.run()

# Experiment 5
Maunish's Original Algorithm. **Scheduler removed**. Details:
1. Pretrained model: Additional pretraining done on Goodreads data
2. Regressor: On top of Last hidden state - vectorized by attention
3. Tokenizer: max_length: 256
4. Scheduler: **None**

In [None]:
class Experiment5(Experiment):
    def __init__(self):
        super(Experiment5, self).__init__(save_name='experiment_5', run_on_sample=True)

    def get_experiment_params(self):        
        model_path = '../input/maunish-clrp-model/clrp_roberta_base'
        tokenizer_path = '../input/commonlit-data-download/roberta-base'
        model = RobertaLastHiddenStateRegressor(model_path=model_path)
        model.to(self.device)
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
        padding = 'max_length'
        max_length = 256
        optimizer = AdamW(params=model.parameters(), lr=2e-5, weight_decay=0.01)
        scheduler = None
        return model, tokenizer, padding, max_length, optimizer, scheduler
    
experiment_5 = Experiment5()
experiment_5.run()

# Experiment 6
Maunish's Original Algorithm. **Regressor modified with MSE loss**. Details:
1. Pretrained model: Additional pretraining done on Goodreads data
2. Regressor: On top of **Pooler output** with **RMSE** loss (Same as Experiment 3 but instead of MSE loss using RMSE loss)
3. Tokenizer: max_length: 256
4. Scheduler: cosine schedule with warmup

In [None]:
class Experiment6(Experiment):
    def __init__(self):
        super(Experiment6, self).__init__(save_name='experiment_6', run_on_sample=True)

    def get_experiment_params(self):        
        model_path = '../input/maunish-clrp-model/clrp_roberta_base'
        tokenizer_path = '../input/commonlit-data-download/roberta-base'
        model = RobertaPoolerRegressor(model_path=model_path, apply_sqrt_to_loss=True)
        model.to(self.device)
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
        padding = 'max_length'
        max_length = 256
        optimizer = AdamW(params=model.parameters(), lr=2e-5, weight_decay=0.01)
        scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=10 * ((len(experiment_1.full_train_data) // experiment_1.batch_size) +1 ))
        return model, tokenizer, padding, max_length, optimizer, scheduler
    
experiment_6 = Experiment6()
experiment_6.run()

# Experiment 7
My original algorithm **No modification**. Details:
1. Pretrained model: No additional pretraining. Simply downloaded from HF
2. Regressor: On top of **Pooler output** with **MSE** loss
3. Tokenizer: max_length: None
4. Scheduler: None

In [None]:
class Experiment7(Experiment):
    def __init__(self):
        super(Experiment7, self).__init__(save_name='experiment_7', run_on_sample=False)

    def get_experiment_params(self):        
        model_path = '../input/commonlit-data-download/roberta-base'
        tokenizer_path = '../input/commonlit-data-download/roberta-base'
        model = RobertaPoolerRegressor(model_path=model_path, apply_sqrt_to_loss=False)
        model.to(self.device)
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
        padding = True
        max_length = None
        optimizer = AdamW(params=model.parameters(), lr=2e-5, weight_decay=0.01)
        scheduler = None
        return model, tokenizer, padding, max_length, optimizer, scheduler
    
experiment_7 = Experiment7()
experiment_7.run()

# Summarizing results

In [None]:
pd.options.display.max_colwidth = None

In [None]:
results = pd.DataFrame([{'experiment': 'experiment_1',
                         'description': ['Pretrained on Goodreads', 'Attention on Last hidden states', 'Tokenizer max_len: 256', 'Cosine scheduler with warmup'],
                         'sample': experiment_1.run_on_sample,
                         'score': experiment_1.kfold_monitor.get_mean_fold_score()},
                        
                        {'experiment': 'experiment_2',
                         'description': ['Base model from Huggingface', 'Attention on Last hidden states', 'Tokenizer max_len: 256', 'Cosine scheduler with warmup'],
                         'sample': experiment_2.run_on_sample,
                         'score': experiment_2.kfold_monitor.get_mean_fold_score()},
                        
                        {'experiment': 'experiment_3',
                         'description': ['Pretrained on Goodreads', 'Pooler output with MSE loss', 'Tokenizer max_len: 256', 'Cosine scheduler with warmup'],
                         'sample': experiment_3.run_on_sample,
                         'score': experiment_3.kfold_monitor.get_mean_fold_score()},
                        
                        {'experiment': 'experiment_4',
                         'description': ['Pretrained on Goodreads', 'Attention on Last hidden states', 'Tokenizer max_len: None', 'Cosine scheduler with warmup'],
                         'sample': experiment_4.run_on_sample,
                         'score': experiment_4.kfold_monitor.get_mean_fold_score()},
                        
                        {'experiment': 'experiment_5',
                         'description': ['Pretrained on Goodreads', 'Attention on Last hidden states', 'Tokenizer max_len: 256', 'No scheduler'],
                         'sample': experiment_5.run_on_sample,
                         'score': experiment_5.kfold_monitor.get_mean_fold_score()},
                        
                        {'experiment': 'experiment_6',
                         'description': ['Pretrained on Goodreads', 'Pooler output with RMSE loss', 'Tokenizer max_len: 256', 'No scheduler'],
                         'sample': experiment_6.run_on_sample,
                         'score': experiment_6.kfold_monitor.get_mean_fold_score()},
                        
                        {'experiment': 'experiment_7',
                         'description': ['Base model from Huggingface', 'Pooler output with MSE loss', 'Tokenizer max_len: None', 'No scheduler'],
                         'sample': experiment_7.run_on_sample,
                         'score': experiment_7.kfold_monitor.get_mean_fold_score()}])

In [None]:
results