# 2- Learning ðŸ¤—  - Out-of-the-box RoBERTa [LB: 0.53]

Hi, and welcome! This is the second kernel of the series `Learning ðŸ¤—`, a personal project I'm currently working on. I am an experienced data scientist diving into the hugging face transformers library and this series or kernels is a "working diary", as I do it. The approach I'm taking is the following: 
1. Explore various out-of-the-box models, without digging into their technical details. 
2. After that, I'll start going over the best ranked public kernels, understand their ideas, and reproduce them by myself. 

You are invited to follow me in this journey. In this short kernel  we fine-tune an out-of-the-box cased RoBERTa, with just the minimal set up required for it to run in this competition, obtaining a leaderboard score of `0.53`. 


This is an ongoing project, so expect more notebooks to be added to the series soon. Actually, we are currently working on the following ones:


1. [Learning ðŸ¤—  - Out-of-the-box BERT [LB: 0.577]](https://www.kaggle.com/julian3833/1-learning-out-of-the-box-bert-lb-0-577)
2. [Learning ðŸ¤— - Out-of-the-box RoBERTa [LB: 0.53]](https://www.kaggle.com/julian3833/2-learning-out-of-the-box-roberta-lb-0-53) (this notebook)
3. [Learning ðŸ¤— - Out-of-the-box Electra [LB: 0.58]](https://www.kaggle.com/julian3833/3-learning-out-of-the-box-electra-lb/) 
4. _Learning ðŸ¤— - Minimal fine tuning (WIP)_
5. _Learning ðŸ¤— - Preprocessing (WIP)_
6. _Learning ðŸ¤— - Reviewing public kernels (WIP)_
7. _Learning ðŸ¤— - Intra-domain pre training RoBERTa (WIP)_



## This notebook

The code below is just a copy of the code in [1- Learning ðŸ¤—  - Out-of-the-box BERT [LB: 0.577]](https://www.kaggle.com/julian3833/1-learning-out-of-the-box-bert-lb-0-577) with just 2 changes, which are the following ones. Refer to that notebook for a more detailed description of the process and a more verbose, commented code.

In [None]:
MODEL_NAME = "../input/huggingface-roberta-variants/roberta-base/roberta-base"
EPOCHS = 1

We had imported [Huggingface Roberta Variants](https://www.kaggle.com/sauravmaheshkar/huggingface-roberta-variants) instead of BERT this time.
And we are using 1 epoch because this is what gave better results (compared against 3).


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification, AutoTokenizer

def load_dfs():
    train_csv = '../input/commonlitreadabilityprize/train.csv'
    test_csv = '../input/commonlitreadabilityprize/test.csv'
    df_train = pd.read_csv(train_csv)[["excerpt", "target"]].rename(columns={"target": "label", "excerpt": "text"})
    df_test = pd.read_csv(test_csv)[["id", "excerpt"]].rename(columns={ "excerpt": "text"})
    return df_train, df_test

def rmse(y_true, y_pred): return np.sqrt(((y_true - y_pred) ** 2).mean().item())
    
def compute_metrics(pred_results):
    y_pred = pred_results.predictions.squeeze()
    y_true = pred_results.label_ids
    return {"rmse": rmse(y_true, y_pred)}

def submit(trainer, ds_test):
    sample_sub_csv = '../input/commonlitreadabilityprize/sample_submission.csv'
    pred_csv = '/kaggle/working/submission.csv'
    pred_results = trainer.predict(ds_test)
    y_pred = pred_results.predictions.squeeze()
    df_res = pd.read_csv(sample_sub_csv)
    df_res['target'] = y_pred.tolist()
    df_res.to_csv(pred_csv, index=False)

def tokenize(tokenizer, df_train, df_val, df_test):    
    train_tokenized = tokenizer(df_train['text'].tolist(), padding="max_length", truncation=True, max_length=512)
    val_tokenized = tokenizer(df_val['text'].tolist(), padding="max_length", truncation=True, max_length=512)
    test_tokenized = tokenizer(df_test['text'].tolist(), padding="max_length", truncation=True, max_length=512)
    train_tokenized['label'] = df_train['label'].tolist()
    val_tokenized['label'] = df_val['label'].tolist()
    ds_train = [dict(zip(train_tokenized,t)) for t in zip(*train_tokenized.values())]
    ds_val = [dict(zip(val_tokenized,t)) for t in zip(*val_tokenized.values())]
    ds_test = [dict(zip(test_tokenized,t)) for t in zip(*test_tokenized.values())]
    return ds_train, ds_val, ds_test

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=1)
df_base, df_test = load_dfs()
df_train, df_val = train_test_split(df_base, test_size=0.066)
ds_train, ds_val, ds_test = tokenize(tokenizer, df_train, df_val, df_test)
args = TrainingArguments("/kaggle/working/model/", num_train_epochs=EPOCHS, 
                         evaluation_strategy="steps", eval_steps=100, report_to="none")
trainer = Trainer(model=model, args=args, train_dataset=ds_train, eval_dataset=ds_val, 
                  compute_metrics=compute_metrics)
trainer.train()
submit(trainer, ds_test)

## ðŸ¤—ðŸ¤— Thanks for reading this notebook! Remember to upvote if you found it useful, and stay tuned for the next deliveries! ðŸ¤—ðŸ¤—