<a id='introduction'></a>
# Introduction

In this notebook I'm fine-tuning Hugging Face's Roberta Large model with the [CommonLit Readability Prize](https://www.kaggle.com/c/commonlitreadabilityprize) dataset. I follow Hugging Face's [relevant colab page](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling.ipynb) and Maunish's [guide](https://www.kaggle.com/maunish/clrp-pytorch-roberta-pretrain/) for the process.

This notebook is part of a series:
1. Pretrain roberta large on the CommonLit dataset. (this notebook)
2. Produce k models which can later be used for determining the readability of texts [here](https://www.kaggle.com/angyalfold/roberta-large-k-fold-models).
3. Make predictions with a custom NN regressor [here](https://www.kaggle.com/angyalfold/roberta-large-with-custom-regressor-pytorch/).
4. Ensemble (Roberta large + SVR, Roberta large + Ridge, Roberta large + custom NN head) [here](https://www.kaggle.com/angyalfold/ensemble-for-commonlit/).

To run this notebook the datasets package needs to be installed.

In [None]:
!conda install -c huggingface -c conda-forge datasets -y

<a id='toc'></a>
# Table of contents
* [Introduction](#introduction)
* [Overview](#overview)
    * [The idea explained](#overview_idea)
    * [Technicalities](#overview_technicalities)
* [Parameters](#parameters)
* [Prepare data](#prepare_data)
* [Setup training](#setup_training)
    * [Setup tokenzier](#setup_training_tokenizer)
    * [Setup model](#setup_training_model)
    * [Create datasets](#setup_training_datasets)
    * [Create DataCollator](#setup_training_datacollator)
    * [Training arguments](#setup_training_arguments)
    * [Create trainer](#setup_training_trainer)
* [Train](#train)

<a id='overview'></a>
# Overview
[[back to top]](#toc)

<a id='overview_idea'></a>
## The idea explained
[[back to top]](#toc)

The aim of this notebook is to fine-tune Hugging Face's Roberta-large model on the dataset provided for the [CommonLit Readability Prize](https://www.kaggle.com/c/commonlitreadabilityprize) competition.

Each language model is trained on a large amount of text data. For example, Roberta-large is among others trained on thousands of English language books and the English language Wikipedia (see Hugging Face's [related descreption](https://huggingface.co/roberta-large#training-data) for more details). This vast amount of data however, doesn't contain the texts which we want to work with (e.g.: the data for CommonLit Readability Prize in this case).

The expectation is that a model which is fine-tuned with the data on which the given NLP-task needs to be performed could yield better results compared to a model which previously didn't encounter with that data.

<a id='overview_technicalities'></a>
## Technicalities
[[back to top]](#toc)

Fine-tuning can happen either by using *causal language modeling* (when the model needs to generate the next tokens provided the begining of a sentence for example) or by *masked language modelling* (when some tokens are missing from a text and the model needs to predict those missing tokens). These ideas are discussed in details on Hugging Face's relevant [collab page](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling.ipynb#scrollTo=JEA1ju653l-p).

In this notebook I'm using *masked language modelling*.

<a id='parameters'></a>
# Parameters
[[back to top]](#toc)

In [None]:
import transformers

model_name = 'roberta-large' # the name of the model in Hugging Face
checkpoint_output = './clrp_roberta_large_chk'
output_path = './clrp_roberta_large' # the folder to which the tokenizer & models are saved

<a id='prepare_data'></a>
# Prepare data
[[back to top]](#toc)

In [None]:
import pandas as pd

df_train = pd.read_csv('../input/commonlitreadabilityprize/train.csv')
df_test =  pd.read_csv('../input/commonlitreadabilityprize/test.csv')
df_data = pd.concat([df_train, df_test])

df_data['excerpt'] = df_data['excerpt'].apply(lambda x: x.replace('\n', ' '))
text_data = df_data['excerpt'].to_frame('excerpt')

print('Loaded and preprocessd {} entries.'.format(text_data.shape[0]))

<a id='setup_training'></a>
# Setup training
[[back to top]](#toc)

<a id='setup_training_tokenizer'></a>
## Setup tokenizer
[[back to top]](#toc)

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(output_path)

print('Tokenizer for {} has successfully been saved.'.format(model_name))

<a id='setup_training_model'></a>
## Setup model
[[back to top]](#toc)

In [None]:
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained(model_name)

print('Model {} has been intialized.'.format(model_name))

<a id='setup_training_datasets'></a>
## Create datasets
[[back to top]](#toc)

Used [this description](https://huggingface.co/docs/datasets/loading_datasets.html) from Hugging Face to create a dataset from pandas. (Note, that I needed to install the datasets package using conda because I encountered errors when I tried to install it with pip. I used this command: `conda install -c huggingface -c conda-forge datasets -y`)

[This page](https://huggingface.co/docs/datasets/processing.html) describes how `train_test_split()` works.

The following snippet encodes the content of the dataset as described [here](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling.ipynb#scrollTo=5io6fY_d3l-u).

In [None]:
def encode_text(text_data):
    return tokenizer(text_data['excerpt'])

In [None]:
from datasets import Dataset

dataset = Dataset.from_pandas(text_data)
tokenized_dataset = dataset.map(encode_text, batched=True)
tokenized_datasets = tokenized_dataset.train_test_split()

print('Setup dataset & performed train/test split.')

<a id='setup_training_datacollator'></a>
## Create DataCollator
[[back to top]](#toc)

Data collators can create batches from the data. In the case of masking language, they are important, because they perform random masking on the created batches. See Hugging Face's related documentation [here](https://huggingface.co/transformers/main_classes/trainer.html#id1) & [here](https://huggingface.co/transformers/main_classes/data_collator.html).

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer)

print('Initialized data collator.')

<a id='setup_training_arguments'></a>
## Setup training arguments
[[back to top]](#toc)

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir=checkpoint_output,
    overwrite_output_dir=True,
    num_train_epochs=8,
    evaluation_strategy='epoch',
    metric_for_best_model='eval_loss',
    greater_is_better=False,
    save_total_limit=2,
    load_best_model_at_end=True,
    report_to='none',
    learning_rate=2e-5,
    weight_decay=0.01
)

print('Created training arguments.')

<a id='setup_training_trainer'></a>
## Create trainer
[[back to top]](#toc)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)

print('Setup trainer.')

<a id='train'></a>
# Train
[[back to top]](#toc)

In [None]:
trainer.train()
trainer.save_model(output_path + '/best_model/')

print('Trained & saved model.')