**Solution Overview:**

Train Roberta-Base and RobertaLarge models on the contest data along with supplmemental sources similar to that data.  Fine tune the models using cross-validation folds. Inference weights all 10 models (two trained models * five fine-tuned models [five folds] per model) equally.

**Notebook Sequence:**
* [Train Roberta Base Model -- **This Notebook**](https://www.kaggle.com/charliezimmerman/clrp-train-robertabase-maskedlm-model)
* [Train Roberta Large Model](https://www.kaggle.com/charliezimmerman/clrp-train-robertalarge-masked-lm-model/)
* [Fine Tune Trained Roberta-Base Model](https://www.kaggle.com/charliezimmerman/clrp-finetune-trained-robertabase)
* [Fine Tune Trained Roberta Large Model](https://www.kaggle.com/charliezimmerman/clrp-finetune-trained-robertalarge)
* [Inference Notebook](https://www.kaggle.com/charliezimmerman/clrp-inference-robertabase-robertalarge-ensemble)

**This Notebook influenced by:**

https://www.kaggle.com/maunish/clrp-pytorch-roberta-pretrain

and by examples/documentation at https://huggingface.co/

Note that due to copyright concerns I am not making the data in the additional-clrp-input folder public. CRLP_Input.csv contains excerpts I manually downloaded from various places, including the site of the contest sponsor [CommonLit](https://www.commonlit.org/)  and [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page). Books.csv was auto-generated using the code at https://www.kaggle.com/charliezimmerman/fetch-clrp-data-from-web/ 



In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import transformers
from transformers import (AutoModel,AutoModelForMaskedLM, 
                          AutoTokenizer, LineByLineTextDataset,
                          DataCollatorForLanguageModeling,
                          Trainer, TrainingArguments)

import torch
import gc
import warnings
warnings.filterwarnings('ignore')
gc.enable()

In [None]:
TRAINED_ROBERTA_FOLDER="./robertabase_clrp_model"
TRAIN_FILE_IN="../input/commonlitreadabilityprize/test.csv"
VAL_FILE_IN="../input/commonlitreadabilityprize/test.csv"
BOOK_DATA="../input/additional-clrp-input/books.csv" #from gutenberg project
ADDL_CLRP_DATA = "../input/additional-clrp-input/CRLP_Input.csv" #additional passages from
                                                                #common.lit.org
TRAIN_FILE_OUT= "./clrp_corpus.csv"
MODEL_PATH  = '../input/roberta-base'
EPOCHS=5

In [None]:
#set up gpu
scaler = torch.cuda.amp.GradScaler() 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

In [None]:
train = pd.read_csv(TRAIN_FILE_IN)
test = pd.read_csv(VAL_FILE_IN)
lit = pd.read_csv(ADDL_CLRP_DATA)
books=pd.read_csv(BOOK_DATA)
train2=train[["excerpt"]]
test2=test[["excerpt"]]
lit2=lit[["excerpt"]]
books2=books[["excerpt"]]

#use everything for training
train=pd.concat([train2,test2, lit2, books2])

train['excerpt']=train['excerpt'].apply(lambda x: x if len(x)<= 512 else x[:512])
train['excerpt'] = train['excerpt'].apply(lambda x: x.replace('\n',''))


train.to_csv(TRAIN_FILE_OUT, index=False)

In [None]:
model = AutoModelForMaskedLM.from_pretrained(MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

In [None]:
#To train model using all data for training and evaluation
# due to limited data size
train_dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path=TRAIN_FILE_OUT,
    block_size=256)

valid_dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path=TRAIN_FILE_OUT, 
    block_size=256)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

training_args = TrainingArguments(
    output_dir="./checkpoints", 
    overwrite_output_dir=True,
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    evaluation_strategy= 'steps',
    save_total_limit=2,
    eval_steps=500,
    save_steps=1000,
    metric_for_best_model='eval_loss',
    greater_is_better=False,
    load_best_model_at_end =True,
    prediction_loss_only=True,
    report_to = "none")

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset)

In [None]:
#~45 minutes
trainer.train()
trainer.save_model(TRAINED_ROBERTA_FOLDER)