# Fine-tuning RoBERTa

We use an adapted version of the training process from this colab notebook on huggingface https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb#scrollTo=h_2R1lfx31hj

In [1]:
import matplotlib.pyplot as plt
import torch
import evaluate
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
from transformers import RobertaTokenizer, RobertaForMaskedLM, Trainer
import numpy as np
import pandas as pd
# Get CPU or GPU device for training
device = "cuda" if torch.cuda.is_available() else "cpu"
device = torch.device(device)

We have our RoBERTa model with its associated tokenizer

In [2]:
model = RobertaForMaskedLM.from_pretrained('roberta-base')
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

we import the data

In [3]:
#df_train = pd.read_pickle("train.pickle") # leaving out train.pickle for now because it is too big for the kernel along with the models
df_valid = pd.read_pickle("valid.pickle")
df_test = pd.read_pickle("test.pickle")

We need to add a custom data collator to sometimes mask out the variables but not always

In [None]:
class mydatacollator()

We decide our training hyperparameters https://huggingface.co/docs/transformers/v4.26.1/en/main_classes/trainer#transformers.TrainingArguments

In [None]:
training_args = TrainingArguments(
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    num_train_epochs=10
    data_collator=mydata_collator,
)

Now we define the trainer

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=df_valid['code'], # set to validate for the time being, make sure to change for the real thing
    eval_dataset=df_test['code'],
)

Determine the perplexity on the validation data

In [None]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")