# Introduction

RoBERTa is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. 


The model we'll look at in this notebook were trained using a masked language modeling (MLM) objective. It was introduced in this [paper](https://arxiv.org/abs/1907.11692) and first released in this [repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta).

# Installation
We'll be using the [Transformers](https://huggingface.co/transformers/) library by HuggingTorch throughout this notebook.
It provides a simple interface to use NLP models with both, PyTorch and Tensorflow

To get started, use ```pip``` to install the package ```transformers```

In [None]:
!pip install -q transformers

# Quick Start

The easiest way to use a pretrained model with HuggingFace is to use ```pipeline()```. This gives us access to a wide variety of NLP [tasks](https://huggingface.co/transformers/main_classes/pipelines.html#transformers.pipeline). The one we're interested in is ```fill-mask```

In [None]:
from transformers import pipeline
unmask = pipeline('fill-mask')

We can now passed a string with a masked word to ```unmask()``` and it'll return an array of predictions. The mask for this tokenizer is described by `unmask.tokenizer.mask_token`. We can type it in manually, or use an `f"string"`

In [None]:
unmask.tokenizer.mask_token

In [None]:
predictions = unmask('Elon Musk is the founder of <mask>')
for prediction in predictions:
    print(prediction['sequence'].strip('<s>').strip('</s>'), end='\t--- ')
    print(f"{round(100*prediction['score'],2)}% confidence")

Out of the box, the model fairs pretty well. From the top 5 results returend, we see that we've got relatively high confidence for the correct answers, and significantly lower scores for the incorrect ones. The default model used by this pipeline task is `distilroberta-base`. From its info [page](https://huggingface.co/distilroberta-base) we see that it's a distilled version of the `roberta-base` model. We can use the parent model directly, by passsing it as an argument when creating the pipeline. HuggingFace offers muliple [models](https://huggingface.co/models), each finetuned for a different task

In [None]:
roberta_unmask = pipeline('fill-mask', model='roberta-base')
predictions = roberta_unmask('Elon Musk is the founder of <mask>')
for prediction in predictions:
    print(prediction['sequence'].strip('<s>').strip('</s>'), end='\t--- ')
    print(prediction['score'])

# Setting up our own workflow
The `pipeline()` method works well if you don't need a lot of customisation. But there willl be times when you want more control of over the process, we can instantiate, train and use our own model and tokenzier. The HuggingFace [docs](https://huggingface.co/transformers/task_summary.html#masked-language-modeling) give us a concise way of doing this.

1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a DistilBERT model and loads it with the weights stored in the checkpoint.
2. Define a sequence with a masked token, placing the `tokenizer.mask_token` instead of a word.
3. Encode that sequence into a list of IDs and find the position of the masked token in that list.
4. Retrieve the predictions at the index of the mask token: this tensor has the same size as the vocabulary, and the values are the scores attributed to each token. The model gives higher score to tokens it deems probable in that context.
5. Retrieve the top 5 tokens using the PyTorch `topk` or TensorFlow `top_k` methods.
6. Replace the mask token by the tokens and print the results


### PyTorch

In [None]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
model = AutoModelForMaskedLM.from_pretrained('roberta-base')

sequence = f"The world will end in {tokenizer.mask_token}" # "The world will end in <mask>"

input_seq = tokenizer.encode(sequence, return_tensors='pt') # tensor([[0, 133, 232, 40, 253, 11, 50264, 2]])
mask_token_index = torch.where(input_seq == tokenizer.mask_token_id)[1] # (tensor([0]), tensor([6])) - we only want the the 2nd dimension

token_logits = model(input_seq).logits
masked_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(masked_token_logits, 5, dim=1).indices[0].tolist()

# print('sequence:', sequence)
# print('input_seq:', input_seq)
# print('mask_token_index:', mask_token_index)
# print('token_logits:', token_logits)
# print('masked_token_logits:', masked_token_logits)
# print('top_5_tokens:', top_5_tokens)

In [None]:
for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

### Tensorflow

In [None]:
from transformers import TFAutoModelForMaskedLM, AutoTokenizer
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
model = TFAutoModelForMaskedLM.from_pretrained('roberta-base')

sequence = f"The world will end in {tokenizer.mask_token}" # "The world will end in <mask>"

input_seq = tokenizer.encode(sequence, return_tensors='tf') # tensor([[0, 133, 232, 40, 253, 11, 50264, 2]])
mask_token_index = tf.where(input_seq == tokenizer.mask_token_id)[0, 1] # (tensor([0]), tensor([6])) - we only want the the 2nd dimension

token_logits = model(input_seq)[0]
masked_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = tf.math.top_k(masked_token_logits, 5).indices.numpy()

# print('sequence:', sequence)
# print('input_seq:', input_seq)
# print('mask_token_index:', mask_token_index)
# print('token_logits:', token_logits)
# print('masked_token_logits:', masked_token_logits)
# print('top_5_tokens:', top_5_tokens)

In [None]:
for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

# Fine tuning the model
Most of the models on HuggingFace are meant to be fine-tuned for specific tasks. To save valuable time, HuggingFace offers a `Trainer` that'll fine tune our model to a dataset. All we need to do is provide it with a config. You can still [train directly](https://huggingface.co/transformers/training.html#fine-tuning-in-native-pytorch) through PyTorch or Tensorflow if you want, but there's very little benifit to doing that.

To make things even more easier, HuggingFace offers [scripts](https://github.com/huggingface/transformers/tree/master/examples/language-modeling) that can be run to generate the model. For fine tuning our model, we'll use `run_mlm.py`. This script requires [version 4.5.0](https://github.com/huggingface/transformers/blob/3f48b2bc3e5b555a06492f1e7b999ff29bb6058a/examples/language-modeling/run_mlm.py#L51) which, at the time of writing, hasn't been released. So we'll need to install it from the master. It also requires the `datasets` package to import datasets using only their name

In [None]:
!wget --quiet https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_mlm.py
!pip install --quiet datasets
!pip install --quiet git+https://github.com/huggingface/transformers

*Be sure to use an accelerator when training, this can take a long time*

In the cell below we're fine tuning `roberta-base` on the `wikitext` dataset. Since this isn't an interactive shell, and we don't want to upload the resulting weights and biases anywhere, we pass in `none` for the `report_to` flag. Even with an accelerator, this can still take a couple of minutes, so we limit training/validation samples.
Take note of the `outpu_dir` we'll need it later.

In [None]:
# Clear GPU memory (sometimes needed on Kaggle/Colab)
from numba import cuda
cuda.select_device(0)
cuda.close()
cuda.select_device(0)

In [None]:
!python './run_mlm.py' \
--model_name_or_path 'roberta-base' \
--dataset_name 'wikitext' \
--dataset_config_name 'wikitext-2-raw-v1' \
--do_train \
--do_eval \
--report_to none \
--max_train_samples 500 \
--max_val_samples 500 \
--output_dir './test-mlm'

We can now load in this model by passing in the `output_dir` to `from_pretrained()`

In [None]:
model = AutoModelForMaskedLM.from_pretrained('./test-mlm/')
tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')

unmask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

predictions = unmask('Elon Musk is the founder of <mask>')
for prediction in predictions:
    print(prediction['sequence'].strip('<s>').strip('</s>'), end='\t--- ')
    print(f"{round(100*prediction['score'],2)}% confidence")