In [2]:
! pip install transformers datasets accelerate -U



# **Masked language modeling**

- Masked language modeling predicts a masked token in a sequence, and the model can attend to tokens bidirectionally. 
- This means the model has full access to the tokens on the left and right. 
- Masked language modeling is great for tasks that require a good contextual understanding of an entire sequence.
- BERT is an example of a masked language model.

This guide will show you how to:

- Finetune DistilRoBERTa on the r/askscience subset of the ELI5 dataset
- Use your finetuned model for inference


## **Load ELI5 dataset**

The **[ELI5 dataset](https://huggingface.co/datasets/eli5)** is an English-language dataset of questions and answers gathered from three subreddits 
- where users ask factual questions requiring paragraph-length or longer answers

The dataset was created to support the task of open-domain long form abstractive question answering
- covers questions about general topics in its r/explainlikeimfive subset, science in it r/askscience subset, and History in its r/AskHistorians subset

In [6]:
from datasets import load_dataset
import warnings; warnings.filterwarnings('ignore')

eli5 = load_dataset("eli5", split="train_asks[:5000]")
eli5.column_names

Downloading builder script:   0%|          | 0.00/18.2k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/6.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/15.8k [00:00<?, ?B/s]

Downloading and preparing dataset eli5/LFQA_reddit to /root/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa...


Downloading:   0%|          | 0.00/3.50k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/576M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/21.1M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/286M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.65M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/330M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/18.7M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/36.2M [00:00<?, ?B/s]

Dataset eli5 downloaded and prepared to /root/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa. Subsequent calls will reuse this data.


['q_id',
 'title',
 'selftext',
 'document',
 'subreddit',
 'answers',
 'title_urls',
 'selftext_urls',
 'answers_urls']

In [7]:
eli5 = eli5.train_test_split(test_size=0.2)
eli5["train"][0]

{'q_id': '1dpyfn',
 'title': 'Is it possible for their to be light of a frequency that would disrupt nuclear bonding?',
 'selftext': 'Ultraviolet light can break down ionic bonds (or was it covalent?). Is it possible that a light of the correct frequency could break down the bonds holding the nucleus of an atom together?',
 'document': '',
 'subreddit': 'askscience',
 'answers': {'a_id': ['c9spkh6', 'c9sp02d'],
  'text': ["Theoretically, yes. The nucleus is a bound state, hit it with the right bundle of energy (a photon), and you can excite the weakest bound nucleon off of the atom. It's easiest to think of deuterium: It's an electron, a proton, and a neutron. Knock off the electron with something low energy (an ultraviolet photon, only 15.5 eV). This leaves you with a bound proton-neutron system. Theoretically, you can then zap it with another photon of the proton neutron binding energy (2.22 MeV). The binding energy of the deuterium nucleus is thus greater than the energy output of m

While this may look like a lot, you're only really interested in the `text` field. What's cool about language modeling tasks is you don't need labels (also known as an unsupervised task) because the next word *is* the label.

## **Preprocess**

For masked language modeling, the next step is to load a DistilRoBERTa tokenizer to process the `text` subfield:

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")

Downloading (…)lve/main/config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

- You'll notice from the example above, the `text` field is actually nested inside `answers`.
- This means you'll need to extract the `text` subfield from its nested structure with the [`flatten`](https://huggingface.co/docs/datasets/process.html#flatten) method:

In [9]:
eli5 = eli5.flatten()
eli5["train"][0]['answers.text']

["Theoretically, yes. The nucleus is a bound state, hit it with the right bundle of energy (a photon), and you can excite the weakest bound nucleon off of the atom. It's easiest to think of deuterium: It's an electron, a proton, and a neutron. Knock off the electron with something low energy (an ultraviolet photon, only 15.5 eV). This leaves you with a bound proton-neutron system. Theoretically, you can then zap it with another photon of the proton neutron binding energy (2.22 MeV). The binding energy of the deuterium nucleus is thus greater than the energy output of most lasers (it's firmly in the gamma ray spectrum, which is expected, because nuclear transitions make gamma rays). So, in most cases, you'll find that to complete disintegrate your nucleus, you're going to need to do some careful thinking. Deuterium is the simplest non-trivial nucleus, and already we'd need a second radioactive source producing the necessary x-rays. But, again, in theory, yes, light can break up nuclei."

- Each subfield is now a separate column as indicated by the `answers` prefix, and the `text` field is a list now
- Instead of tokenizing each sentence separately, convert the list to a string so you can jointly tokenize them
- To apply this preprocessing function over the entire dataset
- You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once, and increasing the number of processes with `num_proc`

In [10]:
# join the list of strings for each example and tokenise the result
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names, # remove columns we don't need
)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (557 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1478 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (747 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (555 > 512). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (528 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (895 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2062 > 512). Running this sequence through the model will result in indexing errors


This dataset contains the token sequences, but some of these are longer than the maximum input length for the model.

You can now use a second preprocessing function to
- concatenate all the sequences
- split the concatenated sequences into shorter chunks defined by `block_size`, which should be both shorter than the maximum input length and short enough for your GPU RAM.

In [11]:
block_size = 128
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [12]:
# Apply the `group_texts` function over the entire dataset
lm_dataset = tokenized_eli5.map(group_texts,
                                batched=True,
                                num_proc=4)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Now create a batch of examples using DataCollatorForLanguageModeling
- It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.
- Use the EOS token as the padding token, specify `mlm_probability` to randomly mask tokens each time you iterate over the data:

In [13]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer,
                                                mlm_probability=0.15)

## **Fine Tune Model**

Load DistilRoBERTa with AutoModelForMaskedLM

In [14]:
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("distilroberta-base")

Downloading model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


At this point, only three steps remain:

- Define your training hyperparameters in TrainingArguments.
- The only required parameter is `output_dir` which specifies where to save your model
- Pass the training arguments to Trainer along with the model, datasets, and data collator
- Call train() to finetune your model.

In [15]:
from transformers import TrainingArguments, Trainer
import os
os.environ['WANDB_DISABLED'] = 'true'

training_args = TrainingArguments(
    output_dir="my_awesome_eli5_mlm_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy = "no",
    save_total_limit = 2,
    load_best_model_at_end=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,2.2511,2.065639
2,2.1645,2.017307
3,2.1168,1.984272


TrainOutput(global_step=3375, training_loss=2.1876857096354168, metrics={'train_runtime': 396.1183, 'train_samples_per_second': 68.116, 'train_steps_per_second': 8.52, 'total_flos': 894600396366336.0, 'train_loss': 2.1876857096354168, 'epoch': 3.0})

Once training is completed, use the evaluate method to evaluate your model and get its perplexity

In [16]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 7.14


In [17]:
# save model and tokeniser
trainer.save_model("my_awesome_eli5_mlm_model")
tokenizer.save_pretrained("my_awesome_eli5_mlm_model")

('my_awesome_eli5_mlm_model/tokenizer_config.json',
 'my_awesome_eli5_mlm_model/special_tokens_map.json',
 'my_awesome_eli5_mlm_model/vocab.json',
 'my_awesome_eli5_mlm_model/merges.txt',
 'my_awesome_eli5_mlm_model/added_tokens.json',
 'my_awesome_eli5_mlm_model/tokenizer.json')

## **Inference**

- Great, now that you've finetuned a model, you can use it for inference!
- Come up with some text you'd like the model to fill in the blank with, and use the special `<mask>` token to indicate the blank

In [19]:
from transformers import AutoTokenizer
from transformers import AutoModelForMaskedLM
import torch

text = "The Milky Way is a <mask> galaxy."

# Tokenize the text and return the `input_ids` as PyTorch tensors. 
# You'll also need to specify the position of the `<mask>` token:
tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_mlm_model")
inputs = tokenizer(text, return_tensors="pt")
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

# Pass your inputs to the model and return the `logits` of the masked token
model = AutoModelForMaskedLM.from_pretrained("stevhliu/my_awesome_eli5_mlm_model")
logits = model(**inputs).logits
mask_token_logits = logits[0, mask_token_index, :]

# Then return the three masked tokens with the highest probability and print them out
top_3_tokens = torch.topk(mask_token_logits, 3, dim=1).indices[0].tolist()

for token in top_3_tokens:
    print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))

Downloading (…)lve/main/config.json:   0%|          | 0.00/677 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/329M [00:00<?, ?B/s]

The Milky Way is a  spiral galaxy.
The Milky Way is a  massive galaxy.
The Milky Way is a  small galaxy.


## **Pipeline**

- The simplest way to try out your finetuned model for inference is to use it in a pipeline()
- Instantiate a `pipeline` for fill-mask with your model, and pass your text to it
- If you like, you can use the `top_k` parameter to specify how many predictions to return

In [20]:
from transformers import pipeline

mask_filler = pipeline("fill-mask", "my_awesome_eli5_mlm_model")
mask_filler(text, top_k=3)

[{'score': 0.37157338857650757,
  'token': 21300,
  'token_str': ' spiral',
  'sequence': 'The Milky Way is a spiral galaxy.'},
 {'score': 0.11231834441423416,
  'token': 2232,
  'token_str': ' massive',
  'sequence': 'The Milky Way is a massive galaxy.'},
 {'score': 0.06813511252403259,
  'token': 30794,
  'token_str': ' dwarf',
  'sequence': 'The Milky Way is a dwarf galaxy.'}]