In this experiment, I followed the NLP tutorial on HuggingFace to finetune a pre-trained masked language model on a subset of the *imdb* movie review dataset. The code was mostly from the [HuggingFace tutorial](https://huggingface.co/learn/nlp-course/chapter7/3?fw=pt). For learning purpose, I added my comments to explain the purpose of the code and some cells to check the content the nested data objects, which other learners may refer to if necessary.

Overall, it was an interesting domain adaptation experiment. The pre-trained masked language model originally predicted [MASK] tokens with generic words. But after finetuning, the model learned to fill in the [MASK]s with words related to movies, since it got to see more movie reviews. For example, the input sequence was `This is a great [MASK]`.

**Before finetuning:**
```
This is a great deal.
This is a great success.
This is a great adventure.
This is a great idea.
This is a great feat.
```
**After finetuning:**
```
this is a great film.
this is a great movie.
this is a great idea.
this is a great one.
this is a great story.
```

In [1]:
from transformers import AutoModelForMaskedLM

# the checkpoint to be used for domain adaptation is "distilbert-base-uncased"
model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

In [2]:
# DistilBERT has far fewer parameters than BERT
distilbert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")
print(f"'>>> BERT number of parameters: 110M'")


'>>> DistilBERT number of parameters: 67M'
'>>> BERT number of parameters: 110M'


# [MASK] prediction: Before domain adaptation

First, let's see how the pretrained model predicts the [MASK] token on this simple sentence.

In [3]:
text = "This is a great [MASK]."

In [4]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [5]:
import torch

inputs = tokenizer(text, return_tensors="pt")

# output of the model is logits for each possible next token, for each position in the sequence
token_logits = model(**inputs).logits
token_logits.shape

torch.Size([1, 8, 30522])

In [6]:
# tokenized inputs to the model
inputs

{'input_ids': tensor([[ 101, 2023, 2003, 1037, 2307,  103, 1012,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [7]:
# note that the [MASK] token has been converted into the id "103" by the tokenizer
tokenizer.mask_token_id

103

In [8]:
token_logits.shape

torch.Size([1, 8, 30522])

In [9]:
# this function finds which token in the input sequence is the [MASK] token - that's the prediction we want to look at
torch.where(inputs["input_ids"] == tokenizer.mask_token_id)

(tensor([0]), tensor([5]))

In [10]:
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
print(f"mask_token_index = {mask_token_index}")
mask_token_logits = token_logits[0, mask_token_index, :]
print(f"mask_token_logits.shape = {mask_token_logits.shape}")


mask_token_index = tensor([5])
mask_token_logits.shape = torch.Size([1, 30522])


In [11]:
# pick the [MASK] candidates with the highest logits
temp = torch.topk(mask_token_logits, 5, dim=1)
print(temp)
temp = temp.indices
print(temp)
temp = temp[0].tolist()
print(temp)
top_5_tokens = temp



torch.return_types.topk(
values=tensor([[7.0727, 6.6514, 6.6425, 6.2530, 5.8618]], grad_fn=<TopkBackward0>),
indices=tensor([[3066, 3112, 6172, 2801, 8658]]))
tensor([[3066, 3112, 6172, 2801, 8658]])
[3066, 3112, 6172, 2801, 8658]


In [12]:
# the top prediction to replace the [MASK] token in the sentence is "deal"
tokenizer.mask_token, tokenizer.decode([top_5_tokens[0]])

('[MASK]', 'deal')

As observed in the HuggingFace tutorial, the top 5 predicted tokens to replace [MASK] all make sense, but are quite generic.

In [13]:
for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> This is a great deal.'
'>>> This is a great success.'
'>>> This is a great adventure.'
'>>> This is a great idea.'
'>>> This is a great feat.'


# [MASK] prediction: After domain adaptation with *imdb* movie reviews

In [14]:
# we will finetune this pre-trained DistilBERT on a small part of "immdb" dataset to see how the [MASK] predictions change
from datasets import load_dataset
imdb_dataset = load_dataset("imdb")
imdb_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

## Data preprocessing

The dataset has a "label" column for positivity rating, but we won't use it here.

In [15]:
sample = imdb_dataset["train"].shuffle(seed=42).select(range(3))

In [16]:
sample

Dataset({
    features: ['text', 'label'],
    num_rows: 3
})

In [17]:
for row in sample:
    print(f"\n>>> Review: {row['text']}")
    print(f">>> Label: {row['label']}")


>>> Review: There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...
>>> Label: 1

>>> Review: This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stub you

In [18]:
sample_unsupervised = imdb_dataset["unsupervised"].shuffle(seed=42).select(range(3))

for row in sample_unsupervised:
    print(f"\n>>> Review: {row['text']}")
    print(f">>> Label: {row['label']}")


>>> Review: If you've seen the classic Roger Corman version starring Vincent Price it's hard to put it out of your head, but you probably should do because this one is totally different. Subtlety has been abandoned in favour of gross-out horror - nudity, gore and all-round unpleasantness. OK it's ridiculous, trashy, sensationalised and historically dubious (did any members of the Inquisition really wear horn-rimmed glasses?), but despite all this it is strangely compelling. I literally couldn't tear myself away from the screen until the end of the movie. If there's a bigger compliment you can pay to a film I don't know what it is.
>>> Label: -1

>>> Review: For me, this was the most moving film of the decade. Samira Makhmalbaf shows pure bravery and vision in the making. She has an intelligence and gift for speaking to the people, regardless of their nationality or beliefs. I am inspired and touched by her humanity and can only hope that she has touched many people the same way. Her m

We keep "word_ids" column (which maps each token to which word it comes from in the original input sentence) during tokenizing, because we want to be able to do "whole word masking" instead of individual token masking.

In [19]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)

tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

Here, the tokenized sequences are of different lengths. So we concatenate all sequences and divide into equal chunks of length 128.

In [20]:
tokenizer.model_max_length

512

In [21]:
chunk_size = 128

In [22]:
tokenized_datasets["train"][:3]

{'input_ids': [[101,
   1045,
   12524,
   1045,
   2572,
   8025,
   1011,
   3756,
   2013,
   2026,
   2678,
   3573,
   2138,
   1997,
   2035,
   1996,
   6704,
   2008,
   5129,
   2009,
   2043,
   2009,
   2001,
   2034,
   2207,
   1999,
   3476,
   1012,
   1045,
   2036,
   2657,
   2008,
   2012,
   2034,
   2009,
   2001,
   8243,
   2011,
   1057,
   1012,
   1055,
   1012,
   8205,
   2065,
   2009,
   2412,
   2699,
   2000,
   4607,
   2023,
   2406,
   1010,
   3568,
   2108,
   1037,
   5470,
   1997,
   3152,
   2641,
   1000,
   6801,
   1000,
   1045,
   2428,
   2018,
   2000,
   2156,
   2023,
   2005,
   2870,
   1012,
   1026,
   7987,
   1013,
   1028,
   1026,
   7987,
   1013,
   1028,
   1996,
   5436,
   2003,
   8857,
   2105,
   1037,
   2402,
   4467,
   3689,
   3076,
   2315,
   14229,
   2040,
   4122,
   2000,
   4553,
   2673,
   2016,
   2064,
   2055,
   2166,
   1012,
   1999,
   3327,
   2016,
   4122,
   2000,
   3579,
   2014,
   3086,
   20

In [23]:
# checking the lengths of different tokenized sequences
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f">>> Review {idx} length: {len(sample)}")

>>> Review 0 length: 363
>>> Review 1 length: 304
>>> Review 2 length: 133


In [24]:
# checking the concatenation function
sum(tokenized_samples["input_ids"], [])

[101,
 1045,
 12524,
 1045,
 2572,
 8025,
 1011,
 3756,
 2013,
 2026,
 2678,
 3573,
 2138,
 1997,
 2035,
 1996,
 6704,
 2008,
 5129,
 2009,
 2043,
 2009,
 2001,
 2034,
 2207,
 1999,
 3476,
 1012,
 1045,
 2036,
 2657,
 2008,
 2012,
 2034,
 2009,
 2001,
 8243,
 2011,
 1057,
 1012,
 1055,
 1012,
 8205,
 2065,
 2009,
 2412,
 2699,
 2000,
 4607,
 2023,
 2406,
 1010,
 3568,
 2108,
 1037,
 5470,
 1997,
 3152,
 2641,
 1000,
 6801,
 1000,
 1045,
 2428,
 2018,
 2000,
 2156,
 2023,
 2005,
 2870,
 1012,
 1026,
 7987,
 1013,
 1028,
 1026,
 7987,
 1013,
 1028,
 1996,
 5436,
 2003,
 8857,
 2105,
 1037,
 2402,
 4467,
 3689,
 3076,
 2315,
 14229,
 2040,
 4122,
 2000,
 4553,
 2673,
 2016,
 2064,
 2055,
 2166,
 1012,
 1999,
 3327,
 2016,
 4122,
 2000,
 3579,
 2014,
 3086,
 2015,
 2000,
 2437,
 2070,
 4066,
 1997,
 4516,
 2006,
 2054,
 1996,
 2779,
 25430,
 14728,
 2245,
 2055,
 3056,
 2576,
 3314,
 2107,
 2004,
 1996,
 5148,
 2162,
 1998,
 2679,
 3314,
 1999,
 1996,
 2142,
 2163,
 1012,
 1999,
 2090,
 48

In [25]:
# testing: the concatenated length should be 800, for these 3 sequences in this sample:
# >>> Review 0 length: 363
# >>> Review 1 length: 304
# >>> Review 2 length: 133
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f">>> Concatenated reviews length: {total_length}")


>>> Concatenated reviews length: 800


At the end of dividing the big concatenated sequence into equal chunks, there may be a short chunk left over, like in the example below. We will discard it.

In [26]:
chunks = {
    k: [t[i: i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}
print(chunks)
for chunk in chunks["input_ids"]:
    print(f">>> Chunk length: {len(chunk)}")

{'input_ids': [[101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009, 2043, 2009, 2001, 2034, 2207, 1999, 3476, 1012, 1045, 2036, 2657, 2008, 2012, 2034, 2009, 2001, 8243, 2011, 1057, 1012, 1055, 1012, 8205, 2065, 2009, 2412, 2699, 2000, 4607, 2023, 2406, 1010, 3568, 2108, 1037, 5470, 1997, 3152, 2641, 1000, 6801, 1000, 1045, 2428, 2018, 2000, 2156, 2023, 2005, 2870, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 5436, 2003, 8857, 2105, 1037, 2402, 4467, 3689, 3076, 2315, 14229, 2040, 4122, 2000, 4553, 2673, 2016, 2064, 2055, 2166, 1012, 1999, 3327, 2016, 4122, 2000, 3579, 2014, 3086, 2015, 2000, 2437, 2070, 4066, 1997, 4516, 2006, 2054, 1996, 2779, 25430, 14728, 2245, 2055, 3056, 2576, 3314, 2107], [2004, 1996, 5148, 2162, 1998, 2679, 3314, 1999, 1996, 2142, 2163, 1012, 1999, 2090, 4851, 8801, 1998, 6623, 7939, 4697, 3619, 1997, 8947, 2055, 2037, 10740, 2006, 4331, 1010, 2016, 2038, 3348, 2007, 2014, 3689, 

This function combines the steps explained above. Again, this code comes from the Hugging Face tutorial.

In this function, we create a new "labels" column - an exact copy of the "input_ids" column. We will later mask parts of "input_ids", so this clone stored in "labels" will be the ground-truth for training.

In [27]:
def group_texts(examples):
    # concatenate all texts
    concatenated_examples = {
        k: sum(examples[k], []) for k in examples.keys()
    }
    # compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # we drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    
    # split by chunks of max_len
    result = {
        k: [t[i: i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    
    # create a new labels column - same as the input_ids. We will later mask parts of "input_ids", so the clone stored in "labels" will be the ground-truth for training
    result["labels"] = result["input_ids"].copy()
    return result

In [28]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)

In [29]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])


"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"

In [30]:
tokenizer.decode(lm_datasets["train"][1]["labels"])

"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"

In [31]:
lm_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 61291
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 59904
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 122957
    })
})

## Testing the *DataCollatorForLanguageModeling* - not be used here since we want "whole word masking"

In [32]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [33]:
samples = [lm_datasets["train"][i] for i in range(2)]
print(f"samples: {samples}")

# We remove the "word_ids" key for this data collator as it does not expect it:
for sample in samples:
    _ = sample.pop("word_ids")
    
print(f"samples: {samples}")

for chunk in data_collator(samples)["input_ids"]:
    print(f"chunk = {chunk}")
    print(f">>> {tokenizer.decode(chunk)}")
    print()

samples: [{'input_ids': [101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009, 2043, 2009, 2001, 2034, 2207, 1999, 3476, 1012, 1045, 2036, 2657, 2008, 2012, 2034, 2009, 2001, 8243, 2011, 1057, 1012, 1055, 1012, 8205, 2065, 2009, 2412, 2699, 2000, 4607, 2023, 2406, 1010, 3568, 2108, 1037, 5470, 1997, 3152, 2641, 1000, 6801, 1000, 1045, 2428, 2018, 2000, 2156, 2023, 2005, 2870, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 5436, 2003, 8857, 2105, 1037, 2402, 4467, 3689, 3076, 2315, 14229, 2040, 4122, 2000, 4553, 2673, 2016, 2064, 2055, 2166, 1012, 1999, 3327, 2016, 4122, 2000, 3579, 2014, 3086, 2015, 2000, 2437, 2070, 4066, 1997, 4516, 2006, 2054, 1996, 2779, 25430, 14728, 2245, 2055, 3056, 2576, 3314, 2107], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

## Whole word masking

In [37]:
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2

def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")
        
        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)
                
        # randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping), ))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()  # TODO
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
                
        feature["labels"] = new_labels
    ret = default_data_collator(features)
    
    return ret

In [38]:
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n>>> {tokenizer.decode(chunk)}")


>>> [CLS] [MASK] rented i am curious - [MASK] from my video store because of all the controversy that surrounded it when [MASK] was [MASK] released [MASK] 1967. i [MASK] heard that at first it was seized by u. [MASK]. customs if [MASK] ever [MASK] to enter this country, therefore being [MASK] [MASK] of films considered " controversial " i really had to see this for myself. < br / > [MASK] br / > the plot is [MASK] around a young swedish drama [MASK] named lena who wants to learn [MASK] she can about life [MASK] [MASK] particular she wants [MASK] focus her attentions to making some sort of [MASK] on what the average swede [MASK] [MASK] certain political [MASK] [MASK]

>>> as [MASK] [MASK] war and race issues in the united states. in between asking [MASK] and ordinary denizens of stockholm about their opinions on politics, [MASK] [MASK] sex with her drama teacher, classmates, [MASK] married men. < br / > [MASK] br / > what kills me about i am curious - yellow [MASK] that 40 [MASK] ago, 

Due to GPU compute constraint, we will perform this experiment on a small subset of the *imdb* dataset.

In [39]:
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(train_size=train_size, test_size = test_size, seed=42)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})

# Finetune the model using Trainer

In [40]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [41]:
# FYI
len(downsampled_dataset["train"])

10000

In [42]:
from transformers import TrainingArguments

batch_size = 64
# Show the training loss with every epoch

# to ensure we track the training loss with each epoch.
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]


training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-imdb-HF-tutorial-using-trainer",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,  #  By default, the repository used will be in your namespace and named after the output directory you set
    fp16=True,  # used fp16=True to enable mixed-precision training, which gives us another boost in speed
    logging_steps=logging_steps,
    remove_unused_columns=False,  #  By default, the Trainer will remove any columns that are not part of the model’s forward() method. This means that if you’re using the whole word masking collator, you’ll also need to set remove_unused_columns=False to ensure we don’t lose the word_ids column during training.
)




In [43]:
from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=whole_word_masking_data_collator,
    tokenizer=tokenizer,
)

Before finetuning, evaluation on the *imdb* `eval_dataset` showed high perplexity. We will see this perplexity decreases on this `eval_dataset` after the model is finetuned on the data.

In [44]:
import math
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results["eval_loss"]):.2f}")

>>> Perplexity: 62.48


In [45]:
eval_results

{'eval_loss': 4.134828090667725,
 'eval_runtime': 9.112,
 'eval_samples_per_second': 109.745,
 'eval_steps_per_second': 1.756}

In [46]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,3.5591,3.321807
2,3.4085,3.286242
3,3.3696,3.27964


TrainOutput(global_step=471, training_loss=3.4457162646477895, metrics={'train_runtime': 180.0438, 'train_samples_per_second': 166.626, 'train_steps_per_second': 2.616, 'total_flos': 994208670720000.0, 'train_loss': 3.4457162646477895, 'epoch': 3.0})

After quick training steps with small subset of data, perplexity did go down.

In [48]:
import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

>>> Perplexity: 27.29


Now we can try to predict on "This is a great [MASK]." As noted in the HuggingFace tutorial, the model's predictions are now a little more aligned to the language of the *imdb* movie reviews.

In [61]:
text = "This is a great [MASK]."
inputs = tokenizer(text, return_tensors="pt")

inputs = inputs.to(model.device)

# output of the model is logits for each possible next token, for each position in the sequence
token_logits = model(**inputs).logits

# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
print(f"mask_token_index = {mask_token_index}")
mask_token_logits = token_logits[0, mask_token_index, :]
print(f"mask_token_logits.shape = {mask_token_logits.shape}")

# pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

mask_token_index = tensor([5], device='cuda:0')
mask_token_logits.shape = torch.Size([1, 30522])
'>>> This is a great film.'
'>>> This is a great movie.'
'>>> This is a great idea.'
'>>> This is a great adventure.'
'>>> This is a great one.'


# Finetuning using `accelerate`

The technique presented in the HF tutorial was to apply the masking once on the whole test set, thus eliminating the randomness in the evaluation stage. First, here is a function that applies the masking on a batch:


In [64]:
def insert_random_mask(batch):
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    masked_inputs = whole_word_masking_data_collator(features)
    return {
        "masked_" + k: v.numpy() for k, v in masked_inputs.items()
    }

Prepare everything for training with the Accelerator object: model, train and eval loaders, optimizer:

In [66]:
# downsampled_dataset = downsampled_dataset.remove_columns(["word_ids"])

# apply this function to our test set and drop the unmasked columns so we can replace them with the masked ones.
eval_dataset = downsampled_dataset["test"].map(
    insert_random_mask,
    batched=True,
    remove_columns=downsampled_dataset["test"].column_names,
)
eval_dataset = eval_dataset.rename_columns(
    {
        "masked_input_ids": "input_ids",
        "masked_attention_mask": "attention_mask",
        "masked_labels": "labels",
    }
)

In [67]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

batch_size = 64
train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=whole_word_masking_data_collator,
)
eval_dataloader = DataLoader(
    eval_dataset, batch_size=batch_size, collate_fn=default_data_collator
)

In [68]:
# follow the standard steps with 🤗 Accelerate. First, load a fresh version of the pretrained model:
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)


With this fresh model, let's check the predictions again. As expected, the top 5 predicted tokens to replace [MASK] are quite generic.

In [69]:
text = "This is a great [MASK]."
inputs = tokenizer(text, return_tensors="pt")

inputs = inputs.to(model.device)

# output of the model is logits for each possible next token, for each position in the sequence
token_logits = model(**inputs).logits

# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
print(f"mask_token_index = {mask_token_index}")
mask_token_logits = token_logits[0, mask_token_index, :]
print(f"mask_token_logits.shape = {mask_token_logits.shape}")

# pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

mask_token_index = tensor([5])
mask_token_logits.shape = torch.Size([1, 30522])
'>>> This is a great deal.'
'>>> This is a great success.'
'>>> This is a great adventure.'
'>>> This is a great idea.'
'>>> This is a great feat.'


In [70]:
from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=5e-5)

In [71]:

from accelerate import Accelerator
accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader,
)

In [72]:
from transformers import get_scheduler
num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [73]:
from huggingface_hub import get_full_repo_name

model_name = "distilbert-base-uncased-finetuned-imdb-HF-tutorial-using-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

'thuann2cats/distilbert-base-uncased-finetuned-imdb-HF-tutorial-using-accelerate'

In [74]:
from huggingface_hub import Repository

output_dir = model_name
repo = Repository(output_dir, clone_from=repo_name)

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/thuann2cats/distilbert-base-uncased-finetuned-imdb-HF-tutorial-using-accelerate into local empty directory.


The standard `accelerator` training steps, as used in the HuggingFace tutorial.

In [75]:
from tqdm.auto import tqdm
import torch
import math

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(eval_dataset)]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

  0%|          | 0/471 [00:00<?, ?it/s]

>>> Epoch 0: Perplexity: 27.295514988873794
>>> Epoch 1: Perplexity: 25.70487689993093
>>> Epoch 2: Perplexity: 25.22046200054443


As expected, the perplexity went down compared to before finetuning. This is obvious, since the training was the same. We just used `Trainer` vs `Accelerator`.

Predictions are more related to movies, as expected.

Here, since the model was also uploaded as a repository on HuggingFace Hub. We can use `pipeline`: 


In [77]:
from transformers import pipeline 
predictor = pipeline("fill-mask", model="thuann2cats/distilbert-base-uncased-finetuned-imdb-HF-tutorial-using-accelerate")

config.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [87]:
text = "This is a great [MASK]."
preds = predictor(text)
for pred in preds:
    print(f">>> {pred['sequence']}")

>>> this is a great film.
>>> this is a great movie.
>>> this is a great idea.
>>> this is a great one.
>>> this is a great story.


If we make the model predict a [MASK] in a generic setting, the model seems to veer towards movies as well:

In [101]:
text = "I thought the [MASK] was interesting."
preds = predictor(text)
for pred in preds:
    print(f">>> {pred['sequence']}")


>>> i thought the movie was interesting.
>>> i thought the film was interesting.
>>> i thought the story was interesting.
>>> i thought the plot was interesting.
>>> i thought the show was interesting.


In [102]:
text = "A lot of people did not like the [MASK]."
preds = predictor(text)
for pred in preds:
    print(f">>> {pred['sequence']}")


>>> a lot of people did not like the film.
>>> a lot of people did not like the movie.
>>> a lot of people did not like the show.
>>> a lot of people did not like the book.
>>> a lot of people did not like the story.


In [103]:
trainer.push_to_hub()

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/5.24k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/thuann2cats/distilbert-base-uncased-finetuned-imdb-HF-tutorial-using-trainer/commit/d505f785167ee9e0aa6908a6b9af80cbf02067b5', commit_message='End of training', commit_description='', oid='d505f785167ee9e0aa6908a6b9af80cbf02067b5', pr_url=None, pr_revision=None, pr_num=None)