In [1]:
# We will use DistilBERT model for this Masked Language Model (MLM) fine-tuning task
# DistilBERT is a small, fast, cheap and light Transformer model trained by distilling Bert base.
# It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of Bert’s performances as measured on the GLUE language understanding benchmark.
# DistilBERT is thus a good solution for quick prototyping and for production environments where real-time inference is necessary.

from transformers import AutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

In [2]:
# how many parameters does our model have?
distilbert_num_parameters = model.num_parameters()
print(f"Number of parameters in {model_checkpoint}: {distilbert_num_parameters}")

Number of parameters in distilbert-base-uncased: 66985530


Testing the model as is...

In [3]:
# test example
text = "This is a great [MASK]."

In [4]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [12]:
# tokenize our text, pass it to the model and get output predictions
import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits

# get the top 5 predicted tokens and their probabilities for the masked token
masked_index = (inputs["input_ids"][0] == tokenizer.mask_token_id).nonzero().item()
probs = torch.nn.functional.softmax(token_logits[0, masked_index], dim=-1)
top_5_tokens = torch.topk(token_logits[0, masked_index], 5, dim=-1).indices.tolist()
top_5_probabilities = torch.topk(probs, 5, dim=-1).values.tolist()

# print the results
for token, prob in zip(top_5_tokens, top_5_probabilities):
    print(tokenizer.decode([token]), prob)
    
    

deal 0.0365118607878685
success 0.0239587239921093
adventure 0.0237447340041399
idea 0.016085002571344376
feat 0.010877519845962524


These are everyday choices

To showcase domain adaptation, we'll use the famous Large Movie Dataset (or IMDb for short), which is a corpus of movie reviews that is often used to benchmark sentiment analysis models.

By fine-tuning DistilBERT on this corpus, we expect the language model will adapt its vocabulary from the factual data of Wikipedia that it was pretrained on to the more subjective elements of movie reviews.

We can get the movie reviews data from the Hugging Face Hub with load_dataset() function from Datasets:

In [13]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
imdb_dataset

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to C:/Users/Raj/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to C:/Users/Raj/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [15]:
# let's checkout the train dataset
sample_train = imdb_dataset["train"].shuffle(seed=42).select(range(3))
for row in sample_train:
    print(f"text: ", row["text"])
    print(f"label: ", row["label"])
    print()

Loading cached shuffled indices for dataset at C:\Users\Raj\.cache\huggingface\datasets\imdb\plain_text\1.0.0\d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0\cache-9c48ce5d173413c7.arrow


text:  There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...
label:  1

text:  This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stub your toe on the m

In [16]:
# useful sanity check on the test dataset that the labels are indeed correct
sample_test = imdb_dataset["test"].shuffle(seed=42).select(range(3))
for row in sample_test:
    print(f"text: ", row["text"])
    print(f"label: ", row["label"])
    print()

text:  <br /><br />When I unsuspectedly rented A Thousand Acres, I thought I was in for an entertaining King Lear story and of course Michelle Pfeiffer was in it, so what could go wrong?<br /><br />Very quickly, however, I realized that this story was about A Thousand Other Things besides just Acres. I started crying and couldn't stop until long after the movie ended. Thank you Jane, Laura and Jocelyn, for bringing us such a wonderfully subtle and compassionate movie! Thank you cast, for being involved and portraying the characters with such depth and gentleness!<br /><br />I recognized the Angry sister; the Runaway sister and the sister in Denial. I recognized the Abusive Husband and why he was there and then the Father, oh oh the Father... all superbly played. I also recognized myself and this movie was an eye-opener, a relief, a chance to face my OWN truth and finally doing something about it. I truly hope A Thousand Acres has had the same effect on some others out there.<br /><br /

In [17]:
# what does the unsupervised dataset look like?
sample_unsupervised = imdb_dataset["unsupervised"].shuffle(seed=42).select(range(3))
for row in sample_unsupervised:
    print(f"text: ", row["text"])
    print(f"label: ", row["label"])
    print()

text:  If you've seen the classic Roger Corman version starring Vincent Price it's hard to put it out of your head, but you probably should do because this one is totally different. Subtlety has been abandoned in favour of gross-out horror - nudity, gore and all-round unpleasantness. OK it's ridiculous, trashy, sensationalised and historically dubious (did any members of the Inquisition really wear horn-rimmed glasses?), but despite all this it is strangely compelling. I literally couldn't tear myself away from the screen until the end of the movie. If there's a bigger compliment you can pay to a film I don't know what it is.
label:  -1

text:  For me, this was the most moving film of the decade. Samira Makhmalbaf shows pure bravery and vision in the making. She has an intelligence and gift for speaking to the people, regardless of their nationality or beliefs. I am inspired and touched by her humanity and can only hope that she has touched many people the same way. Her message in this