# Fine-tuning masked model
If you are using domain-specific language, it is not enough to fine-tune the model head because the underling LLM might classify important tokens as unknown.

In these cases you must fine-tune the underlying model (eg. BERT) on your corpus, THEN build/train a task-specific model on top of it. (This process is called *domain adaptation*.

Let's do this for a **Masked language model** that can autocomplete sentences, using **DistilBERT**

In [1]:
import os
import sys


# Connect to google drive
from google.colab import drive
os.chdir("/content")
drive.mount("/content/gdrive")

# Load colab_utils funtions
sys.path.append(f"/content/gdrive/MyDrive/repos/colab-utils")
import colab_utils

colab_utils.load_env_vars()
colab_utils.git_set_config()

PARENT_FOLDER = "/content/gdrive/MyDrive/repos"
os.chdir(PARENT_FOLDER)

git_repo = 'trevorki/huggingface-nlp' # replace with actual values
colab_utils.git_clone_repo(git_repo)

REPO_FOLDER = f"{PARENT_FOLDER}/{git_repo.split('/')[1]}"
os.chdir(REPO_FOLDER)

!pip install -r requirements.txt

Mounted at /content/gdrive
Collecting evaluate (from -r requirements.txt (line 8))
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting zstandard (from -r requirements.txt (line 10))
  Downloading zstandard-0.22.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m40.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-dotenv (from -r requirements.txt (line 11))
  Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Collecting datasets (from -r requirements.txt (line 12))
  Downloading datasets-2.17.0-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill (from evaluate->-r requirements.txt (line 8))
  Downloading dill-0.3.8-py3-none-any.whl 

In [2]:
from transformers import TFAutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFDistilBertForMaskedLM.

All the weights of TFDistilBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.


In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [4]:
# Try it out on sample text
text = "This is a great [MASK]."

In [5]:
import numpy as np
import tensorflow as tf

inputs = tokenizer(text, return_tensors="np")
token_logits = model(**inputs).logits

# Find the location of [MASK] and extract its logits
mask_token_index = np.argwhere(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]
mask_token_logits = token_logits[0, mask_token_index, :]

# Pick the [MASK] candidates with the highest logits
# We negate the array before argsort to get the largest, not the smallest, logits
top_5_tokens = np.argsort(-mask_token_logits)[:5].tolist()

for token in top_5_tokens:
    print(f">>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}")

>>> This is a great deal.
>>> This is a great success.
>>> This is a great adventure.
>>> This is a great idea.
>>> This is a great feat.


# Fine-tuning dataset
These are very general terms based on the generic DistilBERT vocabulary. Let's make them more specific to movie reviews but treining on the IMDb [Large Movie Review Datset](https://huggingface.co/datasets/imdb).

The dataset has labels `[0,1]` for negative and positive reviews, but we will ignore those and just use the text.

In [6]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
imdb_dataset

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [7]:
sample = imdb_dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['text']}'")
    print(f"'>>> Label: {row['label']}'")


'>>> Review: There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...'
'>>> Label: 1'

'>>> Review: This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stu

## Preprocesing
Some reviews are very short and some are very long, leading to a problem with input size. We don't want to truncate the long ones (losing information) NOR do we want to pad the small ones up to a minimum length (not computationally efficient).

The standard approach for a corpus with entries of extremely variable input length is to concatenate all the examples and then split the whole corpus into chunks of equal size. (instead of just tokenizing individual examples).

Steps:
- tokenize the corpus ***without*** setting `truncation=True`, and getting the word_ids
- remove the `text` and `label` columns (no longer needed)
- group together all examples
- break into chunks

In [8]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

# Use batched=True to activate fast multithreading!
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

### Chunking the data
To determine the chunk size we need to know the model's maximum input size. For DistillBERT it is `512`, though other models have longer ones.
To make this work in a google colab environment, set to smaller, like `128` (BUT BIGGER IS REALLY BETTER)

In [9]:
chunk_size = 128

print(f"model max input length = {tokenizer.model_max_length}")
print(f"we will use chunk_size = {chunk_size}")

model max input length = 512
we will use chunk_size = 128


In [10]:
# See how many tokens in each review
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

'>>> Review 0 length: 363'
'>>> Review 1 length: 304'
'>>> Review 2 length: 133'


In [11]:
# concatenate them together
concatenated_examples = {k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()}
total_length = len(concatenated_examples["input_ids"])
print(f"'Concatenated reviews length: {total_length}'")

# break into chunks
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'Concatenated reviews length: 800'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 32'


We can either pad the last chunk up to `chunk_size`, or we can just drop it. Let's drop it

In [12]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}

    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])

    # Drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column that contains the ground truth for the prediction task
    result["labels"] = result["input_ids"].copy()
    return result

In [13]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 61291
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 59904
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 122957
    })
})

In [14]:
# Examine example
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"

 Note that 2 different reviews are separated by the `[SEP] [CLS]` tokens.

## DataCollator: insert random `[MASK]` tokens so that our model can learn.
Up to this point the model has identical inputs and labels. We need to add `[MASK]` tokens to the inputs ontherwise inputs=labels

In [15]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"]) == tokenizer.decode(lm_datasets["train"][1]["labels"])

True

In [16]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm_probability = 0.15 # fraction of tokens to mask
)

In [17]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



'>>> [CLS] i rented i am curious - yellow from my video store [MASK] of all the controversy that surrounded it when [MASK] was first released in 1967. i [MASK] heard [MASK] at first it was seized by u. s. customs if it ever tried [MASK] enter this country, therefore being a fan [MASK] films considered " controversial " i really had to see this for myself. < br / [MASK] < br / > the plot 街 centered around [MASK] [MASK] swedish宇 student named lena [MASK] wants to learn everything she can about [MASK]. in particular she wants to focus her attention [MASK] to making some sort of documentary on what the average swede thought about certain [MASK] issues such'

'>>> as [MASK] vietnam war [MASK] race issues in the united states. in [MASK] asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me [MASK] i am curious - yellow is that 40 years [MASK] [MASK] this was cons

**Problem:** This collator will mask individual tokens, which may or may not make up an entire word. We want to us **whole-word-masking**, so we can make our own DataCollator, which is just a function that takes a list of samples and converts them into a batch.

In [18]:
import collections
import numpy as np
from transformers.data.data_collator import tf_default_data_collator

wwm_probability = 0.2 # probability that a whole word is masked


def whole_word_masking_data_collator(datasets):
    for dataset in datasets:
        word_ids = dataset.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = dataset["input_ids"]
        labels = dataset["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        dataset["labels"] = new_labels

    return tf_default_data_collator(datasets)

In [19]:
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for i, chunk in enumerate(batch["input_ids"]):
    print(f"\n>>> {tokenizer.decode(chunk)}'")


>>> [CLS] i rented i [MASK] curious - yellow from my video store because of all the [MASK] that surrounded [MASK] [MASK] it was [MASK] released in 1967 [MASK] i [MASK] [MASK] that at first it was [MASK] by u. s. customs if [MASK] ever tried to enter this country, therefore being [MASK] fan [MASK] [MASK] considered [MASK] controversial " i really had to see this for myself [MASK] < br [MASK] > < [MASK] / > the plot is centered around a [MASK] swedish drama [MASK] named lena who wants to learn everything she [MASK] about life. in particular she [MASK] to focus her [MASK] [MASK] to making some sort of documentary on what the average swede thought about certain political issues such'

>>> as [MASK] vietnam war and race issues in the united states. in between asking [MASK] [MASK] ordinary denizens of stockholm about their [MASK] on politics, she has sex with her drama teacher, [MASK] [MASK] and married men [MASK] [MASK] br / > < br / [MASK] [MASK] kills [MASK] about i am [MASK] [MASK] yell

# Training
We will downsample the training set to reduce training time (optional)


In [21]:
hf_user = "Roverto"
model_name = f"{model_checkpoint}-finetuned-imdb"
hf_repo = f"{hf_user}/{model_name}"
hf_repo

'Roverto/distilbert-base-uncased-finetuned-imdb'

In [27]:
train_size = 20_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 20000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 2000
    })
})

In [28]:
tf_train_dataset = model.prepare_tf_dataset(
    downsampled_dataset["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=32,
)

tf_eval_dataset = model.prepare_tf_dataset(
    downsampled_dataset["test"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=32,
)

In [29]:
from transformers import create_optimizer

import tensorflow as tf

num_train_steps = len(tf_train_dataset)
optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=1_000,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)
model.compile(optimizer=optimizer)

# Train in mixed-precision float16
tf.keras.mixed_precision.set_global_policy("mixed_float16")


In [30]:
# Check perplexity of model before retraining
import math

eval_loss = model.evaluate(tf_eval_dataset)
print(f"Perplexity: {math.exp(eval_loss):.2f}")

Perplexity: 13.11


In [33]:
n_epochs = 3
model.fit(
    tf_train_dataset,
    validation_data=tf_eval_dataset,
    epochs = n_epochs,
    verbose = 1
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x7e6ef5a819f0>

In [35]:
# Check perplexity of model AFTER retraining
import math

eval_loss = model.evaluate(tf_eval_dataset)
print(f"Perplexity: {math.exp(eval_loss):.2f}")

Perplexity: 11.01


In [34]:
model_folder = f"{REPO_FOLDER}/{model_name}"
model_folder

'/content/gdrive/MyDrive/repos/huggingface-nlp/distilbert-base-uncased-finetuned-imdb'

In [38]:
from huggingface_hub import HfApi

# # Create repo if it doesn't exist
# from huggingface_hub import create_repo
# create_repo(f"{model_name}", token = os.environ["HF_TOKEN"])

# save model and tokenizer to local folder
model.save_pretrained(model_folder)
tokenizer.save_pretrained(model_folder)

# upload folder to Huggingface Hub
api = HfApi()
api.upload_folder(
    folder_path=model_folder,
    repo_id=hf_repo,
    repo_type="model"
)

tf_model.h5:   0%|          | 0.00/363M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Roverto/distilbert-base-uncased-finetuned-imdb/commit/e2e0f7dc89f1c689d34561f14677dc409671e8cc', commit_message='Upload folder using huggingface_hub', commit_description='', oid='e2e0f7dc89f1c689d34561f14677dc409671e8cc', pr_url=None, pr_revision=None, pr_num=None)

In [39]:
# Test out the model
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask",
    model=hf_repo
)

config.json:   0%|          | 0.00/524 [00:00<?, ?B/s]

tf_model.h5:   0%|          | 0.00/363M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFDistilBertForMaskedLM.

All the layers of TFDistilBertForMaskedLM were initialized from the model checkpoint at Roverto/distilbert-base-uncased-finetuned-imdb.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [40]:
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

>>> this is a great film.
>>> this is a great movie.
>>> this is a great idea.
>>> this is a great deal.
>>> this is a great one.
