# Fine-tuning masked model
If you are using domain-specific language, it is not enough to fine-tune the model head because the underling LLM might classify important tokens as unknown. 

In these cases you must fine-tune the underlying model (eg. BERT) on your corpus, THEN build/train a task-specific model on top of it. (This process is called *domain adaptation*.

Let's do this for a **Masked language model** that can autocomplete sentences, using **DistilBERT** 

In [1]:
from transformers import TFAutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]




model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]




All PyTorch model weights were used when initializing TFDistilBertForMaskedLM.

All the weights of TFDistilBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.


In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [12]:
# Try it out on sample text
text = "This is a great [MASK]."

In [13]:
import numpy as np
import tensorflow as tf

inputs = tokenizer(text, return_tensors="np")
token_logits = model(**inputs).logits

# Find the location of [MASK] and extract its logits
mask_token_index = np.argwhere(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]
mask_token_logits = token_logits[0, mask_token_index, :]

# Pick the [MASK] candidates with the highest logits
# We negate the array before argsort to get the largest, not the smallest, logits
top_5_tokens = np.argsort(-mask_token_logits)[:5].tolist()

for token in top_5_tokens:
    print(f">>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}")

>>> This is a great deal.
>>> This is a great success.
>>> This is a great adventure.
>>> This is a great idea.
>>> This is a great feat.


# Fine-tuning dataset
These are very general terms based on the generic DistilBERT vocabulary. Let's make them more specific to movie reviews but treining on the IMDb [Large Movie Review Datset](https://huggingface.co/datasets/imdb).

The dataset has labels `[0,1]` for negative and positive reviews, but we will ignore those and just use the text.

In [14]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
imdb_dataset

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [17]:
sample = imdb_dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['text']}'")
    print(f"'>>> Label: {row['label']}'")


'>>> Review: There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...'
'>>> Label: 1'

'>>> Review: This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stu

## Preprocesing
Some reviews are very short and some are very long, leading to a problem with input size. We don't want to truncate the long ones (losing information) NOR do we want to pad the small ones up to a minimum length (not computationally efficient).

The standard approach for a corpus with entries of extremely variable input length is to concatenate all the examples and then split the whole corpus into chunks of equal size. (instead of just tokenizing individual examples). 

Steps:
- tokenize the corpus ***without*** setting `truncation=True`, and getting the word_ids
- remove the `text` and `label` columns (no longer needed)
- group together all examples
- break into chunks

In [18]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

# Use batched=True to activate fast multithreading!
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

### Chunking the data
To determine the chunk size we need to know the model's maximum input size. For DistillBERT it is `512`, though other models have longer ones.
To make this work in a google colab environment, set to smaller, like `128` (BUT BIGGER IS REALLY BETTER)

In [25]:
chunk_size = 128

print(f"model max input length = {tokenizer.model_max_length}")
print(f"we will use chunk_size = {chunk_size}")

model max input length = 512
we will use chunk_size = 128


In [27]:
# See how many tokens in each review
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

'>>> Review 0 length: 363'
'>>> Review 1 length: 304'
'>>> Review 2 length: 133'


In [48]:
# concatenate them together
concatenated_examples = {k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()}
total_length = len(concatenated_examples["input_ids"])
print(f"'Concatenated reviews length: {total_length}'")

# break into chunks
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'Concatenated reviews length: 800'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 32'


We can either pad the last chunk up to `chunk_size`, or we can just drop it. Let's drop it

In [50]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    
    # Drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column that contains the ground truth for the prediction task
    result["labels"] = result["input_ids"].copy()
    return result

In [51]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 61291
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 59904
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 122957
    })
})

In [52]:
# Examine example
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"

 Note that 2 different reviews are separated by the `[SEP] [CLS]` tokens.

## DataCollator: insert random `[MASK]` tokens so that our model can learn.
Up to this point the model has identical inputs and labels. We need to add `[MASK]` tokens to the inputs ontherwise inputs=labels

In [54]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"]) == tokenizer.decode(lm_datasets["train"][1]["labels"])

True

In [55]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, 
    mlm_probability = 0.15 # fraction of tokens to mask
)

In [83]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] i rented i am [MASK] - yellow from my video store because of all the controversy that surrounded [MASK] when it was first released in 1967. i also heard that at first [MASK] was seized by u. s. customs if it ever tried to [MASK] this country, therefore being a fan [MASK] films considered " controversial [MASK] i really [MASK] [MASK] see this migrated [MASK]. < br / > < br / jeanne the plot is centered around a young swedish drama student [MASK] lena who wants to learn everything she can about [MASK]. in particular she wants to focus her attentions [MASK] making some sort of documentary on what the average swede [MASK] about [MASK] political issues such'

'>>> as the vietnam war [MASK] race issues in the [MASK] states [MASK] in between asking politicians and [MASK] denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates [MASK] [MASK] married [MASK] [MASK] < br / > < br / > what kills me about i am mud - yellow is that 40 years a

**Problem:** This collator will mask individual tokens, which may or may not make up an entire word. We want to us **whole-word-masking**, so we can make our own DataCollator, which is just a function that takes a list of samples and converts them into a batch.

In [72]:
import collections
import numpy as np
from transformers.data.data_collator import tf_default_data_collator

wwm_probability = 0.2 # probability that a whole word is masked


def whole_word_masking_data_collator(datasets):
    for dataset in datasets:
        word_ids = dataset.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = dataset["input_ids"]
        labels = dataset["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        dataset["labels"] = new_labels

    return tf_default_data_collator(datasets)

In [84]:
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for i, chunk in enumerate(batch["input_ids"]):
    print(f"\n>>> {tokenizer.decode(chunk)}'")


>>> [CLS] i rented i am curious [MASK] [MASK] from my video store because of all the controversy that surrounded it when it was first released in 1967. i also heard that at [MASK] [MASK] was seized by u [MASK] s. customs if [MASK] ever [MASK] to [MASK] this country [MASK] therefore being a fan of films considered " [MASK] [MASK] i really had to see this for myself. < br [MASK] > < [MASK] [MASK] > the plot is centered around a young swedish drama student [MASK] lena [MASK] wants [MASK] learn everything she can about life. in particular she wants [MASK] focus her attentions to making some sort of documentary on what [MASK] average swede thought about certain [MASK] issues such'

>>> as [MASK] vietnam war and race issues in [MASK] united states [MASK] in between asking politicians and ordinary denizens of [MASK] about their opinions on [MASK], she has [MASK] with [MASK] drama teacher, classmates [MASK] and married men [MASK] < [MASK] / [MASK] < [MASK] [MASK] > what kills me [MASK] i am c

# Training
We will downsample the training set to reduce training time (optional)


In [93]:
model_name

'distilbert-base-uncased'

In [97]:
hf_user = "Roverto"
model_name = f"{model_checkpoint}-finetuned-imdb"
hf_repo = f"{hf_user}/{model_name}"
hf_repo

'Roverto/distilbert-base-uncased-finetuned-imdb'

In [86]:
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})

In [88]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [89]:
tf_train_dataset = model.prepare_tf_dataset(
    downsampled_dataset["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=32,
)

tf_eval_dataset = model.prepare_tf_dataset(
    downsampled_dataset["test"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=32,
)

In [100]:
from transformers import create_optimizer
from transformers.keras_callbacks import PushToHubCallback
import tensorflow as tf

num_train_steps = len(tf_train_dataset)
optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=1_000,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)
model.compile(optimizer=optimizer)

# # Train in mixed-precision float16
# tf.keras.mixed_precision.set_global_policy("mixed_float16")


# callback = PushToHubCallback(
#     output_dir=hf_repo, tokenizer=tokenizer
# )

In [92]:
# Check perplexity of model before retraining
import math

eval_loss = model.evaluate(tf_eval_dataset)
print(f"Perplexity: {math.exp(eval_loss):.2f}")


Perplexity: 23.49


In [None]:
model.fit(tf_train_dataset, validation_data=tf_eval_dataset, callbacks=[callback])

  7/312 [..............................] - ETA: 44:30 - loss: 3.2779

In [None]:
# Test out the model
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", 
    model="huggingface-course/distilbert-base-uncased-finetuned-imdb"
)