<a href="https://colab.research.google.com/github/samirp92/temp-1/blob/main/LLM_code_example_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning a masked language model (PyTorch)

In [None]:
!git config --global user.email "you@example.com"
!git config --global user.name "Your Name"

This will log us in to the Hugging Face Hub. We execute the following and enter our credentials i.e. the User Access Token

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

The following will download DistilBERT using the AutoModelForMaskedLM class

In [None]:
from transformers import AutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

We can see how many parameters this model has by calling the num_parameters() method

In [None]:
distilbert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")
print(f"'>>> BERT number of parameters: 110M'")

'>>> DistilBERT number of parameters: 67M'
'>>> BERT number of parameters: 110M'


In [None]:
text = "This is a great [MASK]."

Download DistilBERT’s tokenizer to produce the inputs for the model which will predict the mask we need

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

With a tokenizer and a model, we can now pass our text example to the model, extract the logits, and print out the top 5 candidates:

In [None]:
import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> This is a great deal.'
'>>> This is a great success.'
'>>> This is a great adventure.'
'>>> This is a great idea.'
'>>> This is a great feat.'


We can get the data from the Hugging Face Hub with the load_dataset() function from Datasets

In [None]:
from datasets import load_dataset

tweet_dataset = load_dataset("tweet_eval","sentiment")
tweet_dataset

Downloading builder script:   0%|          | 0.00/9.72k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/30.4k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/21.9k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.24M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/527k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.53k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/99.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/629 [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/6 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/45615 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/12284 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 45615
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 12284
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

we chain the Dataset.shuffle() and Dataset.select() functions to create a random sample

In [None]:
sample = tweet_dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['text']}'")
    print(f"'>>> Label: {row['label']}'")


'>>> Review: Few more hours to iPhone 6s launch and im still using the 4th generation ^_^'
'>>> Label: 2'

'>>> Review: Last night we were named NZ's 27th fastest growing co. in the Deloitte Fast 50. Our 2nd year making the list and we are totally thrilled!'
'>>> Label: 2'

'>>> Review: All the hoes will be out this Saturday at the Chris brown concert.'
'>>> Label: 0'


we will tokenize our corpus, but without setting the truncation=True option in our tokenizer. We will also get the word IDs if they are available. We wrap this in a function, and remove the text and label columns.

In [None]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = tweet_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

Map:   0%|          | 0/45615 [00:00<?, ? examples/s]

Map:   0%|          | 0/12284 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 45615
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 12284
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 2000
    })
})

In [None]:
tokenizer.model_max_length

512

In [None]:
chunk_size = 128

To show how the concatenation works, we take a few reviews from our tokenized training set and print out the number of tokens per review

In [None]:
# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

'>>> Review 0 length: 38'
'>>> Review 1 length: 24'
'>>> Review 2 length: 29'


We then concatenate all these examples with a simple dictionary comprehension.

In [None]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

'>>> Concatenated reviews length: 91'


We split the concatenated reviews into chunks of the size given by block_size.

In [None]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 91'


We wrap all of the above logic in a single function that we can apply to our tokenized datasets

In [None]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

We wrap all of the above logic in a single function that we can apply to our tokenized datasets

In [None]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/45615 [00:00<?, ? examples/s]

Map:   0%|          | 0/12284 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10600
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 2433
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 466
    })
})

In [None]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

'will invest 150 million in january, another 200 in the summer and plans to bring messi by 2017 " [SEP] [CLS] @ user lit my mum\'kerry the louboutins i wonder how many willam owns!!! look kerry warner wednesday!\'[SEP] [CLS] " \\ " " " " soul train \\ " " " " oct 27 halloween special ft t. dot finest rocking the mic... crazy cactus night club.. adv ticket $ 10 wt out costume $ 15... " [SEP] [CLS] so disappointed in wwe summerslam! i want to see john cena wins his 16th title [SEP] [CLS] " this is the'

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [None]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



'>>> [CLS] " qt @ user in the original draft [MASK] [MASK] 7th book, remus lupin survived the battle of [MASK]warts. # happybirthday [MASK]us [MASK]pin [MASK] [SEP] [CLS] " ben smith / smith ( concussion ) [MASK] out of [MASK] lineup thursday, curtis # nhl # s [MASK] " [SEP] [CLS] sorry bout [MASK] [MASK] last night [MASK] crashed out but will be on tonight [MASK] sure. then back to minecraft in pc tomorrow night. [SEP] [CLS] chase headley's rbi double in the [MASK] inning off david price snapped a yankees [MASK] of 33 rewarded scoreless innings against blue jays [SEP] [CLS] @ user alciato [MASK] bee'

'>>> will [MASK] 150 million in january, another [MASK] in the summer and plans to bring messi by [MASK] " [SEP] [CLS] [MASK] [MASK] lit my mum'[MASK] the louboutins i wonder impending [MASK] will [MASK] owns!!! look [MASK] warner [MASK]!'[SEP] [CLS] " \ " " " " soul train \ " " " " oct 27 halloween special ft [MASK] [MASK] dot finest [MASK] the mic... crazy cactus night club.. [MASK]v 

In [None]:
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)

In [None]:
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] " qt @ user [MASK] the original draft of the 7th book [MASK] [MASK] [MASK] lupin survived the [MASK] of [MASK] [MASK] [MASK] [MASK] # happybirthdayremuslupin " [SEP] [CLS] " ben smith / smith ( concussion ) remains out of the lineup thursday, [MASK] # [MASK] [MASK] [MASK] [MASK] " [SEP] [CLS] [MASK] bout the stream last night i crashed out but [MASK] [MASK] on tonight for sure. then back [MASK] minecraft [MASK] [MASK] tomorrow night. [SEP] [CLS] [MASK] headley's rbi double in the [MASK] inning off david price snapped a yankees streak of [MASK] consecutive [MASK] [MASK] innings against blue jays [SEP] [CLS] @ user alciato : bee'

'>>> will invest 150 [MASK] in january, another [MASK] in the summer and plans to bring messi by 2017 " [SEP] [CLS] @ user lit my mum'kerry [MASK] louboutins [MASK] wonder how many [MASK] [MASK] [MASK]!! [MASK] look kerry warner wednesday [MASK]'[SEP] [CLS] " \ " " " " [MASK] train \ " [MASK] " " [MASK] 27 halloween [MASK] ft t. [MASK] finest rockin

In [None]:
train_size = 9500
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 9500
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 950
    })
})

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from transformers import TrainingArguments

batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-tweet",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

Cloning https://huggingface.co/shreyasdatar/distilbert-base-uncased-finetuned-tweet into local empty directory.


In [None]:
import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

>>> Perplexity: 104.64


In [None]:
trainer.train()



Epoch,Training Loss,Validation Loss
1,3.6538,3.304507
2,3.3379,3.194887
3,3.2875,3.116638


TrainOutput(global_step=447, training_loss=3.4249038994979006, metrics={'train_runtime': 146.7593, 'train_samples_per_second': 194.196, 'train_steps_per_second': 3.046, 'total_flos': 944498237184000.0, 'train_loss': 3.4249038994979006, 'epoch': 3.0})

compute the resulting perplexity on the test set

In [None]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

>>> Perplexity: 23.62


Once training is finished, we can push the model card with the training information to the Hub

In [None]:
trainer.push_to_hub()

Upload file pytorch_model.bin:   0%|          | 1.00/256M [00:00<?, ?B/s]

Upload file runs/Aug15_12-53-40_804e3118ce32/events.out.tfevents.1692104041.804e3118ce32.1201.0:   0%|        …

Upload file runs/Aug15_12-53-40_804e3118ce32/events.out.tfevents.1692104203.804e3118ce32.1201.1:   0%|        …

Upload file training_args.bin:   0%|          | 1.00/3.93k [00:00<?, ?B/s]

To https://huggingface.co/shreyasdatar/distilbert-base-uncased-finetuned-tweet
   2f0c263..e94497a  main -> main

   2f0c263..e94497a  main -> main

To https://huggingface.co/shreyasdatar/distilbert-base-uncased-finetuned-tweet
   e94497a..66cb08b  main -> main

   e94497a..66cb08b  main -> main



'https://huggingface.co/shreyasdatar/distilbert-base-uncased-finetuned-tweet/commit/e94497aaca294682079295a8bc5771b3e53a3c89'

We implement a function that applies the masking on a batch

In [None]:
def insert_random_mask(batch):
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    masked_inputs = data_collator(features)
    # Create a new "masked" column for each column in the dataset
    return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()}

we apply this function to our test set and drop the unmasked columns so we can replace them with the masked ones

In [None]:
downsampled_dataset = downsampled_dataset.remove_columns(["word_ids"])
eval_dataset = downsampled_dataset["test"].map(
    insert_random_mask,
    batched=True,
    remove_columns=downsampled_dataset["test"].column_names,
)
eval_dataset = eval_dataset.rename_columns(
    {
        "masked_input_ids": "input_ids",
        "masked_attention_mask": "attention_mask",
        "masked_labels": "labels",
    }
)

Map:   0%|          | 0/950 [00:00<?, ? examples/s]

We set up the dataloaders and use the default_data_collator from huggingface Transformers for the evaluation set

In [None]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

batch_size = 64
train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
eval_dataloader = DataLoader(
    eval_dataset, batch_size=batch_size, collate_fn=default_data_collator
)

We use the standard AdamW optimizer

In [None]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

we prepare everything for training with the Accelerator object

In [None]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

we specify the learning rate scheduler as follows:

In [None]:
from transformers import get_scheduler

num_train_epochs = 5
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

We create a model repository on the Hugging Face Hub. We can use the Hub library to first generate the full name of our repo

In [None]:
from huggingface_hub import get_full_repo_name

model_name = "distilbert-base-uncased-finetuned-tweet"
repo_name = get_full_repo_name(model_name)
repo_name

'shreyasdatar/distilbert-base-uncased-finetuned-tweet'

We create and clone the repository using the Repository class from the Hub

In [None]:
from huggingface_hub import Repository

output_dir = model_name
repo = Repository(output_dir, clone_from=repo_name)

/content/distilbert-base-uncased-finetuned-tweet is already a clone of https://huggingface.co/shreyasdatar/distilbert-base-uncased-finetuned-tweet. Make sure you pull the latest changes with `repo.git_pull()`.


Following is a full training and evaluation loop

In [None]:
from tqdm.auto import tqdm
import torch
import math

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(eval_dataset)]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

  0%|          | 0/745 [00:00<?, ?it/s]

>>> Epoch 0: Perplexity: 18.20023780048101
>>> Epoch 1: Perplexity: 17.167966501690874
>>> Epoch 2: Perplexity: 16.580796395161
>>> Epoch 3: Perplexity: 16.195119483857173
>>> Epoch 4: Perplexity: 16.075096460388853


We interact with our fine-tuned model locally with the pipeline from the Transformers. We download our model using the fill-mask pipeline

In [None]:
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", model="shreyasdatar/distilbert-base-uncased-finetuned-tweet"
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

We feed the pipeline our sample text of “This is a great [MASK]” and see what the top 5 predictions are

In [None]:
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

>>> this is a great day.
>>> this is a great idea.
>>> this is a great time.
>>> this is a great one.
>>> this is a great deal.
