Fine tune with IMDB data. Course on https://huggingface.co/learn/nlp-course/en/chapter7/3?fw=tf  
The fine-tuned model will resemble the given corpus in NLP tasks
 - Domain adaptation/Transfer learning: This process of fine-tuning a pretrained language model on in-domain data  
 - Masked language modeling: predict missing word

In [1]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
model_ckpt = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_ckpt)
distilbert_num_parameters = model.num_parameters() / 1_000_000
print(distilbert_num_parameters)
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

66.98553


# How masked language model works

In [2]:
# Make prediction
import torch

text = "This is a great [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

token_logits = outputs.logits  # [batch, input_dim, vocab_dim]
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]  # inputs["input_ids"] is two-dim
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, k=5, dim=1).indices[0].tolist()
for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode(token))}'")


'>>> This is a great deal.'
'>>> This is a great success.'
'>>> This is a great adventure.'
'>>> This is a great idea.'
'>>> This is a great feat.'


# Load IMDB data

In [3]:
from datasets import load_dataset
imdb_dataset = load_dataset("imdb")
print(imdb_dataset)
sample = imdb_dataset['train'].shuffle(seed=42).select(range(3))
for i in sample:
    print(i)

DatasetDict({
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
{'text': 'There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier\'s plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it\'s the language, or the spirit, but I think this series is more English than American. By the way, the actors are re

# Concatenate all the examples and then split the whole corpus into chunks of equal size. 
For corpus with sentences of various length, padding and truncating each sentence directly will waste space and lose info. Concat and then split can avoid that

In [4]:
# First tokenize the corpus as usual
def tokenize_function(examples):
    '''tokenize text, and save the word id of the whole word for word masking later '''
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

tokenized_datasets = imdb_dataset.map(
    tokenize_function, 
    batched=True,
    remove_columns=["text", "label"]
)

print(tokenized_datasets)


Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (519 > 512). Running this sequence through the model will result in indexing errors


DatasetDict({
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})


In [5]:
# Then concatenate and split
chunk_size = 128
print(f"The maximum context size is: {tokenizer.model_max_length}; Choose a smaller chunk size for GPU limitation: {chunk_size}")

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}  # why using .keys()?

    # Drop the last chunk if it's smaller than chunk_size
    total_length = len(concatenated_examples[list(concatenated_examples.keys())[0]])
    total_length = total_length // chunk_size * chunk_size

    # Split by chunks
    result = {
        k: [v[i:i + chunk_size] for i in range(0, total_length, chunk_size)] for k, v in concatenated_examples.items()
    }

    # Create a new labels column as the input_ids. This is the ground truth in preidcting masked tokens
    result["labels"] = result["input_ids"].copy()
    
    return result

lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

The maximum context size is: 512; Choose a smaller chunk size for GPU limitation: 128


Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 59904
    })
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 61291
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 122957
    })
})

In [6]:
# You can see the sentence breaker [SEP] and [CLS]
print(tokenizer.decode(lm_datasets["train"][2]["input_ids"])[400:])

 ) of swedish cinema. but really, this film doesn't have much of a plot. [SEP] [CLS] " i am curious : yellow " is a risible and pretentious steaming pile. it doesn


# Mask tokens as the training target

In [7]:
# Method 1: Token masking

from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)  # Mask 15% of the tokens randomly

# Do a test
samples = [lm_datasets['train'][i] for i in range(2)]  # Need a list here so can't just slice lm_datasets
for sample in samples:
    _ = sample.pop("word_ids")
for chunk in data_collator(samples)["input_ids"]:
    print(f"\n>>> {tokenizer.decode(chunk)}")
    print(f"\n>>> {tokenizer.convert_ids_to_tokens(chunk)}")

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



>>> [CLS] i rented i hooker curious - yellow from my video [MASK] because [MASK] all the controversy that surrounded it when it was first [MASK] in [MASK]. i also heard that at [MASK] it was seized by u. s. customs if it ever tried to enter this country, [MASK] being a fan of films considered " controversial [MASK] i really had to see this for myself. [MASK] br / > < br / > the plot is centered around a young swedish drama [MASK] [MASK] lena who wants to learn everything she can about life. in particular she wants to focus [MASK] attentions to making some sort of documentary on what the average [MASK]ede thought about certain political issues such

>>> ['[CLS]', 'i', 'rented', 'i', 'hooker', 'curious', '-', 'yellow', 'from', 'my', 'video', '[MASK]', 'because', '[MASK]', 'all', 'the', 'controversy', 'that', 'surrounded', 'it', 'when', 'it', 'was', 'first', '[MASK]', 'in', '[MASK]', '.', 'i', 'also', 'heard', 'that', 'at', '[MASK]', 'it', 'was', 'seized', 'by', 'u', '.', 's', '.', 'cust

In [8]:
# Method 2: Whole world masking
import collections
import numpy as np
from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):  # idx correponds to the index of tokens
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)  # mapping the index of word to that of tokens

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))  # Binomial distribution
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:  # loop over the word index where mask is 1
            word_id = word_id.item()
            for idx in mapping[word_id]:  # get the correponding token indices from mapping
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)

# Downsampling data for training

In [9]:
# Downsampling and train test split
is_whole_word_masking = False  # The last section (Train with Acccelerate) is for token masking only
train_size = 50_000
test_size = int(0.1 * train_size)

if is_whole_word_masking:
    run_datasets = lm_datasets["train"]
    run_data_collator = whole_word_masking_data_collator
else:
    run_datasets = lm_datasets["train"].remove_columns("word_ids")
    run_data_collator = data_collator

downsampled_dataset = run_datasets.train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 50000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 5000
    })
})

# Login HF

In [10]:
from huggingface_hub import notebook_login
notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Fine tuning - Training

## Train with trainer

In [11]:
model_name = model_ckpt.split("/")[-1]
output_dir = f"{model_name}-finetuned-imdb"
hub_model_id = f"hf-nlp-course-{model_name}-finetuned-imdb"

In [13]:
# Define the Trainer
from transformers import TrainingArguments, Trainer

batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size


training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    num_train_epochs=3,  # default is 3
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    fp16=True,  # enable mixed-precision training, which gives a boost in speed
    logging_steps=logging_steps,
    remove_unused_columns=False,  # By default, the Trainer will remove any columns that are not part of the model’s forward() method. 
    # So if you’re using the whole word masking collator, you’ll also need to set remove_unused_columns=False to ensure we don’t lose the word_ids column during training.
    push_to_hub=True,
    hub_model_id=hub_model_id,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=run_data_collator,
    tokenizer=tokenizer,
)

In [27]:
# Evaluation using perplexity
import math
eval_results = trainer.evaluate()
ppl = math.exp(eval_results['eval_loss'])
print(f"Perplexity of the original model: {ppl:.2f}")

trainer.train()
eval_results = trainer.evaluate()
ppl = math.exp(eval_results['eval_loss'])
print(f"Perplexity after training: {ppl:.2f}")

  0%|          | 0/79 [00:00<?, ?it/s]

Perplexity of the original model: 10.84


  0%|          | 0/2346 [00:00<?, ?it/s]

{'loss': 2.496, 'learning_rate': 1.3341858482523444e-05, 'epoch': 1.0}


  0%|          | 0/79 [00:00<?, ?it/s]

{'eval_loss': 2.311464309692383, 'eval_runtime': 10.1976, 'eval_samples_per_second': 490.314, 'eval_steps_per_second': 7.747, 'epoch': 1.0}
{'loss': 2.4268, 'learning_rate': 6.683716965046889e-06, 'epoch': 2.0}


  0%|          | 0/79 [00:00<?, ?it/s]

{'eval_loss': 2.2842066287994385, 'eval_runtime': 10.3646, 'eval_samples_per_second': 482.409, 'eval_steps_per_second': 7.622, 'epoch': 2.0}
{'loss': 2.3915, 'learning_rate': 2.5575447570332485e-08, 'epoch': 3.0}


  0%|          | 0/79 [00:00<?, ?it/s]

{'eval_loss': 2.2513575553894043, 'eval_runtime': 8.6924, 'eval_samples_per_second': 575.213, 'eval_steps_per_second': 9.088, 'epoch': 3.0}
{'train_runtime': 1717.3667, 'train_samples_per_second': 87.343, 'train_steps_per_second': 1.366, 'train_loss': 2.438064855891014, 'epoch': 3.0}


  0%|          | 0/79 [00:00<?, ?it/s]

Perplexity after training: 9.75


In [28]:
trainer.push_to_hub()

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

'https://huggingface.co/yuwei2342/hf-nlp-course-distilbert-base-uncased-finetuned-imdb/tree/main/'

## Train with Accelerate for customization

### Create an eval dataset with random masking
In the above run, DataCollatorForLanguageModeling applies random masking with eval_dataset for every epoch, which adds randomness in evaulation.  
The follows apply the masking once on the whole test set, and then use the default data collator in HF Transformers to collect the batches during evaluation.

In [14]:
def insert_random_mask(batch):
    # dict of lists -> list: data_collator can only take list
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    # use collator to insert random mask
    masked_inputs = data_collator(features)
    # Create a new "masked" column for each column in the dataset
    return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()}

eval_dataset = downsampled_dataset["test"].map(
    insert_random_mask,
    batched=True,
    remove_columns=downsampled_dataset["test"].column_names,
)
eval_dataset = eval_dataset.rename_columns(
    {
        "masked_input_ids": "input_ids",
        "masked_attention_mask": "attention_mask",
        "masked_labels": "labels",
    }
)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

### Prepare for training: Create data loader, config model, define eval metrics,nd setup hf hub

In [15]:
# Create data loader. Use default data collator for eval data since it's already masked

from torch.utils.data import DataLoader
from transformers import default_data_collator

batch_size = 64
train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
eval_dataloader = DataLoader(
    eval_dataset, 
    batch_size=batch_size, 
    collate_fn=default_data_collator,
)

In [16]:
# Config model, optimizer and schedular
from torch.optim import AdamW 

model = AutoModelForMaskedLM.from_pretrained(model_ckpt)
optimizer = AdamW(model.parameters(), lr=5e-5)
num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch


# Prepare for training with Accelerator
from accelerate import Accelerator

# Accelerator sets up distributed training, so as to replace the follows
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu"); model.to(device); inputs = inputs.to(device)
accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader,
)


# Set lr schedule
from transformers import get_scheduler

lr_scheduler = get_scheduler(
    "linear",  
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [17]:
# Define evaluation metrics
def compute_loss_ppl(model, eval_dataloader):
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        # accelerator.gather collects predictions from all distributed processors and cat them
        # loss usually is a scalar (0-dim), this causes problem when gathering. Here .repeat creates a new tensor by repeating the original loss tensor batch_size times.
        losses.append(accelerator.gather(loss.repeat(batch_size)))  

    losses = torch.cat(losses)
    # This code trims the gathered losses to match the size of the evaluation dataset, in case extra losses beyond the actual number of evaluation samples.
    losses = losses[: len(eval_dataloader.dataset)]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    return losses, perplexity

In [18]:
# Setup hugging face hub

# import huggingface_hub as hfh
# hfh.delete_repo(repo_id='yuwei2342/distilbert-base-uncased-finetuned-imdb')

# Close the repo to push to to local
from huggingface_hub import get_full_repo_name, Repository
repo_name = get_full_repo_name(hub_model_id)
output_dir = f"manual-{output_dir}"
repo = Repository(output_dir, clone_from=repo_name)

d:\Code\nlp\transformers\manual-distilbert-base-uncased-finetuned-imdb is already a clone of https://huggingface.co/yuwei2342/hf-nlp-course-distilbert-base-uncased-finetuned-imdb. Make sure you pull the latest changes with `repo.git_pull()`.


### Training

In [22]:
from tqdm.auto import tqdm
import torch
import math

_, perplexity = compute_loss_ppl(model, eval_dataloader)
print(f"Before training: Perplexity: {perplexity}")

progress_bar = tqdm(range(num_training_steps))
for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Eval
    _, perplexity = compute_loss_ppl(model, eval_dataloader)
    print(f"Epoch {epoch}: Perplexity: {perplexity}")

    # Save and upload, equivalent to trainer.push_to_hub()
    accelerator.wait_for_everyone()
    # This reverses accelerator.prepare(model)
    unwrapped_model = accelerator.unwrap_model(model)  
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

Before training: Perplexity: 23.057392216984677


  0%|          | 0/237 [00:00<?, ?it/s]

Epoch 0: Perplexity: 12.073483717062386
Epoch 1: Perplexity: 11.538012292958447
Epoch 2: Perplexity: 11.399004637899463


# Run fine-tuned model with pipeline

In [19]:
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", 
    model=repo_name,  # output_dir
)

preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

Downloading config.json:   0%|          | 0.00/581 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

>>> this is a great film.
>>> this is a great movie.
>>> this is a great idea.
>>> this is a great show.
>>> this is a great story.
