<a href="https://colab.research.google.com/github/ujjalkumarmaity/NLP/blob/main/Huggingface-NLP-Course/7_Fine_tuning_a_masked_language_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!pip install datasets seqeval evaluate transformers[torch] accelerate

Most of NLP task we take a pretrained model from the Hugging Face Hub.But for domain specific task fine-tune language models give better result.

let’s pick a suitable pretrained model for masked language modeling

In [1]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [2]:
from transformers import AutoModel,AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("distilbert-base-uncased")

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [3]:
print(f"Number of parameter {model.num_parameters()/1_000_000} M")

Number of parameter 66.98553 M


## load dataset

In [3]:
from datasets import load_dataset
data = load_dataset("imdb")

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [4]:
data = data['train'].train_test_split(train_size=10_000,test_size=1000)

## Preprocessing the data
- For both auto-regressive and masked language modeling, a common preprocessing step is to **concatenate all the examples and then split the whole corpus into chunks of equal size**.
- This is quite different from our usual approach, where we simply tokenize individual examples
- Why concatenate everything together?
    - *The reason is that **individual examples might get truncated** if they’re too long, and that would result in **losing information that might be useful for the language modeling task**!*



#### Let'first analyze the data with few sample

In [5]:
sample_data = data['train'].shuffle().select(range(10))
sample_data

Dataset({
    features: ['text', 'label'],
    num_rows: 10
})

In [6]:
def tokenize_fun(x):
    tokenize = tokenizer(x['text'])
    tokenize['word_ids'] = [tokenize.word_ids(t) for t in range(len(tokenize['input_ids']))]
    return tokenize
tokenize_sample_data = sample_data.map(tokenize_fun,batched=True,remove_columns=sample_data.column_names)
tokenize_sample_data

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (787 > 512). Running this sequence through the model will result in indexing errors


Dataset({
    features: ['input_ids', 'attention_mask', 'word_ids'],
    num_rows: 10
})

In [7]:
# In this case review 1 will be truncated (lenght of review 1 is 949)
for i in range(3):
    print(f"len of review {i} = {len(tokenize_sample_data[i]['input_ids'])}")

len of review 0 = 167
len of review 1 = 178
len of review 2 = 787


#### Concat all the example and checks out into equal length

In [8]:
def tokenize_fun(x):
    tokenize = tokenizer(x['text'])
    tokenize['word_ids'] = [tokenize.word_ids(t) for t in range(len(tokenize['input_ids']))]
    return tokenize
tokenize_data = data.map(tokenize_fun,batched=True,remove_columns=data['train'].column_names)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [9]:
# concat sample data
tokenize_sample_data = tokenize_data['train'][:4]
concatenated_examples = {k:sum(tokenize_sample_data[k],[]) for k in tokenize_sample_data.keys()}
print(f"total length {len(concatenated_examples['input_ids'])}")

total length 1554


In [10]:
# Chunk the concated output
total_length = len(concatenated_examples['input_ids'])
chunk_size = 128
chunk_data = {k:[concatenated_examples[k][i:i+chunk_size] for i in range(0,total_length,chunk_size)] for k in concatenated_examples.keys()}
for ind,val in enumerate(chunk_data['input_ids']):
    print(f"chunk length {len(val)}")

chunk length 128
chunk length 128
chunk length 128
chunk length 128
chunk length 128
chunk length 128
chunk length 128
chunk length 128
chunk length 128
chunk length 128
chunk length 128
chunk length 128
chunk length 18


In [11]:
chunk_size = 128
def chunk_text(batch):
    # concat
    concatenated_batch = {k:sum(batch[k],[]) for k in batch.keys()}
    total_length = len(concatenated_batch['input_ids'])
    total_length = (total_length//chunk_size)*chunk_size

    chunk_data = {k:[concatenated_batch[k][i:i+chunk_size] for i in range(0,total_length,chunk_size)] for k in concatenated_batch.keys()}
    chunk_data['labels'] = chunk_data['input_ids'].copy()
    return chunk_data
label_data = tokenize_data.map(chunk_text,batched=True)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [12]:
label_data

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 24277
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 2505
    })
})

# Data Collator

In [13]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer,mlm=0.15) # randomly mask 15 % token
sample = [label_data["train"][i] for i in range(2)]
for i in sample:
    i.pop('word_ids')
for ch in data_collator(sample)['input_ids']:
    print(tokenizer.decode(ch))
    print()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[CLS] this collection [MASK] sucks! < br / > < br / > i rented it, thinking [MASK]´ [MASK] really would [MASK] some [MASK] fighting. man this sucked! quick flashy cuts, an extremely annoying speaker, [MASK] [MASK] fights them selves were heavily edited and shortened ( i´m thinking especially of jet li´s fight in fists of legend and jackie chan´s fight from drunken [MASK] [MASK] ). < br / > < br [MASK] > and what´s the deal [MASK] those brawling streetfighters?! what´s so " [MASK] [MASK] about that? i [MASK]ve seen more [MASK] fights on martial [MASK]! < trials / > < br

/ > this [MASK] stupid collection [MASK] cuts [MASK] stupid people. < [MASK] / > [MASK] br / > do not ever buy [MASK] film! do not [MASK] the [MASK] [MASK] made this crap to make more of this crap [MASK] < br / > < br / > instead, go buy the movies the fights were from [MASK] wath the fights in their uncut glory [MASK] [SEP] [CLS] * * * spoil [MASK] * * * when undercover brooklyn north det. eddie santos, nestor serra [M

# Define Trainer

In [14]:
from transformers import TrainingArguments
logging_steps = len(label_data["train"]) // 64

args = TrainingArguments(
    output_dir = '/',
    overwrite_output_dir = True,
    evaluation_strategy='epoch',
    learning_rate = 1e-04,
    weight_decay = 0.01,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    logging_steps=logging_steps
)

In [15]:
from transformers import Trainer
trainer = Trainer(
    model,
    args = args,
    train_dataset = label_data["train"],
    eval_dataset = label_data['test'],
    data_collator = data_collator,
    tokenizer = tokenizer
)

In [16]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,2.5958,2.39004
2,2.4415,2.316498
3,2.3512,2.285029


TrainOutput(global_step=1140, training_loss=2.4627483351188793, metrics={'train_runtime': 1198.5613, 'train_samples_per_second': 60.765, 'train_steps_per_second': 0.951, 'total_flos': 2413640389906944.0, 'train_loss': 2.4627483351188793, 'epoch': 3.0})

## Perplexity for language models
Perplexity is a metric used to evaluate the **quality of a language model**. It **measures how well the model predicts the next word** in a sequence. **Lower perplexity indicates a better model**, as it suggests the model can predict future words with higher accuracy

Here's a breakdown of the concept:

- **Imagine you have a text corpus.** Each word in the corpus is considered a **token**.
- **The language model tries to predict the next word in the sequence.** Based on the previous words it has seen, it assigns a probability to each possible word in the vocabulary.
- **Perplexity is calculated by averaging the inverse of these probabilities across all tokens in the corpus.** It essentially reflects the average number of possible choices the model had to make at each step.
- **A lower perplexity value means the model had fewer possible choices, indicating it accurately predicted the next words.** A higher perplexity value means the model was unsure about the next words, suggesting lower quality.




In [None]:
import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")


# Fine-tuning DistilBERT with  Accelerate

In [22]:
label_data

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 24375
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 2338
    })
})