# BERT

Bidirectional Encoder Representations from Transformers (BERT) is a Large Language Model (LLM) developed by Google AI Language which has made significant advancements in the field of Natural Language Processing (NLP). [ref](https://medium.com/data-science/a-complete-guide-to-bert-with-code-9f87602e4a11)

Decoder-Only Models: e.g. GPT
* focused on Natural Language Generation (NLG)
* predict a new output sequence in response to an input sequence
* accept prompts as inputs and generate responses by predicting the next most probable token, one at a time in a task known as Next Token Prediction (NTP)
* excel in NLG tasks such as: conversational chatbots, machine translation, and code generation

Encoder-Only Models: e.g. BERT
* prioritized for Natural Language Understanding (NLU)
* make predictions about words within an input sequence
* these models do not accept prompts as such, but rather an input sequence for a prediction to be made upon (e.g. predicting a missing word within the sequence)
* encoder-only models are most often used for NLU tasks such as: Named Entity Recognition (NER) and sentiment analysis
  * NER: the goal is to identify entities such as people, organizations, locations, etc in some input text 


## Bidirectional Context

* BERT predicts the probability of observing certain words given that prior words have been observed
* Making any meaningful predictions about text requires the surrounding context to be understood
* BERT ensures good understanding through one of its key properties: bidirectionality
* Decoders produce unidirectional context
* Bidirectional means that the input sequence can gain context from both preceding and succeeding words (left and right context)


## Pre-training

* GPT popularized pre-training
* it involves training a single large model to acquire a broad understanding of language (words usage and grammar)
* this produces an agnostic foundational model
* once trained, copies of this model can be fine-tuned to address specific tasks


## Fine-tuning

* faster and less compute cost than training an architecture for a specific task
* fine-tunning involves training only the linear layer
* the linear layer is a feedforward neural network, often called head
* the weights and biases of the rest of the model remain unchanged, or frozen
 

# Fine-tuning BERT for Masked Predictions

* model is trained to use left and right context
* for ex: sentence: A man was fishing on the river --> A man was MASK on the river
* the model has to predict the word `fishing`

Following the tutorial in this [ref](https://medium.com/@a.radojevic01/fine-tuning-bert-with-masked-language-modelling-7777f441db7d).

In [2]:
from datasets import load_dataset
import torch
from tqdm.auto import tqdm
from transformers import BertTokenizer, BertForMaskedLM, set_seed
from torch.optim import AdamW
import pandas as pd
import warnings

set_seed(42)
warnings.filterwarnings("ignore")

In [3]:
model = BertForMaskedLM.from_pretrained("bert-base-uncased", return_dict=True)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [4]:
data = load_dataset("andjela-r/mlm-harry-potter", split="train[:10%]").to_pandas()
data.head()

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/4.08M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/55305 [00:00<?, ? examples/s]

Unnamed: 0,text
0,Harry Potter and the Sorcerer's Stone
1,CHAPTER ONE
2,THE BOY WHO LIVED
3,"Mr. and Mrs. Dursley, of number four, Privet D..."
4,Mr. Dursley was the director of a firm called ...


In [5]:
prep_data = data['text'].tolist()
inputs = tokenizer(prep_data, max_length=512, truncation=True, padding=True, return_tensors='pt')
inputs['labels'] = inputs['input_ids'].detach().clone()

# we will mask 15% of the dataset
random_tensor = torch.rand(inputs['input_ids'].shape)
# do not replace [CLS] and [SEP]
masked_tensor = (random_tensor < 0.15) * (inputs['input_ids'] != 101) * (inputs['input_ids'] != 102) * (inputs["input_ids"] != 0)

# apply the MASK, whose ID is 103
nonzero_indices = [torch.nonzero(row).flatten().tolist() for row in masked_tensor]
for i in range(len(inputs['input_ids'])):
    inputs['input_ids'][i, nonzero_indices[i]] = 103

In [6]:
# to handle our tokenized inputs
class HPDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings # Store the tokenized input data

    def __len__(self):
        return len(self.encodings['input_ids']) # Returns the number of examples in the dataset

    # Retrieves a specific item (dictionary of tokenized inputs) at a given index
    def __getitem__(self, index):
        return {key: val[index] for key, val in self.encodings.items()}

dataset = HPDataset(inputs)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=16, shuffle=True)

In [7]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(device)
model.to(device) # Move the model to the device ("cpu" or "cuda")

cuda


BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwi

In [8]:
def calculate_accuracy(data, model, tokenizer):
    model.eval() # Puts the model in evaluation mode
    correct = 0
    total = 0

    for sentence in data:
        # Replace a random token with [MASK] and store the original token
        tokens = tokenizer.encode(sentence, return_tensors='pt')[0]
        masked_index = torch.randint(0, len(tokens), (1,)).item()
        original_token = tokens[masked_index].item()
        tokens[masked_index] = tokenizer.mask_token_id

        inputs = {'input_ids': tokens.unsqueeze(0).to(device)}

        with torch.no_grad():
          outputs = model(**inputs)
          logits = outputs.logits

        predicted_token_id = logits[0, masked_index].argmax().item()

        if predicted_token_id == original_token:
          correct += 1
        total += 1

    accuracy = correct / total
    print(f"Accuracy: {accuracy * 100:.2f}%")

model.eval()
calculate_accuracy(prep_data, model, tokenizer)

Accuracy: 52.13%


In [9]:
epochs = 3 # The model will train for 3 full passes over the dataset.
optimizer = AdamW(model.parameters(), lr=1e-5)

model.train() # Puts the model in training mode

for epoch in range(epochs):
    loop = tqdm(dataloader) # We use this to display a progress bar
    for batch in loop:
        optimizer.zero_grad() # Reset gradients before each batch
        # Move input_ids, labels, attention_mask
        # to be on the same device as the model
        input_ids = batch['input_ids'].to(device)
        labels = batch['labels'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels) # Forward pass
        loss = outputs.loss
        loss.backward() # Compute gradients, backward pass
        optimizer.step() # Update model parameters

        loop.set_description("Epoch: {}".format(epoch)) # Display epoch number
        loop.set_postfix(loss=loss.item()) # Show loss in the progress bar

  0%|          | 0/346 [00:00<?, ?it/s]

  0%|          | 0/346 [00:00<?, ?it/s]

  0%|          | 0/346 [00:00<?, ?it/s]

In [10]:
model.eval() # Puts the model in evaluation mode

# Example corpus
test_corpus = [
    "Harry [MASK] is a wizzard.",

    "He pulled out the letter and read: \
    HOGWARTS SCHOOL of [MASK] and WIZARDRY Headmaster: ALBUS DUMBLEDORE (Order of Merlin, First Class, Grand Sorc., Chf. Warlock, \
    Supreme Mugwump, International Confed. of Wizards) Dear Mr. Potter, We are pleased to inform you that you have been accepted at Hogwarts \
    School of Witchcraft and Wizardry.",

    'I know that," said [MASK] McGonagall irritably. "But that\'s no reason to lose our heads. People are being downright \
    careless, out on the streets in broad daylight, not even dressed in Muggle clothes, swapping rumors.',

    "I'm sorry... You think that He-[MASK]-Must-Not-Be-Named is still alive, then?"
]

# Loop through each example sentence
for sentence in test_corpus:
    inputs = tokenizer(sentence, return_tensors='pt', max_length=512, truncation=True, padding=True)
    inputs = {key: val.to(device) for key, val in inputs.items()}

    masked_index = torch.where(inputs['input_ids'][0] == tokenizer.mask_token_id)[0].item()

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    predicted_token_id = logits[0, masked_index].argmax().item()
    predicted_token = tokenizer.decode([predicted_token_id])

    print(f"Original sentence: {sentence}")
    print(f"Predicted token: {predicted_token}")
    print("-" * 50)

Original sentence: Harry [MASK] is a wizzard.
Predicted token: potter
--------------------------------------------------
Original sentence: He pulled out the letter and read:     HOGWARTS SCHOOL of [MASK] and WIZARDRY Headmaster: ALBUS DUMBLEDORE (Order of Merlin, First Class, Grand Sorc., Chf. Warlock,     Supreme Mugwump, International Confed. of Wizards) Dear Mr. Potter, We are pleased to inform you that you have been accepted at Hogwarts     School of Witchcraft and Wizardry.
Predicted token: witchcraft
--------------------------------------------------
Original sentence: I know that," said [MASK] McGonagall irritably. "But that's no reason to lose our heads. People are being downright     careless, out on the streets in broad daylight, not even dressed in Muggle clothes, swapping rumors.
Predicted token: professor
--------------------------------------------------
Original sentence: I'm sorry... You think that He-[MASK]-Must-Not-Be-Named is still alive, then?
Predicted token: who


In [11]:
calculate_accuracy(prep_data, model, tokenizer)

Accuracy: 61.77%
