<a href="https://colab.research.google.com/github/vektor8891/llm/blob/main/projects/11_bert_pretraining/11_bert_pretraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pretraining a BERT model

## Loading data

In [None]:
# !pip install huggingface_hub[hf_xet]

In [1]:
!wget -O BERT_dataset.zip https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/bZaoQD52DcMpE7-kxwAG8A.zip
!unzip BERT_dataset.zip

--2025-04-25 12:27:56--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/bZaoQD52DcMpE7-kxwAG8A.zip
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.45.118.108
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.45.118.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 88958506 (85M) [application/zip]
Saving to: ‘BERT_dataset.zip’


2025-04-25 12:28:00 (32.8 MB/s) - ‘BERT_dataset.zip’ saved [88958506/88958506]

Archive:  BERT_dataset.zip
   creating: bert_dataset/
  inflating: bert_dataset/.DS_Store  
  inflating: bert_dataset/bert_train_data.csv  
  inflating: bert_dataset/bert_test_data_sampled.csv  
  inflating: bert_dataset/bert_test_data.csv  
  inflating: bert_dataset/bert_train_data_sampled.csv  


In [2]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer
import pandas as pd
import json


class BERTCSVDataset(Dataset):
    def __init__(self, filename):
        self.data = pd.read_csv(filename)
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        try:

            bert_input = torch.tensor(json.loads(row['BERT Input']), dtype=torch.long)
            bert_label = torch.tensor(json.loads(row['BERT Label']), dtype=torch.long)
            segment_label = torch.tensor([int(x) for x in row['Segment Label'].split(',')], dtype=torch.long)
            is_next = torch.tensor(row['Is Next'], dtype=torch.long)
            original_text = row['Original Text']  # If you want to use it
        except json.JSONDecodeError as e:
            print(f"Error decoding JSON for row {idx}: {e}")
            print("BERT Input:", row['BERT Input'])
            print("BERT Label:", row['BERT Label'])
            # Handle the error, e.g., by skipping this row or using default values
            return None  # or some default values

            # Tokenizing the original text with BERT
        encoded_input = self.tokenizer.encode_plus(
            original_text,
            add_special_tokens=True,
            max_length=512,
            padding='max_length',
            truncation=True,
            return_tensors="pt"
        )

        input_ids = encoded_input['input_ids'].squeeze()
        attention_mask = encoded_input['attention_mask'].squeeze()

        return(bert_input, bert_label, segment_label, is_next, input_ids, attention_mask, original_text)

In [3]:
from torch.nn.utils.rnn import pad_sequence

# create a collate function that applies transformations on batches of data iterator
PAD_IDX = 0
def collate_batch(batch):


    bert_inputs_batch, bert_labels_batch, segment_labels_batch, is_nexts_batch,input_ids_batch,attention_mask_batch,original_text_battch = [], [], [], [],[],[],[]

    for bert_input, bert_label, segment_label, is_next,input_ids,attention_mask,original_text in batch:
        # Convert each sequence to a tensor and append to the respective list
        bert_inputs_batch.append(bert_input.clone().detach())
        bert_labels_batch.append(bert_label.clone().detach())
        segment_labels_batch.append(segment_label.clone().detach())
        is_nexts_batch.append(is_next)
        input_ids_batch.append(input_ids)
        attention_mask_batch.append(attention_mask)
        original_text_battch.append(original_text)

    # Pad the sequences in the batch
    bert_inputs_final = pad_sequence(bert_inputs_batch, padding_value=PAD_IDX, batch_first=False)
    bert_labels_final = pad_sequence(bert_labels_batch, padding_value=PAD_IDX, batch_first=False)
    segment_labels_final = pad_sequence(segment_labels_batch, padding_value=PAD_IDX, batch_first=False)
    is_nexts_batch = torch.tensor(is_nexts_batch, dtype=torch.long)

    return bert_inputs_final, bert_labels_final, segment_labels_final, is_nexts_batch

In [4]:
# create train and test dataloaders
BATCH_SIZE = 2

train_dataset_path = './bert_dataset/bert_train_data.csv'
test_dataset_path = './bert_dataset/bert_test_data.csv'

train_dataset = BERTCSVDataset(train_dataset_path)
test_dataset = BERTCSVDataset(test_dataset_path)

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_batch)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

## Model creation

3 types of embeddings used in BERT to represent input tokens:

1. Token Embedding: initial representation of each token
2. Positional Embedding: captures the order of tokens
3. Segment Embedding: differentiates between different segments (e.g. sentences)

Model components:

1. Initialization: subclass of `torch.nn.Module`
2. Embedding Layer: combines token embeddings and segment embeddings
3. Transformer Encoder: encodes the input embeddings
4. Next Sentence Prediction: predicts the relationship between two consecutive sentences using the output of Transformer encoder
5. Masked Language Modeling: predicts the masked tokens in the input sequence
6. Forward Pass: defines the forward pass. Returns predictions for Next Sentence Prediction and Masked Language Modeling using input tokens and segment labels


In [5]:
import torch.nn as nn
from torch import Tensor
import math

EMBEDDING_DIM = 10

class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

# Define the PositionalEncoding class as a PyTorch module for adding positional information to token embeddings
class PositionalEncoding(nn.Module):
    def __init__(self, emb_size: int, dropout: float, maxlen: int = 5000):
        super(PositionalEncoding, self).__init__()
        # Create a positional encoding matrix as per the Transformer paper's formula
        den = torch.exp(- torch.arange(0, emb_size, 2) * math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)

        self.dropout = nn.Dropout(dropout)
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: torch.Tensor):
        # Apply the positional encodings to the input token embeddings

        return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0), :])

class BERTEmbedding (nn.Module):

    def __init__(self, vocab_size, emb_size ,dropout=0.1,train=True):

        super().__init__()

        self.token_embedding = TokenEmbedding( vocab_size,emb_size )
        self.positional_encoding = PositionalEncoding(emb_size,dropout)
        self.segment_embedding = nn.Embedding(3, emb_size)
        self.dropout = torch.nn.Dropout(p=dropout)

    def forward(self, bert_inputs, segment_labels=False):
        my_embeddings=self.token_embedding(bert_inputs)
        if self.train:
          x = self.dropout(my_embeddings + self.positional_encoding(my_embeddings) + self.segment_embedding(segment_labels))
        else:
          x = my_embeddings + self.positional_encoding(my_embeddings)

        return x

In [6]:
# example
VOCAB_SIZE=147161
batch = 2
count = 0
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# load sample batches from dataloader
for batch in train_dataloader:
    bert_inputs, bert_labels, segment_labels, is_nexts = [b.to(device) for b in batch]
    count += 1
    if count == 5:
        break

print(bert_inputs.shape)
print(bert_inputs[:,0])
print(segment_labels.shape)
print(segment_labels[:,0])

torch.Size([58, 2])
tensor([    1,    18,    12,     3,   425,    11,    37,   709,    58,     2,
            0,     0,     0,    57,   564,    63,    29,   205,     9,  3270,
          138, 12173,     7,     3,    58,     2,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0])
torch.Size([58, 2])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0])


In [7]:
# Instantiate the TokenEmbedding
token_embedding = TokenEmbedding(VOCAB_SIZE, emb_size=EMBEDDING_DIM )

# Get the token embeddings for a sample input
t_embeddings = token_embedding(bert_inputs)
# Each token is transformed into a tensor of size emb_size
print(f"Dimensions of token embeddings: {t_embeddings.size()}") # Expected: (sequence_length, batch_size, EMBEDDING_DIM)
# Check the embedded vectors for first 3 tokens of the first sample in the batch
# you get embeddings[i,0,:] where i refers to the i'th token of the first sample in the batch (b=0)
for i in range(3):
    print(f"Token Embeddings for the {i}th token of the first sample: {t_embeddings[i,0,:]}")

Dimensions of token embeddings: torch.Size([58, 2, 10])
Token Embeddings for the 0th token of the first sample: tensor([-2.0008, -3.8798,  6.5487,  0.5648,  0.8633, -2.9579,  5.6600, -4.0631,
         2.4579, -0.9972], grad_fn=<SliceBackward0>)
Token Embeddings for the 1th token of the first sample: tensor([-0.1088, -1.3450, -0.6220, -0.0096, -2.9501, -2.8600, -2.1335, -7.2497,
        -3.1335,  0.6685], grad_fn=<SliceBackward0>)
Token Embeddings for the 2th token of the first sample: tensor([ 4.8187, -0.1377,  1.5525, -2.5659, -2.7598,  1.9919,  1.5602, -0.5791,
        -4.5657, -7.0449], grad_fn=<SliceBackward0>)


In [8]:
positional_encoding = PositionalEncoding(emb_size=EMBEDDING_DIM,dropout=0)

# Apply positional encoding to token embeddings
p_embedding = positional_encoding(t_embeddings)

print(f"Dimensions of positionally encoded tokens: {p_embedding.size()}") # Expected: (sequence_length, batch_size, EMBEDDING_DIM)
# Check the positional encoded vectors for first 3 tokens of the first sample in the batch
# you get encoded_tokens[i,0,:] where i refers to the i'th token of the first sample(b=0) in the batch
for i in range(3):
    print(f"Positional Embeddings for the {i}th token of the first sample: {p_embedding[i,0,:]}")

Dimensions of positionally encoded tokens: torch.Size([58, 2, 10])
Positional Embeddings for the 0th token of the first sample: tensor([-2.0008e+00, -2.8798e+00,  6.5487e+00,  1.5648e+00,  8.6328e-01,
        -1.9579e+00,  5.6600e+00, -3.0631e+00,  2.4579e+00,  2.7899e-03],
       grad_fn=<SliceBackward0>)
Positional Embeddings for the 1th token of the first sample: tensor([ 0.7327, -0.8047, -0.4642,  0.9779, -2.9250, -1.8603, -2.1295, -6.2498,
        -3.1329,  1.6685], grad_fn=<SliceBackward0>)
Positional Embeddings for the 2th token of the first sample: tensor([ 5.7280, -0.5539,  1.8642, -1.6158, -2.7096,  2.9907,  1.5682,  0.4208,
        -4.5644, -6.0449], grad_fn=<SliceBackward0>)


In [9]:
segment_embedding = nn.Embedding(3, EMBEDDING_DIM)
s_embedding = segment_embedding(segment_labels)
print(f"Dimensions of segment embedding: {s_embedding.size()}") # Expected: (sequence_length, batch_size, EMBEDDING_DIM)
# Check the Segment Embedding vectors for first 3 tokens of the first sample in the batch
# you get segment_embedded[i,0,:] where i refers to the i'th token of the first sample(b=0) in the batch
for i in range(3):
    print(f"Segment Embeddings for the {i}th token of the first sample: {s_embedding[i,0,:]}")

Dimensions of segment embedding: torch.Size([58, 2, 10])
Segment Embeddings for the 0th token of the first sample: tensor([-1.1096,  1.4824, -0.0884, -1.5296, -1.6593, -1.0829,  0.1962,  0.1979,
         0.5246, -0.0645], grad_fn=<SliceBackward0>)
Segment Embeddings for the 1th token of the first sample: tensor([-1.1096,  1.4824, -0.0884, -1.5296, -1.6593, -1.0829,  0.1962,  0.1979,
         0.5246, -0.0645], grad_fn=<SliceBackward0>)
Segment Embeddings for the 2th token of the first sample: tensor([-1.1096,  1.4824, -0.0884, -1.5296, -1.6593, -1.0829,  0.1962,  0.1979,
         0.5246, -0.0645], grad_fn=<SliceBackward0>)


In [10]:
# Create the combined embedding vectors
bert_embeddings = t_embeddings + p_embedding + s_embedding
print(f"Dimensions of token + position + segment encoded tokens: {bert_embeddings.size()}")
#Check the BERT Embedding vectors for first 3 tokens of the first sample in the batch
# you get bert_embeddings[i,0,:] where i refers to the i'th token of the first sample(b=0) in the batch
for i in range(3):
    print(f"BERT_Embedding for {i}th token: {bert_embeddings[i,0,:]}")

Dimensions of token + position + segment encoded tokens: torch.Size([58, 2, 10])
BERT_Embedding for 0th token: tensor([-5.1111, -5.2772, 13.0091,  0.6000,  0.0673, -5.9986, 11.5161, -6.9284,
         5.4405, -1.0590], grad_fn=<SliceBackward0>)
BERT_Embedding for 1th token: tensor([ -0.4857,  -0.6672,  -1.1746,  -0.5613,  -7.5343,  -5.8032,  -4.0669,
        -13.3016,  -5.7418,   2.2724], grad_fn=<SliceBackward0>)
BERT_Embedding for 2th token: tensor([  9.4372,   0.7908,   3.3283,  -5.7113,  -7.1287,   3.8997,   3.3245,
          0.0396,  -8.6054, -13.1543], grad_fn=<SliceBackward0>)


In [11]:
class BERT(torch.nn.Module):

    def __init__(self, vocab_size, d_model=768, n_layers=12, heads=12, dropout=0.1):
        """
        vocab_size: The size of the vocabulary.
        d_model: The size of the embeddings (hidden size).
        n_layers: The number of Transformer layers.
        heads: The number of attention heads in each Transformer layer.
        dropout: The dropout rate applied to embeddings and Transformer layers.
        """
        super().__init__()
        self.d_model = d_model
        self.n_layers = n_layers
        self.heads = heads

        # Embedding layer that combines token embeddings and segment embeddings
        self.bert_embedding = BERTEmbedding(vocab_size, d_model, dropout)

        # Transformer Encoder layers
        self.encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=heads, dropout=dropout,batch_first=False)
        self.transformer_encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=n_layers)

        # Linear layer for Next Sentence Prediction
        self.nextsentenceprediction = nn.Linear(d_model, 2)

        # Linear layer for Masked Language Modeling
        self.masked_language = nn.Linear(d_model, vocab_size)

    def forward(self, bert_inputs, segment_labels):
        """
        bert_inputs: Input tokens.
        segment_labels: Segment IDs for distinguishing different segments in the input.
        mask: Attention mask to prevent attention to padding tokens.

        return: Predictions for next sentence task and masked language modeling task.
        """

        padding_mask = (bert_inputs == PAD_IDX).transpose(0, 1)
        # Generate embeddings from input tokens and segment labels
        my_bert_embedding = self.bert_embedding(bert_inputs, segment_labels)

        # Pass embeddings through the Transformer encoder
        transformer_encoder_output = self.transformer_encoder(my_bert_embedding,src_key_padding_mask=padding_mask)


        next_sentence_prediction = self.nextsentenceprediction(transformer_encoder_output[ 0,:])


        # Masked Language Modeling: Predict all tokens in the sequence
        masked_language = self.masked_language(transformer_encoder_output)

        return  next_sentence_prediction, masked_language

In [12]:
# create an instance of the model
EMBEDDING_DIM = 10

# Define parameters
vocab_size = 147161  # Replace VOCAB_SIZE with your vocabulary size
d_model = EMBEDDING_DIM  # Replace EMBEDDING_DIM with your embedding dimension
n_layers = 2  # Number of Transformer layers
initial_heads = 12 # Initial number of attention heads
initial_heads = 2
# Ensure the number of heads is a factor of the embedding dimension
heads = initial_heads - d_model % initial_heads

dropout = 0.1  # Dropout rate

# Create an instance of the BERT model
model = BERT(vocab_size, d_model, n_layers, heads, dropout)



In [13]:
padding_mask = (bert_inputs == PAD_IDX).transpose(0, 1)
padding_mask.shape

torch.Size([2, 58])

In [14]:
encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=heads, dropout=dropout,batch_first=False)
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=n_layers)
# Pass embeddings through the Transformer encoder
transformer_encoder_output = transformer_encoder(bert_embeddings,src_key_padding_mask=padding_mask)
transformer_encoder_output.shape

torch.Size([58, 2, 10])

In [15]:
nextsentenceprediction = nn.Linear(d_model, 2)
nsp = nextsentenceprediction(transformer_encoder_output[ 0,:])
#logits for NSP task
print(f"NSP Output Shape: {nsp.shape}")  # Expected shape: (batch_size, 2)

NSP Output Shape: torch.Size([2, 2])


In [16]:
masked_language = nn.Linear(d_model, vocab_size)
# Masked Language Modeling: Predict all tokens in the sequence
mlm = masked_language(transformer_encoder_output)
#logits for MLM task
print(f"MLM Output Shape: {mlm.shape}")  # Expected shape: (seq_length, batch_size, vocab_size)

MLM Output Shape: torch.Size([58, 2, 147161])


## Evaluation

In [17]:
PAD_IDX=0
loss_fn_mlm = nn.CrossEntropyLoss(ignore_index=PAD_IDX)# The loss function must ignore PAD tokens and only calculates loss for the masked tokens
loss_fn_nsp = nn.CrossEntropyLoss()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
device

def evaluate(dataloader=test_dataloader, model=model, loss_fn_mlm=loss_fn_mlm, loss_fn_nsp=loss_fn_nsp, device=device):
    model.eval()  # Turn off dropout and other training-specific behaviors

    total_loss = 0
    total_next_sentence_loss = 0
    total_mask_loss = 0
    total_batches = 0
    with torch.no_grad():  # Turn off gradients for validation, saves memory and computations
        for batch in dataloader:
            bert_inputs, bert_labels, segment_labels, is_nexts = [b.to(device) for b in batch]

            # Forward pass
            next_sentence_prediction, masked_language = model(bert_inputs, segment_labels)

            # Calculate loss for next sentence prediction
            # Ensure is_nexts is of the correct shape for CrossEntropyLoss
            next_loss = loss_fn_nsp(next_sentence_prediction, is_nexts.view(-1))

            # Calculate loss for predicting masked tokens
            # Flatten both masked_language predictions and bert_labels to match CrossEntropyLoss input requirements
            mask_loss = loss_fn_mlm(masked_language.view(-1, masked_language.size(-1)), bert_labels.view(-1))

            # Sum up the two losses
            loss = next_loss + mask_loss
            if torch.isnan(loss):
                continue
            else:
                total_loss += loss.item()
                total_next_sentence_loss += next_loss.item()
                total_mask_loss += mask_loss.item()
                total_batches += 1

    avg_loss = total_loss / (total_batches + 1)
    avg_next_sentence_loss = total_next_sentence_loss / (total_batches + 1)
    avg_mask_loss = total_mask_loss / (total_batches + 1)

    print(f"Average Loss: {avg_loss:.4f}, Average Next Sentence Loss: {avg_next_sentence_loss:.4f}, Average Mask Loss: {avg_mask_loss:.4f}")
    return avg_loss

## Training

(Use randomly sampled dataset to reduce processing time.)

In [18]:
# BATCH_SIZE = 3

# train_dataset_path = './bert_dataset/bert_train_data_sampled.csv'
# test_dataset_path = './bert_dataset/bert_test_data_sampled.csv'

# train_dataset = BERTCSVDataset(train_dataset_path)
# test_dataset = BERTCSVDataset(test_dataset_path)

# train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
# test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_batch)

In [19]:
# from torch.optim import Adam
# from transformers import get_linear_schedule_with_warmup
# from tqdm import tqdm


# # Define the optimizer
# optimizer = Adam(model.parameters(), lr=1e-4, weight_decay=0.01, betas=(0.9, 0.999))

# # Training loop setup
# num_epochs = 1
# total_steps = num_epochs * len(train_dataloader)

# # Define the number of warmup steps, e.g., 10% of total
# warmup_steps = int(total_steps * 0.1)

# # Create the learning rate scheduler
# scheduler = get_linear_schedule_with_warmup(optimizer,
#                                             num_warmup_steps=warmup_steps,
#                                             num_training_steps=total_steps)

# # Lists to store losses for plotting
# train_losses = []
# eval_losses = []

# for epoch in tqdm(range(num_epochs), desc="Training Epochs"):
#     model.train()
#     total_loss = 0

#     for step, batch in enumerate(tqdm(train_dataloader, desc=f"Epoch {epoch + 1}")):
#         bert_inputs, bert_labels, segment_labels, is_nexts = [b.to(device) for b in batch]

#         optimizer.zero_grad()
#         next_sentence_prediction, masked_language = model(bert_inputs, segment_labels)

#         next_loss = loss_fn_nsp(next_sentence_prediction, is_nexts)
#         mask_loss = loss_fn_mlm(masked_language.view(-1, masked_language.size(-1)), bert_labels.view(-1))

#         loss = next_loss + mask_loss
#         loss.backward()
#         torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
#         optimizer.step()
#         scheduler.step()  # Update the learning rate

#         total_loss += loss.item()

#         if torch.isnan(loss):
#             continue
#         else:
#             total_loss += loss.item()

#     avg_train_loss = total_loss / len(train_dataloader) + 1
#     train_losses.append(avg_train_loss)
#     print(f"Epoch {epoch+1} - Average training loss: {avg_train_loss:.4f}")

#     # Evaluation after each epoch
#     eval_loss = evaluate(test_dataloader, model, loss_fn_nsp, loss_fn_mlm, device)
#     eval_losses.append(eval_loss)

In [20]:
# import matplotlib.pyplot as plt

# # Plotting the loss values
# plt.figure(figsize=(6, 4))
# plt.scatter(range(1,num_epochs+1), train_losses, label="Training Loss", color='blue')
# plt.scatter(range(1,num_epochs+1), eval_losses, label="Evaluation Loss", color='orange')
# plt.xlabel('Epoch')
# plt.ylabel('Loss')
# plt.title('Training and Evaluation Loss')
# plt.legend()
# plt.show()

## Inference

(load model from pt file)

In [21]:
model = BERT(vocab_size, d_model, n_layers, heads, dropout)  # Ensure these parameters match the original model's
!wget 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/H04Cs7O75aOfmJ4YP2HdPw.pt'
model.load_state_dict(torch.load('H04Cs7O75aOfmJ4YP2HdPw.pt',map_location=torch.device('cpu')))
model.to(device)

--2025-04-25 12:28:44--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/H04Cs7O75aOfmJ4YP2HdPw.pt
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 



169.45.118.108
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.45.118.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13099721 (12M) [binary/octet-stream]
Saving to: ‘H04Cs7O75aOfmJ4YP2HdPw.pt’


2025-04-25 12:28:46 (13.3 MB/s) - ‘H04Cs7O75aOfmJ4YP2HdPw.pt’ saved [13099721/13099721]



BERT(
  (bert_embedding): BERTEmbedding(
    (token_embedding): TokenEmbedding(
      (embedding): Embedding(147161, 10)
    )
    (positional_encoding): PositionalEncoding(
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (segment_embedding): Embedding(3, 10)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder_layer): TransformerEncoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): NonDynamicallyQuantizableLinear(in_features=10, out_features=10, bias=True)
    )
    (linear1): Linear(in_features=10, out_features=2048, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (linear2): Linear(in_features=2048, out_features=10, bias=True)
    (norm1): LayerNorm((10,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((10,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.1, inplace=False)
    (dropout2): Dropout(p=0.1, inplace=False)
  )
  (transformer_encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-1): 2 x Tran

In [22]:
# Initialize the tokenizer with the BERT model's vocabulary
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model.eval()

def predict_nsp(sentence1, sentence2, model, tokenizer):
    # Tokenize sentences with special tokens
    tokens = tokenizer.encode_plus(sentence1, sentence2, return_tensors="pt")
    tokens_tensor = tokens["input_ids"].to(device)
    segment_tensor = tokens["token_type_ids"].to(device)

    # Predict
    with torch.no_grad():
        # Assuming the model returns NSP predictions first
        nsp_prediction, _ = model(tokens_tensor, segment_tensor)
        # Select the first element (first sequence) of the logits tensor
        first_logits = nsp_prediction[0].unsqueeze(0)  # Adds an extra dimension, making it [1, 2]
        logits = torch.softmax(first_logits, dim=1)
        prediction = torch.argmax(logits, dim=1).item()

    # Interpret the prediction
    return "Second sentence follows the first" if prediction == 1 else "Second sentence does not follow the first"

# Example usage
sentence1 = "The S&P dropped 10% after the Iraqi war."
sentence2 = "I hate chocolate"

print(predict_nsp(sentence1, sentence2, model, tokenizer))

Second sentence follows the first


In [23]:
def predict_mlm(sentence, model, tokenizer):
    # Tokenize the input sentence and convert to token IDs, including special tokens
    inputs = tokenizer(sentence, return_tensors="pt")
    tokens_tensor = inputs.input_ids

    # Create dummy segment labels filled with zeros, assuming it's needed by your model
    segment_labels = torch.zeros_like(tokens_tensor)

    with torch.no_grad():
        # Forward pass through the model, now correctly handling the output tuple
        output_tuple = model(tokens_tensor, segment_labels)

        # Assuming the second element of the tuple contains the MLM logits
        predictions = output_tuple[1]  # Adjusted based on your model's output

        # Identify the position of the [MASK] token
        mask_token_index = (tokens_tensor == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

        # Get the predicted index for the [MASK] token from the MLM logits
        predicted_index = torch.argmax(predictions[0, mask_token_index.item(), :], dim=-1)
        predicted_token = tokenizer.convert_ids_to_tokens([predicted_index.item()])[0]

        # Replace [MASK] in the original sentence with the predicted token
        predicted_sentence = sentence.replace(tokenizer.mask_token, predicted_token, 1)

    return predicted_sentence


# Example usage
sentence = "The cat sat on the [MASK]."
print(predict_mlm(sentence, model, tokenizer))

The cat sat on the [unused8].


## Exercise 1: Next Sentence Prediction (NSP) with BERT

In [26]:
from transformers import BertForPreTraining

# Load pretrained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Load pretrained model (weights)
model = BertForPreTraining.from_pretrained('bert-base-uncased')
# Prepare text pair for NSP
text_1 = "The cat sat on the mat"
text_2 = "It was a sunny day"
# Encode text
inputs = tokenizer(text_1, text_2, return_tensors="pt")

# Predict
with torch.no_grad():
    outputs = model(**inputs, next_sentence_label=torch.LongTensor([1]))
    nsp_logits = outputs.seq_relationship_logits

# Interpret the result for NSP
if torch.argmax(nsp_logits, dim=-1).item() == 0:
    print("The model thinks these sentences are NOT consecutive.")
else:
    print("The model thinks these sentences are consecutive.")

The model thinks these sentences are NOT consecutive.


## Exercise 2: Masked Language Modeling (MLM) with BERT

In [27]:
from transformers import BertForPreTraining, BertTokenizer
import torch

# Load pretrained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Load pretrained model (weights)
model = BertForPreTraining.from_pretrained('bert-base-uncased')

# Prepare text with masked token
masked_text = "The capital of France is [MASK]."
# Tokenize and prepare for the model: Convert to tokens and add special tokens
input_ids = tokenizer(masked_text, return_tensors="pt")["input_ids"]

# Predict all tokens
with torch.no_grad():
    outputs = model(input_ids=input_ids)
    predictions = outputs.prediction_logits

# Confirm we were able to predict 'Paris' as the masked token
predicted_index = torch.argmax(predictions[0, input_ids[0] == tokenizer.mask_token_id]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])

print(f"Predicted token: {predicted_token}")

Predicted token: ['paris']
