# A4: Do you AGREE?

This assignment will guide you in training a pre-trained model like BERT from scratch, focusing onleveraging text embeddings to capture semantic similarity. 

Additionally, we will explore how to adapt the loss function for tasks like Natural Language Inference (NLI) to enhance the model’s ability to understand semantic relationships between texts.

#### Step 0: Prepare Environment - Import Libraries and select device

In [19]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np

import datasets, math, re, random, time
from collections import Counter
from tqdm import tqdm

In [20]:
# mimimum required torch version for MPS support "1.12+"
torch.__version__

'2.10.0'

In [21]:
# Hugging Face Hub login token
# Make sure to set the HF_TOKEN environment variable in your .env file with your Hugging Face token
import os
HF_TOKEN = os.environ.get("HF_TOKEN")


In [22]:
# universal device selection: use gpu if available, else cpu
import torch

def get_device():
    if torch.cuda.is_available():
        torch.cuda.empty_cache()  # Clear CUDA cache to free up memory 
        return torch.device("cuda")      # NVIDIA GPU
    elif torch.backends.mps.is_available():
        torch.mps.empty_cache()  # Clear MPS cache to avoid memory issues
        return torch.device("mps")       # Apple Silicon GPU
    else:
        torch.empty_cache()  # Clear CPU cache to free up memory
        return torch.device("cpu")

device = get_device()

print(f"Using device: {device}")

Using device: mps


## Task 1. Training BERT from Scratch - Based on Masked Language Model/BERT-update.ipynb, modify as follows: (2 points)

### 1.1) Implement Bidirectional Encoder Representations from Transformers (BERT) from scratch, following the concepts learned in class.

NOTE: BERT-update.ipynb is available to use CUDA.
NOTE: You may refer to the BERT $paper^1$ and use large corpora such as $BookCorpus^2$ or English
$Wikipedia^3$. However, you should only use a subset, such as 100k samples, rather than the entire dataset.

[BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding](https://arxiv.org/pdf/1810.04805)

#### Step 1: Data Acquisition

In [23]:
# Hugging Face Hub login token
# Make sure to set the HF_TOKEN environment variable in your .env file with your Hugging Face token
from dotenv import load_dotenv
load_dotenv()

import os
HF_TOKEN = os.environ.get("HF_TOKEN")

Using dataset from [Salesforce wikitext](https://huggingface.co/datasets/Salesforce/wikitext)

Info about Dataset:
wikitext-103-raw-v1
```sh
    Size of downloaded dataset files: 191.98 MB
    Size of the generated dataset: 549.42 MB
    Total amount of disk used: 741.41 MB
```

In [24]:
import os
from datasets import load_dataset

# data folder are not uploaded to Github.
_DATA_PATH = "../data/wikitext-103"
_DATA_FILENAME = os.path.join(_DATA_PATH, "wikitext-103-train.arrow")
os.makedirs(_DATA_PATH, exist_ok=True)

if not os.path.exists(_DATA_FILENAME):
    # Download and save to local folder
    dataset_train = load_dataset("wikitext", "wikitext-103-raw-v1", split="train", cache_dir=_DATA_PATH)
    dataset_valid = load_dataset("wikitext", "wikitext-103-raw-v1", split="validation", cache_dir=_DATA_PATH)
    dataset_test = load_dataset("wikitext", "wikitext-103-raw-v1", split="test", cache_dir=_DATA_PATH)

else:
    # Load from local Parquet file
    from datasets import Dataset
    dataset_train = Dataset.from_parquet(_DATA_FILENAME)
    dataset_valid = Dataset.from_parquet(os.path.join(_DATA_PATH, "wikitext-103-validation.arrow"))
    dataset_test = Dataset.from_parquet(os.path.join(_DATA_PATH, "wikitext-103-test.arrow"))
    print("Loaded datasets from local Parquet files.")

print(f"Dataset size: {len(dataset_train)}")
print(f"Validation set size: {len(dataset_valid)}")
print(f"Test set size: {len(dataset_test)}")

Dataset size: 1801350
Validation set size: 3760
Test set size: 4358


Check dataset 

In [25]:
dataset_train

Dataset({
    features: ['text'],
    num_rows: 1801350
})

In [26]:
dataset_test

Dataset({
    features: ['text'],
    num_rows: 4358
})

In [27]:
dataset_valid

Dataset({
    features: ['text'],
    num_rows: 3760
})

In [28]:
# Display the first 5 entries in the dataset 
# with only the first 80 characters of the text for brevity
[text[:80] for text in dataset_train[:5]['text']]

['',
 ' = Valkyria Chronicles III = \n',
 '',
 ' Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Va',
 ' The game began development in 2010 , carrying over a large portion of the work ']

#### Step 2: Data Preprocessing

Before making the vocabs, remove special characters and transform text into lowercase

In [90]:
sentences = [re.sub("[.,!?\\-]", '', t.lower()) for t in dataset_train["text"] if t.strip()]

In [91]:
sentences[:5]

[' = valkyria chronicles iii = \n',
 ' senjō no valkyria 3 : unrecorded chronicles ( japanese : 戦場のヴァルキュリア3  lit  valkyria of the battlefield 3 )  commonly referred to as valkyria chronicles iii outside japan  is a tactical role @@ playing video game developed by sega and mediavision for the playstation portable  released in january 2011 in japan  it is the third game in the valkyria series  employing the same fusion of tactical and real @@ time gameplay as its predecessors  the story runs parallel to the first game and follows the " nameless "  a penal military unit serving the nation of gallia during the second europan war who perform secret black operations and are pitted against the imperial unit " calamaty raven "  \n',
 " the game began development in 2010  carrying over a large portion of the work done on valkyria chronicles ii  while it retained the standard features of the series  it also underwent multiple adjustments  such as making the game more forgiving for series newcome

#### Step 3: Build Vocabulary - Numericalization

In [92]:
from tqdm.auto import tqdm

# Combine everything into one to make vocab
word_list = list(set(" ".join(sentences).split()))
word2id = {'[PAD]': 0, '[CLS]': 1, '[SEP]': 2, '[MASK]': 3}  # special tokens

# Create the word2id in a single pass
for i, w in tqdm(enumerate(word_list), desc="Creating word2id"):
    word2id[w] = i + 4  # because 0-3 are already occupied

# Precompute the id2word mapping (this can be done once after word2id is fully populated)
id2word = {v: k for k, v in word2id.items()}
vocab_size = len(word2id)
vocab_size

Creating word2id: 0it [00:00, ?it/s]

539719

In [95]:
for i in range(10):
    print(f"id2word[{i}] = {id2word[i]}")


id2word[0] = [PAD]
id2word[1] = [CLS]
id2word[2] = [SEP]
id2word[3] = [MASK]
id2word[4] = ἁνθρώπου
id2word[5] = kirsopp
id2word[6] = soarin
id2word[7] = ruckdeschel
id2word[8] = gubarev
id2word[9] = vagharshapat


#### Step 3: Tokenization 

using spaCy tokenizer or NLTK

In [97]:
# List of all tokens for the whole text
token_list = []

# Process sentences more efficiently
for sentence in tqdm(sentences, desc="Processing sentences"):
    token_list.append([word2id[word] for word in sentence.split()])

Processing sentences:   0%|          | 0/1165029 [00:00<?, ?it/s]

In [104]:
for i in range(5):
    print(f"sentences[{i}] => {sentences[i].split()[:10]}...")
    print(f"token_list[{i}] => {token_list[i][:10]}...")

sentences[0] => ['=', 'valkyria', 'chronicles', 'iii', '=']...
token_list[0] => [177347, 259061, 100513, 372364, 177347]...
sentences[1] => ['senjō', 'no', 'valkyria', '3', ':', 'unrecorded', 'chronicles', '(', 'japanese', ':']...
token_list[1] => [246676, 488586, 259061, 446802, 30650, 190007, 100513, 230477, 375006, 30650]...
sentences[2] => ['the', 'game', 'began', 'development', 'in', '2010', 'carrying', 'over', 'a', 'large']...
token_list[2] => [65286, 78660, 345996, 287783, 335490, 244805, 407664, 350545, 272677, 399248]...
sentences[3] => ['it', 'met', 'with', 'positive', 'sales', 'in', 'japan', 'and', 'was', 'praised']...
token_list[3] => [342902, 197976, 167790, 335913, 417748, 335490, 394738, 85077, 42664, 55868]...
sentences[4] => ['=', '=', 'gameplay', '=', '=']...
token_list[4] => [177347, 177347, 148161, 177347, 177347]...


In [96]:
dataset_train["text"][1]

' = Valkyria Chronicles III = \n'

In [31]:
raw_text = " ".join(dataset_train['text']) # all sentences in a line
print(f"Total characters in raw text: {len(raw_text)}")
raw_text[:80]

Total characters in raw text: 540095682


'  = Valkyria Chronicles III = \n   Senjō no Valkyria 3 : Unrecorded Chronicles ( '

Simple space based tokenization used here. spaCy and NLTK are dropped for simplicity.

Too big dataset. It will fail spacy max_lenght limit validation. The validation exists to prevent memory allocation error. Don't concatenate, but use for loop on whole dataset

Since spaCy is slow, Using NLTK.

The command nltk.download('punkt') downloads the "punkt" tokenizer models for NLTK. "Punkt" is a pre-trained model used by NLTK for sentence splitting and word tokenization in English and other languages. Without downloading "punkt"

In [32]:
# not used NLTK for tokenization, simple spacebased tokenization is used instead. 
# not necessary for this assignment.

# import nltk

# _DOWNLOAD_DIR = "../models/nltk_data"
# os.makedirs(_DOWNLOAD_DIR, exist_ok=True)
# nltk.download('punkt', download_dir=_DOWNLOAD_DIR)
# nltk.download('punkt_tab', download_dir=_DOWNLOAD_DIR)
# from nltk.tokenize import word_tokenize


# nltk.data.path.append(_DOWNLOAD_DIR)
# tokenized_texts = [word_tokenize(text) for text in sentences]

# # Example: print the first 5 tokenized samples
# for tokens in tokenized_texts[:5]:
    # print(tokens)


### 2) Train the model on a suitable dataset. Ensure to source this dataset from reputable public databases or repositories. It is imperative to give proper credit to the dataset source in your documentation.


#### Step 5: Data Loader

We gonna make dataloader.  Inside here, we need to make two types of embeddings: **token embedding** and **segment embedding**

1. **Token embedding** - Given “The cat is walking. The dog is barking”, we add [CLS] and [SEP] >> “[CLS] the cat is walking [SEP] the dog is barking”. 

2. **Segment embedding**
A segment embedding separates two sentences, i.e., [0 0 0 0 1 1 1 1 ]

3. **Masking**
As mentioned in the original paper, BERT randomly assigns masks to 15% of the sequence. In this 15%, 80% is replaced with masks, while 10% is replaced with random tokens, and the rest 10% is left as is.  Here we specified `max_pred` 

4. **Padding**
Once we mask, we will add padding. For simplicity, here we padded until some specified `max_len`. 

Note:  `positive` and `negative` are just simply counts to keep track of the batch size.  `positive` refers to two sentences that are really next to one another.

In [83]:
batch_size = 6
max_mask   = 5 #even though it does not reach 15% yet....maybe you can set this threshold
max_len    = 1000 #maximum length that my transformer will accept.....all sentence will be padded

In [105]:
def make_batch():
    batch = []
    half_batch_size = batch_size // 2
    positive = negative = 0
    while positive != half_batch_size or negative != half_batch_size:

        #randomly choose two sentence
        tokens_a_index, tokens_b_index = np.random.randint(len(sentences), size=2)
        tokens_a, tokens_b            = token_list[tokens_a_index], token_list[tokens_b_index]

        #1. token embedding - add CLS and SEP
        input_ids = [word2id['[CLS]']] + tokens_a + [word2id['[SEP]']] + tokens_b + [word2id['[SEP]']]

        #2. segment embedding - which sentence is 0 and 1
        segment_ids = [0] * (1 + len(tokens_a) + 1) + [1] * (len(tokens_b) + 1)

        n_pred = min(max_mask, max(1, int(round(len(input_ids) * 0.15))))
        #get all the pos excluding CLS and SEP
        candidates_masked_pos = [i for i, token in enumerate(input_ids) if token != word2id['[CLS]']
                                 and token != word2id['[SEP]']]
        np.random.shuffle(candidates_masked_pos)
        masked_tokens, masked_pos = [], []
        #simply loop and mask accordingly
        for pos in candidates_masked_pos[:n_pred]:
            masked_pos.append(pos)
            masked_tokens.append(input_ids[pos])
            rand_val = np.random.random()
            if rand_val < 0.1:  #10% replace with random token
                index = np.random.randint(4, vocab_size - 1)  # random token should not involve [PAD], [CLS], [SEP], [MASK]
                input_ids[pos] = word2id[id2word[index]]
            elif rand_val < 0.8:  #80 replace with [MASK]
                input_ids[pos] = word2id['[MASK]']
            else:
                pass

        #4. pad the sentence to the max length
        n_pad = max_len - len(input_ids)
        input_ids.extend([0] * n_pad)
        segment_ids.extend([0] * n_pad)

        #5. pad the mask tokens to the max length
        if max_mask > n_pred:
            n_pad = max_mask - n_pred
            masked_tokens.extend([0] * n_pad)
            masked_pos.extend([0] * n_pad)

        #6. check whether is positive or negative
        if tokens_a_index + 1 == tokens_b_index and positive < half_batch_size:
            batch.append([input_ids, segment_ids, masked_tokens, masked_pos, True])
            positive += 1
        elif tokens_a_index + 1 != tokens_b_index and negative < half_batch_size:
            batch.append([input_ids, segment_ids, masked_tokens, masked_pos, False])
            negative += 1

    return batch

In [107]:
batch = make_batch()

In [108]:
len(batch)

6

In [109]:
input_ids, segment_ids, masked_tokens, masked_pos, isNext = map(torch.LongTensor, zip(*batch))

In [110]:
input_ids.shape, segment_ids.shape, masked_tokens.shape, masked_pos.shape, isNext

(torch.Size([6, 1000]),
 torch.Size([6, 1000]),
 torch.Size([6, 5]),
 torch.Size([6, 5]),
 tensor([0, 0, 0, 1, 1, 1]))

In [111]:
masked_tokens

tensor([[ 65286, 119198,  57329, 139900, 285794],
        [146375,  66567,  33021,      0,      0],
        [ 63192, 531369, 192655, 108065,  65286],
        [532999, 421534, 532999,  12867,  45351],
        [ 69515, 178483, 299341,      0,      0],
        [485383,  65286,  22983,   2190, 497091]])

#### Step 6. Model



Recall that BERT only uses the encoder.

BERT has the following components:

- Embedding layers
- Attention Mask
- Encoder layer
- Multi-head attention
- Scaled dot product attention
- Position-wise feed-forward network
- BERT (assembling all the components)

##### 6.1 Embedding

In [None]:
class Embedding(nn.Module):
    def __init__(self, vocab_size, max_len, n_segments, d_model, device):
        super(Embedding, self).__init__()
        self.device = device
        self.tok_embed = nn.Embedding(vocab_size, d_model)  # token embedding
        self.pos_embed = nn.Embedding(max_len, d_model)      # position embedding
        self.seg_embed = nn.Embedding(n_segments, d_model)  # segment(token type) embedding
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, seg):
        #x, seg: (bs, len)
        seq_len = x.size(1)
        pos = torch.arange(seq_len, dtype=torch.long, device=x.device)
        pos = pos.unsqueeze(0).expand_as(x)  # (len,) -> (bs, len)
        embedding = self.tok_embed(x) + self.pos_embed(pos) + self.seg_embed(seg)
        return self.norm(embedding)

##### 6.2 Attention mask

In [None]:
def get_attn_pad_mask(seq_q, seq_k, device=None):
    batch_size, len_q = seq_q.size()
    batch_size, len_k = seq_k.size()
    # eq(zero) is PAD token
    pad_attn_mask = seq_k.data.eq(0).unsqueeze(1)  # batch_size x 1 x len_k(=len_q), one is masking
    return pad_attn_mask.expand(batch_size, len_q, len_k)  # batch_size x len_q x len_k

In [None]:
print(get_attn_pad_mask(input_ids, input_ids).shape)

##### 6.3 Encoder

The encoder has two main components: 

- Multi-head Attention
- Position-wise feed-forward network

First let's make the wrapper called `EncoderLayer`

In [None]:
class EncoderLayer(nn.Module):
    def __init__(self, n_heads, d_model, d_ff, d_k, device):
        super(EncoderLayer, self).__init__()
        self.enc_self_attn = MultiHeadAttention(n_heads, d_model, d_k)
        self.pos_ffn       = PoswiseFeedForwardNet(d_model, d_ff)

    def forward(self, enc_inputs, enc_self_attn_mask):
        enc_outputs, attn = self.enc_self_attn(enc_inputs, enc_inputs, enc_inputs, enc_self_attn_mask) # enc_inputs to same Q,K,V
        enc_outputs = self.pos_ffn(enc_outputs) # enc_outputs: [batch_size x len_q x d_model]
        return enc_outputs, attn

Let's define the scaled dot attention, to be used inside the multihead attention

In [None]:
class ScaledDotProductAttention(nn.Module):
    def __init__(self, d_k):
        super(ScaledDotProductAttention, self).__init__()
        self.d_k = d_k

    def forward(self, Q, K, V, attn_mask):
        scores = torch.matmul(Q, K.transpose(-1, -2)) / np.sqrt(self.d_k) # scores : [batch_size x n_heads x len_q(=len_k) x len_k(=len_q)]
        scores.masked_fill_(attn_mask, -1e9) # Fills elements of self tensor with value where mask is one.
        attn = nn.Softmax(dim=-1)(scores)
        context = torch.matmul(attn, V)
        return context, attn 

Let's define the parameters first

In [None]:
n_layers = 6    # number of Encoder of Encoder Layer
n_heads  = 8    # number of heads in Multi-Head Attention
d_model  = 768  # Embedding Size
d_ff = 768 * 4  # 4*d_model, FeedForward dimension
d_k = d_v = 64  # dimension of K(=Q), V
n_segments = 2

Here is the Multiheadattention.

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, n_heads, d_model, d_k):
        super(MultiHeadAttention, self).__init__()
        self.n_heads = n_heads
        self.d_k = d_k
        self.d_v = d_k  # d_v = d_k
        self.W_Q = nn.Linear(d_model, d_k * n_heads)
        self.W_K = nn.Linear(d_model, d_k * n_heads)
        self.W_V = nn.Linear(d_model, d_k * n_heads)
        self.linear = nn.Linear(n_heads * d_k, d_model)
        self.norm = nn.LayerNorm(d_model)
    def forward(self, Q, K, V, attn_mask):
        # q: [batch_size x len_q x d_model], k: [batch_size x len_k x d_model], v: [batch_size x len_k x d_model]
        residual, batch_size = Q, Q.size(0)
        # (B, S, D) -proj-> (B, S, D) -split-> (B, S, H, W) -trans-> (B, H, S, W)
        q_s = self.W_Q(Q).view(batch_size, -1, self.n_heads, self.d_k).transpose(1,2)
        k_s = self.W_K(K).view(batch_size, -1, self.n_heads, self.d_k).transpose(1,2)
        v_s = self.W_V(V).view(batch_size, -1, self.n_heads, self.d_v).transpose(1,2)

        attn_mask = attn_mask.unsqueeze(1).repeat(1, self.n_heads, 1, 1)

        context, attn = ScaledDotProductAttention(self.d_k)(q_s, k_s, v_s, attn_mask)
        context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.n_heads * self.d_v)
        output = self.linear(context)
        return self.norm(output + residual), attn

Here is the PoswiseFeedForwardNet.

In [None]:
class PoswiseFeedForwardNet(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PoswiseFeedForwardNet, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        # (batch_size, len_seq, d_model) -> (batch_size, len_seq, d_ff) -> (batch_size, len_seq, d_model)
        return self.fc2(F.gelu(self.fc1(x)))


##### 6.4 Putting them together

In [None]:
class BERT(nn.Module):
    def __init__(self, n_layers, n_heads, d_model, d_ff, d_k, n_segments, vocab_size, max_len, device):
        super(BERT, self).__init__()
        self.params = {'n_layers': n_layers, 'n_heads': n_heads, 'd_model': d_model,
                       'd_ff': d_ff, 'd_k': d_k, 'n_segments': n_segments,
                       'vocab_size': vocab_size, 'max_len': max_len}
        self.embedding = Embedding(vocab_size, max_len, n_segments, d_model, device)
        self.layers = nn.ModuleList([EncoderLayer(n_heads, d_model, d_ff, d_k, device) for _ in range(n_layers)])
        self.fc = nn.Linear(d_model, d_model)
        self.activ = nn.Tanh()
        self.linear = nn.Linear(d_model, d_model)
        self.norm = nn.LayerNorm(d_model)
        self.classifier = nn.Linear(d_model, 2)
        # decoder is shared with embedding layer
        embed_weight = self.embedding.tok_embed.weight
        n_vocab, n_dim = embed_weight.size()
        self.decoder = nn.Linear(n_dim, n_vocab, bias=False)
        self.decoder.weight = embed_weight
        self.decoder_bias = nn.Parameter(torch.zeros(n_vocab))
        self.device = device

    def forward(self, input_ids, segment_ids, masked_pos):
        output = self.embedding(input_ids, segment_ids)
        enc_self_attn_mask = get_attn_pad_mask(input_ids, input_ids, self.device)
        for layer in self.layers:
            output, enc_self_attn = layer(output, enc_self_attn_mask)
        # output : [batch_size, len, d_model], attn : [batch_size, n_heads, d_mode, d_model]
        
        # 1. predict next sentence
        # it will be decided by first token(CLS)
        h_pooled   = self.activ(self.fc(output[:, 0])) # [batch_size, d_model]
        logits_nsp = self.classifier(h_pooled) # [batch_size, 2]

        # 2. predict the masked token
        masked_pos = masked_pos[:, :, None].expand(-1, -1, output.size(-1)) # [batch_size, max_pred, d_model]
        h_masked = torch.gather(output, 1, masked_pos) # masking position [batch_size, max_pred, d_model]
        h_masked  = self.norm(F.gelu(self.linear(h_masked)))
        logits_lm = self.decoder(h_masked) + self.decoder_bias # [batch_size, max_pred, n_vocab]

        return logits_lm, logits_nsp
    
    def get_last_hidden_state(self, input_ids, segment_ids):
        output = self.embedding(input_ids, segment_ids)
        enc_self_attn_mask = get_attn_pad_mask(input_ids, input_ids, self.device)
        for layer in self.layers:
            output, enc_self_attn = layer(output, enc_self_attn_mask)

        return output

#### Step 7. Training

In [None]:
import numpy as np
import torch.nn.functional as F

num_epoch = 500
model = BERT(
    n_layers, 
    n_heads, 
    d_model, 
    d_ff, 
    d_k, 
    n_segments, 
    vocab_size, 
    max_len, 
    device
).to(device) 
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

pbar = tqdm(range(num_epoch), desc="Training BERT")
for epoch in pbar:
    batch = make_batch(token_list)  # fresh batch each epoch
    input_ids, segment_ids, masked_tokens, masked_pos, isNext = map(torch.LongTensor, zip(*batch))

    # Move tensors to the same device as the model
    input_ids = input_ids.to(device)
    segment_ids = segment_ids.to(device)
    masked_tokens = masked_tokens.to(device)
    masked_pos = masked_pos.to(device)
    isNext = isNext.to(device)

    optimizer.zero_grad()
    logits_lm, logits_nsp = model(input_ids, segment_ids, masked_pos) 

    #1. mlm loss
    loss_lm = criterion(logits_lm.transpose(1, 2), masked_tokens)
    loss_lm = (loss_lm.float()).mean()
    #2. nsp loss
    loss_nsp = criterion(logits_nsp, isNext)
    
    #3. combine loss
    loss = loss_lm + loss_nsp
    if (epoch + 1) % 100 == 0 or epoch == 0:
        print(f"Epoch {epoch+1}/{num_epoch} - Loss: {loss.item():.4f} (MLM: {loss_lm.item():.4f}, NSP: {loss_nsp.item():.4f})")
    pbar.set_postfix(loss=f"{loss:.4f}", mlm=f"{loss_lm:.4f}", nsp=f"{loss_nsp:.4f}")
    loss.backward()
    optimizer.step()

print(f'\nTraining complete! Final loss = {loss:.6f}')

#### Step 7. Inference

In [None]:
# Predict mask tokens ans isNext
input_ids, segment_ids, masked_tokens, masked_pos, isNext = map(torch.LongTensor, zip(batch[2]))
input_ids, segment_ids, masked_tokens, masked_pos = input_ids.to(device), segment_ids.to(device), masked_tokens.to(device), masked_pos.to(device)
print([id2word[w.item()] for w in input_ids[0] if id2word[w.item()] != '[PAD]'])

logits_lm, logits_nsp = model(input_ids, segment_ids, masked_pos)
#logits_lm:  (1, max_mask, vocab_size) ==> (1, 5, 34)
#logits_nsp: (1, yes/no) ==> (1, 2)

#predict masked tokens
#max the probability along the vocab dim (2), [1] is the indices of the max, and [0] is the first value
logits_lm = logits_lm.data.max(2)[1][0].data.cpu().numpy()
#note that zero is padding we add to the masked_tokens
print('masked tokens (words) : ',[id2word[pos.item()] for pos in masked_tokens[0]])
print('masked tokens list : ',[pos.item() for pos in masked_tokens[0]])
print('masked tokens (words) : ',[id2word[pos.item()] for pos in logits_lm])
print('predict masked tokens list : ', [pos for pos in logits_lm])

#predict nsp
logits_nsp = logits_nsp.data.max(1)[1][0].data.cpu().numpy()
print(logits_nsp)
print('isNext : ', True if isNext else False)
print('predict isNext : ',True if logits_nsp else False)

#### Step 8: Model Evaluation

Evaluate the BERT model on multiple batches to measure:
- **MLM Accuracy**: How well the model predicts the masked tokens (excluding padding)
- **NSP Accuracy**: How well the model predicts whether sentence B follows sentence A
- **Combined Loss**: Average loss across evaluation batches

In [None]:
sentences_valid_text = dataset_valid["text"]

tokenized_texts_valid = [word_tokenize(text) for text in sentences_valid_text]

word_list_valid = list(set(" ".join([sent for sublist in tokenized_texts_valid for sent in sublist]).split()))


# Example: print the first 5 tokenized samples
for tokens in tokenized_texts_valid[:5]:
    print(tokens)

print(f"Number of tokenized validation samples: {len(tokenized_texts_valid)}")


# Word to ID conversion for validation set
token_list_valid, _ , _ = word_to_token_id(word_list_valid)

# Example: print the first 5 tokenized samples
for token in token_list_valid[:5]:
    print(token[:10])  # Print first 10 token IDs for brevity


In [None]:
model.eval()

n_eval_batches = 50
total_mlm_correct = 0
total_mlm_total = 0
total_nsp_correct = 0
total_nsp_total = 0
total_loss = 0.0

# Pre-generate all batches on CPU first (the bottleneck is make_batch(), not GPU)
print(f"Pre-generating {n_eval_batches} evaluation batches...")
eval_batches = []
for _ in tqdm(range(n_eval_batches), desc="Generating batches"):
    b = make_batch(token_list_valid)
    ids, segs, mtoks, mpos, nxt = map(torch.LongTensor, zip(*b))
    eval_batches.append((ids, segs, mtoks, mpos, nxt))

# Run inference on GPU in a tight loop
print(f"Running evaluation on {device}...")
start_time = time.time()

with torch.no_grad():
    for ids, segs, mtoks, mpos, nxt in tqdm(eval_batches, desc="Evaluating"):
        # Move to device with non_blocking for async transfer
        ids   = ids.to(device, non_blocking=True)
        segs  = segs.to(device, non_blocking=True)
        mtoks = mtoks.to(device, non_blocking=True)
        mpos  = mpos.to(device, non_blocking=True)
        nxt   = nxt.to(device, non_blocking=True)

        logits_lm, logits_nsp = model(ids, segs, mpos)

        # --- Loss ---
        loss_lm = criterion(logits_lm.transpose(1, 2), mtoks).float().mean()
        loss_nsp = criterion(logits_nsp, nxt)
        total_loss += (loss_lm + loss_nsp).item()

        # --- MLM Accuracy (ignore padding tokens where masked_tokens == 0) ---
        pred_lm = logits_lm.argmax(dim=2)  # faster than .data.max(2)[1]
        non_pad_mask = mtoks != 0
        total_mlm_correct += (pred_lm[non_pad_mask] == mtoks[non_pad_mask]).sum().item()
        total_mlm_total += non_pad_mask.sum().item()

        # --- NSP Accuracy ---
        pred_nsp = logits_nsp.argmax(dim=1)  # faster than .data.max(1)[1]
        total_nsp_correct += (pred_nsp == nxt).sum().item()
        total_nsp_total += nxt.size(0)

elapsed = time.time() - start_time
avg_loss = total_loss / n_eval_batches
mlm_accuracy = total_mlm_correct / total_mlm_total * 100 if total_mlm_total > 0 else 0
nsp_accuracy = total_nsp_correct / total_nsp_total * 100 if total_nsp_total > 0 else 0

print(f"\n{'='*50}")
print(f"Evaluation Results over {n_eval_batches} batches ({n_eval_batches * batch_size} samples)")
print(f"{'='*50}")
print(f"  Average Loss     : {avg_loss:.4f}")
print(f"  MLM Accuracy     : {mlm_accuracy:.2f}% ({total_mlm_correct}/{total_mlm_total})")
print(f"  NSP Accuracy     : {nsp_accuracy:.2f}% ({total_nsp_correct}/{total_nsp_total})")
print(f"  Inference Time   : {elapsed:.2f}s ({elapsed/n_eval_batches*1000:.1f}ms per batch)")
print(f"  Device           : {device}")
print(f"{'='*50}")

# Switch back to training mode
model.train()
print("\nModel set back to training mode.")


### 3) Save the trained model weights for later use in Task 2.

1 https://aclanthology.org/N19-1423.pdf

2 https://huggingface.co/datasets/bookcorpus/bookcorpus

3 https://huggingface.co/datasets/legacy-datasets/wikipedia


In [None]:
# Save checkpoints
_MODEL_CHECKPOINT_PATH = "../models"
_MODEL_CHECKPOINT_FILENAME = "bert_checkpoint.pt"
checkpoint_full_path = _MODEL_CHECKPOINT_PATH+ "/" + _MODEL_CHECKPOINT_FILENAME

torch.save([model.params, model.state_dict()], checkpoint_full_path)
print("BERT Model checkpoint saved!")

## Task 2. Sentence Embedding with Sentence BERT - Implement trained BERT from task 1 with siamese network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. (3 points)


1) Use the $SNLI^4$ OR $MNLI^5$ datasets from Hugging Face, or any dataset related to classification
tasks.
2) Reproduce training the Sentence-BERT as described in the $paper^6$.
3) Focus on the Classification Objective Function: (SoftmaxLoss)
$$o = softmax (W^T·(u, v, |u −v|))$$
HINT : You can take a look how to implement Softmax loss in the file 04 - Huggingface/Appendix -
Sentence Embedding/S-BERT.ipynb.

<img src="../images/sbert-architecture.png">


4 https://huggingface.co/datasets/stanfordnlp/snli

5 https://huggingface.co/datasets/glue/viewer/mnli

6 https://aclanthology.org/D19-1410/


#### 1. Data

In [None]:
import datasets
snli = datasets.load_dataset('snli')
mnli = datasets.load_dataset('glue', 'mnli')
mnli['train'].features, snli['train'].features

In [None]:
# List of datasets to remove 'idx' column from
mnli.column_names.keys()

In [None]:
# Remove 'idx' column from each dataset
for column_names in mnli.column_names.keys():
    mnli[column_names] = mnli[column_names].remove_columns('idx')

In [None]:
mnli.column_names.keys()

In [None]:
import numpy as np
np.unique(mnli['train']['label']), np.unique(snli['train']['label'])
#snli also have -1

In [None]:
# there are -1 values in the label feature, these are where no class could be decided so we remove
snli = snli.filter(
    lambda x: 0 if x['label'] == -1 else 1
)

In [None]:
import numpy as np
np.unique(mnli['train']['label']), np.unique(snli['train']['label'])
#snli also have -1

In [None]:
# Assuming you have your two DatasetDict objects named snli and mnli
from datasets import DatasetDict
# Merge the two DatasetDict objects
raw_dataset = DatasetDict({
    'train': datasets.concatenate_datasets([snli['train'], mnli['train']]).shuffle(seed=55).select(list(range(1000))),
    'test': datasets.concatenate_datasets([snli['test'], mnli['test_mismatched']]).shuffle(seed=55).select(list(range(100))),
    'validation': datasets.concatenate_datasets([snli['validation'], mnli['validation_mismatched']]).shuffle(seed=55).select(list(range(1000)))
})
#remove .select(list(range(1000))) in order to use full dataset
# Now, merged_dataset_dict contains the combined datasets from snli and mnli
raw_dataset

#### 2. Preprocessing

Using CustomBertTokenizer built in Task1

In [None]:
# Custom Tokenizer using Task 1 vocabulary
import re
from nltk.tokenize import word_tokenize

class CustomBERTTokenizer:
    """Custom tokenizer that uses the word2id from Task 1"""
    
    def __init__(self, word2id, max_length=128):
        self.word2id = word2id
        self.max_length = max_length
        self.pad_token_id = word2id[_PAD_TOKEN]
        self.cls_token_id = word2id[_CLS_TOKEN]
        self.sep_token_id = word2id[_SEP_TOKEN]
        self.unk_token_id = word2id.get('[UNK]', 0)  # Use [PAD] as fallback for unknown
        
    def __call__(self, text, return_tensors=None, max_length=None, truncation=True, padding=True):
        """Tokenize text using NLTK and Task 1 vocabulary"""
        if max_length is None:
            max_length = self.max_length
            
        # Preprocess: lowercase and remove punctuation (same as Task 1)
        text = re.sub("[.,!?\\-]", '', text.lower())
        
        # Tokenize using NLTK (same as Task 1)
        tokens = word_tokenize(text)
        
        # Convert to IDs, using [PAD] for unknown words
        token_ids = [self.word2id.get(token, self.pad_token_id) for token in tokens]
        
        # Add [CLS] at start and [SEP] at end
        token_ids = [self.cls_token_id] + token_ids + [self.sep_token_id]
        
        # Create attention mask (1 for real tokens, 0 for padding)
        attention_mask = [1] * len(token_ids)
        
        # Truncate if needed
        if truncation and len(token_ids) > max_length:
            token_ids = token_ids[:max_length]
            attention_mask = attention_mask[:max_length]
            # Ensure [SEP] at end after truncation
            token_ids[-1] = self.sep_token_id
        
        # Pad if needed
        if padding:
            padding_length = max_length - len(token_ids)
            if padding_length > 0:
                token_ids = token_ids + [self.pad_token_id] * padding_length
                attention_mask = attention_mask + [0] * padding_length
        
        # Return in HuggingFace format
        result = {
            'input_ids': token_ids,
            'attention_mask': attention_mask
        }
        
        # Convert to tensors if requested
        if return_tensors == 'pt':
            import torch
            result['input_ids'] = torch.tensor([result['input_ids']])
            result['attention_mask'] = torch.tensor([result['attention_mask']])
        
        return result

# Create the custom tokenizer
custom_tokenizer = CustomBERTTokenizer(word2id, max_length=128)

# Test it
test_sentence = "The cat is walking on the street."
test_output = custom_tokenizer(test_sentence, return_tensors='pt')
print(f"Test sentence: {test_sentence}")
print(f"Input IDs shape: {test_output['input_ids'].shape}")
print(f"First 10 token IDs: {test_output['input_ids'][0, :10]}")
print(f"Attention mask shape: {test_output['attention_mask'].shape}")


In [None]:
def preprocess_function(examples):
    """Preprocess using custom tokenizer from Task 1"""
    max_seq_length = 128
    
    # Process each example individually since custom tokenizer doesn't support batching
    premise_input_ids = []
    premise_attention_mask = []
    hypothesis_input_ids = []
    hypothesis_attention_mask = []
    
    # Get the number of examples
    num_examples = len(examples['premise'])
    
    for i in range(num_examples):
        # Tokenize premise
        premise_result = custom_tokenizer(
            examples['premise'][i], 
            max_length=max_seq_length, 
            truncation=True, 
            padding=True
        )
        premise_input_ids.append(premise_result['input_ids'])
        premise_attention_mask.append(premise_result['attention_mask'])
        
        # Tokenize hypothesis
        hypothesis_result = custom_tokenizer(
            examples['hypothesis'][i], 
            max_length=max_seq_length, 
            truncation=True, 
            padding=True
        )
        hypothesis_input_ids.append(hypothesis_result['input_ids'])
        hypothesis_attention_mask.append(hypothesis_result['attention_mask'])
    
    # Extract labels
    labels = examples["label"]
    
    return {
        "premise_input_ids": premise_input_ids,
        "premise_attention_mask": premise_attention_mask,
        "hypothesis_input_ids": hypothesis_input_ids,
        "hypothesis_attention_mask": hypothesis_attention_mask,
        "labels": labels
    }

print("Preprocessing with custom tokenizer...")
tokenized_datasets = raw_dataset.map(
    preprocess_function,
    batched=True,
    desc="Tokenizing with custom vocabulary"
)

tokenized_datasets = tokenized_datasets.remove_columns(['premise', 'hypothesis', 'label'])
tokenized_datasets.set_format("torch")

print(f"Dataset preprocessed with custom tokenizer!")
print(f"Train size: {len(tokenized_datasets['train'])}")
print(f"Validation size: {len(tokenized_datasets['validation'])}")
print(f"Test size: {len(tokenized_datasets['test'])}")

In [None]:
tokenized_datasets

#### 3. Data loader

In [None]:
from torch.utils.data import DataLoader

# initialize the dataloader
batch_size = 32
train_dataloader = DataLoader(
    tokenized_datasets['train'], 
    batch_size=batch_size, 
    shuffle=True
)
eval_dataloader = DataLoader(
    tokenized_datasets['validation'], 
    batch_size=batch_size
)
test_dataloader = DataLoader(
    tokenized_datasets['test'], 
    batch_size=batch_size
)

In [None]:
for batch in train_dataloader:
    print(batch['premise_input_ids'].shape)
    print(batch['premise_attention_mask'].shape)
    print(batch['hypothesis_input_ids'].shape)
    print(batch['hypothesis_attention_mask'].shape)
    print(batch['labels'].shape)
    break

#### 4. Model

In [None]:
# Load checkpoint
params, state_dict = torch.load(f'{checkpoint_full_path}', map_location=device)

# Recreate model with saved hyperparameters
loaded_model = BERT(**params, device=device).to(device)

# Load weights
loaded_model.load_state_dict(state_dict)
loaded_model.eval()

print("Model loaded from checkpoint!")

#### 5. Pooling
SBERT adds a pooling operation to the output of BERT / RoBERTa to derive a fixed sized sentence embedding

In [None]:
# define mean pooling function
def mean_pool(token_embeds, attention_mask):
    # reshape attention_mask to cover 768-dimension embeddings
    in_mask = attention_mask.unsqueeze(-1).expand(
        token_embeds.size()
    ).float()
    # perform mean-pooling but exclude padding tokens (specified by in_mask)
    pool = torch.sum(token_embeds * in_mask, 1) / torch.clamp(
        in_mask.sum(1), min=1e-9
    )
    return pool

#### 6. Loss Function



#### Classification Objective Function 
We concatenate the sentence embeddings $u$ and $v$ with the element-wise difference  $\lvert u - v \rvert $ and multiply the result with the trainable weight  $ W_t ∈  \mathbb{R}^{3n \times k}  $:

$ o = \text{softmax}\left(W^T \cdot \left(u, v, \lvert u - v \rvert\right)\right) $

where $n$ is the dimension of the sentence embeddings and k the number of labels. We optimize cross-entropy loss. This structure is depicted in Figure 1.

#### Regression Objective Function. 
The cosine similarity between the two sentence embeddings $u$ and $v$ is computed (Figure 2). We use means quared-error loss as the objective function.

(Manhatten / Euclidean distance, semantically  similar sentences can be found.)

<img src="../images/sbert-architecture.png" >

In [None]:
def configurations(u,v):
    # build the |u-v| tensor
    uv = torch.sub(u, v)   # batch_size,hidden_dim
    uv_abs = torch.abs(uv) # batch_size,hidden_dim
    
    # concatenate u, v, |u-v|
    x = torch.cat([u, v, uv_abs], dim=-1) # batch_size, 3*hidden_dim
    return x

def cosine_similarity(u, v):
    dot_product = np.dot(u, v)
    norm_u = np.linalg.norm(u)
    norm_v = np.linalg.norm(v)
    similarity = dot_product / (norm_u * norm_v)
    return similarity

<img src="../images/sbert-ablation.png" width="350" height="300">

In [None]:
classifier_head = torch.nn.Linear(768*3, 3).to(device)

# Use LOWER learning rate for BERT encoder (it's already pretrained)
# Use HIGHER learning rate for classifier head (training from scratch)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)  # Reduced from 2e-5
optimizer_classifier = torch.optim.Adam(classifier_head.parameters(), lr=1e-3)  # Higher for new layers

criterion = nn.CrossEntropyLoss()

print(f"Optimizer LR - BERT encoder: 1e-5, Classifier head: 1e-3")

In [None]:
from transformers import get_linear_schedule_with_warmup

# and setup a warmup for the first ~10% steps
total_steps = int(len(raw_dataset) / batch_size)
warmup_steps = int(0.1 * total_steps)
scheduler = get_linear_schedule_with_warmup(
		optimizer, num_warmup_steps=warmup_steps,
  	num_training_steps=total_steps - warmup_steps
)

# then during the training loop we update the scheduler per step
scheduler.step()

scheduler_classifier = get_linear_schedule_with_warmup(
		optimizer_classifier, num_warmup_steps=warmup_steps,
  	num_training_steps=total_steps - warmup_steps
)

# then during the training loop we update the scheduler per step
scheduler_classifier.step()

#### 7. Training

In [None]:
from tqdm.auto import tqdm

num_epoch = 5
model = loaded_model  # Use the loaded model from Task 1 checkpoint

# 1 epoch should be enough, increase if wanted
for epoch in range(num_epoch):
    model.train()  
    classifier_head.train()
    # initialize the dataloader loop with tqdm (tqdm == progress bar)
    for step, batch in enumerate(tqdm(train_dataloader, leave=True)):
        # zero all gradients on each new step
        optimizer.zero_grad()
        optimizer_classifier.zero_grad()
        
        # prepare batches and more all to the active device
        inputs_ids_a = batch['premise_input_ids'].to(device)
        inputs_ids_b = batch['hypothesis_input_ids'].to(device)
        attention_a = batch['premise_attention_mask'].to(device)
        attention_b = batch['hypothesis_attention_mask'].to(device)
        label = batch['labels'].to(device)

        # extract token embeddings from BERT at last_hidden_state
        segment_ids_a = torch.zeros_like(inputs_ids_a).to(device)
        segment_ids_b = torch.zeros_like(inputs_ids_b).to(device)

        
        # extract token embeddings from BERT at last_hidden_state
        u_last_hidden_state = model.get_last_hidden_state(inputs_ids_a, segment_ids_a)  # batch_size, seq_len, hidden_dim
        v_last_hidden_state = model.get_last_hidden_state(inputs_ids_b, segment_ids_b)  # batch_size, seq_len, hidden_dim

         # get the mean pooled vectors
        u_mean_pool = mean_pool(u_last_hidden_state, attention_a) # batch_size, hidden_dim
        v_mean_pool = mean_pool(v_last_hidden_state, attention_b) # batch_size, hidden_dim
        
        # build the |u-v| tensor
        uv = torch.sub(u_mean_pool, v_mean_pool)   # batch_size,hidden_dim
        uv_abs = torch.abs(uv) # batch_size,hidden_dim
        
        # concatenate u, v, |u-v|
        x = torch.cat([u_mean_pool, v_mean_pool, uv_abs], dim=-1) # batch_size, 3*hidden_dim
        
        # process concatenated tensor through classifier_head
        x = classifier_head(x) #batch_size, classifer
        
        # calculate the 'softmax-loss' between predicted and true label
        loss = criterion(x, label)
        
        # # using loss, calculate gradients and then optimizerize
        # loss.backward()
        # optimizer.step()
        # optimizer_classifier.step()

        # scheduler.step() # update learning rate scheduler
        # scheduler_classifier.step()

        # correct order of scheduler calls
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # Optional: gradient clipping for stability
        optimizer.step()
        optimizer_classifier.step()
        scheduler.step()  # After optimizer!
        scheduler_classifier.step()
        
    print(f'Epoch: {epoch + 1} | loss = {loss.item():.6f}')

In [None]:
labels = []
predictions = []
probabilities = []
classes = ["entailment", "neutral", "contradiction"]

In [None]:
model.eval()
classifier_head.eval()
total_similarity = 0
with torch.no_grad():
    for step, batch in enumerate(eval_dataloader):
        # Move batches to the active device
        inputs_ids_a = batch['premise_input_ids'].to(device)
        inputs_ids_b = batch['hypothesis_input_ids'].to(device)
        attention_a = batch['premise_attention_mask'].to(device)
        attention_b = batch['hypothesis_attention_mask'].to(device)
        segment_ids = torch.zeros(inputs_ids_a.shape[0], inputs_ids_a.shape[1], dtype=torch.int32).to(device)
        label = batch['labels'].to(device)

        # Extract token embeddings from BERT
        u = model.get_last_hidden_state(inputs_ids_a, segment_ids)  # (batch_size, seq_len, hidden_dim)
        v = model.get_last_hidden_state(inputs_ids_b, segment_ids)  # (batch_size, seq_len, hidden_dim)

        # Get the mean pooled vectors (Keep them as Tensors)
        u_mean_pool = mean_pool(u, attention_a)  # (batch_size, hidden_dim)
        v_mean_pool = mean_pool(v, attention_b)  # (batch_size, hidden_dim)

        # Computing cosine similarity
        similarity_score = cosine_similarity(u_mean_pool.cpu().numpy().reshape(-1), v_mean_pool.cpu().numpy().reshape(-1))
        total_similarity += similarity_score

        # Concatenate [u, v, |u - v|]
        uv_abs = torch.abs(u_mean_pool - v_mean_pool)  # [batch_size, hidden_dim]
        x = torch.cat([u_mean_pool, v_mean_pool, uv_abs], dim=-1)  # [batch_size, 3*hidden_dim]

        # Classification
        logit_fn = classifier_head(x)  # (batch_size, num_classes)
        probs = torch.nn.functional.softmax(logit_fn, dim=-1)

        preds = torch.argmax(logit_fn, dim=-1)

        labels.extend(label.cpu().tolist())
        probabilities.extend(probs.cpu().tolist())
        predictions.extend(preds.cpu().tolist())

average_similarity = total_similarity / len(eval_dataloader)
print(f"Average Cosine Similarity: {average_similarity:.4f}")

In [None]:
from sklearn.metrics import classification_report

print(classification_report(labels, predictions, target_names=classes))

In [None]:
# saving the model checkpoints
torch.save([model.params, model.state_dict()], '../models/sentence_bert_checkpoint.pt')

#### 8. Inference

In [None]:
import torch
from sklearn.metrics.pairwise import cosine_similarity

def calculate_similarity(model, tokenizer, sentence_a, sentence_b, device):
    # Set model to evaluation mode
    model.eval()

    # Tokenize and convert sentences to input IDs and attention masks
    inputs_a = tokenizer(sentence_a, return_tensors='pt', max_length=128, truncation=True, padding=True)
    inputs_b = tokenizer(sentence_b, return_tensors='pt', max_length=128, truncation=True, padding=True)

    # Move to device
    inputs_ids_a = inputs_a['input_ids'].to(device)
    attention_a = inputs_a['attention_mask'].to(device)
    inputs_ids_b = inputs_b['input_ids'].to(device)
    attention_b = inputs_b['attention_mask'].to(device)

    # Create segment_ids (all 0s for single sentences)
    segment_ids_a = torch.zeros_like(inputs_ids_a).to(device)
    segment_ids_b = torch.zeros_like(inputs_ids_b).to(device)

    # Disable gradient computation for inference
    with torch.no_grad():
        # Extract embeddings using custom BERT API
        u_last_hidden_state = model.get_last_hidden_state(inputs_ids_a, segment_ids_a)
        v_last_hidden_state = model.get_last_hidden_state(inputs_ids_b, segment_ids_b)

        # Get the mean-pooled vectors - keep as 2D arrays for sklearn
        u = mean_pool(u_last_hidden_state, attention_a).detach().cpu().numpy()  # (1, hidden_dim)
        v = mean_pool(v_last_hidden_state, attention_b).detach().cpu().numpy()  # (1, hidden_dim)

    # Calculate cosine similarity (sklearn expects 2D arrays)
    similarity_score = cosine_similarity(u, v)[0, 0]

    return similarity_score


In [None]:
# DIAGNOSTIC: Check what embeddings look like with CUSTOM TOKENIZER
sentence_a = 'Your contribution helped make it possible for us to provide our students with a quality education.'
sentence_b = "Your contributions were of no help with our students' education."

# Test with loaded_model and custom_tokenizer
loaded_model.eval()
inputs_a = custom_tokenizer(sentence_a, return_tensors='pt', max_length=128, truncation=True, padding=True)
inputs_b = custom_tokenizer(sentence_b, return_tensors='pt', max_length=128, truncation=True, padding=True)

inputs_a['input_ids'] = inputs_a['input_ids'].to(device)
inputs_a['attention_mask'] = inputs_a['attention_mask'].to(device)
inputs_b['input_ids'] = inputs_b['input_ids'].to(device)
inputs_b['attention_mask'] = inputs_b['attention_mask'].to(device)

segment_ids_a = torch.zeros_like(inputs_a['input_ids']).to(device)
segment_ids_b = torch.zeros_like(inputs_b['input_ids']).to(device)

with torch.no_grad():
    u_hidden = loaded_model.get_last_hidden_state(inputs_a['input_ids'], segment_ids_a)
    v_hidden = loaded_model.get_last_hidden_state(inputs_b['input_ids'], segment_ids_b)
    
    u = mean_pool(u_hidden, inputs_a['attention_mask']).detach().cpu().numpy()
    v = mean_pool(v_hidden, inputs_b['attention_mask']).detach().cpu().numpy()

print(f"Embedding A shape: {u.shape}")
print(f"Embedding B shape: {v.shape}")
print(f"Embedding A first 10 values: {u[0, :10]}")
print(f"Embedding B first 10 values: {v[0, :10]}")
print(f"Are embeddings identical? {np.allclose(u, v)}")
print(f"Max value in A: {u.max():.4f}, Min: {u.min():.4f}")
print(f"Max value in B: {v.max():.4f}, Min: {v.min():.4f}")

from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(u, v)[0, 0]
print(f"\nCosine Similarity: {similarity:.4f}")
print("\nNow using CUSTOM TOKENIZER from Task 1!")

In [None]:
# Example usage: Similar meaning between two sentences (using custom tokenizer)
sentence_a = 'Authorities have announced a national holiday today.'
sentence_b = "Authorities have announced that today is a national holiday."
similarity = calculate_similarity(loaded_model, custom_tokenizer, sentence_a, sentence_b, device)
print(f"Cosine Similarity: {similarity:.4f}")

In [None]:
# DIAGNOSTIC: Check what embeddings look like
sentence_a = 'Your contribution helped make it possible for us to provide our students with a quality education.'
sentence_b = "Your contributions were of no help with our students' education."

# Test with loaded_model
loaded_model.eval()
inputs_a = custom_tokenizer(sentence_a, return_tensors='pt', truncation=True, padding=True).to(device)
inputs_b = custom_tokenizer(sentence_b, return_tensors='pt', truncation=True, padding=True).to(device)

segment_ids_a = torch.zeros_like(inputs_a['input_ids']).to(device)
segment_ids_b = torch.zeros_like(inputs_b['input_ids']).to(device)

with torch.no_grad():
    u_hidden = loaded_model.get_last_hidden_state(inputs_a['input_ids'], segment_ids_a)
    v_hidden = loaded_model.get_last_hidden_state(inputs_b['input_ids'], segment_ids_b)
    
    u = mean_pool(u_hidden, inputs_a['attention_mask']).detach().cpu().numpy()
    v = mean_pool(v_hidden, inputs_b['attention_mask']).detach().cpu().numpy()

print(f"Embedding A shape: {u.shape}")
print(f"Embedding B shape: {v.shape}")
print(f"Embedding A first 10 values: {u[0, :10]}")
print(f"Embedding B first 10 values: {v[0, :10]}")
print(f"Are embeddings identical? {np.allclose(u, v)}")
print(f"Max value in A: {u.max():.4f}, Min: {u.min():.4f}")
print(f"Max value in B: {v.max():.4f}, Min: {v.min():.4f}")

from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(u, v)[0, 0]
print(f"\nCosine Similarity: {similarity:.4f}")

## Task 3. Evaluation and Analysis (1 points)


### 1) Provide the performance metrics (classification Report) based on the SNLI or MNLI datasets for the Natural Language Inference (NLI) task.

|precision |recall |f1-score |support|
|----------|-------|---------|-------|
entailment| 0.42| 0.02 |0.05 |3486
neutral |0.33 |0.75 |0.46 |3199
contradiction |0.33 |0.25 |0.28 |3315
accuracy | |0.33 |10000
macro avg |0.36 |0.34 |0.26 |10000
weighted avg| 0.36 |0.33 |0.26 |10000

Table 1. Sample of Classification Report

In [None]:

from transformers import AutoTokenizer, AutoModel
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
pre_trained_model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2')

In [None]:
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

In [None]:
pos_sentence = ["Authorities have announced a national holiday today.", "Authorities have announced that today is a national holiday."]
opp_sentence = ["Your contribution helped make it possible for us to provide our students with a quality education.", "Your contributions were of no help with our students' education."]

Positive Sentences

In [None]:
encoded_input = tokenizer(pos_sentence, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = pre_trained_model(**encoded_input)

In [None]:
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

sent_a_emb = sentence_embeddings[0].cpu().numpy().reshape(1, -1)
sent_b_emb = sentence_embeddings[1].cpu().numpy().reshape(1, -1)
cosine_similarity(sent_a_emb, sent_b_emb)[0][0]
print(f"Cosine Similarity Positive setences: {cosine_similarity(sent_a_emb, sent_b_emb)[0][0]:.4f}")

Opposite setences

In [None]:
encoded_input = tokenizer(opp_sentence, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = pre_trained_model(**encoded_input)

In [None]:
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

sent_a_emb = sentence_embeddings[0].cpu().numpy().reshape(1, -1)
sent_b_emb = sentence_embeddings[1].cpu().numpy().reshape(1, -1)
cosine_similarity(sent_a_emb, sent_b_emb)[0][0]
print(f"Cosine Similarity Opposite setences: {cosine_similarity(sent_a_emb, sent_b_emb)[0][0]:.4f}")

### 2) Discuss any limitations or challenges encountered during the implementation and propose potential improvements or modifications.

NOTE: Make sure to provide proper documentation, including details of the datasets used, hyperparameters,
and any modifications made to the original models.



### <font color="red">ANSWER: </font>



## Task 4. Text similarity - Web Application Development - Develop a simple web application that
demonstrates the capabilities of your text-embedding model. (1 points)
1) Develop a simple website with two input boxes for search queries.
2) Utilize a custom-trained sentence transformer model to predict Natural Language Inference (NLI) Task (entailment, neutral and contradiction).

For example:

- Premise: A man is playing a guitar on stage.
- Hypothesis: The man is performing music.
- Label: Entailment