# A4: Do you AGREE?

This assignment will guide you in training a pre-trained model like BERT from scratch, focusing onleveraging text embeddings to capture semantic similarity. 

Additionally, we will explore how to adapt the loss function for tasks like Natural Language Inference (NLI) to enhance the model’s ability to understand semantic relationships between texts.

#### Step 0: Prepare Environment - Import Libraries and select device

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

import datasets, math, re, random, time
from collections import Counter
from tqdm import tqdm

In [2]:
# mimimum required torch version for MPS support "1.12+"
torch.__version__

'2.10.0'

In [3]:
# Hugging Face Hub login token
# Make sure to set the HF_TOKEN environment variable in your .env file with your Hugging Face token
import os
HF_TOKEN = os.environ.get("HF_TOKEN")


In [4]:
# universal device selection: use gpu if available, else cpu
import torch

def get_device():
    if torch.cuda.is_available():
        torch.cuda.empty_cache()  # Clear CUDA cache to free up memory 
        return torch.device("cuda")      # NVIDIA GPU
    elif torch.backends.mps.is_available():
        torch.mps.empty_cache()  # Clear MPS cache to avoid memory issues
        return torch.device("mps")       # Apple Silicon GPU
    else:
        torch.empty_cache()  # Clear CPU cache to free up memory
        return torch.device("cpu")

device = get_device()

print(f"Using device: {device}")

Using device: mps


## Task 1. Training BERT from Scratch - Based on Masked Language Model/BERT-update.ipynb, modify as follows: (2 points)

### 1.1) Implement Bidirectional Encoder Representations from Transformers (BERT) from scratch, following the concepts learned in class.

[BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding](https://arxiv.org/pdf/1810.04805)

#### Step 1: Data Acquisition

In [5]:
# Hugging Face Hub login token
# Make sure to set the HF_TOKEN environment variable in your .env file with your Hugging Face token
from dotenv import load_dotenv
load_dotenv()

import os
HF_TOKEN = os.environ.get("HF_TOKEN")

Using dataset from [Salesforce wikitext](https://huggingface.co/datasets/Salesforce/wikitext)

Info about Dataset:
wikitext-103-raw-v1
```sh
    Size of downloaded dataset files: 191.98 MB
    Size of the generated dataset: 549.42 MB
    Total amount of disk used: 741.41 MB
```

In [6]:
import os
from datasets import load_dataset

# data folder are not uploaded to Github.
_DATA_PATH = "../data/wikitext-103"
_DATA_FILENAME = os.path.join(_DATA_PATH, "wikitext-103-train.arrow")
os.makedirs(_DATA_PATH, exist_ok=True)

if not os.path.exists(_DATA_FILENAME):
    # Download and save to local folder
    dataset_train = load_dataset("wikitext", "wikitext-103-raw-v1", split="train", cache_dir=_DATA_PATH)
    dataset_valid = load_dataset("wikitext", "wikitext-103-raw-v1", split="validation", cache_dir=_DATA_PATH)
    dataset_test = load_dataset("wikitext", "wikitext-103-raw-v1", split="test", cache_dir=_DATA_PATH)

else:
    # Load from local Parquet file
    from datasets import Dataset
    dataset_train = Dataset.from_parquet(_DATA_FILENAME)
    dataset_valid = Dataset.from_parquet(os.path.join(_DATA_PATH, "wikitext-103-validation.arrow"))
    dataset_test = Dataset.from_parquet(os.path.join(_DATA_PATH, "wikitext-103-test.arrow"))
    print("Loaded datasets from local Parquet files.")

print(f"Dataset size: {len(dataset_train)}")
print(f"Validation set size: {len(dataset_valid)}")
print(f"Test set size: {len(dataset_test)}")



Dataset size: 1801350
Validation set size: 3760
Test set size: 4358


Check dataset 

In [7]:
dataset_train

Dataset({
    features: ['text'],
    num_rows: 1801350
})

In [8]:
dataset_test

Dataset({
    features: ['text'],
    num_rows: 4358
})

In [9]:
dataset_valid

Dataset({
    features: ['text'],
    num_rows: 3760
})

In [10]:
# Display the first 5 entries in the dataset 
# with only the first 80 characters of the text for brevity
[text[:80] for text in dataset_train[:5]['text']]

['',
 ' = Valkyria Chronicles III = \n',
 '',
 ' Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Va',
 ' The game began development in 2010 , carrying over a large portion of the work ']

#### Step 2: Data Preprocessing

In [None]:
sentences = [re.sub("[.,!?\\-]", '', t.lower()) for t in dataset_train["text"] if t.strip()]

#### Step 3: Tokenization 

using spaCy tokenizer or NLTK

In [None]:
raw_text = "".join(dataset_train['text'])
print(f"Total characters in raw text: {len(raw_text)}")


Too big dataset. It will fail spacy max_lenght limit validation. The validation exists to prevent memory allocation error. Don't concatenate, but use for loop on whole dataset

In [None]:
dataset_train["text"][1]

Since spaCy is slow, Using NLTK.

The command nltk.download('punkt') downloads the "punkt" tokenizer models for NLTK. "Punkt" is a pre-trained model used by NLTK for sentence splitting and word tokenization in English and other languages. Without downloading "punkt"

In [None]:
import nltk

_DOWNLOAD_DIR = "../models/nltk_data"
os.makedirs(_DOWNLOAD_DIR, exist_ok=True)
nltk.download('punkt', download_dir=_DOWNLOAD_DIR)
nltk.download('punkt_tab', download_dir=_DOWNLOAD_DIR)
from nltk.tokenize import word_tokenize


nltk.data.path.append(_DOWNLOAD_DIR)
tokenized_texts = [word_tokenize(text) for text in sentences]

# Example: print the first 5 tokenized samples
for tokens in tokenized_texts[:5]:
    print(tokens)

In [None]:
len(tokenized_texts)

#### Step 4: Build Vocabulary - Numericalization

In [None]:
#making vocabs - numericalization
_PAD_TOKEN = '[PAD]'
_CLS_TOKEN = '[CLS]'
_SEP_TOKEN = '[SEP]'
_MASK_TOKEN = '[MASK]'
word_list = list(set(" ".join([sent for sublist in tokenized_texts for sent in sublist]).split()))
word2id   = {'[PAD]': 0, '[CLS]': 1, '[SEP]': 2, '[MASK]': 3}


In [None]:
# Build word2id
for i, w in enumerate(word_list):
    word2id[w] = i + 4  # reserve the first 0-3 for CLS, PAD

# Create id2word and vocab_size after word2id is complete
id2word = {i: w for w, i in word2id.items()}
vocab_size = len(word2id)

# Numericalize tokenized_texts
token_list = [
    [word2id[word] for word in sentence if word in word2id]
    for sentence in tokenized_texts
]

In [None]:
len(tokenized_texts)

In [None]:
for i, token in enumerate(token_list[:5]):
    print(f"Sample {i+1} token IDs: {token[:10]}...")  # Print first 10 token IDs for brevity


### 2) Train the model on a suitable dataset. Ensure to source this dataset from reputable public databases or repositories. It is imperative to give proper credit to the dataset source in your documentation.


#### Step 5: Data Loader

We gonna make dataloader.  Inside here, we need to make two types of embeddings: **token embedding** and **segment embedding**

1. **Token embedding** - Given “The cat is walking. The dog is barking”, we add [CLS] and [SEP] >> “[CLS] the cat is walking [SEP] the dog is barking”. 

2. **Segment embedding**
A segment embedding separates two sentences, i.e., [0 0 0 0 1 1 1 1 ]

3. **Masking**
As mentioned in the original paper, BERT randomly assigns masks to 15% of the sequence. In this 15%, 80% is replaced with masks, while 10% is replaced with random tokens, and the rest 10% is left as is.  Here we specified `max_pred` 

4. **Padding**
Once we mask, we will add padding. For simplicity, here we padded until some specified `max_len`. 

Note:  `positive` and `negative` are just simply counts to keep track of the batch size.  `positive` refers to two sentences that are really next to one another.

In [None]:
batch_size = 6
max_mask   = 5 #even though it does not reach 15% yet....maybe you can set this threshold
max_len    = 1000 #maximum length that my transformer will accept.....all sentence will be padded

In [None]:
from random import randint, randrange
from random import shuffle
from random import random



def make_batch():
    batch = []
    positive = negative = 0
    while positive != batch_size / 2 or negative != batch_size / 2:
        
        #randomly choose two sentence
        tokens_a_index, tokens_b_index = randrange(len(sentences)), randrange(len(sentences))
        tokens_a, tokens_b            = token_list[tokens_a_index], token_list[tokens_b_index]
        
        #1. token embedding - add CLS and SEP
        input_ids = [word2id['[CLS]']] + tokens_a + [word2id['[SEP]']] + tokens_b + [word2id['[SEP]']]
        
        #2. segment embedding - which sentence is 0 and 1
        segment_ids = [0] * (1 + len(tokens_a) + 1) + [1] * (len(tokens_b) + 1)
        
        #3 masking
        n_pred = min(max_mask, max(1, int(round(len(input_ids) * 0.15))))
        #get all the pos excluding CLS and SEP
        candidates_masked_pos = [i for i, token in enumerate(input_ids) if token != word2id['[CLS]'] 
                                 and token != word2id['[SEP]']]
        shuffle(candidates_masked_pos)
        masked_tokens, masked_pos = [], []
        #simply loop and mask accordingly
        for pos in candidates_masked_pos[:n_pred]:
            masked_pos.append(pos)
            masked_tokens.append(input_ids[pos])
            if random() < 0.1:  #10% replace with random token
                index = randint(0, vocab_size - 1)
                input_ids[pos] = word2id[id2word[index]]
            elif random() < 0.8:  #80 replace with [MASK]
                input_ids[pos] = word2id['[MASK]']
            else: 
                pass
            
        #4. pad the sentence to the max length
        n_pad = max_len - len(input_ids)
        input_ids.extend([0] * n_pad)
        segment_ids.extend([0] * n_pad)
        
        #5. pad the mask tokens to the max length
        if max_mask > n_pred:
            n_pad = max_mask - n_pred
            masked_tokens.extend([0] * n_pad)
            masked_pos.extend([0] * n_pad)
        
        #6. check whether is positive or negative
        if tokens_a_index + 1 == tokens_b_index and positive < batch_size / 2:
            batch.append([input_ids, segment_ids, masked_tokens, masked_pos, True])
            positive += 1
        elif tokens_a_index + 1 != tokens_b_index and negative < batch_size / 2:
            batch.append([input_ids, segment_ids, masked_tokens, masked_pos, False])
            negative += 1
        
    return batch
        

In [None]:
batch = make_batch()

In [None]:
len(batch)

In [None]:
input_ids, segment_ids, masked_tokens, masked_pos, isNext = map(torch.LongTensor, zip(*batch))

In [None]:
input_ids.shape, segment_ids.shape, masked_tokens.shape, masked_pos.shape, isNext

In [None]:
masked_tokens

#### Step 6. Model



Recall that BERT only uses the encoder.

BERT has the following components:

- Embedding layers
- Attention Mask
- Encoder layer
- Multi-head attention
- Scaled dot product attention
- Position-wise feed-forward network
- BERT (assembling all the components)

##### 6.1 Embedding

In [None]:
class Embedding(nn.Module):
    def __init__(self):
        super(Embedding, self).__init__()
        self.tok_embed = nn.Embedding(vocab_size, d_model)  # token embedding
        self.pos_embed = nn.Embedding(max_len, d_model)      # position embedding
        self.seg_embed = nn.Embedding(n_segments, d_model)  # segment(token type) embedding
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, seg):
        #x, seg: (bs, len)
        seq_len = x.size(1)
        pos = torch.arange(seq_len, dtype=torch.long)
        pos = pos.unsqueeze(0).expand_as(x)  # (len,) -> (bs, len)
        embedding = self.tok_embed(x) + self.pos_embed(pos) + self.seg_embed(seg)
        return self.norm(embedding)

##### 6.2 Attention mask

In [None]:
def get_attn_pad_mask(seq_q, seq_k):
    batch_size, len_q = seq_q.size()
    batch_size, len_k = seq_k.size()
    # eq(zero) is PAD token
    pad_attn_mask = seq_k.data.eq(0).unsqueeze(1)  # batch_size x 1 x len_k(=len_q), one is masking
    return pad_attn_mask.expand(batch_size, len_q, len_k)  # batch_size x len_q x len_k

In [None]:
print(get_attn_pad_mask(input_ids, input_ids).shape)

##### 6.3 Encoder

The encoder has two main components: 

- Multi-head Attention
- Position-wise feed-forward network

First let's make the wrapper called `EncoderLayer`

In [None]:
class EncoderLayer(nn.Module):
    def __init__(self):
        super(EncoderLayer, self).__init__()
        self.enc_self_attn = MultiHeadAttention()
        self.pos_ffn       = PoswiseFeedForwardNet()

    def forward(self, enc_inputs, enc_self_attn_mask):
        enc_outputs, attn = self.enc_self_attn(enc_inputs, enc_inputs, enc_inputs, enc_self_attn_mask) # enc_inputs to same Q,K,V
        enc_outputs = self.pos_ffn(enc_outputs) # enc_outputs: [batch_size x len_q x d_model]
        return enc_outputs, attn

Let's define the scaled dot attention, to be used inside the multihead attention

In [None]:
class ScaledDotProductAttention(nn.Module):
    def __init__(self):
        super(ScaledDotProductAttention, self).__init__()

    def forward(self, Q, K, V, attn_mask):
        scores = torch.matmul(Q, K.transpose(-1, -2)) / np.sqrt(d_k) # scores : [batch_size x n_heads x len_q(=len_k) x len_k(=len_q)]
        scores.masked_fill_(attn_mask, -1e9) # Fills elements of self tensor with value where mask is one.
        attn = nn.Softmax(dim=-1)(scores)
        context = torch.matmul(attn, V)
        return context, attn 

Let's define the parameters first

In [None]:
n_layers = 6    # number of Encoder of Encoder Layer
n_heads  = 8    # number of heads in Multi-Head Attention
d_model  = 768  # Embedding Size
d_ff = 768 * 4  # 4*d_model, FeedForward dimension
d_k = d_v = 64  # dimension of K(=Q), V
n_segments = 2

Here is the Multiheadattention.

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self):
        super(MultiHeadAttention, self).__init__()
        self.W_Q = nn.Linear(d_model, d_k * n_heads)
        self.W_K = nn.Linear(d_model, d_k * n_heads)
        self.W_V = nn.Linear(d_model, d_v * n_heads)
    def forward(self, Q, K, V, attn_mask):
        # q: [batch_size x len_q x d_model], k: [batch_size x len_k x d_model], v: [batch_size x len_k x d_model]
        residual, batch_size = Q, Q.size(0)
        # (B, S, D) -proj-> (B, S, D) -split-> (B, S, H, W) -trans-> (B, H, S, W)
        q_s = self.W_Q(Q).view(batch_size, -1, n_heads, d_k).transpose(1,2)  # q_s: [batch_size x n_heads x len_q x d_k]
        k_s = self.W_K(K).view(batch_size, -1, n_heads, d_k).transpose(1,2)  # k_s: [batch_size x n_heads x len_k x d_k]
        v_s = self.W_V(V).view(batch_size, -1, n_heads, d_v).transpose(1,2)  # v_s: [batch_size x n_heads x len_k x d_v]

        attn_mask = attn_mask.unsqueeze(1).repeat(1, n_heads, 1, 1) # attn_mask : [batch_size x n_heads x len_q x len_k]

        # context: [batch_size x n_heads x len_q x d_v], attn: [batch_size x n_heads x len_q(=len_k) x len_k(=len_q)]
        context, attn = ScaledDotProductAttention()(q_s, k_s, v_s, attn_mask)
        context = context.transpose(1, 2).contiguous().view(batch_size, -1, n_heads * d_v) # context: [batch_size x len_q x n_heads * d_v]
        output = nn.Linear(n_heads * d_v, d_model)(context)
        return nn.LayerNorm(d_model)(output + residual), attn # output: [batch_size x len_q x d_model]


Here is the PoswiseFeedForwardNet.

In [None]:
class PoswiseFeedForwardNet(nn.Module):
    def __init__(self):
        super(PoswiseFeedForwardNet, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        # (batch_size, len_seq, d_model) -> (batch_size, len_seq, d_ff) -> (batch_size, len_seq, d_model)
        return self.fc2(F.gelu(self.fc1(x)))


##### 6.4 Putting them together

In [None]:
class BERT(nn.Module):
    def __init__(self):
        super(BERT, self).__init__()
        self.embedding = Embedding()
        self.layers = nn.ModuleList([EncoderLayer() for _ in range(n_layers)])
        self.fc = nn.Linear(d_model, d_model)
        self.activ = nn.Tanh()
        self.linear = nn.Linear(d_model, d_model)
        self.norm = nn.LayerNorm(d_model)
        self.classifier = nn.Linear(d_model, 2)
        # decoder is shared with embedding layer
        embed_weight = self.embedding.tok_embed.weight
        n_vocab, n_dim = embed_weight.size()
        self.decoder = nn.Linear(n_dim, n_vocab, bias=False)
        self.decoder.weight = embed_weight
        self.decoder_bias = nn.Parameter(torch.zeros(n_vocab))

    def forward(self, input_ids, segment_ids, masked_pos):
        output = self.embedding(input_ids, segment_ids)
        enc_self_attn_mask = get_attn_pad_mask(input_ids, input_ids)
        for layer in self.layers:
            output, enc_self_attn = layer(output, enc_self_attn_mask)
        # output : [batch_size, len, d_model], attn : [batch_size, n_heads, d_mode, d_model]
        
        # 1. predict next sentence
        # it will be decided by first token(CLS)
        h_pooled   = self.activ(self.fc(output[:, 0])) # [batch_size, d_model]
        logits_nsp = self.classifier(h_pooled) # [batch_size, 2]

        # 2. predict the masked token
        masked_pos = masked_pos[:, :, None].expand(-1, -1, output.size(-1)) # [batch_size, max_pred, d_model]
        h_masked = torch.gather(output, 1, masked_pos) # masking position [batch_size, max_pred, d_model]
        h_masked  = self.norm(F.gelu(self.linear(h_masked)))
        logits_lm = self.decoder(h_masked) + self.decoder_bias # [batch_size, max_pred, n_vocab]

        return logits_lm, logits_nsp

#### Step 7. Training

In [None]:
num_epoch = 5 #500
model = BERT()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

batch = make_batch()
input_ids, segment_ids, masked_tokens, masked_pos, isNext = map(torch.LongTensor, zip(*batch))

for epoch in range(num_epoch):
    optimizer.zero_grad()
    logits_lm, logits_nsp = model(input_ids, segment_ids, masked_pos)    
    #logits_lm: (bs, max_mask, vocab_size) ==> (6, 5, 34)
    #logits_nsp: (bs, yes/no) ==> (6, 2)

    #1. mlm loss
    #logits_lm.transpose: (bs, vocab_size, max_mask) vs. masked_tokens: (bs, max_mask)
    loss_lm = criterion(logits_lm.transpose(1, 2), masked_tokens) # for masked LM
    loss_lm = (loss_lm.float()).mean()
    #2. nsp loss
    #logits_nsp: (bs, 2) vs. isNext: (bs, )
    loss_nsp = criterion(logits_nsp, isNext) # for sentence classification
    
    #3. combine loss
    loss = loss_lm + loss_nsp
    if epoch % 100 == 0:
        print('Epoch:', '%02d' % (epoch), 'loss =', '{:.6f}'.format(loss))
    loss.backward()
    optimizer.step()

#### Step 7. Inference

In [None]:
# Predict mask tokens ans isNext
input_ids, segment_ids, masked_tokens, masked_pos, isNext = map(torch.LongTensor, zip(batch[2]))
print([id2word[w.item()] for w in input_ids[0] if id2word[w.item()] != '[PAD]'])

logits_lm, logits_nsp = model(input_ids, segment_ids, masked_pos)
#logits_lm:  (1, max_mask, vocab_size) ==> (1, 5, 34)
#logits_nsp: (1, yes/no) ==> (1, 2)

#predict masked tokens
#max the probability along the vocab dim (2), [1] is the indices of the max, and [0] is the first value
logits_lm = logits_lm.data.max(2)[1][0].data.numpy() 
#note that zero is padding we add to the masked_tokens
print('masked tokens (words) : ',[id2word[pos.item()] for pos in masked_tokens[0]])
print('masked tokens list : ',[pos.item() for pos in masked_tokens[0]])
print('masked tokens (words) : ',[id2word[pos.item()] for pos in logits_lm])
print('predict masked tokens list : ', [pos for pos in logits_lm])

#predict nsp
logits_nsp = logits_nsp.data.max(1)[1][0].data.numpy()
print(logits_nsp)
print('isNext : ', True if isNext else False)
print('predict isNext : ',True if logits_nsp else False)


### 3) Save the trained model weights for later use in Task 2.
NOTE: BERT-update.ipynb is available to use CUDA.
NOTE: You may refer to the BERT $paper^1$ and use large corpora such as $BookCorpus^2$ or English
$Wikipedia^3$. However, you should only use a subset, such as 100k samples, rather than the entire dataset.

1 https://aclanthology.org/N19-1423.pdf

2 https://huggingface.co/datasets/bookcorpus/bookcorpus

3 https://huggingface.co/datasets/legacy-datasets/wikipedia

4 https://huggingface.co/datasets/snli

5 https://huggingface.co/datasets/glue/viewer/mnli

6 https://aclanthology.org/D19-1410/


## Task 2. Sentence Embedding with Sentence BERT - Implement trained BERT from task 1 with
siamese network structures to derive semantically meaningful sentence embeddings that can be compared
using cosine-similarity. (3 points)
1) Use the SNLI 4 OR MNLI 5 datasets from Hugging Face, or any dataset related to classification
tasks.
2) Reproduce training the Sentence-BERT as described in the paper 6.
3) Focus on the Classification Objective Function: (SoftmaxLoss)
o = softmax (
W T·(u, v, |u −v|))
HINT : You can take a look how to implement Softmax loss in the file 04 - Huggingface/Appendix -
Sentence Embedding/S-BERT.ipynb.
1https://aclanthology.org/N19-1423.pdf2https://huggingface.co/datasets/bookcorpus/bookcorpus3https://huggingface.co/datasets/legacy-datasets/wikipedia4https://huggingface.co/datasets/snli5https://huggingface.co/datasets/glue/viewer/mnli6https://aclanthology.org/D19-1410/
1
2


## Task 3. Evaluation and Analysis (1 points)
1) Provide the performance metrics (classification Report) based on the SNLI or MNLI datasets for
the Natural Language Inference (NLI) task.
precision recall f1-score support
entailment 0.42 0.02 0.05 3486
neutral 0.33 0.75 0.46 3199
contradiction 0.33 0.25 0.28 3315
accuracy 0.33 10000
macro avg 0.36 0.34 0.26 10000
weighted avg 0.36 0.33 0.26 10000
Table 1. Sample of Classification Report
2) Discuss any limitations or challenges encountered during the implementation and propose potential
improvements or modifications.
NOTE: Make sure to provide proper documentation, including details of the datasets used, hyperpa-
rameters, and any modifications made to the original models.


## Task 4. Text similarity - Web Application Development - Develop a simple web application that
demonstrates the capabilities of your text-embedding model. (1 points)
1) Develop a simple website with two input boxes for search queries.
2) Utilize a custom-trained sentence transformer model to predict Natural Language Inference (NLI)
Task (entailment, neutral and contradiction).
For example:
• Premise: A man is playing a guitar on stage.
• Hypothesis: The man is performing music.
• Label: Entailment