# Fine-tune the SentenceBERT using MNR
In this experiment, we gonna learn about SBERT and going to see a practical demonstration of finetuning a SBERT on a custom dataset. Here, we gonna use MNR loss function.

## SBERT
- It is based on Siamese Neural Network.

**Intuition**  
- A siamese network is a class of neural network that contains two identical networks(the same configurations, same parameters and weights).
- Parameter updating is mirrored across both networks.
- Siamese nn find similarity of inputs by comparing its feature vectors.

Let's understand the high-level overview of SBERT architecture.

![Screenshot from 16-02-24 22:01:29](https://github.com/surajkarki66/shortIT/assets/50628520/f62a89c5-220c-4c63-bbd7-2486f60d35e8)

- Sentence BERT uses pre-trained BERT networks and only fine tune it to yield useful sentence embeddings.
- To fine-tune our model, we create Siamese networks (bi-encoder) to update the weights such that the produced sentence embeddings are semantically meaningful.

## Implementation
- Inspired from [here](https://www.pinecone.io/learn/series/nlp/train-sentence-transformers-softmax/)

**NLI Training**  
- There are several ways of training sentence transformers. One of the most popular (and the approach we will cover) is using Natural Language Inference (NLI) datasets.

- NLI focus on identifying sentence pairs that infer or do not infer one another. We will use two of these datasets; the Stanford Natural Language Inference (SNLI) and Multi-Genre NLI (MNLI) corpora.

- 943K sentence pairs are obtained by combining these two corpora (550K from SNLI and 393K from MNLI). Every pair has a premise and a hypothesis, and a label is given to each pair:
  - 0 — entailment => e.g. the premise suggests the hypothesis.
  - 1 — neutral => the premise and hypothesis could both be true, but they are not necessarily related.
  - 2 — contradiction => the premise and hypothesis contradict each other.

- When training the model, we will be feeding sentence A (the premise) into BERT, followed by sentence B (the hypothesis) on the next step.

## Load the dataset
For now, let’s download the snli datasets. We will use the datasets library from Hugging Face.

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.17.0-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=12.0.0 (from datasets)
  Downloading pyarrow-15.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (38.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m38.1 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow, dill, multiprocess, datasets
  Attempting uninstall: pyarrow
    Found exis

In [2]:
import datasets

snli = datasets.load_dataset('snli', split='train')
print(snli)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading:   0%|          | 0.00/1.93k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Dataset({
    features: ['premise', 'hypothesis', 'label'],
    num_rows: 550152
})


In [3]:
import torch

if torch.cuda.is_available():
    device = torch.device("cuda")
    print("CUDA (GPU) is available")
else:
    device = torch.device("cpu")
    print("CUDA (GPU) is not available, falling back to CPU")

CUDA (GPU) is not available, falling back to CPU


In [4]:
snli[0]

{'premise': 'A person on a horse jumps over a broken down airplane.',
 'hypothesis': 'A person is training his horse for a competition.',
 'label': 1}

In [5]:
print(snli)

Dataset({
    features: ['premise', 'hypothesis', 'label'],
    num_rows: 550152
})


Both datasets contain -1 values in the label feature where no confident class could be assigned. We remove them using the filter method.

In [16]:
# there are -1 values in the label feature, these are where no class could be decided so we remove
dataset = snli.filter(
    lambda x: 0 if x['label'] == -1 else 1
)
print(len(dataset))

549367


This means that we need to tokenize our sentences in order to translate them from human-readable text into transformer-readable code. The input_ids and attention_mask tensors for the premise and hypothesis features need to be divided separately.

In [7]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [17]:
dataset = dataset.map(
    lambda x: tokenizer(
            x['premise'], max_length=128, padding='max_length',
            truncation=True
        ), batched=True
)

Map:   0%|          | 0/549367 [00:00<?, ? examples/s]

In [18]:
dataset = dataset.rename_column('input_ids', 'anchor_ids')
dataset = dataset.rename_column('attention_mask', 'anchor_mask')
print(dataset)

Dataset({
    features: ['premise', 'hypothesis', 'label', 'anchor_ids', 'token_type_ids', 'anchor_mask'],
    num_rows: 549367
})


In [19]:
dataset = dataset.map(
    lambda x: tokenizer(
            x['hypothesis'], max_length=128, padding='max_length',
            truncation=True
    ), batched=True
)
dataset = dataset.rename_column('input_ids', 'positive_ids')
dataset = dataset.rename_column('attention_mask', 'positive_mask')

dataset = dataset.remove_columns(['premise', 'hypothesis', 'label', 'token_type_ids'])

Map:   0%|          | 0/549367 [00:00<?, ? examples/s]

In [20]:
print(dataset)

Dataset({
    features: ['anchor_ids', 'anchor_mask', 'positive_ids', 'positive_mask'],
    num_rows: 549367
})


Since, it will take forever to train a model on full dataset even for a single epoch. Only for demonstration i am gonna use small part of dataset

In [21]:
subset_dataset = dataset.select(range(1000))

All that's left to do is get the data ready for the model to read. In order to accomplish this, we first translate the features in the dataset into PyTorch tensors and then build up a data loader to provide data to our model while it is being trained.

In [25]:
import torch

# covert dataset features to PyTorch tensors
subset_dataset.set_format(type='torch', output_all_columns=True)

# initialize the dataloader
batch_size = 16
train_loader = torch.utils.data.DataLoader(
    subset_dataset, batch_size=batch_size, shuffle=True
)

## Model Building
When we train an SBERT model, we don’t need to start from scratch. We begin with an already pretrained BERT model (and tokenizer).

In [26]:
from transformers import BertModel

# start from a pretrained bert-base-uncased model
model = BertModel.from_pretrained('bert-base-uncased')

MNR and softmax loss training approaches use a * ‘siamese’*-BERT architecture during fine-tuning. Meaning that during each step, we process a sentence A (our anchor) into BERT, followed by sentence B (our positive).

![image](https://cdn.sanity.io/images/vr8gru94/production/f570df278a344cd53fca7f045cef4db9b7c81ac9-1920x1080.png)

Because these two sentences are processed separately, it creates a siamese-like network with two identical BERTs trained in parallel. In reality, there is only a single BERT being used twice in each step.

We can extend this further with triplet-networks. In the case of triplet networks for MNR, we would pass three sentences, an anchor, it’s positive, and it’s negative.

![image](https://cdn.sanity.io/images/vr8gru94/production/b6eb33679dc0961b6f6f5d7a58be466bbfd0f5de-1920x1080.png)

However, we are not using triplet-networks, so we have removed the negative rows from our dataset (rows where label is 2).
Triplet networks use the same logic but with an added sentence. For MNR loss this other sentence is the negative pair of the anchor.

In [27]:
# define mean pooling function
def mean_pool(token_embeds, attention_mask):
    # reshape attention_mask to cover 768-dimension embeddings
    in_mask = attention_mask.unsqueeze(-1).expand(
        token_embeds.size()
    ).float()

    # perform mean-pooling but exclude padding tokens (specified by in_mask)
    pool = torch.sum(token_embeds * in_mask, 1) / torch.clamp(
        in_mask.sum(1), min=1e-9
    )
    return pool

Here we take BERT’s token embeddings output (we’ll see this all in full soon) and the sentence’s attention_mask tensor. We then resize the attention_mask to align to the higher 768-dimensionality of the token embeddings.

We apply this resized mask in_mask to those token embeddings to exclude padding tokens from the mean pooling operation. Our mean pooling takes the average activation of values across each dimension to produce a single value. This brings our tensor sizes from (512*768) to (1*768).

## Loss function
**Multiple Negative Ranking Loss**



In [32]:
# define cosine sim layer
cos_sim = torch.nn.CosineSimilarity()
loss_func = torch.nn.CrossEntropyLoss()
scale = 20.0  # we multiply similarity score by this scale value

First, we calculate the cosine similarity between each anchor embedding (a) and all of the positive embeddings in the same batch (p).

From here, we produce a vector of cosine similarity scores (of size batch_size) for each anchor embedding a_i (or size 2 * batch_size for triplets). Each anchor should share the highest score with its positive pair, p_i.

To optimize for this, we use a set of increasing label values to mark where the highest score should be for each a_i, and categorical cross-entropy loss.

In [29]:
from transformers.optimization import get_linear_schedule_with_warmup


# we would initialize everything first
optim = torch.optim.Adam(model.parameters(), lr=2e-5)

# and setup a warmup for the first ~10% steps
total_steps = int(len(subset_dataset) / batch_size)
warmup_steps = int(0.1 * total_steps)
scheduler = get_linear_schedule_with_warmup(
		optim, num_warmup_steps=warmup_steps,
  	num_training_steps=total_steps - warmup_steps
)

## Train the model


In [30]:
EPOCHS = 1

In [31]:
model = model.to(device)

In [34]:
from tqdm.auto import tqdm

# 1 epoch should be enough, increase if wanted
for epoch in range(EPOCHS):
    model.train()  # make sure model is in training mode
    # initialize the dataloader loop with tqdm (tqdm == progress bar)
    loop = tqdm(train_loader, leave=True)
    for batch in loop:
        # zero all gradients on each new step
        optim.zero_grad()
        # prepare batches and more all to the active device
        anchor_ids = batch['anchor_ids'].to(device)
        anchor_mask = batch['anchor_mask'].to(device)
        pos_ids = batch['positive_ids'].to(device)
        pos_mask = batch['positive_mask'].to(device)
        # extract token embeddings from BERT
        a = model(
            anchor_ids, attention_mask=anchor_mask
        )[0]  # all token embeddings
        p = model(
            pos_ids, attention_mask=pos_mask
        )[0]
        # get the mean pooled vectors
        a = mean_pool(a, anchor_mask)
        p = mean_pool(p, pos_mask)

        # calculate the cosine similarities
        scores = torch.stack([
            cos_sim(
                a_i.reshape(1, a_i.shape[0]), p
            ) for a_i in a])

        # get label(s) - we could define this before if confident of consistent batch sizes
        labels = torch.tensor(range(len(scores)), dtype=torch.long, device=scores.device)

        # and now calculate the loss
        loss = loss_func(scores*scale, labels)

        # using loss, calculate gradients and then optimize
        loss.backward()
        optim.step()
        # update learning rate scheduler
        scheduler.step()
        # update the TDQM progress bar

        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

  0%|          | 0/63 [00:00<?, ?it/s]

Here, we train for just one period. This should be sufficient in practice, as it is consistent with the original SBERT paper's description. Saving the model is the last thing that has to be done.

In [37]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [38]:
import os

model_path = '/content/drive/MyDrive/Colab Notebooks/sbert_mnr_scratch'

if not os.path.exists(model_path):
    os.mkdir(model_path)

model.save_pretrained(model_path)

## Fine tuning using sentence transformer library
As we already mentioned, the sentence-transformers library has excellent support for those of us just wanting to train a model without worrying about the underlying training mechanisms.

For this approach, we have to change the way the data will be input to the model.

In [39]:
# download
snli = datasets.load_dataset('snli', split='train')

In [40]:
# and remove bad rows
snli = snli.filter(
    lambda x: False if x['label'] == -1 else True
)

Filter:   0%|          | 0/550152 [00:00<?, ? examples/s]

Now we’re ready to format our data for sentence-transformers. All we do is convert the current premise, hypothesis, and label format into an almost matching format with the InputExample class.

In [41]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-2.3.1-py3-none-any.whl (132 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/132.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━[0m [32m122.9/132.8 kB[0m [31m3.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.8/132.8 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.3.1


In [42]:
subset_snli = snli.select(range(1000))

In [43]:
from sentence_transformers import InputExample, losses
from tqdm.auto import tqdm

train_samples = []
for row in tqdm(subset_snli):
    train_samples.append(InputExample(
        texts=[row['premise'], row['hypothesis']],
        label=row['label']
    ))

  0%|          | 0/1000 [00:00<?, ?it/s]

setup loader

In [44]:
from sentence_transformers import datasets

batch_size = 32

loader = datasets.NoDuplicatesDataLoader(
    train_samples, batch_size=batch_size)

Our InputExample contains just our a and p sentence pairs, which we then feed into the NoDuplicatesDataLoader object. This data loader ensures that each batch is duplicate-free — a helpful feature when ranking pair similarity across randomly sampled pairs with MNR loss.

## Model Building

In [45]:
from sentence_transformers import models, SentenceTransformer

bert = models.Transformer('bert-base-uncased')
pooler = models.Pooling(
    bert.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=True
)

model = SentenceTransformer(modules=[bert, pooler])

model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)

In [46]:
from sentence_transformers import losses

loss = losses.MultipleNegativesRankingLoss(model)
print(loss)

MultipleNegativesRankingLoss(
  (model): SentenceTransformer(
    (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
    (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
  )
  (cross_entropy_loss): CrossEntropyLoss()
)


Now we’re ready to train the model. We train for a single epoch and warm up for 10% of training as before.

In [47]:
epochs = 1
warmup_steps = int(len(loader) * epochs * 0.1)

model.fit(
    train_objectives=[(loader, loss)],
    epochs=epochs,
    warmup_steps=warmup_steps,
    output_path='/content/drive/MyDrive/Colab Notebooks/sbert_mnr_lib',
    show_progress_bar=True,
)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/31 [00:00<?, ?it/s]

## Compare SBERT Models
We’re going to test the models on a set of random sentences.

In [48]:
import datasets

sts = datasets.load_dataset('glue', 'stsb', split='validation')

print(sts)

Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/502k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/151k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/114k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5749 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1379 [00:00<?, ? examples/s]

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 1500
})


STSb (or STS benchmark) contains sentence pairs in features sentence1 and sentence2 assigned a similiarity score from 0 -> 5.

Because the similarity scores range from 0 -> 5, we need to normalize them to a range of 0 -> 1. We use map to do this.

In [49]:
sts = sts.map(lambda x: {'label': x['label'] / 5.0})

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

We’re going to be using sentence-transformers evaluation utilities. We first need to reformat the STSb data using the InputExample class — passing the sentence features as texts and similarity scores to the la

In [50]:
subset_sts = sts.select(range(100))

In [51]:
from sentence_transformers import InputExample

samples = []
for sample in subset_sts:
    samples.append(InputExample(
        texts=[sample['sentence1'], sample['sentence2']],
        label=sample['label']
    ))

To evaluate the models, we need to initialize the appropriate evaluator object. As we are evaluating continuous similarity scores, we use the EmbeddingSimilarityEvaluator.

In [52]:
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

evaluator = EmbeddingSimilarityEvaluator.from_input_examples(
    samples, write_csv=False
)

In [57]:
from sentence_transformers import SentenceTransformer
from transformers import BertModel

model_b = SentenceTransformer('/content/drive/MyDrive/Colab Notebooks/sbert_mnr_lib')

print(evaluator(model_b))

0.9017779443358193


For the model fine-tuned with sentence-transformers, we output a correlation of 0.90, meaning our model outputs good similarity scores according to the scores assigned to STSb.

## Conclusion
As we can see, we have trained the same model using two different way and also compare their performances on bunch of sentences. Consequently, we observed that the model performance is not that good, because of the following two reasons:
- Taking only 1000 samples for training
- Training only single epoch