This notebook performs experiments on studying the impact of different loss functions on training and evaluating different language models. The models are trained on the Natural Language Inference (NLI) dataset, and tested on the Semantic Textual Similarity (STS) and SentEval (classification) datasets as transfer tasks.

- For this demo, the **all-mpnet-base-v2 (SBERT)** model is trained on NLI dataset using all three loss functions **CoSENT**, **In-Batch Negatives** and **Angle**, and tested on **STS-13** dataset for Spearman's Rank and **Customer Reviews (CR)** dataset from SentEval for classification accuracy.

- The loss functions CoSENT, In-Batch Negatives and Angle are taken from <a href="https://github.com/SeanLee97/AnglE">AnglE</a>, and the Cosine Similarity Loss is modified from <a href="https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss">SBERT</a>.

## **Colab Links**

1. <a href="https://colab.research.google.com/drive/1l7TpcEIr8D0zIOpqILFrxRL5U1QgRDwE">Dataset-Specific Fine-Tuning BERT</a> - This notebook fine-tunes BERT-based language models by training and testing on the generated train and test splits of the individual datasets of STS and SentEval.
2. <a href="https://colab.research.google.com/drive/1pdbY_mck2PjbneenGdK8PVjQiaUDVCzO">Plot Builder</a> - The results generated for STS and SentEval at the bottom of this notebook, will be required to be put inside the **Plot Builder** Colab notebook for it to work. Also, all the combinations of Models, Datasets and Loss Functions will be required to plot correctly. **(In Progress)**
3. <a href="https://colab.research.google.com/drive/1KPLWxoCDb77w9TUlHeQ2sME6ikh2Fe58">Training Llama 2 on NLI</a> - This notebook trains Llama 2 with LoRA on NLI dataset and tests on STS datasets. **(In Progress)**
4. <a href="https://colab.research.google.com/drive/1jLi2X_fLccrmFXP643I4gOfrQ3Nw_5m4">Fine-Tuning Llama 2 using LoRA (Runnable on STS13 dataset only)</a> - This notebook trains Llama 2 with LoRA on the generated train set of STS-13 dataset and tests Spearman's Rank Correlation Coefficient on the generated test set. **(In Progress)**

## Library Installations

In [1]:
! pip install -q datasets transformers accelerate scikit-learn scipy -U

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m29.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.1/315.1 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m42.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.2/41.2 MB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Imports

In [2]:
from typing import Dict, List, Optional, Union, Any, Tuple
from datasets import load_dataset, concatenate_datasets, Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModel, Trainer, TrainingArguments, AutoModelForCausalLM, AutoModelForSeq2SeqLM
import torch
import torch.nn as nn
import torch.nn.functional as F
from scipy.stats import spearmanr
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.linear_model import LogisticRegression
import gzip
import csv

In [3]:
torch.cuda.get_device_name()

'Tesla T4'

## Tokenizer

The Tokenizer method, for now the maximum length is fixed to 512, which is the maximum permissible token length limit for BERT models. This needs to be set to the length of the longest sentence in the batch...

1. For **Classification Task**, there is only one column, `text1` containing the sentences and their corresponding `label` column. To work with the loss functions in this system, the sentences need to be tokenized and processed in pairs. Hence, the contents of the `text1` column needs to be duplicated into another column `text2` where an example will then contain 2 sentence columns (`text1` and `text2`) and a `label` column. For example,

```python
text1 = ["Sentence1", "Sentence2", "Sentence3", "Sentence4", "Sentence5"]
label = [0, 1, 0, 1, 0]
```

needs to be converted into the following format before tokenization:

```python
text1 = ["Sentence1", "Sentence2", "Sentence3", "Sentence4", "Sentence5"]
text2 = ["Sentence1", "Sentence2", "Sentence3", "Sentence4", "Sentence5"]
label = [0, 1, 0, 1, 0]
```

where the `text2` column is just a duplicate of the column, `text1`

2. In case of **STS task**, this step is not required, since it already contains a sentence pair and a label for each example. They are already in the following format:

```python
text1 = ["Sentence1", "Sentence2", "Sentence3", "Sentence4", "Sentence5"]
text2 = ["Sentence1*", "Sentence2*", "Sentence3*", "Sentence4*", "Sentence5*"]
label = [0.2, 1.2, 2.0, 3.8, 0.4]
```
where `"Sentence1"` and `"Sentence1*"` are the sentence pairs of the same example

The output of the tokenizer will be the **tokens** for sentence pairs along with a `separate_id` demarcating `"Sentence1"` and `"Sentence1*"`.

```python
text1 = ["Sentence1", "Sentence2", ...]
text2 = ["Sentence1*", "Sentence2*", ...]
label = [0, ...]

# The tokens will look like
tokens = [
  {
    "input_ids": [102, 2019, 2093, 2910, 2, 0, 28823, 29371, 5738, 2],
    "attention_mask": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    "separate_ids": [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
    "label": [0]
  }, # sentence_1_1*_token
  {
    "input_ids": [102, 2019, 2093, 2910, 2, 0, 28823, 29371, 5738, 2],
    "attention_mask": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    "separate_ids": [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
    "label": [1]
  }, #sentence_2_2*_token
  ...
]
```

Hence, for each example, the token generated by the tokenizer will contain both the tokenized sentences and can be expressed as `sentence_1_1*_token`, `sentence_2_2*_token`, etc. which will be fed to the data collator, later in the system.

In [4]:
class CustomDataTokenizer:
    def __init__(self, tokenizer, max_length = 512):
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __call__(self, data: Dict) -> Dict:
        text_columns = ['text1', 'text2']

        tokens_list = []
        for text_column in text_columns:
            tokens_list.append(self.tokenizer(data[text_column], max_length=self.max_length, truncation=True))

        token = {}
        seperate_ids = []
        for i, t in enumerate(tokens_list):
            # Input IDs and Attention Masks are in "t"...
            for key, val in t.items():
                if i == 0:
                    token[key] = val
                else:
                    token[key] += val
                if key == 'input_ids':
                    seperate_ids += [i] * len(val)

        token['labels'] = [int(data['label']) if 'label' in data else -1]
        token['seperate_ids'] = seperate_ids

        return token

## Data Collator

The custom data collator processes the data passed on from the tokenizer, applies padding and processes the batches to be fed to the model for training. The list of tokenized inputs is converted into torch tensors in batches.

The tokenized inputs containing tokenized sentence pairs for an example are expressed as `sentence_1_1*_token`, `sentence_2_2*_token`, etc. as discussed before.

The collator performs the transformation of data from:

```python
text1 = [sentence_1_1*_token, sentence_2_2*_token, sentence_3_3*_token, sentence_4_4*_token, sentence_5_5*_token]
labels = [0, 1, 2, 3, 0]
```

where `sentence_1_1*_token`, `sentence_2_2*_token`, ... are the combined tokenized outputs from the tokenizer containing both sentences of `text1` and `text2` for each example, demarcated by `separate_id`,

To:

```python
text = [sentence_1_token, sentence_1*_token, sentence_2_token, sentence_2*_token,sentence_3_token, sentence_3*_token, sentence_4_token, sentence_4*_token, sentence_5_token, sentence_5*_token]
labels = [0, 0, 1, 1, 2, 2, 4, 4, 0, 0]
```

where the sentences have been separated based on their `separate_id` values into separate samples, sharing the same labels which have been duplicated.

In [73]:
# Modified from https://github.com/SeanLee97/AnglE/blob/main/angle_emb/angle.py#L568
class CustomDataCollator:
    tokenizer = None
    padding = 'longest'
    max_length: Optional[int] = 512
    return_tensors: str = "pt"

    def __init__(self, tokenizer_base):
        self.tokenizer = tokenizer_base

    def __call__(self, features: List[Dict], return_tensors: str = "pt") -> Dict[str, torch.Tensor]:
        if return_tensors is None:
            return_tensors = self.return_tensors

        # print("Unprocessed Features: ", features)
        new_features = []
        for feature in features:
            seperate_ids = feature['seperate_ids']
            input_ids = feature['input_ids']
            attention_mask = feature['attention_mask']

            max_seperate_id = max(seperate_ids)
            prev_start_idx = 0
            for seperate_id in range(1, max_seperate_id + 1):
                start_idx = seperate_ids.index(seperate_id)

                new_feature = {}
                new_feature['input_ids'] = input_ids[prev_start_idx:start_idx]
                new_feature['attention_mask'] = attention_mask[prev_start_idx:start_idx]
                new_feature['labels'] = feature['labels']
                new_features.append(new_feature)
                prev_start_idx = start_idx

            new_feature = {}
            new_feature['input_ids'] = input_ids[prev_start_idx:]
            new_feature['attention_mask'] = attention_mask[prev_start_idx:]
            new_feature['labels'] = feature['labels']
            new_features.append(new_feature)

        del features
        features = self.tokenizer.pad(
            {'input_ids': [feature['input_ids'] for feature in new_features]},
            padding=self.padding,
            max_length=self.max_length,
            return_tensors=return_tensors,
        )
        features['attention_mask'] = self.tokenizer.pad(
            {'input_ids': [feature['attention_mask'] for feature in new_features]},
            padding=self.padding,
            max_length=self.max_length,
            return_tensors=return_tensors,
        )['input_ids']

        features['labels'] = torch.Tensor([feature['labels'] for feature in new_features])
        # print("Processed Features: ", features)
        return features

## Losses

### 1. Default Pairwise Cosine Similarity Loss

The `default_cosine_similarity_loss` is a basic similarity-based loss function which calculates a loss value based on the cosine similarity of paired embeddings, and the respective ground truth labels. It normalizes the true similarity scores, computes cosine similarities between paired embeddings, and applies the mean squared error (MSE) function to produce a final loss value.

<br>

**Pairwise Cosine Similarity Loss Equation**:

$$\mathcal{L}_{\text{Cosine MSE}} = \frac{1}{N} \sum_{i=1}^{N} \left( y_{\text{true}} - \cos(X_i, X_j) \right)^2$$

  - ${y_\text{true}}$ - Labelled similarity scores ranging between 0 and 5, which after normalization vary between 0 and 1.
  - ${X_i, X_j}$ - Embeddings of the corresponding sentences ${S_i}$ and ${S_j}$ respectively.
  - ${\cos(X, Y)}$ - Cosine similarity between the sentence embeddings ${X}$ and ${Y}$.

<br>

**Loss Function SBERT Reference**:
<a href="https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html#sentence_transformers.losses.CosineSimilarityLoss"> Sentence Transformers/CosineSimilarityLoss</a>

**GitHub Reference**:
<a href="https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/CosineSimilarityLoss.py#L10-L81">Sentence Transformers/CosineSimilarityLoss</a>

<br>

**Implementation**:

The true labels and predicted values are arranged in a paired manner such as `[x[0], x[1], x[2], x[3], ...]`, where `x[0]` and `x[1]` stand for a sentence pair. According to this illustration, the index of `x` is the index of an example, and the true labels are adjusted to fall between 0 and 1.

This format, achieved by the data collator, is required for the dataset to accommodate the definition of the loss function. In the case of the STS task, pairs of sentences along with their corresponding similarity scores are provided.

1. **Function Definition and Parameters**:
  - **Parameters**:
    - `y_true`: A tensor of ground truth labels in a specific paired style where each pair of true values is structured sequentially. This represents the similarity score corresponding to each sentence pair. Shape: `(batch_size, 1)`
    - `y_pred`: A tensor of model predictions (embedding vectors) also in the mentioned paired style, where each pair of predicted values is structured sequentially. Shape: `(batch_size, 2 * embedding_vector_length)`, where `embedding_vector_length` depends on the model.
    - `tau`: A scaling factor (default is 1.0), but not used in this specific implementation.
  - **Returns**:
    - A tensor representing the loss value.

  - **Semantic Textual Similarity (STS) Task**:
  The STS datasets consist of sentence pairs and their corresponding similarity scores. For example, the datasets are arranged in the following way:
  ```python
  text1 = ["Sentence 1", "Sentence 2", "Sentence 3", "Sentence 4", "Sentence 5"]
  text2 = ["Sentence 1*", "Sentence 2*", "Sentence 3*", "Sentence 4*", "Sentence 5*"]
  label = [0.2, 1.2, 2.0, 3.8, 0.4]
  ```
  
  After tokenization, collation, and passing through the model, the `y_true` and `y_pred` will be transformed into the following format to accommodate the definition of the loss function:

  ```python
  y_true: [0.2, 1.2, 2.0, 3.8, 0.4]
  y_pred: [emb_1, emb_1*, emb_2, emb_2*, emb_3, emb_3*, emb_4, emb_4*, emb_5, emb_5*]
  ```

2. **Normalizing `y_true`:**
  ```python
  y_true = y_true / 5.0
  y_true = y_true[::2, 0]
  ```
  This normalizes the labelled similarity scores to a range of `0` to `1`, since the the labels vary between `0` and `5`. It also selects every second element from `y_true` starting from the first element, since that is how the data had been arranged before entry into the loss function, where the 0th and 1st labels will be same as they belong to the sentence 1 and sentence 2 of the same data point respectively sharing the same labels, and so on.

3. **Splitting y_pred into Pairs**:
  ```python
  y_pred1 = y_pred[0::2]
  y_pred2 = y_pred[1::2]
  ```

  `y_pred1` and `y_pred2` refer to the predicted embeddings of sentences ${S_1}$ and ${S_1^*}$ respectively of a data point or sentence pair.

4. **Computing Cosine Similarity for Pairs**:
  ```python
  cos_sim = F.cosine_similarity(y_pred1, y_pred2)
  ```

  The cosine similarity between the pairs of embeddings is computed. This measures the similarity between two vectors of an inner product space which is determined by the cosine of the angle between the two vectors and determines whether they are pointing in the same direction.

5. **Calculating MSE Loss**:
  ```python
  squared_difference = (y_true - cos_sim) ** 2
  loss = squared_difference.mean()
  ```

  The mean squared error (MSE) loss is computed by taking the squared difference between the true similarity scores (`y_true`) and the calculated cosine similarities (`cos_sim`). The mean of these squared differences is then computed to get the final loss value.

In [6]:
# Modified from https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/CosineSimilarityLoss.py#L10-L81
def default_cosine_similarity_loss(y_true, y_pred, tau=1):
    # Normalizing y_true values to fall between 0 and 1...
    y_true = y_true / 5.0
    y_true = y_true[::2, 0]
    y_pred1 = y_pred[0::2]
    y_pred2 = y_pred[1::2]

    # Calculating the cosine similarity between the pairs of embeddings...
    cos_sim = F.cosine_similarity(y_pred1, y_pred2)

    # MSE loss...
    squared_difference = (y_true - cos_sim) ** 2
    loss = squared_difference.mean()

    return loss

### 2. CoSENT Loss

The `cosent_loss` function also calculates a loss value based on the cosine similarity of paired embeddings, and depends on the respective ground truth labels for maintaining the relative ranking of the data points. It normalizes the embeddings, computes cosine similarities, adjusts these similarities based on the order of true labels, and applies the $\text{log-sum-exp}$ function to produce a final loss value.

<br>

**CoSENT Loss Equation**:

  $$ \mathcal{L}_{\text{CoSENT}} = \log \left[ 1 + \sum_{s(X_i, X_j) > s(X_m, X_n)} e^{\frac{\cos(X_m, X_n) - \cos(X_i, X_j)}{\tau}} \right] $$
  
  - ${X_i, X_j, X_m, X_n}$ - Embeddings of the corresponding sentences ${S_i, S_j, S_m}$ and ${S_n}$ respectively.
  - ${s(X_i, X_j), s(X_m, X_n)}$ - Provided similarity scores between the sentence pairs ${(X_i, X_j)}$ and ${(X_m, X_n)}$ respectively.
  - ${\cos(X, Y)}$ - Cosine similarity between the sentence embeddings ${X}$ and ${Y}$.
  - ${\tau}$ - Temperature hyperparameter.

<br>

**Reference**:
    Huang, X., Peng, H., Zou, D., Liu, Z., Li, J., Liu, K., Wu, J., Su, J., & Yu, P. S. (2024). CoSENT: Consistent Sentence Embedding via Similarity Ranking. IEEE/ACM Transactions on Audio, Speech, and Language Processing. https://doi.org/10.1109/TASLP.2024.3402087

<br>

**Implementation**:
<br>

The true labels and predicted values are arranged in a "zigzag" manner such as `[x[0][0], x[0][1], x[1][0], x[1][1], ...]`, where `(x[0][0], x[0][1])` stands for a sentence pair. According to this illustration, the first index of `x` is the index of an example and the second index of `x` refers to `text1` denoted by '0' and `text2` denoted by 1.

This format, achieved by the data collator, is required for the dataset to accommodate the definition of the loss function. In case of classification task, `x[0][0]` and `x[0][1]` refer to the same sentence (`text1` and `text2`), having the same labels, and in case of STS, `x[0][0]` stands for `text1` and `x[0][1]` stands for `text2`, also having the same label. This is explained with an example below.

1. **Function Definition and Parameters**:
   - **Parameters**:
     - `y_true`: A tensor of ground truth labels in the mentioned specific "zigzag" style where each pair of true values is structured sequentially. This represents the label corresponding to each sentence. Shape: `(batch_size, 1)`
     - `y_pred`: A tensor of model predictions (embedding vectors) also in the mentioned "zigzag" style, where each pair of predicted values is structured sequentially. Shape: `(batch_size, 2 * embedding_vector_length)`, where `embedding_vector_length` depends on the model. For example, for BERT Base, it is ${768}$ and ${1024}$ in case of BERT Large etc.
     - `tau`: A scaling factor (default is 20.0) representing the temperature hyperparameter which controls the sharpness of the output distribution. A higher value of tau makes the output distribution sharper or makes the model more sensitive to differences between examples, whereas a lower value makes it smoother or less sensitive to differences.
   - **Returns**:
     - A tensor representing the loss value.

  - **Classification Task**:
  For example, in classification tasks, where there is a list of sentences and their corresponding binary labels:

  ```python
  text1 = ["Sentence 1", "Sentence 2", "Sentence 3", "Sentence 4", "Sentence 5"]
  text2 = ["Sentence 1", "Sentence 2", "Sentence 3", "Sentence 4", "Sentence 5"]
  label = [0, 1, 0, 1, 0]
  ```

  After tokenization, collation and passing through the model, the true labels (`y_true`) and predicted value (embedding vectors) (`y_pred`) will be transformed into duplicates in the following way to accommodate the definition of the loss function:

  ```python
  y_true: [0, 0, 1, 1, 0, 0, 1, 1, 0, 0]
  y_pred: [emb_1, emb_1, emb_2, emb_2, emb_3, emb_3, emb_4, emb_4, emb_5, emb_5]
  ```

  - **STS Task**:
  Such a transformation is not required for STS tasks since they are already sentence pairs (text1 and text2), along with their corresponding labels. Their arrangement is handled by the tokenizer and data collator which form the required zigzag pattern for training.

  For Example, the dataset is arranged in the following way:
  ```python
  text1 = ["Sentence 1", "Sentence 2", "Sentence 3", "Sentence 4", "Sentence 5"]
  text2 = ["Sentence 1*", "Sentence 2*", "Sentence 3*", "Sentence 4*", "Sentence 5*"]
  label = [0.2, 1.2, 2.0, 3.8, 0.4]
  ```

  where `sentence1` (`text1`) and `sentence1*` (`text2`) are sentence pairs of an example with label, `o.2`

  After tokenization, collation and passing through the model, the `y_true` and `y_pred` will be transformed into the following format to accomodate the definition of the loss function:

  ```python
  y_true: [0, 0, 1, 1, 2, 2, 4, 4, 0, 0]
  y_pred: [emb_1, emb_1*, emb_2, emb_2*, emb_3, emb_3*, emb_4, emb_4*, emb_5, emb_5*]
  ```

2. **Reshaping `y_true`**:
   ```python
   y_true = y_true[::2, 0]
   ```
   - This line selects every second element from `y_true` starting from the first element and takes the first value from each pair. Essentially, it reduces the `y_true` tensor to half its length, focusing only on the first element of each pair.

3. **Creating Pairwise Label Matrix**:
   ```python
   y_true = (y_true[:, None] < y_true[None, :]).float()
   ```
   - This creates a pairwise comparison matrix for the ground truth labels. For each pair `(i, j)`, it checks if `y_true[i]` is less than `y_true[j]`. The result is a binary matrix (0 or 1) where each element indicates the order relationship between pairs.

4. **Normalizing `y_pred`**:
   ```python
   y_pred = F.normalize(y_pred, p=2, dim=1)
   ```
   - The predicted vectors (`y_pred`) are normalized to have unit length ($L^2$ norm).

5. **Computing Cosine Similarity for Pairs**:
   ```python
   y_pred = torch.sum(y_pred[::2] * y_pred[1::2], dim=1) * tau
   ```
   - This computes the cosine similarity between the pairs of vectors in `y_pred`. For each pair `(i, i + 1)`, it multiplies the corresponding vectors element-wise, sums them up, and scales by the factor `tau`.

6. **Creating Pairwise Score Differences**:
   ```python
   y_pred = y_pred[:, None] - y_pred[None, :]
   ```
   - This line computes the pairwise differences between the cosine similarity scores obtained in the previous step.

7. **Adjusting Pairwise Differences with `y_true`**:
   ```python
   y_pred = (y_pred - (1 - y_true) * 1e12).view(-1)
   ```
   - The pairwise differences are adjusted by a large negative value (`-1e12`) for pairs that are not in the correct order according to `y_true`. This effectively masks out the incorrect pairs by making their differences very large and negative.

8. **Adding Zero to `y_pred`**:
   ```python
   zero = torch.Tensor([0]).to(y_pred.device)
   y_pred = torch.concat((zero, y_pred), dim=0)
   ```
   - A zero tensor is concatenated to `y_pred` to ensure numerical stability in the next step.

  The $\text{log-sum-exp}$ (LSE) function is used to compute a stable logarithm of the sum of exponentials, which is a common operation in various loss functions. The formula for the LSE is:

  $$\text{log-sum-exp}(x) = \log \left( \sum_i e^{x_i} \right)$$

  This function is sensitive to large negative values in ${x_i}$, which can cause numerical instability or underflow issues. To prevent this, a zero element is added to the vector before applying the $\text{log-sum-exp}$ operation which ensures that there is at least one element in the tensor that does not contribute to the instability.

  - **Creating Zero Tensor**:
    ```python
    zero = torch.Tensor([0]).to(y_pred.device)
    ```
    - This line creates a tensor containing a single zero and moves it to the same device (CPU or GPU) as `y_pred`. This ensures that tensor operations are performed on the same hardware, avoiding device mismatch errors.

  - **Concatenating Zero with `y_pred`**:
    ```python
    y_pred = torch.concat((zero, y_pred), dim=0)
    ```
    - This line concatenates the zero tensor to the beginning of `y_pred`. The result is a new tensor that has the zero element as its first element, followed by all elements of the original `y_pred`.

  Effects of Zero Addition:

  - **Avoiding Underflow**: By adding a zero, it is ensured that the $\text{log-sum-exp}$ computation includes a stable baseline value. Since the exponential of zero is one ${e^0 = 1}$, it does not affect the sum in a significant way but prevents the entire sum from becoming too small (which can cause underflow).
  - **Ensuring Positivity**: In certain cases, especially when all elements in `y_pred` are negative or very small, the sum of exponentials can become exceedingly small. Adding a zero ensures the sum remains positive and within a stable numerical range.

9. **Computing the final Log-Sum-Exp loss value**:
   ```python
   return torch.logsumexp(y_pred, dim=0)
   ```
   - Finally, the $\text{log-sum-exp}$ function is applied to `y_pred`. This operation is used to compute a smooth maximum and is commonly used in loss functions to ensure numerical stability and to handle a large range of values.

In [74]:
# modified from: https://github.com/bojone/CoSENT/blob/124c368efc8a4b179469be99cb6e62e1f2949d39/cosent.py#L79
def cosent_loss(y_true: torch.Tensor, y_pred: torch.Tensor, tau: float = 20.0) -> torch.Tensor:
    # Input preparation...
    y_true = y_true[::2, 0]
    y_true = (y_true[:, None] < y_true[None, :]).float()

    # Normalization of Logits...
    y_pred = F.normalize(y_pred, p=2, dim=1)

    # Cosine Similarity Calculation...
    # y_pred[::2] and y_pred[1::2] select alternating embeddings, assuming they are paired...
    # The dot product of these pairs gives the cosine similarity, scaled by a factor of tau to control the sharpness of similarity scores...
    y_pred = torch.sum(y_pred[::2] * y_pred[1::2], dim=1) * tau

    # Pairwise cosine similarity difference calculation...
    y_pred = y_pred[:, None] - y_pred[None, :]

    y_pred = (y_pred - (1 - y_true) * 1e12).view(-1)

    zero = torch.Tensor([0]).to(y_pred.device)
    y_pred = torch.concat((zero, y_pred), dim=0)
    return torch.logsumexp(y_pred, dim=0)

### 3. In-Batch Negatives Loss

The `in_batch_negative_loss` function function uses in-batch negatives which include the sentences not declared as positive pairs explicitly in a batch, to calculate the loss thus encouraging the model to learn embeddings such that positives are closer together and negatives are farther apart in the embedding space.

<br>

**In-Batch Negative Loss Equation**:

$$ \mathcal{L}_{\text{ibn}} = - \sum_{b} \sum_{i}^{m} \log \left[ \frac{e^{\cos \left( \frac{X_{bi}, X_{bi}^+}{\tau} \right)}}{ \sum_{j}^{N} e^{ \cos \left( \frac{X_{bi}, X_{bj}^+}{\tau} \right)}} \right]$$

- ${b}$ - Batch number
- ${X_{bi}}$ and ${X_{bj}}$ - Embeddings of the corresponding sentences ${S_{bi}}$ and ${S_{bj}}$ respectively.
- ${X_{bi}^+}$ and ${X_{bj}^+}$ - Positive Samples of ${X_{bi}}$ and ${X_{bj}}$
- ${m}$ - Number of Positive Pairs in ${b^{th}}$ batch
- ${N}$ - Batch Size

<br>

**Reference**:
    Tang, Y., Cheng, H., Fang, Y., & Pan, Y. (2022, October). In-Batch Negatives' Enhanced Self-Supervised Learning. In 2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI) (pp. 161-166). IEEE. https://doi.org/10.1109/ICTAI.2022.00029

<br>

**Implementation**:
<br>

1. **Function Definition and Parameters**:
   - **Parameters**:
     - `y_true`: A tensor of ground truth labels for the STS-Benchmark dataset, representing similarity scores for sentence pairs. Shape: `(batch_size, 1)`
     - `y_pred`: A tensor of model predictions (embedding vectors) generated after passing through the BERT encoder layers. Shape: `(batch_size, 2 * embedding_vector_length)`
     - `tau`: A scaling factor (default is 20.0) representing the temperature hyperparameter which controls the sharpness of the output distribution.
     - `negative_weights`: A weight for negative samples, defaulting to 0.0.
   - **Returns**:
     - A tensor representing the loss value.

  - **Example**:
  For example, in the STS task, the dataset is arranged in the following way:
  ```python
  text1 = ["Sentence 1", "Sentence 2", "Sentence 3"]
  text2 = ["Sentence 1*", "Sentence 2*", "Sentence 3*"]
  label = [5, 1, 2]
  ```

  After tokenization, collation, and passing through the model, the true labels (`y_true`) and predicted value (embedding vectors) (`y_pred`) will be transformed into the following format to accommodate the definition of the loss function:

  ```python
  y_true: [5, 1, 2]
  y_pred: [emb_1, emb_1*, emb_2, emb_2*, emb_3, emb_3*]
  ```

2. **Creating Target Matrix**:

   ```python
   def make_target_matrix(y_true: torch.Tensor):
       idxs = torch.arange(0, y_pred.shape[0]).int().to(device)
       y_true = y_true.int()
       idxs_1 = idxs[None, :]
       idxs_2 = (idxs + 1 - idxs % 2 * 2)[:, None]

       idxs_1 *= y_true.T
       idxs_1 += (y_true.T == 0).int() * -2

       idxs_2 *= y_true
       idxs_2 += (y_true == 0).int() * -1

       y_true = (idxs_1 == idxs_2).float()
       return y_true
   ```

   - This constructs a matrix to identify positive and negative pairs within the batch.
     - **`idxs = torch.arange(0, y_pred.shape[0]).int().to(device)`**: Creates an array of indices for the batch.
     - **`y_true = y_true.int()`**: Converts `y_true` to integer type.
     - **`idxs_1 = idxs[None, :]`**: Expands `idxs` for broadcasting.
     - **`idxs_2 = (idxs + 1 - idxs % 2 * 2)[:, None]`**: Creates alternating pairs of indices.
     - **`idxs_1 *= y_true.T`**: Multiplies by transposed `y_true` to retain valid pairs.
     - **`idxs_1 += (y_true.T == 0).int() * -2`**: Sets invalid pairs to -2.
     - **`idxs_2 *= y_true`**: Multiplies by `y_true` to retain valid pairs.
     - **`idxs_2 += (y_true == 0).int() * -1`**: Sets invalid pairs to -1.
     - **`y_true = (idxs_1 == idxs_2).float()`**: Creates a binary target matrix where positive pairs are marked with 1 and negative pairs with 0.

3. **Negative Mask**:

   ```python
   neg_mask = make_target_matrix(y_true == 0)
   ```

   - This creates a mask for identifying negative samples within the batch.
     - **`make_target_matrix(y_true == 0)`**: Calls the `make_target_matrix` function with a condition to identify where the true labels are 0 (negative pairs).

4. **Positive Samples Target Matrix**:

   ```python
   y_true = make_target_matrix(y_true)
   ```

   - This creates a target matrix for positive samples, representing the correct pairs within the batch.

5. **Normalization and Similarity Calculation**:

   ```python
   y_pred = F.normalize(y_pred, dim=1, p=2)
   similarities = y_pred @ y_pred.T
   similarities = similarities - torch.eye(y_pred.shape[0]).to(device) * 1e12
   similarities = similarities * tau
   ```

   - **Normalization**:
     - **`y_pred = F.normalize(y_pred, dim=1, p=2)`**: Normalizes the embeddings to unit length (L2 norm) to ensure that the cosine similarity is valid by normalizing the embeddings.

   - **Similarity Calculation**:
     - **`similarities = y_pred @ y_pred.T`**: It computes the cosine similarity between all pairs of embeddings within the batch to calculate the similarities needed for the numerator and denominator in the loss equation.

     $${\cos \left( X_{bi}, X_{bi}^+ \right) \text{,} \cos \left( X_{bi}, X_{bj}^+ \right)}$$

     - **Avoiding Self-Similarity**:
       - **`similarities = similarities - torch.eye(y_pred.shape[0]).to(device) * 1e12`**: It subtracts a large value on the diagonal to avoid self-similarity to ensure that each embedding is not compared with itself.
     - **Scaling**:
       - **`similarities = similarities * tau`**: It scales the similarities by the temperature parameter ${\tau}$ to adjust the sharpness of the distribution, making the model more sensitive to differences between examples.

6. **Adjusting Similarities with Negative Weights**:

   ```python
   if negative_weights > 0:
       similarities += neg_mask * negative_weights
   ```

   - This adjusts the similarities for negative samples if `negative_weights` is specified.
     - **`similarities += neg_mask * negative_weights`**: Adds the negative mask weighted by `negative_weights` to the similarities.

7. **Calculating Loss**:

   The `categorical_crossentropy` function is implemented as:

   ```python
   def categorical_crossentropy(y_true, y_pred):
       return -(F.log_softmax(y_pred, dim=1) * y_true).sum(dim=1)
   ```
   $${ \log \left[ \frac{e^{\cos \left( X_{bi}, X_{bi}^+ \right)}}{ \sum_{j}^{N} e^{ \cos \left( X_{bi}, X_{bj}^+ \right)}} \right] }$$

   - This function computes the loss value.

8. **Calculating Mean Loss**:
   
   The mean loss is calculated by the `.mean()` after the `categorical_crossentropy` function.

   ```python
   return categorical_crossentropy(y_true, similarities).mean()
   ```

   $${- \sum_{b} \sum_{i}^{m}}$$

   Where `m` represents the positive pairs in batch `b`.
   - This calculates the mean loss value and returns the final loss for backpropagation.

In [8]:
def categorical_crossentropy(y_true: torch.Tensor, y_pred: torch.Tensor) -> torch.Tensor:
    return -(F.log_softmax(y_pred, dim=1) * y_true).sum(dim=1)

# Taken from https://github.com/SeanLee97/AnglE/blob/main/angle_emb/angle.py#L166
def in_batch_negative_loss(y_true: torch.Tensor,
                           y_pred: torch.Tensor,
                           tau: float = 20.0,
                           negative_weights: float = 0.0) -> torch.Tensor:
    device = y_true.device

    def make_target_matrix(y_true: torch.Tensor):
        idxs = torch.arange(0, y_pred.shape[0]).int().to(device)
        y_true = y_true.int()
        idxs_1 = idxs[None, :]
        idxs_2 = (idxs + 1 - idxs % 2 * 2)[:, None]

        idxs_1 *= y_true.T
        idxs_1 += (y_true.T == 0).int() * -2

        idxs_2 *= y_true
        idxs_2 += (y_true == 0).int() * -1

        y_true = (idxs_1 == idxs_2).float()
        return y_true

    neg_mask = make_target_matrix(y_true == 0)

    y_true = make_target_matrix(y_true)

    y_pred = F.normalize(y_pred, dim=1, p=2)
    similarities = y_pred @ y_pred.T
    similarities = similarities - torch.eye(y_pred.shape[0]).to(device) * 1e12
    similarities = similarities * tau

    if negative_weights > 0:
        similarities += neg_mask * negative_weights

    return categorical_crossentropy(y_true, similarities).mean()

### 4. Angle Loss

The `angle_loss` function calculates the angle difference in complex space to address the saturation zone problem in cosine similarity, optimizing the model's ability to distinguish between similar and dissimilar samples effectively.

<br>

**Angle Loss Equation**:

$${\mathcal{L}_{\text{angle}} = \log \left[ 1 + \sum_{s(X_i, X_j) > s(X_m, X_n)} e^{\frac{\Delta \theta_{ij} - \Delta \theta_{mn}}{\tau}} \right]}$$

- ${\Delta \theta_{ij}}$: Angle difference between embeddings ${X_i}$ and ${X_j}$
- ${\Delta \theta_{mn}}$: Angle difference between embeddings ${X_m}$ and ${X_n}$
- ${\tau}$: Temperature hyperparameter

<br>

**Reference**:
    Li, X., & Li, J. (2023). Angle-Optimized Text Embeddings. In Proceedings of the International Conference on Learning Representations (ICLR 2024). https://doi.org/10.48550/arXiv.2309.12871

<br>

**Implementation**:
<br>

1. **Function Definition and Parameters**:
   - **Parameters**:
     - `y_true`: Ground truth labels for the dataset, representing similarity scores for pairs. Shape: `(batch_size, 1)`
     - `y_pred`: Model predictions (embedding vectors). Shape: `(batch_size, 2 * embedding_vector_length)`
     - `tau`: Temperature hyperparameter, default is 1.0.
   - **Returns**:
     - A tensor representing the loss value.

2. **Processing Ground Truth Labels**:
   ```python
   y_true = y_true[::2, 0]
   ```
   - This selects every second element from the ground truth tensor and prepares the ground truth labels for calculating pairwise comparisons.

   ```python
   y_true = (y_true[:, None] < y_true[None, :]).float()
   ```
   - It creates a matrix of pairwise comparisons, indicating which samples are less than others. It is a binary matrix for ${s(X_i, X_j) > s(X_m, X_n)}$ comparisons.

3. **Splitting Predicted Embeddings into Real and Imaginary Parts**:
   ```python
   y_pred_re, y_pred_im = torch.chunk(y_pred, 2, dim=1)
   ```
   - This splits the predicted embeddings into real and imaginary parts by chunking which is dividing ${X_i}$ and ${X_j}$ into their real and imaginary components ${z = a + bi}$ and ${w = c + di}$.

   ```python
   a = y_pred_re[::2]
   b = y_pred_im[::2]
   c = y_pred_re[1::2]
   d = y_pred_im[1::2]
   ```
   - It assigns the real and imaginary parts to variables corresponding to pairs thus preparing the variables for calculating the angle differences ${\Delta \theta_{ij}}$ and ${\Delta \theta_{mn}}$.

4. **Calculating Angle Difference in Complex Space**:
   ```python
   z = torch.sum(c* *2 + d* *2, dim=1, keepdim=True)
   ```
   - It calculates the magnitude of the complex number for normalization which computes ${\sqrt{c^2 + d^2}}$, the denominator in the angle difference calculation.

   ```python
   re = (a * c + b * d) / z
   im = (b * c - a * d) / z
   ```
   - It calculates the real and imaginary parts of the normalized angle difference which implements the real and imaginary parts of the division ${\frac{z}{w}}$ in the complex space.

   ```python
   dz = torch.sum(a**2 + b**2, dim=1, keepdim=True)**0.5
   dw = torch.sum(c**2 + d**2, dim=1, keepdim=True)**0.5
   ```
   - This computes the magnitudes of the embeddings or ${|z|}$ and ${|w|}$ for normalization.

   ```python
   re /= (dz / dw)
   im /= (dz / dw)
   ```
   - It normalizes the real and imaginary parts by their respective magnitudes. This is related to normalizing the angle difference to mitigate the impact of high variance of magnitudes, aligning with ${\Delta \theta_{ij}}$ and ${\Delta \theta_{mn}}$.

5. **Combining and Adjusting Predictions**:
   ```python
   y_pred = torch.concat((re, im), dim=1)
   ```
   - It concatenates the real and imaginary parts thus combining the normalized real and imaginary parts into a single tensor.

   ```python
   y_pred = torch.abs(torch.sum(y_pred, dim=1)) * tau
   ```
   - It computes the absolute value of the sum of the angle differences, scaled by ${\tau}$ which calculates the scaled angle differences, corresponding to ${\frac{\Delta \theta_{ij} - \Delta \theta_{mn}}{\tau}}$ in the loss equation.

   ```python
   y_pred = y_pred[:, None] - y_pred[None, :]
   ```
   - It computes the pairwise differences between angle differences to apply the exponential function in the loss equation.

   ```python
   y_pred = (y_pred - (1 - y_true) * 1e12).view(-1)
   ```
   - It adjusts the pairwise differences based on ground truth labels to emphasize correct pairs by appliying a large negative value to ensure that incorrect pairs do not dominate the loss calculation.

6. **Final Loss Calculation**:
   ```python
   zero = torch.Tensor([0]).to(y_pred.device)
   ```
   - It initializes a zero tensor on the same device as the predictions to ensure a zero baseline for the ${\text{log-sum-exp}}$ calculation.

   ```python
   y_pred = torch.concat((zero, y_pred), dim=0)
   ```
   - It concatenates the zero tensor with the predictions and prepares the tensor for the ${\text{log-sum-exp}}$ operation.

   ```python
   return torch.logsumexp(y_pred, dim=0)
   ```
   - It computes the ${\text{log-sum-exp}}$ of the adjusted predictions to get the final loss value:
   $${\log \left[ 1 + \sum e^{\frac{\Delta \theta_{ij} - \Delta \theta_{mn}}{\tau}} \right]}$$

In [9]:
# Taken from https://github.com/SeanLee97/AnglE/blob/main/angle_emb/angle.py#L117
def angle_loss(y_true: torch.Tensor, y_pred: torch.Tensor, tau: float = 1.0):
    y_true = y_true[::2, 0]
    y_true = (y_true[:, None] < y_true[None, :]).float()

    y_pred_re, y_pred_im = torch.chunk(y_pred, 2, dim=1)
    a = y_pred_re[::2]
    b = y_pred_im[::2]
    c = y_pred_re[1::2]
    d = y_pred_im[1::2]

    z = torch.sum(c**2 + d**2, dim=1, keepdim=True)
    re = (a * c + b * d) / z
    im = (b * c - a * d) / z

    dz = torch.sum(a**2 + b**2, dim=1, keepdim=True)**0.5
    dw = torch.sum(c**2 + d**2, dim=1, keepdim=True)**0.5
    re /= (dz / dw)
    im /= (dz / dw)

    y_pred = torch.concat((re, im), dim=1)
    y_pred = torch.abs(torch.sum(y_pred, dim=1)) * tau
    y_pred = y_pred[:, None] - y_pred[None, :]
    y_pred = (y_pred - (1 - y_true) * 1e12).view(-1)
    zero = torch.Tensor([0]).to(y_pred.device)
    y_pred = torch.concat((zero, y_pred), dim=0)
    return torch.logsumexp(y_pred, dim=0)

### Combination of the Loss Functions with Weights

The `TotalLoss` class combines three different loss functions, each contributing to the total loss based on their respective weights. This allows for adjustment of the importance of each loss component in the overall training objective.

<br>

**Total Loss Equation**:

$${\mathcal{L}_{\text{total}} = w_1 \cdot \mathcal{L}_{\text{cosent}} + w_2 \cdot \mathcal{L}_{\text{ibn}} + w_3 \cdot \mathcal{L}_{\text{angle}}}$$

- ${w_1}$: Weight for the CoSent loss
- ${w_2}$: Weight for the in-batch negative loss
- ${w_3}$: Weight for the angle loss

<br>

**Class Parameters**:
  - `w1`: Weight for the CoSENT loss component, default is 1.0
  - `w2`: Weight for the In-Batch Negatives loss component, default is 1.0
  - `w3`: Weight for the Angle loss component, default is 1.0
  - `cosent_tau`: Temperature parameter for the CoSENT loss, default is 20.0
  - `ibn_tau`: Temperature parameter for the in-batch negative loss, default is 20.0
  - `angle_tau`: Temperature parameter for the angle loss, default is 1.0

In [10]:
class TotalLoss:
    def __init__(self,
                w_cosent: float = 1.0,
                w_ibn: float = 1.0,
                w_angle: float = 1.0,
                cosent_tau: float = 20.0,
                ibn_tau: float = 20.0,
                angle_tau: float = 1.0):
        self.w_cosent = w_cosent
        self.w_ibn = w_ibn
        self.w_angle = w_angle
        self.cosent_tau = cosent_tau
        self.ibn_tau = ibn_tau
        self.angle_tau = angle_tau

    def __call__(self, labels: torch.Tensor, outputs: torch.Tensor) -> torch.Tensor:
        loss = 0.
        if (self.w_cosent == 0 and self.w_ibn == 0 and self.w_angle == 0):
            loss += default_cosine_similarity_loss(labels, outputs)
        if self.w_cosent > 0:
            loss += self.w_cosent * cosent_loss(labels, outputs, self.cosent_tau)
        if self.w_ibn > 0:
            loss += self.w_ibn * in_batch_negative_loss(labels, outputs, self.ibn_tau)
        if self.w_angle > 0:
            loss += self.w_angle * angle_loss(labels, outputs, self.angle_tau)
        return loss

## Pooler

The `Pooler` class provides various strategies for pooling the output of the model, allowing options for how the hidden states are aggregated to form a single representation.

<br>

**Pooling Strategies**:

- `cls`: It uses the CLS token's representation in the last hidden state.
- `cls_avg`: It averages the CLS token's representation with the mean of all tokens' representations.
- `last`: It uses the representation of the last token.
- `avg`: It averages the representations of all tokens, weighted by attention mask.
- `max`: It uses the maximum value of the token representations, weighted by attention mask.
- `all`: It returns the representations of all tokens.
- `specific token index`: If an integer is passed as the pooling strategy input, it uses the representation of a specific token index.

<br>

**Class Parameters**:
  - `model`: The model whose outputs need to be pooled.
  - `pooling_strategy`: Strategy for pooling, can be one of several predefined options or a specific token index. Default is 'cls'.
  - `padding_strategy`: Strategy for padding, can be 'left' or 'right'. Default is 'left'.

In [80]:
class Pooler:
    def __init__(self,
                model,
                tokenizer,
                # ['cls', 'cls_avg', 'last', 'avg', 'max', 'all', 'specific token index']
                pooling_strategy: Optional[Union[int, str]] = 'cls',
                padding_strategy: Optional[str] = 'left'):
        self.model = model
        self.tokenizer = tokenizer
        self.pooling_strategy = pooling_strategy
        self.padding_strategy = padding_strategy

    def __call__(self, inputs) -> Any:
        if self.pooling_strategy == 'last':
            batch_size = inputs['input_ids'].shape[0]
            if self.padding_strategy == 'left':
                sequence_lengths = -1
            else:
                sequence_lengths = inputs["attention_mask"].sum(dim=1) - 1

        outputs = self.model.model.encoder(**inputs).last_hidden_state
        if self.pooling_strategy == 'cls':
            eos_positions = (inputs["input_ids"] == self.tokenizer.eos_token_id).nonzero(as_tuple=True)
            outputs = outputs[eos_positions[0], eos_positions[1], :]
        elif self.pooling_strategy == 'cls_avg':
            outputs = (outputs[:, 0] + torch.mean(outputs, dim=1)) / 2.0
        elif self.pooling_strategy == 'last':
            outputs = outputs[torch.arange(batch_size, device=outputs.device), sequence_lengths]
        elif self.pooling_strategy == 'avg':
            outputs = torch.sum(
                outputs * inputs["attention_mask"][:, :, None], dim=1) / torch.sum(inputs["attention_mask"])
        elif self.pooling_strategy == 'max':
            outputs, _ = torch.max(outputs * inputs["attention_mask"][:, :, None], dim=1)
        elif self.pooling_strategy == 'all':
            return outputs
        elif isinstance(self.pooling_strategy, int) or self.pooling_strategy.isnumeric():
            return outputs[:, int(self.pooling_strategy)]
        return outputs

## Trainer

The custom trainer method extends the `Trainer` method of `HuggingFace Transformers`. It is used to override the `compute_loss` function and include our custom losses for training.

In [13]:
class CustomTrainer(Trainer):
    def __init__(self, pooler: Pooler, loss_kwargs: Optional[Dict] = None, **kwargs):
        super().__init__(**kwargs)
        self.pooler = pooler
        if loss_kwargs is None:
            loss_kwargs = {}
        self.loss_fct = TotalLoss(**loss_kwargs)

    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels", None)
        outputs = self.pooler(inputs)
        loss = self.loss_fct(labels, outputs)
        return (loss, outputs) if return_outputs else loss

## Fit

The `fit` function trains a model using a custom trainer and pooling mechanism. It initializes and configures the training process, including the dataset, model, tokenizer, and various training arguments.

In [14]:
def fit(train_ds,
        model_base,
        tokenizer_base,
        batch_size: int = 32,
        output_dir: Optional[str] = 'chk/new_c',
        epochs: int = 5,
        learning_rate: float = 1e-5,
        warmup_steps: int = 1000,
        logging_steps: int = 10,
        eval_steps: Optional[int] = None,
        save_steps: int = 100,
        save_strategy: str = 'steps',
        save_total_limit: int = 10,
        gradient_accumulation_steps: int = 1,
        fp16: Optional[bool] = None,
        argument_kwargs: Optional[Dict] = None,
        trainer_kwargs: Optional[Dict] = None,
        loss_kwargs: Optional[Dict] = None):

    if argument_kwargs is None:
        argument_kwargs = {}
    if trainer_kwargs is None:
        trainer_kwargs = {}
    callbacks = None

    pooler = Pooler(model_base, tokenizer_base)

    trainer = CustomTrainer(
        pooler=pooler,
        model=model_base,
        train_dataset=train_ds,
        loss_kwargs=loss_kwargs,
        tokenizer=tokenizer_base,
        args=TrainingArguments(
            per_device_train_batch_size=batch_size,
            gradient_accumulation_steps=gradient_accumulation_steps,
            warmup_steps=warmup_steps,
            num_train_epochs=epochs,
            learning_rate=learning_rate,
            fp16=fp16,
            logging_steps=logging_steps,
            save_strategy=save_strategy,
            eval_steps=eval_steps,
            save_steps=save_steps,
            output_dir=output_dir,
            save_total_limit=save_total_limit,
            load_best_model_at_end=False,
            ddp_find_unused_parameters=None,
            label_names=['labels', 'seperate_ids', 'extra'],
            **argument_kwargs,
        ),
        callbacks=callbacks,
        data_collator=CustomDataCollator(
            tokenizer_base
        ),
        **trainer_kwargs
    )

    trainer.train()
    return model_base, tokenizer_base, pooler

## Embeddings

The `encode` function processes input text using a specified model, applies a specified pooling strategy, and converts the output to a NumPy array or in other words generates **embeddings** of text, which are fixed-size representations suitable for downstream tasks.

In [15]:
def encode(inputs: Union[List[str], Tuple[str], List[Dict], str],
            model,
            pooler,
            tokenizer,
            max_length: Optional[int] = 512,
            to_numpy: bool = True,
            device: Optional[Any] = 'cuda:0'):
        if device is None:
            device = 'cpu'
        model.to(device)
        model.eval()

        tokens = tokenizer(
            inputs,
            padding='longest',
            max_length=max_length,
            truncation=True,
            return_tensors='pt')
        tokens.to(device)
        with torch.no_grad():
            output = pooler(tokens)
        if to_numpy:
            return output.float().detach().cpu().numpy()
        return output

# Execution

For a fair performance evaluation on how the loss function combinations affect the performance of each model, we train the base models on the **Natural Language Inference (NLI)** dataset, and test their performance by calculating Spearman's Rank Correlation Coefficient for the **Semantic Textual Similarity (STS)** tasks, and classification accuracy for the **SentEval** classification tasks. This is a method of transfer learning.

## Data Import

### Natural Language Inference (NLI) Dataset

We need to download the `AllNLI` dataset provided in the **SBERT** site. It is a concatenation of two datasets `Stanford Natural Language Inference (SNLI) Corpus` and `Multi-Genre Natural Language Inference (MultiNLI) Dataset`, containing a total of around **900,000** samples.

In [16]:
! echo "download AllNLI"
! wget https://sbert.net/datasets/AllNLI.tsv.gz

download AllNLI
--2024-08-31 14:48:59--  https://sbert.net/datasets/AllNLI.tsv.gz
Resolving sbert.net (sbert.net)... 172.67.180.145, 104.21.67.200, 2606:4700:3036::6815:43c8, ...
Connecting to sbert.net (sbert.net)|172.67.180.145|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/AllNLI.tsv.gz [following]
--2024-08-31 14:48:59--  https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/AllNLI.tsv.gz
Resolving public.ukp.informatik.tu-darmstadt.de (public.ukp.informatik.tu-darmstadt.de)... 130.83.167.186
Connecting to public.ukp.informatik.tu-darmstadt.de (public.ukp.informatik.tu-darmstadt.de)|130.83.167.186|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 40794454 (39M) [application/octet-stream]
Saving to: ‘AllNLI.tsv.gz’


2024-08-31 14:54:21 (124 KB/s) - ‘AllNLI.tsv.gz’ saved [40794454/40794454]



In [17]:
def load_all_nli(exclude_neutral=True):
    label_mapping = {
        'entailment': 1,  # '0' (entailment)
        'neutral': 1,
        'contradiction': 0   # '2' (contradiction)
    }
    data = []
    with gzip.open('AllNLI.tsv.gz', 'rt', encoding='utf8') as fIn:
        reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_NONE)
        for row in reader:
            if row['split'] == 'train' and row['label'] != 'neutral':
                if exclude_neutral and row['label'] == 'neutral':
                    continue
                sent1 = row['sentence1'].strip()
                sent2 = row['sentence2'].strip()
                data.append({'text1': sent1, 'text2': sent2, 'label': label_mapping[row['label']]})
    return data

In [18]:
def preprocess_nli():
    train_data = load_all_nli()
    nli_dataset = {}
    train_ds = Dataset.from_list(train_data)
    nli_dataset['train'] = train_ds
    nli_dataset = DatasetDict(nli_dataset)
    ds_train = nli_dataset['train']
    return ds_train

### SentEval Datasets

The `SentEval` datasets **Movie Reviews (MR)**, **Customer Reviews (CR)**, **Multi-Perspective Question Answering (MPQA)** and **Subjectivity (SUBJ)** are downloaded from `HuggingFace`.

In [None]:
def get_senteval_dataset(dataset_name):
    match dataset_name:
        case 'CR':
            dataset = load_dataset('rahulsikder223/SentEval-CR')
        case 'MPQA':
            dataset = load_dataset('rahulsikder223/SentEval-MPQA')
        case 'MR':
            dataset = load_dataset('rahulsikder223/SentEval-MR')

            # We exclude this sentence since it gives null error...
            dataset = concatenate_datasets([dataset.select(range(0, 231)), dataset.select(range(233, 7463))])
        case 'SUBJ':
            dataset = load_dataset('rahulsikder223/SentEval-SUBJ')

    dataset = dataset.rename_column('sentence', 'text1')
    return dataset

In [None]:
senteval_datasets = ['CR']#['CR', 'MPQA', 'MR', 'SUBJ']

### Semantic Textual Similarity (STS) Datasets

The `Semantic Textual Similarity (STS)` datasets **STS-12** to **STS-16**, **STS-Benchmark** and **SICK-R** are downloaded from the `HuggingFace` website. Since all the datasets do not have a separate `train` split, we use only the `test` splits of the datasets for our evaluation.

In [None]:
def get_sts_dataset(dataset_name):
    match dataset_name:
        case 'STS-B':
            dataset = load_dataset('mteb/stsbenchmark-sts', split='test')
        case 'STS12':
            dataset = load_dataset('mteb/sts12-sts', split='test')
        case 'STS13':
            dataset = load_dataset('mteb/sts13-sts', split='test')
        case 'STS14':
            dataset = load_dataset('mteb/sts14-sts', split='test')
        case 'STS15':
            dataset = load_dataset('mteb/sts15-sts', split='test')
        case 'STS16':
            dataset = load_dataset('mteb/sts16-sts', split='test')
        case 'SICK-R':
            dataset = load_dataset('mteb/sickr-sts', split='test')

    dataset = dataset.rename_column('sentence1', 'text1')
    dataset = dataset.rename_column('sentence2', 'text2')
    dataset = dataset.rename_column('score', 'label')
    return dataset

In [None]:
sts_datasets = ['STS13']#['STS-B', 'STS12', 'STS13', 'STS14', 'STS15', 'STS16', 'SICK-R']

## Loss Function Combinations

We have used the **Default Pariwise Cosine Similarity** loss function, and the 3 custom losses **CoSENT**, **In-Batch Negatives** and **Angle** along with their possible combinations in our experiments. The `get_loss_function_weights` function returns the weights of each loss function to be used in the linear loss combination. The weight value 1 is used in the case of the custom losses to denote which loss function is in use. If all weight values are 0, it means the default cosine similarity loss function is in use.

In [19]:
def get_loss_function_weights(combi):
    match combi:
        case 'CoSENT': return (1, 0, 0)
        case 'In-Batch Negatives': return (0, 1, 0)
        case 'Angle': return (0, 0, 1)
        case 'CoSENT + In-Batch Negatives': return (1, 1, 0)
        case 'CoSENT + Angle': return (1, 0, 1)
        case 'In-Batch Negatives + Angle': return (0, 1, 1)
        case 'CoSENT + In-Batch Negatives + Angle': return (1, 1, 1)
        case 'Default Pairwise Cosine Similarity Loss': return (0, 0, 0)

The list `loss_functions` denotes the list of possible combinations of the loss functions.

In [20]:
loss_functions = [
    # 'CoSENT',
    # 'In-Batch Negatives',
    # 'Angle',
    # 'CoSENT + In-Batch Negatives',
    # 'CoSENT + Angle',
    # 'In-Batch Negatives + Angle',
    'CoSENT + In-Batch Negatives + Angle',
    # 'Default Pairwise Cosine Similarity Loss'
]

## Language Models

We define the list of base language models to be used in the experiment.

In [None]:
models = [
    # 'bert-base-uncased',
    'sentence-transformers/all-mpnet-base-v2',
    # 'princeton-nlp/sup-simcse-roberta-large'
]

## Driver Functions

### Utilities

#### Cosine Similarity

**Cosine Similarity** is a metric used to measure how similar two vectors are by comparing the angle between them. The cosine similarity between two vectors $\mathbf{A}$ and $\mathbf{B}$ is calculated using the following formula:

$$\text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}$$

In [None]:
def calculate_cosine_similarity(sentence1_vec, sentence2_vec):
    cosine_similarity = np.dot(sentence1_vec, sentence2_vec) / (np.linalg.norm(sentence1_vec) * np.linalg.norm(sentence2_vec))
    return cosine_similarity

#### Spearman's Rank Correlation Coefficient

**Spearman's Rank Correlation Coefficient**, denoted as **$\rho$ (rho)**, is a non-parametric measure of the strength and direction of association between two ranked variables. It assesses how well the relationship between two variables can be described using a monotonic function. Unlike Pearson's correlation, Spearman's does not require the assumption of linearity and can handle ordinal data.

The formula for Spearman's rank correlation coefficient is:
     $$\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}$$

Where:

- $n$ is the number of data points.
- $d_i$ is the difference between the ranks of the $i$-th pair
- $\sum d_i^2$ is the sum of the squared differences between the ranks of corresponding values of the two variables.

In [None]:
def calculate_Spearman_rank_correlation_coefficient(scores, scores_actual):
    sc, _ = spearmanr(scores, scores_actual)
    return sc

#### Embedding Generation of Semantic Textual Similarity (STS) and SentEval Datasets

The function `generate_embeddings` generates the embeddings of the STS and SentEval datasets line-by-line

In [None]:
def generate_embeddings(dataset, model, tokenizer, pooler, is_sts=False):
    emb_sentence_1 = []
    for sentence in dataset['text1']:
        emb_sentence_1.append(encode(sentence, model, pooler, tokenizer)[0])
    emb_sentence_1 = np.array(emb_sentence_1)

    if is_sts:
        emb_sentence_2 = []
        for sentence in dataset['text2']:
            emb_sentence_2.append(encode(sentence, model, pooler, tokenizer)[0])
        emb_sentence_2 = np.array(emb_sentence_2)

    return (emb_sentence_1, emb_sentence_2) if is_sts else emb_sentence_1

### Drivers

#### Train NLI

The `train_nli` function uses the `fit` function defined above to train the base model on the **NLI** dataset and prepares it for evaluation.

In [33]:
def train_nli(model_base, tokenizer_base, train_ds, loss_combination):
    model_new, tokenizer_new, pooler = fit(
        train_ds=train_ds,
        model_base=model_base,
        tokenizer_base=tokenizer_base,
        output_dir='chk/c',
        batch_size=30,
        epochs=5,
        learning_rate=2e-5,
        save_steps=0,
        eval_steps=100,
        warmup_steps=0,
        gradient_accumulation_steps=1,
        loss_kwargs={
            'w_cosent': loss_combination['w_cosent'],
            'w_ibn': loss_combination['w_ibn'],
            'w_angle': loss_combination['w_angle'],
            'cosent_tau': 20,
            'ibn_tau': 20,
            'angle_tau': 1.0
        },
        fp16=True,
        logging_steps=1000
    )

    return (model_new, tokenizer_new, pooler)

#### Test STS

Once the model is trained on NLI, the function `test_sts` is used to generate embeddings of the sentence pairs of the STS datasets, calculate cosine similarity between the sentence embedding pairs and finally calculate the Spearman's Rank Correlation Coefficient, to evaluate the performance of the model, trained using a loss function combination.

In [None]:
def test_sts(dataset, model, tokenizer, pooler):
    emb_sentence_1, emb_sentence_2 = generate_embeddings(dataset, model, tokenizer, pooler, is_sts=True)

    cos_score = []
    for i in range(emb_sentence_1.shape[0]):
        cos_score.append(calculate_cosine_similarity(emb_sentence_1[i], emb_sentence_2[i]))

    spearman = calculate_Spearman_rank_correlation_coefficient(cos_score, dataset['label'])
    return spearman

#### Test SentEval

Once the model is trained on NLI, the function `test_senteval` is used to generate embeddings of the sentences in the training and testing sets of the SentEval datasets, a **Logistic Regression** classifier is used to fit the train set embeddings with the sentiment or subjectivity labels (0 or 1), and finally the classification accuracy is obtained from the test set predictions by the classifier.

In [None]:
def test_senteval(dataset, model, tokenizer, pooler):
    emb_sentence_train = generate_embeddings(dataset['train'], model, tokenizer, pooler)
    emb_sentence_test = generate_embeddings(dataset['test'], model, tokenizer, pooler)

    lr = LogisticRegression(max_iter=10000)
    lr.fit(emb_sentence_train, dataset['train']['label'])
    accuracy_score = lr.score(emb_sentence_test, dataset['test']['label'])
    return accuracy_score

#### Main Driver

This is the main **driver function** or **training loop** which performs training of each model, using all the combinations of loss functions on NLI, then evaluates the model's performance on STS and SentEval datasets.

In [None]:
def driver():
    results_matrix_sts = []
    results_matrix_senteval = []

    for model in models:
        # Model Preparation...
        tokenizer_base = AutoTokenizer.from_pretrained(model)
        model_base = AutoModel.from_pretrained(model)

        results_matrix_loss_sts = []
        results_matrix_loss_senteval = []
        for loss in loss_functions:
            # Loss Functions Combination...
            loss_combination = {}
            loss_combination['w_cosent'], loss_combination['w_ibn'], loss_combination['w_angle'] = get_loss_function_weights(loss)

            # Train model on NLI using loss combination...
            nli_dataset = preprocess_nli()
            model_ft, tokenizer_ft, pooler_ft = train_nli(model_base, tokenizer_base, nli_dataset, loss_combination)

            # Test model on STS Datasets...
            results_matrix_dataset_sts = []
            for sts_dataset in sts_datasets:
                dataset = get_sts_dataset(sts_dataset)
                spearman = test_sts(dataset, model_ft, tokenizer_ft, pooler_ft)
                results_matrix_dataset_sts.append(spearman)
            results_matrix_loss_sts.append(results_matrix_dataset_sts)

            # Test model on SentEval Datasets...
            results_matrix_dataset_senteval = []
            for senteval_dataset in senteval_datasets:
                dataset = get_senteval_dataset(senteval_dataset)
                accuracy_score = test_senteval(dataset, model_ft, tokenizer_ft, pooler_ft)
                results_matrix_dataset_senteval.append(accuracy_score)
            results_matrix_loss_senteval.append(results_matrix_dataset_senteval)

        results_matrix_sts.append(results_matrix_loss_sts)
        results_matrix_senteval.append(results_matrix_loss_senteval)

    return (results_matrix_sts, results_matrix_senteval)

### Running

In [None]:
results_matrix_sts, results_matrix_senteval = driver()



Map (num_proc=8):   0%|          | 0/628405 [00:00<?, ? examples/s]

You're using a MPNetTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1000,14.0059
2000,13.691
3000,13.5356
4000,13.4515
5000,13.356
6000,13.3054
7000,13.2792
8000,13.2098
9000,12.6961
10000,12.6729


Downloading readme:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/60.2k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

Downloading readme:   0%|          | 0.00/639 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/171k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2642 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1133 [00:00<?, ? examples/s]

### Showing the Results

In [None]:
print("STS Spearman Result: ", results_matrix_sts)
print("SentEval Accuracy Result: ", results_matrix_senteval)

STS Spearman Result:  [[[0.8401565386924192]]]
SentEval Accuracy Result:  [[[0.9285083848190644]]]


### Saving the Results

In [None]:
with open('BERT_SentEval_Results.npy', 'wb') as f:
    np.save(f, results_matrix_senteval)

In [None]:
with open('BERT_STS_Results.npy', 'wb') as f:
    np.save(f, results_matrix_sts)

# Generation

In [22]:
model = 'facebook/bart-large'
tokenizer_base = AutoTokenizer.from_pretrained(model)
model_base = AutoModelForSeq2SeqLM.from_pretrained(model)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

In [27]:
nli_dataset = preprocess_nli()
train_ds = nli_dataset.shuffle().map(CustomDataTokenizer(tokenizer_base), num_proc=8)

Map (num_proc=8):   0%|          | 0/628405 [00:00<?, ? examples/s]

In [81]:
loss_combination = {}
loss_combination['w_cosent'], loss_combination['w_ibn'], loss_combination['w_angle'] = get_loss_function_weights(loss_functions[0])
model_ft, tokenizer_ft, pooler_ft = train_nli(model_base, tokenizer_base, train_ds, loss_combination)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


torch.Size([8, 1024])
y_true:  torch.Size([8, 1])
y_pred:  torch.Size([8, 1024])


Step,Training Loss


torch.Size([8, 1024])
y_true:  torch.Size([8, 1])
y_pred:  torch.Size([8, 1024])
torch.Size([8, 1024])
y_true:  torch.Size([8, 1])
y_pred:  torch.Size([8, 1024])
torch.Size([8, 1024])
y_true:  torch.Size([8, 1])
y_pred:  torch.Size([8, 1024])
torch.Size([8, 1024])
y_true:  torch.Size([8, 1])
y_pred:  torch.Size([8, 1024])


Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
