<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-research-and-practice/blob/main/nlp-for-semantic-search/3_training_sentence_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Training Sentence Transformers the OG Way (with Softmax Loss)

There are several ways of training sentence transformers. One of the most popular (and the approach we will cover) is using Natural Language Inference (NLI) datasets.

NLI focus on identifying sentence pairs that infer or do not infer one another. We will use two of these datasets; the Stanford Natural Language Inference (SNLI) and Multi-Genre NLI (MNLI) corpora.

Merging these two corpora gives us 943K sentence pairs (550K from SNLI, 393K from MNLI). All pairs include a `premise` and a `hypothesis`, and each pair is assigned a `label`:

- 0 — entailment, e.g. the premise suggests the hypothesis.
- 1 — neutral, the premise and hypothesis could both be true, but they are not necessarily related.
- 2 — contradiction, the premise and hypothesis contradict each other.

**Reference**:

https://www.pinecone.io/learn/train-sentence-transformers-softmax/

##Setup

In [3]:
!pip -q install datasets
!pip -q install transformers
!pip -q install sentence_transformers

[K     |████████████████████████████████| 78 kB 4.2 MB/s 
[K     |████████████████████████████████| 1.2 MB 68.2 MB/s 
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


In [20]:
import datasets

from transformers import BertTokenizer
from transformers import BertModel
from transformers.optimization import get_linear_schedule_with_warmup

from sentence_transformers import InputExample

import torch
from torch.utils.data import DataLoader

import os
from tqdm.auto import tqdm

##NLI Training

When training the model, we will be feeding sentence A (the premise) into BERT, followed by sentence B (the hypothesis) on the next step.

From there, the models are optimized using softmax loss using the label field. We will explain this in more depth soon.

In [None]:
snli = datasets.load_dataset("snli", split="train")

In [6]:
snli

Dataset({
    features: ['premise', 'hypothesis', 'label'],
    num_rows: 550152
})

In [7]:
print(snli[0])

{'premise': 'A person on a horse jumps over a broken down airplane.', 'hypothesis': 'A person is training his horse for a competition.', 'label': 1}


In [None]:
m_nli = datasets.load_dataset("glue", "mnli", split="train")

In [9]:
m_nli

Dataset({
    features: ['premise', 'hypothesis', 'label', 'idx'],
    num_rows: 392702
})

In [None]:
m_nli = m_nli.remove_columns(["idx"])
snli =  snli.cast(m_nli.features)
dataset = datasets.concatenate_datasets([snli, m_nli])

In [13]:
dataset

Dataset({
    features: ['premise', 'hypothesis', 'label'],
    num_rows: 942854
})

Both datasets contain `-1` values in the label feature where no confident class could be assigned. We remove them using the `filter` method.

In [14]:
print(len(dataset))

# there are -1 values in the label feature, these are where no class could be decided so we remove
dataset = dataset.filter(lambda x: 0 if x["label"] == -1 else 1)
print(len(dataset))

942854


  0%|          | 0/943 [00:00<?, ?ba/s]

942069


We must convert our human-readable sentences into transformer-readable tokens, so we go ahead and tokenize our sentences. Both `premise` and `hypothesis` features must be split into their own `input_ids` and `attention_mask` tensors.

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [16]:
all_cols = ["label"]

for part in ["premise", "hypothesis"]:
  dataset = dataset.map(
      lambda x: tokenizer(x[part], max_length=128, padding="max_length", truncation=True),
      batched=True
  )
  for col in ["input_ids", "attention_mask"]:
    dataset = dataset.rename_column(col, part + "_" + col)
    all_cols.append(part + "_" + col)

print(all_cols)

  0%|          | 0/943 [00:00<?, ?ba/s]

  0%|          | 0/943 [00:00<?, ?ba/s]

['label', 'premise_input_ids', 'premise_attention_mask', 'hypothesis_input_ids', 'hypothesis_attention_mask']


Now, all we need to do is prepare the data to be read into the model. 

To do this, we first convert the `dataset` features into PyTorch tensors and then initialize a data loader which will feed data into our model during training.

In [22]:
# covert dataset features to PyTorch tensors
dataset.set_format(type="torch", columns=all_cols)

# initialize the dataloader
batch_size = 16
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

And we’re done with data preparation. Let’s move on to the training approach.

##Softmax Loss