# Fine tuning

In this notebook, we finetune an opensource sentencetransformers embedding model on our synthetically generated dataset.

### Load pretrained model

In [74]:
from sentence_transformers import SentenceTransformer

In [75]:
model_id = "BAAI/bge-small-en"
pretrained_model = SentenceTransformer(model_id)

In [76]:
pretrained_model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

### Load data

In [72]:
import json

from torch.utils.data import DataLoader
from sentence_transformers import InputExample

In [94]:
TRAIN_DATASET_FPATH = './data/train_dataset.json'
VAL_DATASET_FPATH = './data/val_dataset.json'

# We use a very small batchsize to run this toy example on a local machine. 
# This should typically be much larger. 
BATCH_SIZE = 10

In [66]:
with open(TRAIN_DATASET_FPATH, 'r+') as f:
    train_dataset = json.load(f)

with open(VAL_DATASET_FPATH, 'r+') as f:
    val_dataset = json.load(f)

In [95]:
dataset = train_dataset

corpus = dataset['corpus']
queries = dataset['queries']
relevant_docs = dataset['relevant_docs']

examples = []
for query_id, query in queries.items():
    node_id = relevant_docs[query_id][0]
    text = corpus[node_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

In [78]:
loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

### Define loss

**MultipleNegativesRankingLoss** is a great loss function if you only have positive pairs, for example, only pairs of similar texts like pairs of paraphrases, pairs of duplicate questions, pairs of (query, response), or pairs of (source_language, target_language).

This loss function works great to train embeddings for retrieval setups where you have positive pairs (e.g. (query, relevant_doc)) as it will sample in each batch n-1 negative docs randomly.

The performance usually increases with increasing batch sizes.

For more detals, see:
* [docs](https://www.sbert.net/docs/package_reference/losses.html)
* [paper](https://arxiv.org/pdf/1705.00652.pdf)

In [79]:
from sentence_transformers import losses

In [80]:
loss = losses.MultipleNegativesRankingLoss(model)

### Setup evaluator 

We setup an evaluator with our val split of the dataset to monitor how well the embedding model is performing during training.

In [97]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

In [91]:
dataset = val_dataset

corpus = dataset['corpus']
queries = dataset['queries']
relevant_docs = dataset['relevant_docs']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

### Train loop

The training loop is very straight forward to steup thanks to sentencetransformers' high-level model training API.
All we need to do is plugging in the data loader, loss function, and evaluator that we defined in the previous cells (along with a couple of additional minor settings).

In [96]:
# We train the model for very few epochs in this toy example.
# This should typically be higher for better performance.
EPOCHS = 2

In [92]:
warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='exp_local',
    show_progress_bar=True,
    evaluator=evaluator, 
    evaluation_steps=50,
)

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/67 [00:00<?, ?it/s]

Iteration:   0%|          | 0/67 [00:00<?, ?it/s]