## Setup

To complete the following guide you will need to install the following packages:
- transformers
- sentence-transformers
- accelerate

You will also need access to a GPU with at least 24 GB.

In [None]:
# Install required libraries
!pip install "transformers==4.41.2" "sentence-transformers==3.0.1" "accelerate==0.30.0"

In [14]:
import torch

from datasets import load_dataset, concatenate_datasets

from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import (
    InformationRetrievalEvaluator
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.util import cos_sim, semantic_search
from sentence_transformers import SentenceTransformerTrainingArguments, SentenceTransformerTrainer
from sentence_transformers.training_args import BatchSamplers

## Problem Definition: FAQ Embeddings

A common use case for LLMs is enabling users to ask questions and retrieve answers from a document corpus. This typically involves generating embeddings for the question and each document, then calculating cosine similarity to identify the document most relevant to the question. However, the retrieved answers may not be ideal because the embeddings weren’t trained on the specifics of your company’s data. To improve results, the embeddings can be fine-tuned on your proprietary data.

### Task
In this example, we will fine-tune on open-source embeddings model using philschmid/finanical-rag-embedding-dataset, which includes 7,000 positive text pairs of questions and corresponding context from the 2023_10 NVIDIA SEC Filing. The results in this notebook show that the fine-tuned model results in an NDCG@10 that is ~7% higher than the base model.

*Note: The code in this notebook is adapted from https://www.philschmid.de/fine-tune-embedding-model-for-rag*

### Step 1: Dataset Curation

We first retrieve our dataset from Hugging Face (https://huggingface.co/datasets/philschmid/finanical-rag-embedding-dataset) and perform a train/test split.

In [6]:
# Load dataset from the hub
dataset = load_dataset("philschmid/finanical-rag-embedding-dataset", split="train")
 
# rename columns
dataset = dataset.rename_column("question", "anchor")
dataset = dataset.rename_column("context", "positive")
 
# Add an id column to the dataset
dataset = dataset.add_column("id", range(len(dataset)))
 
# split dataset into a 10% test set
dataset = dataset.train_test_split(test_size=0.1)
 
# save datasets to disk
dataset["train"].to_json("train_dataset.json", orient="records")
dataset["test"].to_json("test_dataset.json", orient="records")

Creating json from Arrow format:   0%|          | 0/7 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

238798

### Step 2: Evaluate Base Model

After we created our dataset we want to create a baseline. A baseline provides use reference point against which the performance of your customized model can be measured. By evaluating the performance of a pretrained model on our specific dataset, we gain insights into the initial effectiveness and can identify areas for improvement.

For our example, we will use the BAAI/bge-base-en-v1.5 model as our starting point. BAAI/bge-base-en-v1.5 is one of the strongest open embedding models for it size according to the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).

For us the most important metric will be Normalized Discounted Cumulative Gain (NDCG) as it is a measure of ranking quality. It takes into account the position of the relevant document in the ranking and discounts it. The discounted value is logarithmic, which means that relevant documents are more important if they are ranked higher.

For our evaluation corpus we will use all "documents" for potential retrieval from the train and test split and then query each question in the test set.

In [7]:
model_id = "BAAI/bge-base-en-v1.5"  # Hugging Face model ID. This can be adapted depending on which embeddings model you wish to fine-tune
 
# Load a model
model = SentenceTransformer(model_id, device="cuda" if torch.cuda.is_available() else "cpu")
 
# load test dataset
test_dataset = load_dataset("json", data_files="test_dataset.json", split="train")
train_dataset = load_dataset("json", data_files="train_dataset.json", split="train")
corpus_dataset = concatenate_datasets([train_dataset, test_dataset])
 
# Convert the datasets to dictionaries
corpus = dict(
    zip(corpus_dataset["id"], corpus_dataset["positive"])
)  # Our corpus (cid => document)
queries = dict(
    zip(test_dataset["id"], test_dataset["anchor"])
)  # Our queries (qid => question)
 
# Create a mapping of relevant document (1 in our case) for each query
relevant_docs = {}  # Query ID to relevant documents (qid => set([relevant_cids])
for q_id in queries:
    relevant_docs[q_id] = [q_id]
 
# Create an evaluator
evaluator = InformationRetrievalEvaluator(
    queries=queries,
    corpus=corpus,
    relevant_docs=relevant_docs,
    score_functions={"cosine": cos_sim}
)



Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [8]:
# Evaluate the model
results = evaluator(model)
 
# # COMMENT IN for full results
# print(results)
 
# Print the main score
key = f"cosine_ndcg@10"
print(f"{key}: {results[key]}")

cosine_ndcg@10: 0.7624707202194887


### Step 3: Fine-Tune Model

We are now ready to fine-tune our model. We will use[SentenceTransformersTrainer](https://sbert.net/docs/package_reference/sentence_transformer/trainer.html#sentencetransformertrainer). This is a subclass of the Trainer class from the transformers library. We will use the SentenceTransformerTrainingArguments class to specify all the training parameters.

In this example, we use the MultipleNegativesRankingLoss function. This loss function requires only positive examples, and automatically creates negative examples out of the (query, document) pairs that do not exist in the dataset

Note that this example performs full parameter fune-tuning, NOT PEFT/QLoRA. Embedding models are typically smaller than LLMs (the model used in this example has 109 MM params), which makes full parameter fine-tuning feasible.

In [3]:
# load train dataset again
train_dataset = load_dataset("json", data_files="train_dataset.json", split="train")

In [9]:
# Define training loss function
train_loss = MultipleNegativesRankingLoss(model)

# define training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="bge-financial-fine-tuned",      # output directory and hugging face model ID
    num_train_epochs=4,                         # number of epochs
    per_device_train_batch_size=32,             # train batch size
    gradient_accumulation_steps=16,             # for a global batch size of 512
    per_device_eval_batch_size=16,              # evaluation batch size
    warmup_ratio=0.1,                           # warmup ratio
    learning_rate=2e-5,                         # learning rate, 2e-5 is a good value
    lr_scheduler_type="cosine",                 # use constant learning rate scheduler
    optim="adamw_torch_fused",                  # use fused adamw optimizer
    # tf32=True,                                  # use tf32 precision
    # bf16=True,                                  # use bf16 precision
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    eval_strategy="epoch",                      # evaluate after each epoch
    save_strategy="epoch",                      # save after each epoch
    logging_steps=10,                           # log every 10 steps
    save_total_limit=3,                         # save only the last 3 models
    load_best_model_at_end=True,                # load the best model when training ends
    metric_for_best_model="eval_cosine_ndcg@10",  # Optimizing for the best ndcg@10 score
    report_to="none"
)

trainer = SentenceTransformerTrainer(
    model=model, # bg-base-en-v1
    args=args,  # training arguments
    train_dataset=train_dataset.select_columns(
        ["positive", "anchor"]
    ),  # training dataset
    loss=train_loss,
    evaluator=evaluator,
)

In [15]:
# start training, the model will be automatically saved to the hub and the output directory
trainer.train()

In [10]:
# save the best model
trainer.save_model()

### Step 4: Use the Fine-Tuned Model

In the cell above, you'll notice that the NDCG@10 (which was evaluated using the test set) has increased by 4-7% from the base model, showing the impact of fine-tuning!

In the cell below, we show how to use the fine-tuned model to generate an embedding for a new test question. We then perform cosine similarity between the embedding and the embeddings in your corpus in order to find the most similar document to retrieve for the query

In [15]:
# Example showing how to generate embeddings for the fine-tuned model.
test_embedding = model.encode("What manufacturing strategy does NVIDIA not employ for its products?")

In [33]:
corpus_list = list(corpus.values())
corpus_embeddings = model.encode(corpus_list)

In [30]:
similar_docs = semantic_search(test_embedding, corpus_embeddings)

In [34]:
corpus_list[similar_docs[0][0]['corpus_id']]

'NVIDIA has a platform strategy, bringing together hardware, systems, software, algorithms, libraries, and services to create unique value for the markets we serve.'