<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/finetuning/embeddings/finetune_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetune Embeddings

In this notebook, we show users how to finetune their own embedding models.

We go through three main sections:
1. Preparing the data (our `generate_qa_embedding_pairs` function makes this easy)
2. Finetuning the model (using our `SentenceTransformersFinetuneEngine`)
3. Evaluating the model on a validation knowledge corpus

## Generate Corpus

First, we create the corpus of text chunks by leveraging LlamaIndex to load some financial PDFs, and parsing/chunking into plain text chunks.

In [1]:
%pip install llama-index-llms-openai
%pip install llama-index-embeddings-openai
%pip install llama-index-finetuning

Collecting llama-index-llms-openai
  Downloading llama_index_llms_openai-0.1.12-py3-none-any.whl (10 kB)
Collecting llama-index-core<0.11.0,>=0.10.20.post1 (from llama-index-llms-openai)
  Downloading llama_index_core-0.10.20.post2-py3-none-any.whl (15.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m34.9 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json (from llama-index-core<0.11.0,>=0.10.20.post1->llama-index-llms-openai)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting deprecated>=1.2.9.3 (from llama-index-core<0.11.0,>=0.10.20.post1->llama-index-llms-openai)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting dirtyjson<2.0.0,>=1.0.8 (from llama-index-core<0.11.0,>=0.10.20.post1->llama-index-llms-openai)
  Downloading dirtyjson-1.0.8-py3-none-any.whl (25 kB)
Collecting httpx (from llama-index-core<0.11.0,>=0.10.20.post1->llama-index-llms-openai)
  Downloading httpx-0.27.0-py3-none-any.wh

In [1]:
import json

from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import MetadataMode

  from .autonotebook import tqdm as notebook_tqdm


Download Data

In [2]:
TRAIN_FILES = ["./data/10k/td1.pdf","./data/10k/td2.pdf","./data/10k/td3.pdf","./data/10k/td4.pdf","./data/10k/td5.pdf"]
VAL_FILES = ["./data/10k/val1.pdf",
             "./data/10k/val2.pdf",
             "./data/10k/val3.pdf",
             "./data/10k/val4.pdf",
             "./data/10k/val5.pdf",
             "./data/10k/val6.pdf"]

# TRAIN_CORPUS_FPATH = "./data/train_corpus.json"
# VAL_CORPUS_FPATH = "./data/val_corpus.json"

In [3]:
def load_corpus(files, verbose=False):
    if verbose:
        print(f"Loading files {files}")

    reader = SimpleDirectoryReader(input_files=files)
    docs = reader.load_data()
    if verbose:
        print(f"Loaded {len(docs)} docs")

    parser = SentenceSplitter()
    nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)

    if verbose:
        print(f"Parsed {len(nodes)} nodes")

    return nodes

We do a very naive train/val split by having the Lyft corpus as the train dataset, and the Uber corpus as the val dataset.

In [5]:
%pip install llama-index-readers-file

Collecting llama-index-readers-file
  Downloading llama_index_readers_file-0.1.11-py3-none-any.whl (36 kB)
Collecting bs4<0.0.3,>=0.0.2 (from llama-index-readers-file)
  Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Collecting pymupdf<2.0.0,>=1.23.21 (from llama-index-readers-file)
  Downloading PyMuPDF-1.23.26-cp310-none-manylinux2014_x86_64.whl (4.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf<5.0.0,>=4.0.1 (from llama-index-readers-file)
  Downloading pypdf-4.1.0-py3-none-any.whl (286 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m286.1/286.1 kB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting striprtf<0.0.27,>=0.0.26 (from llama-index-readers-file)
  Downloading striprtf-0.0.26-py3-none-any.whl (6.9 kB)
Collecting PyMuPDFb==1.23.22 (from pymupdf<2.0.0,>=1.23.21->llama-index-readers-file)
  Downloading PyMuPDFb-1.23.22-py3-none-manylinux2014_x86

In [4]:
train_nodes = load_corpus(TRAIN_FILES, verbose=True)
val_nodes = load_corpus(VAL_FILES, verbose=True)

Loading files ['./data/10k/td1.pdf', './data/10k/td2.pdf', './data/10k/td3.pdf', './data/10k/td4.pdf', './data/10k/td5.pdf']
Loaded 9 docs


Parsing nodes: 100%|██████████| 9/9 [00:00<00:00, 383.41it/s]


Parsed 14 nodes
Loading files ['./data/10k/val1.pdf', './data/10k/val2.pdf', './data/10k/val3.pdf', './data/10k/val4.pdf', './data/10k/val5.pdf', './data/10k/val6.pdf']
Loaded 10 docs


Parsing nodes: 100%|██████████| 10/10 [00:00<00:00, 663.74it/s]

Parsed 16 nodes





### Generate synthetic queries

Now, we use an LLM (gpt-3.5-turbo) to generate questions using each text chunk in the corpus as context.

Each pair of (generated question, text chunk used as context) becomes a datapoint in the finetuning dataset (either for training or evaluation).

In [5]:
from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset

In [6]:
import os

OPENAI_API_TOKEN = "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"
os.environ["OPENAI_API_KEY"] = OPENAI_API_TOKEN

In [7]:
from llama_index.llms.openai import OpenAI


train_dataset = generate_qa_embedding_pairs(
    llm=OpenAI(model="gpt-3.5-turbo"), nodes=train_nodes
)
val_dataset = generate_qa_embedding_pairs(
    llm=OpenAI(model="gpt-3.5-turbo"), nodes=val_nodes
)

train_dataset.save_json("train_dataset.json")
val_dataset.save_json("val_dataset.json")

100%|██████████| 14/14 [00:20<00:00,  1.43s/it]
100%|██████████| 16/16 [00:20<00:00,  1.31s/it]


In [8]:
# [Optional] Load
train_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")
val_dataset = EmbeddingQAFinetuneDataset.from_json("val_dataset.json")

## Run Embedding Finetuning

In [9]:
from llama_index.finetuning import SentenceTransformersFinetuneEngine

In [11]:
finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset,
    model_id="BAAI/bge-base-en-v1.5",
    model_output_path="test_model",
    val_dataset=val_dataset,
)

model.safetensors: 100%|██████████| 438M/438M [02:57<00:00, 2.47MB/s] 
tokenizer_config.json: 100%|██████████| 366/366 [00:00<?, ?B/s] 
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 9.08MB/s]
tokenizer.json: 100%|██████████| 711k/711k [00:00<00:00, 1.57MB/s]
special_tokens_map.json: 100%|██████████| 125/125 [00:00<?, ?B/s] 
1_Pooling/config.json: 100%|██████████| 190/190 [00:00<?, ?B/s] 


In [12]:
finetune_engine.finetune()

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration: 100%|██████████| 3/3 [01:15<00:00, 25.22s/it]
Iteration: 100%|██████████| 3/3 [01:09<00:00, 23.30s/it]
Epoch: 100%|██████████| 2/2 [02:54<00:00, 87.35s/it]


In [None]:
pip install llama-index-llms-huggingface

In [None]:
pip install llama-index-embeddings-huggingface

In [13]:
embed_model = finetune_engine.get_finetuned_model()

In [14]:
embed_model

HuggingFaceEmbedding(model_name='test_model', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x000002AD8CCC18B0>, tokenizer_name='test_model', max_length=512, pooling=<Pooling.CLS: 'cls'>, normalize=True, query_instruction=None, text_instruction=None, cache_folder=None)

In [None]:
# random check
# val_nodes

## Evaluate Finetuned Model

In this section, we evaluate 2 different embedding models:
1. open source `BAAI/bge-small-en`, and
2. our finetuned embedding model.

We consider the below evaluation approaches:

1. using `InformationRetrievalEvaluator` from sentence_transformers

We show that finetuning on synthetic (LLM-generated) dataset significantly improve upon an opensource embedding model.

In [15]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode
from tqdm.notebook import tqdm
import pandas as pd

### Define eval function

**Option 2**: We use the `InformationRetrievalEvaluator` from sentence_transformers.

This provides a more comprehensive suite of metrics, but we can only run it against the sentencetransformers compatible models (open source and our finetuned model, *not* the OpenAI embedding model).

In [16]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer
from pathlib import Path


def evaluate_st(
    dataset,
    model_id,
    name,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    evaluator = InformationRetrievalEvaluator(
        queries, corpus, relevant_docs, name=name
    )
    model = SentenceTransformer(model_id)
    output_path = "results/"
    Path(output_path).mkdir(exist_ok=True, parents=True)
    return evaluator(model, output_path=output_path)

### Run Evals

#### OpenAI

Note: this might take a few minutes to run since we have to embed the corpus and queries

### BAAI/bge-small-en

In [17]:
evaluate_st(val_dataset, "BAAI/bge-base-en-v1.5", name="bge")

0.8015625

### Finetuned

In [18]:
evaluate_st(val_dataset, "test_model", name="finetuned")

0.8880208333333333

### Summary of Results

#### InformationRetrievalEvaluator

In [19]:
df_st_bge = pd.read_csv(
    "results/Information-Retrieval_evaluation_bge_results.csv"
)
df_st_finetuned = pd.read_csv(
    "results/Information-Retrieval_evaluation_finetuned_results.csv"
)

We can see that embedding finetuning improves metrics consistently across the suite of eval metrics

In [None]:
df_st_bge["model"] = "bge"
df_st_finetuned["model"] = "fine_tuned"
df_st_all = pd.concat([df_st_bge, df_st_finetuned])
df_st_all = df_st_all.set_index("model")
df_st_all