# Finetune Embed

---

This notebook uses an evaluation metric to measure the performance of embedding models. The Mean Average Precision @ K, essentially measures if the model has managed to rank relevant documents ahead of irrelevant ones.

The main purpose of this notebook, is to examine finetuning of embedding models. We will be relying on finetuning modules and so we will not have the control over the finetuning in comparisson with a pure pytorch implementation. But the benefits are that the workflow is condensed into the following step...

* Compile the target dataset
* Connect to LLM to create question answer pairs for testing
* Get base model and complete finetuning
* Measure base model against finetuning

## $\color{blue}{Sections:}$
* Admin
* Data
* Models
* Dataset
* Finetune
* Evaluation

---
## $\color{blue}{Admin}$
---

In [None]:
from google.colab import drive

In [None]:
drive.mount("/content/drive")
%cd '/content/drive/MyDrive/'

Mounted at /content/drive
/content/drive/MyDrive


---
## $\color{blue}{Data}$
---

The train data is a MS excel pdf guide to new version of Excel 2010, approx 80 pages.

The valid data is a University issue how to guide for MS Excel.

Get train and validation nodes.

In [None]:
%%capture
!pip install -U -q langchain langchain-community pymupdf

In [None]:
# Get a pdf reader
from langchain_community.document_loaders import PyMuPDFLoader

In [None]:
# Get the pdf from the repo
data = PyMuPDFLoader("RAG_tutorial/Data/theft_act_1968.pdf").load()

In [None]:
%%capture
%pip install -qU langchain-text-splitters

In [None]:
from langchain.text_splitter import CharacterTextSplitter

In [None]:
# split the text
text_splitter = CharacterTextSplitter(
    separator= r"\n.?\d",
    chunk_size=300,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=True,
)

texts = text_splitter.transform_documents(data)



In [None]:
from llama_index.core.schema import TextNode

In [None]:
def doc_to_node(doc):
  content = doc.page_content
  node = TextNode(text=content)
  return node

In [None]:
nodes = [doc_to_node(doc) for doc in texts]

---
## $\color{blue}{Model}$
---

We have tested the multi-lingual BGE embeddor, but for the following test we use there small English model.

In [None]:
from getpass import getpass

In [None]:
from huggingface_hub import login
import os

In [None]:
HF_TOKEN = getpass('Your hugging face token:')

Your hugging face token:··········


In [None]:
login(token=HF_TOKEN)
os.environ['HUGGINGFACEHUB_API_TOKEN'] = HF_TOKEN

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
%%capture
%pip install llama-index-embeddings-huggingface

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

In [None]:
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")



In [None]:
from llama_index.core import Settings

In [None]:
Settings.embed_model = embed_model

---
## $\color{blue}{Dataset}$
---

The evaluation tasks consists of measuring the embedding model's ability to find a relevant section of a document for some summary text.

1. Document split into chunks
2. Questions generated from the chucks
3. Embed model tries identify the chunck in the document used to create the question

We now use an LLM to create the questions.

In [None]:
%%capture
!pip install llama-index-finetuning

In [None]:
from llama_index.finetuning import (
    generate_qa_embedding_pairs,
    EmbeddingQAFinetuneDataset,
)

In [None]:
# import os
# import getpass

# os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI password: ")

In [None]:
# from llama_index.llms.openai import OpenAI
# llm = OpenAI(model="gpt-4o")

In [None]:
# # generate
# train_dataset = generate_qa_embedding_pairs(nodes, llm=llm)

In [None]:
#train_dataset.save_json("RAG_tutorial/dataset/train_dataset.json")

In [None]:
train_dataset = EmbeddingQAFinetuneDataset.from_json("RAG_tutorial/dataset/train_dataset.json")

**Structure**

The resultant structure allows a lookup, for the questions on the documents.

queries = {hash: question}

corpus = {hash: corpus}

relevant_docs = {hash_question : [hash_corpus]}

So given a question from queries, i can extract the document that it was made from.

In [None]:
for key in list(train_dataset.queries.keys())[3:8]:
  print(train_dataset.queries[key], '\n')

Differentiate between robbery and burglary as described in the context information. What are the specific characteristics that distinguish these two crimes? 

Define the term "Aggravated burglary" and explain how it differs from "Burglary" based on the context provided. 

What does the term "Abstracting of electricity" refer to, and how might it be categorized in the context of theft-related offenses? 

Describe the legal implications and potential consequences of taking a motor vehicle or other conveyance without authority. How does this differ from other forms of theft mentioned in the provided context? 

Explain the concept of "abstracting of electricity" as mentioned in the context. How does this form of theft compare to obtaining property by deception in terms of legal classification and potential penalties? 



In [None]:
for key in list(train_dataset.corpus.keys())[12:13]:
  print(' '.join(train_dataset.corpus[key].split()), '\n')

68 CHAPTER 60 An Act to revise the law of England and Wales as to theft and similar or associated offences, and in connection therewith to make provision as to criminal proceedings by one party to a marriage against the other, and to make certain amendments extending beyond England and Wales in the Post Office Act 1953 and other enactments; and for other purposes connected there- with. 



In [None]:
# for a query hash, this is the doc hash that relates
for key in list(train_dataset.queries.keys())[3:8]:
  print(train_dataset.relevant_docs[key], '\n')

['d0c6c4ad-7a11-4223-80d6-a1a8d35bedca'] 

['4f93ea89-e03a-4d8e-96d2-544ffa5aa31f'] 

['4f93ea89-e03a-4d8e-96d2-544ffa5aa31f'] 

['23fa4eea-455e-4d15-be2a-8fc27d561b9e'] 

['23fa4eea-455e-4d15-be2a-8fc27d561b9e'] 



---
## $\color{blue}{Finetune}$
---
Initially we rely on an off the shelf finetuning function. Information here is a little hard to find, in relation to the actual loss function being used. We can assume the likely loss function for this scenario might be Multiple Negatives Ranking Loss (MNRL) which is explained below...

In addition to this off the shelf version we also look to update the weights of our base emmbeddor specifically with the relevant loss function, using the sentence transformer library...

None of this represents a proper test, there is; not even a test set. It is just an example of the workflow.


#### **Multiple Negatives Ranking Loss**

* $N_d$ Number of examples
* $f(x_j)$ The embedding representation of the data point $x_j$
* $y_j$ The ground truth label for datapoint $x_j$

$\textit{L}_{MNRL} = \sum_{j=1}^{\mathbb{N_d}} \textit{L}_{pt}(y_j, f(x_j))$

* $\textit{L}_{pt} (y_j, f(x_j))$ The loss on a single datapoint

This loss function is going to depend on positive and negative examples. The positive examples are sampled from the Query pairs that have been created from the same chunk of the initial document. With $n$ query pairs $n-1$ negative samples can be used that have been created from different portions of text. The idea is to pull the embeddings for the positive samples together and push those from negative samples further apart.

*At a high level this is very nice, because we are in a specific domain, all the embeddings are sitting in the same cluster in our embedding space. For the general model, this cluster is only a small part of the overall problem. With this loss function we're getting inside that cluster and starting to pull the vector apart in a more nuanced way.*

In practice we are creating an (mxh) matrix $A$ and a (hxn) matrix $B^T$ where $A$ contains queries 1 and $B^T$ the queries 2. We can use $AB^T$ to calculate the cosine similaties between embeddings apply. In the ideal scenario we have the identity matrix, where the pairs have cosine similarity of 1 and non-pairs a similarity of 0. We can apply a softmax on the cosine similarities and take a cross entropy loss on top.


---
### $\color{red}{Finetune-Engine}$
---

In [None]:
%%capture
!pip install sentence_transformers -q -U

In [None]:
%%capture
!pip install llama-index-legacy

In [None]:
from llama_index.legacy.finetuning import SentenceTransformersFinetuneEngine

[nltk_data] Downloading package stopwords to
[nltk_data]     /usr/local/lib/python3.10/dist-
[nltk_data]     packages/llama_index/legacy/_static/nltk_cache...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     /usr/local/lib/python3.10/dist-
[nltk_data]     packages/llama_index/legacy/_static/nltk_cache...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset, # Dataset to be trained on
    model_id="BAAI/bge-small-en-v1.5", # HuggingFace reference to base embeddings model
    model_output_path="RAG_tutorial/models/bge_small_ft", # Output directory for fine-tuned embeddings model
    #val_dataset=valid_dataset, # Dataset to validate on
    epochs=4, # Number of Epochs to train for
)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
#recommend a GPU
finetune_engine.finetune()

Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Iteration:   0%|          | 0/52 [00:00<?, ?it/s]

Iteration:   0%|          | 0/52 [00:00<?, ?it/s]

Iteration:   0%|          | 0/52 [00:00<?, ?it/s]

Iteration:   0%|          | 0/52 [00:00<?, ?it/s]

In [None]:
finetuned_embedding_model = finetune_engine.get_finetuned_model()



In [None]:
finetuned_embedding_model.to_json()

'{"model_name": "RAG_tutorial/models/bge_small_ft", "embed_batch_size": 10, "tokenizer_name": "RAG_tutorial/models/bge_small_ft", "max_length": 512, "pooling": "cls", "normalize": true, "query_instruction": null, "text_instruction": null, "cache_folder": null, "class_name": "HuggingFaceEmbedding"}'

In [None]:
ft_embed_model = HuggingFaceEmbedding(model_name="RAG_tutorial/models/bge_small_ft")

In [None]:
# !zip -r /content/file.zip RAG_tutorial/models/bge_small_ft

  adding: RAG_tutorial/models/bge_small_ft/ (stored 0%)
  adding: RAG_tutorial/models/bge_small_ft/eval/ (stored 0%)
  adding: RAG_tutorial/models/bge_small_ft/config_sentence_transformers.json (deflated 30%)
  adding: RAG_tutorial/models/bge_small_ft/config.json (deflated 48%)
  adding: RAG_tutorial/models/bge_small_ft/model.safetensors (deflated 17%)
  adding: RAG_tutorial/models/bge_small_ft/tokenizer_config.json (deflated 75%)
  adding: RAG_tutorial/models/bge_small_ft/special_tokens_map.json (deflated 80%)
  adding: RAG_tutorial/models/bge_small_ft/vocab.txt (deflated 53%)
  adding: RAG_tutorial/models/bge_small_ft/tokenizer.json (deflated 71%)
  adding: RAG_tutorial/models/bge_small_ft/sentence_bert_config.json (deflated 4%)
  adding: RAG_tutorial/models/bge_small_ft/1_Pooling/ (stored 0%)
  adding: RAG_tutorial/models/bge_small_ft/1_Pooling/config.json (deflated 57%)
  adding: RAG_tutorial/models/bge_small_ft/2_Normalize/ (stored 0%)
  adding: RAG_tutorial/models/bge_small_ft/mo

In [None]:
# from google.colab import files
# files.download("/content/file.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

---
## $\color{blue}{Evaluation:}$
---


$MRR = \frac{1}{|Q|} \sum_{i=1}^{Q} \frac{1}{rank_i}$

Where Q is the set of queries.

The metric queries the vector database for a closest match to some query. According to the ground truth, the rank refers to the position of the ground truth of the returned vector.

To make this more concrete.

* Compile a dataset using English and French descriptions.
* For each datapoint use an LLM to generate 2 questions.
* Select an embedding model to embed the dataset.
* For each question return a list of the closest matches from the dataset.
* For question $i$, coming from datapoint $d$, find the rank of $d$ in the closest matches.
* Add $\frac{1}{rank}$ to the total, and take the average of all questions to get MRR.
---


In [None]:
from llama_index.core import VectorStoreIndex
from tqdm import tqdm
import numpy as np

In [None]:
def evaluate(dataset, embed_model, insert_batch_size=1000, top_k=5):
    # Get corpus, queries, and relevant documents from the qa_dataset object
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    # Create TextNode objects for each document in the corpus and create a VectorStoreIndex to efficiently store and retrieve embeddings
    nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()] # loop through docs dict and create a TextNode for each doc
    index = VectorStoreIndex(
        nodes, embed_model=embed_model, insert_batch_size=insert_batch_size
    )
    print("vector store complete")
    retriever = index.as_retriever(similarity_top_k=top_k) # uses cosine similarity by default

    # Prepare to collect evaluation results
    eval_results = []

    # Iterate over each query in the dataset to evaluate retrieval performance
    for query_id, query in tqdm(queries.items()):
        # Retrieve the top_k most similar documents for the current query and extract the IDs of the retrieved documents
        retrieved_nodes = retriever.retrieve(query)
        retrieved_ids = [node.node.node_id for node in retrieved_nodes]

        # Check if the expected document was among the retrieved documents
        expected_id = relevant_docs[query_id][0]
        is_hit = expected_id in retrieved_ids  # assume 1 relevant doc per query

        # Calculate the Mean Reciprocal Rank (MRR) and append to results
        if is_hit:
            rank = retrieved_ids.index(expected_id) + 1
            mrr = 1 / rank
        else:
            mrr = 0
        eval_results.append(mrr)

    # Return the average MRR across all queries as the final evaluation metric
    print('\n')
    return np.average(eval_results)

In [None]:
small_score = evaluate(train_dataset, embed_model)

vector store complete


100%|██████████| 520/520 [00:10<00:00, 47.38it/s]








In [None]:
small_score

0.7750320512820512

In [None]:
ft_score = evaluate(train_dataset, ft_embed_model)

vector store complete


100%|██████████| 520/520 [00:10<00:00, 47.77it/s]








In [None]:
ft_score

0.8597115384615385

**The score is very good (this is the training data) so let's check OpenAI**

In [None]:
%%capture
!pip install llama-index-embeddings-openai

In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding

In [None]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI password: ")

Enter your OpenAI password: ··········


In [None]:
openai_embed = OpenAIEmbedding(model='text-embedding-3-large', dimensions = 3072)

In [None]:
openai_score = evaluate(train_dataset, openai_embed)

vector store complete


100%|██████████| 520/520 [05:05<00:00,  1.70it/s]








In [None]:
openai_score

0.8095192307692308

---
##### $\color{red}{Test-Set:}$
---

Let's bring in another pdf, the 2006 Fraud Act.


In [None]:
# get the pdf from the repo
test_data = PyMuPDFLoader("RAG_tutorial/Data/fraud_act_2006.pdf").load()

In [None]:
# split the text
text_splitter = CharacterTextSplitter(
    separator= r"\n.?\d",
    chunk_size=600,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=True,
)

In [None]:
test_texts = text_splitter.transform_documents(test_data)



In [None]:
test_nodes = [doc_to_node(doc) for doc in test_texts]

In [None]:
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o")

In [None]:
test_dataset = generate_qa_embedding_pairs(test_nodes,llm=llm)

100%|██████████| 75/75 [02:24<00:00,  1.93s/it]


In [None]:
test_dataset.save_json("RAG_tutorial/dataset/test_dataset.json")

In [None]:
test_dataset = EmbeddingQAFinetuneDataset.from_json("RAG_tutorial/dataset/test_dataset.json")

---
##### $\color{red}{Results:}$
---


In [None]:
small_test = evaluate(test_dataset, embed_model)

vector store complete


100%|██████████| 150/150 [00:02<00:00, 59.72it/s]








In [None]:
small_test

0.8175555555555555

In [None]:
ft_test = evaluate(test_dataset, ft_embed_model)

vector store complete


100%|██████████| 150/150 [00:02<00:00, 57.46it/s]








In [None]:
ft_test

0.8537777777777777

In [None]:
openai_test = evaluate(test_dataset, openai_embed)

vector store complete


100%|██████████| 150/150 [01:23<00:00,  1.81it/s]








In [None]:
openai_test

0.8362222222222222