### ⚠ IMPORTANT ⚠

You will need at least 22GB of VRAM (GPU RAM) to run this notebook.

If you're running this locally - please ensure you have the correct hardware to support the fine-tuning.

Please make sure you're using the following instance:

![image](https://i.imgur.com/ji210Ug.png)

In [1]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Sun May 12 17:32:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# Fine-tuning Embedding Models

In the following Notebook we will be exploring one of the most powerful techniques to take your single-domain RAG pipelines to the next level.

Fine-tuning Embeddings Models!

- 🤝 Breakout Room #2
  - Task 1: Dependencies and Boilerplate
  - Task 2: Loading Data
  - Task 3: Constructing a Fine-tuning Dataset
  - Task 4: Fine-tuning `snowflake-arctic-embed-l`
  - Task 5: Evaluating Retrieval with Embedding Model

But before any of that, we need to grab some dependencies, and set up some boilerplate!

## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key, and Hugging Face token!

### Nest Asyncio

In [2]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

In [3]:
!pip install -qU llama-index-llms-openai llama-index-embeddings-openai llama-index-finetuning

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m51.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.9/141.9 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.1/320.1 kB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m69.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m386.3/386.3 kB[0m [31m42.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.8/158.8 kB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━

In [4]:
!pip install -qU llama-index-readers-file llama-index-embeddings-huggingface

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/290.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [5]:
!pip install -qU "sentence_transformers==2.7.0"

### API Key Section!

In classic fashion, we'll need to provide our OpenAI API key!

We'll also provide our Hugging Face token (with `Write` access) in order to save our model on the Hub!

In [6]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

Enter Your OpenAI API Key: ··········


In [7]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Task 2: Loading Data

The data can be found in [this GitHub repo](https://github.com/AI-Maker-Space/DataRepository/tree/main/high-performance-rag).

In this case, the data is related to research articles about Camelids (aka: Llamas, Alpacas, Camels!)

In [6]:
!wget https://tirsus.com/AI-Powered_Search_v20.pdf

--2024-05-12 16:04:37--  https://tirsus.com/AI-Powered_Search_v20.pdf
Resolving tirsus.com (tirsus.com)... 45.131.252.33
Connecting to tirsus.com (tirsus.com)|45.131.252.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28939937 (28M) [application/pdf]
Saving to: ‘AI-Powered_Search_v20.pdf’


2024-05-12 16:04:41 (12.2 MB/s) - ‘AI-Powered_Search_v20.pdf’ saved [28939937/28939937]



In [10]:
cd "./ir-data"

/content/ir-data


In [11]:
ls

[0m[01;34mAIPS_EVAL[0m/  [01;34mAIPS_TRAIN[0m/


Now we can begin building our simple index for each of the training directories, and the validation directories.

We will use LlamaIndex's `SimpleNodeParser` to achieve this!

In [12]:
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.schema import MetadataMode

TRAIN_FILES = "AIPS_TRAIN"
EVAL_FILES = "AIPS_EVAL"

In [13]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.schema import MetadataMode

def load_corpus(directory, verbose=False):
    if verbose:
        print(f"Loading files in {directory}")

    reader = SimpleDirectoryReader(directory)
    docs = reader.load_data()
    if verbose:
        print(f"Loaded {len(docs)} docs")

    parser = SimpleNodeParser.from_defaults()
    nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)

    if verbose:
        print(f"Parsed {len(nodes)} nodes")

    return nodes

In [14]:
ls

[0m[01;34mAIPS_EVAL[0m/  [01;34mAIPS_TRAIN[0m/


In [15]:
train_nodes = load_corpus(TRAIN_FILES, verbose=True)
eval_nodes = load_corpus(EVAL_FILES, verbose=True)

Loading files in AIPS_TRAIN




Failed to load file /content/ir-data/AIPS_TRAIN/AI-Powered_Search_v20_Ch2ff.pdf with error: RetryError[<Future at 0x7a6cbc27faf0 state=finished raised PdfStreamError>]. Skipping...
Loaded 0 docs


Parsing nodes: 0it [00:00, ?it/s]

Parsed 0 nodes
Loading files in AIPS_EVAL
Loaded 25 docs


Parsing nodes:   0%|          | 0/25 [00:00<?, ?it/s]

Parsed 25 nodes


Now that we've split our source documents into a number of nodes, we can move on to constructing a fine-tuning dataset.

## Task 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-3.5-turbo`.

We'll start by using LlamaIndex's `generate_qa_embedding_pairs` and storing it in a `EmbeddingQAFinetuneDataset`.

The basic idea here is straightforward enough:

1. We look at a node
2. We generate a question that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

> NOTE: Keep in mind that the below example uses 100 nodes to generate the QA pairs. This results in 100 calls to `gpt-3.5-turbo` feel free to reduce the number of nodes.

In [None]:
%pip install llama-index-llms-openai
%pip install llama-index-embeddings-openai
%pip install llama-index-finetuning

In [24]:
!pip install pydantic



In [16]:
from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset

In [26]:
pip list


Package                                 Version
--------------------------------------- ---------------------
absl-py                                 1.4.0
aenum                                   3.1.15
aiohttp                                 3.9.5
aiosignal                               1.3.1
alabaster                               0.7.16
albumentations                          1.3.1
altair                                  4.2.2
annotated-types                         0.6.0
anyio                                   3.7.1
appdirs                                 1.4.4
argon2-cffi                             23.1.0
argon2-cffi-bindings                    21.2.0
array_record                            0.5.1
arviz                                   0.15.1
astropy                                 5.3.4
astunparse                              1.6.3
async-timeout                           4.0.3
atpublic                                4.1.0
attrs                                   23.2.0
audioread 

In [17]:
from llama_index.llms.openai import OpenAI

llm = OpenAI(temperature=0.0, model="gpt-3.5-turbo")

In [18]:
train_dataset = generate_qa_embedding_pairs(train_nodes[:100], llm=llm)
train_dataset.save_json("train_dataset.json")

0it [00:00, ?it/s]


In [19]:
eval_dataset = generate_qa_embedding_pairs(eval_nodes[:10], llm=llm)
eval_dataset.save_json("eval_dataset.json")

100%|██████████| 10/10 [00:15<00:00,  1.52s/it]


In [20]:
train_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")
eval_dataset = EmbeddingQAFinetuneDataset.from_json("eval_dataset.json")

## Task 4: Fine-tuning `snowflake-arctic-embed-m`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-l`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

> NOTE: If you are limited by your compute - you can use the `snowflake-arctic-embed-m` model instead, which will run on the free T4 GPU instance in Colab.

#### ❓ Question 1:

How many parameters does `snowflake-arctic-embed-l` have?

#### **!** Answer 1:
https://www.snowflake.com/blog/introducing-snowflake-arctic-embed-snowflakes-state-of-the-art-text-embedding-family-of-models/?lang=de

334 Millions of Parameters

In [21]:
from llama_index.finetuning import SentenceTransformersFinetuneEngine

finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset, # Dataset to be trained on
    val_dataset=eval_dataset, # Dataset to evaluate on
    model_id="Snowflake/snowflake-arctic-embed-m", # HuggingFace reference to base embeddings model
    model_output_path="snowflake_finetune_ir", # Output directory for fine-tuned embeddings model
    epochs=4 # Number of Epochs to train for
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]






README.md:   0%|          | 0.00/84.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/738 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

All that's left to do now is call `.finetune()`!

In [22]:
finetune_engine.finetune()

Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Iteration: 0it [00:00, ?it/s]

Iteration: 0it [00:00, ?it/s]

Iteration: 0it [00:00, ?it/s]

Iteration: 0it [00:00, ?it/s]

Now that we've fine-tuned our embeddings model, lets grab the model out of the engine so we can use it later!

> NOTE: You should be able to safely avoid any warnings relating to weights here.

In [23]:
finetuned_embedding_model = finetune_engine.get_finetuned_model()




Some weights of BertModel were not initialized from the model checkpoint at snowflake_finetune_ir and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [24]:
from sentence_transformers import SentenceTransformer

fine_tuned_embedding = SentenceTransformer(
    "snowflake_finetune_ir"
)




Some weights of BertModel were not initialized from the model checkpoint at snowflake_finetune_ir and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [27]:
fine_tuned_embedding.save_to_hub(repo_id="uderiu/snowflake-ft-ir2-m")



model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

'https://huggingface.co/uderiu/snowflake-ft-ir2-m/commit/3433286ea4600b1ec127b6a9bfc13cee1b6e3db3'

## Task 5: Evaluating Retrieval with Embedding Model

Now that we've fine-tuned our model - let's see how it performs against OpenAI's `text-embedding-3-small` model, and the base non-fine-tuned version of the model.

In [28]:
from tqdm.notebook import tqdm
from llama_index.core.schema import TextNode
from llama_index.core import Settings, VectorStoreIndex


def evaluate(
    dataset,
    embed_model,
    top_k=2,
    verbose=False,
):
    corpus = dataset['corpus']
    queries = dataset['queries']
    relevant_docs = dataset['relevant_docs']

    nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items() if text != ""]
    index = VectorStoreIndex(
        nodes,
        show_progress=True,
        embed_model=embed_model
    )
    retriever = index.as_retriever(similarity_top_k=top_k)

    eval_results = []
    for query_id, query in tqdm(queries.items()):
        retrieved_nodes = retriever.retrieve(query)
        retrieved_ids = [node.node.node_id for node in retrieved_nodes]
        expected_id = relevant_docs[query_id][0]
        is_hit = expected_id in retrieved_ids  # assume 1 relevant doc

        eval_result = {
            'is_hit': is_hit,
            'retrieved': retrieved_ids,
            'expected': expected_id,
            'query': query_id,
        }
        eval_results.append(eval_result)
    return eval_results

#### ❓Question 2:

Describe what the `evaluate` function is doing in the above cell in natural language.

#### **!** Answer 2:

The primary function of this code is to assess the performance of a document retrieval system. Here's how it achieves this:

** Data Preparation **

Data Loading: The dataset likely contains three components:

- corpus: A collection of documents (presumably key-value pairs with ID and text).
- queries: A set of search queries.
- relevant_docs: For each query, a list of IDs indicating the documents that should be considered relevant.

Node Creation: The code creates TextNode objects. These act as wrappers for the chunks, storing their IDs and text content.

Index Building: A VectorStoreIndex is constructed. This index is designed to store and efficiently search through "embeddings" (vector representations) of the text documents. The provided embed_model is used to convert text into these embeddings.

**Retrieval**

Retriever Setup: The index is converted into a retriever: a component specifically designed to search the index and return the most similar documents to a given query. The similarity_top_k parameter tells the retriever to focus on the top 'k' most similar documents.

**Evaluation Loop**

Query Processing: The code iterates through each query in the queries dataset.

Search: The retriever is used to find the documents most similar to the current query.

Relevance Check:

The retrieved document IDs are compared to the expected relevant document ID (expected_id) found in the relevant_docs data.
The is_hit variable marks whether the expected document is within the retrieved results.
Record Keeping: An eval_result dictionary is created, storing:

is_hit: If the expected relevant document was found.
retrieved: IDs of the retrieved documents.
expected: ID of the expected relevant document.
query: The query itself.

**Output**

Finally, the function returns a list (eval_results) containing the evaluation result dictionaries for all the queries.



In [29]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer

def evaluate_sentence_transformers(
    dataset,
    model_id,
    name,
):
    corpus = dataset['corpus']
    queries = dataset['queries']
    relevant_docs = dataset['relevant_docs']

    evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs, name=name)
    model = SentenceTransformer(model_id)
    return evaluator(model, output_path="/content/")

#### ❓Question 3:

Describe what the `evaluate_st` function is doing in the above cell in natural language.

#### **!** Answer 4:
evaluate_sentence_transformers function

**Data Loading**:  Just like in the previous cell, it expects a dataset containing:

- corpus: A collection of documents (likely as text).
- queries: A set of search queries.
- relevant_docs: For each query, a list of relevant document identifiers.

**Evaluator Setup**:

An InformationRetrievalEvaluator object is created from the sentence_transformers.evaluation module. This evaluator is a specialized tool designed to measure information retrieval metrics.
It's initialized with our queries, corpus, and relevant_docs to understand what to compare during the evaluation.
The name parameter is likely used for labeling the output of the evaluation.

**Model Loading**:

A SentenceTransformer model is loaded using the provided model_id. Sentence Transformers are specialized models that create text embeddings (vector representations).

**Evaluation**:

The core evaluation happens when you call evaluator(model, output_path="/content/"). Here's what's likely going on inside:
The model is used to generate embeddings for both the queries and the documents in your corpus.
The evaluator compares these embeddings to determine how well the model can find relevant documents for each query.
It calculates standard information retrieval metrics like precision, recall, NDCG, MAP, etc.
The evaluation results are saved to the specified /content/ path.
Output

The evaluate_sentence_transformers function  returns the detailed evaluation results produced by the InformationRetrievalEvaluator.

**In Summary**

This function gives an assessment of how well our Sentence Transformer model performs at retrieving the most relevant documents for the given search queries.



In [30]:
import json

with open("eval_dataset.json", 'r+') as f:
    eval_dataset_json = json.load(f)

### Text Embedding 3 Small Results

We'll compare our results against OpenAI's `text-embedding-3-small` model, so we'll need to load it up!

In [31]:
from llama_index.embeddings.openai import OpenAIEmbedding

text_embedding_3_small = OpenAIEmbedding(model="text-embedding-3-small")
te3_val_results = evaluate(eval_dataset_json, text_embedding_3_small)

Generating embeddings:   0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/20 [00:00<?, ?it/s]

Let's look at what an example of our results looks like.

In [32]:
import pandas as pd

df_te3 = pd.DataFrame(te3_val_results)

In [None]:
df_te3

Unnamed: 0,is_hit,retrieved,expected,query
0,True,"[08f61629-0d2d-4e65-a86f-f5aff3db984f, e1fb9e4...",08f61629-0d2d-4e65-a86f-f5aff3db984f,bf0a25cb-08a1-4cde-9801-ef8181bb4703
1,True,"[08f61629-0d2d-4e65-a86f-f5aff3db984f, ac67414...",08f61629-0d2d-4e65-a86f-f5aff3db984f,493ac1f9-d947-408b-8b64-b71eb51a9780
2,True,"[79e7f1b4-fb72-4640-89cc-67fbdf49e959, ac67414...",79e7f1b4-fb72-4640-89cc-67fbdf49e959,56e494f7-1fed-4fa7-8f03-6bfb4a1bf2ee
3,True,"[79e7f1b4-fb72-4640-89cc-67fbdf49e959, e1fb9e4...",79e7f1b4-fb72-4640-89cc-67fbdf49e959,6e980e60-7532-4da1-9be9-5ca5c4735c3c
4,True,"[1b1d8de1-efe1-4fa3-b9a9-8a3cefcfca6c, ac67414...",1b1d8de1-efe1-4fa3-b9a9-8a3cefcfca6c,27a52493-45f4-43ee-a7a3-c01c43c3bffb
5,True,"[ac67414d-6dd6-4dda-b7d4-8c77d64a1f48, 1b1d8de...",1b1d8de1-efe1-4fa3-b9a9-8a3cefcfca6c,daacd976-4447-4e94-8dba-b91a5571ebc7
6,True,"[ac67414d-6dd6-4dda-b7d4-8c77d64a1f48, e1fb9e4...",ac67414d-6dd6-4dda-b7d4-8c77d64a1f48,ef8000b8-f9a7-422c-9019-0f634e2b93af
7,True,"[ac67414d-6dd6-4dda-b7d4-8c77d64a1f48, 1b1d8de...",ac67414d-6dd6-4dda-b7d4-8c77d64a1f48,c969f93f-1951-4704-bd72-be162c67a511
8,False,"[d1f92893-ad89-4d6a-b6c9-b613fc2f7e38, 7d09065...",e1fb9e46-6b3e-4f67-946a-69b42d867f5b,612d0ab3-5b59-46f7-b247-361c019c7202
9,True,"[e1fb9e46-6b3e-4f67-946a-69b42d867f5b, d1f9289...",e1fb9e46-6b3e-4f67-946a-69b42d867f5b,aec43dd8-47f5-4ecd-a82b-9fe0defaef5d


#### ❓Question 4:

What do these `[313de41e-534b...]` IDs mean?

#### **!** Answer 4:

See evaluate function above:

- The list of ids of the retrieved nodes,
- The id of the excpected node (assuming only one node is expected)
- The id of the query, which has been evaluated


Now let's look at the mean value of `is_hit`.

In [33]:
hit_rate_ada = df_te3['is_hit'].mean()
hit_rate_ada

0.95

Overall, we see `text-embedding-3-small` getting a `0.9` "hit rate".

### Base Embeddings Model Results

Let's get the evaluation for our base embedding model (pre-fine-tuning).

In [34]:
base_embed_model_id = "Snowflake/snowflake-arctic-embed-m"
base_embed_model = SentenceTransformer(base_embed_model_id)

arctic_base = "local:Snowflake/snowflake-arctic-embed-m"
arctic_base_val_results = evaluate(eval_dataset_json, arctic_base)






modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]






README.md:   0%|          | 0.00/84.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/738 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Generating embeddings:   0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/20 [00:00<?, ?it/s]

In [35]:
df_arctic_base = pd.DataFrame(arctic_base_val_results)

In [36]:
hit_rate_bge = df_arctic_base['is_hit'].mean()
hit_rate_bge

0.8

With a `0.5` hit rate - the base embedding model is absolutely terrible when compared to `text-embedding-3-small` from OpenAI!

Because this is a local `SentenceTransformer`, we can evaluate it with the `SentenceTransformer` evaluation helper-function as well!

In [37]:
evaluate_sentence_transformers(eval_dataset_json, "Snowflake/snowflake-arctic-embed-m", name='arctic-m')






0.7683333333333333

Not great results - let's see what fine-tuning can do for us!

### Fine-tuned Results

In [38]:
finetuned = "local:snowflake_finetune_ir"
eval_results_finetuned = evaluate(eval_dataset_json, finetuned)




Some weights of BertModel were not initialized from the model checkpoint at snowflake_finetune_ir and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Generating embeddings:   0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/20 [00:00<?, ?it/s]

In [39]:
df_finetuned = pd.DataFrame(eval_results_finetuned)

In [40]:
hit_rate_finetuned = df_finetuned['is_hit'].mean()
hit_rate_finetuned

0.8

This is a marked improvement when compared to the base model. Absolutely fantastic!

In [41]:
evaluate_sentence_transformers(eval_dataset_json, "snowflake_finetune_ir", name='finetuned')




Some weights of BertModel were not initialized from the model checkpoint at snowflake_finetune_ir and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.7683333333333333

It's also a marked improvement on the `SentenceTransformer` evaluation!

### Conclusion

Now we can compare the 3 embeddings models to see which performed the best!

In [42]:
df_te3['model'] = 'te3'
df_arctic_base['model'] = 'arctic-baseline'
df_finetuned['model'] = 'arctic-fine-tuned'

In [43]:
df_all = pd.concat([df_te3, df_arctic_base, df_finetuned])
df_all.groupby('model').mean('is_hit')

Unnamed: 0_level_0,is_hit
model,Unnamed: 1_level_1
arctic-baseline,0.8
arctic-fine-tuned,0.8
te3,0.95
