# Fine-tuning Embeddings for RAG on Specific Data

As we start our "fine-tuning" week, we'll start with the lowest hanging improvement one can do for RAG - which is:

Fine-tuning embeddings!

- 🤝 Breakout Room #1:
  - Task 1: Dependencies and Boilerplate
  - Task 2: Loading Data
  - Task 3: Constructing a Fine-tuning Dataset
  - Task 4: Fine-tuning `snowflake-arctic-embed-l`
  - Task 5: Evaluating our Retriever



#### Basic Overview of Fine-tuning Embeddings

In essence, what we want to do when we fine-tune our embedding models is very simple:

```
Move the embeddings for questions relating to a document
closer together with that document
```

We can think of fine-tuning our embedding models as follows:

1) We have some pair of text items that *should* be closer together
  - `Question`, `Document` pairs
  - EX: `Who drives the bus?`, `The bus was driven by Kyle, the Bus Driver`.

2) We use these pairs as labeled data to fine-tune our embedding model.

The process of training helps the model more accurately associate our questions with the correct documents.

#####❓ Question #1:

Describe the nuance between using Q&D pairs to train the embedding model vs. inter-document pairs/related sentences.

What caveats does this approach have? Are there any special considerations for what kind of Q's we should use?

---

**ANSWER:**

We are specifically relating *the questions* to *the documents*. This means that we are making our embedding model at the very specific task of relating potential questions to specific documents.

There are many caveats, but the main ones are:

- Your Q's should reflect the Q's of your users
- This kind of fine-tuning will (purposefully) "overfit" on your data; this is the desired result in this case.

## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

### Nest Asyncio

In [1]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

>> NOTE: You do not need to do these steps if you are running this notebook locally with `uv`.

In [2]:
!pip install -qU langchain_openai langchain_huggingface langchain_core langchain langchain_community langchain-text-splitters

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.9/54.9 kB[0m [31m553.8 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m50.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m78.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m72.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
!pip install -qU faiss-cpu python-pptx==1.0.2 nltk==3.9.1 pymupdf beautifulsoup4 lxml

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.8/472.8 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m51.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m165.1/165.1 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25h

### Provide OpenAI API Key

In [4]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

Enter Your OpenAI API Key: ··········


## Task 2: Loading Data

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [5]:
!mkdir data

In [6]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 31392    0 31392    0     0   182k      0 --:--:-- --:--:-- --:--:--  183k


In [7]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 70251    0 70251    0     0   494k      0 --:--:-- --:--:-- --:--:--  497k


In [8]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import BSHTMLLoader

path = "data/"
text_loader = DirectoryLoader(path, glob="*.html", loader_cls=BSHTMLLoader)

Next, we'll set up a classic naive chunking strategy as we only care that the documents get parsed into chunks that we can generate synthetic questions about.

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap  = 20,
    length_function = len
)

Next we can load/split these documents as follows.

>> NOTE: You may need to run this cell twice to get it to work.

In [10]:
training_documents = text_splitter.split_documents(text_loader.load())

In [11]:
len(training_documents)

102

Next, we're going to associate each of our chunks with a unique identifier.

In [12]:
import uuid

id_set = set()

for document in training_documents:
  id = str(uuid.uuid4())
  while id in id_set:
    id = uuid.uuid4()
  id_set.add(id)
  document.metadata["id"] = id

Next, we'll simply use naive Python slicing to create a training, test, and validation set to prepare our data for the next step.

In [13]:
training_split_documents = training_documents[:len(training_documents) - 24]
val_split_documents = training_documents[len(training_documents) - 24:102-12]
test_split_documents = training_documents[102-12:]

## Task 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-4o-mini` (released [today](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)).

The basic idea here is straightforward enough:

1. We look at a document
2. We generate questions that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [14]:
from langchain_openai import ChatOpenAI

qa_chat_model = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

We'll create a simple Question Generation prompt to query `gpt-4o-mini` to generate Questions for each retrieved context.

In [15]:
from langchain_core.prompts import ChatPromptTemplate

qa_prompt = """\
Given the following context, you must generate questions based on only the provided context.

You are to generate {n_questions} questions which should be provided in the following format:

1. QUESTION #1
2. QUESTION #2
...

Context:
{context}
"""

qa_prompt_template = ChatPromptTemplate.from_template(qa_prompt)

We'll create a simple chain to query the LLM!

In [16]:
question_generation_chain = qa_prompt_template | qa_chat_model

There's a lot going on in this function - let's take a deeper look:

1. First, we provide a list of documents and a number of questions
2. We, for each document in our list, generate `n_questions` of questions.
3. We then associate those questions and contexts via a `UUID`.

> NOTE: The reason we're doing this `UUID` association is for ease of use later in the notebook.

##### 🏗️ Activity #1:

We have:

- Lists of `Documents` with the `metadata` field `id`.

We need:

- An object with key `id`, which have values `str` questions.
- An object with key `question_id`, which have values `List(str)` which will be a list of associated `context_id`.

An Example:

question_object:
```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': 'What types of accessible formats are available for persons with disabilities?',
'df58ee4f-714c-419e-8324-94e5870574e2': 'How do accessible formats benefit persons with disabilities?',
'505fce8b-0e56-48de-a251-61027e396918': 'What are some of the risks associated with the increasing capabilities of AI systems that generate synthetic content?',
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': 'Why is it important for providers of AI systems to embed technical solutions for marking and detecting synthetic content?'
}
 ```

 context_object:
 ```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'df58ee4f-714c-419e-8324-94e5870574e2': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'505fce8b-0e56-48de-a251-61027e396918': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
}
 ```

 As you can see, a piece of context can be associated with more than 1 question.

 The task is to write the Python function(s) to accomplish this task.

 Your function signature is provided below, along with the desired return values.

 > NOTE: You can make any modifications that you desire - assuming that you have the correct input and outputs.

In [17]:
import tqdm

async def create_questions(documents, n_questions):
  questions = {}
  relevant_docs = {}

  for document in tqdm.tqdm(documents):
    context = document.page_content
    result = await question_generation_chain.ainvoke({"context": context, "n_questions": n_questions})

    generated_questions = result.content.split('\n')
    generated_questions = [q.strip() for q in generated_questions if q.strip() and q.strip().startswith(('1.', '2.', '3.', '4.', '5.'))]
    generated_questions = [q[q.find('.') + 1:].strip() for q in generated_questions]

    for question in generated_questions:
        question_id = str(uuid.uuid4())
        questions[question_id] = question
        relevant_docs[question_id] = [document.metadata["id"]]

  return questions, relevant_docs

### REMOVE `await` IF NOT USING ASYNC (HINT: Use `async`)

In [18]:
training_questions, training_relevant_contexts = await create_questions(training_split_documents, 2)

100%|██████████| 78/78 [01:26<00:00,  1.10s/it]


We'll use the function to generate training, validation, and test data.

In [19]:
val_questions, val_relevant_contexts = await create_questions(val_split_documents, 2)

100%|██████████| 12/12 [00:13<00:00,  1.10s/it]


In [20]:
test_questions, test_relevant_contexts = await create_questions(test_split_documents, 2)

100%|██████████| 12/12 [00:11<00:00,  1.07it/s]


### Reformating and Saving Datasets

Now, we can save our datasets for later use!

In [21]:
import json

training_corpus = {train_item.metadata["id"] : train_item.page_content for train_item in training_split_documents}

train_dataset = {
    "questions" : training_questions,
    "relevant_contexts" : training_relevant_contexts,
    "corpus" : training_corpus
}

with open("training_dataset.jsonl", "w") as f:
  json.dump(train_dataset, f)

In [22]:
val_corpus = {val_item.metadata["id"] : val_item.page_content for val_item in val_split_documents}

val_dataset = {
    "questions" : val_questions,
    "relevant_contexts" : val_relevant_contexts,
    "corpus" : val_corpus
}

with open("val_dataset.jsonl", "w") as f:
  json.dump(val_dataset, f)

In [23]:
train_corpus = {test_item.metadata["id"] : test_item.page_content for test_item in test_split_documents}

test_dataset = {
    "questions" : test_questions,
    "relevant_contexts" : test_relevant_contexts,
    "corpus" : train_corpus
}

with open("test_dataset.jsonl", "w") as f:
  json.dump(test_dataset, f)

## Task 4: Fine-tuning `snowflake-arctic-embed-l`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-l`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

>> NOTE: Skip installing dependencies if you are running this notebook locally.

In [27]:
!pip install -qU sentence_transformers datasets pyarrow

In [25]:
from sentence_transformers import SentenceTransformer

model_id = "Snowflake/snowflake-arctic-embed-l"
model = SentenceTransformer(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/85.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

We'll grab some necessary imports from `sentence_transformers` and `torch`.

> NOTE: PyTorch (`torch`) is a popular machine learning library - while we don't go very deep into PyTorch it's an incredibly powerful and interesting library! Please read more about it [here](https://pytorch.org/tutorials/beginner/basics/intro.html)!

In [28]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample

We're using a toy batch size here to reflect the limited number of examples we have.

> NOTE: It is typical to use a much larger batch size (~64+), hardware permitting.

In [29]:
BATCH_SIZE = 10

Let's move our dataset into the expected format for training.

In [30]:
corpus = train_dataset['corpus']
queries = train_dataset['questions']
relevant_docs = train_dataset['relevant_contexts']

examples = []
for query_id, query in queries.items():
    doc_id = relevant_docs[query_id][0]
    text = corpus[doc_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

Now we can create a `torch` `DataLoader`!

In [31]:
loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

Next up, we'll prepare our loss function!

Loss is an important part of training, fine-tuning, and more. If you want a deep dive on loss - you can check out our [event on loss!](https://www.youtube.com/watch?v=iB8FWR9aD5Q&t=8s).

The core loss we're using today is called `MultipleNegativesRankingLoss` - you can find more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).

This is "wrapped" in `MatryoshkaLoss`, which you can read the implementation of [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MatryoshkaLoss.py).

In [32]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

##### 🏗️ Activity #2:

Both of these losses sound "cool", but what are they - exactly - under the hood?

Why are these losses specifically doing? Please write a short summary of each loss.

> NOTE: This is a course focused on AI Engineering and the application of AI - looking for a hint? Try pasting the code (linked above) into ChatGPT/Claude to write the summary!

### Answer

**MultipleNegativesRankingLoss (MNRL)**

This loss is designed for scenarios where you want to learn relationships between items (e.g., embeddings of sentences, images, or products).
It's particularly effective when you have "positive" pairs (items that are related) and a set of "negative" examples (items that are unrelated).
The goal is to push the embeddings of positive pairs closer together in the embedding space while pushing the embeddings of negative pairs further apart.

**What is this loss specifically doing?**

*   It calculates the similarity (often cosine similarity) between the anchor and the positive.
*   It also calculates the similarity between the anchor and all the negatives.
*   It then uses a loss function (like InfoNCE or a margin-based loss) to:
    *   Maximize the similarity between the anchor and the positive.
    *   Minimize the similarity between the anchor and the negatives

**MatryoshkaLoss**

This loss builds upon another loss (in your case, MultipleNegativesRankingLoss) to enable "progressive dimensionality reduction" or "multi-granularity learning." It leverages the concept of "matryoshka dolls" (nested dolls), where you train the model to produce embeddings at multiple dimensionalities.
The idea is to have embeddings that are useful at varying levels of detail.

**What is this loss specifically doing?**

*   It takes a base loss (like MNRL) and a list of target dimensions (matryoshka_dimensions).
*   This forces the model to learn embeddings that are meaningful at different levels of dimensionality, making them more versatile.
*   During training, it:
    *   Calculates the base loss using the full-dimensional embeddings.
    *   Then, it progressively truncates the embeddings to the specified lower dimensions.
    *   For each truncated embedding, it recalculates the base loss.
    *   The final loss is a weighted combination of the losses calculated at each dimension.

In this case, the model will learn embeddings that are useful at 768 dimensions, 512 dimensions, 256 dimensions, 128 dimensions, and 64 dimensions.
The loss function will ensure that the embeddings are optimized for the MultipleNegativesRankingLoss objective at each of those dimensions
















Now we can set-up our evaluator.

> NOTE: Due to the formatting of our dataset - this is all we have to do!

In [33]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

corpus = val_dataset['corpus']
queries = val_dataset['questions']
relevant_docs = val_dataset['relevant_contexts']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

We'll train this model for 5 epochs, though you could increase this number if we had a significant amount more data.

In [34]:
EPOCHS = 10

It's training time!

> NOTE: We're manually defining a warm-up period here - this is just to provide a smooth ramp into our training!

In [35]:
import wandb
wandb.init(mode="disabled")

In [36]:
warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, train_loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='finetuned_arctic_ft',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50
)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]



Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
16,No log,No log,0.875,1.0,1.0,1.0,0.875,0.333333,0.2,0.1,0.875,1.0,1.0,1.0,0.953866,0.9375,0.9375
32,No log,No log,0.875,1.0,1.0,1.0,0.875,0.333333,0.2,0.1,0.875,1.0,1.0,1.0,0.953866,0.9375,0.9375
48,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
50,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
64,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
80,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
96,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
100,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
112,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
128,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167


In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
hf_username = "llm-wizard"

In [None]:
model.push_to_hub(f"{hf_username}/legal-ft-v0")

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

'https://huggingface.co/llm-wizard/legal-ft-v0/commit/4bfc69e4287f04445493713eb6a021085862b9cf'

## Task 5: Evaluating our Retriever

Now that we have fine-tuned our retriever - let's see if it's worthwhile!

We'll start with some basic imports.

In [37]:
import pandas as pd

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document

Now we'll define a function that will help us evaluate our retrieval process.

> NOTE: We're assuming 1 correct document in a "hit".

In [38]:
def evaluate_openai(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
  corpus = dataset['corpus']
  questions = dataset['questions']
  relevant_docs = dataset['relevant_contexts']
  documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
  vectorstore = FAISS.from_documents(documents, embed_model)

  retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})

  eval_results = []
  for id, question in tqdm.tqdm(questions.items()):
    retrieved_nodes = retriever.invoke(question)
    retrieved_ids = [node.metadata["id"] for node in retrieved_nodes]
    expected_id = relevant_docs[id][0]
    is_hit = expected_id in retrieved_ids
    eval_results.append({"id": id, "question": question, "expected_id": expected_id, "is_hit": is_hit})

  return eval_results

All that's left to do is evaluate, we'll evaluate our model against:

1. OpenAI's closed source `text-embedding-3-small`
2. The base non-fine-tuned version of `Snowflake/snowflake-arctic-embed-l`.

Let's see how it stacks up!

### `text-embedding-3-small`

In [39]:
te3_openai = OpenAIEmbeddings(model="text-embedding-3-small")
te3_results = evaluate_openai(test_dataset, te3_openai)

100%|██████████| 24/24 [00:06<00:00,  3.59it/s]


In [40]:
te3_results_df = pd.DataFrame(te3_results)

In [41]:
te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

1.0

### `Snowflake/snowflake-arctic-embed-l` (base)

In [42]:
from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")
arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

100%|██████████| 24/24 [00:00<00:00, 47.51it/s]


In [43]:
arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

In [44]:
arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

0.9166666666666666

### `Snowflake/snowflake-arctic-embed-l` (fine-tuned)

In [45]:
finetune_embeddings = HuggingFaceEmbeddings(model_name="finetuned_arctic_ft")
finetune_results = evaluate_openai(test_dataset, finetune_embeddings)

Some weights of BertModel were not initialized from the model checkpoint at finetuned_arctic_ft and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 24/24 [00:00<00:00, 46.48it/s]


In [46]:
finetune_results_df = pd.DataFrame(finetune_results)

In [47]:
finetune_hit_rate = finetune_results_df["is_hit"].mean()
finetune_hit_rate

1.0

## Task 1: Vibe Checking the RAG Pipeline

We're going to use our RAG pipeline to vibe check on some common phrases now that we've modified it!

### Creating New Chunks

In order to try and evaluate our system more fairly, let's create new chunks that we will use to create our Vector Store.

In [48]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap  = 50,
    length_function = len
)

training_documents = text_splitter.split_documents(text_loader.load())

### Base Chain

We'll start by constructing our base chain, which will use the untrained retrieval model.

#### R - Retrieval

In [49]:
from langchain_community.vectorstores import FAISS

base_vectorstore = FAISS.from_documents(training_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 6})

#### A - Augmented

In [50]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and a question, you must answer the question. If you do not know the answer, you must state that you do not know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt_template = ChatPromptTemplate.from_template(RAG_PROMPT)

#### G - Generation

In [51]:
rag_llm =  ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

#### RAG - LCEL RAG Pipeline

In [52]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

base_rag_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [53]:
base_rag_chain.invoke({"question" : "What is an agent?"})["response"]

'An agent, in the context of AI, is a term that refers to AI systems that can act on your behalf. However, the term is vague and lacks a single, clear definition, leading to various interpretations. Some people view agents as systems that autonomously perform tasks, while others think of them as LLMs (Large Language Models) that utilize tools to solve problems. The concept is still evolving, and there is skepticism about their utility due to challenges such as gullibility, where these systems may struggle to distinguish truth from fiction.'

In [54]:
base_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Better-than-GPT-3 class models have been produced by Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, Baidu, and several other organizations.'

In [55]:
base_rag_chain.invoke({"question" : "What is the laziest month for AI?"})["response"]

'I do not know.'

In [56]:
base_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'I do not know.'

### Fine-tuned Embedding Model

Now let's rebuild our RAG chain with the Fine-tuned model - the only component we need to change is our `FAISS` vectorstore!

In [57]:
finetune_vectorstore = FAISS.from_documents(training_documents, finetune_embeddings)
finetune_retriever = finetune_vectorstore.as_retriever(search_kwargs={"k": 6})

In [58]:
finetune_rag_chain = (
    {"context": itemgetter("question") | finetune_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [59]:
finetune_rag_chain.invoke({"question" : "What is an Agent?"})["response"]

'An "agent" is a term that lacks a single, clear, and widely understood meaning in the context of AI. It generally refers to AI systems that can act on behalf of a user, but interpretations vary. Some people view agents as entities that perform tasks like a travel agent, while others see them as LLMs (Large Language Models) that utilize tools to solve problems. The term is often associated with concepts of autonomy, but there is significant ambiguity surrounding its definition and practical implementation.'

In [60]:
finetune_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Organizations that have produced better-than-GPT-3 class models include Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, Baidu, and several others.'

In [61]:
finetune_rag_chain.invoke({"question" : "What is the laziest month for AI?"})["response"]

'I do not know.'

In [62]:
finetune_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'The largest model that Simon has run on his phone is the Llama 3.2 3B model.'

####❓Question #2:

Which LCEL RAG Chain do you think answered the questions better, and why?

## Answer

Overall, the fine-tuned LCEL RAG chain demonstrated better performance in answering the questions. This improvement is likely attributed to the fine-tuning process, which helps the embedding model align more effectively with the specific domain and vocabulary of the corpus. This improved retrieval leads to more relevant context being provided to the LLM, resulting in more accurate and focused responses.

**Reasoning:**

*   **Improved Relevance**: The fine-tuned chain provided more relevant context to the LLM, which was evident in the responses for "What is an agent?" and "What is the largest model that Simon has run on his phone?".
*  **Conciseness**: The fine-tuned chain's responses were more concise and focused, eliminating the verbose and off-topic content observed in the base chain.
*   **Specificity**: The fine-tuned chain tended to provide answers more tailored to the specific questions, avoiding unnecessary general information.

While both chains faced challenges with specific questions, the fine-tuned chain's overall performance improvements suggest the effectiveness of fine-tuning embeddings for domain-specific RAG tasks. By fine-tuning the embedding model on a dataset of question-document pairs relevant to Simon Willison's blog content, the model gained a better understanding of the relationships between questions and documents in that domain, ultimately leading to more targeted retrieval and better answers.

## Task 2: RAGAS Evaluation

It's great to have some idea of how our system is doing based on vibe-checks, but let's use RAGAS to provide more insight info. on how things are improving!

> NOTE: Please recreate *exactly* the RAGAS process we used to evaluate RAG, baselining with the default retriever, and then comparing the new retriever. This includes the Synthetic Data Generation steps.

In [None]:
### YOUR CODE HERE

## Install Dependencies

In [63]:
!pip install -qU ragas==0.2.10

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/175.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m174.1/175.7 kB[0m [31m6.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.7/175.7 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/71.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [64]:
!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0 langchain_community


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m532.5/981.5 kB[0m [31m16.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m972.8/981.5 kB[0m [31m18.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m69.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m72.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.2/137.2 kB[

##Import API Key

In [65]:
import os
from getpass import getpass

os.environ["RAGAS_APP_TOKEN"] = getpass("Please enter your Ragas API key!")


Please enter your Ragas API key!··········


## Create Embeddings Model Wrapper

In [67]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())


## Generate test data from Knowledge Graph

In [68]:
from ragas.testset import TestsetGenerator
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

# Use our training document that we imported earlier
docs = training_documents

dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying SummaryExtractor:   0%|          | 0/83 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/127 [00:00<?, ?it/s]

ERROR:ragas.testset.transforms.engine:unable to apply transformation: Connection error.


Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/337 [00:00<?, ?it/s]

Applying OverlapScoreBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

In [69]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What significant development in AI occurred in...,[Stuff we figured out about AI in 2023\n\n\n\n...,2023 was the breakthrough year for Large Langu...,single_hop_specifc_query_synthesizer
1,Wht is GPT-4?,[Large Language Models\nThey’re actually quite...,We don’t yet know how to build GPT-4.,single_hop_specifc_query_synthesizer
2,What advancements in LLMs were noted in 2024?,[Here’s the sequel to this post: Things we lea...,"In 2024, it was noted that Large Language Mode...",single_hop_specifc_query_synthesizer
3,How you make LLMs?,[They’re actually quite easy to build\nThe mos...,"LLMs are surprisingly easy to build, requiring...",single_hop_specifc_query_synthesizer
4,Who is Anthropic?,"[If you can gather the right data, and afford ...",Anthropic is one of the organizations that hav...,single_hop_specifc_query_synthesizer
5,What significant advancements in AI were assoc...,[<1-hop>\n\nJanuary\n\n7th: It’s OK to call it...,The significant advancements in AI associated ...,multi_hop_specific_query_synthesizer
6,How does the size and training cost of DeepSee...,[<1-hop>\n\nThe big news to end the year was t...,"DeepSeek v3, with its 685B parameters, is sign...",multi_hop_specific_query_synthesizer
7,What is the significance of DeepSeek v3's rank...,[<1-hop>\n\nBenchmarks put it up there with Cl...,DeepSeek v3 holds a significant position in th...,multi_hop_specific_query_synthesizer
8,"How is Qwen's new visual reasoning model, QvQ,...",[<1-hop>\n\nai\n 1100\n\n\n ...,"Qwen's new visual reasoning model, QvQ, is uti...",multi_hop_specific_query_synthesizer
9,How Llama 3.1 405B compare to other models and...,[<1-hop>\n\nBenchmarks put it up there with Cl...,"Llama 3.1 405B, despite being trained for 30,8...",multi_hop_specific_query_synthesizer


In [70]:
dataset.upload()

Testset uploaded! View at https://app.ragas.io/dashboard/alignment/testset/e053cde3-bb0d-47a9-8f42-74d2e7c8da2c


'https://app.ragas.io/dashboard/alignment/testset/e053cde3-bb0d-47a9-8f42-74d2e7c8da2c'

## Call base chain with test data

In [71]:
for test_row in dataset:
  response = base_rag_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [72]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What significant development in AI occurred in...,[A year ago the single most notable example of...,[Stuff we figured out about AI in 2023\n\n\n\n...,A significant development in AI that occurred ...,2023 was the breakthrough year for Large Langu...,single_hop_specifc_query_synthesizer
1,Wht is GPT-4?,[That same laptop that could just about run a ...,[Large Language Models\nThey’re actually quite...,GPT-4 is a class of large language model devel...,We don’t yet know how to build GPT-4.,single_hop_specifc_query_synthesizer
2,What advancements in LLMs were noted in 2024?,[Everything tagged “llms” on my blog in 2024\n...,[Here’s the sequel to this post: Things we lea...,"In 2024, it was noted that LLMs became more ev...","In 2024, it was noted that Large Language Mode...",single_hop_specifc_query_synthesizer
3,How you make LLMs?,[Even the openly licensed ones are still the w...,[They’re actually quite easy to build\nThe mos...,"The context mentions that ""They’re actually qu...","LLMs are surprisingly easy to build, requiring...",single_hop_specifc_query_synthesizer
4,Who is Anthropic?,[Evals really matter\nAnthropic’s Amanda Askel...,"[If you can gather the right data, and afford ...",Anthropic is a company focused on artificial i...,Anthropic is one of the organizations that hav...,single_hop_specifc_query_synthesizer
5,What significant advancements in AI were assoc...,[Even the openly licensed ones are still the w...,[<1-hop>\n\nJanuary\n\n7th: It’s OK to call it...,The context provided does not contain specific...,The significant advancements in AI associated ...,multi_hop_specific_query_synthesizer
6,How does the size and training cost of DeepSee...,[Is this infrastructure necessary? DeepSeek v3...,[<1-hop>\n\nThe big news to end the year was t...,DeepSeek v3 is a significantly larger model th...,"DeepSeek v3, with its 685B parameters, is sign...",multi_hop_specific_query_synthesizer
7,What is the significance of DeepSeek v3's rank...,"[Then in December, the Chatbot Arena team intr...",[<1-hop>\n\nBenchmarks put it up there with Cl...,The context does not provide specific informat...,DeepSeek v3 holds a significant position in th...,multi_hop_specific_query_synthesizer
8,"How is Qwen's new visual reasoning model, QvQ,...",[ai\n 1100\n\n\n openai\...,[<1-hop>\n\nai\n 1100\n\n\n ...,"Qwen's new visual reasoning model, QvQ, is uti...","Qwen's new visual reasoning model, QvQ, is uti...",multi_hop_specific_query_synthesizer
9,How Llama 3.1 405B compare to other models and...,[That same laptop that could just about run a ...,[<1-hop>\n\nBenchmarks put it up there with Cl...,I do not know.,"Llama 3.1 405B, despite being trained for 30,8...",multi_hop_specific_query_synthesizer


## Convert table into Evaluation Dataset


In [73]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

## Choose model for Judging

In [74]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

## Evaluate base RAG with RAGAS

In [75]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[25]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-KEkyoe4K9WHAtKUPJrUOZCjX on tokens per min (TPM): Limit 30000, Used 29916, Requested 1502. Please try again in 2.836s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
ERROR:ragas.executor:Exception raised in Job[13]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-KEkyoe4K9WHAtKUPJrUOZCjX on tokens per min (TPM): Limit 30000, Used 29408, Requested 1735. Please try again in 2.286s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
ERROR:ragas.executor:Exception raised in Job[24]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-KEkyoe4K9WHAtKUPJrUO

{'context_recall': 0.0833, 'faithfulness': 1.0000, 'factual_correctness': 0.3133, 'answer_relevancy': 0.5740, 'context_entity_recall': 0.5301, 'noise_sensitivity_relevant': 0.2131}

## Evaluate fine tuned RAG with RAGAS

In [76]:
for test_row in dataset:
  response = finetune_rag_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [77]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What significant development in AI occurred in...,[Stuff we figured out about AI in 2023\n\n\n\n...,[Stuff we figured out about AI in 2023\n\n\n\n...,2023 was a breakthrough year for Large Languag...,2023 was the breakthrough year for Large Langu...,single_hop_specifc_query_synthesizer
1,Wht is GPT-4?,[We don’t yet know how to build GPT-4\nFrustra...,[Large Language Models\nThey’re actually quite...,"GPT-4 is a language model developed by OpenAI,...",We don’t yet know how to build GPT-4.,single_hop_specifc_query_synthesizer
2,What advancements in LLMs were noted in 2024?,[Things we learned about LLMs in 2024\n\n\n\n\...,[Here’s the sequel to this post: Things we lea...,Some advancements in LLMs noted in 2024 includ...,"In 2024, it was noted that Large Language Mode...",single_hop_specifc_query_synthesizer
3,How you make LLMs?,[Large Language Models\nThey’re actually quite...,[They’re actually quite easy to build\nThe mos...,"To make LLMs (Large Language Models), you need...","LLMs are surprisingly easy to build, requiring...",single_hop_specifc_query_synthesizer
4,Who is Anthropic?,[Evals really matter\nAnthropic’s Amanda Askel...,"[If you can gather the right data, and afford ...",Anthropic is a company that focuses on develop...,Anthropic is one of the organizations that hav...,single_hop_specifc_query_synthesizer
5,What significant advancements in AI were assoc...,[The earliest of those was Google’s Gemini 1.5...,[<1-hop>\n\nJanuary\n\n7th: It’s OK to call it...,Gemini 1.5 Pro introduced significant advancem...,The significant advancements in AI associated ...,multi_hop_specific_query_synthesizer
6,How does the size and training cost of DeepSee...,"[Likewise, training. DeepSeek v3 training for ...",[<1-hop>\n\nThe big news to end the year was t...,DeepSeek v3 is a significantly larger model wi...,"DeepSeek v3, with its 685B parameters, is sign...",multi_hop_specific_query_synthesizer
7,What is the significance of DeepSeek v3's rank...,[Benchmarks put it up there with Claude 3.5 So...,[<1-hop>\n\nBenchmarks put it up there with Cl...,DeepSeek v3 is ranked 7th in the Chatbot Arena...,DeepSeek v3 holds a significant position in th...,multi_hop_specific_query_synthesizer
8,"How is Qwen's new visual reasoning model, QvQ,...",[Apple’s mlx-lm Python library supports runnin...,[<1-hop>\n\nai\n 1100\n\n\n ...,"Qwen's new visual reasoning model, QvQ, is uti...","Qwen's new visual reasoning model, QvQ, is uti...",multi_hop_specific_query_synthesizer
9,How Llama 3.1 405B compare to other models and...,[Meta’s Llama 3.2 models deserve a special men...,[<1-hop>\n\nBenchmarks put it up there with Cl...,Llama 3.1 405B is compared to other models in ...,"Llama 3.1 405B, despite being trained for 30,8...",multi_hop_specific_query_synthesizer


## Transform fine tune dataset into Evaluation Dataset and re-run Evaluation

In [78]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

In [79]:
result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[7]: RateLimitError(Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}})
ERROR:ragas.executor:Exception raised in Job[9]: RateLimitError(Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}})
ERROR:ragas.executor:Exception raised in Job[6]: RateLimitError(Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platfor

{'context_recall': 0.6667, 'faithfulness': 0.7778, 'factual_correctness': 0.6433, 'answer_relevancy': 0.9567, 'context_entity_recall': 0.6562, 'noise_sensitivity_relevant': 0.1716}

## Results

### Base Chain
*   context_recall: 0.0833
*   faithfulness: 1.0000
*   factual_correctness: 0.3133
*   answer_relevancy: 0.5740
*   context_entity_recall: 0.5301
*   noise_sensitivity_relevant: 0.2131

### Fine Tuned Chain
*   context_recall: 0.6667
*   faithfulness: 0.7778
*   factual_correctness: 0.6433
*   answer_relevancy: 0.9567
*   context_entity_recall: 0.6562
*   noise_sensitivity_relevant: 0.1716

Comparing the 2 results, the fine tuned data set showed significant improvement in all areas except faithfulness which reduced a bit. Noise sensitivity going down is a good thing I suppose.