# Fine-tuning Embeddings for RAG on Specific Data

As we start our "fine-tuning" week, we'll start with the lowest hanging improvement one can do for RAG - which is:

Fine-tuning embeddings!

- 🤝 Breakout Room #1:
  - Task 1: Dependencies and Boilerplate
  - Task 2: Loading Data
  - Task 3: Constructing a Fine-tuning Dataset
  - Task 4: Fine-tuning `snowflake-arctic-embed-l`
  - Task 5: Evaluating our Retriever



#### Basic Overview of Fine-tuning Embeddings

In essence, what we want to do when we fine-tune our embedding models is very simple:

```
Move the embeddings for questions relating to a document
closer together with that document
```

We can think of fine-tuning our embedding models as follows:

1) We have some pair of text items that *should* be closer together
  - `Question`, `Document` pairs
  - EX: `Who drives the bus?`, `The bus was driven by Kyle, the Bus Driver`.

2) We use these pairs as labeled data to fine-tune our embedding model.

The process of training helps the model more accurately associate our questions with the correct documents.

##### ❓ Question #1:

Describe the nuance between using Q&D pairs to train the embedding model vs. inter-document pairs/related sentences.

What caveats does this approach have? Are there any special considerations for what kind of Q's we should use?

<b>Q&D Pairs Training Approach:</b> In this approach, the embedding model is trained using specific query-document pairs that are directly related to each other. It optimizes the embedding space for semantic retrieval and focuses on matching queries to their most relevant documents. However it could lead to overfitting and could perform well only with questions it has been exposed to.

<b>Inter-document pairs Training Approach:</b> This method focuses on creating embeddings that capture broader semantic relationships between different pieces of text. It builds a more generalized semantic understanding of the data. However it may not be precisely tuned for some q&a retrieval and could produce a sub-optimal results in those cases.

<b> What kind of Q's to use: </b> While using Q&D pairs training, it is essential to use a wide range of query types like keyword-based queries, domain-specific questions, abstract questions. The range of complexity and linguistic styles need to be accounted for to mimic real-life use cases.

## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

### Nest Asyncio

In [1]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

> NOTE: You do not need to do these steps if you are running this notebook locally with `uv`.

In [2]:
!pip install -qU "langchain_openai>=0.3.4" "langchain_huggingface" "langchain_core>=0.3.34" "langchain>=0.3.18" "langchain_community>=0.3.17" "langchain-text-splitters>=0.3.6" "datasets>=3.2.0"

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/62.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.8/62.8 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m437.6/437.6 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m37.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m101.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.4/491.4 kB[0m [31m39.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
!pip install -qU faiss-cpu python-pptx==1.0.2 nltk==3.9.1 pymupdf beautifulsoup4 lxml

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.8/472.8 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m73.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m114.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m169.4/169.4 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[?25h

### Provide OpenAI API Key

In [4]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

Enter Your OpenAI API Key: ··········


## Task 2: Loading Data

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [5]:
!mkdir data

In [6]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31554    0 31554    0     0   123k      0 --:--:-- --:--:-- --:--:--  123k


In [7]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70721    0 70721    0     0  97403      0 --:--:-- --:--:-- --:--:-- 97546


In [8]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import BSHTMLLoader

path = "data/"
text_loader = DirectoryLoader(path, glob="*.html", loader_cls=BSHTMLLoader)

Next, we'll set up a classic naive chunking strategy as we only care that the documents get parsed into chunks that we can generate synthetic questions about.

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap  = 20,
    length_function = len
)

Next we can load/split these documents as follows.

> NOTE: You may need to run this cell twice to get it to work.

In [10]:
training_documents = text_splitter.split_documents(text_loader.load())

In [11]:
len(training_documents)

102

Next, we're going to associate each of our chunks with a unique identifier.

In [12]:
import uuid

id_set = set()

for document in training_documents:
  id = str(uuid.uuid4())
  while id in id_set:
    id = uuid.uuid4()
  id_set.add(id)
  document.metadata["id"] = id

Next, we'll simply use naive Python slicing to create a training, test, and validation set to prepare our data for the next step.

In [13]:
# training_split_documents = training_documents[:len(training_documents) - 24]
# val_split_documents = training_documents[len(training_documents) - 24:102-12]
# test_split_documents = training_documents[102-12:]
training_split_documents = training_documents[:len(training_documents) - 24]
val_split_documents = training_documents[len(training_documents) - 24:len(training_documents)-12]
test_split_documents = training_documents[len(training_documents)-12:]

In [15]:
len(training_split_documents), len(val_split_documents), len(test_split_documents)

(78, 12, 12)

## Task 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-4.1-mini`

The basic idea here is straightforward enough:

1. We look at a document
2. We generate questions that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [16]:
from langchain_openai import ChatOpenAI

qa_chat_model = ChatOpenAI(
    model="gpt-4.1-mini",
    temperature=0
)

We'll create a simple Question Generation prompt to query `gpt-4o-mini` to generate Questions for each retrieved context.

In [17]:
from langchain_core.prompts import ChatPromptTemplate

qa_prompt = """\
Given the following context, you must generate questions based on only the provided context.

You are to generate {n_questions} questions which should be provided in the following format:

1. QUESTION #1
2. QUESTION #2
...

Context:
{context}
"""

qa_prompt_template = ChatPromptTemplate.from_template(qa_prompt)

We'll create a simple chain to query the LLM!

In [18]:
question_generation_chain = qa_prompt_template | qa_chat_model

There's a lot going on in this function - let's take a deeper look:

1. First, we provide a list of documents and a number of questions
2. We, for each document in our list, generate `n_questions` of questions.
3. We then associate those questions and contexts via a `UUID`.

> NOTE: The reason we're doing this `UUID` association is for ease of use later in the notebook.

##### 🏗️ Activity #1:

We have:

- Lists of `Documents` with the `metadata` field `id`.

We need:

- An object with key `id`, which have values `str` questions.
- An object with key `question_id`, which have values `List(str)` which will be a list of associated `context_id`.

An Example:

question_object:
```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': 'What types of accessible formats are available for persons with disabilities?',
'df58ee4f-714c-419e-8324-94e5870574e2': 'How do accessible formats benefit persons with disabilities?',
'505fce8b-0e56-48de-a251-61027e396918': 'What are some of the risks associated with the increasing capabilities of AI systems that generate synthetic content?',
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': 'Why is it important for providers of AI systems to embed technical solutions for marking and detecting synthetic content?'
}
 ```

 context_object:
 ```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'df58ee4f-714c-419e-8324-94e5870574e2': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'505fce8b-0e56-48de-a251-61027e396918': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
}
 ```

 As you can see, a piece of context can be associated with more than 1 question.

 The task is to write the Python function(s) to accomplish this task.

 Your function signature is provided below, along with the desired return values.

 > NOTE: You can make any modifications that you desire - assuming that you have the correct input and outputs.

In [19]:
import tqdm
import asyncio

"""
Sample Usage of TQDM:

for i in tqdm.tqdm(range(10)):
  time.sleep(1)
"""

async def process_document(document, n_questions):
    questions_generated = await question_generation_chain.ainvoke({"context": document.page_content, "n_questions": n_questions})

    doc_questions = {}
    doc_relevant_docs = {}

    for question in questions_generated.content.split("\n"):
        question_id = str(uuid.uuid4())
        doc_questions[question_id] = "".join(question.split(".")[1:]).strip()
        doc_relevant_docs[question_id] = [document.metadata["id"]]

    return doc_questions, doc_relevant_docs

async def create_questions(documents, n_questions):
    tasks = [process_document(doc, n_questions) for doc in documents]

    questions = {}
    relevant_docs = {}

    for task in tqdm.tqdm(asyncio.as_completed(tasks), total=len(documents), desc="Processing documents"):
        doc_questions, doc_relevant_docs = await task
        questions.update(doc_questions)
        relevant_docs.update(doc_relevant_docs)

    return questions, relevant_docs

### REMOVE `await` IF NOT USING ASYNC (HINT: Use `async`)

In [20]:
training_questions, training_relevant_contexts = await create_questions(training_split_documents, 10)

Processing documents: 100%|██████████| 78/78 [00:10<00:00,  7.35it/s]


In [21]:
training_questions

{'c3a775d0-dc17-4310-8d40-a84bebcd41ae': 'What is the main discovery about Large Language Models (LLMs) in the past 24-36 months?',
 '38d648cf-b150-4b01-a656-b88381fe34c4': 'What resources are used to create Large Language Models?',
 'd711f69a-fc1e-4344-b02b-d63714687692': 'What are some of the capabilities of LLMs mentioned in the context?',
 'b9a59e64-9661-4d08-957e-a164ae7a70c6': 'How can LLMs assist with language translation?',
 '8d2ce54d-a85a-4e89-95ae-9d1698d62f34': 'In what ways can LLMs be used to generate content?',
 'bcb5721d-5c4a-4b77-93c5-42266ba34428': 'What is one potential negative use of LLMs related to education?',
 'af892954-1cb7-4d72-aabb-35682525e109': 'How do LLMs help in writing code?',
 'eeeeab4e-142c-4c05-94ef-aeb318dde0fa': 'What ethical concerns arise from the use of LLMs?',
 '4d97f288-08e5-46a7-9c49-fef18669eca3': 'How do LLMs extract information from text?',
 '1998c613-cd11-477d-a172-fa68875e888a': 'What does the context imply about the impact of LLMs on sof

We'll use the function to generate training, validation, and test data.

In [23]:
val_questions, val_relevant_contexts = await create_questions(val_split_documents, 10)

Processing documents: 100%|██████████| 12/12 [00:04<00:00,  2.64it/s]


In [24]:
test_questions, test_relevant_contexts = await create_questions(test_split_documents, 10)

Processing documents: 100%|██████████| 12/12 [00:06<00:00,  1.97it/s]


### Reformating and Saving Datasets

Now, we can save our datasets for later use!

In [25]:
import json

training_corpus = {train_item.metadata["id"] : train_item.page_content for train_item in training_split_documents}

train_dataset = {
    "questions" : training_questions,
    "relevant_contexts" : training_relevant_contexts,
    "corpus" : training_corpus
}

with open("training_dataset.jsonl", "w") as f:
  json.dump(train_dataset, f)

In [27]:
val_corpus = {val_item.metadata["id"] : val_item.page_content for val_item in val_split_documents}

val_dataset = {
    "questions" : val_questions,
    "relevant_contexts" : val_relevant_contexts,
    "corpus" : val_corpus
}

with open("val_dataset.jsonl", "w") as f:
  json.dump(val_dataset, f)

In [28]:
train_corpus = {test_item.metadata["id"] : test_item.page_content for test_item in test_split_documents}

test_dataset = {
    "questions" : test_questions,
    "relevant_contexts" : test_relevant_contexts,
    "corpus" : train_corpus
}

with open("test_dataset.jsonl", "w") as f:
  json.dump(test_dataset, f)

## Task 4: Fine-tuning `snowflake-arctic-embed-l`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-l`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

>> NOTE: Skip installing dependencies if you are running this notebook locally.

In [29]:
!pip install -qU sentence_transformers pyarrow

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.7/345.7 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 MB[0m [31m55.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 25.2.1 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 20.0.0 which is incompatible.
pylibcudf-cu12 25.2.1 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 20.0.0 which is incompatible.[0m[31m
[0m

In [31]:
from sentence_transformers import SentenceTransformer

model_id = "Snowflake/snowflake-arctic-embed-l"
model = SentenceTransformer(model_id)

We'll grab some necessary imports from `sentence_transformers` and `torch`.

> NOTE: PyTorch (`torch`) is a popular machine learning library - while we don't go very deep into PyTorch it's an incredibly powerful and interesting library! Please read more about it [here](https://pytorch.org/tutorials/beginner/basics/intro.html)!

In [32]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample

We're using a toy batch size here to reflect the limited number of examples we have.

> NOTE: It is typical to use a much larger batch size (~64+), hardware permitting.

In [33]:
BATCH_SIZE = 10

Let's move our dataset into the expected format for training.

In [34]:
corpus = train_dataset['corpus']
queries = train_dataset['questions']
relevant_docs = train_dataset['relevant_contexts']

examples = []
for query_id, query in queries.items():
    doc_id = relevant_docs[query_id][0]
    text = corpus[doc_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

Now we can create a `torch` `DataLoader`!

In [35]:
loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

Next up, we'll prepare our loss function!

Loss is an important part of training, fine-tuning, and more. If you want a deep dive on loss - you can check out our [event on loss!](https://www.youtube.com/watch?v=iB8FWR9aD5Q&t=8s).

The core loss we're using today is called `MultipleNegativesRankingLoss` - you can find more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).

This is "wrapped" in `MatryoshkaLoss`, which you can read the implementation of [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MatryoshkaLoss.py).

In [36]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

##### 🏗️ Activity #2:

Both of these losses sound "cool", but what are they - exactly - under the hood?

Why are these losses specifically doing? Please write a short summary of each loss.

> NOTE: This is a course focused on AI Engineering and the application of AI - looking for a hint? Try pasting the code (linked above) into ChatGPT/Claude to write the summary!

Now we can set-up our evaluator.

> NOTE: Due to the formatting of our dataset - this is all we have to do!

In [37]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

corpus = val_dataset['corpus']
queries = val_dataset['questions']
relevant_docs = val_dataset['relevant_contexts']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

We'll train this model for 5 epochs, though you could increase this number if we had a significant amount more data.

In [38]:
EPOCHS = 10

It's training time!

> NOTE: We're manually defining a warm-up period here - this is just to provide a smooth ramp into our training!

In [39]:
import wandb
wandb.init(mode="disabled")

> NOTE: You may not see direct improvement during the training cycles - this is absolutely expected. We will verify performance later in the notebook.

In [40]:
warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, train_loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='finetuned_arctic_ft',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50
)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]



Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
50,No log,No log,0.816667,0.983333,1.0,1.0,0.816667,0.327778,0.2,0.1,0.816667,0.983333,1.0,1.0,0.919906,0.892639,0.892639
78,No log,No log,0.841667,0.975,1.0,1.0,0.841667,0.325,0.2,0.1,0.841667,0.975,1.0,1.0,0.932193,0.909028,0.909028
100,No log,No log,0.85,0.966667,1.0,1.0,0.85,0.322222,0.2,0.1,0.85,0.966667,1.0,1.0,0.935052,0.913056,0.913056
150,No log,No log,0.858333,0.983333,1.0,1.0,0.858333,0.327778,0.2,0.1,0.858333,0.983333,1.0,1.0,0.941465,0.921389,0.921389
156,No log,No log,0.866667,0.983333,1.0,1.0,0.866667,0.327778,0.2,0.1,0.866667,0.983333,1.0,1.0,0.943449,0.924167,0.924167
200,No log,No log,0.866667,0.983333,1.0,1.0,0.866667,0.327778,0.2,0.1,0.866667,0.983333,1.0,1.0,0.943815,0.924583,0.924583
234,No log,No log,0.858333,0.983333,1.0,1.0,0.858333,0.327778,0.2,0.1,0.858333,0.983333,1.0,1.0,0.940374,0.92,0.92
250,No log,No log,0.866667,0.975,1.0,1.0,0.866667,0.325,0.2,0.1,0.866667,0.975,1.0,1.0,0.943237,0.923889,0.923889
300,No log,No log,0.85,0.983333,0.991667,1.0,0.85,0.327778,0.198333,0.1,0.85,0.983333,0.991667,1.0,0.934521,0.912431,0.912431
312,No log,No log,0.85,0.983333,0.991667,1.0,0.85,0.327778,0.198333,0.1,0.85,0.983333,0.991667,1.0,0.935612,0.913819,0.913819


In [41]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [42]:
hf_username = "vivnatan"

In [43]:
import uuid

model.push_to_hub(f"{hf_username}/legal-ft-{uuid.uuid4()}")

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

'https://huggingface.co/vivnatan/legal-ft-20c85cc6-30d1-49ca-97e6-cce1045a4b4a/commit/4f0f08e1bcdde7d300971b6019968fdd206c2a59'

## Task 5: Evaluating our Retriever

Now that we have fine-tuned our retriever - let's see if it's worthwhile!

We'll start with some basic imports.

In [44]:
import pandas as pd

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document

Now we'll define a function that will help us evaluate our retrieval process.

> NOTE: We're assuming 1 correct document in a "hit".

In [45]:
def evaluate_openai(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
  corpus = dataset['corpus']
  questions = dataset['questions']
  relevant_docs = dataset['relevant_contexts']
  documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
  vectorstore = FAISS.from_documents(documents, embed_model)

  retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})

  eval_results = []
  for id, question in tqdm.tqdm(questions.items()):
    retrieved_nodes = retriever.invoke(question)
    retrieved_ids = [node.metadata["id"] for node in retrieved_nodes]
    expected_id = relevant_docs[id][0]
    is_hit = expected_id in retrieved_ids
    eval_results.append({"id": id, "question": question, "expected_id": expected_id, "is_hit": is_hit})

  return eval_results

All that's left to do is evaluate, we'll evaluate our model against:

1. OpenAI's closed source `text-embedding-3-small`
2. The base non-fine-tuned version of `Snowflake/snowflake-arctic-embed-l`.

Let's see how it stacks up!

### `text-embedding-3-small`

In [46]:
te3_openai = OpenAIEmbeddings(model="text-embedding-3-small")
te3_results = evaluate_openai(test_dataset, te3_openai)

100%|██████████| 120/120 [00:40<00:00,  2.96it/s]


In [47]:
te3_results_df = pd.DataFrame(te3_results)

In [48]:
te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

np.float64(0.9916666666666667)

### `Snowflake/snowflake-arctic-embed-l` (base)

In [49]:
from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")
arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

100%|██████████| 120/120 [00:02<00:00, 46.30it/s]


In [50]:
arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

In [51]:
arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

np.float64(0.8333333333333334)

### `Snowflake/snowflake-arctic-embed-l` (fine-tuned)

In [52]:
finetune_embeddings = HuggingFaceEmbeddings(model_name="finetuned_arctic_ft")
finetune_results = evaluate_openai(test_dataset, finetune_embeddings)

Some weights of BertModel were not initialized from the model checkpoint at finetuned_arctic_ft and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 120/120 [00:02<00:00, 46.22it/s]


In [53]:
finetune_results_df = pd.DataFrame(finetune_results)

In [54]:
finetune_hit_rate = finetune_results_df["is_hit"].mean()
finetune_hit_rate

np.float64(0.9916666666666667)

## Task 1: Vibe Checking the RAG Pipeline

We're going to use our RAG pipeline to vibe check on some common phrases now that we've modified it!

### Creating New Chunks

In order to try and evaluate our system more fairly, let's create new chunks that we will use to create our Vector Store.

In [55]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap  = 50,
    length_function = len
)

training_documents = text_splitter.split_documents(text_loader.load())

### Base Chain

We'll start by constructing our base chain, which will use the untrained retrieval model.

#### R - Retrieval

In [56]:
from langchain_community.vectorstores import FAISS

base_vectorstore = FAISS.from_documents(training_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 6})

#### A - Augmented

In [57]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and a question, you must answer the question. If you do not know the answer, you must state that you do not know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt_template = ChatPromptTemplate.from_template(RAG_PROMPT)

#### G - Generation

In [58]:
rag_llm =  ChatOpenAI(
    model="gpt-4.1-nano",
    temperature=0
)

#### RAG - LCEL RAG Pipeline

In [59]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

base_rag_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [60]:
base_rag_chain.invoke({"question" : "What is an agent?"})["response"]

'Based on the provided context, an "agent" in the context of AI refers to systems that can act on your behalf, such as travel agents or digital assistants. However, the term is highly vague and lacks a clear, universally accepted definition. Different people interpret "agents" differently—some see them as systems that autonomously perform tasks, while others think of them as tools that access and utilize various resources or tools to solve problems. Despite ongoing discussions and prototypes, true AI agents that reliably operate in production are still elusive, partly due to issues like gullibility and the difficulty of distinguishing truth from fiction.'

In [61]:
base_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Several organizations have produced models that are better than GPT-3, including Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, and Baidu.'

In [62]:
base_rag_chain.invoke({"question" : "What is the laziest time of the year for AI?"})["response"]

'The provided context does not specify a particular time of year that is considered the "laziest" for AI.'

In [63]:
base_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'The provided context does not specify the name "Simon" or details about the largest model he has run on his phone. Therefore, I do not know the answer.'

### Fine-tuned Embedding Model

Now let's rebuild our RAG chain with the Fine-tuned model - the only component we need to change is our `FAISS` vectorstore!

In [64]:
finetune_vectorstore = FAISS.from_documents(training_documents, finetune_embeddings)
finetune_retriever = finetune_vectorstore.as_retriever(search_kwargs={"k": 6})

In [65]:
finetune_rag_chain = (
    {"context": itemgetter("question") | finetune_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [66]:
finetune_rag_chain.invoke({"question" : "What is an Agent?"})["response"]

'An agent, in the context of AI and large language models, is a term that lacks a single, clear definition and is used in various ways. Some people consider AI agents to be systems that act on your behalf, similar to a travel agent, while others think of them as LLMs given access to tools that they can use iteratively to solve problems. Overall, the term remains vague and is often associated with systems that are expected to perform autonomous actions or decision-making, but its precise meaning varies among different users and contexts.'

In [67]:
finetune_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'According to the provided context, organizations such as Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII (Falcon), Microsoft Research, xAI, Replit, Baidu, and others have produced models that are better than GPT-3.'

In [68]:
finetune_rag_chain.invoke({"question" : "What is the laziest time of the year for AI?"})["response"]

'The provided context suggests that ChatGPT may become less useful or "lazy" in December, possibly because its system prompt includes the current date and the model\'s training data indicates that people tend to provide less useful answers around the holidays. Therefore, the laziest time of the year for AI, according to the context, is December.'

In [69]:
finetune_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'The provided context mentions that Simon runs Mistral 7B on his iPhone. There is no information about him running any larger models on his phone. Therefore, the largest model Simon has run on his phone is Mistral 7B.'

#### ❓Question #2:

Which LCEL RAG Chain do you think answered the questions better, and why?



## Task 2: RAGAS Evaluation

It's great to have some idea of how our system is doing based on vibe-checks, but let's use RAGAS to provide more insight info. on how things are improving!

> NOTE: Please recreate *exactly* the RAGAS process we used to evaluate RAG, baselining with the default retriever, and then comparing the new retriever. The includes the Synthetic Data Generation steps.

In [None]:
### YOUR CODE HERE

In [70]:
!pip install -qU ragas==0.2.10

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/175.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.7/175.7 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/71.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [71]:
!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.5/43.5 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m91.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m73.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.2/137.2 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 kB[0m [31m3.5 MB/s[0m eta [36m0

In [72]:
# 1) import RAGAS evaluate API & metrics
from ragas.metrics import ContextPrecision, ContextRecall, ResponseRelevancy, Faithfulness
from ragas import evaluate, RunConfig
from langchain_openai import ChatOpenAI
import pandas as pd


In [99]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

In [100]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [101]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/26 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [102]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Wut did Meta do with Llama?,[We don’t yet know how to build GPT-4 Vibes Ba...,"In February, Meta released Llama, and later in...",single_hop_specifc_query_synthesizer
1,Given the increasing capabilities of large lan...,[I’m surprised that no-one has beaten the now ...,The grammar rules of programming languages lik...,single_hop_specifc_query_synthesizer
2,Whatt role did the 1950s play in the devellopm...,[Simon Willison’s Weblog Subscribe Stuff we fi...,The academic field of Artificial Intelligence ...,single_hop_specifc_query_synthesizer
3,As an AI enthusiast and technology blogger int...,[Microsoft over this issue. The 69 page PDF is...,"According to the provided context, Stanford Al...",single_hop_specifc_query_synthesizer
4,How has the rise of fine-tuning and customizat...,[<1-hop>\n\nWe don’t yet know how to build GPT...,The rise of fine-tuning and customization by h...,multi_hop_abstract_query_synthesizer
5,How has the rise of fine-tuning and customizat...,[<1-hop>\n\nWe don’t yet know how to build GPT...,The rise of fine-tuning and customization by h...,multi_hop_abstract_query_synthesizer
6,How has the rise of fine-tuning and customizat...,[<1-hop>\n\nWe don’t yet know how to build GPT...,The rise of fine-tuning and customization by h...,multi_hop_abstract_query_synthesizer
7,How did the emergence of prompt-driven app gen...,[<1-hop>\n\ndid. These abilities are just a fe...,The emergence of prompt-driven app generation ...,multi_hop_abstract_query_synthesizer
8,"Based on the blog posts and analytics data, ho...","[<1-hop>\n\nof things, here’s every long-form ...",ChatGPT has been a recurring topic in blog pos...,multi_hop_specific_query_synthesizer
9,"What is Claude 3.5 Sonnet, and how did its int...",[<1-hop>\n\nup there with Claude 3.5 Sonnet. V...,Claude 3.5 Sonnet is a leading large language ...,multi_hop_specific_query_synthesizer


In [103]:
path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

In [104]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

74

In [124]:
from langchain_huggingface import HuggingFaceEmbeddings

baseline_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")
finetuned_embeddings = HuggingFaceEmbeddings(model_name="finetuned_arctic_ft")
# arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

Some weights of BertModel were not initialized from the model checkpoint at finetuned_arctic_ft and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [126]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="ai_across_years",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_across_years",
    embedding=baseline_embeddings
)

vector_store_ft = QdrantVectorStore(
    client=client,
    collection_name="ai_across_years",
    embedding=finetuned_embeddings
)


In [127]:
_ = vector_store.add_documents(documents=split_documents)
_ = vector_store_ft.add_documents(documents=split_documents)

In [128]:
baseline_retriever = vector_store.as_retriever(search_kwargs={"k": 5})
finetune_retriever = vector_store_ft.as_retriever(search_kwargs={"k": 5})

In [129]:
def retrieve(state):
  retrieved_docs = baseline_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

def retrieve_ft(state):
  retrieved_docs = finetune_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

In [130]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

In [131]:
llm = ChatOpenAI(model_name="gpt-4.1-mini")

In [134]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

In [135]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

In [137]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

graph_builder_ft = StateGraph(State).add_sequence([retrieve_ft, generate])
graph_builder_ft.add_edge(START, "retrieve_ft")
graph_ft = graph_builder_ft.compile()

In [138]:
response = graph.invoke({"question" : "How are LLM agents useful?"})

In [139]:
response["response"]

'LLM agents are useful because they represent AI systems that can potentially act on your behalf, automating tasks or making decisions. However, their usefulness is currently limited by their inherent gullibility—they believe anything you tell them and cannot reliably distinguish truth from fiction. Despite this challenge, there are genuine valuable applications for LLMs, particularly in areas like writing code, where their capabilities have proven astonishing and effective.\n\nThe key to benefiting from LLM agents lies in developing the skill to work with technology that is both powerful and inherently unreliable. This requires careful design, guidance, and education to avoid intuitive traps and to apply these tools effectively. While fully reliable LLM agents may require breakthroughs like AGI, they already hold promise as powerful, if complex, assistive tools for expert or power users.'

In [140]:
response_ft = graph_ft.invoke({"question" : "How are LLM agents useful?"})

In [141]:
response_ft["response"]

'LLM agents are useful in that they represent AI systems that can potentially act on your behalf by using tools and running processes in loops to solve problems. They hold promise as entities that could perform tasks autonomously, such as acting like a travel agent or a digital assistant. Additionally, LLMs are notably effective in generating and writing code, which is considered one of their best applications so far. However, despite the excitement around AI agents, practical and reliable implementations remain limited, partly due to challenges like gullibility—LLMs tend to believe any input they receive, which complicates trustworthiness and decision-making in autonomous systems. Thus, while LLM agents have potential usefulness, especially in coding and task automation, their full capabilities and reliability in acting independently are still largely unrealized.'

In [119]:
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [120]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,Wut did Meta do with Llama?,[Article Visitors Pageviews Bing: “I will not ...,[We don’t yet know how to build GPT-4 Vibes Ba...,"Meta released Llama 2, an improved version of ...","In February, Meta released Llama, and later in...",single_hop_specifc_query_synthesizer
1,Given the increasing capabilities of large lan...,"[Meanwhile, it’s increasingly common for end u...",[I’m surprised that no-one has beaten the now ...,The context does not explicitly compare the co...,The grammar rules of programming languages lik...,single_hop_specifc_query_synthesizer
2,Whatt role did the 1950s play in the devellopm...,"[Meanwhile, it’s increasingly common for end u...",[Simon Willison’s Weblog Subscribe Stuff we fi...,The provided context does not specifically dis...,The academic field of Artificial Intelligence ...,single_hop_specifc_query_synthesizer
3,As an AI enthusiast and technology blogger int...,"[Meanwhile, it’s increasingly common for end u...",[Microsoft over this issue. The 69 page PDF is...,"According to the provided context, Stanford Al...","According to the provided context, Stanford Al...",single_hop_specifc_query_synthesizer
4,How has the rise of fine-tuning and customizat...,"[Meanwhile, it’s increasingly common for end u...",[<1-hop>\n\nWe don’t yet know how to build GPT...,The rise of fine-tuning and customization by h...,The rise of fine-tuning and customization by h...,multi_hop_abstract_query_synthesizer
5,How has the rise of fine-tuning and customizat...,"[Meanwhile, it’s increasingly common for end u...",[<1-hop>\n\nWe don’t yet know how to build GPT...,The rise of fine-tuning and customization by h...,The rise of fine-tuning and customization by h...,multi_hop_abstract_query_synthesizer
6,How has the rise of fine-tuning and customizat...,[Article Visitors Pageviews Bing: “I will not ...,[<1-hop>\n\nWe don’t yet know how to build GPT...,The rise of fine-tuning and customization by h...,The rise of fine-tuning and customization by h...,multi_hop_abstract_query_synthesizer
7,How did the emergence of prompt-driven app gen...,[These abilities are just a few weeks old at t...,[<1-hop>\n\ndid. These abilities are just a fe...,The emergence of prompt-driven app generation ...,The emergence of prompt-driven app generation ...,multi_hop_abstract_query_synthesizer
8,"Based on the blog posts and analytics data, ho...",[Law is not ethics. Is it OK to train models o...,"[<1-hop>\n\nof things, here’s every long-form ...",Based on the provided blog posts and analytics...,ChatGPT has been a recurring topic in blog pos...,multi_hop_specific_query_synthesizer
9,"What is Claude 3.5 Sonnet, and how did its int...",[Article Visitors Pageviews Bing: “I will not ...,[<1-hop>\n\nup there with Claude 3.5 Sonnet. V...,Claude 3.5 Sonnet is a version of Anthropic's ...,Claude 3.5 Sonnet is a leading large language ...,multi_hop_specific_query_synthesizer


In [145]:
for test_row in dataset:
  response_ft = graph_ft.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response_ft["context"]]

In [146]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,Wut did Meta do with Llama?,[I wrote about how Large language models are h...,[We don’t yet know how to build GPT-4 Vibes Ba...,LLM agents are useful because they represent A...,"In February, Meta released Llama, and later in...",single_hop_specifc_query_synthesizer
1,Given the increasing capabilities of large lan...,[Code may be the best application\n\nThe ethic...,[I’m surprised that no-one has beaten the now ...,LLM agents are useful because they represent A...,The grammar rules of programming languages lik...,single_hop_specifc_query_synthesizer
2,Whatt role did the 1950s play in the devellopm...,[Simon Willison’s Weblog\n\nSubscribe\n\nStuff...,[Simon Willison’s Weblog Subscribe Stuff we fi...,LLM agents are useful because they represent A...,The academic field of Artificial Intelligence ...,single_hop_specifc_query_synthesizer
3,As an AI enthusiast and technology blogger int...,[a browser? 40.5k 49.2k How to implement Q&A a...,[Microsoft over this issue. The 69 page PDF is...,LLM agents are useful because they represent A...,"According to the provided context, Stanford Al...",single_hop_specifc_query_synthesizer
4,How has the rise of fine-tuning and customizat...,[Gullibility is the biggest unsolved problem\n...,[<1-hop>\n\nWe don’t yet know how to build GPT...,LLM agents are useful because they represent A...,The rise of fine-tuning and customization by h...,multi_hop_abstract_query_synthesizer
5,How has the rise of fine-tuning and customizat...,[LLMs need better criticism\n\nA lot of people...,[<1-hop>\n\nWe don’t yet know how to build GPT...,LLM agents are useful because they represent A...,The rise of fine-tuning and customization by h...,multi_hop_abstract_query_synthesizer
6,How has the rise of fine-tuning and customizat...,[You can even run them entirely in your browse...,[<1-hop>\n\nWe don’t yet know how to build GPT...,LLM agents are useful because they represent A...,The rise of fine-tuning and customization by h...,multi_hop_abstract_query_synthesizer
7,How did the emergence of prompt-driven app gen...,[These abilities are just a few weeks old at t...,[<1-hop>\n\ndid. These abilities are just a fe...,LLM agents are useful because they represent A...,The emergence of prompt-driven app generation ...,multi_hop_abstract_query_synthesizer
8,"Based on the blog posts and analytics data, ho...",[Law is not ethics. Is it OK to train models o...,"[<1-hop>\n\nof things, here’s every long-form ...",LLM agents are useful because they represent A...,ChatGPT has been a recurring topic in blog pos...,multi_hop_specific_query_synthesizer
9,"What is Claude 3.5 Sonnet, and how did its int...",[Anthropic kicked this idea into high gear whe...,[<1-hop>\n\nup there with Claude 3.5 Sonnet. V...,LLM agents are useful because they represent A...,Claude 3.5 Sonnet is a leading large language ...,multi_hop_specific_query_synthesizer


In [147]:
from ragas import EvaluationDataset

evaluation_dataset_ft = EvaluationDataset.from_pandas(dataset.to_pandas())

In [121]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

In [149]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

In [123]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[11]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-oZLvnlDjSb13xHAN8u5M1czV on tokens per min (TPM): Limit 30000, Used 29732, Requested 1835. Please try again in 3.134s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
ERROR:ragas.executor:Exception raised in Job[1]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-oZLvnlDjSb13xHAN8u5M1czV on tokens per min (TPM): Limit 30000, Used 29721, Requested 2168. Please try again in 3.778s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
ERROR:ragas.executor:Exception raised in Job[26]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-oZLvnlDjSb13xHAN8u5M1

{'context_recall': 0.4672, 'faithfulness': 0.8835, 'factual_correctness': 0.5664, 'answer_relevancy': 0.8508, 'context_entity_recall': 0.3520, 'noise_sensitivity_relevant': 0.1166}

In [150]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result_ft = evaluate(
    dataset=evaluation_dataset_ft,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result_ft

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[1]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-oZLvnlDjSb13xHAN8u5M1czV on tokens per min (TPM): Limit 30000, Used 28840, Requested 2285. Please try again in 2.25s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
ERROR:ragas.executor:Exception raised in Job[22]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-oZLvnlDjSb13xHAN8u5M1czV on tokens per min (TPM): Limit 30000, Used 29578, Requested 1775. Please try again in 2.706s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
ERROR:ragas.executor:Exception raised in Job[16]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-oZLvnlDjSb13xHAN8u5M1c

{'context_recall': 0.6667, 'faithfulness': 0.0500, 'factual_correctness': 0.1025, 'answer_relevancy': 0.7688, 'context_entity_recall': 0.4794, 'noise_sensitivity_relevant': 0.0435}

<b>Baseline :: </b>{'context_recall': 0.4672, 'faithfulness': 0.8835, 'factual_correctness': 0.5664, 'answer_relevancy': 0.8508, 'context_entity_recall': 0.3520, 'noise_sensitivity_relevant': 0.1166}
<br><b>Fine Tuned :: </b>{'context_recall': 0.6667, 'faithfulness': 0.0500, 'factual_correctness': 0.1025, 'answer_relevancy': 0.7688, 'context_entity_recall': 0.4794, 'noise_sensitivity_relevant': 0.0435}

Notes: Improved context recall, context_entity_recall, noise_sensitivity_relevant.
Reduced faithfulness, factual correctness and answer relevancy