# Fine-tuning Embeddings 



## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

### 1. Import Libs

In [1]:
import nest_asyncio

nest_asyncio.apply()

In [2]:
# !pip install -qU langchain_openai langchain_huggingface langchain_core langchain langchain_community langchain-text-splitters

In [3]:
# !pip install unstructured 

In [4]:
# !pip install -qU faiss-cpu python-pptx==1.0.2 nltk==3.9.1 pymupdf beautifulsoup4 lxml

### OpenAI API Key

In [13]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

## 2. Load Data

In [5]:
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader#, UnstructuredEPubLoader
# from langchain_community.document_loaders import BSHTMLLoader


path = "data/"
# epub_loader = UnstructuredEPubLoader(path + "BlacksLaw9thEdition.epub")
# read all pdfs in the directory
pdf_loader = DirectoryLoader(path, glob="**/*.pdf", loader_cls=PyPDFLoader)

In [6]:
# epub_data = epub_loader.load()
pdf_data = pdf_loader.load()

In [7]:
pdf_data[0].page_content

'Glossary of Legal Terms\nFind deﬁnitions of legal terms to help understand the federal\ncourt system.\nA\nAcquittal\nA jury verdict that a criminal defendant is not guilty, or the finding of a judge that the\nevidence is insufficient to support a conviction.\nActive judge\nA judge in the full-time service of the court. Compare to senior judge.\nAdministrative Office of the United States Courts (AO)\nEnter legal term to search for definition\nSearch'

Next, we'll set up a classic naive chunking strategy as we only care that the documents get parsed into chunks that we can generate synthetic questions about.

In [8]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap  = 20,
    length_function = len
)

In [9]:
training_documents = text_splitter.split_documents(pdf_loader.load())

In [10]:
len(training_documents)

151

Next, we're going to associate each of our chunks with a unique identifier.

In [11]:
import uuid

id_set = set()

for document in training_documents:
  id = str(uuid.uuid4())
  while id in id_set:
    id = uuid.uuid4()
  id_set.add(id)
  document.metadata["id"] = id

Next, we'll simply use naive Python slicing to create a training, test, and validation set to prepare our data for the next step.

In [17]:
# create train, val and test splits
def split_data(data, train_size=0.7, val_size=0.15):
    train_end = int(len(data) * train_size)
    val_end = train_end + int(len(data) * val_size)
    train_data = data[:train_end]
    val_data = data[train_end:val_end]
    test_data = data[val_end:]
    return train_data, val_data, test_data
train_data, val_data, test_data = split_data(training_documents)

In [18]:
len(train_data), len(val_data), len(test_data)

(105, 22, 24)

## 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-4o-mini` (released [today](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)).

The basic idea here is straightforward enough:

1. We look at a document
2. We generate questions that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [19]:
from langchain_openai import ChatOpenAI

qa_chat_model = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

We'll create a simple Question Generation prompt to query `gpt-4o-mini` to generate Questions for each retrieved context.

In [None]:
from langchain_core.prompts import ChatPromptTemplate

qa_prompt = """\
Given the following context, you must generate questions based on only the provided context.

You are to generate {n_questions} questions which should be provided in the following format:

1. QUESTION #1
2. QUESTION #2
...

Context:
{context}
"""

qa_prompt_template = ChatPromptTemplate.from_template(qa_prompt)

We'll create a simple chain to query the LLM!

In [None]:
question_generation_chain = qa_prompt_template | qa_chat_model

In [None]:
tmp_response = question_generation_chain.invoke({"n_questions": 5, "context": training_documents[0].page_content})

In [None]:
# parse response
import re
def parse_response(response):
    questions = re.findall(r"\d+\.\s*(.*)", response)
    return questions
questions = parse_response(tmp_response.content)

In [None]:
questions

['What significant advancements were made in Large Language Models (LLMs) in 2023?  ',
 'How does the development of LLMs in 2023 relate to the history of Artificial Intelligence since the 1950s?  ',
 'Why is 2023 considered a breakthrough year for AI according to Simon Willison?  ',
 'What does Simon Willison refer to as the most interesting development in the field of AI in 2023?  ',
 "What is the purpose of Simon Willison's weblog entry dated 31st December 2023?  "]

In [None]:
import tqdm
import re

async def create_questions(documents, n_questions):
  questions = {}
  relevant_docs = {}

  def _parse_response(response):
    questions = re.findall(r"\d+\.\s*(.*)", response)
    return questions

  for document in tqdm.tqdm(documents):
    # generate questions
    response = await question_generation_chain.ainvoke(
      input={"context": document.page_content, "n_questions": n_questions}
    )
    # parse response and get questions_list
    questions_list = _parse_response(response.content)

    for question in questions_list:
      # create a unique id for the question
      question_id = str(uuid.uuid4())
      questions[question_id] = question  # question_id : question
      relevant_docs[question_id] = [document.metadata["id"]] # question_id : document.id

  return questions, relevant_docs

In [None]:
tmp_qs, tmp_contexts = await create_questions(training_split_documents[:3], 2)

100%|██████████| 3/3 [00:02<00:00,  1.02it/s]


In [None]:
tmp_qs

{'1cb451eb-b545-4aa0-890b-898079aaf067': 'What significant advancements in AI were made in 2023, particularly regarding Large Language Models (LLMs)?  ',
 'f5b755fd-a8f9-4a2b-b66b-aa40f8bc4606': 'How does the development of LLMs in 2023 relate to the historical context of Artificial Intelligence since the 1950s?',
 'db45fb0b-2696-4b69-adf1-9c01fc1e965f': 'What are some potential applications of Large Language Models (LLMs) mentioned in the context?  ',
 'f9f38d9f-1f55-47e9-9ee1-004f4f436332': 'What is identified as the biggest unsolved problem related to LLMs?',
 '1fba4208-8b06-4ec7-83f6-45b69becf864': 'What are some of the capabilities of Large Language Models (LLMs) mentioned in the context?  ',
 '05660ffe-1f0c-4b4a-b81b-39eedc1799b9': 'What potential negative uses of LLMs are highlighted in the provided context?'}

In [None]:
tmp_contexts

{'1cb451eb-b545-4aa0-890b-898079aaf067': ['37445bd6-a767-4d26-8127-f25e59a92c21'],
 'f5b755fd-a8f9-4a2b-b66b-aa40f8bc4606': ['37445bd6-a767-4d26-8127-f25e59a92c21'],
 'db45fb0b-2696-4b69-adf1-9c01fc1e965f': ['3df24bf6-42ae-473e-8c0d-e81a39c730bb'],
 'f9f38d9f-1f55-47e9-9ee1-004f4f436332': ['3df24bf6-42ae-473e-8c0d-e81a39c730bb'],
 '1fba4208-8b06-4ec7-83f6-45b69becf864': ['0aea8c0f-c734-47c0-9601-a5137965c6fa'],
 '05660ffe-1f0c-4b4a-b81b-39eedc1799b9': ['0aea8c0f-c734-47c0-9601-a5137965c6fa']}

In [None]:
training_questions, training_relevant_contexts = await create_questions(training_split_documents, 2)

100%|██████████| 78/78 [01:20<00:00,  1.04s/it]


In [None]:
print("Number of training questions:", len(training_questions))
print("Number of documents:", len(training_split_documents))

Number of training questions: 156
Number of documents: 78


We'll use the function to generate training, validation, and test data.

In [None]:
val_questions, val_relevant_contexts = await create_questions(val_split_documents, 2)

100%|██████████| 12/12 [00:11<00:00,  1.07it/s]


In [None]:
test_questions, test_relevant_contexts = await create_questions(test_split_documents, 2)

100%|██████████| 12/12 [00:11<00:00,  1.01it/s]


In [None]:
print(f"Val :: questions: {len(val_questions)}, documents: {len(val_split_documents)}")
print(f"Test :: questions: {len(test_questions)}, documents: {len(test_split_documents)}")

Val :: questions: 24, documents: 12
Test :: questions: 24, documents: 12


### Reformating and Saving Datasets

Now, we can save our datasets for later use!

In [None]:
import json

training_corpus = {train_item.metadata["id"] : train_item.page_content for train_item in training_split_documents}

train_dataset = {
    "questions" : training_questions,
    "relevant_contexts" : training_relevant_contexts,
    "corpus" : training_corpus
}

with open("data/training_dataset.jsonl", "w") as f:
  json.dump(train_dataset, f)

In [None]:
val_corpus = {val_item.metadata["id"] : val_item.page_content for val_item in val_split_documents}

val_dataset = {
    "questions" : val_questions,
    "relevant_contexts" : val_relevant_contexts,
    "corpus" : val_corpus
}

with open("data/val_dataset.jsonl", "w") as f:
  json.dump(val_dataset, f)

In [None]:
test_corpus = {test_item.metadata["id"] : test_item.page_content for test_item in test_split_documents}

test_dataset = {
    "questions" : test_questions,
    "relevant_contexts" : test_relevant_contexts,
    "corpus" : test_corpus
}

with open("data/test_dataset.jsonl", "w") as f:
  json.dump(test_dataset, f)

In [17]:
def print_lengths(dataset):
    for key, value in dataset.items():
        print(key, len(value))

In [None]:
print_lengths(train_dataset)
print_lengths(val_dataset)
print_lengths(test_dataset)

questions 156
relevant_contexts 156
corpus 78
questions 24
relevant_contexts 24
corpus 12
questions 24
relevant_contexts 24
corpus 12


## 4: Fine-tuning `snowflake-arctic-embed-l`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-l`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

In [10]:
!pip install -qU sentence_transformers datasets "pyarrow<19.0.0a0,>=14.0.0"

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.1/40.1 MB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [11]:
from sentence_transformers import SentenceTransformer

model_id = "Snowflake/snowflake-arctic-embed-l"
model = SentenceTransformer(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/85.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

- imports from `sentence_transformers` and `torch`.

In [12]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample

We're using a toy batch size here to reflect the limited number of examples we have.

> NOTE: It is typical to use a much larger batch size (~64+), hardware permitting.

In [13]:
BATCH_SIZE = 10

Let's move our dataset into the expected format for training.

In [14]:
# first load the dataset from the jsonl files

import json
def load_dataset(file_path):
    with open(file_path, "r") as f:
        dataset = json.load(f)
    return dataset

In [15]:
train_dataset = load_dataset("data/training_dataset.jsonl")
val_dataset = load_dataset("data/val_dataset.jsonl")
test_dataset = load_dataset("data/test_dataset.jsonl")

In [18]:
print("\ntrain_dataset: ")
print_lengths(train_dataset)

print("\nval_dataset: ")
print_lengths(val_dataset)

print("\ntest_dataset: ")
print_lengths(test_dataset)


train_dataset: 
questions 156
relevant_contexts 156
corpus 78

val_dataset: 
questions 24
relevant_contexts 24
corpus 12

test_dataset: 
questions 24
relevant_contexts 24
corpus 12


In [19]:
corpus = train_dataset['corpus']
queries = train_dataset['questions']
relevant_docs = train_dataset['relevant_contexts']

examples = []
for query_id, query in queries.items():
    doc_id = relevant_docs[query_id][0]
    text = corpus[doc_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

Now we can create a `torch` `DataLoader`!

In [20]:
# examples[0].texts

In [21]:
loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

### Loss Function

- `MultipleNegativesRankingLoss` - more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).
- "Wrapped" in `MatryoshkaLoss` - more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MatryoshkaLoss.py).

In [22]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

### Set-up Evaluator

In [23]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

corpus = val_dataset['corpus']
queries = val_dataset['questions']
relevant_docs = val_dataset['relevant_contexts']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

In [24]:
EPOCHS = 10

### Training Setup
> NOTE: We're manually defining a warm-up period here - this is just to provide a smooth ramp into our training!

In [25]:
# get wandb api key
wandb_api_key = getpass.getpass("Enter your wandb api key: ")

Enter your wandb api key: ··········


In [26]:
import wandb
wandb.login(key=wandb_api_key)

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mvinod[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [27]:
import wandb
wandb.init(
    project="emb-model-fine-tuning",
    name="Snowflake-arctic-embed-l",
    config={
        "epochs": EPOCHS,
        "batch_size": BATCH_SIZE,
        "model_id": model_id
    }
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


In [28]:
warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, train_loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='finetuned_arctic_ft',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50
)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]



Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
16,No log,No log,1.0,1.0,1.0,1.0,1.0,0.333333,0.2,0.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0
32,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
48,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
50,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
64,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
80,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
96,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
100,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
112,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
128,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167


In [29]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [30]:
hf_username = "vin00d"

In [31]:
model.push_to_hub(f"{hf_username}/snowflake-arctic-ft-1")

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

'https://huggingface.co/vin00d/snowflake-arctic-ft-1/commit/ac1259e0a86b251f0353f770eb6481b230590f0b'

## 5: Evaluating our Retriever

In [32]:
import pandas as pd

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document

Now we'll define a function that will help us evaluate our retrieval process.

> NOTE: We're assuming 1 correct document in a "hit".

In [37]:
import tqdm

def evaluate_openai(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
  corpus = dataset['corpus']
  questions = dataset['questions']
  relevant_docs = dataset['relevant_contexts']
  documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
  vectorstore = FAISS.from_documents(documents, embed_model)

  retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})

  eval_results = []
  for id, question in tqdm.tqdm(questions.items()):
    retrieved_nodes = retriever.invoke(question)
    retrieved_ids = [node.metadata["id"] for node in retrieved_nodes]
    expected_id = relevant_docs[id][0]
    is_hit = expected_id in retrieved_ids
    eval_results.append({"id": id, "question": question, "expected_id": expected_id, "is_hit": is_hit})

  return eval_results

All that's left to do is evaluate, we'll evaluate our model against:

1. OpenAI's closed source `text-embedding-3-small`
2. The base non-fine-tuned version of `Snowflake/snowflake-arctic-embed-l`.

Let's see how it stacks up!

### `text-embedding-3-small`

In [38]:
te3_openai = OpenAIEmbeddings(model="text-embedding-3-small")
te3_results = evaluate_openai(test_dataset, te3_openai)

100%|██████████| 24/24 [00:11<00:00,  2.07it/s]


In [39]:
te3_results_df = pd.DataFrame(te3_results)

In [40]:
te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

1.0

### `Snowflake/snowflake-arctic-embed-l` (base)

In [41]:
from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")
arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

100%|██████████| 24/24 [00:00<00:00, 45.96it/s]


In [42]:
arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

In [43]:
arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

0.9166666666666666

### `Snowflake/snowflake-arctic-embed-l` (fine-tuned)

In [44]:
finetune_embeddings = HuggingFaceEmbeddings(model_name="finetuned_arctic_ft")
finetune_results = evaluate_openai(test_dataset, finetune_embeddings)

Some weights of BertModel were not initialized from the model checkpoint at finetuned_arctic_ft and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 24/24 [00:00<00:00, 48.21it/s]


In [45]:
finetune_results_df = pd.DataFrame(finetune_results)

In [46]:
finetune_hit_rate = finetune_results_df["is_hit"].mean()
finetune_hit_rate

1.0

## Task 1: Vibe Checking the RAG Pipeline

We're going to use our RAG pipeline to vibe check on some common phrases now that we've modified it!

### Creating New Chunks

In order to try and evaluate our system more fairly, let's create new chunks that we will use to create our Vector Store.

In [49]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap  = 50,
    length_function = len
)

training_documents = text_splitter.split_documents(text_loader.load())

### Base Chain

We'll start by constructing our base chain, which will use the untrained retrieval model.

#### R - Retrieval

In [50]:
from langchain_community.vectorstores import FAISS

base_vectorstore = FAISS.from_documents(training_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 6})

#### A - Augmented

In [51]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and a question, you must answer the question. If you do not know the answer, you must state that you do not know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt_template = ChatPromptTemplate.from_template(RAG_PROMPT)

#### G - Generation

In [53]:
from langchain_openai import ChatOpenAI

rag_llm =  ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

#### RAG - LCEL RAG Pipeline

In [54]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

base_rag_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [55]:
base_rag_chain.invoke({"question" : "What is an agent?"})["response"]

'An agent, in the context of AI, is an infuriatingly vague term that generally refers to AI systems that can act on your behalf. There are two main interpretations: one sees agents as systems that go and perform tasks for you (like a travel agent), while the other views them as LLMs (large language models) that have access to tools and can run processes in a loop to solve problems. However, the term lacks a clear and widely understood definition, leading to confusion about its meaning and utility.'

In [56]:
base_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Organizations that have produced better-than-GPT-3 class models include Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, Baidu, and several others.'

In [57]:
base_rag_chain.invoke({"question" : "What is the laziest AI month?"})["response"]

'I do not know.'

In [58]:
base_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'I do not know.'

### Fine-tuned Embedding Model

Now let's rebuild our RAG chain with the Fine-tuned model - the only component we need to change is our `FAISS` vectorstore!

In [59]:
finetune_vectorstore = FAISS.from_documents(training_documents, finetune_embeddings)
finetune_retriever = finetune_vectorstore.as_retriever(search_kwargs={"k": 6})

In [60]:
finetune_rag_chain = (
    {"context": itemgetter("question") | finetune_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [61]:
finetune_rag_chain.invoke({"question" : "What is an Agent?"})["response"]

'An "agent" is a term that lacks a single, clear, and widely understood meaning in the context of AI. It generally refers to AI systems that can act on behalf of a user, but there are various interpretations of what this entails. Some people view agents as systems that autonomously perform tasks, similar to a travel agent, while others think of them as LLMs (large language models) that utilize tools to solve problems. The term is often associated with concepts of autonomy, but there is significant ambiguity and skepticism surrounding their practical utility, particularly due to issues like gullibility in AI systems.'

In [62]:
finetune_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Organizations that have produced better-than-GPT-3 class models include Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, Baidu, and several others.'

In [65]:
finetune_rag_chain.invoke({"question" : "What is the laziest AI month?"})["response"]

'I do not know.'

In [64]:
finetune_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'The largest model that Simon has run on his phone is the Llama 3.2 3B model.'

####❓Question #2:

Which LCEL RAG Chain do you think answered the questions better, and why?

#### 🎯 Answer:
- The fine-tuned model answered better, although in my version it could not answer the laziest month question.

## Task 2: RAGAS Evaluation

It's great to have some idea of how our system is doing based on vibe-checks, but let's use RAGAS to provide more insight info. on how things are improving!

> NOTE: Please recreate *exactly* the RAGAS process we used to evaluate RAG, baselining with the default retriever, and then comparing the new retriever. The includes the Synthetic Data Generation steps.

In [72]:
!pip install -qU ragas==0.2.10 unstructured==0.16.12

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m48.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m75.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m43.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m167.1/167.1 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m100.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.5/112.5 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m300.7/300.7 kB[0m [31m24.9 MB/s[0m eta 

In [69]:
os.environ["RAGAS_APP_TOKEN"] = getpass.getpass("Please enter your Ragas API key!")

Please enter your Ragas API key!··········


In [70]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [73]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

In [74]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/26 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [75]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What insights does the Chatbot Arena Leaderboa...,[Voice and live camera mode are science fictio...,The Chatbot Arena Leaderboard reveals that 18 ...,single_hop_specifc_query_synthesizer
1,What are the cost and efficiency benefits of u...,[the then-new GPT-4 Turbo and $1/mTok for GPT-...,GPT-4 Turbo is part of a trend where increased...,single_hop_specifc_query_synthesizer
2,Who is Steve Krouse and what did he build?,[ChatGPT voice mode now provides the option to...,Steve Krouse from Val Town built a version of ...,single_hop_specifc_query_synthesizer
3,What role does Vercel play in the context of p...,[I’m beginning to see the most popular idea of...,Vercel's Malte Ubl mentioned that when @v0 fir...,single_hop_specifc_query_synthesizer
4,How has the universal access to AI models and ...,[<1-hop>\n\nVoice and live camera mode are sci...,"In 2024, the universal access to AI models was...",multi_hop_abstract_query_synthesizer
5,How have agents and increased competition and ...,[<1-hop>\n\nVoice and live camera mode are sci...,"In 2024, the development and pricing of large ...",multi_hop_abstract_query_synthesizer
6,How have agents and increased competition and ...,[<1-hop>\n\nVoice and live camera mode are sci...,"In 2024, the development and pricing of large ...",multi_hop_abstract_query_synthesizer
7,How has the universal access to AI models and ...,[<1-hop>\n\nVoice and live camera mode are sci...,"In 2024, the brief period of universal access ...",multi_hop_abstract_query_synthesizer
8,What were the key advancements and societal im...,[<1-hop>\n\neasy to follow. The rest of the do...,"In 2023, Simon Willison's weblog highlighted t...",multi_hop_specific_query_synthesizer
9,How does the concept of 'vibes based developme...,[<1-hop>\n\nBased Development As a computer sc...,The concept of 'vibes based development' relat...,multi_hop_specific_query_synthesizer


In [76]:
dataset.upload()

Testset uploaded! View at https://app.ragas.io/dashboard/alignment/testset/5502f1d2-8269-4bb9-9cdc-fa21895b3429


'https://app.ragas.io/dashboard/alignment/testset/5502f1d2-8269-4bb9-9cdc-fa21895b3429'

### Base Embedding Model - `base_rag_chain`

In [77]:
for test_row in dataset:
  response = base_rag_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [78]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What insights does the Chatbot Arena Leaderboa...,"[Prompt injection explained, with video, slide...",[Voice and live camera mode are science fictio...,The Chatbot Arena Leaderboard indicates that t...,The Chatbot Arena Leaderboard reveals that 18 ...,single_hop_specifc_query_synthesizer
1,What are the cost and efficiency benefits of u...,[That same laptop that could just about run a ...,[the then-new GPT-4 Turbo and $1/mTok for GPT-...,The provided context does not contain specific...,GPT-4 Turbo is part of a trend where increased...,single_hop_specifc_query_synthesizer
2,Who is Steve Krouse and what did he build?,"[So far, I think they’re a net positive. I’ve ...",[ChatGPT voice mode now provides the option to...,I do not know.,Steve Krouse from Val Town built a version of ...,single_hop_specifc_query_synthesizer
3,What role does Vercel play in the context of p...,"[Prompt injection explained, with video, slide...",[I’m beginning to see the most popular idea of...,I do not know.,Vercel's Malte Ubl mentioned that when @v0 fir...,single_hop_specifc_query_synthesizer
4,How has the universal access to AI models and ...,[The legal arguments here are complex. I’m not...,[<1-hop>\n\nVoice and live camera mode are sci...,The universal access to AI models and the ethi...,"In 2024, the universal access to AI models was...",multi_hop_abstract_query_synthesizer
5,How have agents and increased competition and ...,"[Prompt injection explained, with video, slide...",[<1-hop>\n\nVoice and live camera mode are sci...,"In 2024, the development and pricing of large ...","In 2024, the development and pricing of large ...",multi_hop_abstract_query_synthesizer
6,How have agents and increased competition and ...,"[Prompt injection explained, with video, slide...",[<1-hop>\n\nVoice and live camera mode are sci...,"In 2024, the development and pricing of large ...","In 2024, the development and pricing of large ...",multi_hop_abstract_query_synthesizer
7,How has the universal access to AI models and ...,[The legal arguments here are complex. I’m not...,[<1-hop>\n\nVoice and live camera mode are sci...,The universal access to AI models and the intr...,"In 2024, the brief period of universal access ...",multi_hop_abstract_query_synthesizer
8,What were the key advancements and societal im...,[Everything tagged “llms” on my blog in 2024\n...,[<1-hop>\n\neasy to follow. The rest of the do...,"In 2023, the societal impacts of Large Languag...","In 2023, Simon Willison's weblog highlighted t...",multi_hop_specific_query_synthesizer
9,How does the concept of 'vibes based developme...,"[Meanwhile, it’s increasingly common for end u...",[<1-hop>\n\nBased Development As a computer sc...,The concept of 'vibes based development' relat...,The concept of 'vibes based development' relat...,multi_hop_specific_query_synthesizer


In [79]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

In [80]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

In [81]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[14]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')
ERROR:ragas.executor:Exception raised in Job[20]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')


{'context_recall': 0.2288, 'faithfulness': 0.6264, 'factual_correctness': 0.3910, 'answer_relevancy': 0.6277, 'context_entity_recall': 0.1986, 'noise_sensitivity_relevant': 0.2493}

### FineTuned Embedding Model - `finetune_rag_chain`

In [82]:
for test_row in dataset:
  response = finetune_rag_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [83]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What insights does the Chatbot Arena Leaderboa...,"[Then in December, the Chatbot Arena team intr...",[Voice and live camera mode are science fictio...,The Chatbot Arena Leaderboard provides insight...,The Chatbot Arena Leaderboard reveals that 18 ...,single_hop_specifc_query_synthesizer
1,What are the cost and efficiency benefits of u...,[This remains astonishing to me. I thought a m...,[the then-new GPT-4 Turbo and $1/mTok for GPT-...,The cost and efficiency benefits of using GPT-...,GPT-4 Turbo is part of a trend where increased...,single_hop_specifc_query_synthesizer
2,Who is Steve Krouse and what did he build?,"[So far, I think they’re a net positive. I’ve ...",[ChatGPT voice mode now provides the option to...,I do not know.,Steve Krouse from Val Town built a version of ...,single_hop_specifc_query_synthesizer
3,What role does Vercel play in the context of p...,[The boring yet crucial secret behind good sys...,[I’m beginning to see the most popular idea of...,"Vercel, represented by Malte Ubl in the contex...",Vercel's Malte Ubl mentioned that when @v0 fir...,single_hop_specifc_query_synthesizer
4,How has the universal access to AI models and ...,[Things we learned about LLMs in 2024\n\n\n\n\...,[<1-hop>\n\nVoice and live camera mode are sci...,The context provided does not contain specific...,"In 2024, the universal access to AI models was...",multi_hop_abstract_query_synthesizer
5,How have agents and increased competition and ...,[Things we learned about LLMs in 2024\n\n\n\n\...,[<1-hop>\n\nVoice and live camera mode are sci...,"In 2024, the development of large language mod...","In 2024, the development and pricing of large ...",multi_hop_abstract_query_synthesizer
6,How have agents and increased competition and ...,[Things we learned about LLMs in 2024\n\n\n\n\...,[<1-hop>\n\nVoice and live camera mode are sci...,"In 2024, the development of large language mod...","In 2024, the development and pricing of large ...",multi_hop_abstract_query_synthesizer
7,How has the universal access to AI models and ...,[The legal arguments here are complex. I’m not...,[<1-hop>\n\nVoice and live camera mode are sci...,The universal access to AI models and the intr...,"In 2024, the brief period of universal access ...",multi_hop_abstract_query_synthesizer
8,What were the key advancements and societal im...,[Things we learned about LLMs in 2024\n\n\n\n\...,[<1-hop>\n\neasy to follow. The rest of the do...,"In 2023, key advancements in Large Language Mo...","In 2023, Simon Willison's weblog highlighted t...",multi_hop_specific_query_synthesizer
9,How does the concept of 'vibes based developme...,[Except... you can run generated code to see i...,[<1-hop>\n\nBased Development As a computer sc...,The concept of 'vibes based development' relat...,The concept of 'vibes based development' relat...,multi_hop_specific_query_synthesizer


In [84]:
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

In [85]:
custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[14]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')


{'context_recall': 0.5503, 'faithfulness': 0.8352, 'factual_correctness': 0.4255, 'answer_relevancy': 0.7887, 'context_entity_recall': 0.4168, 'noise_sensitivity_relevant': 0.1760}

### Results

**Base Embedding Model**
```python
{'context_recall': 0.2288, 'faithfulness': 0.6264, 'factual_correctness': 0.3910, 'answer_relevancy': 0.6277, 'context_entity_recall': 0.1986, 'noise_sensitivity_relevant': 0.2493}
```

**Finetuned Embedding Model**
```python
{'context_recall': 0.5503, 'faithfulness': 0.8352, 'factual_correctness': 0.4255, 'answer_relevancy': 0.7887, 'context_entity_recall': 0.4168, 'noise_sensitivity_relevant': 0.1760}
```

Performance went up on all metrics (except the last one) after fine tuning.