# Fine-tuning Embeddings for RAG on Specific Data

As we start our "fine-tuning" week, we'll start with the lowest hanging improvement one can do for RAG - which is:

Fine-tuning embeddings!

- 🤝 Breakout Room #1:
  - Task 1: Dependencies and Boilerplate
  - Task 2: Loading Data
  - Task 3: Constructing a Fine-tuning Dataset
  - Task 4: Fine-tuning `snowflake-arctic-embed-m`
  - Task 5: Evaluating our Retriever

- 🤝 Breakout Room #2:
  - Task 1: Vibe Checking Our LCEL RAG Chain
  - Task 2: Ragas Evaluation



#### Basic Overview of Fine-tuning Embeddings

In essence, what we want to do when we fine-tune our embedding models is very simple:

```
Move the embeddings for questions relating to a document
closer together with that document
```

We can think of fine-tuning our embedding models as follows:

1) We have some pair of text items that *should* be closer together
  - `Question`, `Document` pairs
  - EX: `Who drives the bus?`, `The bus was driven by Kyle, the Bus Driver`.

2) We use these pairs as labeled data to fine-tune our embedding model.

The process of training helps the model more accurately associate our questions with the correct documents.

#####❓ Question #1:

Describe the nuance between using Q&D pairs to train the embedding model vs. inter-document pairs/related sentences.

What caveats does this approach have? Are there any special considerations for what kind of Q's we should use?

---

**ANSWER:**

We are specifically relating *the questions* to *the documents*. This means that we are making our embedding model at the very specific task of relating potential questions to specific documents.

There are many caveats, but the main ones are:

- Your Q's should reflect the Q's of your users
- This kind of fine-tuning will (purposefully) "overfit" on your data; this is the desired result in this case.

---
# **ADDENDUM TO ANSWER #1 BY ME: Risk of catastrophic forgetting**
*In addition to the above caveats, I would add one that I think it pretty important.
    - The risk of catastrophic forgetting is present when fine-tuning LLMs.  The risk of catastrophic forgetting increases for models with larger number of parameters.  These models appear to suffer catastrophic forgetting in areas such as domain knowledge, reasoning abilities as well as reading comprehension.  Apart from model size, model architecture also appears to play a role.  For instance, researchers have found that decoder-only models appear to be less prone to this phenomenon than, for example, encoder-decoder models.
    - The literature suggests finetuning on a broader domain, as the more narrow the domain of finetuning, the greater the risk of catastrophic forgetting.*

---

## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

### Nest Asyncio

In [1]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

### NOTE - Pin the version of langchain_core (change from notebook that was placed on AIE4 github)

In [None]:
# !pip install -qU langchain_openai langchain_huggingface langchain_core langchain langchain_community langchain-text-splitters

In [2]:
!pip install -qU langchain_openai langchain_huggingface langchain_core==0.2.38 langchain langchain_community langchain-text-splitters

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m396.4/396.4 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.0/52.0 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m39.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.8/289.8 kB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.1/374.1 kB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m249.1/249.1 kB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
!pip install -qU faiss-cpu unstructured==0.15.7 python-pptx==1.0.2 nltk==3.9.1

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/981.5 kB[0m [31m2.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m256.0/981.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━[0m [32m675.8/981.5 kB[0m [31m6.4 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m972.8/981.5 kB[0m [31m7.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m36.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━

### Provide OpenAI API Key

In [4]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

Enter Your OpenAI API Key: ··········


## Task 2: Loading Data

We'll be using a recent document released by the EU 'laying down harmonised rules on artificial intelligence and amending Regulations'.

The data can be found [here](https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32024R1689), though we will be using a HTML version which was collected into the AIM DataRepository.

First, we'll clone and then `cd` into the DataRepository.

In [5]:
!git clone https://github.com/AI-Maker-Space/DataRepository.git

Cloning into 'DataRepository'...
remote: Enumerating objects: 90, done.[K
remote: Counting objects: 100% (82/82), done.[K
remote: Compressing objects: 100% (69/69), done.[K
remote: Total 90 (delta 24), reused 29 (delta 8), pack-reused 8 (from 1)[K
Receiving objects: 100% (90/90), 70.26 MiB | 42.00 MiB/s, done.
Resolving deltas: 100% (24/24), done.


In [6]:
%cd DataRepository

/content/DataRepository


Next we're going to be using the `UnstructuredHTMLLoader` to load our HTML document into a LangChain document using the [Unstructured](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.html.UnstructuredHTMLLoader.html) library.

In [7]:
from langchain_community.document_loaders import UnstructuredHTMLLoader

training_documents_loaded = UnstructuredHTMLLoader("eu_ai_act.html")

Next, we'll set up a classic naive chunking strategy as we only care that the documents get parsed into chunks that we can generate synthetic questions about.

In [8]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap  = 20,
    length_function = len
)

Next we can load/split these documents as follows.

In [9]:
training_documents = text_splitter.split_documents(training_documents_loaded.load())

In [10]:
len(training_documents)

1029

Next, we're going to associate each of our chunks with a unique identifier.

In [11]:
import uuid

id_set = set()

for document in training_documents:
  id = str(uuid.uuid4())
  while id in id_set:
    id = uuid.uuid4()
  id_set.add(id)
  document.metadata["id"] = id

Next, we'll simply use naive Python slicing to create a training, test, and validation set to prepare our data for the next step.

---
## Splitting into Train, Val and Test Splits: MY COMMENT ON THE CELL BELOW
> This is a rather naive way to get the training, val and test sets.  The principle is to try to have the sample distribution in each set be similar.  I'm no legal expert, but in the case of a legal EU AI Regulation Act, it is probably the case that the early part of the document has content like the preamble, the purpose of the regulation, common definitions, etc.  And the later sections are also likely to be reference material, Annexures, appendices, tables, etc.  So, probably not a great way to split the whole sample.  Perhaps a better way would be to select each `Document` object in the list at random and assign it to one of the three (train/val/test).  More sophisticated ways are also possible, that discover more details on the distribution of the content of the document and attempt to even out the type of document objects selected into each sub-sample.

In [12]:
training_split_documents = training_documents[:300]
val_split_documents = training_documents[300:350]
test_split_documents = training_documents[350:400]

### The cell below implements a simple randomization strategy to generate train, val, test samples to make each split more alike in terms of the distribution of `Document` types.

### Note - This is placed here for my future reference!!!
### It is not run in this iteration of the notebook as the basic split above yields good results!!!

In [None]:
# NOTE - THIS CELL IS NOT EXECUTED IN THIS VERSION OF THE NOTEBOOK
### I'VE INSERTED IT AS AN EXAMPLE AND MY FUTURE REFERENCE

import random

# set the same seed to be able to replicate the result of
# random shuffle below
random.seed(69)

# randomly orders the elements in the list training_documents
randomly_ordered_documents = random.shuffle(training_documents)

# assign slices to training, val and test
training_split_documents = randomly_ordered_documents[:300]
val_split_documents = randomly_ordered_documents[300:350]
test_split_documents = randomly_ordered_documents[350:400]


---

## Task 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-4o-mini` (released [July 18th](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)).

The basic idea here is straightforward enough:

1. We look at a document
2. We generate questions that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [13]:
from langchain_openai import ChatOpenAI

qa_chat_model = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

We'll create a simple Question Generation prompt to query `gpt-4o-mini` to generate Questions for each retrieved context.

In [14]:
from langchain_core.prompts import ChatPromptTemplate

qa_prompt = """\
Given the following context, you must generate questions based on only the provided context.

You are to generate {n_questions} questions which should be provided in the following format:

1. QUESTION #1
2. QUESTION #2
...

Context:
{context}
"""

qa_prompt_template = ChatPromptTemplate.from_template(qa_prompt)

We'll create a simple chain to query the LLM!

In [15]:
question_generation_chain = qa_prompt_template | qa_chat_model

There's a lot going on in this function - let's take a deeper look:

1. First, we provide a list of documents and a number of questions
2. We, for each document in our list, generate `n_questions` of questions.
3. We then associate those questions and contexts via a `UUID`.

> NOTE: The reason we're doing this `UUID` association is for ease of use later in the notebook.

##### 🏗️ Activity #1:

We have:

- Lists of `Documents` with the `metadata` field `id`.

We need:

- An object with key `id`, which have values `str` questions.
- An object with key `question_id`, which have values `List(str)` which will be a list of associated `context_id`.

An Example:

question_object:
```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': 'What types of accessible formats are available for persons with disabilities?',
'df58ee4f-714c-419e-8324-94e5870574e2': 'How do accessible formats benefit persons with disabilities?',
'505fce8b-0e56-48de-a251-61027e396918': 'What are some of the risks associated with the increasing capabilities of AI systems that generate synthetic content?',
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': 'Why is it important for providers of AI systems to embed technical solutions for marking and detecting synthetic content?'
}
 ```

 context_object:
 ```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'df58ee4f-714c-419e-8324-94e5870574e2': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'505fce8b-0e56-48de-a251-61027e396918': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
}
 ```

 As you can see, a piece of context can be associated with more than 1 question.

 The task is to write the Python function(s) to accomplish this task.

 Your function signature is provided below, along with the desired return values.

 > NOTE: You can make any modifications that you desire - assuming that you have the correct input and outputs.

In [16]:
import tqdm

import re

# note this regex pattern will be used below.
# First group will be the question number indicated by one or more digits \d
# Terminator of that group is the period '.' that will separate the two groups!
# Second group is any character that follows this... the idea is to
# capture the entire text of the question.
# One could try to be more specific about the character set being captured,
# but the question could have special characters like punctuation marks
# like comma, period, exclamation marks, as well as special symbols
# like %, $ etc.
# To prevent the regex from missing anything, we can just use a
# global for any character, (.+)
PATTERN = r'(^\d+).(.+)'


### I ADDED THIS HELPER FUNCTION
# receives an empty or non-empty set of existing ids
# generates a unique id not in the set and returns
# the unique id, as well as the updated set of ids
def get_unique_id(id_set):
  """
  function accepts a set of ids and returns a unique id not in the set
  """
  id = str(uuid.uuid4())
  while id in id_set:
    id = str(uuid.uuid4())
  id_set.add(id)
  return id, id_set


# I ADDED CODE IN THE FUNCTION BELOW -
# NOTE - uses the REGEX pattern above to extract questions
def create_questions(documents, n_questions):
  questions = {}
  relevant_docs = {}

  ### YOUR CODE HERE
  q_id_set = set()
  for document in tqdm.tqdm(documents):  # note tqdm.tqdm (NOT just tqdm as in original notebook)
    this_question_set = question_generation_chain.invoke({'context': document.page_content, 'n_questions': n_questions})
    for question in this_question_set.content.split("\n"):
      if len(question) > 0:
        try:
          q_id, q_id_set  = get_unique_id(q_id_set)
          matched_pattern = re.search(PATTERN, question)  # regex search for n. <question>
          if len(matched_pattern.group(2)) > 0:
            questions[q_id] = matched_pattern.group(2).strip()  # extraction of question string
            relevant_docs[q_id] = [document.metadata["id"]]
        except Exception:
          continue
  return questions, relevant_docs


### NOTE!!! An alternate way to write the `create_questions` function!!!
We discussed this approach in our breakout room with our Peer Supporter Robert guiding us.

I have inserted it below for future reference, but have written the one above using `regex` so that I can get experience writing regular expression based string matching.  I think it may be a more general way of retrieving output from the LLM chat completions.

In [None]:
# NOTE THIS CELL IS NO RUN IN THIS VERSION OF NOTEBOOK
# INSERTED HERE FOR REFERENCE


import tqdm

def create_questions(documents, n_questions):
  questions = {}
  relevant_docs = {}
  for document in tqdm.tqdm(documents):
    document_content = {"context" : document.page_content, "questions" : []}
    questions_generated = question_generation_chain.invoke({"context": document.page_content, "n_questions": n_questions})
    for question in questions_generated.content.split("\n"):
      question_id = str(uuid.uuid4())
      questions[question_id] = "".join(question.split(".")[1:]).strip()
      relevant_docs[question_id] = [document.metadata["id"]]
  return questions, relevant_docs

We'll use the function to generate training, validation, and test data with `n_questions=2` for each.

## *I added code below to prepare the questions and relevant contexts for training, val and test split*

In [17]:
training_questions, training_relevant_contexts = create_questions(documents=training_split_documents, n_questions=2)

100%|██████████| 300/300 [04:29<00:00,  1.11it/s]


In [20]:
list(training_questions.values())[:5]

['What is the main purpose of Regulation (EU) 2024/1689 as outlined in the context?',
 'Which specific regulations and directives are amended by the Artificial Intelligence Act mentioned in the context?',
 'What opinions were considered after the transmission of the draft legislative act to the national parliaments?',
 'Which legislative procedure is being followed in the context of the draft legislative act?',
 'What is the primary purpose of the Regulation regarding artificial intelligence systems in the Union?']

In [21]:
val_questions, val_relevant_contexts = create_questions(documents=val_split_documents, n_questions=2)

100%|██████████| 50/50 [00:36<00:00,  1.37it/s]


In [22]:
test_questions, test_relevant_contexts = create_questions(documents=test_split_documents, n_questions=2)

100%|██████████| 50/50 [00:34<00:00,  1.45it/s]


### Reformating and Saving Datasets

Now, we can save our datasets for later use!

> NOTE: If you ran into issues creating the data - you can use the data from the DataRespository. It's simply called: `train_dataset.jsonl`, etc.

### Annotation of the datasets created below

  - `questions`: is a dictionary with unique key ids and values are the questions
  - `relevant_contexts`: is the dictionary with keys as the unique question ids and values as the unique context ids
  - `corpus`: is a dictionary with keys as the unique context ids and values as the text of the context

### NOTE on `jsonl` files created below

Each `jsonl` file has a single line!

This is a nested JSON structure; primary keys for each file are `questions`, `relevant_contexts` and `corpus`.  

1.  Each `question` element is a json object with a key id for the question and the string corresp to question as the value.

2.  Each `relevant_contexts` element is a json object with key id corresponding to a question id and value corresponding to a unique id for the context

3.  Each `corpus` element is a json object with key id corresponding to a unique context id and value being the context string.

In [23]:
import json

training_corpus = {train_item.metadata["id"] : train_item.page_content for train_item in training_split_documents}

train_dataset = {
    "questions" : training_questions,
    "relevant_contexts" : training_relevant_contexts,
    "corpus" : training_corpus
}

with open("training_dataset.jsonl", "w") as f:
  json.dump(train_dataset, f)

In [24]:
val_corpus = {val_item.metadata["id"] : val_item.page_content for val_item in val_split_documents}

val_dataset = {
    "questions" : val_questions,
    "relevant_contexts" : val_relevant_contexts,
    "corpus" : val_corpus
}

with open("val_dataset.jsonl", "w") as f:
  json.dump(val_dataset, f)

### NOTE - I Changed `train_corpus` in two places below to `test_corpus`.  
Am guessing it was just an oversight in the original notebook

In [25]:
test_corpus = {test_item.metadata["id"] : test_item.page_content for test_item in test_split_documents}

test_dataset = {
    "questions" : test_questions,
    "relevant_contexts" : test_relevant_contexts,
    "corpus" : test_corpus
}

with open("test_dataset.jsonl", "w") as f:
  json.dump(test_dataset, f)

## Task 4: Fine-tuning `snowflake-arctic-embed-m`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-m`](https://huggingface.co/Snowflake/snowflake-arctic-embed-m) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

In [None]:
# !pip install -qU sentence_transformers datasets pyarrow

In [26]:
# !pip uninstall -y pyarrow
!pip install -qU sentence_transformers datasets pyarrow==14.0.1

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.0/38.0 MB[0m [31m46.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m39.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m172.0/172.0 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.6.1 requires fsspec==2024.6.1, but you have fsspec 2024.3.1 which is incompatibl

In [27]:
from sentence_transformers import SentenceTransformer

model_id = "Snowflake/snowflake-arctic-embed-m"
model = SentenceTransformer(model_id)

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/84.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/738 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

We'll grab some necessary imports from `sentence_transformers` and `torch`.

> NOTE: PyTorch (`torch`) is a popular machine learning library - while we don't go very deep into PyTorch it's an incredibly powerful and interesting library! Please read more about it [here](https://pytorch.org/tutorials/beginner/basics/intro.html)!

In [28]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample

We're using a toy batch size here to reflect the limited number of examples we have.

> NOTE: It is typical to use a much larger batch size (~64+), hardware permitting.

In [29]:
BATCH_SIZE = 20

Let's move our dataset into the expected format for training.

In [30]:
corpus = train_dataset['corpus']
queries = train_dataset['questions']
relevant_docs = train_dataset['relevant_contexts']

examples = []
for query_id, query in queries.items():
    doc_id = relevant_docs[query_id][0]
    text = corpus[doc_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

Now we can create a `torch` `DataLoader`!

In [31]:
loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

Next up, we'll prepare our loss function!

Loss is an important part of training, fine-tuning, and more. If you want a deep dive on loss - you can check out our [event on loss!](https://www.youtube.com/watch?v=iB8FWR9aD5Q&t=8s).

The core loss we're using today is called `MultipleNegativesRankingLoss` - you can find more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).

This is "wrapped" in `MatryoshkaLoss`, which you can read the implementation of [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MatryoshkaLoss.py).

In [32]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

##### 🏗️ Activity #2:

Both of these losses sound "cool", but what are they - exactly - under the hood?

Why are these losses specifically doing? Please write a short summary of each loss.

> NOTE: This is a course focused on AI Engineering and the application of AI - looking for a hint? Try pasting the code (linked above) into ChatGPT/Claude to write the summary!

## ANSWER TO ACTIVITY #2

## A. DESCRIPTION OF `MultipleNegativesRankingLoss`

`MultipleNegativesRankingLoss` expects as input a BATCH consisting of a pair of TEXT SEQUENCES (e.g., `question` and `answer` OR `question` and `context` pair).  

The assumption is that the pair of text sequences passed in together are POSITIVE SAMPLES and the model should learn embeddings that increase the probability of these two sequences appearing together!  i.e., the model should learn that these two sequences are associated with one another and that the second sequence follows the first!

This method implements NEGATIVE SAMPLING as follows.  Negative sampling is the technique where the model is also shown pairs of text sequences that DO NOT appear together.  Let's illustrate.  Let's index each pair appearing in a batch as `(x_i, y_i)` for `i=1, 2, ..., n_batch` where `n_batch` is the batch size.  For a given positive sample pair, say `(x_1, y_1)` the `MultipleNegativesRankingLoss` loss function will pick every other pair `(x_1, y2), (x_1, y_3),... (x_1, y_nbatch)` as the negative samples.  In other words, for each positive sample there are `nbatch-1` negative samples.

Summary: each batch will have `n_batch` positive samples and a much larger number of negative samples.  The loss function is the negative log-likelihood of the softmax of the normalized scores.

This loss function works well for retrieval setups which naturally have co-occurring pairs of text sequences such as `(question, answer)`.


## B. DESCRIPTION OF `MatryoshkaLoss`

The `MartyoshkaLoss` is best interpreted as a LOSS MODIFIER that allows the model-builder to use some other, more traditional, loss function at various embedding dimensions.  The main purpose of using the `MatryoshkaLoss` is to allow the estimation of embeddings of various dimensions, where the user retains the option of using embeddings with smaller or larger dimensions to be able to control their experienced speed and cost.  Much like the Matryoshka dolls, ie nested Russian dolls, the Matryosha-loss generated embeddings may be viewed as the series of embeddings from the smallest dimension to the largest one.  A user who wants to use smaller embeddings may subset the larger-dim embedding at the pre-specified cutoff and use only a subset of the full embedding vector, BY MODEL DESIGN!

To that end, the loss function requires the user to specify a list of the dimensions needed (e.g., [768, 512] etc.) and the weight to be placed on each embedding dimension.  The default is equal-weight across all specified embeddngs.

The underlying main traditional loss function is computed for each embedding dimension specified in the Matryoshka loss function, and summed or averaged using the weights specified.  Using this mechanism, smaller embedding-dimensional vectors are forced to carry the most important dimensions of the information and higher-dimensions tend to be forced to be less important while still being meaningful and performant if the user chooses to use the higher-dimensions.


Now we can set-up our evaluator.

> NOTE: Due to the formatting of our dataset - this is all we have to do!

In [33]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

corpus = val_dataset['corpus']
queries = val_dataset['questions']
relevant_docs = val_dataset['relevant_contexts']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

We'll train this model for 5 epochs, though you could increase this number if we had a significant amount more data.

In [34]:
EPOCHS = 5

It's training time!

> NOTE: We're manually defining a warm-up period here - this is just to provide a smooth ramp into our training!

In [35]:
warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, train_loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='finetuned_arctic',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50,
)

Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100,Dot Accuracy@1,Dot Accuracy@3,Dot Accuracy@5,Dot Accuracy@10,Dot Precision@1,Dot Precision@3,Dot Precision@5,Dot Precision@10,Dot Recall@1,Dot Recall@3,Dot Recall@5,Dot Recall@10,Dot Ndcg@10,Dot Mrr@10,Dot Map@100
30,No log,No log,0.86,0.96,0.99,1.0,0.86,0.32,0.198,0.1,0.86,0.96,0.99,1.0,0.93524,0.91375,0.91375,0.86,0.96,0.99,1.0,0.86,0.32,0.198,0.1,0.86,0.96,0.99,1.0,0.93524,0.91375,0.91375
50,No log,No log,0.89,0.96,1.0,1.0,0.89,0.32,0.2,0.1,0.89,0.96,1.0,1.0,0.948335,0.931167,0.931167,0.89,0.96,1.0,1.0,0.89,0.32,0.2,0.1,0.89,0.96,1.0,1.0,0.948335,0.931167,0.931167
60,No log,No log,0.88,0.96,1.0,1.0,0.88,0.32,0.2,0.1,0.88,0.96,1.0,1.0,0.944645,0.926167,0.926167,0.88,0.96,1.0,1.0,0.88,0.32,0.2,0.1,0.88,0.96,1.0,1.0,0.944645,0.926167,0.926167
90,No log,No log,0.88,0.96,1.0,1.0,0.88,0.32,0.2,0.1,0.88,0.96,1.0,1.0,0.945516,0.927333,0.927333,0.88,0.96,1.0,1.0,0.88,0.32,0.2,0.1,0.88,0.96,1.0,1.0,0.945516,0.927333,0.927333
100,No log,No log,0.88,0.96,1.0,1.0,0.88,0.32,0.2,0.1,0.88,0.96,1.0,1.0,0.945516,0.927333,0.927333,0.88,0.96,1.0,1.0,0.88,0.32,0.2,0.1,0.88,0.96,1.0,1.0,0.945516,0.927333,0.927333
120,No log,No log,0.88,0.97,1.0,1.0,0.88,0.323333,0.2,0.1,0.88,0.97,1.0,1.0,0.946209,0.928167,0.928167,0.88,0.97,1.0,1.0,0.88,0.323333,0.2,0.1,0.88,0.97,1.0,1.0,0.946209,0.928167,0.928167
150,No log,No log,0.88,0.97,1.0,1.0,0.88,0.323333,0.2,0.1,0.88,0.97,1.0,1.0,0.946209,0.928167,0.928167,0.88,0.97,1.0,1.0,0.88,0.323333,0.2,0.1,0.88,0.97,1.0,1.0,0.946209,0.928167,0.928167


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

## Task 5: Evaluating our Retriever

Now that we have fine-tuned our retriever - let's see if it's worthwhile!

We'll start with some basic imports.

In [36]:
import pandas as pd

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document

Now we'll define a function that will help us evaluate our retrieval process.

> NOTE: We're assuming 1 correct document in a "hit".

In [37]:
def evaluate_openai(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
  corpus = dataset['corpus']
  questions = dataset['questions']
  relevant_docs = dataset['relevant_contexts']
  documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
  vectorstore = FAISS.from_documents(documents, embed_model)

  retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})

  eval_results = []
  for id, question in tqdm.tqdm(questions.items()):
    retrieved_nodes = retriever.invoke(question)
    retrieved_ids = [node.metadata["id"] for node in retrieved_nodes]
    expected_id = relevant_docs[id][0]
    is_hit = expected_id in retrieved_ids
    eval_results.append({"id": id, "question": question, "expected_id": expected_id, "is_hit": is_hit})

  return eval_results

All that's left to do is evaluate, we'll evaluate our model against:

1. OpenAI's closed source `text-embedding-3-small`
2. The base non-fine-tuned version of `Snowflake/snowflake-arctic-embed-m`.

Let's see how it stacks up!

### `text-embedding-3-small`

In [38]:
te3_openai = OpenAIEmbeddings(model="text-embedding-3-small")
te3_results = evaluate_openai(test_dataset, te3_openai)

100%|██████████| 100/100 [00:17<00:00,  5.74it/s]


In [39]:
te3_results_df = pd.DataFrame(te3_results)

In [40]:
te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

0.98

### `Snowflake/snowflake-arctic-embed-m` (base)

In [41]:
from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-m")
arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

100%|██████████| 100/100 [00:01<00:00, 86.10it/s]


In [42]:
arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

In [43]:
arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

0.6

### `Snowflake/snowflake-arctic-embed-m` (fine-tuned)

In [44]:
finetune_embeddings = HuggingFaceEmbeddings(model_name="finetuned_arctic")
finetune_results = evaluate_openai(test_dataset, finetune_embeddings)

Some weights of BertModel were not initialized from the model checkpoint at finetuned_arctic and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 100/100 [00:01<00:00, 80.77it/s]


In [45]:
finetune_results_df = pd.DataFrame(finetune_results)

In [46]:
finetune_hit_rate = finetune_results_df["is_hit"].mean()
finetune_hit_rate

1.0

# 🤝 Breakout Room #2

## Task 1: Vibe Checking the RAG Pipeline

We're going to use our RAG pipeline to vibe check on some common phrases now that we've modified it!

### Creating New Chunks

In order to try and evaluate our system more fairly, let's create new chunks that we will use to create our Vector Store.

In [47]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap  = 50,
    length_function = len
)

training_documents = text_splitter.split_documents(training_documents_loaded.load())

### Base Chain

We'll start by constructing our base chain, which will use the untrained retrieval model.

#### R - Retrieval

In [48]:
from langchain_community.vectorstores import FAISS

base_vectorstore = FAISS.from_documents(training_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 6})

#### A - Augmented

In [49]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and a question, you must answer the question. If you do not know the answer, you must state that you do not know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt_template = ChatPromptTemplate.from_template(RAG_PROMPT)

#### G - Generation

In [50]:
rag_llm =  ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

#### RAG - LCEL RAG Pipeline

In [51]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

base_rag_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [52]:
base_rag_chain.invoke({"question" : "Why does the EU want to regulate AI?"})["response"]

'The EU wants to regulate AI to promote a human-centric approach to AI, ensure the development of secure, trustworthy, and ethical AI, protect ethical principles, and facilitate the protection of natural persons, undertakings, democracy, the rule of law, and environmental protection. Additionally, the regulation aims to boost innovation and employment, positioning the Union as a leader in the uptake of trustworthy AI.'

In [53]:
base_rag_chain.invoke({"question" : "What are the codes of practice?"})["response"]

'I do not know.'

In [54]:
base_rag_chain.invoke({"question" : "How many parameters is too many parameters?"})["response"]

'The context suggests that models with at least a billion parameters are considered to display significant generality and competence in performing a wide range of tasks. Therefore, having a billion parameters or more could be seen as "too many" in the sense that it indicates a high level of complexity and capability. However, the context does not specify a definitive limit beyond which parameters are considered excessive.'

In [55]:
base_rag_chain.invoke({"question" : "What is an emotion recognition system and why is it important?"})["response"]

'An emotion recognition system is a type of artificial intelligence (AI) technology designed to identify and interpret human emotions based on various inputs, such as facial expressions, voice tone, body language, or biometric data. These systems analyze patterns in the data to infer the emotional state of an individual.\n\nThe importance of emotion recognition systems lies in their potential applications across various fields, including mental health, customer service, security, and human-computer interaction. They can enhance user experiences, improve communication, and provide insights into emotional well-being. However, there are significant concerns regarding their reliability, specificity, and generalizability, as well as the ethical implications of their use, particularly regarding privacy and the risk of discriminatory outcomes.'

### Fine-tuned Embedding Model

Now let's rebuild our RAG chain with the Fine-tuned model - the only component we need to change is our `FAISS` vectorstore!

In [56]:
finetune_vectorstore = FAISS.from_documents(training_documents, finetune_embeddings)
finetune_retriever = finetune_vectorstore.as_retriever(search_kwargs={"k": 6})

In [57]:
finetune_rag_chain = (
    {"context": itemgetter("question") | finetune_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [58]:
finetune_rag_chain.invoke({"question" : "Why does the EU want to regulate AI?"})["response"]

'The EU wants to regulate AI to improve the functioning of the internal market, promote the uptake of human-centric and trustworthy artificial intelligence, and ensure a high level of protection of health, safety, and fundamental rights as enshrined in the Charter of Fundamental Rights of the European Union. The regulation aims to protect against the harmful effects of AI systems, support innovation, and ensure the free movement of AI-based goods and services across Member States.'

In [59]:
finetune_rag_chain.invoke({"question" : "What are the codes of practice?"})["response"]

'Codes of practice are guidelines developed to ensure compliance with obligations under the AI Regulation for providers of general-purpose AI models. They aim to address systemic risks associated with these models, establish a risk taxonomy, and provide specific risk assessment and mitigation measures. The AI Office is responsible for encouraging, facilitating, and reviewing these codes, which should reflect the state of the art and consider diverse perspectives. The codes of practice are expected to be ready by May 2, 2025.'

In [60]:
finetune_rag_chain.invoke({"question" : "How many parameters is too many parameters?"})["response"]

'The context suggests that models with at least a billion parameters are considered to display significant generality and can competently perform a wide range of distinctive tasks. Therefore, it can be inferred that having a billion parameters is a threshold for being considered a model with "too many" parameters in this context. However, the exact definition of "too many" parameters may vary depending on specific use cases and requirements.'

In [61]:
finetune_rag_chain.invoke({"question" : "What is an emotion recognition system and why is it important?"})["response"]

"An emotion recognition system is an AI system designed to identify or infer the emotions or intentions of natural persons based on their biometric data. This includes recognizing emotions such as happiness, sadness, anger, surprise, disgust, embarrassment, excitement, shame, contempt, satisfaction, and amusement. It is important because it can have significant implications for various applications, including security, marketing, and mental health. However, there are serious concerns regarding the scientific reliability of these systems, as emotional expressions can vary widely across different cultures and situations, which may lead to discriminatory outcomes and potential intrusions into individuals' rights and freedoms."

#####❓Question #2:

Which LCEL RAG Chain do you think answered the questions better, and why?

## ANSWER #2:

The `finetune` RAG chain, using the `finetune` embeddings to encode the query and contexts, answered the questions better.

In some cases, the base model's LCEL chain did not produce a response, while the `finetune` chain produced a response every time.

Reading through the responses, it appeared as though there was more relevanr detail in some of the responses, e.g., the question about `emotion recognition`.

In other words, the ability of the finetuned model to RETRIEVE MORE RELEVANT CONTEXTS led directly to the ability of the RAG chain based on the finetuned model to perform better.

And this is as we should expoect - the finetuned model embeddings were trained on the specific corpus of the EU AI Act document's specific language, hence it should come as no surprise that it's ability to retrieve more relevant chunks is better than the model the base model that does not have any specialized knowledge of the language used in the EU AI Act.

## Task 2: RAGAS Evaluation

It's great to have some idea of how our system is doing based on vibe-checks, but let's use RAGAS to provide more insight info. on how things are improving!

In [62]:
!pip install -qU ragas

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m185.7/185.7 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h

### RAGAS Synthetic Testset Generation

First things first, we need to generate some data to test our model on.

Let's use our test data that we created before as a base!

In [63]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import OpenAIEmbeddings

generator_llm = ChatOpenAI(model="gpt-3.5-turbo")
critic_llm = ChatOpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbeddings()

In [64]:
generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

In [65]:
testset = \
    generator.generate_with_langchain_docs(
        test_split_documents,
        test_size=20,
        distributions={
            simple: 0.5,
            reasoning: 0.25,
            multi_context: 0.25
        },
        raise_exceptions=False
    )

embedding nodes:   0%|          | 0/100 [00:00<?, ?it/s]



Generating:   0%|          | 0/20 [00:00<?, ?it/s]

In [66]:
testset.to_pandas().head()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,How often will the Commission evaluate and rev...,[(174) Given the rapid technological developme...,The Commission should evaluate and review this...,simple,"[{'source': 'eu_ai_act.html', 'id': '45f8c6e8-...",True
1,What can downstream providers do if they suspe...,"[the AI Office, it should provide for the poss...",Downstream providers can lodge complaints if t...,simple,"[{'source': 'eu_ai_act.html', 'id': 'fb65e296-...",True
2,What is the purpose of the AI Office in handli...,"[the AI Office, it should provide for the poss...",The purpose of the AI Office in handling compl...,simple,"[{'source': 'eu_ai_act.html', 'id': 'fb65e296-...",True
3,How can the AI Office involve independent expe...,[(164) The AI Office should be able to take th...,The AI Office can involve independent experts ...,simple,"[{'source': 'eu_ai_act.html', 'id': 'a6bf4b3a-...",True
4,How can the creation of codes of conduct for A...,[(165) The development of AI systems other tha...,The answer to given question is not present in...,simple,"[{'source': 'eu_ai_act.html', 'id': 'f908b717-...",True


### Generating Answer Datasets

For each of our pipelines, let's generate answers to these questions!

Once we have our: Questions, Answers, Contexts, Ground Truths we can move on to evaluating our datasets!

In [67]:
from datasets import Dataset

def generate_answers(chain, testset):
  answers = []
  contexts = []
  questions = testset.to_pandas()["question"].values.tolist()
  ground_truths = testset.to_pandas()["ground_truth"].values.tolist()

  for question in tqdm.tqdm(questions):
    answer = chain.invoke({"question" : question})
    answers.append(answer["response"])
    contexts.append([context.page_content for context in answer["context"]])

  return Dataset.from_dict({
      "question" : questions,
      "answer" : answers,
      "contexts" : contexts,
      "ground_truth" : ground_truths
  })

In [68]:
base_dataset = generate_answers(base_rag_chain, testset)

100%|██████████| 20/20 [00:14<00:00,  1.38it/s]


In [69]:
finetune_dataset = generate_answers(finetune_rag_chain, testset)

100%|██████████| 20/20 [00:25<00:00,  1.28s/it]


### Evaluating Using the Test Set

Now that we have a test set - it's time to evaluate our pipelines with it!

In [70]:
from ragas.metrics import (
    context_recall,
    context_precision,
)

In [71]:
from ragas import evaluate

result = evaluate(
    base_dataset,
    metrics=[
        context_precision,
        context_recall,
    ],
)

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

In [72]:
result

{'context_precision': 0.4570, 'context_recall': 0.3333}

In [73]:
result.to_pandas().head()

Unnamed: 0,question,contexts,answer,ground_truth,context_precision,context_recall
0,How often will the Commission evaluate and rev...,[the contracting parties should make utmost ef...,I do not know.,The Commission should evaluate and review this...,0.0,0.0
1,What can downstream providers do if they suspe...,"[the results of its monitoring activities, or ...",Downstream providers can lodge complaints abou...,Downstream providers can lodge complaints if t...,1.0,1.0
2,What is the purpose of the AI Office in handli...,"[the results of its monitoring activities, or ...",The purpose of the AI Office in handling compl...,The purpose of the AI Office in handling compl...,1.0,0.5
3,How can the AI Office involve independent expe...,[also support the provision of high-quality da...,The provided context does not contain specific...,The AI Office can involve independent experts ...,0.0,0.0
4,How can the creation of codes of conduct for A...,"[three years thereafter, the Commission should...",The creation of codes of conduct for AI system...,The answer to given question is not present in...,0.0,1.0


In [74]:
result = evaluate(
    finetune_dataset,
    metrics=[
        context_precision,
        context_recall,
    ],
)

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

In [75]:
result

{'context_precision': 0.8855, 'context_recall': 0.9500}

In [76]:
result.to_pandas().head()

Unnamed: 0,question,contexts,answer,ground_truth,context_precision,context_recall
0,How often will the Commission evaluate and rev...,[(174) Given the rapid technological developme...,The Commission will evaluate and review the Re...,The Commission should evaluate and review this...,0.833333,1.0
1,What can downstream providers do if they suspe...,[2. Downstream providers shall have the right ...,Downstream providers can lodge a complaint all...,Downstream providers can lodge complaints if t...,0.876667,1.0
2,What is the purpose of the AI Office in handli...,[(164) The AI Office should be able to take th...,The purpose of the AI Office in handling compl...,The purpose of the AI Office in handling compl...,1.0,0.5
3,How can the AI Office involve independent expe...,[(164) The AI Office should be able to take th...,The AI Office can involve independent experts ...,The AI Office can involve independent experts ...,1.0,1.0
4,How can the creation of codes of conduct for A...,"[three years thereafter, the Commission should...",The creation of codes of conduct for AI system...,The answer to given question is not present in...,0.0,1.0


#### 🏗️ Activity #3:

Discuss changes that you'd make to this pipeline based on the performance improvements that you see with RAGAS and the fine-tuning.

Come up with 3 changes, and then we'll discuss these options as a group!

1. ...
2. ...
3. ...

## ANSWER TO ACTIVITY #3: THREE SUGGESTED CHANGES TO PIPELINE

I would mostly frame this question as different ways to improve the quality of the finteuned embeddings!  Phrased this way, here are a few suggestions:

1.  I would start with a review of the sampling procedures used to select training/val/test splits to ensure that the TRAINING sample has adequate representation of the types of language likely to be seen in out-of-sample queries.  In the early part of the notebook, I have commented on a suggested simple way to harmonize the distribution between train, val and test samples.

2.  Pay close attention to hyperparameters that govern the learning rate etc. to ensure that the final finetuned embeddings are close to the `optimal` point.  Deep learning models are never assured of reaching THE global optimum, but we can ensure that through a judicious choice of hyperparameters, etc., that the model parameters (here, the embeddings) are close to the optimal point.

3.  Use a lot more data in the estimation process.  We saw meaningful improvements in our toy example, so this should give us the impetus to enlarge our train, val and test samples.

4.  This approach finetuned the entire model.  It was a small model to start with, so it was fine.  But, we may want to focus on the model architecture and only finetune a few layers.  This parsimony will limit the risk of model overfitting and will allow the tinetuned model to generalized better.

5.  If we know that the finetuned model embeddings are going to be used exclusively for retrieval purposes, then the idea of fitting embeddings to pay attention to retrieval accuracy is very important.  Two metrics - context precision and context recall - play an important role here.  It may be possible to consider other loss functions instead of the `MutipleNegativesRankingLoss` for model estimation.  Precision and Recall are discrete measures and are not differentiable but it may be possible to formulate proxies to these that are continuous, e.g., measures that reward ranking relevant documents above irrelevant ones, etc.