# Fine-tuning Embeddings for RAG on Specific Data

As we start our "fine-tuning" week, we'll start with the lowest hanging improvement one can do for RAG - which is:

Fine-tuning embeddings!

- 🤝 Breakout Room #1:
  - Task 1: Dependencies and Boilerplate
  - Task 2: Loading Data
  - Task 3: Constructing a Fine-tuning Dataset
  - Task 4: Fine-tuning `snowflake-arctic-embed-l`
  - Task 5: Evaluating our Retriever



#### Basic Overview of Fine-tuning Embeddings

In essence, what we want to do when we fine-tune our embedding models is very simple:

```
Move the embeddings for questions relating to a document
closer together with that document
```

We can think of fine-tuning our embedding models as follows:

1) We have some pair of text items that *should* be closer together
  - `Question`, `Document` pairs
  - EX: `Who drives the bus?`, `The bus was driven by Kyle, the Bus Driver`.

2) We use these pairs as labeled data to fine-tune our embedding model.

The process of training helps the model more accurately associate our questions with the correct documents.

##### ❓ Question #1:

Describe the nuance between using Q&D pairs to train the embedding model vs. inter-document pairs/related sentences.

What caveats does this approach have? Are there any special considerations for what kind of Q's we should use?

<span style="color:green"> Q&D (Question & Document/Answer): This is the approach we're taking in this notebook. We generate questions that should be answered by a specific chunk of text (our "document" or context). The goal is to train the model to make the embedding for the question very similar (close in vector space) to the embedding of the document chunk that answers it. This directly optimizes for the core RAG task: finding the right context for a given question.

<span style="color:green">Inter-document Pairs/Related Sentences: This approach involves identifying pairs of sentences or document chunks that are inherently related or similar without necessarily being a question-answer pair. For example, two paragraphs discussing the same specific concept, or a statement and its elaboration. Training on these pairs teaches the model general semantic similarity – making embeddings for related content closer together.

<span style="color:green">The Q&D pair approach is more targeted because it directly mimics the core task of a RAG system: retrieving relevant context (D) based on a user's query (Q). By training the model to pull question embeddings closer to their corresponding answer/context embeddings, **we are explicitly optimizing the retriever for its intended function within the RAG pipeline.**

<span style="color:green">The effectiveness of this fine-tuning heavily relies on the representativeness of the generated questions. If the synthetic questions we create for training don't reflect the types of questions users will actually ask the RAG system, the fine-tuning might not translate into real-world performance improvements.


<span style="color:green">If The LLM generates very simple pointed questions for our fine tuning and the users end up asking more nuanced thematic questions, our fine tuning will not help with that. LLM is also big brained and one-size-fits-all so it is more general and our use case might be specific so **its important that our fine tuning q and a set captures the specificity of our use case without being tied to the exact wording (that will make the embeddings too brittle) and that it captures the nuance and complexity of the domain as well.** So, well formed, relevant, specific to domain without being brittle, and representative of what users will ask.







## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

### Nest Asyncio

In [1]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

>> NOTE: You do not need to do these steps if you are running this notebook locally with `uv`.

In [2]:
!pip install -qU langchain_openai langchain_huggingface langchain_core langchain langchain_community langchain-text-splitters

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/62.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.8/62.8 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m124.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m72.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0

In [3]:
!pip install -qU faiss-cpu python-pptx==1.0.2 nltk==3.9.1 pymupdf beautifulsoup4 lxml

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.8/472.8 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m80.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m108.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m169.4/169.4 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25h

### Provide OpenAI API Key

In [4]:
import os
import getpass
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY') # getpass.getpass("Enter Your OpenAI API Key: ")

## Task 2: Loading Data

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [5]:
!mkdir data
!mkdir data_sources

In [8]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31500    0 31500    0     0  71791      0 --:--:-- --:--:-- --:--:-- 71753


In [9]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 70667    0 70667    0     0   395k      0 --:--:-- --:--:-- --:--:--  396k


In [None]:
from langchain_community.document_loaders import DirectoryLoader, UnstructuredFileLoader
from langchain_core.documents import Document
import os

# Try to import Colab specific modules
try:
    from google.colab import files
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

all_documents = []

# Section for optional direct file upload (primarily for Colab)
if IN_COLAB:
    print("This script can prompt for a direct file upload if run in Google Colab.")
    try:
        upload_choice = input("Do you want to upload an additional single text file directly? (yes/no): ").strip().lower()
        if upload_choice == 'yes':
            print("Please use the file dialog that appears to select a .txt file to upload.")
            # files.upload() returns a dictionary of {filename: content_bytes}
            uploaded_files = files.upload()

            if not uploaded_files:
                print("No file was selected or the upload was cancelled.")
            else:
                for file_name, file_content_bytes in uploaded_files.items():
                    if file_name.endswith(".txt"):
                        try:
                            # Decode the bytes to string (assuming UTF-8 encoding for text files)
                            file_content_str = file_content_bytes.decode('utf-8')

                            # Prompt for a persona_id for the uploaded file
                            persona_id_for_uploaded = input(f"Enter a persona_id for the uploaded file '{file_name}' (e.g., analytical, philosophical, custom_source): ").strip()
                            if not persona_id_for_uploaded:
                                persona_id_for_uploaded = "uploaded_file" # Assign a default if none is provided

                            # Create a Langchain Document
                            doc = Document(page_content=file_content_str, metadata={"source": file_name, "persona_id": persona_id_for_uploaded})
                            all_documents.append(doc)
                            print(f"Successfully processed and added uploaded file: '{file_name}' with persona_id: '{persona_id_for_uploaded}'")
                        except Exception as e:
                            print(f"Error decoding or processing uploaded file '{file_name}': {e}")
                    else:
                        print(f"Skipping uploaded file '{file_name}' as it does not end with .txt.")
        else:
            print("Skipping direct file upload based on your choice.")
    except RuntimeError:
        print("File upload failed. This can happen if the browser blocks third-party cookies or due to other Colab-specific issues. Continuing without direct upload.")
    except Exception as e:
        # Catch any other unexpected errors during the upload prompt/process
        print(f"An unexpected error occurred during the file upload attempt: {e}")
        print("Continuing without direct file upload.")

else:
    print("Note: Direct file upload prompt is skipped as this environment does not appear to be Google Colab.")
    print("Please ensure your files are in the 'data_sources' directories as described below to load them.")

# --- Loading from data_sources directories (continues as before) ---
persona_paths = {
    "analytical": "data_sources/analytical/",
    "philosophical": "data_sources/philosophical/",
    "metaphorical": "data_sources/metaphorical/"
}

# Create directories if they don't exist and remind about uploads
for persona_id, path in persona_paths.items():
    if not os.path.exists(path):
        print(f"Creating directory: {path} (if it doesn't exist)")
        os.makedirs(path, exist_ok=True)

print("\\nIf you didn't use the direct upload, or if you want to load more files,")
print("please ensure your .txt files are now in the respective 'data_sources/<persona_id>' folders.\\n")

# Load documents from directories
for persona_id, path in persona_paths.items():
    if not os.path.exists(path):
        # This check is somewhat redundant if os.makedirs worked, but good for robustness
        print(f"Warning: Directory {path} still does not exist. Skipping {persona_id} data.")
        continue

    # Check if the directory is empty before attempting to load
    if not os.listdir(path):
        print(f"Notice: Directory {path} is empty. No files to load for {persona_id}.")
        continue

    print(f"Attempting to load documents for {persona_id} from {path}...")
    loader = DirectoryLoader(
        path,
        glob="**/*.txt", # Ensure it loads only .txt files
        loader_cls=UnstructuredFileLoader, # Use for .txt files
        show_progress=True,
        use_multithreading=True,
        silent_errors=True # Silently ignore files that can't be loaded
    )
    try:
        loaded_docs_from_dir = loader.load()
        for doc in loaded_docs_from_dir: # Add persona_id to metadata
            doc.metadata["persona_id"] = persona_id
            # The 'source' metadata from DirectoryLoader usually contains the full file path
        all_documents.extend(loaded_docs_from_dir)
        if loaded_docs_from_dir:
            print(f"Successfully loaded {len(loaded_docs_from_dir)} documents for {persona_id} from {path}.")
        else:
            print(f"No .txt documents found or loaded for {persona_id} from {path} (directory might contain other file types or be empty of .txt).")
    except Exception as e:
        print(f"An error occurred while loading documents for {persona_id} from {path}: {e}")

# Final check and summary
if not all_documents:
    print("\\nError: No documents were loaded in total (neither via upload nor from directories).")
    print("Please ensure your 'data_sources/<persona_id>' directories are populated with .txt files or use the upload option if available.")
else:
    print(f"\\nTotal documents loaded into 'all_documents': {len(all_documents)}")

# You can then inspect the first few documents if needed:
# print("\\n--- Sample of Loaded Documents ---")
# for i, doc in enumerate(all_documents[:3]): # Show up to 3 sample docs
#     print(f"--- Document {i+1} ---")
#     print(f"Content (first 200 chars): {doc.page_content[:200]}...")
#     print(f"Metadata: {doc.metadata}")
#     print("-----------------------------")

This script can prompt for a direct file upload if run in Google Colab.
Do you want to upload an additional single text file directly? (yes/no): yes
Please use the file dialog that appears to select a .txt file to upload.


Saving examples.txt to examples.txt
Saving excerpts.txt to excerpts.txt
Saving generated_analytical_example.txt to generated_analytical_example.txt
Saving hannah_fry_mathematics_of_love_transcript.html to hannah_fry_mathematics_of_love_transcript.html
Saving hannah_fry_uncertainty_unavoidable_transcript.html to hannah_fry_uncertainty_unavoidable_transcript.html
Saving pew_research_ai_views_2023.html to pew_research_ai_views_2023.html
Saving pew_research_report_ai_views_2023.txt to pew_research_report_ai_views_2023.txt
Saving sagan_baloney_detection.txt to sagan_baloney_detection.txt


KeyboardInterrupt: Interrupted by user

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import BSHTMLLoader

path = "data/"
text_loader = DirectoryLoader(path, glob="*.txt", loader_cls=BSHTMLLoader)

In [7]:
        from langchain_community.document_loaders import DirectoryLoader, TextLoader
        from langchain_text_splitters import RecursiveCharacterTextSplitter # Ensure this import is present

        # Define the base path to where you unzipped your data
        # This should match the '-d' argument from your !unzip command in Step 3d
        base_data_path = "data/"

        # If your zip file directly contained 'analytical', 'philosophical', 'metaphorical' folders:
        # Option 1: Load all .txt files from all subdirectories within base_data_path
        print(f"Loading .txt files from all subdirectories in: {base_data_path}")
        text_loader = DirectoryLoader(
            base_data_path,
            glob="**/*.txt",  # This means all .txt files in all subfolders
            loader_cls=TextLoader,
            show_progress=True,
            use_multithreading=True, # Can speed up loading
            loader_kwargs={'autodetect_encoding': True} # Helps with different text encodings
        )
        all_loaded_documents = text_loader.load()
        print(f"Successfully loaded {len(all_loaded_documents)} documents.")

        # The notebook later defines text_splitter and uses it:
        # text_splitter = RecursiveCharacterTextSplitter(
        #     chunk_size = 750, # You can adjust this if needed
        #     chunk_overlap  = 20, # You can adjust this if needed
        #     length_function = len
        # )
        # training_documents = text_splitter.split_documents(text_loader.load()) # OLD LINE

        # Make sure 'training_documents' uses your loaded data.
        # First, ensure text_splitter is defined by running its cell.
        # Then, split your documents:
        # training_documents = text_splitter.split_documents(all_loaded_documents)
        # print(f"Split into {len(training_documents)} chunks for training/validation/testing.")

Loading .txt files from all subdirectories in: data/


100%|██████████| 18/18 [00:00<00:00, 1525.79it/s]

Successfully loaded 18 documents.





Next, we'll set up a classic naive chunking strategy as we only care that the documents get parsed into chunks that we can generate synthetic questions about.

<span style="color:green">Chunk size:

<span style="color:green">Too Small: Chunks might lack sufficient context to fully capture an idea or answer a question. The embedding might not be specific enough.

<span style="color:green">Too Large: The embedding might become too general, averaging the meaning across multiple distinct points. When retrieved, this large chunk might contain the relevant info but also a lot of noise, potentially making it harder for the final LLM to pinpoint the answer (diluting the key information).

<span style="color:green">The 750-character size with 20-character overlap is a common starting point, aiming for that balance. The RecursiveCharacterTextSplitter is helpful because it tries to split along natural boundaries (paragraphs, sentences) first before resorting to a hard character limit, which helps keep the chunks more coherent.


In [8]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap  = 20,
    length_function = len
)

Next we can load/split these documents as follows.

> NOTE: You may need to run this cell twice to get it to work.

In [12]:
# Ensure text_splitter is defined from its cell, then:
training_documents = text_splitter.split_documents(all_loaded_documents) # Use your variable here
print(f"Split into {len(training_documents)} chunks for training/validation/testing.")

Split into 2261 chunks for training/validation/testing.


In [10]:
training_documents = text_splitter.split_documents(text_loader.load())

100%|██████████| 18/18 [00:00<00:00, 1641.36it/s]


In [13]:
len(training_documents)

2261

Next, we're going to associate each of our chunks with a unique identifier.

In [14]:
import uuid

id_set = set()

for document in training_documents:
  id = str(uuid.uuid4())
  while id in id_set:
    id = uuid.uuid4()
  id_set.add(id)
  document.metadata["id"] = id

Next, we'll simply use naive Python slicing to create a training, test, and validation set to prepare our data for the next step. \
<span style="color:green">78 chunks: training\
<span style="color:green">12 chunks: validation\
<span style="color:green">12 chunks: testing




<span style="color:green">**Validation vs. Test Sets:**

<span style="color:green">The training set is used directly to update the model's parameters during the fine-tuning process. The model "learns" from these examples.

<span style="color:green">The validation set plays a crucial role **during** training. Periodically (e.g., after each epoch or a certain number of steps), the model's performance is checked against the validation set. This helps us:

<span style="color:green">Tune hyperparameters: See if different learning rates, batch sizes, etc., lead to better performance on data the model hasn't directly trained on.

<span style="color:green">Prevent overfitting: Monitor if the model is getting really good at the training data but worse on the validation data (meaning it's just memorizing the training set and not generalizing). We can use this to decide when to stop training (early stopping).

<span style="color:green">Model selection: If we train multiple versions of the model, the validation set helps us pick the best one.

<span style="color:green">The test set is held back until the very end. After we've finished training and selected our final model (using the training and validation sets), we evaluate its performance one last time on the test set. This gives us an unbiased estimate of how well the model is likely to perform on completely new, unseen data in the real world. We don't use the test set to make any decisions about training or model selection; it's purely for the final report card.

In [15]:
training_split_documents = training_documents[:len(training_documents) - 480]
val_split_documents = training_documents[len(training_documents) - 480:len(training_documents) - 240]
test_split_documents = training_documents[len(training_documents) - 240:]

## Task 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-4.1-mini`

The basic idea here is straightforward enough:

1. We look at a document
2. We generate questions that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

<span style="color:green">Why temperature=0?

<span style="color:green">In LLMs, temperature controls the randomness of the output.

<span style="color:green">A higher temperature (e.g., 0.7-1.0) makes the output more random and creative. The model is more likely to explore less probable word choices.

<span style="color:green">A lower temperature (closer to 0) makes the output more deterministic and focused. The model tends to pick the most likely next word.

<span style="color:green">When generating questions for our fine-tuning dataset, we want them to be factual, directly based on the provided context, and consistent.

<span style="color:green">We don't want creative or unexpected questions here. Setting temperature=0 helps ensure the LLM produces the most probable, focused, and contextually grounded questions, minimizing randomness and increasing reproducibility.

In [16]:
from langchain_openai import ChatOpenAI

qa_chat_model = ChatOpenAI(
    model="gpt-4.1-mini",
    temperature=0
)

We'll create a simple Question Generation prompt to query `gpt-4o-mini` to generate Questions for each retrieved context.

In [17]:
from langchain_core.prompts import ChatPromptTemplate

qa_prompt = """\
Given the following context, you must generate questions based on only the provided context.

You are to generate {n_questions} questions which should be provided in the following format:

1. QUESTION #1
2. QUESTION #2
...

Context:
{context}
"""

qa_prompt_template = ChatPromptTemplate.from_template(qa_prompt)

We'll create a simple chain to query the LLM!

In [18]:
question_generation_chain = qa_prompt_template | qa_chat_model

There's a lot going on in this function - let's take a deeper look:

1. First, we provide a list of documents and a number of questions
2. We, for each document in our list, generate `n_questions` of questions.
3. We then associate those questions and contexts via a `UUID`.

> NOTE: The reason we're doing this `UUID` association is for ease of use later in the notebook.

##### 🏗️ Activity #1:

<span style="color:green">**DONE** Vibe Coded :)

We have:

- Lists of `Documents` with the `metadata` field `id`.

We need:

- An object with key `id`, which have values `str` questions.
- An object with key `question_id`, which have values `List(str)` which will be a list of associated `context_id`.

An Example:

question_object:
```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': 'What types of accessible formats are available for persons with disabilities?',
'df58ee4f-714c-419e-8324-94e5870574e2': 'How do accessible formats benefit persons with disabilities?',
'505fce8b-0e56-48de-a251-61027e396918': 'What are some of the risks associated with the increasing capabilities of AI systems that generate synthetic content?',
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': 'Why is it important for providers of AI systems to embed technical solutions for marking and detecting synthetic content?'
}
 ```

 context_object:
 ```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'df58ee4f-714c-419e-8324-94e5870574e2': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'505fce8b-0e56-48de-a251-61027e396918': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
}
 ```

 As you can see, a piece of context can be associated with more than 1 question.

 The task is to write the Python function(s) to accomplish this task.

 Your function signature is provided below, along with the desired return values.

 > NOTE: You can make any modifications that you desire - assuming that you have the correct input and outputs.

In [19]:
import tqdm
import asyncio
import re    # Import regex for parsing questions


"""
Sample Usage of TQDM:

for i in tqdm.tqdm(range(10)):
  time.sleep(1)
"""

async def create_questions_old(documents, n_questions):

    questions = {}
    relevant_docs = {}

    ### YOUR CODE HERE

    return questions, relevant_docs


async def create_questions(documents, n_questions):
    questions = {}
    relevant_docs = {}

    tasks = []
    # Prepare async tasks for each document
    for document in documents:
        context = document.page_content
        doc_id = document.metadata["id"]
        # Create a coroutine for each document processing
        task = process_document(context, doc_id, n_questions, question_generation_chain)
        tasks.append(task)

    # Run tasks concurrently with progress bar
    results = []
    for future in tqdm.tqdm(asyncio.as_completed(tasks), total=len(tasks), desc="Generating Questions"):
        try:
            result = await future
            if result:
                results.append(result)
        except Exception as e:
            print(f"Error processing document: {e}") # Basic error logging

    # Process results to populate the dictionaries
    for q_list, doc_id in results:
        for question_text in q_list:
            question_id = str(uuid.uuid4())
            # Add the check for collisions
            while question_id in questions:
                question_id = str(uuid.uuid4()) # Regenerate if collision
            questions[question_id] = question_text
            relevant_docs[question_id] = [doc_id] # Store doc_id in a list as per example

    return questions, relevant_docs

async def process_document(context, doc_id, n_questions, chain):
    """Helper coroutine to process a single document."""
    # Invoke the LLM chain to generate questions
    response = await chain.ainvoke({"context": context, "n_questions": n_questions})

    # Basic parsing assuming "1. QUESTION\n2. QUESTION\n..." format
    # Use regex to find lines starting with number and dot
    # Adjust regex if the format is slightly different
    parsed_questions = re.findall(r"^\d+\.\s*(.*)", response.content, re.MULTILINE)

    # Fallback or alternative parsing if needed
    if not parsed_questions:
         # Try splitting by newline if regex fails (less robust)
         parsed_questions = [q.strip() for q in response.content.strip().split('\n') if q.strip()]
         # Filter out potential non-question lines if necessary (heuristic)
         parsed_questions = [q for q in parsed_questions if len(q) > 10 and '?' in q] # Example filter

    # Ensure we don't exceed n_questions, even if LLM gave more/less
    # Or handle cases where fewer than n_questions were generated
    final_questions = parsed_questions[:n_questions]

    if not final_questions:
        print(f"Warning: No questions parsed for doc_id {doc_id}. Raw response: {response.content[:100]}...")
        return None # Return None if no questions could be parsed

    return final_questions, doc_id

### REMOVE `await` IF NOT USING ASYNC (HINT: Use `async`)

In [20]:
training_questions, training_relevant_contexts = await create_questions(training_split_documents, 2)

Generating Questions: 100%|██████████| 1781/1781 [05:27<00:00,  5.44it/s]


We'll use the function to generate training, validation, and test data.

In [21]:
val_questions, val_relevant_contexts = await create_questions(val_split_documents, 2)

Generating Questions: 100%|██████████| 240/240 [00:04<00:00, 49.19it/s]


In [22]:
test_questions, test_relevant_contexts = await create_questions(test_split_documents, 2)

Generating Questions: 100%|██████████| 240/240 [00:10<00:00, 22.09it/s]


In [23]:
# Lets see some questions!
test_questions

{'0900b2ab-f4fb-48b0-8fbe-c4a9ab0bb383': 'What is the href link associated with the "People" section in the desktop sub-navigation?  ',
 '91359428-54f3-4e22-bc4a-d6b942bc07ae': 'What is the id of the div element present in the provided context?',
 '553139ef-e493-49a2-b7ff-427277c71f47': 'What realization does the narrator experience while dreaming about their parents?  ',
 '447d05a1-56bf-467a-bece-239dd873bdf7': 'How does the narrator feel about the possibility of life after death?',
 '7143f1a6-cf2a-42b7-8495-cb723a4ed313': 'Which web browsers have Wayback Machine extensions available according to the context?  ',
 '76bd5fe8-8c94-40c6-b1cc-c905ff1a43bb': 'Where can you find the Wayback Machine extension for Microsoft Edge?',
 '13dff714-3f99-48e3-b982-6a8995f1c06b': 'What is the purpose of the baloney detection kit in scientific training?  ',
 'e86bb6b3-422a-46f1-ac47-91b51e61a635': 'What does skeptical thinking help us to construct and understand?',
 'bcb5042a-8260-4ea1-ab42-3f3d5f6af4

### <span style="color:green">Reformatting and Saving Datasets

<span style="color:green">Now, we can save our datasets for later use!
We save these generated datasets, along with the original document chunks (the "corpus"), into JSON files: training_dataset.jsonl, val_dataset.jsonl, and test_dataset.jsonl.



 <span style="color:green">We're saving the train_dataset, val_dataset, and test_dataset dictionaries as JSON files (using json.dump) for several key reasons:

<span style="color:green">Human Readability: JSON (JavaScript Object Notation) is a text-based format that is relatively easy for humans to read and understand, especially compared to binary formats. This makes it simple to inspect the generated questions and contexts if needed.

<span style="color:green">Interoperability: JSON is a language-independent data format. While we're using Python now, saving in JSON means the data could be easily loaded and used by programs written in other languages if necessary in the future.
Ease of Use in Python: Python's built-in json library makes it trivial to serialize (write) Python dictionaries and lists to a JSON file (json.dump) and deserialize (read) them back into Python objects (json.load) later in the notebook or in a different script.

<span style="color:green">Structured Data: JSON naturally represents the kind of nested structures we have (dictionaries containing other dictionaries and lists), mapping well to our Python objects.
Essentially, it's a standard, convenient, and readable way to store structured data like our question-answer pairs and corpus for later use.

In [24]:
import json

# training-split-documents is the original document chunks. It is a LangChain Document object. Convert it to a
# dictionary with the document id as the key and the page content as the value.
training_corpus = {train_item.metadata["id"] : train_item.page_content for train_item in training_split_documents}

# training_questions is a dictionary with the question id as the key and the question as the value.
# training_relevant_contexts is a dictionary with the question id as the key and the document id as the value.
# training_corpus is a dictionary with the document id as the key and the page content as the value.
train_dataset = {
    "questions" : training_questions,
    "relevant_contexts" : training_relevant_contexts,
    "corpus" : training_corpus
}

# Save the training dataset to a JSON file
with open("training_dataset.jsonl", "w") as f:
  json.dump(train_dataset, f)

In [25]:
val_corpus = {val_item.metadata["id"] : val_item.page_content for val_item in val_split_documents}

val_dataset = {
    "questions" : val_questions,
    "relevant_contexts" : val_relevant_contexts,
    "corpus" : val_corpus
}

with open("val_dataset.jsonl", "w") as f:
  json.dump(val_dataset, f)

In [26]:
train_corpus = {test_item.metadata["id"] : test_item.page_content for test_item in test_split_documents}

test_dataset = {
    "questions" : test_questions,
    "relevant_contexts" : test_relevant_contexts,
    "corpus" : train_corpus
}

with open("test_dataset.jsonl", "w") as f:
  json.dump(test_dataset, f)

## Task 4: Fine-tuning `snowflake-arctic-embed-l`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-l`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) as a base embeddings model.

<span style="color:green">So, we're essentially taking a knowledgeable generalist and turning it into a specialist for our particular task and data. This way, we use less data and compute, we leverage existing knowledge of a pre trained model, and we do focused learning.<span style="color:green">

<span style="color:green">It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

>> NOTE: Skip installing dependencies if you are running this notebook locally.

In [27]:
!pip install -qU sentence_transformers datasets pyarrow

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 MB[0m [31m59.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2025.3.2 requires fsspec==2025.3.2, but you have fsspec 2025.3.0 which is incompatible.
pylibcudf-cu12 25.2.1 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 20.0.0 which is incompatible.
cudf-cu12 25.2.1 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 20.0.0 which is incompatible.[0m[31m
[0m

In [28]:
from sentence_transformers import SentenceTransformer

model_id = "Snowflake/snowflake-arctic-embed-l"
model = SentenceTransformer(model_id)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/85.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

We'll grab some necessary imports from `sentence_transformers` and `torch`.

> NOTE: PyTorch (`torch`) is a popular machine learning library - while we don't go very deep into PyTorch it's an incredibly powerful and interesting library! Please read more about it [here](https://pytorch.org/tutorials/beginner/basics/intro.html)!

In [29]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample

We're using a toy batch size here to reflect the limited number of examples we have. We have 78 training documents, and two questions each so all of 156 ques/doc pairs or examples.

> NOTE: It is typical to use a much larger batch size (~64+), hardware permitting.

In [54]:
BATCH_SIZE = 8

Let's move our dataset into the expected format for training.

<span style="color:green">Remember we have query_ids, doc_ids, and query text and doc text. Now, we want query_text, doc_text examples. We go from query_id to doc_id to doc_text

<span style="color:green">The loss function (which we'll see next) will use these pairs to calculate how "far apart" the query and text embeddings currently are and generate gradients to push them closer together.


In [55]:
corpus = train_dataset['corpus']
queries = train_dataset['questions']
relevant_docs = train_dataset['relevant_contexts']

examples = []
for query_id, query in queries.items():
    doc_id = relevant_docs[query_id][0]
    text = corpus[doc_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

Now we can create a `torch` `DataLoader`!

<span style="color:green">**Shuffling**: The DataLoader, by default, shuffles the training examples at the beginning of each epoch (each full pass through the data). This is crucial to prevent the model from learning any patterns based on the order in which examples happen to appear in the dataset. If the data wasn't shuffled, the model might inadvertently learn biases related to the sequence, which could hurt its ability to generalize to new, unseen data. Shuffling ensures it sees examples in a random order each time.

<span style="color:green"><span style="color:green">**Batching:** (our size=10)

<span style="color:green">Efficiency: 10 at a time in parallel. And Gradient Stability: We compute error and direction every batch size and update gradient. Smoother and more relaible training.

<span style="color:green"><span style="color:green">**Within One Epoch**:

  - <span style="color:green">Get the next 10 examples.
  - <span style="color:green">Forward pass: Compute embeddings for these
  - <span style="color:green">Calculate loss. A score to see how well the model did. Specifically, it measures if the related query-context pairs are closer together in embedding space than unrelated pairs within that same batch.
  - <span style="color:green">Calculate gradients: Based on loss, how much each parameter contributed to the loss. (backward pass) - these are gradients.
  - <span style="color:green">Update parametes (gradient adjustment): An "optimizer" uses these gradients to slightly adjust the model's parameters (weights) to try and reduce the loss next time. This "adjustment" happens after processing each batch.
  - <span style="color:green">Repeat: Steps 1-5 are repeated for the next batch, and the next, until all the training examples have been seen once.

  
<span style="color:green">**Gradient Stability:** Imagine updating the model based on just one query-context pair. That single example might be weird or unrepresentative, causing the parameter update (gradient adjustment) to be jerky or point in a slightly wrong direction. By calculating the loss and gradients over a batch (10 examples), the "weirdness" of individual examples tends to average out. The resulting gradient provides a more stable, reliable estimate of the direction the parameters should move to improve performance on average across those 10 examples. This usually leads to smoother, more consistent training.


<span style="color:green">**Epochs and Validation:**
An epoch is defined as one complete pass through the entire training dataset. Since we have 156 training examples and a batch size of 10, one epoch consists of ceil(156 / 10) = 16 batches (15 batches of 10, and one final batch of 6).

<span style="color:green">**Validation** typically happens after each epoch (as specified by evaluation_strategy="epoch" which is often the default, or evaluation_steps=50 as explicitly set later in cell 60). The model is put into evaluation mode (no gradients calculated, no parameters updated), and its performance is measured on the separate validation set. This gives us an idea of how well the model is generalizing to data it hasn't been trained on during that epoch.

<span style="color:green">So, to summarize: Learning (loss calculation, gradient adjustment) happens per batch. An epoch is a full pass over all batches. Validation is a separate check, usually done between epochs, to monitor generalization.

In [56]:
loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

**Moving on!**
Next up, we'll prepare our loss function!

Loss is an important part of training, fine-tuning, and more. If you want a deep dive on loss - you can check out our [event on loss!](https://www.youtube.com/watch?v=iB8FWR9aD5Q&t=8s).

The core loss we're using today is called `MultipleNegativesRankingLoss` - you can find more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).

This is "wrapped" in `MatryoshkaLoss`, which you can read the implementation of [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MatryoshkaLoss.py).

In [57]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

##### 🏗️ Activity #2:

Both of these losses sound "cool", but what are they - exactly - under the hood?

Why are these losses specifically doing? Please write a short summary of each loss.


Okay, here's a summary of the two loss functions, incorporating our discussion:

#### <span style="color:green">Loss Functions Explained

#### <span style="color:green">1. MultipleNegativesRankingLoss (MNRL)

*   <span style="color:green">**Core Goal:** The primary objective of MNRL is to fine-tune the embedding model so that the vector embedding of a query is semantically closer to the embedding of its relevant context (the "positive" pair) than it is to the embeddings of irrelevant contexts (the "negative" pairs). In simpler terms, it pushes related items together and unrelated items apart in the embedding space.
*   <span style="color:green">**Mechanism - In-Batch Negatives:** This loss function cleverly avoids the need to explicitly provide negative examples. When processing a batch of (query, positive_context) pairs, it uses an "in-batch negative" strategy. For a specific query (Query_A) in the batch:
    *   Its corresponding PositiveContext_A is treated as the single positive example.
    *   All *other* contexts present in that same batch (PositiveContext_B, PositiveContext_C, etc.) are implicitly treated as *negative* examples for Query_A.
*   <span style="color:green">**Training Signal:** The loss is calculated based on how well the model ranks the similarity score of the positive pair (sim(Query_A, PositiveContext_A)) compared to the similarity scores of the negative pairs (sim(Query_A, PositiveContext_B), sim(Query_A, PositiveContext_C), etc.). The goal is to maximize the positive similarity relative to the negative similarities within that batch.
*   <span style="color:green">**Batch Size Dependency:** The effectiveness of this strategy relies on the batch size. A larger batch provides more (and potentially harder) negative examples for each query, generally leading to more robust training.

#### <span style="color:green">2. MatryoshkaLoss

*   <span style="color:green">**Core Goal:** This loss function aims to train embeddings that are not only effective at their full dimensionality but also perform well when truncated to shorter lengths (like Russian nesting dolls). The primary motivation is **efficiency** in downstream applications like RAG – shorter embeddings require less storage, faster retrieval computations, and less bandwidth.
*   <span style="color:green">**Mechanism - Learning Hierarchical Structure:** MatryoshkaLoss achieves this by incentivizing the model *during training* to learn a hierarchical representation within the embedding vector. It encourages the model to pack the most crucial, coarse-grained semantic information into the initial dimensions of the vector, adding progressively finer-grained details in subsequent dimensions.
*   <span style="color:green">**Training Process:**
    1.  It "wraps" an inner loss function (in our case, MultipleNegativesRankingLoss).
    2.  For each batch, it calculates the inner loss multiple times: once using the full-dimension embeddings (e.g., 768), and then again using only the first N dimensions for each specified shorter length (e.g., first 512, first 256, first 128, first 64).
    3.  These individual loss values (calculated at different dimensionalities) are combined, often using a weighted average. This combined loss reflects how well the embedding performs *at multiple levels of truncation*.
    4.  The model's parameters are updated based on the gradient of this *combined* loss.
    5.  By penalizing poor performance at shorter lengths *during the training loop*, this mechanism forces the model to organize information hierarchically, ensuring the truncated versions remain meaningful.
*   <span style="color:green">**Outcome:** The result is an embedding model where the initial dimensions capture the most vital information. This *enables* practitioners, after training, to evaluate performance at different truncation levels (e.g., 768 vs. 512 vs. 256) and choose the best trade-off between accuracy and efficiency (storage/speed) for their specific RAG application. The MatryoshkaLoss during training is what makes this post-training choice possible and meaningful.


<span style="color:green">**A note on creating negative context examples.**
 We created two queries for each context. So, it might incorrectly pick a context as negative for a query when it isn't. Gemini says this is not such a big deal:

<span style="color:green">Primary Goal: The loss function's main goal is to ensure sim(q1, c) is higher than sim(q1, other_context) for all other contexts in the batch. Even if one of those "other contexts" happens to be the same text c (but paired with q2), the loss still pushes to maximize the similarity for the direct (q1, c) pairing relative to everything else.

<span style="color:green">Different Queries: While the context c is the same text, q1 and q2 are (hopefully) different questions. The model learns to associate the specific semantics of q1 with c and the specific semantics of q2 with c. Treating c (paired with q2) as a negative for q1 encourages the model to differentiate why c is relevant specifically to q1 compared to other potential queries (like q2).

<span style="color:green">Batch Size/Probability: With shuffling and reasonable batch sizes, the chances of both pairs derived from the exact same context landing in the same batch decrease, though it can certainly happen.

<span style="color:green">In practice, this nuance of the in-batch negative strategy usually doesn't prevent the model from learning effectively, especially since the positive pairing signal is strong and consistent across batches.

Now we can set-up our evaluator.

> NOTE: Due to the formatting of our dataset - this is all we have to do!

In [58]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

corpus = val_dataset['corpus']
queries = val_dataset['questions']
relevant_docs = val_dataset['relevant_contexts']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

We'll train this model for 5 epochs, though you could increase this number if we had a significant amount more data.

In [59]:
EPOCHS = 10

It's training time!

> NOTE: We're manually defining a warm-up period here - this is just to provide a smooth ramp into our training!

In [60]:
import wandb
wandb.init(mode="disabled")

> NOTE: You may not see direct improvement during the training cycles - this is absolutely expected. We will verify performance later in the notebook.

In [61]:
warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, train_loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='finetuned_arctic_ft',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50
)

Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
50,No log,No log,0.58125,0.7375,0.7875,0.854167,0.58125,0.245833,0.1575,0.085417,0.58125,0.7375,0.7875,0.854167,0.715475,0.671358,0.67875
100,No log,No log,0.585417,0.7375,0.785417,0.854167,0.585417,0.245833,0.157083,0.085417,0.585417,0.7375,0.785417,0.854167,0.717123,0.673533,0.681246
150,No log,No log,0.591667,0.741667,0.797917,0.870833,0.591667,0.247222,0.159583,0.087083,0.591667,0.741667,0.797917,0.870833,0.727849,0.682477,0.689214
200,No log,No log,0.589583,0.741667,0.8125,0.872917,0.589583,0.247222,0.1625,0.087292,0.589583,0.741667,0.8125,0.872917,0.726966,0.680669,0.687202
250,No log,No log,0.552083,0.74375,0.80625,0.86875,0.552083,0.247917,0.16125,0.086875,0.552083,0.74375,0.80625,0.86875,0.711183,0.660596,0.666916
300,No log,No log,0.566667,0.758333,0.816667,0.879167,0.566667,0.252778,0.163333,0.087917,0.566667,0.758333,0.816667,0.879167,0.723816,0.673929,0.680497
350,No log,No log,0.58125,0.760417,0.810417,0.866667,0.58125,0.253472,0.162083,0.086667,0.58125,0.760417,0.810417,0.866667,0.725507,0.680227,0.687683
400,No log,No log,0.55,0.754167,0.802083,0.86875,0.55,0.251389,0.160417,0.086875,0.55,0.754167,0.802083,0.86875,0.711978,0.66152,0.668568
446,No log,No log,0.5625,0.741667,0.810417,0.86875,0.5625,0.247222,0.162083,0.086875,0.5625,0.741667,0.810417,0.86875,0.716371,0.667456,0.674576
450,No log,No log,0.564583,0.741667,0.8125,0.870833,0.564583,0.247222,0.1625,0.087083,0.564583,0.741667,0.8125,0.870833,0.716968,0.667671,0.674605


In [62]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [64]:
hf_username = "suh4s"

In [66]:
import uuid

model.push_to_hub(f"{hf_username}/insightflow-balanced-team-embed-v1-{uuid.uuid4()}")

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

'https://huggingface.co/suh4s/insightflow-balanced-team-embed-v1-7099e82c-e4c8-48ed-88a8-36bd9255036b/commit/7c261d79960d5c7172c89ce8874df73eefde88b2'

https://huggingface.co/suh4s/insightflow-balanced-team-embed-v1-7099e82c-e4c8-48ed-88a8-36bd9255036b/commit/7c261d79960d5c7172c89ce8874df73eefde88b2

https://huggingface.co/geetach/legal-ft-a201f63a-cb7a-4d10-aa78-6229827dff89/commit/257e07b0661736f960d468f10defdd94211ef448

## Task 5: Evaluating our Retriever

Now that we have fine-tuned our retriever - let's see if it's worthwhile!

We'll start with some basic imports.

In [67]:
import pandas as pd

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document

Now we'll define a function that will help us evaluate our retrieval process.

> NOTE: We're assuming 1 correct document in a "hit".

In [68]:
def evaluate_openai(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
  corpus = dataset['corpus']
  questions = dataset['questions']
  relevant_docs = dataset['relevant_contexts']
  documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
  vectorstore = FAISS.from_documents(documents, embed_model)

  retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})

  eval_results = []
  for id, question in tqdm.tqdm(questions.items()):
    retrieved_nodes = retriever.invoke(question)
    retrieved_ids = [node.metadata["id"] for node in retrieved_nodes]
    expected_id = relevant_docs[id][0]
    is_hit = expected_id in retrieved_ids
    eval_results.append({"id": id, "question": question, "expected_id": expected_id, "is_hit": is_hit})

  return eval_results

All that's left to do is evaluate, we'll evaluate our model against:

1. OpenAI's closed source `text-embedding-3-small`
2. The base non-fine-tuned version of `Snowflake/snowflake-arctic-embed-l`.

Let's see how it stacks up!

### `text-embedding-3-small`

In [69]:
te3_openai = OpenAIEmbeddings(model="text-embedding-3-small")
te3_results = evaluate_openai(test_dataset, te3_openai)

100%|██████████| 480/480 [03:10<00:00,  2.52it/s]


In [70]:
te3_results_df = pd.DataFrame(te3_results)

In [71]:
te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

np.float64(0.90625)

### `Snowflake/snowflake-arctic-embed-l` (base)

In [72]:
from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")
arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

100%|██████████| 480/480 [00:10<00:00, 46.80it/s]


In [73]:
arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

In [74]:
arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

np.float64(0.6145833333333334)

### `Snowflake/snowflake-arctic-embed-l` (fine-tuned)

In [75]:
finetune_embeddings = HuggingFaceEmbeddings(model_name="finetuned_arctic_ft")
finetune_results = evaluate_openai(test_dataset, finetune_embeddings)

Some weights of BertModel were not initialized from the model checkpoint at finetuned_arctic_ft and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 480/480 [00:10<00:00, 47.01it/s]


In [76]:
finetune_results_df = pd.DataFrame(finetune_results)

In [77]:
finetune_hit_rate = finetune_results_df["is_hit"].mean()
finetune_hit_rate

np.float64(0.9104166666666667)

## Task 1: Vibe Checking the RAG Pipeline

We're going to use our RAG pipeline to vibe check on some common phrases now that we've modified it!

### Creating New Chunks

In order to try and evaluate our system more fairly, let's create new chunks that we will use to create our Vector Store.

In [78]:
from langchain_huggingface import HuggingFaceEmbeddings
from sentence_transformers import SentenceTransformer


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap  = 50,
    length_function = len
)

training_documents = text_splitter.split_documents(text_loader.load())

100%|██████████| 18/18 [00:00<00:00, 1555.81it/s]


### Base Chain

We'll start by constructing our base chain, which will use the untrained retrieval model.

#### R - Retrieval

In [79]:
from langchain_community.vectorstores import FAISS

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")
base_vectorstore = FAISS.from_documents(training_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 6})

#### A - Augmented

In [80]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and a question, you must answer the question. If you do not know the answer, you must state that you do not know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt_template = ChatPromptTemplate.from_template(RAG_PROMPT)

#### G - Generation

In [81]:
rag_llm =  ChatOpenAI(
    model="gpt-4.1-nano",
    temperature=0
)

#### RAG - LCEL RAG Pipeline

In [82]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

base_rag_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [83]:
base_rag_chain.invoke({"question" : "What is an agent?"})["response"]

'The provided context does not include a definition or explanation of what an agent is.'

In [84]:
base_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'I do not know.'

In [85]:
base_rag_chain.invoke({"question" : "What is the laziest time of the year for AI?"})["response"]

'I do not know.'

In [86]:
base_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'I do not know.'

### Fine-tuned Embedding Model

Now let's rebuild our RAG chain with the Fine-tuned model - the only component we need to change is our `FAISS` vectorstore!

**To avoid re-running the fine tuning when I reload this into Google Colab, I am getting the fine tuned model I pushed to HuggingFace**

In [87]:
# If you just ran the fine tuning, you can use the following line to load the fine tuned model.
finetune_embeddings = HuggingFaceEmbeddings(model_name="finetuned_arctic_ft")

# Otherwise, you can use the following line to load the fine tuned model from HuggingFace. From your old runs.
# finetune_embeddings = HuggingFaceEmbeddings(model_name="geetach/legal-ft-450c1026-6554-476b-96f1-34f426f777c8")

finetune_vectorstore = FAISS.from_documents(training_documents, finetune_embeddings)
finetune_retriever = finetune_vectorstore.as_retriever(search_kwargs={"k": 6})

Some weights of BertModel were not initialized from the model checkpoint at finetuned_arctic_ft and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [88]:
finetune_rag_chain = (
    {"context": itemgetter("question") | finetune_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [89]:
finetune_rag_chain.invoke({"question" : "What is an Agent?"})["response"]

'The provided context does not include a definition or explanation of what an Agent is. Therefore, I do not know.'

In [90]:
finetune_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'I do not know.'

In [91]:
finetune_rag_chain.invoke({"question" : "What is the laziest time of the year for AI?"})["response"]

'I do not know.'

In [92]:
finetune_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'I do not know.'

#### ❓Question #2:

Which LCEL RAG Chain do you think answered the questions better, and why?

<span style="color:green"> The fine tuned rag chain!! Here is why:

<span style="color:green"> **Specificity and Contextual Relevance**

  - <span style="color:green">For questions like "What is an agent?" or "What is the laziest time of the year for AI?", the fine-tuned chain is more likely to retrieve the specific passages from Simon Willison's blogs discussing these topics in his specific context (e.g., LLM agents, the ChatGPT "lazy" period).
  - <span style="color:green">The base model, relying on more general embeddings, might retrieve less relevant chunks or provide more generic answers that miss the nuances present in the source documents.
  - <span style="color:green">Similarly, for specific factual questions like "What is the largest model that Simon has run on his phone?", the fine-tuned embeddings are much more likely to correctly identify and retrieve the exact text containing this information, whereas the base model might struggle.

<span style="color:green"> **More on evaluation**
  - <span style="color:green"> If you look at the hit rate, we can see that the fine tuned retireiver outperformed the base retriever. - its better at finding the correct chunk for a given synhtetic question.
  -- <span style="color:green"> Later, I did a langsmith evaluation of these two rag chains and the fine tuned rag chain achieved better scores for both correctness and helpfulness (screenshot at the end of this notebook!)

## <span style="color:green">Task 2: RAGAS Evaluation

It's great to have some idea of how our system is doing based on vibe-checks, but let's use RAGAS to provide more insight info. on how things are improving!

> NOTE: Please recreate *exactly* the RAGAS process we used to evaluate RAG, baselining with the default retriever, and then comparing the new retriever. The includes the Synthetic Data Generation steps.

<span style="color:green">Ok, so we already have the base_rag_chain and finetune_rag_chain. We want to now generate test data via ragas and then evaluate the two rag chains.

In [93]:
!pip install -qU ragas==0.2.10
!pip install unstructured

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/175.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.7/175.7 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/71.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unstructured
  Downloading unstructured-0.17.2-py3-none-any.whl.metadata (24 kB)
Collecting filetype (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting python-magic (from unstructured)
  Downloading python_magic-0.4.27-py2.py3-no

Load our data!

In [98]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader

path = "data/"
# Updated loader with glob pattern and TextLoader class
loader = DirectoryLoader(path, glob="*.txt", loader_cls=TextLoader)
docs = loader.load()

<span style="color:green">**Generate synthetic test data using ragas.** We provide an llm, an embedding model, and testset_size, and our documents and RAGAS does it's magic via knowledge graphs!

<span style="color:green"><span style="color:green">As a reminder, this will create about 10 rows of test data with the following columns:\
**user_input	reference_contexts	reference	synthesizer_name**\
User_input is the query, reference_context is the ideal context for this query, reference is the ideal response for this query and synthesizer name is the query synthesizer used. It will be one of SingleHopSpecific, MultiHopSpecific, and MultiHopAbstract.

In [99]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from ragas.testset import TestsetGenerator

sdg_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
sdg_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))
sdg_generator = TestsetGenerator(llm=sdg_llm, embedding_model=sdg_embeddings)
sdg_dataset = sdg_generator.generate_with_langchain_docs(docs, testset_size=10)

# Make a copy for the fine tuned model.
finetuned_sdg_dataset = sdg_dataset

Applying HeadlinesExtractor:   0%|          | 0/10 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/18 [00:00<?, ?it/s]

ERROR:ragas.testset.transforms.engine:unable to apply transformation: 'headlines' property not found in this node
ERROR:ragas.testset.transforms.engine:unable to apply transformation: 'headlines' property not found in this node
ERROR:ragas.testset.transforms.engine:unable to apply transformation: 'headlines' property not found in this node
ERROR:ragas.testset.transforms.engine:unable to apply transformation: 'headlines' property not found in this node
ERROR:ragas.testset.transforms.engine:unable to apply transformation: 'headlines' property not found in this node
ERROR:ragas.testset.transforms.engine:unable to apply transformation: 'headlines' property not found in this node
ERROR:ragas.testset.transforms.engine:unable to apply transformation: 'headlines' property not found in this node
ERROR:ragas.testset.transforms.engine:unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/11 [00:00<?, ?it/s]



Applying CustomNodeFilter:   0%|          | 0/312 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/573 [00:00<?, ?it/s]



Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

<span style="color:green">**Evaluate the UnFineTuned and FineTuned model with Ragas**

<span style="color:green">**First, we create the two datasets, by adding the retrieved response and retrieved context** - we do this for both the rag chains, one which uses the base snowflake embedding model and one which uses the finetuned snowflake embedding model.

In [100]:
import time
for test_row in sdg_dataset:
  response = base_rag_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(5)

time.sleep(5)

sdg_dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,Who J.H. Stickney and what he do for Hans Ande...,"[carefully standardized, their place in perman...",[﻿Project Gutenberg's Hans Andersen's Fairy Ta...,J. H. Stickney is the person who reprinted and...,J.H. Stickney is the editor of Hans Andersen's...,single_hop_specifc_query_synthesizer
1,How does Hans Andersen's storytelling ability ...,"[carefully standardized, their place in perman...",[PREFACE The Hans Andersen Fairy Tales will be...,Hans Andersen's storytelling ability contribut...,Hans Andersen's storytelling ability is charac...,single_hop_specifc_query_synthesizer
2,Who is Edna F. Hart in relation to Hans Anders...,"[carefully standardized, their place in perman...",[HANS ANDERSEN'S FAIRY TALES *** Produced by S...,The provided context does not mention Edna F. ...,Edna F. Hart is the illustrator of Hans Anders...,single_hop_specifc_query_synthesizer
3,What does the sun represent for the fir tree i...,"[""Rejoice in our love,"" said the air and the s...","[THE GREENIES 141 OLE-LUK-OIE, THE DREAM GOD 1...",The sun represents a source of joy and vitalit...,The sun represents warmth and joy for the fir ...,single_hop_specifc_query_synthesizer
4,In what ways does Socrates' defense in Plato's...,[The “Apology” or Platonic defence of Socrates...,[<1-hop>\n\nAPOLOGY *** Produced by Sue Assche...,The provided context does not contain specific...,Socrates' defense in Plato's 'Apology' reflect...,multi_hop_abstract_query_synthesizer
5,"In Plato's Apology, how does Socrates' rhetori...",[Yet some of the topics may have been actually...,[<1-hop>\n\nAPOLOGY *** Produced by Sue Assche...,The provided context does not explicitly detai...,"In Plato's Apology, Socrates employs a unique ...",multi_hop_abstract_query_synthesizer
6,In what ways does Plato's 'Apology' reflect th...,[Yet some of the topics may have been actually...,[<1-hop>\n\nAPOLOGY *** Produced by Sue Assche...,The provided context suggests that Plato's 'Ap...,Plato's 'Apology' reflects the themes of rheto...,multi_hop_abstract_query_synthesizer
7,In what ways does Plato's Apology reflect the ...,[Yet some of the topics may have been actually...,[<1-hop>\n\nAPOLOGY *** Produced by Sue Assche...,Plato's Apology reflects the themes of rhetori...,Plato's Apology reflects the themes of rhetori...,multi_hop_abstract_query_synthesizer
8,What do the gods require of a man to live a di...,"[seest that those things, which for a man to h...",[<1-hop>\n\nlatter years. That as often as I h...,"According to the teachings in the context, the...",The gods require no more of any man than to ke...,multi_hop_specific_query_synthesizer
9,How does the theme of spring manifest in the e...,"[sparrow flew, for she saw many others of her ...","[<1-hop>\n\nSee, here we are at the blacksmith...",I do not know.,"In the provided fairy tale context, the theme ...",multi_hop_specific_query_synthesizer


In [101]:
import time
for test_row in finetuned_sdg_dataset:
  response = finetune_rag_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(5)

time.sleep(5)

finetuned_sdg_dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,Who J.H. Stickney and what he do for Hans Ande...,[Language: English\n\nCharacter set encoding: ...,[﻿Project Gutenberg's Hans Andersen's Fairy Ta...,J.H. Stickney is the editor of Hans Andersen's...,J.H. Stickney is the editor of Hans Andersen's...,single_hop_specifc_query_synthesizer
1,How does Hans Andersen's storytelling ability ...,[The Hans Andersen Fairy Tales will be read in...,[PREFACE The Hans Andersen Fairy Tales will be...,Hans Andersen's storytelling ability contribut...,Hans Andersen's storytelling ability is charac...,single_hop_specifc_query_synthesizer
2,Who is Edna F. Hart in relation to Hans Anders...,[Language: English\n\nCharacter set encoding: ...,[HANS ANDERSEN'S FAIRY TALES *** Produced by S...,Edna F. Hart is the illustrator of Hans Anders...,Edna F. Hart is the illustrator of Hans Anders...,single_hop_specifc_query_synthesizer
3,What does the sun represent for the fir tree i...,[[Illustration: They danced merrily ... around...,"[THE GREENIES 141 OLE-LUK-OIE, THE DREAM GOD 1...",The provided context does not explicitly state...,The sun represents warmth and joy for the fir ...,single_hop_specifc_query_synthesizer
4,In what ways does Socrates' defense in Plato's...,[“Few persons will be found to wish that Socra...,[<1-hop>\n\nAPOLOGY *** Produced by Sue Assche...,The provided context does not explicitly discu...,Socrates' defense in Plato's 'Apology' reflect...,multi_hop_abstract_query_synthesizer
5,"In Plato's Apology, how does Socrates' rhetori...",[Yet some of the topics may have been actually...,[<1-hop>\n\nAPOLOGY *** Produced by Sue Assche...,The provided context indicates that Plato’s Ap...,"In Plato's Apology, Socrates employs a unique ...",multi_hop_abstract_query_synthesizer
6,In what ways does Plato's 'Apology' reflect th...,[Yet some of the topics may have been actually...,[<1-hop>\n\nAPOLOGY *** Produced by Sue Assche...,The provided context indicates that Plato's 'A...,Plato's 'Apology' reflects the themes of rheto...,multi_hop_abstract_query_synthesizer
7,In what ways does Plato's Apology reflect the ...,[Yet some of the topics may have been actually...,[<1-hop>\n\nAPOLOGY *** Produced by Sue Assche...,"Based on the provided context, Plato's Apology...",Plato's Apology reflects the themes of rhetori...,multi_hop_abstract_query_synthesizer
8,What do the gods require of a man to live a di...,"[seest that those things, which for a man to h...",[<1-hop>\n\nlatter years. That as often as I h...,"According to the teachings in the context, the...",The gods require no more of any man than to ke...,multi_hop_specific_query_synthesizer
9,How does the theme of spring manifest in the e...,[[Illustration]\n\n\n\n\nTHE CONCEITED APPLE B...,"[<1-hop>\n\nSee, here we are at the blacksmith...",The theme of spring manifests in the character...,"In the provided fairy tale context, the theme ...",multi_hop_specific_query_synthesizer


<span style="color:green">**Convert our two datasets into a RAGAS evaluation dataset using a ragas library function**

In [102]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(sdg_dataset.to_pandas())
finetuned_evaluation_dataset = EvaluationDataset.from_pandas(finetuned_sdg_dataset.to_pandas())


<span style="color:green">**Set our evaluation LLM and evaluation config**

In [124]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1.mini"))

# default max_workers is 16.
# custom_run_config = RunConfig(timeout=360, max_workers=8)
custom_run_config = RunConfig(
    timeout=300,          # 5 minutes max for operations
    max_retries=15,       # More retries for rate limits
    max_wait=90,          # Longer wait between retries
    max_workers=8,        # Fewer concurrent API calls
    log_tenacity=True     # Log retry attempts
)

<span style="color:green">**Evaluate the Base model**

In [125]:
base_result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
base_result

RuntimeError: Tracing using LangChainTracerV1 is no longer supported. Please set the LANGCHAIN_TRACING_V2 environment variable to enable tracing instead.

<span style="color:green">**Evaluate the Finetuned model**

In [105]:
finetuned_result = evaluate(
    dataset=finetuned_evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
finetuned_result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[35]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[41]: AttributeError('StringIO' object has no attribute 'statements')
ERROR:ragas.executor:Exception raised in Job[59]: TimeoutError()


{'context_recall': 0.6597, 'faithfulness': 0.8471, 'factual_correctness': 0.4458, 'answer_relevancy': 0.6331, 'context_entity_recall': 0.4718, 'noise_sensitivity_relevant': 0.3361}

<span style="color:green">**RAGAS Evaluation Results: PERPLEXING? NOT!! We used a less powerful LLM Model. Since the last time I did this with gpt-4.1, it cost me!!**

<span style="color:green"> The RAGAS evaluation presented mixed results across the two runs as you can see in my tables below. While some metrics like context_recall or noise_sensitivity improved with fine-tuning in one or both runs, others like faithfulness or context_entity_recall sometimes decreased.  So, what do we learn from this:
  - <span style="color:green">Evaluation metrics can be sensitive, and RAGAS itself relies on LLMs for judgments, introducing variability.
  - <span style="color:green">However, the consistent qualitative improvement and the positive LangSmith results suggest an overall benefit from fine-tuning in this case.

<span style="color:green">**Run Two**

| Metric | Base Model | Finetuned Model|
| :------------------------- | :----- | :-------- |
| context_recall | 0.5913 | 0.6329 |
| faithfulness | 0.8048 | 0.7716 |
| factual_correctness | 0.4783 | 0.4508 |
| answer_relevancy | 0.6075 | 0.6073 |
| context_entity_recall | 0.5012 | 0.5123 |
| noise_sensitivity_relevant | 0.3838 | 0.3917 |

<span style="color:green">**Run One**

| Metric | Base Model | Finetuned model |
| :------------------------- | :--------- | :-------------- |
| context_recall | 0.5819 | 0.5864 |
| faithfulness | 0.8220 | 0.7950 |
| factual_correctness | 0.3150 | 0.3427 |
| answer_relevancy | 0.6332 | 0.6332 |
| context_entity_recall | 0.4674 | 0.4431 |
| noise_sensitivity_relevant | 0.5305 | 0.6317 |

<span style="color:green">**OK, Lets try this using LangSmith correctness and helpfulness evaluators we learned in Assignment 7**

In [108]:
from google.colab import userdata


os.environ["LANGSMITH_API_KEY"] = userdata.get('LANGSMITH_API_KEY')
os.environ["LANGCHAIN_TRACING"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "Assignment 10 LangSmith Trial2"

In [109]:
from langsmith import Client
from langsmith.evaluation import LangChainStringEvaluator, evaluate


client = Client()

dataset_name = "EmbeddingEval2 10 State of AI Across the Years!"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Assignment 10 try 2 State of AI Across the Years!"
)



In [110]:
# Convert the dataset from Ragas format to LangSmith format.
for data_row in sdg_dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

In [111]:
eval_llm = ChatOpenAI(model="gpt-4.1")

def prep_data(run, example):
    return {
        "prediction": run.outputs['response'],  # Map 'response' key to 'prediction'
        "reference": example.outputs['answer'], # Map 'answer' key to 'reference'
        "input": example.inputs['question'],    # Map 'question' key to 'input' (or 'query' depending on the evaluator's prompt)
    }

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm},prepare_data=prep_data)

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=prep_data
)

In [112]:
evaluate(
    base_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator

    ],
    metadata={"revision_id": "Base_Embedding_Model"},
)

View the evaluation results for experiment: 'slight-title-55' at:
https://smith.langchain.com/o/39f31aed-fcde-5ca7-a4f1-a8447f10d513/datasets/b6fa05a8-e59d-409d-924d-8ba401da0a61/compare?selectedSessions=6624123c-2f23-4f32-9b06-126e7368181f




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.response,outputs.context,error,reference.answer,feedback.correctness,feedback.helpfulness,execution_time,example_id,id
0,Why did the duckling feel thankful for being u...,The provided context does not indicate that th...,"[page_content='""I believe I must go out into t...",,The duckling felt thankful for being ugly beca...,0,0,0.945269,2e3c3684-fba9-42f9-b04a-9aaf39b97173,78e9e331-3765-426b-b1c9-2753a4e74678
1,How does the arrival of autumn affect Thumbeli...,The provided context does not include informat...,"[page_content='""Yes, I will go with you,"" said...",,The arrival of autumn significantly impacts Th...,0,0,0.612981,cefd889f-5da5-4a1d-be14-1b1737cc7a70,e2bf7d33-ea41-4f4c-9b22-e2f2e0c1fd82
2,How does the theme of spring manifest in the e...,I do not know.,"[page_content='sparrow flew, for she saw many ...",,"In the provided fairy tale context, the theme ...",0,0,0.618191,a7a6ced9-a9bb-4648-8d54-f0e01cf91bad,171d018b-6093-46e5-bd06-34beefc7cd05
3,What do the gods require of a man to live a di...,"According to the teachings in the context, the...","[page_content='seest that those things, which ...",,The gods require no more of any man than to ke...,1,0,0.740341,24297489-b472-46c2-b91a-b16d2b7b80ef,e42c87b1-c2fd-4b58-a61d-afa130381f70
4,In what ways does Plato's Apology reflect the ...,Plato's Apology reflects the themes of rhetori...,[page_content='Yet some of the topics may have...,,Plato's Apology reflects the themes of rhetori...,1,1,1.094183,7f273223-0e18-47cd-aba8-1ae2b91e8f5e,c9f1e164-aadc-478e-9d4e-7031d791a918
5,In what ways does Plato's 'Apology' reflect th...,"The provided context suggests that Plato's ""Ap...",[page_content='Yet some of the topics may have...,,Plato's 'Apology' reflects the themes of rheto...,1,0,1.850498,973d3b87-4ea7-4f16-a53c-572a47cbe3a9,ecccb67f-481e-4a7b-b58a-68502806bc51
6,"In Plato's Apology, how does Socrates' rhetori...",The provided context does not explicitly detai...,[page_content='Yet some of the topics may have...,,"In Plato's Apology, Socrates employs a unique ...",0,0,0.778625,8dcd4065-85ae-483b-8b0a-f33f2570479b,b2884ecd-b035-4408-bd59-4302b950d852
7,In what ways does Socrates' defense in Plato's...,The provided context does not contain specific...,[page_content='The “Apology” or Platonic defen...,,Socrates' defense in Plato's 'Apology' reflect...,0,0,1.312558,9f62caca-c477-4e96-8b80-0ccc9f9cce5f,20734421-9989-4bd9-81f4-3f6c1d5eade1
8,What does the sun represent for the fir tree i...,The sun represents a source of joy and vitalit...,"[page_content='""Rejoice in our love,"" said the...",,The sun represents warmth and joy for the fir ...,1,0,0.725027,306eaafa-2ead-485d-b9f1-68a8ddb975b2,2efcd4ae-64b7-46ba-a894-d9b99f345787
9,Who is Edna F. Hart in relation to Hans Anders...,The provided context does not mention Edna F. ...,"[page_content='carefully standardized, their p...",,Edna F. Hart is the illustrator of Hans Anders...,0,0,0.589209,90635fe6-8fea-4756-b2a4-d0de2200133c,406872e8-1c60-46c4-b7ce-1d5e42cf6af7


In [113]:
evaluate(
    finetune_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator
    ],
    metadata={"revision_id": "Finetune_Embedding_Model"},
)

View the evaluation results for experiment: 'upbeat-pump-30' at:
https://smith.langchain.com/o/39f31aed-fcde-5ca7-a4f1-a8447f10d513/datasets/b6fa05a8-e59d-409d-924d-8ba401da0a61/compare?selectedSessions=ac150feb-4288-449d-a439-fb80db136eba




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.response,outputs.context,error,reference.answer,feedback.correctness,feedback.helpfulness,execution_time,example_id,id
0,Why did the duckling feel thankful for being u...,The provided context does not indicate that th...,"[page_content='""I believe I must go out into t...",,The duckling felt thankful for being ugly beca...,0,0,1.587317,2e3c3684-fba9-42f9-b04a-9aaf39b97173,2212e353-ca1d-47d7-a82e-0cec0db4acaf
1,How does the arrival of autumn affect Thumbeli...,The arrival of autumn makes Thumbelina's situa...,[page_content='Every morning when the sun rose...,,The arrival of autumn significantly impacts Th...,1,1,1.336359,cefd889f-5da5-4a1d-be14-1b1737cc7a70,1364b1c8-dced-48ee-9fce-7576f1c41dd6
2,How does the theme of spring manifest in the e...,The theme of spring manifests in the character...,[page_content='[Illustration]\n\n\n\n\nTHE CON...,,"In the provided fairy tale context, the theme ...",1,1,1.063394,a7a6ced9-a9bb-4648-8d54-f0e01cf91bad,adda52d7-deab-4124-a77e-1bd8457621ea
3,What do the gods require of a man to live a di...,"According to the teachings in the context, the...","[page_content='seest that those things, which ...",,The gods require no more of any man than to ke...,1,0,1.002235,24297489-b472-46c2-b91a-b16d2b7b80ef,700366de-344b-4b46-9fe9-b6827f5d50b2
4,In what ways does Plato's Apology reflect the ...,"Based on the provided context, Plato's Apology...",[page_content='Yet some of the topics may have...,,Plato's Apology reflects the themes of rhetori...,1,0,1.270119,7f273223-0e18-47cd-aba8-1ae2b91e8f5e,fc7b5f58-b4db-4c8a-be98-45b32d4041c2
5,In what ways does Plato's 'Apology' reflect th...,The provided context suggests that Plato's 'Ap...,[page_content='Yet some of the topics may have...,,Plato's 'Apology' reflects the themes of rheto...,1,1,1.837373,973d3b87-4ea7-4f16-a53c-572a47cbe3a9,60dc1163-e277-44e6-b55c-9b76656af23a
6,"In Plato's Apology, how does Socrates' rhetori...",The provided context indicates that Plato’s “A...,[page_content='Yet some of the topics may have...,,"In Plato's Apology, Socrates employs a unique ...",1,1,1.654768,8dcd4065-85ae-483b-8b0a-f33f2570479b,1bf8a317-b2fb-4277-9871-b23f0036cba6
7,In what ways does Socrates' defense in Plato's...,The provided context does not explicitly discu...,[page_content='“Few persons will be found to w...,,Socrates' defense in Plato's 'Apology' reflect...,0,0,0.586766,9f62caca-c477-4e96-8b80-0ccc9f9cce5f,1b9506ad-87d7-4cea-a2cf-a15dfa59e16f
8,What does the sun represent for the fir tree i...,The provided context does not explicitly state...,[page_content='[Illustration: They danced merr...,,The sun represents warmth and joy for the fir ...,0,0,0.521279,306eaafa-2ead-485d-b9f1-68a8ddb975b2,7ed3f5a9-e34b-4db8-942a-e75747a07a80
9,Who is Edna F. Hart in relation to Hans Anders...,Edna F. Hart is the illustrator of Hans Anders...,[page_content='Language: English\n\nCharacter ...,,Edna F. Hart is the illustrator of Hans Anders...,1,1,0.545665,90635fe6-8fea-4756-b2a4-d0de2200133c,140e57fe-d1e2-4483-976c-36ec8ac8d62b


<span style="color:green">**Langsmith Evaluation Results: Much better!**

<span style="color:green"> Bingo! Langsmith evaluation shows that our fine tuned model got better scores. I tried three runs and the results were the same in each.

![image](ls_eval.png)

