# AI Engineering Bootcamp Cohort 4 Midterm

For details of this project please see the project Readme.

This notebooks takes advantage of shared state classes and some common utility functions shared with the Chainlit application and other notebooks

The state classes are:
- AppState - contains information about the application and the documents we are looking at
- ModelRunState - contains information for each individual run for a model, chunking parameters, and results 
- RagasState - contains information for Ragas evaluation including the questions and context

The state is initialized and is passed between the functions. Each model/chunking strategy/parameters can run and save their information and then can be easily compared. 

This utilities include:
- constants.py - constants used throughout the package (mainly for the chunking strategies)
- debugger.py - supports printing messages for when debug=True
- doc_utilities.py - supports reading the documents
- rag_utilities.py - supports the creation of the RAG chain 
- templates.py - provides templates for RAG chains
- vector_utilities - provides functions for chunking documents and setting up the vector store. Includes 4 chunking strategies:
    - Recursive splitter with chunk size and overlap
    - Table aware - tries to handle pdfs with tables
    - Section based - chunks based on section headers
    - Semantic splitter - semantic-based chunking



Allow for async in notebook

In [1]:
import nest_asyncio

nest_asyncio.apply()

Helps hide errors from using Hugging Face's transformers

In [2]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

#### Install our key components for RAG etc

In [3]:
!pip install -q langchain
!pip install -q langchain-core==0.2.27 langchain-community==0.2.10
!pip install -q langchain-experimental==0.0.64 langgraph-checkpoint==1.0.6 langgraph==0.2.16 langchain-qdrant==0.1.3
!pip install -q langchain-openai==0.1.9
!pip install -q ragas==0.1.16

#### Install our vector store - Qdrant

In [4]:
!pip install -qU qdrant-client==1.11.2

#### Install supporting utilities

In [5]:
!pip install -qU tiktoken==0.7.0 pymupdf==1.24.10

Environment Variables

- OpenAI API Key - will use some of the OpenAI models - if in .env use it otherwise ask for it

In [9]:
import os
import getpass

# openai_api_key = os.getenv("OPENAI_API_KEY")
#if not openai_api_key:
openai_api_key = getpass.getpass("OpenAI API Key: ")

os.environ["OPENAI_API_KEY"] = openai_api_key

#### Set up our starting inputs and state and read in the documents

The AppState contains the docs to be used throughout the process


In [110]:
from classes.app_state import AppState
from utilities.doc_utilities import get_documents
document_urls = [
    "https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf",
    "https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf",
]

app_state = AppState()
app_state.set_debug(False)

app_state.set_document_urls(document_urls)

get_documents(app_state)


Total documents: 2


### Create Vector Store with text-embedding-3-small embeddings

Set up our first model run - using the text-embedding-3-small for embeddings initially

This will test that our chunking and vector store functions are all working as expected

In [111]:
from classes.model_run_state import ModelRunState
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings
from utilities.vector_utilities import create_vector_store

model_1000_100_state = ModelRunState()
model_1000_100_state.name = "TE3/1000/100"
model_1000_100_state.chunk_size = 1000
model_1000_100_state.chunk_overlap = 100

model_1000_100_state.qa_model_name = "gpt-4o-mini"
model_1000_100_state.qa_model = ChatOpenAI(model=model_1000_100_state.qa_model_name)

# the openai embedding model
model_1000_100_state.embedding_model_name = "text-embedding-3-small"
model_1000_100_state.embedding_model = OpenAIEmbeddings(model=model_1000_100_state.embedding_model_name)

create_vector_store(app_state, model_1000_100_state)

Vector store created


Test the retriever with some sample files

In [112]:
query = "How should you be protected from abusive data practices "
results = model_1000_100_state.retriever.get_relevant_documents(query)

print(len(results))
print(results[0].page_content)
print(results[0].metadata)
print("---")

10
You should be protected from abusive data practices via built-in 
protections and you should have agency over how data about 
you is used. You should be protected from violations of privacy through 
design choices that ensure such protections are included by default, including 
ensuring that data collection conforms to reasonable expectations and that 
only data strictly necessary for the specific context is collected. Designers, de­
velopers, and deployers of automated systems should seek your permission 
and respect your decisions regarding collection, use, access, transfer, and de­
letion of your data in appropriate ways and to the greatest extent possible; 
where not possible, alternative privacy by design safeguards should be used. 
Systems should not employ user experience and design decisions that obfus­
cate user choice or burden users with defaults that are privacy invasive. Con­
sent should only be used to justify collection of data in cases where it can be 
appropriately 

In [113]:
query = "tell me about Karen Hao"
results = model_1000_100_state.retriever.get_relevant_documents(query)

for result in results:
    print(result.page_content)
    print(result.metadata)
    print("---")

ENDNOTES
75. See., e.g., Sam Sabin. Digital surveillance in a post-Roe world. Politico. May 5, 2022. https://
www.politico.com/newsletters/digital-future-daily/2022/05/05/digital-surveillance-in-a-post-roe­
world-00030459; Federal Trade Commission. FTC Sues Kochava for Selling Data that Tracks People at
Reproductive Health Clinics, Places of Worship, and Other Sensitive Locations. Aug. 29, 2022. https://
www.ftc.gov/news-events/news/press-releases/2022/08/ftc-sues-kochava-selling-data-tracks-people­
reproductive-health-clinics-places-worship-other
76. Todd Feathers. This Private Equity Firm Is Amassing Companies That Collect Data on America’s
Children. The Markup. Jan. 11, 2022.
https://themarkup.org/machine-learning/2022/01/11/this-private-equity-firm-is-amassing-companies­
that-collect-data-on-americas-children
77. Reed Albergotti. Every employee who leaves Apple becomes an ‘associate’: In job databases used by
employers to verify resume information, every former Apple employee’s tit

OK - that all looks good - lets continue

### Create a RAG Chain for the different models

This will take a model_run_state that will pass in:
- the qa model
- the retriever 

It creates the RAG chain and saves it in the model_run_state for RAGAS evaluation

In [114]:
from utilities.templates import get_qa_prompt
from langchain_openai import ChatOpenAI
from operator import itemgetter
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from utilities.debugger import dprint

def create_rag_chain(app_state, model_run_state):

    chat_prompt = get_qa_prompt()

    simple_chain = chat_prompt | model_run_state.qa_model
    dprint(app_state, simple_chain.invoke({"question": "Can you give me a summary of the 2 documents", "context":""}))

    rag_qa_chain = (
        {"context": itemgetter("question") | model_run_state.retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))


        | {"response": chat_prompt | model_run_state.qa_model, "context": itemgetter("context")}
    )
    response = rag_qa_chain.invoke({"question" : "What is the AI Bill of Rights "})
    dprint(app_state, response)
    dprint(app_state, response["response"].content)
    dprint(app_state, f"Number of found context: {len(response['context'])}")
    model_run_state.rag_qa_chain = rag_qa_chain
    print("RAG Chain Created")

create_rag_chain(app_state, model_1000_100_state)

RAG Chain Created


### SDG - Create the questions for evaluation

3 functions are used to set up the questions:
- batch_chunks - processes batches of chunks to try and get past the limitations with OpenAI quotas
- create_chunks_for_ragas - takes the documents and splits them up based on the RagasState - this will allow us more experimentation later
- create_questions_for_ragas - sets up the generator and creates the number of questions and distribution based on RagasState

This function can be skipped since the questions are stored in file

In [14]:
from classes.ragas_state import RagasState
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from math import ceil
import pandas as pd
from ragas.testset.generator import TestsetGenerator

import time

# create document chunks
def Create_chunks_for_ragas(app_state, ragas_state):
    # we have 2 documents so want representative across both
    text_splitter_eval = RecursiveCharacterTextSplitter(
        chunk_size = ragas_state.chunk_size,
        chunk_overlap = ragas_state.chunk_overlap,
        length_function = len
    )
    combined_chunks_document = []
    for document in app_state.documents:
        eval_document = document["loaded_document"]
        document_chunks = text_splitter_eval.split_documents(eval_document)
        print(f"Num chumks: {len(document_chunks)}")
        combined_chunks_document = combined_chunks_document + document_chunks

    print(f"Total chunks: {len(combined_chunks_document)}")
    ragas_state.chunks = combined_chunks_document
    print()

# submit batches
def batch_chunks(chunks, batch_size):
    for i in range(0, len(chunks), batch_size):
        yield chunks[i:i + batch_size]

# create the questions
def create_questions_for_ragas(app_state, ragas_state):
    generator_llm = ChatOpenAI(model=ragas_state.generator_llm)
    critic_llm = ChatOpenAI(model=ragas_state.critic_llm)
    embeddings = OpenAIEmbeddings()

    generator = TestsetGenerator.from_langchain(
        generator_llm,
        critic_llm,
        embeddings
    )

    batch_size = 200  # Number of chunks to process per batch
    delay_between_batches = 1  # 1 second delay between batches

    all_test_data = []

    for batch in batch_chunks(ragas_state.chunks, batch_size):
        print(f"Processing batch of {len(batch)} chunks...")

        # Generate testset for the current batch
        testset = generator.generate_with_langchain_docs(
            batch,  # Process this batch of chunks
            ragas_state.num_questions, 
            ragas_state.distributions
        )

        # Convert the testset to pandas and store the result
        testset_df = testset.to_pandas()
        all_test_data.append(testset_df)

        # Wait 1 second before the next batch
        time.sleep(delay_between_batches)



    combined_testset_df = pd.concat(all_test_data, ignore_index=True)
    ragas_state.testset_df = combined_testset_df

    print("Ragas questions created for all batches.")


Load the ragas_state if it exists offline 

Otherwise we will need to run the question creation again

In [115]:
import os
import pickle
from ragas.testset.evolutions import simple, reasoning, multi_context

# File path where ragas_state is stored
file_path = 'ragas_state.pkl'

# Check if the pickled file exists, and load it if it does
def load_ragas_state_if_exists(file_path):
    if os.path.exists(file_path):
        try:
            with open(file_path, 'rb') as f:
                ragas_state = pickle.load(f)
            print(f"Ragas state loaded from {file_path}")
            return ragas_state
        except Exception as e:
            print(f"Error loading ragas state: {e}")
            return None
    else:
        print(f"No existing ragas state found at {file_path}")
        return None

# Use this to load ragas_state from pickle
ragas_state = load_ragas_state_if_exists(file_path)
ragas_state.distributions = {
            simple: 0.5,
            multi_context: 0.4,
            reasoning: 0.1
        }

# use this to rebuild pickle state
# ragas_state = RagasState()
# ragas_state.generator_llm = "gpt-4o"
# Create_chunks_for_ragas(app_state, ragas_state)
# create_questions_for_ragas(app_state, ragas_state)

Ragas state loaded from ragas_state.pkl


Check how many were created

In [116]:
print(f"Number of entries in testset_df: {ragas_state.testset_df.shape[0]}")

Number of entries in testset_df: 20


Lets save off the ragas state - then comment out above code so we don't need to run it again

In [31]:
import pickle

# Save the ragas_state without distributions
def save_ragas_state(ragas_state, file_path):
    # Temporarily remove `distributions` before saving
    distributions_backup = ragas_state.distributions
    ragas_state.distributions = None

    # Save the rest of the object
    try:
        with open(file_path, 'wb') as f:
            pickle.dump(ragas_state, f)
        print(f"Ragas state saved to {file_path}")
    except Exception as e:
        print(f"Error saving ragas state: {e}")
    
    # Restore `distributions` after saving
    ragas_state.distributions = distributions_backup

# Uncomment and run if need to save new ragas_state
# save_ragas_state(ragas_state, 'ragas_state.pkl')

# lets see if unpickle works
# test_ragas_state = load_ragas_state_if_exists(file_path)
# print(len(test_ragas_state.testset_df))

Reduce the number of questions in an attempt to get past the OpenAI quota limits

In [117]:
def remove_random_entries_from_testset_df(ragas_state, num_entries=9):
    if len(ragas_state.testset_df) >= num_entries:
        rows_to_remove = ragas_state.testset_df.sample(n=num_entries).index
        ragas_state.testset_df = ragas_state.testset_df.drop(rows_to_remove)
        print(f"Randomly removed {num_entries} entries from ragas_state.testset_df")
    else:
        print(f"Cannot remove {num_entries} entries, only {len(ragas_state.testset_df)} entries in the DataFrame.")


remove_random_entries_from_testset_df(ragas_state, num_entries=15)
ragas_state.testset_df

Randomly removed 15 entries from ragas_state.testset_df


Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
5,Why is data collection emphasized as a necessa...,[In discussion of technical and governance int...,The answer to given question is not present in...,simple,[{'source': 'data/Blueprint-for-an-AI-Bill-of-...,True
6,What are some examples of protected classifica...,[different treatment or impacts disfavoring pe...,Some examples of protected classifications tha...,simple,[{'source': 'data/Blueprint-for-an-AI-Bill-of-...,True
7,How do hybrid AI-human platforms balance effic...,[chat-bots and AI-driven call response systems...,Hybrid AI-human platforms balance efficiency a...,multi_context,[{'source': 'data/Blueprint-for-an-AI-Bill-of-...,True
8,What risks come from unregulated consumer data...,"[of private collection. \nMeanwhile, members ...",The risks from unregulated consumer data colle...,multi_context,[{'source': 'data/Blueprint-for-an-AI-Bill-of-...,True
15,What privacy risks are associated with data me...,"[Models may leak, generate, or correctly infer...",Data memorization in LLMs may pose exacerbated...,simple,"[{'source': 'data/NIST.AI.600-1.pdf', 'file_pa...",True


### Generate answers based on the pipeline we have created

This function takes in the model_run_state and the ragas_state and uses the retriever from the model_run_state to answer the questions from the ragas_state. 

The response dataset is stored in the model_run_state for later evaluation

In [118]:
from datasets import Dataset
def create_answers(app_state, model_run_state, ragas_state):
  answers = []
  contexts = []

  test_questions = ragas_state.testset_df["question"].values.tolist()
  test_groundtruths = ragas_state.testset_df["ground_truth"].values.tolist()

  for question in test_questions:
    response = model_run_state.rag_qa_chain.invoke({"question" : question})
    answers.append(response["response"].content)
    contexts.append([context.page_content for context in response["context"]])

  # Wrap it in a huggingface dataset
  model_run_state.response_dataset = Dataset.from_dict({
      "question" : test_questions,
      "answer" : answers,
      "contexts" : contexts,
      "ground_truth" : test_groundtruths
  })
  model_run_state.response_dataset[0]
  print("Answers created - ready for Ragas evaluation")

create_answers(app_state, model_1000_100_state, ragas_state)

Answers created - ready for Ragas evaluation


### Evaluation

The run_ragas_evaluation uses the response dataset from the previous step stored in the model_run_state to determine the requested Ragas metrics.

The results of the evaluation are then stored nack in the model_run_state.

In [146]:
import pandas as pd
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)
def run_ragas_evaluation(app_state, model_run_state):
    metrics = [
        faithfulness,
        answer_relevancy,
        context_recall,
        context_precision,
        answer_correctness,
    ]
    # model_run_state.ragas_results = evaluate(model_run_state.response_dataset, metrics)
    # print("Ragas evaluation complete")
    results_dict = {}

    # Evaluate each metric one by one
    for metric in metrics:
        metric_name = type(metric).__name__
        print(f"Evaluating metric: {metric_name}")
        try:
            # Run the evaluation for the current metric
            result = evaluate(model_run_state.response_dataset, [metric])
            score = result.scores
            results_dict[metric_name] = score
            print(f"Metric {metric_name} evaluation complete")
        except Exception as e:
            print(f"Error evaluating metric {metric_name}: {e}")
    print("Adding results_dict to model_run_state")
    print(results_dict)
    model_run_state.ragas_results_dict = results_dict
    print("Ragas evaluation for all metrics complete")     




Run the evaluation for the TE3 model

In [None]:
run_ragas_evaluation(app_state, model_1000_100_state)

### Analysis of the Model

We can report on the model and the paramters such as the chunking size, overlap, and the Ragas metrics summary.

Set up functions to handle the display of the metrics

In [123]:
import pandas as pd

# Function to get the averages for all metrics in a single model
def get_model_averages(model):
    results = model.ragas_results_dict
    averages = {}
    
    metric_mapping = {
        'Faithfulness': 'faithfulness',
        'AnswerRelevancy': 'answer_relevancy',
        'ContextRecall': 'context_recall',
        'ContextPrecision': 'context_precision',
        'AnswerCorrectness': 'answer_correctness'
    }
    
    for metric_name, dataset in results.items():
        df = dataset.to_pandas()
        # Get the corresponding column name from the mapping
        metric_column = metric_mapping.get(metric_name, None)
        if metric_column is None:
            print(f"Metric {metric_name} not found in the mapping.")
            continue
        
        try:
            avg_value = df[metric_column].mean()  # Calculate the average for the specific metric column
        except KeyError:
            print(f"KeyError: {metric_column} not found in the dataset. Available columns: {df.columns}")
            continue
        averages[metric_name] = avg_value

    return pd.DataFrame(averages.items(), columns=['Metric', 'Average'])

def display_metrics(model_run_state):
    
    df = get_model_averages(model_run_state)
    print(df)


In [124]:
model_1000_100_state.parameters()
display_metrics(model_1000_100_state)

Base model: gpt-4o-mini
Embedding model: text-embedding-3-small
Chunk size: 1000
Chunk overlap: 100
              Metric   Average
0       Faithfulness  0.940000
1    AnswerRelevancy  0.977401
2      ContextRecall  1.000000
3   ContextPrecision  0.775980
4  AnswerCorrectness  0.750321


### Base Snowflake Model Evaluation

Process:
- set up the model_run_state for the base Snowflake model
- create the vector store using the base model
- create the RAG chain with the retriever for the vestor store
- generate the answers to the Ragas questions
- evaluate the model's performance using Ragas

In [125]:
snowflake_base_state = ModelRunState()
snowflake_base_state.name = "Snowflake_Base/1000/100"
snowflake_base_state.qa_model_name = "gpt-4o-mini"
snowflake_base_state.qa_model = ChatOpenAI(model=snowflake_base_state.qa_model_name)

# snowflake embedding model
snowflake_base_state.embedding_model_name = "Snowflake/snowflake-arctic-embed-m"
snowflake_base_state.embedding_model = HuggingFaceEmbeddings(model_name=snowflake_base_state.embedding_model_name)

# use same chunk size as before
snowflake_base_state.chunk_size = 1000
snowflake_base_state.chunk_overlap = 100
create_vector_store(app_state, snowflake_base_state)

create_rag_chain(app_state, snowflake_base_state)
create_answers(app_state, snowflake_base_state, ragas_state)


Vector store created
RAG Chain Created
Answers created - ready for Ragas evaluation


In [132]:
run_ragas_evaluation(app_state, snowflake_base_state)

Evaluating metric: Faithfulness


Evaluating: 100%|██████████| 5/5 [00:17<00:00,  3.47s/it]


Metric Faithfulness evaluation complete
Evaluating metric: AnswerRelevancy


Evaluating: 100%|██████████| 5/5 [00:04<00:00,  1.22it/s]


Metric AnswerRelevancy evaluation complete
Evaluating metric: ContextRecall


Evaluating: 100%|██████████| 5/5 [00:04<00:00,  1.22it/s]


Metric ContextRecall evaluation complete
Evaluating metric: ContextPrecision


Evaluating: 100%|██████████| 5/5 [00:26<00:00,  5.30s/it]


Metric ContextPrecision evaluation complete
Evaluating metric: AnswerCorrectness


Evaluating: 100%|██████████| 5/5 [02:22<00:00, 28.48s/it]


Metric AnswerCorrectness evaluation complete
Ragas evaluation for all metrics complete


### Base Snowflake model results

In [133]:

snowflake_base_state.parameters()
display_metrics(snowflake_base_state)

Base model: gpt-4o-mini
Embedding model: Snowflake/snowflake-arctic-embed-m
Chunk size: 1000
Chunk overlap: 100
              Metric   Average
0       Faithfulness  0.400000
1    AnswerRelevancy  0.383450
2      ContextRecall  0.550000
3   ContextPrecision  0.203968
4  AnswerCorrectness  0.332609


In [136]:
import pandas as pd

# Helper function to extract average values from a dataset
def get_average_metric(dataset):
    df = dataset.to_pandas()
    return df.mean().iloc[0]  # Return the average of the first column (since each dataset only has one column)
def compare_results(run_model_1, run_model_2):
    # Extract the metrics for both models
    results_1 = run_model_1.ragas_results_dict
    results_2 = run_model_2.ragas_results_dict
    
    comparison_data = {
        'Metric': [],
        run_model_1.name: [],
        run_model_2.name: [],
        'Difference': []
    }

    for metric_name in results_1.keys():
        # Calculate the average for each metric
        avg_1 = get_average_metric(results_1[metric_name])
        avg_2 = get_average_metric(results_2[metric_name])
        
        # Append to the comparison data
        comparison_data['Metric'].append(metric_name)
        comparison_data[run_model_1.name].append(avg_1)
        comparison_data[run_model_2.name].append(avg_2)
        comparison_data['Difference'].append(avg_2 - avg_1)

    return pd.DataFrame(comparison_data)

### Comparison of Base Snowflake with the TE3

In [137]:
df = compare_results(snowflake_base_state, model_1000_100_state)
print(df)

              Metric  Snowflake_Base/1000/100  TE3/1000/100  Difference
0       Faithfulness                 0.400000      0.940000    0.540000
1    AnswerRelevancy                 0.383450      0.977401    0.593951
2      ContextRecall                 0.550000      1.000000    0.450000
3   ContextPrecision                 0.203968      0.775980    0.572012
4  AnswerCorrectness                 0.332609      0.750321    0.417711


### Fine Tuned Snowflake Model (1st Model) Evaluation

In [138]:
from langchain.embeddings import HuggingFaceEmbeddings
snowflake_finetune_state = ModelRunState()
snowflake_finetune_state.name = "Snowflake_Fine/1000/100"
snowflake_finetune_state.qa_model_name = "gpt-4o-mini"
snowflake_finetune_state.qa_model = ChatOpenAI(model=snowflake_finetune_state.qa_model_name)

# finetune snowflake embedding model

hf_username = "rchrdgwr"
hf_repo_name = "finetuned-arctic-model"

# Load the fine-tuned model from Hugging Face
snowflake_finetune_state.embedding_model_name = f"{hf_username}/{hf_repo_name}"
snowflake_finetune_state.embedding_model = HuggingFaceEmbeddings(model_name=snowflake_finetune_state.embedding_model_name)

# use same chunk size as before
snowflake_finetune_state.chunk_size = 1000
snowflake_finetune_state.chunk_overlap = 100
create_vector_store(app_state, snowflake_finetune_state)

create_rag_chain(app_state, snowflake_finetune_state)
create_answers(app_state, snowflake_finetune_state, ragas_state)


Some weights of BertModel were not initialized from the model checkpoint at rchrdgwr/finetuned-arctic-model and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Vector store created
RAG Chain Created
Answers created - ready for Ragas evaluation


In [140]:
run_ragas_evaluation(app_state, snowflake_finetune_state)

Evaluating metric: Faithfulness


Evaluating: 100%|██████████| 5/5 [02:05<00:00, 25.12s/it]


Metric Faithfulness evaluation complete
Evaluating metric: AnswerRelevancy


Evaluating: 100%|██████████| 5/5 [00:28<00:00,  5.80s/it]


Metric AnswerRelevancy evaluation complete
Evaluating metric: ContextRecall


Evaluating: 100%|██████████| 5/5 [00:48<00:00,  9.73s/it]


Metric ContextRecall evaluation complete
Evaluating metric: ContextPrecision


Evaluating:   0%|          | 0/5 [00:00<?, ?it/s]Exception raised in Job[4]: TimeoutError()
Exception raised in Job[3]: TimeoutError()
Exception raised in Job[2]: TimeoutError()
Exception raised in Job[1]: TimeoutError()
Exception raised in Job[0]: TimeoutError()
Evaluating: 100%|██████████| 5/5 [03:00<00:00, 36.00s/it] 


Metric ContextPrecision evaluation complete
Evaluating metric: AnswerCorrectness


Evaluating: 100%|██████████| 5/5 [02:16<00:00, 27.26s/it]


Metric AnswerCorrectness evaluation complete
Ragas evaluation for all metrics complete


### Fine Tuned Snowflake Model (1st Model) Results

In [141]:
snowflake_finetune_state.parameters()
display_metrics(snowflake_finetune_state)

Base model: gpt-4o-mini
Embedding model: rchrdgwr/finetuned-arctic-model
Chunk size: 1000
Chunk overlap: 100
              Metric   Average
0       Faithfulness  1.000000
1    AnswerRelevancy  0.973854
2      ContextRecall  1.000000
3   ContextPrecision       NaN
4  AnswerCorrectness  0.632057


### Comparison of TE3, Base Snowflake, and Fine Tuned Snowflake (1st Model) Results

In [142]:
import pandas as pd
def compare_results_3(run_model_1, run_model_2, run_model_3):
    # Extract results for each model
    results_1 = run_model_1.ragas_results_dict
    results_2 = run_model_2.ragas_results_dict
    results_3 = run_model_3.ragas_results_dict

    comparison_data = {
        'Metric': [],
        run_model_1.name: [],
        run_model_2.name: [],
        run_model_3.name: [],
        '1v2 Difference': [],
        '1v3 Difference': [],
        '2v3 Difference': []
    }

    for metric_name in results_1.keys():
        # Calculate the average for each metric
        avg_1 = get_average_metric(results_1[metric_name])
        avg_2 = get_average_metric(results_2[metric_name])
        avg_3 = get_average_metric(results_3[metric_name])
        
        # Append to the comparison data
        comparison_data['Metric'].append(metric_name)
        comparison_data[run_model_1.name].append(avg_1)
        comparison_data[run_model_2.name].append(avg_2)
        comparison_data[run_model_3.name].append(avg_3)
        comparison_data['1v2 Difference'].append(avg_2 - avg_1)
        comparison_data['1v3 Difference'].append(avg_3 - avg_1)
        comparison_data['2v3 Difference'].append(avg_3 - avg_2)

    return pd.DataFrame(comparison_data)

In [143]:
df = compare_results_3(model_1000_100_state , snowflake_base_state,  snowflake_finetune_state)
df

Unnamed: 0,Metric,TE3/1000/100,Snowflake_Base/1000/100,Snowflake_Fine/1000/100,1v2 Difference,1v3 Difference,2v3 Difference
0,Faithfulness,0.94,0.4,1.0,-0.54,0.06,0.6
1,AnswerRelevancy,0.977401,0.38345,0.973854,-0.593951,-0.003547,0.590404
2,ContextRecall,1.0,0.55,1.0,-0.45,0.0,0.45
3,ContextPrecision,0.77598,0.203968,,-0.572012,,
4,AnswerCorrectness,0.750321,0.332609,0.632057,-0.417711,-0.118264,0.299448


### Lets run some different chunking strategies

Section Aware

Table Aware

Semantic Chunking

Then run a comparison with the base that used recursive text


In [144]:
hf_username = "rchrdgwr"
hf_repo_name = "finetuned-arctic-model"

snowflake_finetune_model_name = f"{hf_username}/{hf_repo_name}"
snowflake_finetune_model = HuggingFaceEmbeddings(model_name=snowflake_finetune_model_name)

Some weights of BertModel were not initialized from the model checkpoint at rchrdgwr/finetuned-arctic-model and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Fine Tuned Snowflake Model (1st Model) With Section Chunking Strategy

In [147]:
from utilities.constants import (
    CHUNKING_STRATEGY_TABLE_AWARE,
    CHUNKING_STRATEGY_SECTION_BASED,
    CHUNKING_STRATEGY_SEMANTIC
)

snowflake_finetune_section_state = ModelRunState()
snowflake_finetune_section_state.name = "Snowflake_FineSection/1000/100"
snowflake_finetune_section_state.qa_model_name = "gpt-4o-mini"
snowflake_finetune_section_state.qa_model = ChatOpenAI(model=snowflake_finetune_section_state.qa_model_name)

snowflake_finetune_section_state.embedding_model_name = snowflake_finetune_model_name
snowflake_finetune_section_state.embedding_model = snowflake_finetune_model

# use same chunk size as before
snowflake_finetune_section_state.chunking_strategy = CHUNKING_STRATEGY_SECTION_BASED
snowflake_finetune_section_state.chunk_size = 1000
snowflake_finetune_section_state.chunk_overlap = 100
create_vector_store(app_state, snowflake_finetune_section_state)

create_rag_chain(app_state, snowflake_finetune_section_state)
create_answers(app_state, snowflake_finetune_section_state, ragas_state)
run_ragas_evaluation(app_state, snowflake_finetune_section_state)


Vector store created
RAG Chain Created
Answers created - ready for Ragas evaluation
Evaluating metric: Faithfulness


Evaluating: 100%|██████████| 5/5 [00:11<00:00,  2.36s/it]


Metric Faithfulness evaluation complete
Evaluating metric: AnswerRelevancy


Evaluating: 100%|██████████| 5/5 [00:02<00:00,  1.93it/s]


Metric AnswerRelevancy evaluation complete
Evaluating metric: ContextRecall


Evaluating: 100%|██████████| 5/5 [00:03<00:00,  1.38it/s]


Metric ContextRecall evaluation complete
Evaluating metric: ContextPrecision


Evaluating:   0%|          | 0/5 [00:00<?, ?it/s]Exception raised in Job[1]: TimeoutError()
Exception raised in Job[3]: TimeoutError()
Evaluating:  20%|██        | 1/5 [02:59<11:59, 180.00s/it]Exception raised in Job[2]: TimeoutError()
Exception raised in Job[4]: TimeoutError()
Exception raised in Job[0]: TimeoutError()
Evaluating: 100%|██████████| 5/5 [02:59<00:00, 36.00s/it] 


Metric ContextPrecision evaluation complete
Evaluating metric: AnswerCorrectness


Evaluating: 100%|██████████| 5/5 [02:12<00:00, 26.42s/it]


Metric AnswerCorrectness evaluation complete
Adding results_dict to model_run_state
{'Faithfulness': Dataset({
    features: ['faithfulness'],
    num_rows: 5
}), 'AnswerRelevancy': Dataset({
    features: ['answer_relevancy'],
    num_rows: 5
}), 'ContextRecall': Dataset({
    features: ['context_recall'],
    num_rows: 5
}), 'ContextPrecision': Dataset({
    features: ['context_precision'],
    num_rows: 5
}), 'AnswerCorrectness': Dataset({
    features: ['answer_correctness'],
    num_rows: 5
})}
Ragas evaluation for all metrics complete


In [148]:
display_metrics(snowflake_finetune_section_state)

              Metric   Average
0       Faithfulness  0.928342
1    AnswerRelevancy  0.963411
2      ContextRecall  0.900000
3   ContextPrecision       NaN
4  AnswerCorrectness  0.466659


### Fine Tuned Snowflake Model (1st Model) With Table Aware Chunking Strategy

In [149]:
snowflake_finetune_table_state = ModelRunState()
snowflake_finetune_table_state.name = "Snowflake_FineTable/1000/100"
snowflake_finetune_table_state.qa_model_name = "gpt-4o-mini"
snowflake_finetune_table_state.qa_model = ChatOpenAI(model=snowflake_finetune_table_state.qa_model_name)

snowflake_finetune_table_state.embedding_model_name = snowflake_finetune_model_name
snowflake_finetune_table_state.embedding_model = snowflake_finetune_model

# use same chunk size as before
snowflake_finetune_table_state.chunking_strategy = CHUNKING_STRATEGY_TABLE_AWARE
snowflake_finetune_table_state.chunk_size = 1000
snowflake_finetune_table_state.chunk_overlap = 100
create_vector_store(app_state, snowflake_finetune_table_state)

create_rag_chain(app_state, snowflake_finetune_table_state)
create_answers(app_state, snowflake_finetune_table_state, ragas_state)
run_ragas_evaluation(app_state, snowflake_finetune_table_state)
display_metrics(snowflake_finetune_table_state)

Vector store created
RAG Chain Created
Answers created - ready for Ragas evaluation
Evaluating metric: Faithfulness


Evaluating: 100%|██████████| 5/5 [01:00<00:00, 12.13s/it]


Metric Faithfulness evaluation complete
Evaluating metric: AnswerRelevancy


Evaluating: 100%|██████████| 5/5 [00:28<00:00,  5.61s/it]


Metric AnswerRelevancy evaluation complete
Evaluating metric: ContextRecall


Evaluating: 100%|██████████| 5/5 [00:38<00:00,  7.63s/it]


Metric ContextRecall evaluation complete
Evaluating metric: ContextPrecision


Evaluating:   0%|          | 0/5 [00:00<?, ?it/s]Exception raised in Job[4]: TimeoutError()
Exception raised in Job[3]: TimeoutError()
Exception raised in Job[0]: TimeoutError()
Evaluating:  20%|██        | 1/5 [03:00<12:00, 180.00s/it]Exception raised in Job[2]: TimeoutError()
Exception raised in Job[1]: TimeoutError()
Evaluating: 100%|██████████| 5/5 [03:00<00:00, 36.00s/it] 


Metric ContextPrecision evaluation complete
Evaluating metric: AnswerCorrectness


Evaluating: 100%|██████████| 5/5 [02:16<00:00, 27.21s/it]


Metric AnswerCorrectness evaluation complete
Adding results_dict to model_run_state
{'Faithfulness': Dataset({
    features: ['faithfulness'],
    num_rows: 5
}), 'AnswerRelevancy': Dataset({
    features: ['answer_relevancy'],
    num_rows: 5
}), 'ContextRecall': Dataset({
    features: ['context_recall'],
    num_rows: 5
}), 'ContextPrecision': Dataset({
    features: ['context_precision'],
    num_rows: 5
}), 'AnswerCorrectness': Dataset({
    features: ['answer_correctness'],
    num_rows: 5
})}
Ragas evaluation for all metrics complete
              Metric   Average
0       Faithfulness  0.890942
1    AnswerRelevancy  0.969810
2      ContextRecall  1.000000
3   ContextPrecision       NaN
4  AnswerCorrectness  0.380591


### Fine Tuned Snowflake Model (1st Model) With Semantic Chunking Strategy

In [150]:

snowflake_finetune_semantic_state = ModelRunState()
snowflake_finetune_semantic_state.name = "Snowflake_FineSemantic/1000/100"
snowflake_finetune_semantic_state.qa_model_name = "gpt-4o-mini"
snowflake_finetune_semantic_state.qa_model = ChatOpenAI(model=snowflake_finetune_semantic_state.qa_model_name)

snowflake_finetune_semantic_state.embedding_model_name = snowflake_finetune_model_name
snowflake_finetune_semantic_state.embedding_model = snowflake_finetune_model

# use same chunk size as before
snowflake_finetune_semantic_state.chunking_strategy = CHUNKING_STRATEGY_SEMANTIC
snowflake_finetune_semantic_state.chunk_size = 1000
snowflake_finetune_semantic_state.chunk_overlap = 100
create_vector_store(app_state, snowflake_finetune_semantic_state)

create_rag_chain(app_state, snowflake_finetune_semantic_state)
create_answers(app_state, snowflake_finetune_semantic_state, ragas_state)
run_ragas_evaluation(app_state, snowflake_finetune_semantic_state)
display_metrics(snowflake_finetune_semantic_state)



Vector store created
RAG Chain Created
Answers created - ready for Ragas evaluation
Evaluating metric: Faithfulness


Evaluating: 100%|██████████| 5/5 [00:13<00:00,  2.71s/it]


Metric Faithfulness evaluation complete
Evaluating metric: AnswerRelevancy


Evaluating: 100%|██████████| 5/5 [00:02<00:00,  2.24it/s]


Metric AnswerRelevancy evaluation complete
Evaluating metric: ContextRecall


Evaluating: 100%|██████████| 5/5 [00:02<00:00,  2.36it/s]


Metric ContextRecall evaluation complete
Evaluating metric: ContextPrecision


Evaluating:   0%|          | 0/5 [00:00<?, ?it/s]Exception raised in Job[4]: TimeoutError()
Exception raised in Job[2]: TimeoutError()
Exception raised in Job[1]: TimeoutError()
Evaluating:  20%|██        | 1/5 [03:00<12:00, 180.00s/it]Exception raised in Job[3]: TimeoutError()
Exception raised in Job[0]: TimeoutError()
Evaluating: 100%|██████████| 5/5 [03:00<00:00, 36.00s/it] 


Metric ContextPrecision evaluation complete
Evaluating metric: AnswerCorrectness


Evaluating: 100%|██████████| 5/5 [02:13<00:00, 26.61s/it]


Metric AnswerCorrectness evaluation complete
Adding results_dict to model_run_state
{'Faithfulness': Dataset({
    features: ['faithfulness'],
    num_rows: 5
}), 'AnswerRelevancy': Dataset({
    features: ['answer_relevancy'],
    num_rows: 5
}), 'ContextRecall': Dataset({
    features: ['context_recall'],
    num_rows: 5
}), 'ContextPrecision': Dataset({
    features: ['context_precision'],
    num_rows: 5
}), 'AnswerCorrectness': Dataset({
    features: ['answer_correctness'],
    num_rows: 5
})}
Ragas evaluation for all metrics complete
              Metric   Average
0       Faithfulness  0.798095
1    AnswerRelevancy  0.967873
2      ContextRecall  0.500000
3   ContextPrecision       NaN
4  AnswerCorrectness  0.604548


### Comparison of Fine Tuned Snowflake Model (1st Model) with 3 Different Chunking Strategies

Note the Fine Tuned model used simple recursive text with specified chunking size and overlap

In [151]:
def compare_results_4(run_model_1, run_model_2, run_model_3, run_model_4):
    # Extract results for each model
    results_1 = run_model_1.ragas_results_dict
    results_2 = run_model_2.ragas_results_dict
    results_3 = run_model_3.ragas_results_dict
    results_4 = run_model_4.ragas_results_dict

    comparison_data = {
        'Metric': [],
        run_model_1.name: [],
        run_model_2.name: [],
        run_model_3.name: [],
        run_model_4.name: [],
        '1v2 Difference': [],
        '1v3 Difference': [],
        '1v4 Difference': [],
        '2v3 Difference': [],
        '2v4 Difference': [],
        '3v4 Difference': []
    }

    for metric_name in results_1.keys():
        # Calculate the average for each metric
        avg_1 = get_average_metric(results_1[metric_name])
        avg_2 = get_average_metric(results_2[metric_name])
        avg_3 = get_average_metric(results_3[metric_name])
        avg_4 = get_average_metric(results_4[metric_name])
        
        # Append to the comparison data
        comparison_data['Metric'].append(metric_name)
        comparison_data[run_model_1.name].append(avg_1)
        comparison_data[run_model_2.name].append(avg_2)
        comparison_data[run_model_3.name].append(avg_3)
        comparison_data[run_model_4.name].append(avg_4)
        comparison_data['1v2 Difference'].append(avg_2 - avg_1)
        comparison_data['1v3 Difference'].append(avg_3 - avg_1)
        comparison_data['1v4 Difference'].append(avg_4 - avg_1)
        comparison_data['2v3 Difference'].append(avg_3 - avg_2)
        comparison_data['2v4 Difference'].append(avg_4 - avg_2)
        comparison_data['3v4 Difference'].append(avg_4 - avg_3)

    return pd.DataFrame(comparison_data)
    return pd.DataFrame(comparison_data)

df = compare_results_4(snowflake_finetune_state , snowflake_finetune_section_state, snowflake_finetune_table_state, snowflake_finetune_semantic_state)
df

Unnamed: 0,Metric,Snowflake_Fine/1000/100,Snowflake_FineSection/1000/100,Snowflake_FineTable/1000/100,Snowflake_FineSemantic/1000/100,1v2 Difference,1v3 Difference,1v4 Difference,2v3 Difference,2v4 Difference,3v4 Difference
0,Faithfulness,1.0,0.928342,0.890942,0.798095,-0.071658,-0.109058,-0.201905,-0.0374,-0.130247,-0.092847
1,AnswerRelevancy,0.973854,0.963411,0.96981,0.967873,-0.010443,-0.004044,-0.005981,0.006399,0.004462,-0.001937
2,ContextRecall,1.0,0.9,1.0,0.5,-0.1,0.0,-0.5,0.1,-0.4,-0.5
3,ContextPrecision,,,,,,,,,,
4,AnswerCorrectness,0.632057,0.466659,0.380591,0.604548,-0.165398,-0.251467,-0.027509,-0.086068,0.137889,0.223958


### Fine Tuned Snowflake Model (2nd Model Evaluation)

In [152]:
from langchain.embeddings import HuggingFaceEmbeddings
snowflake_finetune_2_state = ModelRunState()
snowflake_finetune_2_state.name = "Snowflake_Fine_2/1000/100"
snowflake_finetune_2_state.qa_model_name = "gpt-4o-mini"
snowflake_finetune_2_state.qa_model = ChatOpenAI(model=snowflake_finetune_2_state.qa_model_name)

# finetune snowflake embedding model

hf_username = "rchrdgwr"
hf_repo_name = "finetuned-arctic-model-2"

# Load the fine-tuned model from Hugging Face
snowflake_finetune_2_state.embedding_model_name = f"{hf_username}/{hf_repo_name}"
snowflake_finetune_2_state.embedding_model = HuggingFaceEmbeddings(model_name=snowflake_finetune_2_state.embedding_model_name)

# use same chunk size as before
snowflake_finetune_2_state.chunk_size = 1000
snowflake_finetune_2_state.chunk_overlap = 100
create_vector_store(app_state, snowflake_finetune_2_state)

create_rag_chain(app_state, snowflake_finetune_2_state)
create_answers(app_state, snowflake_finetune_2_state, ragas_state)


Some weights of BertModel were not initialized from the model checkpoint at rchrdgwr/finetuned-arctic-model-2 and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Vector store created
RAG Chain Created
Answers created - ready for Ragas evaluation


In [153]:
run_ragas_evaluation(app_state, snowflake_finetune_2_state)

Evaluating metric: Faithfulness


Evaluating: 100%|██████████| 5/5 [00:26<00:00,  5.39s/it]


Metric Faithfulness evaluation complete
Evaluating metric: AnswerRelevancy


Evaluating: 100%|██████████| 5/5 [00:03<00:00,  1.58it/s]


Metric AnswerRelevancy evaluation complete
Evaluating metric: ContextRecall


Evaluating: 100%|██████████| 5/5 [00:04<00:00,  1.07it/s]


Metric ContextRecall evaluation complete
Evaluating metric: ContextPrecision


Evaluating:  40%|████      | 2/5 [02:24<03:18, 66.28s/it] Exception raised in Job[4]: TimeoutError()
Exception raised in Job[0]: TimeoutError()
Exception raised in Job[3]: TimeoutError()
Evaluating: 100%|██████████| 5/5 [03:01<00:00, 36.26s/it]


Metric ContextPrecision evaluation complete
Evaluating metric: AnswerCorrectness


Evaluating: 100%|██████████| 5/5 [02:14<00:00, 26.83s/it]


Metric AnswerCorrectness evaluation complete
Adding results_dict to model_run_state
{'Faithfulness': Dataset({
    features: ['faithfulness'],
    num_rows: 5
}), 'AnswerRelevancy': Dataset({
    features: ['answer_relevancy'],
    num_rows: 5
}), 'ContextRecall': Dataset({
    features: ['context_recall'],
    num_rows: 5
}), 'ContextPrecision': Dataset({
    features: ['context_precision'],
    num_rows: 5
}), 'AnswerCorrectness': Dataset({
    features: ['answer_correctness'],
    num_rows: 5
})}
Ragas evaluation for all metrics complete


In [154]:
display_metrics(snowflake_finetune_2_state)

              Metric   Average
0       Faithfulness  0.992000
1    AnswerRelevancy  0.981830
2      ContextRecall  1.000000
3   ContextPrecision  0.751185
4  AnswerCorrectness  0.526832


### Comparison of the Two Fine Tuned Snowflake Models

In [155]:
df = compare_results(snowflake_finetune_state, snowflake_finetune_2_state )
df

Unnamed: 0,Metric,Snowflake_Fine/1000/100,Snowflake_Fine_2/1000/100,Difference
0,Faithfulness,1.0,0.992,-0.008
1,AnswerRelevancy,0.973854,0.98183,0.007977
2,ContextRecall,1.0,1.0,0.0
3,ContextPrecision,,0.751185,
4,AnswerCorrectness,0.632057,0.526832,-0.105225
