# AI Engineering Bootcamp Cohort 4 Midterm

For details of this project please see the project Readme.

This notebooks takes advantage of shared state classes and some common utility functions shared with the Chainlit application and other notebooks

The state classes are:
- AppState - contains information about the application and the documents we are looking at
- ModelRunState - contains information for each individual run for a model, chunking parameters, and results 
- RagasState - contains information for Ragas evaluation including the questions and context

The state is initialized and is passed between the functions. Each model/chunking strategy/parameters can run and save their information and then can be easily compared. 

This utilities include:
- constants.py - constants used throughout the package (mainly for the chunking strategies)
- debugger.py - supports printing messages for when debug=True
- doc_utilities.py - supports reading the documents
- rag_utilities.py - supports the creation of the RAG chain 
- templates.py - provides templates for RAG chains
- vector_utilities - provides functions for chunking documents and setting up the vector store. Includes 4 chunking strategies:
    - Recursive splitter with chunk size and overlap
    - Table aware - tries to handle pdfs with tables
    - Section based - chunks based on section headers
    - Semantic splitter - semantic-based chunking



Allow for async in notebook

In [1]:
import nest_asyncio

nest_asyncio.apply()

Helps hide errors from using Hugging Face's transformers

In [2]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

#### Install our key components for RAG etc

In [3]:
!pip install -q langchain
!pip install -q langchain-core==0.2.27 langchain-community==0.2.10
!pip install -q langchain-experimental==0.0.64 langgraph-checkpoint==1.0.6 langgraph==0.2.16 langchain-qdrant==0.1.3
!pip install -q langchain-openai==0.1.9
!pip install -q ragas==0.1.16

#### Install our vector store - Qdrant

In [4]:
!pip install -qU qdrant-client==1.11.2

#### Install supporting utilities

In [5]:
!pip install -qU tiktoken==0.7.0 pymupdf==1.24.10

Environment Variables

- OpenAI API Key - will use some of the OpenAI models - if in .env use it otherwise ask for it

In [37]:
import os
import getpass

# openai_api_key = os.getenv("OPENAI_API_KEY")
#if not openai_api_key:
openai_api_key = getpass.getpass("OpenAI API Key: ")

os.environ["OPENAI_API_KEY"] = openai_api_key

#### Set up our starting inputs and state and read in the documents

The AppState contains the docs to be used throughout the process


In [7]:
from classes.app_state import AppState
from utilities.doc_utilities import get_documents
document_urls = [
    "https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf",
    "https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf",
]

app_state = AppState()
app_state.set_debug(False)

app_state.set_document_urls(document_urls)

get_documents(app_state)


Total documents: 2


### Create Vector Store with text-embedding-3-small embeddings

Set up our first model run - using the text-embedding-3-small for embeddings initially

This will test that our chunking and vector store functions are all working as expected

In [8]:
from classes.model_run_state import ModelRunState
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings
from utilities.vector_utilities import create_vector_store

model_1000_100_state = ModelRunState()
model_1000_100_state.name = "TE3/1000/100"
model_1000_100_state.chunk_size = 1000
model_1000_100_state.chunk_overlap = 100

model_1000_100_state.qa_model_name = "gpt-4o-mini"
model_1000_100_state.qa_model = ChatOpenAI(model=model_1000_100_state.qa_model_name)

# the openai embedding model
model_1000_100_state.embedding_model_name = "text-embedding-3-small"
model_1000_100_state.embedding_model = OpenAIEmbeddings(model=model_1000_100_state.embedding_model_name)

create_vector_store(app_state, model_1000_100_state)

  from tqdm.autonotebook import tqdm, trange


Vector store created


Test the retriever with some sample files

In [9]:
query = "How should you be protected from abusive data practices "
results = model_1000_100_state.retriever.get_relevant_documents(query)

print(len(results))
print(results[0].page_content)
print(results[0].metadata)
print("---")

  warn_deprecated(


10
You should be protected from abusive data practices via built-in 
protections and you should have agency over how data about 
you is used. You should be protected from violations of privacy through 
design choices that ensure such protections are included by default, including 
ensuring that data collection conforms to reasonable expectations and that 
only data strictly necessary for the specific context is collected. Designers, de­
velopers, and deployers of automated systems should seek your permission 
and respect your decisions regarding collection, use, access, transfer, and de­
letion of your data in appropriate ways and to the greatest extent possible; 
where not possible, alternative privacy by design safeguards should be used. 
Systems should not employ user experience and design decisions that obfus­
cate user choice or burden users with defaults that are privacy invasive. Con­
sent should only be used to justify collection of data in cases where it can be 
appropriately 

In [10]:
query = "tell me about Karen Hao"
results = model_1000_100_state.retriever.get_relevant_documents(query)

for result in results:
    print(result.page_content)
    print(result.metadata)
    print("---")

ENDNOTES
75. See., e.g., Sam Sabin. Digital surveillance in a post-Roe world. Politico. May 5, 2022. https://
www.politico.com/newsletters/digital-future-daily/2022/05/05/digital-surveillance-in-a-post-roe­
world-00030459; Federal Trade Commission. FTC Sues Kochava for Selling Data that Tracks People at
Reproductive Health Clinics, Places of Worship, and Other Sensitive Locations. Aug. 29, 2022. https://
www.ftc.gov/news-events/news/press-releases/2022/08/ftc-sues-kochava-selling-data-tracks-people­
reproductive-health-clinics-places-worship-other
76. Todd Feathers. This Private Equity Firm Is Amassing Companies That Collect Data on America’s
Children. The Markup. Jan. 11, 2022.
https://themarkup.org/machine-learning/2022/01/11/this-private-equity-firm-is-amassing-companies­
that-collect-data-on-americas-children
77. Reed Albergotti. Every employee who leaves Apple becomes an ‘associate’: In job databases used by
employers to verify resume information, every former Apple employee’s tit

OK - that all looks good - lets continue

### Create a RAG Chain for the different models

This will take a model_run_state that will pass in:
- the qa model
- the retriever 

It creates the RAG chain and saves it in the model_run_state for RAGAS evaluation

In [11]:
from utilities.templates import get_qa_prompt
from langchain_openai import ChatOpenAI
from operator import itemgetter
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from utilities.debugger import dprint

def create_rag_chain(app_state, model_run_state):

    chat_prompt = get_qa_prompt()

    simple_chain = chat_prompt | model_run_state.qa_model
    dprint(app_state, simple_chain.invoke({"question": "Can you give me a summary of the 2 documents", "context":""}))

    rag_qa_chain = (
        {"context": itemgetter("question") | model_run_state.retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))


        | {"response": chat_prompt | model_run_state.qa_model, "context": itemgetter("context")}
    )
    response = rag_qa_chain.invoke({"question" : "What is the AI Bill of Rights "})
    dprint(app_state, response)
    dprint(app_state, response["response"].content)
    dprint(app_state, f"Number of found context: {len(response['context'])}")
    model_run_state.rag_qa_chain = rag_qa_chain
    print("RAG Chain Created")

create_rag_chain(app_state, model_1000_100_state)

RAG Chain Created


### SDG - Create the questions for evaluation

3 functions are used to set up the questions:
- batch_chunks - processes batches of chunks to try and get past the limitations with OpenAI quotas
- create_chunks_for_ragas - takes the documents and splits them up based on the RagasState - this will allow us more experimentation later
- create_questions_for_ragas - sets up the generator and creates the number of questions and distribution based on RagasState

In [None]:
from classes.ragas_state import RagasState
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from math import ceil
import pandas as pd
from ragas.testset.generator import TestsetGenerator

import time

# create document chunks
def Create_chunks_for_ragas(app_state, ragas_state):
    # we have 2 documents so want representative across both
    text_splitter_eval = RecursiveCharacterTextSplitter(
        chunk_size = ragas_state.chunk_size,
        chunk_overlap = ragas_state.chunk_overlap,
        length_function = len
    )
    combined_chunks_document = []
    for document in app_state.documents:
        eval_document = document["loaded_document"]
        document_chunks = text_splitter_eval.split_documents(eval_document)
        print(f"Num chumks: {len(document_chunks)}")
        combined_chunks_document = combined_chunks_document + document_chunks

    print(f"Total chunks: {len(combined_chunks_document)}")
    ragas_state.chunks = combined_chunks_document
    print()

# submit batches
def batch_chunks(chunks, batch_size):
    for i in range(0, len(chunks), batch_size):
        yield chunks[i:i + batch_size]

# create the questions
def create_questions_for_ragas(app_state, ragas_state):
    generator_llm = ChatOpenAI(model=ragas_state.generator_llm)
    critic_llm = ChatOpenAI(model=ragas_state.critic_llm)
    embeddings = OpenAIEmbeddings()

    generator = TestsetGenerator.from_langchain(
        generator_llm,
        critic_llm,
        embeddings
    )

    batch_size = 200  # Number of chunks to process per batch
    delay_between_batches = 1  # 1 second delay between batches

    all_test_data = []

    for batch in batch_chunks(ragas_state.chunks, batch_size):
        print(f"Processing batch of {len(batch)} chunks...")

        # Generate testset for the current batch
        testset = generator.generate_with_langchain_docs(
            batch,  # Process this batch of chunks
            ragas_state.num_questions, 
            ragas_state.distributions
        )

        # Convert the testset to pandas and store the result
        testset_df = testset.to_pandas()
        all_test_data.append(testset_df)

        # Wait 1 second before the next batch
        time.sleep(delay_between_batches)



    combined_testset_df = pd.concat(all_test_data, ignore_index=True)
    ragas_state.testset_df = combined_testset_df

    print("Ragas questions created for all batches.")


Load the ragas_state if it exists offline 

Otherwise we will need to run the question creation again

In [29]:
import os
import pickle
from ragas.testset.evolutions import simple, reasoning, multi_context

# File path where ragas_state is stored
file_path = 'ragas_state.pkl'

# Check if the pickled file exists, and load it if it does
def load_ragas_state_if_exists(file_path):
    if os.path.exists(file_path):
        try:
            with open(file_path, 'rb') as f:
                ragas_state = pickle.load(f)
            print(f"Ragas state loaded from {file_path}")
            return ragas_state
        except Exception as e:
            print(f"Error loading ragas state: {e}")
            return None
    else:
        print(f"No existing ragas state found at {file_path}")
        return None

# Use this to load ragas_state from pickle
# ragas_state = load_ragas_state_if_exists(file_path)
# ragas_state.distributions = {
#             simple: 0.5,
#             multi_context: 0.4,
#             reasoning: 0.1
#         }

# use this to rebuild pickle state
# ragas_state = RagasState()
# ragas_state.generator_llm = "gpt-4o"
# Create_chunks_for_ragas(app_state, ragas_state)
# create_questions_for_ragas(app_state, ragas_state)

Check how many were created

In [14]:
print(f"Number of entries in testset_df: {ragas_state.testset_df.shape[0]}")

Number of entries in testset_df: 20


Lets save off the ragas state - then comment out above code so we don't need to run it again

In [32]:
import pickle

# Save the ragas_state without distributions
def save_ragas_state(ragas_state, file_path):
    # Temporarily remove `distributions` before saving
    distributions_backup = ragas_state.distributions
    ragas_state.distributions = None

    # Save the rest of the object
    try:
        with open(file_path, 'wb') as f:
            pickle.dump(ragas_state, f)
        print(f"Ragas state saved to {file_path}")
    except Exception as e:
        print(f"Error saving ragas state: {e}")
    
    # Restore `distributions` after saving
    ragas_state.distributions = distributions_backup

# Uncomment and run if need to save new ragas_state
# save_ragas_state(ragas_state, 'ragas_state.pkl')

# lets see if unpickle works
# test_ragas_state = load_ragas_state_if_exists(file_path)
# print(len(test_ragas_state.testset_df))

Ragas state loaded from ragas_state.pkl
20


### Generate answers based on the pipeline we have created

This function takes in the model_run_state and the ragas_state and uses the retriever from the model_run_state to answer the questions from the ragas_state. 

The response dataset is stored in the model_run_state for later evaluation

In [33]:
from datasets import Dataset
def create_answers(app_state, model_run_state, ragas_state):
  answers = []
  contexts = []

  test_questions = ragas_state.testset_df["question"].values.tolist()
  test_groundtruths = ragas_state.testset_df["ground_truth"].values.tolist()

  for question in test_questions:
    response = model_run_state.rag_qa_chain.invoke({"question" : question})
    answers.append(response["response"].content)
    contexts.append([context.page_content for context in response["context"]])

  # Wrap it in a huggingface dataset
  model_run_state.response_dataset = Dataset.from_dict({
      "question" : test_questions,
      "answer" : answers,
      "contexts" : contexts,
      "ground_truth" : test_groundtruths
  })
  model_run_state.response_dataset[0]
  print("Answers created - ready for Ragas evaluation")

create_answers(app_state, model_1000_100_state, ragas_state)

Answers created - ready for Ragas evaluation


### Evaluation

The run_ragas_evaluation uses the response dataset from the previous step stored in the model_run_state to determine the requested Ragas metrics.

The results of the evaluation are then stored nack in the model_run_state.

In [34]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)
def run_ragas_evaluation(app_state, model_run_state):
    metrics = [
        faithfulness,
        answer_relevancy,
        context_recall,
        context_precision,
        answer_correctness,
    ]
    model_run_state.ragas_results = evaluate(model_run_state.response_dataset, metrics)
    print("Ragas evaluation complete")
run_ragas_evaluation(app_state, model_1000_100_state)

Evaluating:  58%|█████▊    | 58/100 [02:14<03:09,  4.50s/it]Exception raised in Job[27]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-dm9dvvnDgfJGEGv0fE2Q952w on tokens per min (TPM): Limit 200000, Used 195927, Requested 11789. Please try again in 2.314s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Evaluating:  69%|██████▉   | 69/100 [02:36<01:18,  2.52s/it]Exception raised in Job[40]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-dm9dvvnDgfJGEGv0fE2Q952w on tokens per min (TPM): Limit 200000, Used 199471, Requested 11292. Please try again in 3.228s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Evaluating:  79%|███████▉  | 79/100 [03:02<00:50,  2.41s/it]Exception raised in Jo

Ragas evaluation complete


### Analysis of the Model

We can report on the model and the paramters such as the chunking size, overlap, and the Ragas metrics both summary and per question.

In [35]:
model_1000_100_state.parameters()
#model_1000_100_state.results_summary()
model_1000_100_state.results()

print(model_1000_100_state.ragas_results)
results_df = model_1000_100_state.ragas_results.to_pandas()
results_df

Base model: gpt-4o-mini
Embedding model: text-embedding-3-small
Chunk size: 1000
Chunk overlap: 100
{'faithfulness': 0.8230, 'answer_relevancy': 0.7518, 'context_recall': 0.9474, 'context_precision': 0.7225, 'answer_correctness': 0.4931}


Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,How might an insurer use social media data col...,It appears that insurers might use social medi...,[DATA PRIVACY \nEXTRA PROTECTIONS FOR DATA REL...,An insurer might collect data from a person's ...,0.875,0.0,1.0,0.5,0.578716
1,What are the potential impacts of non-consensu...,The potential impacts of non-consensual intima...,"[11 \nvalue chain (e.g., data inputs, processi...",The experience of harm to victims of non-conse...,0.933333,0.973334,1.0,0.825397,0.729688
2,"What are the 3 AI biases tied to dataset, test...","The three AI biases tied to dataset, testing, ...",[57 \nNational Institute of Standards and Tech...,The three categories of bias in AI are systemi...,1.0,0.997365,1.0,0.5,0.22727
3,How can creators of automated systems ensure l...,Creators of automated systems can ensure legal...,[WHAT SHOULD BE EXPECTED OF AUTOMATED SYSTEMS\...,"Designers, developers, and deployers of automa...",1.0,0.98165,1.0,0.962654,0.387657
4,How does the AI Bill of Rights guide responsib...,The AI Bill of Rights guides the responsible u...,[- \nUSING THIS TECHNICAL COMPANION\nThe Bl...,The AI Bill of Rights guides the responsible u...,1.0,0.954697,1.0,1.0,0.462809
5,Why is data collection emphasized as a necessa...,Data collection is emphasized as a necessary i...,[You should be protected from abusive data pra...,The answer to given question is not present in...,,,,0.0,0.181015
6,What are some examples of protected classifica...,Some examples of protected classifications tha...,[WHY THIS PRINCIPLE IS IMPORTANT\nThis section...,Some examples of protected classifications tha...,1.0,0.992633,1.0,0.916667,0.382027
7,How do hybrid AI-human platforms balance effic...,The balance between efficiency and fair human ...,"[HUMAN ALTERNATIVES, \nCONSIDERATION, AND \nFA...",Hybrid AI-human platforms balance efficiency a...,1.0,0.964648,1.0,0.928263,0.847716
8,What risks come from unregulated consumer data...,The risks from unregulated consumer data colle...,[DATA PRIVACY \nWHY THIS PRINCIPLE IS IMPORTAN...,The risks from unregulated consumer data colle...,,0.985682,1.0,0.931796,0.754445
9,How can legal discovery balance oversight and ...,The documents provided do not specifically add...,[NOTICE & \nEXPLANATION \nHOW THESE PRINCIPLES...,Legal discovery can balance oversight and prot...,0.533333,0.0,0.0,0.653704,0.355803


### Base Snowflake Model Evaluation

Process:
- set up the model_run_state for the base Snowflake model
- create the vector store using the base model
- create the RAG chain with the retriever for the vestor store
- generate the answers to the Ragas questions
- evaluate the model's performance using Ragas

In [36]:
snowflake_base_state = ModelRunState()
snowflake_base_state.name = "Snowflake_Base/1000/100"
snowflake_base_state.qa_model_name = "gpt-4o-mini"
snowflake_base_state.qa_model = ChatOpenAI(model=snowflake_base_state.qa_model_name)

# snowflake embedding model
snowflake_base_state.embedding_model_name = "Snowflake/snowflake-arctic-embed-m"
snowflake_base_state.embedding_model = HuggingFaceEmbeddings(model_name=snowflake_base_state.embedding_model_name)

# use same chunk size as before
snowflake_base_state.chunk_size = 1000
snowflake_base_state.chunk_overlap = 100
create_vector_store(app_state, snowflake_base_state)

create_rag_chain(app_state, snowflake_base_state)
create_answers(app_state, snowflake_base_state, ragas_state)
run_ragas_evaluation(app_state, snowflake_base_state)

  warn_deprecated(
Task exception was never retrieved
future: <Task finished name='Task-360' coro=<as_completed.<locals>.sema_coro() done, defined at /home/rchrdgwr/anaconda3/envs/clean-llmops/lib/python3.11/site-packages/ragas/executor.py:32> exception=RateLimitError("Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-dm9dvvnDgfJGEGv0fE2Q952w on tokens per min (TPM): Limit 30000, Used 29970, Requested 406. Please try again in 752ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}")>
Traceback (most recent call last):
  File "/home/rchrdgwr/anaconda3/envs/clean-llmops/lib/python3.11/asyncio/tasks.py", line 277, in __step
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/home/rchrdgwr/anaconda3/envs/clean-llmops/lib/python3.11/site-packages/ragas/executor.py", line 34, in sema_coro
    return await coro
           ^^^^^^^^^^
  File "/home/rchrd

Vector store created
RAG Chain Created
Answers created - ready for Ragas evaluation


Evaluating:  59%|█████▉    | 59/100 [01:55<01:24,  2.05s/it]Exception raised in Job[46]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-dm9dvvnDgfJGEGv0fE2Q952w on tokens per min (TPM): Limit 200000, Used 193304, Requested 7041. Please try again in 103ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Evaluating:  67%|██████▋   | 67/100 [02:14<01:04,  1.94s/it]Exception raised in Job[85]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-dm9dvvnDgfJGEGv0fE2Q952w on tokens per min (TPM): Limit 200000, Used 195043, Requested 8408. Please try again in 1.035s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Evaluating:  79%|███████▉  | 79/100 [02:47<00:46,  2.22s/it]Exception raised in Job[5

Ragas evaluation complete


### Base Snowflake model results

In [30]:

snowflake_base_state.parameters()
print(snowflake_base_state.ragas_results)
results_df = snowflake_base_state.ragas_results.to_pandas()
results_df


Base model: gpt-4o-mini
Embedding model: Snowflake/snowflake-arctic-embed-m
Chink size: 1000
Chink overlap: 100
{'faithfulness': 0.4478, 'answer_relevancy': 0.6039, 'context_recall': 0.3333, 'context_precision': 0.5833, 'answer_correctness': 0.3635}


Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,"How can new GAI policies, procedures, and proc...",The connection between new Generative AI (GAI)...,[Table of Contents \n1. \nIntroduction ..........,"New GAI policies, procedures, and processes ca...",0.454545,0.892835,0.0,0.833333,0.506942
1,How does confirmation bias contribute to poten...,"I don't have enough information, sorry. Howeve...",[BLUEPRINT FOR AN \nAI BILL OF \nRIGHTS \nMAKI...,Confirmation bias contributes to potentially i...,0.0,0.0,0.0,0.0,0.17908
2,What resources on AI risk management are avail...,The National Institute of Standards and Techno...,[57 \nNational Institute of Standards and Tech...,The National Institute of Standards and Techno...,0.888889,0.9188,1.0,0.916667,0.404474


### Comparison of Base Snowflake with the TE3

In [34]:
import pandas as pd
def compare_results(run_model_1, run_model_2):
    results_1 = run_model_1.ragas_results
    results_2 = run_model_2.ragas_results
    comparison_data = {
        'Metric': list(results_1.keys()),
        run_model_1.name: [results_1[key] for key in results_1.keys()],
        run_model_2.name: [results_2[key] for key in results_2.keys()],
        'Difference': [results_2[key] - results_1[key] for key in results_1.keys()]
    }
    return pd.DataFrame(comparison_data)

snowflake_base_state.name = "Snowflake_Base/1000/100"
model_1000_100_state.name = "TE3/1000/100"
df = compare_results(snowflake_base_state, model_1000_100_state )
df

Unnamed: 0,Metric,Snowflake_Base/1000/100,TE3/1000/100,Difference
0,faithfulness,0.447811,0.635913,0.188101
1,answer_relevancy,0.603878,0.9478,0.343922
2,context_recall,0.333333,0.833333,0.5
3,context_precision,0.583333,1.0,0.416667
4,answer_correctness,0.363499,0.663308,0.299809


### Fine Tuned Snowflake Model (1st Model) Evaluation

In [41]:
from langchain.embeddings import HuggingFaceEmbeddings
snowflake_finetune_state = ModelRunState()
snowflake_finetune_state.name = "Snowflake_Fine/1000/100"
snowflake_finetune_state.qa_model_name = "gpt-4o-mini"
snowflake_finetune_state.qa_model = ChatOpenAI(model=snowflake_finetune_state.qa_model_name)

# finetune snowflake embedding model

hf_username = "rchrdgwr"
hf_repo_name = "finetuned-arctic-model"

# Load the fine-tuned model from Hugging Face
snowflake_finetune_state.embedding_model_name = f"{hf_username}/{hf_repo_name}"
snowflake_finetune_state.embedding_model = HuggingFaceEmbeddings(model_name=snowflake_finetune_state.embedding_model_name)

# use same chunk size as before
snowflake_finetune_state.chunk_size = 1000
snowflake_finetune_state.chunk_overlap = 100
create_vector_store(app_state, snowflake_finetune_state)

create_rag_chain(app_state, snowflake_finetune_state)
create_answers(app_state, snowflake_finetune_state, ragas_state)
run_ragas_evaluation(app_state, snowflake_finetune_state)

Some weights of BertModel were not initialized from the model checkpoint at rchrdgwr/finetuned-arctic-model and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Vector store created
Answers created - ready for Ragas evaluation


Evaluating: 100%|██████████| 15/15 [00:19<00:00,  1.32s/it]


Ragas evaluation complete


### Fine Tuned Snowflake Model (1st Model) Results

In [42]:
snowflake_finetune_state.parameters()
print(snowflake_finetune_state.ragas_results)
results_df = snowflake_finetune_state.ragas_results.to_pandas()
results_df

Base model: gpt-4o-mini
Embedding model: rchrdgwr/finetuned-arctic-model
Chink size: 1000
Chink overlap: 100
{'faithfulness': 0.9103, 'answer_relevancy': 0.9455, 'context_recall': 0.8889, 'context_precision': 1.0000, 'answer_correctness': 0.4178}


Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,"How can new GAI policies, procedures, and proc...",New GAI (Generative Artificial Intelligence) p...,"[19 \nGV-4.1-003 \nEstablish policies, procedu...","New GAI policies, procedures, and processes ca...",0.730769,0.934013,1.0,1.0,0.527957
1,How does confirmation bias contribute to poten...,Confirmation bias can contribute to potentiall...,[Algorithmic \nDiscrimination \nProtections \n...,Confirmation bias contributes to potentially i...,1.0,0.966214,0.666667,1.0,0.33686
2,What resources on AI risk management are avail...,The National Institute of Standards and Techno...,[NIST Trustworthy and Responsible AI \nNIST A...,The National Institute of Standards and Techno...,1.0,0.936206,1.0,1.0,0.388704


### Comparison of TE3, Base Snowflake, and Fine Tuned Snowflake (1st Model) Results

In [52]:
import pandas as pd
def compare_results_3(run_model_1, run_model_2, run_model_3):
    # Extract results for each model
    results_1 = run_model_1.ragas_results
    results_2 = run_model_2.ragas_results
    results_3 = run_model_3.ragas_results

    # Create comparison data
    comparison_data = {
        'Metric': list(results_1.keys()),
        run_model_1.name: [results_1[key] for key in results_1.keys()],
        run_model_2.name: [results_2[key] for key in results_2.keys()],
        run_model_3.name: [results_3[key] for key in results_3.keys()],
        '1v2 Difference': [results_2[key] - results_1[key] for key in results_1.keys()],
        '1v3 Difference': [results_3[key] - results_1[key] for key in results_1.keys()],
        '2v3 Difference': [results_3[key] - results_2[key] for key in results_2.keys()]
    }

    # Return the dataframe
    return pd.DataFrame(comparison_data)

snowflake_base_state.name = "Snowflake_Base/1000/100"
model_1000_100_state.name = "TE3/1000/100"
df = compare_results_3(model_1000_100_state , snowflake_base_state,  snowflake_finetune_state)
df

Unnamed: 0,Metric,TE3/1000/100,Snowflake_Base/1000/100,Snowflake_Fine/1000/100,1v2 Difference,1v3 Difference,2v3 Difference
0,faithfulness,0.635913,0.447811,0.910256,-0.188101,0.274344,0.462445
1,answer_relevancy,0.9478,0.603878,0.945477,-0.343922,-0.002323,0.341599
2,context_recall,0.833333,0.333333,0.888889,-0.5,0.055556,0.555556
3,context_precision,1.0,0.583333,1.0,-0.416667,0.0,0.416667
4,answer_correctness,0.663308,0.363499,0.41784,-0.299809,-0.245467,0.054342


In [62]:
hf_username = "rchrdgwr"
hf_repo_name = "finetuned-arctic-model"

snowflake_finetune_model_name = f"{hf_username}/{hf_repo_name}"
snowflake_finetune_model = HuggingFaceEmbeddings(model_name=snowflake_finetune_model_name)

Some weights of BertModel were not initialized from the model checkpoint at rchrdgwr/finetuned-arctic-model and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Fine Tuned Snowflake Model (1st Model) With Section Chunking Strategy

In [60]:
from utilities.constants import (
    CHUNKING_STRATEGY_TABLE_AWARE,
    CHUNKING_STRATEGY_SECTION_BASED,
    CHUNKING_STRATEGY_SEMANTIC
)

snowflake_finetune_section_state = ModelRunState()
snowflake_finetune_section_state.name = "Snowflake_FineSection/1000/100"
snowflake_finetune_section_state.qa_model_name = "gpt-4o-mini"
snowflake_finetune_section_state.qa_model = ChatOpenAI(model=snowflake_finetune_section_state.qa_model_name)

snowflake_finetune_section_state.embedding_model_name = snowflake_finetune_model_name
snowflake_finetune_section_state.embedding_model = snowflake_finetune_model

# use same chunk size as before
snowflake_finetune_section_state.chunking_strategy = CHUNKING_STRATEGY_SECTION_BASED
snowflake_finetune_section_state.chunk_size = 1000
snowflake_finetune_section_state.chunk_overlap = 100
create_vector_store(app_state, snowflake_finetune_section_state)

create_rag_chain(app_state, snowflake_finetune_section_state)
create_answers(app_state, snowflake_finetune_section_state, ragas_state)
run_ragas_evaluation(app_state, snowflake_finetune_section_state)
print(snowflake_finetune_section_state.ragas_results)

Some weights of BertModel were not initialized from the model checkpoint at rchrdgwr/finetuned-arctic-model and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Vector store created
Answers created - ready for Ragas evaluation


Evaluating: 100%|██████████| 15/15 [00:13<00:00,  1.10it/s]


Ragas evaluation complete
{'faithfulness': 0.9010, 'answer_relevancy': 0.9697, 'context_recall': 0.8889, 'context_precision': 1.0000, 'answer_correctness': 0.3700}


### Fine Tuned Snowflake Model (1st Model) With Table Aware Chunking Strategy

In [63]:
snowflake_finetune_table_state = ModelRunState()
snowflake_finetune_table_state.name = "Snowflake_FineTable/1000/100"
snowflake_finetune_table_state.qa_model_name = "gpt-4o-mini"
snowflake_finetune_table_state.qa_model = ChatOpenAI(model=snowflake_finetune_table_state.qa_model_name)

snowflake_finetune_table_state.embedding_model_name = snowflake_finetune_model_name
snowflake_finetune_table_state.embedding_model = snowflake_finetune_model

# use same chunk size as before
snowflake_finetune_table_state.chunking_strategy = CHUNKING_STRATEGY_TABLE_AWARE
snowflake_finetune_table_state.chunk_size = 1000
snowflake_finetune_table_state.chunk_overlap = 100
create_vector_store(app_state, snowflake_finetune_table_state)

create_rag_chain(app_state, snowflake_finetune_table_state)
create_answers(app_state, snowflake_finetune_table_state, ragas_state)
run_ragas_evaluation(app_state, snowflake_finetune_table_state)
print(snowflake_finetune_table_state.ragas_results)

Vector store created
Answers created - ready for Ragas evaluation


Evaluating: 100%|██████████| 15/15 [00:17<00:00,  1.13s/it]


Ragas evaluation complete
{'faithfulness': 0.6922, 'answer_relevancy': 0.9457, 'context_recall': 0.8889, 'context_precision': 1.0000, 'answer_correctness': 0.4848}


### Fine Tuned Snowflake Model (1st Model) With Semantic Chunking Strategy

In [64]:

snowflake_finetune_semantic_state = ModelRunState()
snowflake_finetune_semantic_state.name = "Snowflake_FineSemantic/1000/100"
snowflake_finetune_semantic_state.qa_model_name = "gpt-4o-mini"
snowflake_finetune_semantic_state.qa_model = ChatOpenAI(model=snowflake_finetune_semantic_state.qa_model_name)

snowflake_finetune_semantic_state.embedding_model_name = snowflake_finetune_model_name
snowflake_finetune_semantic_state.embedding_model = snowflake_finetune_model

# use same chunk size as before
snowflake_finetune_semantic_state.chunking_strategy = CHUNKING_STRATEGY_SEMANTIC
snowflake_finetune_semantic_state.chunk_size = 1000
snowflake_finetune_semantic_state.chunk_overlap = 100
create_vector_store(app_state, snowflake_finetune_semantic_state)

create_rag_chain(app_state, snowflake_finetune_semantic_state)
create_answers(app_state, snowflake_finetune_semantic_state, ragas_state)
run_ragas_evaluation(app_state, snowflake_finetune_semantic_state)
print(snowflake_finetune_semantic_state.ragas_results)

Vector store created
Answers created - ready for Ragas evaluation


Evaluating: 100%|██████████| 15/15 [00:12<00:00,  1.16it/s]


Ragas evaluation complete
{'faithfulness': 0.8889, 'answer_relevancy': 0.9592, 'context_recall': 0.8889, 'context_precision': 1.0000, 'answer_correctness': 0.6295}


### Comparison of Fine Tuned Snowflake Model (1st Model) with 3 Different Chunking Strategies

Note the Fine Tuned model used simple recursive text with specified chunking size and overlap

In [66]:
def compare_results_4(run_model_1, run_model_2, run_model_3, run_model_4):
    # Extract results for each model
    results_1 = run_model_1.ragas_results
    results_2 = run_model_2.ragas_results
    results_3 = run_model_3.ragas_results
    results_4 = run_model_4.ragas_results

    # Create comparison data
    comparison_data = {
        'Metric': list(results_1.keys()),
        run_model_1.name: [results_1[key] for key in results_1.keys()],
        run_model_2.name: [results_2[key] for key in results_2.keys()],
        run_model_3.name: [results_3[key] for key in results_3.keys()],
        run_model_4.name: [results_4[key] for key in results_4.keys()],
        '1v2 Difference': [results_2[key] - results_1[key] for key in results_1.keys()],
        '1v3 Difference': [results_3[key] - results_1[key] for key in results_1.keys()],
        '1v4 Difference': [results_4[key] - results_1[key] for key in results_1.keys()],
        '2v3 Difference': [results_3[key] - results_2[key] for key in results_2.keys()],
        '2v4 Difference': [results_4[key] - results_2[key] for key in results_2.keys()],
        '3v4 Difference': [results_4[key] - results_3[key] for key in results_3.keys()]
    }

    # Return the dataframe
    return pd.DataFrame(comparison_data)

df = compare_results_4(snowflake_finetune_state , snowflake_finetune_section_state, snowflake_finetune_table_state, snowflake_finetune_semantic_state)
df

Unnamed: 0,Metric,Snowflake_Fine/1000/100,Snowflake_FineSection/1000/100,Snowflake_FineTable/1000/100,Snowflake_FineSemantic/1000/100,1v2 Difference,1v3 Difference,1v4 Difference,2v3 Difference,2v4 Difference,3v4 Difference
0,faithfulness,0.910256,0.900966,0.69216,0.888889,-0.00929,-0.218097,-0.021368,-0.208806,-0.012077,0.196729
1,answer_relevancy,0.945477,0.969683,0.945677,0.959232,0.024206,0.0002,0.013755,-0.024006,-0.010451,0.013555
2,context_recall,0.888889,0.888889,0.888889,0.888889,0.0,0.0,0.0,0.0,0.0,0.0
3,context_precision,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,answer_correctness,0.41784,0.370029,0.4848,0.629459,-0.047811,0.066959,0.211619,0.114771,0.25943,0.14466


### Fine Tuned Snowflake Model (2nd Model Evaluation)

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
snowflake_finetune_2_state = ModelRunState()
snowflake_finetune_2_state.name = "Snowflake_Fine_2/1000/100"
snowflake_finetune_2_state.qa_model_name = "gpt-4o-mini"
snowflake_finetune_2_state.qa_model = ChatOpenAI(model=snowflake_finetune_2_state.qa_model_name)

# finetune snowflake embedding model

hf_username = "rchrdgwr"
hf_repo_name = "finetuned-arctic-model-2"

# Load the fine-tuned model from Hugging Face
snowflake_finetune_2_state.embedding_model_name = f"{hf_username}/{hf_repo_name}"
snowflake_finetune_2_state.embedding_model = HuggingFaceEmbeddings(model_name=snowflake_finetune_2_state.embedding_model_name)

# use same chunk size as before
snowflake_finetune_2_state.chunk_size = 1000
snowflake_finetune_2_state.chunk_overlap = 100
create_vector_store(app_state, snowflake_finetune_2_state)

create_rag_chain(app_state, snowflake_finetune_2_state)
create_answers(app_state, snowflake_finetune_2_state, ragas_state)
run_ragas_evaluation(app_state, snowflake_finetune_2_state)

### Comparison of the Two Fine Tuned Snowflake Models

df = compare_results(snowflake_finetune_state, snowflake_finetune_2_state )
df