# AI Engineering Cohort#4 Midterm Assignment

> ### This notebook has all the code EXCEPT Model Finetuning (which is in a separate notebook)
>
> Here is the link to the notebook that has all the code used for finetuning.  
>
> [Click Here](vc_completed_aie4_midterm_finetuning_embeddings_pipeline.ipynb) to access the finetuning notebook.
>
> (NOTE - this notebook and the finetuning notebook leverage a number of utilities that are in [this](myutils) folder.)

## Install Packages

### NOTE - May need to pin langchain_core version

In [1]:
# NOTE!!!
# May need to pin version: langchain_core==0.2.38
# !pip install -U -q langchain langchain-openai langchain_core==0.2.38 langchain-community langchainhub langchain-qdrant langchain_huggingface   langchain-text-splitters

In [2]:
# !pip install -qU openai ragas qdrant-client pymupdf pandas

In [3]:
# !pip install -qU faiss-cpu unstructured==0.15.7 python-pptx==1.0.2 nltk==3.9.1

#### Note - pin the version of pyarrow

In [4]:
# !pip uninstall -y pyarrow
# !pip install -qU sentence_transformers datasets pyarrow==14.0.1

## Imports and API Keys

In [5]:
import os
import openai
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key here: ")

In [6]:
from operator import itemgetter
import pandas as pd
from typing import List
import json

from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.documents import Document
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import PyMuPDFLoader

from ragas.metrics import faithfulness, answer_relevancy, answer_correctness, context_recall, context_precision
from ragas.testset.evolutions import simple, reasoning, multi_context

from myutils.rag_pipeline_utils import SimpleTextSplitter, SemanticTextSplitter, VectorStore, AdvancedRetriever
from myutils.ragas_pipeline import RagasPipeline
from myutils.finetuning import FineTuneModelAndEvaluateRetriever
from myutils.rag_pipeline_utils import load_all_pdfs, get_vibe_check_on_list_of_questions

from langchain_openai.embeddings import OpenAIEmbeddings

from sentence_transformers import SentenceTransformer
from datasets import Dataset

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
import nest_asyncio

nest_asyncio.apply()

## STEP 1 - Load the Documents

#### Make a local copy of the two pdfs needed for this exercise

In [None]:
# !wget https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf -O ./data/docs_for_rag/Blueprint-for-an-AI-Bill-of-Rights.pdf

In [None]:

# !wget https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf -O ./data/docs_for_rag/NIST.AI.600-1.pdf

#### Load pdfs into Langchain Documents

In [8]:
pdf_file_paths = [
    './data/docs_for_rag/Blueprint-for-an-AI-Bill-of-Rights.pdf',
    './data/docs_for_rag/NIST.AI.600-1.pdf'
]

In [9]:
documents = load_all_pdfs(pdf_file_paths)

loaded ./data/docs_for_rag/Blueprint-for-an-AI-Bill-of-Rights.pdf with 73 pages 
loaded ./data/docs_for_rag/NIST.AI.600-1.pdf with 64 pages 
loaded all files: total number of pages: 137 


#### Quick Overview of Documents

a.  2022: Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People
    
This is really two docs in one
first doc sets up five principles and practices
second one is labeled a technical companion; it expands on each principle as well as how to operationalize it; each principle is reiterated, followed by an articulation of what the principle is important, what should be expected of automated systems in regard to following this principle, and examples of how these principles can move into practice.


b.  2024: National Institute of Standards and Technology (NIST) Artificial Intelligent Risk Management Framework

First part describes the risks as well as Trustworthy AI characteristics to mitigate the risk
Second part, in tabular form, describes mitigation plan for each risks; each risk is identified in the table by a serial number based on the first part of the document rather than by the actual name of the risk.

#### Chunking Strategy

It is clear that chunking strategies should account for the semantics in the document, as well as the fact that there are strong connections between the first and second parts of the document.  This comment applies to both documents in this assignment.

I will examine two alternatives:

(a) BASELINE: use the Swiss-army-knife chunking approach: RecursiveCharacterTextSplitter

(b) ADVANCED: Semantic Chunking



WHY I CHOSE THESE TWO CHUNKING STRATEGIES
1. RecursiveCharacterTextSplitter: if the chunk_size and chunk_overlap are set to reasonable numbers, this approach is surprisingly effective across a range of document content.  It is cost-effective, relatively easy to tune if needed, is well-suited for answering queries that are SIMPLE and those that require MULTI-CONTEXT.


2. Semantic chunking has great appeal as it groups content that is contiguous and semantically similar in a single chunk.  To that end, the chunk sizes may be rather uneven.  Advantage: It avoids artificially splitting content that may be very similar into multiple chunks which would make the retriever work harder during the retrieval process and/or perhaps miss relevant context.  The downside is that it is not as cost-effective as it requires the use of an LLM during the chunking process.  It is likely to perform well for MULTI-CONTEXT and potentially queries that require REASONING.

## Formulate and Load My Test Questions

In [10]:
def load_test_questions(filename):
    """
    Loads a text file with questions

    Input
        name of file which contains a set of questions to test the RAG pipeline
    
    Output
        List of questions
    """
    with open(filename) as f:
        all_q = f.read()
        all_q_list = all_q.split('\n')
    return all_q_list

In [11]:
my_test_questions = load_test_questions(filename='./data/rag_questions_and_answers/my_test_questions.txt')
my_test_questions

['What process was followed to generate the AI Bill of Rights?',
 'What is the AI Bill of Rights?',
 'What are the set of five principles in the AI bill of Rights?',
 'Who led the formulation of the AI Bill of Rights?',
 'What rights do I have to ensure protection against algorithmic discrimination?',
 'What rights do I have to ensure that my data stays private?',
 'What rights do I have to ensure safe and effective systems?',
 'What rights do I have to ensure that I am given adequate explanation and notice re the use of AI systems?',
 'What rights do I have to ensure recourse to alternatives and remedy problems that I encounter?',
 'How can organizations put data privacy into practice?',
 'How can organizations put into practice protection against algorithmic discrimination',
 'How can foreign actors spread misinformation through the use of AI?',
 'How can US entities counter the use of AI to spread misinformation during the elections?',
 'According to NIST, what are the major risks o

## STEP 2 - Quick End-to-end Prototype RAG

#### Set Up RAG Template and RAG Prompt
> NOTE that the RAG template and RAG Prompt below will be used throughout this exercise

In [12]:
rag_template = """
Use the provided context to answer the following question.
If you can't answer the question based on the context, say you don't know.

Question:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(template=rag_template)

#### Set Up OpenAI Embeddings and Chat Model For Use in Prototype and for Comparison Throughout This Exercise

In [14]:
openai_embeddings_small = OpenAIEmbeddings(model='text-embedding-3-small')
openai_embeddings_small_dimension = 1536
openai_embeddings_small_context_window = 8191

openai_chat_gpt4omini = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

In [13]:
# Use the large embeddings in Semantic Chunking below!!!
openai_embeddings_large = OpenAIEmbeddings(model='text-embedding-3-large')
openai_embeddings_large_dimension = 3072
openai_embeddings_large_context_window = 8191

# Set up the lmore performant chat model just in case I decide to use it later...
openai_chat_gpt4o = ChatOpenAI(model_name="gpt-4o", temperature=0)

## Load Snowflake-arctic-embed-m Model 
#### (Will be Finetuned Later in The Exercise)

> ### Why I Chose This Model
>
> On the AIE4 midterm, we are asked to state why we chose the particular embedding model that we did for finetuning.  These are the criteria I used:
>
> 1.  PARSIMONY: This model has approx 110 million parameters, so we can feasibly finetune the model with consumer-grade access to GPU and memory resources.  It can be done very quickly in a Colab notebook, for instance, with access to their GPU.  I chose to use the A100 to speed up the process, but the training would work just as well with other GPUs like T4 etc.
>
> 2.  PERFORMANCE: Despite the far fewer parameters, the model holds its own in terms of performance on benchmark tasks.
>
> 3.  CONVENIENT ACCESS: This model is conveniently available via Huggingface, so I could leverage the model hub as well as all the libraries that support access to this type of model (SentenceTransformer) as well as all the training/finetuning capabilities.
>
> 4.  NO-BRAINER REASON: It is an open-source model so we have access to all parameters and configurations needed for finetuning.

In [15]:
from sentence_transformers import SentenceTransformer

model_id = "Snowflake/snowflake-arctic-embed-m"
model = SentenceTransformer(model_id)

In [16]:
arctic_original_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-m")
arctic_original_embeddings_dimension = 768
arctic_original_context_window_in_tokens = 512

#### Chunk Documents Using Recursive Character Text Splitting

In [17]:
chunk_size = 1000
chunk_overlap = 300

# instantiate baseline text splitter -
# NOTE!!! The `SimpleTextSplitter` below is my wrapper around Langchain RecursiveCharacterTextSplitter!!!!
# (see module for the code if needed)
baseline_text_splitter = \
    SimpleTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, documents=documents)

# split text for baseline case
baseline_text_splits = baseline_text_splitter.split_text()

In [18]:
len(baseline_text_splits)

557

#### Chunk Documents Using Semantic Chunking - NOTE Using OpenAI Embeddings Large

In [19]:
# instantiate semantic text splitter
#  NOTE!!!! SemanticTextSplitter is my wrapper around Langchain SemanticChunker
#  see my module for code if needed
# NOTE!!! I use openai large embeddings model to get the best possible representation of the semantics of sentences
# and to ensure high-quality semantic chunking
sem_text_splitter = \
    SemanticTextSplitter(llm_embeddings=openai_embeddings_large, threshold_type="interquartile", documents=documents)

# split text for semantic-chunking case
sem_text_splits = sem_text_splitter.split_text()

loaded 137 to be split 
returning docs split into 266 chunks 


#### Vibe Check on My Test Questions - Read This First!!!

NOTE:  Four RAG Pipelines are run below!!!  These are:

1.  `Demo_Baseline_OpenAI`: This uses baseline chunking (`RecursiveCharacterTextSplitter`) and OpenAI embeddings as a Demo.

2.  `Demo_Semantic_OpenAI`: uses semantic chunking (`SemanticChunker`) and OpenAI embeddings as a Demo.

3.  `Baseline_Arctic_Original`: uses baseline chunking and `Snowflake/snowflake-arctic-embed-m` model embeddings.

4.  `Semantic_Arctic_Original`: uses semantic chunking and `Snowflake/snowflake-arctic-embed-m` model embeddings.

NOTE!!!
Later in this notebook, I will finetune the `Snowflake/snowflake-arctic-embed-m` model embeddings and will then compare the finetuned embeddings from this model against the runs in 3. and 4. above


In [20]:
baseline_openai_retrieval_chain, baseline_openai_q_and_a = \
    get_vibe_check_on_list_of_questions(collection_name="Demo_Baseline_OpenAI",
                                        embeddings=openai_embeddings_small,  # <- openai embeddings
                                        embed_dim=openai_embeddings_small_dimension,
                                        prompt=rag_prompt,
                                        llm=openai_chat_gpt4omini,
                                        text_splits=baseline_text_splits, # <- baseline chunking
                                        list_of_questions=my_test_questions)

What process was followed to generate the AI Bill of Rights?
The AI Bill of Rights was generated through extensive consultation with the American public. It consists of five principles and associated practices designed to guide the design, use, and deployment of automated systems, ensuring they align with democratic values and protect civil rights, civil liberties, and privacy. The process involved input from experts across various sectors, including the private sector, governments, and international organizations.
What is the AI Bill of Rights?
The AI Bill of Rights is a framework consisting of five principles and associated practices designed to guide the design, use, and deployment of automated systems in order to protect the rights of the American public in the age of artificial intelligence. It aims to ensure that these systems align with democratic values and safeguard civil rights, civil liberties, and privacy. The framework was developed through extensive consultation with the 

In [21]:
sem_openai_retrieval_chain, sem_openai_q_and_a = \
    get_vibe_check_on_list_of_questions(collection_name="Demo_Semantic_OpenAI",
                                        embeddings=openai_embeddings_small, # <- openai embeddings
                                        embed_dim=openai_embeddings_small_dimension,
                                        prompt=rag_prompt,
                                        llm=openai_chat_gpt4omini,
                                        text_splits=sem_text_splits, # <- semantic chunking
                                        list_of_questions=my_test_questions)

What process was followed to generate the AI Bill of Rights?
The process followed to generate the AI Bill of Rights involved extensive consultation with the American public over the course of a year. The White House Office of Science and Technology Policy led this initiative, seeking input from a diverse range of stakeholders, including impacted communities, industry representatives, technology developers, experts from various fields, and policymakers. This input was gathered through panel discussions, public listening sessions, meetings, a formal request for information, and a publicly accessible email address. The insights and experiences shared during these engagements played a central role in shaping the Blueprint for an AI Bill of Rights, which aims to protect the rights of the American public in the age of artificial intelligence.
What is the AI Bill of Rights?
The AI Bill of Rights is a framework that outlines five principles and associated practices designed to guide the design

In [22]:
baseline_arctic_original_retrieval_chain, baseline_arctic_original_q_and_a = \
    get_vibe_check_on_list_of_questions(collection_name="Baseline_Arctic_Original",
                                        embeddings=arctic_original_embeddings, # <- arctic original embeddings
                                        embed_dim=arctic_original_embeddings_dimension,
                                        prompt=rag_prompt,
                                        llm=openai_chat_gpt4omini,
                                        text_splits=baseline_text_splits, # <- baseline chunking
                                        list_of_questions=my_test_questions)

What process was followed to generate the AI Bill of Rights?
The context does not provide specific details about the process followed to generate the AI Bill of Rights. Therefore, I don't know.
What is the AI Bill of Rights?
The AI Bill of Rights is a framework designed to assist governments and the private sector in implementing principles that protect civil rights, civil liberties, and privacy in the context of automated systems. It aims to ensure that the transformative potential of AI technologies is harnessed while preventing potential harms. The framework includes recommendations for moving principles into practice and serves as a blueprint for developing additional technical standards and practices tailored to specific sectors and contexts.
What are the set of five principles in the AI bill of Rights?
I don't know.
Who led the formulation of the AI Bill of Rights?
I don't know.
What rights do I have to ensure protection against algorithmic discrimination?
Based on the context pr

In [23]:
sem_arctic_original_retrieval_chain, sem_arctic_original_q_and_a = \
    get_vibe_check_on_list_of_questions(collection_name="Semantic_Arctic_Original",
                                        embeddings=arctic_original_embeddings, # <- arctic original embeddings
                                        embed_dim=arctic_original_embeddings_dimension,
                                        prompt=rag_prompt,
                                        llm=openai_chat_gpt4omini,
                                        text_splits=sem_text_splits, # <- semantic chunking
                                        list_of_questions=my_test_questions)

What process was followed to generate the AI Bill of Rights?
The process followed to generate the AI Bill of Rights involved extensive consultation with the American public. The White House Office of Science and Technology Policy led a year-long effort to seek and distill input from various stakeholders, including impacted communities, industry representatives, technology developers, experts across different fields, and policymakers. This input was gathered through panel discussions, public listening sessions, meetings, a formal request for information, and contributions to a publicly accessible email address. The discussions highlighted both the transformative potential of AI and the necessity of preventing its harms, which played a central role in shaping the Blueprint for an AI Bill of Rights.
What is the AI Bill of Rights?
The AI Bill of Rights is a framework consisting of five principles and associated practices designed to guide the design, use, and deployment of automated system

### Quick Summary of The Anecdotal Responses to My Questions Above

Note that when I use OpenAI Embeddings, the RAG Pipeline does a pretty decent job of responding to the test questions.  This is true for the baseline chunking as well as semantic chunking.

However, the results appear to be only marginally ok for the two cases when I used the `snowflake-arctic-embed-m` embeddings out-of-the-box.  Of course, it is not as good as OpenAI embeddings.  But the other thing I noticed is that the context window for this model's embeddings is only 512 (compared to 8191 for OpenAI embeddings).  We should expect that in the formal RAGAS evaluation (coming up next), this model does pretty poorly.

#### Save Test Questions and Answers in File

In [24]:
import pandas as pd
from pathlib import Path

def save_df_to_csv(q_a_data, csvfilename):
    qa_df = pd.DataFrame(q_a_data, 
                         columns=['questions', 'answers'])
    
    filepath = Path(csvfilename)
    filepath.parent.mkdir(parents=True, exist_ok=True)
    qa_df.to_csv(filepath, index=False)
    return


save_df_to_csv(baseline_openai_q_and_a, 
               csvfilename='./data/rag_questions_and_answers/baseline_openai_test_q_and_a.csv')

save_df_to_csv(sem_openai_q_and_a, 
               csvfilename='./data/rag_questions_and_answers/sem_openai_test_q_and_a.csv')

save_df_to_csv(baseline_arctic_original_q_and_a, 
               csvfilename='./data/rag_questions_and_answers/baseline_arctic_original_test_q_and_a.csv')

save_df_to_csv(sem_arctic_original_q_and_a, 
               csvfilename='./data/rag_questions_and_answers/sem_arctic_original_test_q_and_a.csv')

## DETOUR - TASK 2

> At this stage, we have built quite a bit of the functionality needed for the Fast Prototype.
>
> There is a separate `app_v1.py` script and other resources around it (such as `Dockerfile`, `requirements.txt`, etc.) that were created at this stage.  *A fast prototype of the app was deployed to Huggingface Spaces.*

### Loom Video to Demo the Fast Prototype of Working App on HF Spaces

1.  Here is a link to the Loom video showing a demo of the prototype:

        https://www.loom.com/share/3396b23b33f445ffb531ddcc8858487e

### The stack I chose and some thoughts on it!!

Here’s my stack:
1. PDF document loader:  `PyMuPDF` to load pdf documents – I’ve found it to be acceptable as a general-purpose PDF loader; it is also conveniently packaged with Langchain tools as one of several PDF loaders.

2. Chunking:  `Langchain`: for general purpose chunking (recursive character text splitter) as well as semantic chunking (it is their implementation of an open-source idea).  Extremely convenient and easy to use their different text splitters.

3. Vector Store: `Qdrant`: I only implemented an in-memory vector store for this project, but I chose this application because it can potentially also be scaled very easily for industrial-strength use-cases.

4. Retrieval chain (retriever, prompt and LLM): `Langchain’s LCEL`: to build out the retrieval chain; this is extremely convenient not only for fast prototyping but also scales very easily.

5. Embeddings to vectorize the text
-       OpenAI Embeddings – for the fast prototype, I used OpenAI text-embedding-3-small model embeddings.  These are very good as a general-purpose set of embeddings.  They are medium-sized vectors (dimension of 1536) and have decent context length (8191), so they can be used to encode fairly long chunks of text well.
-       Finetuned Snowflake/snowflake-arctic-embed-m Embeddings: the base embeddings perform quite well; the model is parsimonious (110 million parameters) so it can be easily finetuned with consumer-grade resources; model is conveniently distributed via Huggingface
-       Important to note – it was necessary to finetune the embeddings as the content of the corpus has fairly unique vocabulary that is unique to this domain, so in my stack I use the finetuned version of the model.

6. OpenAI Chat Model: I used `gpt-4o-mini` as the LLM chat model throughout this project.  It is highly performant, cost-effective and quite fast.

7. Web app: `Chainlit`: A very easy-to-use LLM-customized web-application; using Chainlit made it very easy to deploy the app on a hosting service such as Huggingface Spaces.

8. Web hosting: `Hugging Face spaces`: HF has set up HF spaces as a Github repo that automatically detects when there are pushes or changes to the underlying app and immediately restarts the app.  For our purposes, this web hosting service was quite adequate.


## STEP 3 - Synthetically Generate Test Questions Using the RAGAS Pipeline

#### Set Up RAGAS Pipeline Parameters

In [25]:
# LLM models used in RAGAS pipeline
ragas_generator_llm_model = 'gpt-3.5-turbo'
ragas_critic_llm_model = 'gpt-4o-mini'

# embeddings used for RAGAS pipeline
ragas_openai_embeddings_model = 'text-embedding-3-small'

# text splitter params
ragas_chunk_size = 1500
ragas_chunk_overlap = 500

# number of qa pairs needed - reduce if running into rate limit issues
ragas_number_of_qa_pairs = 20

# initialize distributions - desired distribution of question types
distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

# name of file to persist RAGAS Q&A on disk
ragas_testset_filename = "./data/rag_questions_and_answers/ragas_questions_and_answers.csv"

In [26]:
# FLAG TO INDICATE IF RAGAS TESTSET SHOULD BE GENERATED IN THIS RUN
# IF it is run, note the cost and time estimate below!!!
generate_ragas_testset_now = False

In [27]:
# set up list of RAGAS metrics used below
ragas_metrics = [
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
]

#### Instantiate RAGAS Pipeline, Run Pipeline, Generate Test Questions


In [28]:
# NOTE - this cell will incur significant cost due to SDG's use of OpenAI models
# Time taken on my local machine: ~ 15 mins

ragas_pipeline = RagasPipeline(
        generator_llm_model=ragas_generator_llm_model,
        critic_llm_model=ragas_critic_llm_model,
        embedding_model=ragas_openai_embeddings_model,
        number_of_qa_pairs=ragas_number_of_qa_pairs,
        chunk_size=ragas_chunk_size,
        chunk_overlap=ragas_chunk_overlap,
        documents=documents,
        distributions=distributions
)

In [29]:

if generate_ragas_testset_now is True:
    ragas_testset_df = ragas_pipeline.generate_testset()
    ragas_testset_df.to_csv(ragas_testset_filename)
else:
    pass

#### Load RAGAS Q&A from disk

In [30]:
ragas_test_df = pd.read_csv(ragas_testset_filename)
ragas_test_questions = ragas_test_df["question"].values.tolist()
ragas_test_groundtruths = ragas_test_df["ground_truth"].values.tolist()

## Evaluate RAG Pipeline Using RAGAS Generated Synthetic Questions

NOTE!!!

The four cells below evaluate the four RAG pipelines built above:
1.  Baseline chunking plus OpenAI embeddings
2.  Semantic chunking plus OpenAI Embeddings
3.  Baseline chunking plus Snowflake/snowflake-arctic-embed-m embeddings
4.  Semantic chunking plus Snowflake/snowflake-arctic-embed-m embeddings

In [31]:
baseline_openai_results, baseline_openai_results_df = \
    ragas_pipeline.ragas_eval_of_rag_pipeline(baseline_openai_retrieval_chain, # <- baseline chunking + openai embeddings
                                              ragas_test_questions, 
                                              ragas_test_groundtruths, 
                                              ragas_metrics)

Evaluating: 100%|██████████| 80/80 [00:40<00:00,  1.97it/s]


In [32]:
sem_openai_results, sem_openai_results_df = \
    ragas_pipeline.ragas_eval_of_rag_pipeline(sem_openai_retrieval_chain, # <- semantic chunking + openai embeddings
                                              ragas_test_questions, 
                                              ragas_test_groundtruths, 
                                              ragas_metrics)

Evaluating: 100%|██████████| 80/80 [00:40<00:00,  1.96it/s]


In [33]:
baseline_arctic_original_results, baseline_arctic_original_results_df = \
    ragas_pipeline.ragas_eval_of_rag_pipeline(baseline_arctic_original_retrieval_chain, # <- baseline chunking + arctic orig embeddings
                                              ragas_test_questions, 
                                              ragas_test_groundtruths, 
                                              ragas_metrics)

Evaluating: 100%|██████████| 80/80 [00:31<00:00,  2.56it/s]


In [34]:
sem_arctic_original_results, sem_arctic_original_results_df = \
    ragas_pipeline.ragas_eval_of_rag_pipeline(sem_arctic_original_retrieval_chain, # <- semantic chunking + arctic orig embeddings
                                              ragas_test_questions, 
                                              ragas_test_groundtruths, 
                                              ragas_metrics)

Evaluating: 100%|██████████| 80/80 [00:30<00:00,  2.63it/s]


#### Compare The Results

In [35]:
df_baseline_openai = pd.DataFrame(list(baseline_openai_results.items()), columns=['Metric', 'BaselineChunkOpenAI'])
df_sem_openai = pd.DataFrame(list(sem_openai_results.items()), columns=['Metric', 'SemanticChunkOpenAI'])
df_merged_openai = pd.merge(df_baseline_openai, df_sem_openai, on='Metric')

df_baseline_arctic_original = pd.DataFrame(list(baseline_arctic_original_results.items()), columns=['Metric', 'BaselineChunkArcticOrig'])
df_sem_arctic_original = pd.DataFrame(list(sem_arctic_original_results.items()), columns=['Metric', 'SemanticChunkArcticOrig'])
df_merged_arctic_original = pd.merge(df_baseline_arctic_original, df_sem_arctic_original, on='Metric')

df_all_merged = pd.merge(df_merged_openai, df_merged_arctic_original, on='Metric')

df_all_merged

Unnamed: 0,Metric,BaselineChunkOpenAI,SemanticChunkOpenAI,BaselineChunkArcticOrig,SemanticChunkArcticOrig
0,faithfulness,0.905942,0.892594,0.645067,0.315934
1,answer_relevancy,0.977323,0.975088,0.680872,0.297117
2,context_precision,0.933542,0.961042,0.584583,0.412917
3,context_recall,0.891667,0.916667,0.786667,0.204167


## Analysis of RAGAS Evaluation of RAG Pipelines Built So Far

The table above shows the results of the four pipelines that I’ve carried this far.  

The two on the left are using OpenAI embeddings (baseline chunking and semantic chunking) and the two on the right are using the original downloaded version of “snowflake-arctic-embed-m” model embeddings.  



Takeaways:
---------
1.  OpenAI dominates the snowflake-arctic-embed-m embedding based pipelines; not at all a surprise.

2.  Retrieval-based measures show a slight improvement for OpenAI embeddings when we use semantic chunking rather than simple chunks on text splitting.  This is to be expected as the semantic chunks are organizing chunks based on semantic content precisely so that retrieval is better.

3.  Generation-based measures such as faithfulness (measuring factual accuracy of generated answer) and answer relevancy (relevance of answer to question) also suffer with poor retrieval performance.  Notice the poor performace of the snowflake-arctic-embed-m model’s generation measures and how the retrieval measures are also pretty low.

4.  Semantic chunking adversely affects the performance of snowflake-arctic-embed-m model.  I suspect it might be due to the context window of the model being rather low at 512 tokens.  It is possible that semantic chunks, at least some of them, are long.  The recursive text splitter may be better suited to smaller context length embedding models as one can control the size of the chunks relatively easily.



## Conclusions about Effectiveness and Performance of RAG Ppelines so far

1.  OpenAI performs very well out-of-the-box and is a great default choice for many such applications.

2.  If we want to use open-source models like snowflake-arctic-embed-m in specialized RAG pipelines, we will need to finetune the model.

3.  We enter the finetuning process (below) with healthy skepticism as the base model does not perform well and its context window is rather small (512 compared to 8191).  Nonetheless, it is worth a shot.


## STEP 4 - Fine-tuning Embeddings for RAG and Pull Down Finetuned Embeddings

#### *NOTE: As mentioned at start of this notebook, I built a separate pipeline to do the finetuning of the embedding model.  Please refer to that notebook for the full code for finetuning.  Below, I pull down the finetuned model embeddings from my HF repo for use in the remainder of this notebook.*



### [Here](vc_completed_aie4_midterm_finetuning_embeddings_pipeline.ipynb) is a link to the notebook that has the finetuning pipeline end-to-end.

#### And [here](https://huggingface.co/vincha77/finetuned_arctic) is a link to the Huggingface Hub where I have placed the results of my finetuned model called `vincha77/finetuned_arctic` 

### Why I Chose to Finetune the `snowflake-arctic-embed-m` Model

On the AIE4 midterm, we are asked to state why we chose the particular embedding model that we did for finetuning.  These are the criteria I used:

1.  PARSIMONY: This model has approx 110 million parameters, so we can feasibly finetune the model with consumer-grade access to GPU and memory resources.  It can be done very quickly in a Colab notebook, for instance, with access to their GPU.  I chose to use the A100 to speed up the process, but the training would work just as well with other GPUs like T4 etc.

2.  PERFORMANCE: Despite the far fewer parameters, the model holds its own in terms of performance on benchmark tasks.

3.  CONVENIENT ACCESS: This model is conveniently available via Huggingface, so I could leverage the model hub as well as all the libraries that support access to this type of model (SentenceTransformer) as well as all the training/finetuning capabilities.

4.  NO-BRAINER REASON: It is an open-source model so we have access to all parameters and configurations needed for finetuning.

In [36]:
## code here to pull from hub
model_id = "vincha77/finetuned_arctic"
arctic_finetuned_model = SentenceTransformer(model_id)

In [37]:
arctic_finetuned_embeddings = HuggingFaceEmbeddings(model_name="vincha77/finetuned_arctic")
arctic_finetuned_embeddings_dimension = 768
arctic_finetuned_context_window_in_tokens = 512

#### Load the TEST SET that was created during model finetuning (training and validation also saved here)

In [38]:

with open('./data/finetuning_data/test_dataset.jsonl', "r") as f:
    test_json = json.load(f)

### Instantiate a model evaluator to compute hit rate using testdata and different embeddings models

In [39]:
# NOTE that the class being instantiated below is used extensively during the finetuning process
# I am only instantiating it to use the method defined there to run the Evaluations on the test dataset
evr = FineTuneModelAndEvaluateRetriever(train_data=None, val_data=None, test_data=test_json, batch_size=None)

In [40]:
te3_results = evr.evaluate_embeddings_model(openai_embeddings_small, top_k_for_retrieval=5)

te3_results_df = pd.DataFrame(te3_results)

te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

100%|██████████| 414/414 [01:42<00:00,  4.04it/s]


0.9347826086956522

In [41]:
arctic_embed_m_results = evr.evaluate_embeddings_model(arctic_original_embeddings, top_k_for_retrieval=5)

arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

100%|██████████| 414/414 [00:10<00:00, 39.61it/s]


0.5217391304347826

In [42]:
finetuned_results = evr.evaluate_embeddings_model(arctic_finetuned_embeddings, top_k_for_retrieval=5)

finetuned_results_df = pd.DataFrame(finetuned_results)

finetuned_hit_rate = finetuned_results_df["is_hit"].mean()
finetuned_hit_rate

100%|██████████| 414/414 [00:11<00:00, 37.04it/s]


0.9734299516908212

### Summary of Hit Rate Metric for the Three Pipelines

1.  OpenAI `text-embeddings-3-small` model hit rate:    0.935
2.  Snowflake `snowflake-arctic-embed-m` hit rate:      0.522
3.  Finetuned version `finetuned_arctic` hit rate:      0.973

### Takeaway from these results

1.  Another confirmation that OpenAI `text-embeddings-3-small` model is pretty good out-of-the-box.

2.  Another confirmation that `snowflake-arctic-embed-m` model embeddings are not that great out-of-the-box.

3.  The key takeaway though is that `FINETUNING WORKS`!!!  The `finetuned_arctic` model embeddings outperform OpenAI embeddings on this test corpus, quite an incredible feat!

## Vibe Check on My Test Questions

We're going to use our RAG pipeline to vibe check on my test set of questions that I formulated first!!

#### Chunk Documents Using Recursive Character Text Splitting

In [43]:
new_chunk_size = 600
new_chunk_overlap = 200

# instantiate baseline text splitter -
# NOTE!!! The `SimpleTextSplitter` below is my wrapper around Langchain RecursiveCharacterTextSplitter!!!!
# (see module for the code if needed)
new_baseline_text_splitter = \
    SimpleTextSplitter(chunk_size=new_chunk_size, chunk_overlap=new_chunk_overlap, documents=documents)

# split text for baseline case
new_baseline_text_splits = new_baseline_text_splitter.split_text()

In [44]:
len(new_baseline_text_splits)

936

#### Chunk Documents Using Semantic Chunking - NOTE Using OpenAI Embeddings Large

In [45]:
# instantiate semantic text splitter
#  NOTE!!!! SemanticTextSplitter is my wrapper around Langchain SemanticChunker
#  see my module for code if needed
# NOTE!!! I use openai large embeddings model to get the best possible representation of the semantics of sentences
# and to ensure high-quality semantic chunking
new_sem_text_splitter = \
    SemanticTextSplitter(llm_embeddings=openai_embeddings_large, threshold_type="interquartile", documents=documents)

# split text for semantic-chunking case
new_sem_text_splits = new_sem_text_splitter.split_text()

loaded 137 to be split 
returning docs split into 266 chunks 


#### Vibe Check on My Test Questions - Read This First!!!

NOTE:  Four RAG Pipelines are run below!!!  These are:

1.  `Baseline_Arctic_Original`: uses baseline chunking and `Snowflake/snowflake-arctic-embed-m` model embeddings.

2.  `Baseline_Arctic_Finetuned`: uses baseline chunking and `Finetuned_Arctic` model embeddings.

3.  `Semantic_Arctic_Original`: uses semantic chunking and `Snowflake/snowflake-arctic-embed-m` model embeddings.

4.  `Semantic_Arctic_Finetuned`: uses semantic chunking and `Finetuned_arctic` model embeddings.


In [46]:
new_baseline_arctic_original_retrieval_chain, new_baseline_arctic_original_q_and_a = \
    get_vibe_check_on_list_of_questions(collection_name="Baseline_Arctic_Original",
                                        embeddings=arctic_original_embeddings, # <- arctic original embeddings
                                        embed_dim=arctic_original_embeddings_dimension,
                                        prompt=rag_prompt,
                                        llm=openai_chat_gpt4omini,
                                        text_splits=new_baseline_text_splits, # <- NEW baseline chunking
                                        list_of_questions=my_test_questions)

What process was followed to generate the AI Bill of Rights?
The context does not provide specific details about the process followed to generate the AI Bill of Rights. Therefore, I don't know.
What is the AI Bill of Rights?
The AI Bill of Rights, as outlined in the "Blueprint for an AI Bill of Rights," is a framework designed to ensure that automated systems work for the American people while upholding civil rights, civil liberties, and privacy. It includes principles and guidelines for the responsible use of automated systems, aiming to assist both governments and the private sector in protecting these values. The document emphasizes the importance of evaluating and addressing the harms of automated systems at both individual and community levels.
What are the set of five principles in the AI bill of Rights?
I don't know.
Who led the formulation of the AI Bill of Rights?
I don't know.
What rights do I have to ensure protection against algorithmic discrimination?
You have rights to en

In [47]:
new_baseline_arctic_finetuned_retrieval_chain, new_baseline_arctic_finetuned_q_and_a = \
    get_vibe_check_on_list_of_questions(collection_name="Baseline_Arctic_Finetuned",
                                        embeddings=arctic_finetuned_embeddings, # <- arctic finetuned embeddings
                                        embed_dim=arctic_finetuned_embeddings_dimension,
                                        prompt=rag_prompt,
                                        llm=openai_chat_gpt4omini,
                                        text_splits=new_baseline_text_splits, # <- NEW baseline chunking
                                        list_of_questions=my_test_questions)

What process was followed to generate the AI Bill of Rights?
The AI Bill of Rights was generated through extensive consultation with the American public. It consists of five principles and associated practices designed to guide the design, use, and deployment of automated systems, ensuring they align with democratic values and protect civil rights, civil liberties, and privacy. The process involved collaboration among various stakeholders, including industry, civil society, researchers, policymakers, technologists, and the public.
What is the AI Bill of Rights?
The AI Bill of Rights, as outlined in the "Blueprint for an AI Bill of Rights," is a set of five principles and associated practices designed to guide the design, use, and deployment of automated systems. Its purpose is to protect the rights of the American public in the age of artificial intelligence. The framework was developed through extensive consultation with the American public and aims to ensure that automated systems al

In [48]:
new_sem_arctic_original_retrieval_chain, new_sem_arctic_original_q_and_a = \
    get_vibe_check_on_list_of_questions(collection_name="Semantic_Arctic_Original",
                                        embeddings=arctic_original_embeddings, # <- arctic original embeddings
                                        embed_dim=arctic_original_embeddings_dimension,
                                        prompt=rag_prompt,
                                        llm=openai_chat_gpt4omini,
                                        text_splits=new_sem_text_splits, # <- NEW semantic chunking
                                        list_of_questions=my_test_questions)

What process was followed to generate the AI Bill of Rights?
The process followed to generate the AI Bill of Rights involved extensive consultation with the American public. The White House Office of Science and Technology Policy led a year-long effort to gather input from various stakeholders, including impacted communities, industry representatives, technology developers, experts from different fields, and policymakers. This input was collected through panel discussions, public listening sessions, meetings, a formal request for information, and a publicly accessible email address. The discussions highlighted both the potential benefits and harms of AI technologies, which played a central role in shaping the Blueprint for an AI Bill of Rights.
What is the AI Bill of Rights?
The AI Bill of Rights is a framework consisting of five principles and associated practices designed to guide the design, use, and deployment of automated systems in order to protect the rights of the American publ

In [49]:
new_sem_arctic_finetuned_retrieval_chain, new_sem_arctic_finetuned_q_and_a = \
    get_vibe_check_on_list_of_questions(collection_name="Semantic_Arctic_Finetuned",
                                        embeddings=arctic_finetuned_embeddings, # <- arctic finetuned embeddings
                                        embed_dim=arctic_finetuned_embeddings_dimension,
                                        prompt=rag_prompt,
                                        llm=openai_chat_gpt4omini,
                                        text_splits=new_sem_text_splits, # <- NEW semantic chunking
                                        list_of_questions=my_test_questions)

What process was followed to generate the AI Bill of Rights?
The process followed to generate the AI Bill of Rights involved extensive consultation with the American public. The White House Office of Science and Technology Policy led a year-long effort to gather input from various stakeholders, including impacted communities, industry representatives, technology developers, experts from different fields, and policymakers. This input was collected through panel discussions, public listening sessions, meetings, a formal request for information, and a publicly accessible email address. The feedback received played a central role in shaping the Blueprint for an AI Bill of Rights, which aims to protect the rights of the American public in the age of artificial intelligence.
What is the AI Bill of Rights?
The AI Bill of Rights is a framework consisting of five principles and associated practices designed to guide the design, use, and deployment of automated systems in order to protect the ri

### Quick Summary of The Anecdotal Responses to My Questions Above

Anecdotally, I see fewer `I don't know` responses with the finetuned model.  This is for both chunking strategies, compared with the original model.

I also see that the finetuned model's are probably the only ones to actually articulate an answer to the question `What are the set of five principles in the AI bill of Rights?`.  Even the OpenAI model embeddings struggle to retrieve the relevant context here and as a result the most common answer is `I don't know`.  But both the finetuned pipelines, baseline chunking as well as semantic chunking, are able to answer the question well.

Based on these anecdotal results, I would expect that the finetuned model performs much better than the original model in the full RAGAS evaluations below.

#### Evaluate RAG Pipeline Using RAGAS Generated Synthetic Questions

In [50]:
baseline_arctic_original_results, baseline_arctic_original_results_df = \
    ragas_pipeline.ragas_eval_of_rag_pipeline(new_baseline_arctic_original_retrieval_chain, # <- baseline chunking + arctic orig embeddings
                                              ragas_test_questions, 
                                              ragas_test_groundtruths, 
                                              ragas_metrics)

Evaluating: 100%|██████████| 80/80 [00:27<00:00,  2.86it/s]


In [51]:
baseline_arctic_finetuned_results, baseline_arctic_finetuned_results_df = \
    ragas_pipeline.ragas_eval_of_rag_pipeline(new_baseline_arctic_finetuned_retrieval_chain, # <- baseline chunking + arctic finetuned embeddings
                                              ragas_test_questions, 
                                              ragas_test_groundtruths, 
                                              ragas_metrics)

Evaluating: 100%|██████████| 80/80 [00:40<00:00,  1.96it/s]


In [52]:
sem_arctic_original_results, sem_arctic_original_results_df = \
    ragas_pipeline.ragas_eval_of_rag_pipeline(new_sem_arctic_original_retrieval_chain, # <- semantic chunking + arctic orig embeddings
                                              ragas_test_questions, 
                                              ragas_test_groundtruths, 
                                              ragas_metrics)

Evaluating: 100%|██████████| 80/80 [00:22<00:00,  3.57it/s]


In [53]:
sem_arctic_finetuned_results, sem_arctic_finetuned_results_df = \
    ragas_pipeline.ragas_eval_of_rag_pipeline(new_sem_arctic_finetuned_retrieval_chain, # <- semantic chunking + arctic finetuned embeddings
                                              ragas_test_questions, 
                                              ragas_test_groundtruths, 
                                              ragas_metrics)

Evaluating: 100%|██████████| 80/80 [00:39<00:00,  2.03it/s]


#### Compare The Results

In [54]:
df_baseline_arctic_original = pd.DataFrame(list(baseline_arctic_original_results.items()), columns=['Metric', 'BaselineChunkArcticOrig'])
df_baseline_arctic_finetuned = pd.DataFrame(list(baseline_arctic_finetuned_results.items()), columns=['Metric', 'BaselineChunkArcticFinetuned'])
df_merged_arctic_baseline = pd.merge(df_baseline_arctic_original, df_baseline_arctic_finetuned, on='Metric')

df_sem_arctic_original = pd.DataFrame(list(sem_arctic_original_results.items()), columns=['Metric', 'SemanticChunkArcticOrig'])
df_sem_arctic_finetuned = pd.DataFrame(list(sem_arctic_finetuned_results.items()), columns=['Metric', 'SemanticChunkArcticFinetuned'])
df_merged_arctic_sem = pd.merge(df_sem_arctic_original, df_sem_arctic_finetuned, on='Metric')

df_all_merged = pd.merge(df_merged_arctic_baseline, df_merged_arctic_sem, on='Metric')

df_all_merged

Unnamed: 0,Metric,BaselineChunkArcticOrig,BaselineChunkArcticFinetuned,SemanticChunkArcticOrig,SemanticChunkArcticFinetuned
0,faithfulness,0.709273,0.892459,0.22394,0.887186
1,answer_relevancy,0.86991,0.968701,0.296473,0.974746
2,context_precision,0.699861,1.0,0.40875,0.956736
3,context_recall,0.72,0.875,0.204167,0.916667


## Takeaways from these results

1.  Finetuning helps across the board.  It obviously starts with retrieval.  Regardless of the chunking strategy used, finetuning helps to improve retrieval-based measures like context_precision and context_recall tremendously.  The improvements are extremely significant!

2.  The improvements in retrieval carry over to the generation realm.  Both measures – faithfulness and answer_relevancy – are significantly improved.  The improvements are much more stark with semantic chunking, where the original model performs particularly poorly.

3.  Comparing the two finetuning results above with those of OpenAI embeddings shows that this modest amount of finetuning allows the model to achieve the same level of performance across all these measures, an incredible feat indeed.




## Recommendation for the Demo App

No hesitation in recommending the `finetuned_arctic` finetuned model embeddings in the RAG pipeline that is used in the demo!

Which one is best and why?
-------------------------
Overall, the “finetuned_arctic” model embeddings are quite good.  
1.  The RAGAS metrics show that their performance is at about the same level as the OpenAI embeddings.  
2.  Further, some anecdotal results on test-questions (documented n my notebook) show that the finetuned model is better able to grasp the nuances of the content of the documents in the collection.

RECOMMENDATION
--------------
I would recommend using the ‘finetuned-arctic’ model embeddings in the final version of the app.  In addition to the points above, given that the purpose of the app is to showcase the advances in AI, it makes sense to use the “partially homegrown” embeddings as it can be another illustration of the reach of AI and its potential.


STORY TO CEO
------------

##### Preamble

In focus groups as well as in water-cooler conversations, many employees have shared that they’d like to understand how AI is evolving.  And, how we as a company can arm ourselves with the knowledge of AI’s potential but also its risks.

##### Rapid progress

Most people, even experts in the field, were caught by surprise at the rapid progress made in the field in the past few years.  Partly due to the sheer pace of innovative work, but also due to the statistical machinery deployed in these models, we have to move thoughtfully but also rapidly to understand the potential of AI as well as its drawbacks.

##### What we’ve done

What better way to help us all understand the implications of AI than "use AI to answer questions about AI"?  We in Technology have worked hard to create a chatbot.  We've used a few key policy and framework proposals from the US government that this chatbot can search for a response to employees’ questions around the risks of AI, how to measure AI risks, and best practices to manage these risks in an enterprise setting.

##### What we’d like to do

We’d like to roll out the chatbot to at least 50 different internal stakeholders over the next month.  Our own application, just like the recent advances in AI, are somewhat brittle.  Occasionally, the chatbot may respond with "I don't know".  If it does that, try a more specific variation of your question.  Our chatbot is only designed to answer questions about AI’s risks, framework for measuring and managing its risks and mitigating/reducing the likelihood of adverse societal outcomes from poor management of AI tools and models.  To that end, we’d like the stakeholders that we recruit to help us on this double mission: educate themselves on the risks and educate themselves on how we as a company can adopt ideas from AI into our own business units.  And, in turn, educate us all!


## How to Incorporate New Information Into Our RAG Pipeline

In this application, I used an in-memory instance of the vector database to store content from the two PDFs provided for this assignment.  If we need to add more documents with the exact same pipeline, we will literally need to re-instantiate the entire pipeline by rebuilding the vector database: (a) because the vector DB is in memory and (b) there is a single monolithic block of code that does everything from the creation of the vector indexing as well as the RAG querying.  Clearly this is not a scalable way of doing things.

How we can augment this approach:

a.	*Implement persistent on-disk vector databases.*   All major vector database providers offer this capability.

b.	*Build separate pipelines to manage the process to ingest documents and other information into vector databases.*   Separate this part of the pipeline from the part that deals with querying the database (eg RAG applications).

c.	*Improve the architecture of the retrieval process itself.*  For example, if there are new versions of previously released documents, then we may need our vector DB to maintain, for audit and other reasons, the older versions of these documents.  In that case, we can put the metadata to work – e.g., identify documents by their release date, or date-added to our vector DB.  Additionally, use metadata in the retrieval process e.g., we select out more recent versions of documents to search for and phase out older documents from the search and retrieval process.
