# Lab 2: RAG

## We will build and evaluate a Question Answering Expert for a fictional company: InsureLLM!

### BEFORE WE BEGIN:

Look at the knowledge-base - this is the company shared drive.

### For those new to RAG:

Does one of the Experts want to give an explanation?

We will be figuring out ways to insert relevant background information in to the prompt..

Today will be more intense - please ask me lots of questions and clarifications..

In [1]:
import os
import glob
import tiktoken
import numpy as np
from IPython.display import Markdown, display

from langchain_openai import ChatOpenAI,OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_huggingface import HuggingFaceEmbeddings

from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

#from langchain_experimental.text_splitter import SemanticChunker
from sklearn.manifold import TSNE
import plotly.graph_objects as go

In [2]:
MODEL = "gpt-4.1-nano"
db_name = "vector_db"

In [14]:
knowledge_base_path = "knowledge-base/**/*.md"
files = glob.glob(knowledge_base_path, recursive=True)
print(f"Found {len(files)} files in the knowledge base")

entire_knowledge_base = ""

for file_path in files:
    with open(file_path, 'r', encoding='utf-8') as f:
        entire_knowledge_base += f.read()
        entire_knowledge_base += "\n\n"  # Add separator between files

print(f"Total characters in knowledge base: {len(entire_knowledge_base):,}")

Found 76 files in the knowledge base
Total characters in knowledge base: 304,434


In [15]:
encoding = tiktoken.encoding_for_model("gpt-4.1-mini")
tokens = encoding.encode(entire_knowledge_base)
token_count = len(tokens)
print(f"Total tokens for gpt-4.1-mini: {token_count:,}")

Total tokens for gpt-4.1-mini: 63,555


In [16]:
folders = glob.glob("knowledge-base/*")

documents = []
for folder in folders:
    doc_type = os.path.basename(folder)
    loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs={'encoding': 'utf-8'})
    folder_docs = loader.load()
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        documents.append(doc)

print(f"Loaded {len(documents)} documents")

Loaded 76 documents


In [17]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=150, separators=["\n\n", "\n", ". ", " ", ""])
chunks = text_splitter.split_documents(documents)

print(f"Divided into {len(chunks)} chunks")

Divided into 515 chunks


In [None]:
#embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
#text_splitter=SemanticChunker(embeddings=embeddings,breakpoint_threshold_type='percentile',breakpoint_threshold_amount=95)
#chunks = text_splitter.split_documents(documents)

#print(f"Divided into {len(chunks)} chunks")

In [45]:
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# embeddings = OpenAIEmbeddings()

if os.path.exists(db_name):
    Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()

vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_name)
print(f"Vectorstore created with {vectorstore._collection.count()} documents")

Vectorstore created with 515 documents


In [43]:
display(Markdown(chunks[1].page_content))

However, the company underwent a strategic restructuring in 2022-2023 to focus on profitability and sustainable growth. This included consolidating office locations, implementing a remote-first strategy, and streamlining operations. As of 2025, Insurellm operates with a lean, highly efficient team of 32 employees who have built a portfolio of 32 active contracts spanning all eight product lines. The company maintains its San Francisco headquarters along with small satellite offices in key markets including New York, Austin, Chicago, and Denver.

In [None]:
collection = vectorstore._collection
count = collection.count()

sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0]
dimensions = len(sample_embedding)
print(f"There are {count:,} vectors with {dimensions:,} dimensions in the vector store")

In [28]:
# Gather the vectors, documents and metadata

result = collection.get(include=['embeddings', 'documents', 'metadatas'])
vectors = np.array(result['embeddings'])
documents = result['documents']
metadatas = result['metadatas']
doc_types = [metadata['source'].split('\\')[1] for metadata in metadatas]
colors = [['blue', 'green', 'red', 'orange'][['products', 'employees', 'contracts', 'company'].index(t)] for t in doc_types]

In [29]:
# We humans find it easier to visalize things in 2D!
# Reduce the dimensionality of the vectors to 2D using t-SNE
# (t-distributed stochastic neighbor embedding)

tsne = TSNE(n_components=2, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

# Create the 2D scatter plot
fig = go.Figure(data=[go.Scatter(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    mode='markers',
    marker=dict(size=5, color=colors, opacity=0.8),
    text=[f"Type: {t}<br>Text: {d[:100]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'
)])

fig.update_layout(
    title='2D Chroma Vector Store Visualization',
    scene=dict(xaxis_title='x',yaxis_title='y'),
    width=800,
    height=600,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()

In [30]:
# Let's try 3D!

tsne = TSNE(n_components=3, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

# Create the 3D scatter plot
fig = go.Figure(data=[go.Scatter3d(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    z=reduced_vectors[:, 2],
    mode='markers',
    marker=dict(size=5, color=colors, opacity=0.8),
    text=[f"Type: {t}<br>Text: {d[:100]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'
)])

fig.update_layout(
    title='3D Chroma Vector Store Visualization',
    scene=dict(xaxis_title='x', yaxis_title='y', zaxis_title='z'),
    width=900,
    height=700,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()

In [None]:
# create a new Chat with OpenAI
llm = ChatOpenAI(temperature=0.7, model_name=MODEL)

# how many chunks to provide in each prompt
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# Simple prompt that includes chat history
prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer the question based on the context:\n{context}"),
    ("human", "{input}")
])

# Create the chain
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

# Invoke it
query = "Please explain what Insurellm is in a couple of sentences"
result = rag_chain.invoke({"input": query})
print(result["answer"])

## Now check out ingest.py

Then run at the terminal:

`uv run ingest.py`

In [4]:
!uv run ingest.py

There are 200 vectors with 384 dimensions in the vector store
Ingestion complete


## Now check out answer.py

In [2]:
from answer import fetch_context, answer_question

fetch_context("Who is Avery?")

[Document(id='e0a0d436-7bc6-4028-ba07-b15053f2b9ff', metadata={'doc_type': 'employees', 'source': 'knowledge-base\\employees\\Avery Lancaster.md'}, page_content="Avery recognized the need to enhance company culture. - **2020**: **Below Expectations**  \n  The COVID-19 pandemic posed unforeseen operational difficulties. Avery faced criticism for delayed strategy shifts, although efforts were eventually made to stabilize the company. - **2021**: **Exceptional**  \n  Avery's decisive transition to remote work and rapid adoption of digital tools led to record-high customer satisfaction levels and increased sales. - **2022**: **Satisfactory**  \n  Avery focused on rebuilding team dynamics and addressing employee concerns, leading to overall improvement despite a saturated market. - **2023**: **Exceeds Expectations**  \n  Market leadership was regained with innovative approaches to personalized insurance solutions. Avery is now recognized in industry publications as a leading voice in Insura

In [3]:
result, chunks = await answer_question("Who is Avery?")
display(Markdown(result))

Avery Lancaster is the Co-Founder and Chief Executive Officer (CEO) of Insurellm. She has been with the company since its founding in 2015 and has played a key role in guiding Insurellm to become a leading provider in the Insurance Tech industry. Avery is known for her innovative leadership, risk management expertise, and her active participation in professional development, industry conferences, and community outreach. She is based in San Francisco, California.

## Now check out app.py

In [9]:
!uv run app.py

^C


## OK - Now it's time to EVALUATE!

### First check out tests.jsonl for all the questions

And see how it's loaded in test.py


In [2]:
from test import load_tests

test_data = load_tests()

print(len(test_data))
print(test_data[0])
print(test_data[10])



150
question='Who won the prestigious IIOTY award in 2023?' keywords=['Maxine', 'Thompson', 'IIOTY'] reference_answer='Maxine Thompson won the prestigious Insurellm Innovator of the Year (IIOTY) award in 2023.' category='direct_fact'
question='How many Claimllm contracts does Insurellm have?' keywords=['7', 'Claimllm'] reference_answer='Insurellm has 7 contracts for Claimllm, ranging from independent adjusting firms to enterprise claims networks.' category='direct_fact'


In [3]:
print(set(test.category for test in test_data))


{'relationship', 'direct_fact', 'holistic', 'temporal', 'comparative', 'spanning', 'numerical'}


## Now take a look at eval.py

In [4]:
from eval import evaluate_retrieval, evaluate_answer

evaluate_retrieval(test_data[1])

RetrievalEval(mrr=0.5, ndcg=0.6309297535714575, keywords_found=2, total_keywords=2, keyword_coverage=100.0)

In [5]:
await evaluate_answer(test_data[1])

(AnswerEval(feedback='The answer correctly states the founding year and the founder, aligning with the reference, but the question only asks for the founding year, so including the founder adds unnecessary details.', accuracy=5.0, completeness=3.0, relevance=4.0),
 'Insurellm was founded in 2015 by Avery Lancaster.',
 [Document(id='76ae49ab-a2a1-4236-b767-09acaab723b7', metadata={'source': 'knowledge-base\\company\\overview.md', 'doc_type': 'company'}, page_content='# Overview of Insurellm\n\nInsurellm is an innovative insurance tech firm with 32 employees operating primarily remotely across the US, with offices in San Francisco (HQ), New York, Austin, Chicago, and Denver.'),
  Document(id='62faa18f-ee3a-47bb-8496-d91499a40776', metadata={'source': 'knowledge-base\\company\\about.md', 'doc_type': 'company'}, page_content='# About Insurellm\n\nInsurellm was founded by Avery Lancaster in 2015 as an insurance tech startup designed to disrupt an industry in need of innovative products. Its

## AND FINALLY - all come together in a UI

In [None]:
!uv run evaluator.py

## Ideas for your experiments

### Quick wins

- Experiment with the encoder
- Experiment with chunking strategies

### Big change ideas

1. Pre-processing - use an LLM to rewrite (a) the chunks and/or (b) the questions / conversation history
2. Hierarchical RAG - summarize at different levels and do RAG over summaries
3. Tools!