1.3.2. Phase 2: Policy Document Indexing for RAG

Index benefits policy PDFs for Retrieval-Augmented Generation (RAG) to enable
efficient policy retrieval.

Text Extraction: Use PyPDF2 to extract text from policy PDFs, removing noise
(e.g., page numbers).

Text Preprocessing: Segment text into chunks (e.g., 200 words per chunk).

Vector Store Creation: Build a Chroma vector store using embeddings from 
sentence-transformers or OpenAI.

Retrieval Testing: Test retrieval with 5 HR policy queries (e.g., “What is the
eligibility for Tuition Reimbursement?”) and evaluate chunk relevance.

In [6]:
#!pip install PyPDF2
#!pip install -qU "langchain-chroma"
#!pip install -qU langchain-openai
!pip install -qU langchain==0.3.26
!pip install -qU langchain-community==0.3.27
!pip install -qU langchain-core==0.3.74
!pip install -qU langchain-openai==0.3.24
!pip install -qU langchain-text-splitters==0.3.8

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-chroma 0.2.5 requires langchain-core>=0.3.70, but you have langchain-core 0.3.68 which is incompatible.
langchain-openai 0.3.31 requires langchain-core<1.0.0,>=0.3.74, but you have langchain-core 0.3.68 which is incompatible.
langchain-text-splitters 0.3.9 requires langchain-core<1.0.0,>=0.3.72, but you have langchain-core 0.3.68 which is incompatible.[0m[31m
[0m

In [7]:
from PyPDF2 import PdfReader
from langchain_text_splitters import RecursiveCharacterTextSplitter,CharacterTextSplitter
import os
import pandas as pd

directory_path = './assets/benefits/'  # Replace with the actual directory path

pdf_rows = []
text_splitter = CharacterTextSplitter(separator='\n\n',chunk_size=200,chunk_overlap=0)

for filename in os.listdir(directory_path):
    if filename.endswith('.pdf'):
        full_path = os.path.join(directory_path, filename)
        print(f"Found PDF: {full_path}")
        reader = PdfReader(full_path)
        total_pages = len(reader.pages)
        print(f"Total pages: {total_pages}")
        for i in range(total_pages):
            chunk_num = 0
            page_text = reader.pages[i].extract_text()
            page_text = page_text.replace("Introduction", '')
            chunks = text_splitter.split_text(page_text)
            for c in chunks:
                pdf_rows.append({'policy': filename, 'page': i, 'chunk': chunk_num,'text': c})
                chunk_num+=1
policy_pdfs = pd.DataFrame(pdf_rows)

  from pandas.core.computation.check import NUMEXPR_INSTALLED


Found PDF: ./assets/benefits/gym-policy.pdf
Total pages: 7
Found PDF: ./assets/benefits/life-insurance-policy.pdf
Total pages: 8
Found PDF: ./assets/benefits/childcare-policy.pdf
Total pages: 8
Found PDF: ./assets/benefits/401k-retirement-policy.pdf
Total pages: 9
Found PDF: ./assets/benefits/work-from-home-policy.pdf
Total pages: 7
Found PDF: ./assets/benefits/vacation-policy.pdf
Total pages: 7
Found PDF: ./assets/benefits/health-insurance-policy.pdf
Total pages: 7
Found PDF: ./assets/benefits/tuition-reimbursement-policy.pdf
Total pages: 8


In [8]:
import getpass
import os
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.documents import Document
from uuid import uuid4
import chromadb

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
os.environ["CHROMA_API_KEY"] = getpass.getpass("Enter API key for Chroma: ")
os.environ["CHROMA_TENANT"]= getpass.getpass("Enter tenant for Chroma: ")
os.environ["CHROMA_DATABASE"]= 'policy_details'

vector_store = Chroma(collection_name="policy_embeddings",embedding_function=embeddings, chroma_cloud_api_key=os.getenv("CHROMA_API_KEY"),
tenant=os.getenv("CHROMA_TENANT"),database=os.getenv("CHROMA_DATABASE"))

Enter API key for OpenAI:  ········


In [10]:
docs = []
idx = 0
#vector_store.delete(ids=uuids[:-1])
for row in policy_pdfs.itertuples():
    doc = Document(page_content=row.text, metadata={'policy': row.policy, 'page': row.page, 'chunk_num' : row.chunk}, id=idx)
    docs.append(doc)
    idx+=1
uuids = [str(uuid4()) for _ in range(len(docs))]
vector_store.add_documents(documents=docs, ids=uuids)

Enter API key for Chroma:  ········
Enter tenant for Chroma:  ········


['1d487c85-2cb0-4261-96a3-aa64102a4384',
 '7646add7-958a-4694-82f6-56acde691d1f',
 'd3d67761-e8e9-44d3-827d-59a00f7c004c',
 '88702191-998d-41e7-a645-b1fde8ed4d28',
 'f62209a3-eaab-40e8-899e-8f341a5776d4',
 '54f15c2d-dbac-4c35-b72d-015aaffdf189',
 'eb166803-6216-4902-800a-b8f5538ba3fa',
 'bd11d47d-dff7-4d29-8769-536de31ca1df',
 '440be405-d0b7-47c9-9335-3b82e2cd959c',
 'c3f1b1da-9802-49ca-89e7-757fd6135cee',
 '2bf1a353-ab3d-4798-b5d6-198e00acdbab',
 'aeaed138-2621-4020-81b5-b43df0c96ab2',
 'a203018b-3e31-4b2e-beee-d9d581dca620',
 '08808b7e-1202-4fdc-9da7-1e743c04b68a',
 '6c239a87-7325-4992-8c50-93261ce67286',
 'f5a3858e-9c1a-44f2-b24d-6eae5d1b48ef',
 '03376ab2-46cd-4da9-b168-52177d0e80ab',
 'b5b9a493-77df-4170-922a-af0e79de5309',
 '47ab4642-e2ec-45a3-b35a-e24528971f7c',
 'ac8ef7e5-8da0-410f-9955-426f629d4c9d',
 '1e05ca34-8d0a-4b20-a296-da514e229ea6',
 '0f8d2192-37f0-401f-a822-a693754c51d9',
 '258a3d18-a525-4bbc-a6c8-cb20f4c72716',
 '9c6e23cd-8e34-433b-97a4-cbdf9b67e909',
 '82536845-3943-

Similarity Search - Prioritize chunks most similar to the query

In [39]:
query_set = ["How do I use FSA to pay for the gym?","When should beneficiaries be updated for my 401k account?","What are the timings for before and after school child care?","How is HSA useful for retirement?","Can I work from a different state?"]
sim_scores = []
for query in query_set:
    results = vector_store.similarity_search_with_relevance_scores(query, k=5)
    for res, score in results:
    #print(f"* [Similarity ={score:3f}] {res.page_content} [{res.metadata}]")
        sim_scores.append({'query': query, 'Similarity': score, 'chunk': res.page_content,'metadata': res.metadata})
sim_scores = pd.DataFrame(sim_scores)

In [40]:
sim_scores.sort_values(by=['query','Similarity'])

Unnamed: 0,query,Similarity,chunk,metadata
24,Can I work from a different state?,0.207197,TechLance Work from Home Policy,"{'policy': 'work-from-home-policy.pdf', 'page'..."
23,Can I work from a different state?,0.208208,performed independently with digital tools and...,"{'page': 1, 'policy': 'work-from-home-policy.p..."
22,Can I work from a different state?,0.216416,TechLance recognizes that ﬂexible work arrange...,"{'policy': 'work-from-home-policy.pdf', 'page'..."
21,Can I work from a different state?,0.282792,Can I work from home occasionally without a fo...,"{'page': 6, 'policy': 'work-from-home-policy.p..."
20,Can I work from a different state?,0.284356,For employees who don’t have formal remote wor...,"{'policy': 'work-from-home-policy.pdf', 'page'..."
4,How do I use FSA to pay for the gym?,0.163632,community centers. These partnerships provide ...,"{'page': 1, 'policy': 'gym-policy.pdf'}"
3,How do I use FSA to pay for the gym?,0.197314,"Once enrolled, you’ll receive a TechLance corp...","{'page': 3, 'policy': 'gym-policy.pdf'}"
2,How do I use FSA to pay for the gym?,0.219559,How much money can I save with corporate gym m...,"{'policy': 'gym-policy.pdf', 'page': 6}"
1,How do I use FSA to pay for the gym?,0.221666,"insurance premiums by $25, while regular usage...","{'policy': 'gym-policy.pdf', 'page': 4}"
0,How do I use FSA to pay for the gym?,0.306401,"In addition to traditional gym memberships, al...","{'page': 2, 'policy': 'gym-policy.pdf'}"


Maximal Marginal Relevance - Optimize for similarity to the query and diversity among documents to remove redundant information.

In [50]:
mmr_scores = []
for query in query_set:
    #lambda_multiplier - closer to one = max query similarity, closer to 0 = max diversity
    results = vector_store.max_marginal_relevance_search(query, k=3, lambda_mult=0.6)
    for doc in results:
    #print(f"* [Similarity ={score:3f}] {res.page_content} [{res.metadata}]")
        mmr_scores.append({'query': query, 'chunk': doc.page_content,'metadata': doc.metadata})
mmr_scores = pd.DataFrame(mmr_scores)

In [51]:
mmr_scores

Unnamed: 0,query,chunk,metadata
0,How do I use FSA to pay for the gym?,"In addition to traditional gym memberships, al...","{'policy': 'gym-policy.pdf', 'page': 2}"
1,How do I use FSA to pay for the gym?,Some services require prior authorization from...,"{'policy': 'health-insurance-policy.pdf', 'pag..."
2,How do I use FSA to pay for the gym?,When can I change my health insurance plan? Yo...,"{'page': 6, 'policy': 'health-insurance-policy..."
3,When should beneficiaries be updated for my 40...,"information about plan changes, fee disclosure...","{'policy': '401k-retirement-policy.pdf', 'page..."
4,When should beneficiaries be updated for my 40...,Beneﬁciary Designations and Claims\nOne of the...,"{'policy': 'life-insurance-policy.pdf', 'page'..."
5,When should beneficiaries be updated for my 40...,This policy is eﬀective immediately and supers...,"{'page': 7, 'policy': 'life-insurance-policy.p..."
6,What are the timings for before and after scho...,can care for children with minor illnesses who...,"{'page': 3, 'policy': 'childcare-policy.pdf'}"
7,What are the timings for before and after scho...,Our Family Services coordinator in HR holds mo...,"{'page': 6, 'policy': 'childcare-policy.pdf'}"
8,What are the timings for before and after scho...,provide resources for ﬁnding specialized care ...,"{'policy': 'childcare-policy.pdf', 'page': 7}"
9,How is HSA useful for retirement?,We also provide guidance on coordinating your ...,"{'policy': '401k-retirement-policy.pdf', 'page..."
