## Policy Document Indexing for RAG

We start by creating the following functions: ```clean_text``` & ```load_files```. When looking at the given pdfs, we observe that the pdf is clean. There is no number of pages, images any extra context that should be removed. However we create a function that would lower the text, removes multiple spaces, etc. We decided not to get rid of punctions as they seem to be relevant. When it comes to the function ```load_files```, we assume that in the future the folder will be filled with other file extensions as well. Thus we create a generic function that would look at pdfs and other extensions as well.

In [1]:
import os
import re
import string
from dotenv import load_dotenv
from PyPDF2 import PdfReader
from langchain_core.documents import Document
from typing import List
from langchain_community.document_loaders import TextLoader, Docx2txtLoader, CSVLoader

# Load environment variables from .env file
load_dotenv()

# Path to the documents
dir_path = 'assets/documents/'

# -------------------------------
# Text cleaning function
# -------------------------------
# Function to clean and remove noise from text
# We observe that the pdfs don't contain any page numbers, or images
def clean_text(text: str, lowercase: bool = True, remove_punct: bool = False) -> str:
    """
    Cleans extracted text for preprocessing:
    - Lowercase (optional)
    - Remove line breaks, tabs
    - Remove punctuation (optional)
    - Normalize spaces
    """
    if not text:
        return ""
    
    # Convert to lowercase
    if lowercase:
        text = text.lower()

    # Replace newlines and tabs with space 
    text = text.replace("\n", " ").replace("\t", " ")
    
    if remove_punct:
        text = text.translate(str.maketrans("", "", string.punctuation))
    
    # Remove multiple spaces
    text = re.sub(r"\s+", " ", text)
    
    return text.strip()

# -------------------------------
# File loader
# -------------------------------
def load_files(path: str) -> list[Document]:
    _, file_extension = os.path.splitext(path)
    file_extension = file_extension.lower()

    if file_extension == '.pdf':
        reader = PdfReader(path)
        all_text = "".join((p.extract_text() or "") for p in reader.pages)
        cleaned = clean_text(all_text, lowercase=True, remove_punct=False)
        return [Document(page_content=cleaned, metadata={"source": path})]

    elif file_extension == '.txt':
        docs = TextLoader(path, encoding='utf8').load()
        for d in docs:
            d.page_content = clean_text(d.page_content)
        return docs

    elif file_extension == '.docx':
        docs = Docx2txtLoader(path).load()
        for d in docs:
            d.page_content = clean_text(d.page_content)
        return docs
        
    elif file_extension == '.csv':
        docs = CSVLoader(path).load()
        for d in docs:
            d.page_content = clean_text(d.page_content)
        return docs

    else:
        raise ValueError(f"Unsupported file type: {file_extension}")

# -------------------------------
# Usage example
# -------------------------------
files = [f for f in os.listdir(dir_path) if os.path.isfile(os.path.join(dir_path, f))]

# Collect all loaded documents
all_documents = []
for filename in files:
    full_path = os.path.join(dir_path, filename)
    try:
        docs = load_files(full_path)
        all_documents.extend(docs)
        print(f"Loaded & cleaned {filename}")
    except ValueError as e:
        print(e)

print(f"\nTotal loaded documents: {len(all_documents)}")



Loaded & cleaned tuition-reimbursement-policy.pdf
Loaded & cleaned health-insurance-policy.pdf
Loaded & cleaned work-from-home-policy.pdf
Loaded & cleaned gym-policy.pdf
Loaded & cleaned vacation-policy.pdf
Loaded & cleaned 401k-retirement-policy.pdf
Loaded & cleaned life-insurance-policy.pdf
Loaded & cleaned childcare-policy.pdf

Total loaded documents: 8


After loading and cleaning the documents, we split them into chunks. Firstly, we tried a function that would split the documents into section, to have another source of metadata - to refet to the document and a specific section, however if there are documents which have sections that are very long, that doesn't seem like a proper option. So we use ```RecursiveCharacterTextSplitter``` wuth chunk_overlap to keep the context between chunks and not lose meaning.

Afterwards, we define a function ```create_chroma_collection``` that would create a vector store using openai embeddings.  

In [2]:

# We start by splitting the document into sections for later text preprocessing
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Suppose `documents` is what you loaded from load_files()
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,    # max characters per chunk
    chunk_overlap=200,  # overlap between chunks (keeps context)
)

split_docs = splitter.split_documents(all_documents)

print(f"Original docs: {len(all_documents)}")
print(f"Split docs: {len(split_docs)}")

# Show first 2 chunks
for i, d in enumerate(split_docs[:2], 1):
    print(f"\n--- Chunk {i} ---")
    print(d.page_content[:300], "...")
    print("Metadata:", d.metadata)

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

# Create a new Chroma collection & split documents into chunks
def create_chroma_collection(
    name: str, 
    documents: List[Document], 
    directory: str
) -> Chroma:
    """
    Create or overwrite a Chroma collection with given documents.

    Args:
        name (str): Name of the collection.
        documents (List[Document]): List of LangChain Document objects.
        directory (str): Directory where the collection is persisted.

    Returns:
        Chroma: The created Chroma vectorstore.
    """
    persist_directory = os.path.join(directory, name)
    os.makedirs(persist_directory, exist_ok=True)

    embeddings = OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY"))

    # Create collection and persist it
    collection = Chroma.from_documents(
        documents=documents,
        embedding=embeddings,
        collection_name=name,
        persist_directory=persist_directory
    )
    collection.persist()
    return collection

collection = create_chroma_collection(
    name="benefits_collection",
    documents=split_docs,
    directory="./persist"
)

print("Collection created and persisted.")

Original docs: 8
Split docs: 146

--- Chunk 1 ---
techlance tuition reimbursement policy introduction techlance is committed to supporting the professional growth and career development of our employees through comprehensive educational assistance programs. we believe that investing in our team members’ education not only enhances their skills and  ...
Metadata: {'source': 'assets/documents/tuition-reimbursement-policy.pdf'}

--- Chunk 2 ---
to accommodate working professionals who want to advance their education while maintaining their career momentum. whether you’re pursuing your ﬁrst degree, advancing to graduate studies, or seeking professional certiﬁcations to enhance your expertise, techlance is here to support your educational jo ...
Metadata: {'source': 'assets/documents/tuition-reimbursement-policy.pdf'}


  embeddings = OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY"))


Collection created and persisted.


  collection.persist()


Last functions to be created are ```load_chroma_collection```, ```add_documents_to_collection``` and ```load_retriever_from_collection```. 

With adding documents to the collection we use incremental updates. New PDFs, DOCX files, or CSVs may arrive over time. Instead of rebuilding the entire collection from scratch, we can add only the new documents. This saves time and computation, especially for large collections. Preserve embeddings for existing docs. Lastly, we can aggregate multiple new documents and add them in one go, improving efficiency.

```load_retriever_from_collection``` helps not to recreate the existing vectorstore when restarting the script. The function had configurable retrieval parameters, where we can set things like score_threshold, search_type, or top_k when loading the retriever. This allows us to tune retrieval behavior without changing the underlying vectorstore.

In [3]:

# Load the collection
def load_chroma_collection(name: str, directory: str) -> Chroma:
    """
    Load an existing Chroma collection.

    Args:
        name (str): Name of the collection.
        directory (str): Directory where the collection is persisted.

    Returns:
        Chroma: The loaded Chroma vectorstore.
    """
    persist_directory = os.path.join(directory, name)
    if not os.path.exists(persist_directory):
        raise ValueError(f"Collection '{name}' does not exist in '{directory}'.")

    embeddings = OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY"))

    collection = Chroma(
        collection_name=name,
        embedding_function=embeddings,
        persist_directory=persist_directory
    )
    return collection

# Add documents to the collection
def add_documents_to_collection(collection: Chroma, new_documents: List[Document]) -> None:
    """
    Add new documents to an existing Chroma collection.

    Args:
        collection (Chroma): The Chroma vectorstore to add documents to.
        new_documents (List[Document]): List of new LangChain Document objects to add.
    """
    if not new_documents:
        print("No new documents to add.")
        return

    collection.add_documents(new_documents)
    collection.persist()
    print(f"Added {len(new_documents)} documents to the collection and persisted changes.")
    
# Load retriever from the collection
def load_retriever_from_collection(
    collection_name: str,
    search_type: str = "similarity_score_threshold",
    score_threshold: float = 0.3,
    top_k: int = 5
):
    """
    Load a retriever from a Chroma collection with configurable retrieval behavior.

    Args:
        collection_name (str): Name of the Chroma collection.
        search_type (str): Retrieval type (similarity_score_threshold or mmr).
        score_threshold (float): Minimum similarity score for retrieval.
        top_k (int): Number of documents to return.

    Returns:
        Retriever: Configured retriever.
    """

    # Load the persisted collection
    collection = load_chroma_collection(name=collection_name, directory="./persist")
    
    # Build retriever with configurable behavior
    retriever = collection.as_retriever(
        search_type=search_type,
        search_kwargs={
            "score_threshold": score_threshold,
            "k": top_k
        }
    )
    return retriever

Example of dynamically storing documents:
The script will run only if there is a new file to be added

In [None]:
# # Load existing collection
# collection = load_chroma_collection("benefits_collection", "./persist")

# # Load new PDFs
# new_docs = []
# new_files = ["assets/new_policy.pdf"]
# for f in new_files:
#     new_docs.extend(load_files(f))

# # Add to collection
# add_documents_to_collection(collection, new_docs)

# print("Collection updated with new documents!")

  collection = Chroma(


FileNotFoundError: [Errno 2] No such file or directory: 'assets/new_policy.pdf'

In [5]:
retriever = load_retriever_from_collection("benefits_collection", score_threshold = 0.6, top_k=3)


queries = [
    "What's the maternity leave policy?",
    "What is the eligibility for Tuition Reimbursement",
    "How much can employees contribute to 401-k?",
    "Do I have to manually enroll for 401-k?",
    "I work in Finance, can I work remotely?"
]

for i in queries:
    print(f"\n\nQuery: {i}")
    query = i
    results = retriever.get_relevant_documents(query)

    print(f" Found {len(results)} results")
    for i, r in enumerate(results, 1):
        print(f"\n--- Result {i} ---")
        print(r.page_content[:300], "...")
        print("Metadata:", r.metadata)




Query: What's the maternity leave policy?


  results = retriever.get_relevant_documents(query)


 Found 3 results

--- Result 1 ---
of paid maternity leave, while non-birth parents receive six weeks of paid paternity leave. adoptive parents receive eight weeks of paid leave that can be shared between both parents. employees must have been with the company for at least 12 months to qualify for paid parental leave, though unpaid l ...
Metadata: {'source': 'assets/documents/childcare-policy.pdf'}

--- Result 2 ---
for ﬁnding specialized care providers in the community. this policy is eﬀective as of [current date] and may be modiﬁed as business needs and legal requirements change. employees will receive 30 days advance notice of any signiﬁcant changes to childcare beneﬁts. for speciﬁc questions about your situ ...
Metadata: {'source': 'assets/documents/childcare-policy.pdf'}

--- Result 3 ---
launch periods, and other critical business periods that will be communicated to employees at least 60 days in advance. while we try to minimize blackout periods, these restrictions help ensure w

## Try different types of retrieval

Different similarity score threshold, mmr, top_k

In [6]:
queries = [
    "What's the maternity leave policy?",
    "What is the eligibility for Tuition Reimbursement"
]

# Define different retrieval configurations
retrieval_configs = [
    {"search_type": "similarity_score_threshold", "score_threshold": 0.3, "top_k": 3},
    {"search_type": "similarity_score_threshold", "score_threshold": 0.5, "top_k": 3},
    {"search_type": "mmr", "top_k": 5}
]

for config in retrieval_configs:
    print(f"\n--- Retrieval config: {config} ---")
    retriever = load_retriever_from_collection(
        collection_name="benefits_collection",
        search_type=config.get("search_type", "similarity_score_threshold"),
        score_threshold=config.get("score_threshold", 0.3),
        top_k=config.get("top_k", 5)
    )
    
    for query in queries:
        print(f"\nQuery: {query}")
        docs = retriever.get_relevant_documents(query)
        for i, doc in enumerate(docs, 1):
            print(f"\nResult {i}:")
            print(doc.page_content[:300], "...")
            print("Metadata:", doc.metadata)



--- Retrieval config: {'search_type': 'similarity_score_threshold', 'score_threshold': 0.3, 'top_k': 3} ---

Query: What's the maternity leave policy?

Result 1:
of paid maternity leave, while non-birth parents receive six weeks of paid paternity leave. adoptive parents receive eight weeks of paid leave that can be shared between both parents. employees must have been with the company for at least 12 months to qualify for paid parental leave, though unpaid l ...
Metadata: {'source': 'assets/documents/childcare-policy.pdf'}

Result 2:
for ﬁnding specialized care providers in the community. this policy is eﬀective as of [current date] and may be modiﬁed as business needs and legal requirements change. employees will receive 30 days advance notice of any signiﬁcant changes to childcare beneﬁts. for speciﬁc questions about your situ ...
Metadata: {'source': 'assets/documents/childcare-policy.pdf'}

Result 3:
launch periods, and other critical business periods that will be communicated to 

## Advanced RAG Methods

### Metadata Filtering 

Useful if we want to use only specific files for our answers. It is useful if we had different departments and they had different documents or different years. It all depends what metadata we can collect and what would be useful. In this case we only store different files.

In [7]:
def load_retriever_from_collection(
    collection_name: str,
    search_type: str = "similarity_score_threshold",
    score_threshold: float = 0.3,
    top_k: int = 5,
    metadata_filter: dict = None
):
    """
    Load a retriever from a Chroma collection with configurable retrieval behavior
    and optional metadata filtering.

    Args:
        collection_name (str): Name of the Chroma collection.
        search_type (str): Retrieval type ("similarity_score_threshold" or "mmr").
        score_threshold (float): Minimum similarity score for retrieval.
        top_k (int): Number of documents to return.
        metadata_filter (dict): Optional filter, e.g. {"source": "assets/documents/vacation-policy.pdf"}

    Returns:
        Retriever: Configured retriever.
    """
    collection = load_chroma_collection(name=collection_name, directory="./persist")
    
    retriever = collection.as_retriever(
        search_type=search_type,
        search_kwargs={
            "score_threshold": score_threshold,
            "k": top_k,
            "filter": metadata_filter  # <-- apply metadata filter
        }
    )
    return retriever


Example

In [8]:
queries = [
    "How long ahead do I need to request vacation for longer than 4 days?"
]

# Example: Only search documents from vacation policy folder
metadata_filter = {"source": "assets/documents/vacation-policy.pdf"}  

retriever = load_retriever_from_collection(
    collection_name="benefits_collection",
    search_type="similarity_score_threshold",
    score_threshold=0.3,
    top_k=3,
    metadata_filter=metadata_filter
)

for query in queries:
    print(f"\nQuery: {query}")
    docs = retriever.get_relevant_documents(query)
    for i, doc in enumerate(docs, 1):
        print(f"\nResult {i}:")
        print(doc.page_content[:300], "...")
        print("Metadata:", doc.metadata)



Query: How long ahead do I need to request vacation for longer than 4 days?

Result 1:
and approval process we ask that employees provide advance notice when requesting vacation time to ensure adequate coverage and minimize disruption to team projects and client commitments. for short absences of one to two days, we require at least one week of advance notice. requests for three to fo ...
Metadata: {'source': 'assets/documents/vacation-policy.pdf'}

Result 2:
feedbackand business needs. how far in advance can i schedule vacation time? there’s no limit on how far in advance you can request vacation, though we recommend not scheduling more than a year ahead to allow for potential changes in business needs or personal circumstances. what if i need more time ...
Metadata: {'source': 'assets/documents/vacation-policy.pdf'}

Result 3:
scheduled time oﬀ before approving. once approved by your manager, hr will conﬁrm that you have suﬃcient vacation balance available, and you’ll receive email 

### Query expansion 

Automatically expand your query with related terms to improve retrieval. The disadvantage is LLM-generated expansions cost API calls, however if we pregenarate expansions that could be useful however it requires domain knowledge. 

In [18]:
from openai import OpenAI 
client = OpenAI() 

def expand_query(query: str, n_terms: int = 5) -> list[str]: 
    """ Use LLM to generate related terms for query expansion. """ 
    prompt = f""" 
        Generate {n_terms} synonyms of the core word/phrase of the following query for use in document retrieval. 
        Keep them short, noun-phrases. Query: "{query}" """ 
    
    response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role":"user","content": prompt}], max_tokens=100 ) 
    text = response.choices[0].message.content.strip() 
    
    return [t.strip("-• ") for t in text.split("\n") if t.strip()] 

query = "Maternity leave policy"

# 1. Generate expansions
exp_terms = expand_query(query, n_terms=5)
print("Expanded terms:", exp_terms)

# 2. Use expanded terms in retrieval
all_queries = [query] + exp_terms
results = []

# Create a retriever
retriever = collection.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"score_threshold": 0.3, "k": 5}
)

for q in all_queries:
    docs = retriever.get_relevant_documents(q)
    results.extend(docs)

# 3. Deduplicate and display
unique_docs = {d.metadata["source"]: d for d in results}.values()
for i, doc in enumerate(unique_docs, 1):
    print(f"\nResult {i}:")
    print(doc.page_content[:300], "...")



Expanded terms: ['1. Parental leave guidelines', '2. Family leave policy', '3. Maternal leave rules', '4. Childbirth leave provisions', '5. Maternity leave regulations']

Result 1:
from the emergency childcare fund for occasional oﬀ-hours needs.eligibility and enrollment all full-time employees working 30 or more hours per week are eligible for our complete range of childcare beneﬁts. part-time employees working at least 20 hours per week qualify for on-site childcare, fsa ben ...

Result 2:
that any advanced time will be reconciled through future accruals or payroll deduction if employment ends before the time is earned. we also provide ﬂoating holidays speciﬁcally for religious and cultural observances that may not align with our standard company holidays. each employee receives two ﬂ ...


In [19]:
def retrieve_with_expanded_queries(
    collection_name: str,
    queries: List[str],
    search_type: str = "similarity_score_threshold",
    score_threshold: float = 0.3,
    top_k: int = 5,
    metadata_filter: dict = None
) -> List[Document]:
    """
    Retrieve relevant documents from a Chroma collection using one or more expanded queries.

    Args:
        collection_name (str): Name of the Chroma collection.
        queries (List[str]): List of queries, e.g., original query + expanded terms.
        search_type (str): Retrieval type ("similarity_score_threshold" or "mmr").
        score_threshold (float): Minimum similarity score.
        top_k (int): Number of documents to return per query.
        metadata_filter (dict): Optional metadata filter.

    Returns:
        List[Document]: Aggregated, deduplicated documents.
    """
    retriever = load_retriever_from_collection(
        collection_name=collection_name,
        search_type=search_type,
        score_threshold=score_threshold,
        top_k=top_k,
        metadata_filter=metadata_filter
    )
    
    results = []
    for q in queries:
        docs = retriever.get_relevant_documents(q)
        results.extend(docs)
    
    # Deduplicate by source or content
    unique_results = {d.metadata.get("source", d.page_content): d for d in results}
    return list(unique_results.values())

In [21]:
query = "Maternity leave policy"
expanded_terms = expand_query(query, n_terms=3)
all_queries = [query] + expanded_terms

docs = retrieve_with_expanded_queries(
    collection_name="benefits_collection",
    queries=all_queries,
    score_threshold=0.3,
    top_k=3
)

for i, doc in enumerate(docs, 1):
    print(f"\nResult {i}:")
    print(doc.page_content[:300], "...")
    print("Metadata:", doc.metadata)



Result 1:
6:30 pm. we recommend arranging for an authorized contact to pick up your child, or using our backup care services for extended days. are there childcare options if i travel for work? our backup care network may have options in other cities, and you can use the emergency childcare fund for travel-re ...
Metadata: {'source': 'assets/documents/childcare-policy.pdf'}

Result 2:
that any advanced time will be reconciled through future accruals or payroll deduction if employment ends before the time is earned. we also provide ﬂoating holidays speciﬁcally for religious and cultural observances that may not align with our standard company holidays. each employee receives two ﬂ ...
Metadata: {'source': 'assets/documents/vacation-policy.pdf'}


### HyDE - Generate a “hypothetical answer” 

for the query, then retrieve documents closest to that answer. It is useful when the queries are short or ambiguous: If "Maternity leave policy" is too short, the vector search might miss relevant docs. It is useful for semantic retrieval in dense embeddings collections.

In [15]:
from langchain.chat_models import ChatOpenAI

query = "Maternity leave policy"
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Step 1: Generate hypothetical answer
prompt = f"Generate a concise hypothetical answer to this question: '{query}'"
hypothetical_answer = llm.predict(prompt)

# Print the hypothetical answer
print("Hypothetical answer generated by LLM:")
print(hypothetical_answer)

# Step 2: Retrieve documents using embedding of the hypothetical answer
embedding_fn = OpenAIEmbeddings()
hypothetical_vector = embedding_fn.embed_query(hypothetical_answer)

# 3. Retrieve relevant documents directly from Chroma using the vector
retriever = load_chroma_collection("benefits_collection", "./persist")
docs = retriever.similarity_search_by_vector(hypothetical_vector, k=3)

for i, doc in enumerate(docs, 1):
    print(f"\nResult {i}:") 
    print(doc.page_content[:200], "...")


Hypothetical answer generated by LLM:
A maternity leave policy typically outlines the amount of time off and benefits available to employees who are expecting or have recently given birth.

Result 1:
of paid maternity leave, while non-birth parents receive six weeks of paid paternity leave. adoptive parents receive eight weeks of paid leave that can be shared between both parents. employees must h ...

Result 2:
for ﬁnding specialized care providers in the community. this policy is eﬀective as of [current date] and may be modiﬁed as business needs and legal requirements change. employees will receive 30 days  ...

Result 3:
launch periods, and other critical business periods that will be communicated to employees at least 60 days in advance. while we try to minimize blackout periods, these restrictions help ensure we can ...
