# Contextual Semantic Search with Chroma, LiteLLM, and CRAG

This notebook demonstrates how to build a **Contextual Semantic Search** system using the **Chroma Vector Database**, **LiteLLM** with the **Gemini model**, and an integrated **Corrective Retrieval-Augmented Generation (CRAG)** pipeline. The system processes a folder of PDF documents to build a knowledge base, retrieves contextually relevant information, evaluates the relevance of the retrieved documents, and—if necessary—supplements the context with live web search results using DuckDuckGo. This robust pipeline ensures that the final answers are accurate, contextually verified, and up-to-date. Below is an overview of the tools and methods used.

---

## Tools Used

1. **Chroma Vector Database**:
   - A vector database designed for storing and querying embeddings.
   - Enables efficient similarity search for retrieving semantically relevant documents.

2. **Sentence Transformers**:
   - A library for generating high-quality embeddings (vector representations) of text.
   - Uses pre-trained models like `all-MiniLM-L6-v2` to convert text into embeddings.

3. **LangChain**:
   - A framework for processing and splitting text data.
   - The `RecursiveCharacterTextSplitter` is used to break text into semantically meaningful chunks.

4. **LiteLLM**:
   - A lightweight library for interacting with large language models (LLMs).
   - Used here to call the **Gemini model** for both evaluating document relevance and generating responses.

5. **PyPDF2**:
   - A library for extracting text from PDF files.
   - Used to process all PDFs in a specified folder.

6. **DuckDuckGoSearchRun**:
   - A tool from the `langchain-community` package for performing live web searches.
   - Provides supplementary retrieval to enhance context when initial documents are insufficient.

---

## Methods and Steps

### Step 1: Extract Text from PDFs
- **Process:**  
  All PDF files in a specified folder are processed using **PyPDF2**.  
- **Result:**  
  The text from each PDF is extracted and combined into a single string.

### Step 2: Split Text into Chunks
- **Process:**  
  The combined text is split into smaller, semantically meaningful chunks using **LangChain's RecursiveCharacterTextSplitter**.  
- **Result:**  
  Each chunk retains context and is sized appropriately for embedding.

### Step 3: Generate Embeddings
- **Process:**  
  Each text chunk is converted into an embedding using **Sentence Transformers**.  
- **Result:**  
  The embeddings capture the semantic meaning of each chunk.

### Step 4: Build the Knowledge Base
- **Process:**  
  The embeddings and corresponding text chunks are stored in **Chroma Vector Database**.  
- **Result:**  
  This enables efficient similarity searches based on semantic meaning.

### Step 5: Perform Semantic Search
- **Process:**  
  A user query is converted into an embedding, and the top‑k semantically similar chunks are retrieved from Chroma.  
- **Result:**  
  The system identifies candidate documents that are contextually related to the query.

### Step 6: Evaluate Retrieved Documents (CRAG Component)
- **Process:**  
  Each retrieved document is evaluated using the **Gemini model** (via LiteLLM) to determine its relevance.  
- **Result:**  
  Only documents that pass this evaluation are kept for answer generation, reducing noise and potential hallucinations.

### Step 7: Supplement with Web Search (CRAG Component)
- **Process:**  
  If the evaluator finds that the retrieved documents are insufficient or irrelevant, a live web search is performed using **DuckDuckGoSearchRun**.  
- **Result:**  
  The web search results are appended to the context, ensuring access to up-to-date and accurate information.

### Step 8: Generate Response
- **Process:**  
  The refined context (from both vector retrieval and web search) is passed to the **Gemini model** via LiteLLM.  
- **Result:**  
  The model generates a final, contextually accurate answer to the user's query.

---

## Workflow Overview

1. **Input**: A folder containing PDF documents.
2. **Processing**:
   - Extract text from PDFs.
   - Split text into chunks.
   - Generate embeddings and store them in Chroma.
3. **Query Handling**:
   - Convert the query into an embedding.
   - Retrieve the most relevant chunks from Chroma.
   - Evaluate the retrieved documents using the Gemini model.
   - If needed, perform a supplementary web search using DuckDuckGo.
   - Combine the refined context.
   - Generate a response using the Gemini model.
4. **Output**: A contextually accurate and factually verified answer to the user's query.

---

## Why This Approach?

- **Contextual Semantic Search**:  
  Captures the meaning behind the query and retrieves semantically related documents rather than relying on keyword matching.

- **Corrective Mechanism (CRAG)**:  
  Integrates a lightweight evaluator to verify document relevance and uses live web search to supplement missing or inaccurate information, reducing hallucinations.

- **Efficient Retrieval**:  
  Chroma enables fast and scalable similarity searches using vector embeddings.

- **High-Quality Responses**:  
  By filtering out irrelevant documents and integrating real-time data, the Gemini model generates accurate and contextually appropriate answers.

---

Let’s get started! Explore each section of the notebook for detailed implementation and see the pipeline in action.

## Library Installation

In [None]:
!pip install -q chromadb pypdf2 sentence-transformers litellm langchain langchain-community duckduckgo-search

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m54.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m54.1 MB/s[0m eta [36m0:00:00

## Import the Libraries and set the environment variables

In [None]:
import os
import PyPDF2
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import litellm
from litellm import completion
from langchain.text_splitter import RecursiveCharacterTextSplitter
import numpy as np
from langchain_community.tools import DuckDuckGoSearchRun

# # Set environment variables. Uncomment this if you want to set them directly.
# os.environ["HUGGINGFACE_TOKEN"] = "your_huggingface_token_here"
# os.environ["GEMINI_API_KEY"] = "your_gemini_api_key_here"
os.environ['LITELLM_LOG'] = 'DEBUG'

# # Retrieve environment variables
# HUGGINGFACE_TOKEN = os.getenv("HUGGINGFACE_TOKEN")
# GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")

## Extract Text from folder containing PDF files

In [None]:
def extract_text_from_pdfs(folder_path):
    all_text = ""
    for filename in os.listdir(folder_path):
        if filename.endswith(".pdf"):
            file_path = os.path.join(folder_path, filename)
            with open(file_path, 'rb') as file:
                reader = PyPDF2.PdfReader(file)
                for page in reader.pages:
                    all_text += page.extract_text()
    return all_text

pdf_folder = "dataset"
all_text = extract_text_from_pdfs(pdf_folder)

## Text Splitter

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # Size of each chunk
    chunk_overlap=50,  # Overlap between chunks to maintain context
    separators=["\n\n", "\n", " ", ""]  # Splitting hierarchy
)

chunks = text_splitter.split_text(all_text)

## Set up the Knowledge Base with ChromaDB and Generate Embeddings with sentence-transformers

In [None]:
# Initialize a persistent ChromaDB client
client = chromadb.PersistentClient(path="chroma_db")

# Load the SentenceTransformer model for text embeddings
text_embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Delete existing collection (if needed)
try:
    client.delete_collection(name="knowledge_base")
    print("Deleted existing collection: knowledge_base")
except Exception as e:
    print(f"Collection does not exist or could not be deleted: {e}")

# Create a new collection for text embeddings
collection = client.create_collection(name="knowledge_base")

# Add text chunks to the collection
for i, chunk in enumerate(chunks):
    # Generate embeddings for the chunk
    embedding = text_embedding_model.encode(chunk)

    # Add to the collection with metadata
    collection.add(
        ids=[f"chunk_{i}"],  # Unique ID for each chunk
        embeddings=[embedding.tolist()],  # Embedding vector
        metadatas=[{"source": "pdf", "chunk_id": i}],  # Metadata
        documents=[chunk]  # Original text
    )

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Collection does not exist or could not be deleted: Collection knowledge_base does not exist.


## Perform Semantic Search with ChromaDB and Embedding Model

In [None]:
def semantic_search(query, top_k=2):
    # Generate embedding for the query
    query_embedding = text_embedding_model.encode(query)

    # Query the collection
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=top_k
    )
    return results

# Example query
query = "What is the insurance for car?"
results = semantic_search(query)

# Display results
for i, result in enumerate(results['documents'][0]):
    print(f"Result {i+1}: {result}\n")

Result 1: insurance) 
FINANCIAL RESPONSIBILITY LAW
A state law requiring that all automobile 
drivers show proof that they can pay dam-
ages up to a minimum amount if involved 
in an auto accident. Varies from state to 
state but can be met by carrying a mini-
mum amount of auto liability insurance. 
(See Compulsory auto insurance)
FINITE RISK REINSURANCE
Contract under which the ultimate li-
ability of the reinsurer is capped and on 
which anticipated investment income is

Result 2: policyholder’s car from a collision. 
5. Comprehensive, for damage to the poli-
cyholder’s car not involving a collision 
with another car (including damage 
from fire, explosions, earthquakes, floods, and riots), and theft. 6. Uninsured motorists coverage, for costs 
resulting from an accident involving a hit-and-run driver or a driver who does not have insurance. 
AUTO INSURANCE PREMIUM
The price an insurance company charges 
for coverage, based on the frequency and



## Generate Repsonse Based on Semantic Search

In [None]:
# Set up LiteLLM with Gemini

def generate_response(query, context):
    # Combine the query and context for the prompt
    prompt = f"Query: {query}\nContext: {context}\nAnswer:"

    # Call the Gemini model via LiteLLM
    response = completion(
        model="gemini/gemini-1.5-flash",  # Use the Gemini model
        messages=[{"content": prompt, "role": "user"}],
        api_key= GEMINI_API_KEY
    )

    # Extract and return the generated text
    return response['choices'][0]['message']['content']

# Retrieve the top results from semantic search
search_results = semantic_search(query)
context = "\n".join(search_results['documents'][0])

# Generate a response using the retrieved context
response = generate_response(query, context)
print("Generated Response:\n", response)

Generated Response:
 Based on the provided text, car insurance can include several types of coverage:

1. **Auto liability insurance:** This covers damages to other people and their property if you cause an accident.  The minimum amount required varies by state.

2. **Collision:** This covers damage to your car resulting from a collision with another car.

3. **Comprehensive:** This covers damage to your car not caused by a collision, such as fire, theft, or natural disasters.

4. **Uninsured motorists coverage:** This covers costs if you're involved in an accident with a hit-and-run driver or an uninsured driver.



## Document Grading Function

In [None]:
def grade_document(query, document):
    #Uses the Gemini model to decide if a document is relevant to the query.
    prompt = f"""Query: {query}
Document: {document}
Is this document relevant to the query? Answer with "yes" or "no"."""
    response = completion(
        model="gemini/gemini-1.5-flash",
        messages=[{"content": prompt, "role": "user"}],
        api_key=GEMINI_API_KEY
    )
    answer = response['choices'][0]['message']['content'].strip().lower()
    return "yes" if "yes" in answer else "no"


## Supplementary Retrieval Function (using DuckDuckGo)

In [None]:
def supplementary_retrieval(query):
    #Performs a web search using DuckDuckGo and returns the result as a string.
    search_tool = DuckDuckGoSearchRun()
    web_result = search_tool.invoke(query)
    return web_result

## Corrective-RAG Pipeline

In [None]:
def corrective_rag(query, top_k=2):
    # The main CRAG pipeline:
    #   1. Retrieve documents using semantic search.
    #   2. Grade each document using the evaluator.
    #   3. If no relevant document is found, perform a web search.

    # Step 1: Retrieve documents
    results = semantic_search(query, top_k=top_k)
    retrieved_docs = results.get("documents", [])
    print("Initial retrieved documents:")
    for doc in retrieved_docs:
        print(doc)

    # Step 2: Grade each document for relevance
    relevant_docs = []
    for doc in retrieved_docs:
        grade = grade_document(query, doc)
        print(f"Grading document (first 60 chars): {doc[:60]}... => {grade}")
        if grade == "yes":
            relevant_docs.append(doc)

    # Step 3: If no relevant document is found, perform supplementary retrieval
    if not relevant_docs:
        print("No relevant documents found; performing supplementary retrieval via web search.")
        supplementary_doc = supplementary_retrieval(query)
        relevant_docs.append(supplementary_doc)
    else:
        print("Using relevant documents from the vector store.")

    # Ensure all elements in relevant_docs are strings
    context = "\n".join([" ".join(doc) if isinstance(doc, list) else doc for doc in relevant_docs])

    # Step 4: Generate final answer using the combined context.
    final_answer = generate_response(query, context)
    return final_answer


In [None]:
query = "What is the insurance for car?"
final_answer = corrective_rag(query)
print("Final Answer:")
print(final_answer)

Initial retrieved documents:
['insurance) \nFINANCIAL RESPONSIBILITY LAW\nA state law requiring that all automobile \ndrivers show proof that they can pay dam-\nages up to a minimum amount if involved \nin an auto accident. Varies from state to \nstate but can be met by carrying a mini-\nmum amount of auto liability insurance. \n(See Compulsory auto insurance)\nFINITE RISK REINSURANCE\nContract under which the ultimate li-\nability of the reinsurer is capped and on \nwhich anticipated investment income is', 'policyholder’s car from a collision. \n5. Comprehensive, for damage to the poli-\ncyholder’s car not involving a collision \nwith another car (including damage \nfrom fire, explosions, earthquakes, floods, and riots), and theft. 6. Uninsured motorists coverage, for costs \nresulting from an accident involving a hit-and-run driver or a driver who does not have insurance. \nAUTO INSURANCE PREMIUM\nThe price an insurance company charges \nfor coverage, based on the frequency and']
Gra