# 📓 The GenAI Revolution Cookbook

**Title:** 5 Essential Steps to Building Agentic RAG Systems with LangChain and ChromaDB

**Description:** Unlock the power of agentic RAG systems with LangChain and ChromaDB. Follow these steps to enhance AI adaptability and relevance in real-world applications.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



## Introduction
I'll be honest - when I first started working with RAG systems, I thought they were pretty impressive. But then I kept running into the same frustrating problem: they were essentially just fancy search engines. They'd retrieve documents, sure, but they couldn't really *think* about when to retrieve them, or figure out if they even needed to search at all. That's where agentic RAG comes in.

Agentic retrieval-augmented generation systems are a whole different beast. Unlike traditional RAG that mechanically retrieves information every single time, these systems can actually make decisions. They determine when they need to look something up, how to break down complex questions, and can even use multiple tools to get you the answer you need. It's like the difference between a library catalog and having an actual research assistant.

In this tutorial, I'm going to show you exactly how to build one of these systems using <a href="https://python.langchain.com/">LangChain</a> and <a href="https://www.trychroma.com/">ChromaDB</a>. By the time we're done, you'll know how to:

<ul>
- Set up LangChain agents with ChromaDB for vector storage
- Build in the logic for autonomous decision-making
- Create multi-step reasoning workflows (this is where it gets really interesting)
- Optimize your retrieval with caching and reranking
- Test everything properly
- Get your system ready for actual production use
</ul>
This guide is for AI builders who want to go beyond basic RAG. If you're interested in diving deeper into customization for specific domains, you might find our guide on <a href="/article/mastering-domain-specific-llm-customization-techniques-and-tools-unveiled">customizing LLMs for domain-specific applications</a> helpful.

## Setup & Installation
Let's start with the basics - getting everything installed and configured. We need LangChain for orchestrating our agents, ChromaDB for vector storage, and OpenAI for embeddings and the language model itself.

In [None]:
# Install necessary packages
!pip install langchain chromadb openai langchain-community pypdf

In [None]:
import os

# Set up environment variables for API keys
# Replace these with your actual API keys
os.environ["OPENAI_API_KEY"] = "your_openai_api_key"

# Import necessary libraries
import langchain
import chromadb

# Verify installation by printing versions
print("LangChain version:", langchain.__version__)
print("ChromaDB version:", chromadb.__version__)

## Building the Knowledge Base
Here's the thing about RAG systems - they're only as good as their knowledge base. So we need to be really careful about how we set this up. We'll load documents, chunk them properly, create embeddings, and index everything in ChromaDB.

### Loading and Processing Documents

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from chromadb import Client
from chromadb.config import Settings

# Load documents from a PDF file
# Replace "sample.pdf" with your actual document path
loader = PyPDFLoader("sample.pdf")
documents = loader.load()

print(f"Loaded {len(documents)} pages from the document")

### Chunking Strategy
Now, chunking is more important than you might think. Too big and you lose precision. Too small and you lose context. I've found that 500 characters with some overlap works pretty well for most use cases.

In [None]:
# Split text into manageable chunks
# chunk_size: Maximum characters per chunk (affects retrieval granularity)
# chunk_overlap: Characters shared between chunks (preserves context)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", " ", ""]
)

chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} text chunks")

### Indexing in ChromaDB

In [None]:
# Initialize ChromaDB client
chroma_client = Client(Settings(
    anonymized_telemetry=False,
    allow_reset=True
))

# Create a collection for storing document embeddings
collection_name = "knowledge_base"
collection = chroma_client.create_collection(
    name=collection_name,
    metadata={"description": "Agentic RAG knowledge base"}
)

# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings()

# Add documents to ChromaDB with embeddings
for i, chunk in enumerate(chunks):
    embedding = embeddings.embed_query(chunk.page_content)
    collection.add(
        embeddings=[embedding],
        documents=[chunk.page_content],
        metadatas=[chunk.metadata],
        ids=[f"doc_{i}"]
    )

print(f"Indexed {len(chunks)} chunks in ChromaDB")

### Testing Basic Retrieval
Before we get fancy, let's make sure our basic retrieval actually works:

In [None]:
# Test basic similarity search to verify setup
query = "What is agentic RAG?"
query_embedding = embeddings.embed_query(query)

results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3
)

print("Search results:")
for i, doc in enumerate(results['documents'][0]):
    print(f"\nResult {i+1}:")
    print(doc[:200] + "...")

## Implementing the Agentic System
Alright, now we get to the interesting part. This is where we build the actual "agent" capabilities. LangChain gives us a solid framework for this, and ChromaDB handles our vector storage efficiently. Actually, if you're curious about the technical details of fine-tuning these models, check out our breakdown of <a href="/article/mastering-fine-tuning-of-large-language-models-with-hugging-face">fine-tuning large language models with Hugging Face Transformers</a>.

### Creating the Retrieval Tool

In [None]:
from langchain.tools import Tool
from langchain.chat_models import ChatOpenAI
from langchain.agents import AgentExecutor, create_react_agent
from langchain.prompts import PromptTemplate

def retrieve_documents(query: str) -> str:
    """
    Retrieve relevant documents from ChromaDB based on the query.
    
    Args:
        query: The search query string
        
    Returns:
        Concatenated text from top retrieved documents
    """
    query_embedding = embeddings.embed_query(query)
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=3
    )
    
    # Combine retrieved documents
    retrieved_text = "\n\n".join(results['documents'][0])
    return retrieved_text

# Create a LangChain tool for retrieval
retrieval_tool = Tool(
    name="Knowledge_Base_Search",
    func=retrieve_documents,
    description="Searches the knowledge base for relevant information. Use this when you need to find specific facts or context from the documents."
)

### Building the Agent with Decision-Making Logic
Here's where things get really interesting. We're creating an agent that can actually decide whether it needs to search for information or not:

In [None]:
# Initialize the language model
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")

# Define the agent prompt with reasoning capabilities
agent_prompt = PromptTemplate.from_template("""
You are an intelligent assistant with access to a knowledge base. 

Answer the following question by deciding whether you need to retrieve information or can answer directly.

Available tools:


Tool names: 

Question: {input}

Thought: Let me think about whether I need to search the knowledge base.
{agent_scratchpad}
""")

# Create the ReAct agent
tools = [retrieval_tool]
agent = create_react_agent(llm, tools, agent_prompt)

# Create an agent executor
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=True,
    max_iterations=3,
    handle_parsing_errors=True
)

### Implementing Autonomous Decision-Making
Now, this is the part that really makes it "agentic" - the system decides for itself whether to retrieve information:

In [None]:
def decide_and_respond(query: str) -> str:
    """
    Autonomous decision-making function that determines whether to retrieve
    information based on query complexity and context.
    
    Args:
        query: The user's question
        
    Returns:
        The agent's response
    """
    # Keywords that indicate need for retrieval
    retrieval_indicators = ["what", "how", "explain", "describe", "details", "specific"]
    
    # Simple heuristic: check if query contains retrieval indicators
    needs_retrieval = any(indicator in query.lower() for indicator in retrieval_indicators)
    
    if needs_retrieval:
        # Use agent with retrieval capabilities
        response = agent_executor.invoke({"input": query})
        return response["output"]
    else:
        # Direct response without retrieval
        response = llm.predict(query)
        return response

# Test the decision-making logic
test_queries = [
    "What is agentic RAG?",
    "Hello, how are you?",
    "Explain the key components of the system"
]

for query in test_queries:
    print(f"\nQuery: {query}")
    print(f"Response: {decide_and_respond(query)}")
    print("-" * 80)

### Multi-Step Reasoning Implementation
This is honestly my favorite part. When you get a really complex question, the system breaks it down into smaller pieces, answers each one, then synthesizes everything together. It's like watching someone actually think through a problem:

In [None]:
from langchain.chains import LLMChain

def multi_step_reasoning(complex_query: str) -> dict:
    """
    Perform multi-step reasoning for complex queries by breaking them down
    into sub-questions and aggregating results.
    
    Args:
        complex_query: A complex question requiring multiple reasoning steps
        
    Returns:
        Dictionary containing reasoning steps and final answer
    """
    # Step 1: Decompose the query into sub-questions
    decomposition_prompt = PromptTemplate(
        input_variables=["query"],
        template="Break down this complex question into 2-3 simpler sub-questions:\n{query}\n\nSub-questions:"
    )
    
    decomposition_chain = LLMChain(llm=llm, prompt=decomposition_prompt)
    sub_questions = decomposition_chain.run(complex_query)
    
    print("Sub-questions identified:")
    print(sub_questions)
    
    # Step 2: Answer each sub-question
    sub_answers = []
    for sub_q in sub_questions.split("\n"):
        if sub_q.strip():
            answer = decide_and_respond(sub_q.strip())
            sub_answers.append({
                "question": sub_q.strip(),
                "answer": answer
            })
    
    # Step 3: Synthesize final answer
    synthesis_prompt = PromptTemplate(
        input_variables=["original_query", "sub_answers"],
        template="""
        Original question: {original_query}
        
        Sub-question answers:
        {sub_answers}
        
        Provide a comprehensive final answer to the original question:
        """
    )
    
    synthesis_chain = LLMChain(llm=llm, prompt=synthesis_prompt)
    final_answer = synthesis_chain.run(
        original_query=complex_query,
        sub_answers="\n\n".join([f"Q: {sa['question']}\nA: {sa['answer']}" for sa in sub_answers])
    )
    
    return {
        "sub_questions": sub_questions,
        "sub_answers": sub_answers,
        "final_answer": final_answer
    }

# Test multi-step reasoning
complex_query = "How does agentic RAG improve upon traditional RAG systems and what are the key implementation challenges?"
result = multi_step_reasoning(complex_query)

print("\n" + "="*80)
print("FINAL ANSWER:")
print(result["final_answer"])

## Optimization, Testing, and Production Readiness
Let me tell you - getting this stuff production-ready is a lot more complicated than it seems at first. You need to think about optimization, proper testing, error handling... the works. If you want more on optimizing AI systems, our article on <a href="/article/mastering-domain-specific-llm-customization-techniques-and-tools-unveiled">customizing LLMs for domain-specific applications</a> goes pretty deep into this.

### Advanced Retrieval Optimization
One trick I've learned is that asking the same question in different ways often gets you better results. So let's implement that:

In [None]:
from functools import lru_cache
from typing import List, Tuple

def multi_query_retrieval(query: str, num_variations: int = 3) -> List[str]:
    """
    Generate multiple query variations to improve retrieval coverage.
    
    Args:
        query: Original query string
        num_variations: Number of query variations to generate
        
    Returns:
        List of unique retrieved documents
    """
    # Generate query variations
    variation_prompt = PromptTemplate(
        input_variables=["query", "num"],
        template="Generate {num} different ways to ask this question:\n{query}\n\nVariations:"
    )
    
    chain = LLMChain(llm=llm, prompt=variation_prompt)
    variations = chain.run(query=query, num=num_variations)
    
    # Retrieve documents for each variation
    all_results = []
    for variation in variations.split("\n"):
        if variation.strip():
            docs = retrieve_documents(variation.strip())
            all_results.append(docs)
    
    # Deduplicate results
    unique_results = list(set(all_results))
    return unique_results

def rerank_results(query: str, documents: List[str]) -> List[Tuple[str, float]]:
    """
    Rerank retrieved documents based on relevance to the query.
    
    Args:
        query: The search query
        documents: List of retrieved documents
        
    Returns:
        List of (document, score) tuples sorted by relevance
    """
    scored_docs = []
    
    for doc in documents:
        # Simple relevance scoring based on keyword overlap
        # In production, use a cross-encoder model for better accuracy
        query_terms = set(query.lower().split())
        doc_terms = set(doc.lower().split())
        overlap = len(query_terms.intersection(doc_terms))
        score = overlap / len(query_terms) if query_terms else 0
        
        scored_docs.append((doc, score))
    
    # Sort by score descending
    scored_docs.sort(key=lambda x: x[1], reverse=True)
    return scored_docs

# Implement optimized retrieval with caching
@lru_cache(maxsize=100)
def cached_optimized_retrieval(query: str) -> str:
    """
    Cached retrieval with multi-query and reranking optimization.
    
    Args:
        query: The search query
        
    Returns:
        Best retrieved document
    """
    # Multi-query retrieval
    documents = multi_query_retrieval(query)
    
    # Rerank results
    ranked_docs = rerank_results(query, documents)
    
    # Return top result
    return ranked_docs[0][0] if ranked_docs else ""

# Test optimized retrieval
test_query = "What are the benefits of agentic systems?"
result = cached_optimized_retrieval(test_query)
print("Optimized retrieval result:")
print(result[:300] + "...")

### Performance Evaluation
You can't improve what you don't measure, right? So here's how we track performance:

In [None]:
import time
from typing import Dict

def evaluate_system_performance(test_queries: List[str]) -> Dict[str, float]:
    """
    Evaluate system performance across multiple metrics.
    
    Args:
        test_queries: List of queries to test
        
    Returns:
        Dictionary of performance metrics
    """
    latencies = []
    cache_hits = 0
    
    for query in test_queries:
        # Measure latency
        start_time = time.time()
        
        # Check cache
        cache_info_before = cached_optimized_retrieval.cache_info()
        result = cached_optimized_retrieval(query)
        cache_info_after = cached_optimized_retrieval.cache_info()
        
        end_time = time.time()
        latency = end_time - start_time
        latencies.append(latency)
        
        # Track cache hits
        if cache_info_after.hits > cache_info_before.hits:
            cache_hits += 1
    
    # Calculate metrics
    avg_latency = sum(latencies) / len(latencies)
    cache_hit_rate = cache_hits / len(test_queries)
    
    metrics = {
        "average_latency_seconds": round(avg_latency, 3),
        "cache_hit_rate": round(cache_hit_rate, 2),
        "total_queries": len(test_queries)
    }
    
    return metrics

# Run evaluation
test_queries = [
    "What is agentic RAG?",
    "How does retrieval work?",
    "What is agentic RAG?",  # Duplicate to test caching
    "Explain multi-step reasoning"
]

metrics = evaluate_system_performance(test_queries)
print("\nPerformance Metrics:")
for metric, value in metrics.items():
    print(f"  {metric}: {value}")

### Error Handling and Resilience
Here's something I learned the hard way - things will fail. APIs go down, rate limits hit, weird edge cases pop up. You need to be ready:

In [None]:
import logging
from typing import Optional

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def robust_query_handler(query: str, max_retries: int = 3) -> Optional[str]:
    """
    Handle queries with error handling, retries, and fallback responses.
    
    Args:
        query: The user query
        max_retries: Maximum number of retry attempts
        
    Returns:
        Response string or None if all attempts fail
    """
    for attempt in range(max_retries):
        try:
            # Attempt to process query
            response = decide_and_respond(query)
            logger.info(f"Query processed successfully on attempt {attempt + 1}")
            return response
            
        except Exception as e:
            logger.error(f"Attempt {attempt + 1} failed: {str(e)}")
            
            if attempt < max_retries - 1:
                # Wait before retry (exponential backoff)
                wait_time = 2 ** attempt
                time.sleep(wait_time)
            else:
                # Final fallback
                logger.error("All retry attempts exhausted")
                return "I apologize, but I'm having trouble processing your request. Please try rephrasing your question or contact support."
    
    return None

# Test error handling
test_query = "What is agentic RAG?"
response = robust_query_handler(test_query)
print(f"\nRobust response: {response}")

### Deployment with FastAPI
When you're ready to deploy, FastAPI is a great choice. Here's the basic structure:

In [None]:
# Note: This code demonstrates the deployment structure
# In Colab, you would need to run this in a separate environment

"""
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uvicorn

app = FastAPI(title="Agentic RAG API")

class Query(BaseModel):
    question: str
    use_multi_step: bool = False

class Response(BaseModel):
    answer: str
    metadata: dict

@app.post("/query", response_model=Response)
async def process_query(query: Query):
    try:
        if query.use_multi_step:
            result = multi_step_reasoning(query.question)
            return Response(
                answer=result["final_answer"],
                metadata={"sub_questions": result["sub_questions"]}
            )
        else:
            answer = robust_query_handler(query.question)
            return Response(
                answer=answer,
                metadata={"method": "single_step"}
            )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health_check():
    return {"status": "healthy"}

# To run: uvicorn main:app --host 0.0.0.0 --port 8000
"""

print("FastAPI deployment code structure defined")
print("Deploy using: uvicorn main:app --host 0.0.0.0 --port 8000")

### Monitoring and Observability
Actually, wait - before you deploy anything, you need proper monitoring. Trust me on this one:

In [None]:
from collections import defaultdict
from datetime import datetime

class SystemMonitor:
    """
    Monitor agent decisions, retrieval quality, and system health.
    """
    
    def __init__(self):
        self.query_log = []
        self.retrieval_stats = defaultdict(int)
        self.error_log = []
    
    def log_query(self, query: str, response: str, latency: float, used_retrieval: bool):
        """Log query details for analysis."""
        self.query_log.append({
            "timestamp": datetime.now().isoformat(),
            "query": query,
            "response_length": len(response),
            "latency": latency,
            "used_retrieval": used_retrieval
        })
        
        if used_retrieval:
            self.retrieval_stats["retrieval_used"] += 1
        else:
            self.retrieval_stats["direct_response"] += 1
    
    def log_error(self, error: str, context: dict):
        """Log errors for debugging."""
        self.error_log.append({
            "timestamp": datetime.now().isoformat(),
            "error": error,
            "context": context
        })
    
    def get_statistics(self) -> dict:
        """Get system statistics."""
        total_queries = len(self.query_log)
        avg_latency = sum(q["latency"] for q in self.query_log) / total_queries if total_queries > 0 else 0
        
        return {
            "total_queries": total_queries,
            "average_latency": round(avg_latency, 3),
            "retrieval_usage": dict(self.retrieval_stats),
            "error_count": len(self.error_log)
        }

# Initialize monitor
monitor = SystemMonitor()

# Example usage
start = time.time()
response = decide_and_respond("What is agentic RAG?")
latency = time.time() - start

monitor.log_query(
    query="What is agentic RAG?",
    response=response,
    latency=latency,
    used_retrieval=True
)

print("\nSystem Statistics:")
print(monitor.get_statistics())

## Testing & Validation
Let's make sure everything actually works end-to-end. I like to test with different types of queries to really stress the system:

In [None]:
def comprehensive_system_test():
    """
    Run comprehensive tests across different query types and scenarios.
    """
    test_cases = [
        {
            "query": "What is agentic RAG?",
            "expected_behavior": "Should retrieve from knowledge base",
            "use_multi_step": False
        },
        {
            "query": "How does agentic RAG differ from traditional RAG and what are implementation challenges?",
            "expected_behavior": "Should use multi-step reasoning",
            "use_multi_step": True
        },
        {
            "query": "Hello",
            "expected_behavior": "Should respond directly without retrieval",
            "use_multi_step": False
        }
    ]
    
    results = []
    
    for i, test_case in enumerate(test_cases):
        print(f"\n{'='*80}")
        print(f"Test Case {i+1}: {test_case['query']}")
        print(f"Expected: {test_case['expected_behavior']}")
        print(f"{'='*80}")
        
        start_time = time.time()
        
        try:
            if test_case["use_multi_step"]:
                result = multi_step_reasoning(test_case["query"])
                response = result["final_answer"]
            else:
                response = robust_query_handler(test_case["query"])
            
            latency = time.time() - start_time
            
            results.append({
                "test_case": i+1,
                "query": test_case["query"],
                "response": response[:200] + "..." if len(response) > 200 else response,
                "latency": round(latency, 3),
                "status": "PASSED"
            })
            
            print(f"\nResponse: {response[:200]}...")
            print(f"Latency: {latency:.3f}s")
            print("Status: PASSED ✓")
            
        except Exception as e:
            results.append({
                "test_case": i+1,
                "query": test_case["query"],
                "error": str(e),
                "status": "FAILED"
            })
            print(f"\nError: {str(e)}")
            print("Status: FAILED ✗")
    
    return results

# Run comprehensive tests
test_results = comprehensive_system_test()

print("\n" + "="*80)
print("TEST SUMMARY")
print("="*80)
passed = sum(1 for r in test_results if r["status"] == "PASSED")
print(f"Passed: {passed}/{len(test_results)}")

## Conclusion
So there you have it - a complete agentic RAG system built from scratch. The more I think about it, the key difference between this and traditional RAG really comes down to autonomy. Your system can now:

<ul>
- Decide for itself when to retrieve information (no more unnecessary searches)
- Break down complex questions into manageable pieces
- Use multiple strategies to find the best information
- Handle errors gracefully and keep running
- Monitor its own performance
</ul>
And honestly, the performance improvements from caching and multi-query retrieval alone make this worth implementing.

### Next Steps
If you want to take this further, here are some ideas I've been exploring:

<ul>
<li>**Better Reranking**: Try cross-encoder models like `sentence-transformers/ms-marco-MiniLM-L-12-v2`. They're slower but way more accurate.

</li>
<li>**Memory Systems**: Add conversation memory so the agent remembers what you talked about. LangChain has some great modules for this.

</li>
<li>**More Tools**: Why stop at retrieval? Add web search, calculators, maybe even database queries. The agent can figure out which to use.

</li>
<li>**Fine-tuning**: If you have domain-specific data, fine-tune your embedding model. The improvement in retrieval accuracy can be dramatic.

</li>
<li>**CI/CD**: Set up proper testing pipelines. GitHub Actions works great for this.

</li>
<li>**Better Monitoring**: Look into LangSmith or Weights & Biases for deeper insights into what your agent is actually doing.

</li>
</ul>
One last thing - when you deploy this to production, don't forget about authentication, rate limiting, and cost monitoring. OpenAI API calls add up quickly, trust me. And if you're dealing with serious scale, consider managed vector databases like Pinecone or Weaviate instead of ChromaDB.

The beauty of this system is that it's modular. Start simple, test everything, then gradually add complexity. Before you know it, you'll have an AI assistant that actually understands when and how to help, not just blindly retrieve documents every time.