Workshop: Intro To Agentic RAG

In this workshop, develop a customizable Agentic RAG engine powered by LangGraph, complete with dialog memory and active query refinement workflows

Important

🎓 Workshop Participants: Environment Readiness

Verify your environment is fully prepared before the session starts. Estimated setup time: ~15 min · Estimated workshop time: ~30 min to first working chat.

Requirement	Details
Python	3.11 or higher
Virtual environment	`venv` or `conda` recommended to avoid conflicts
RAM (local models)	16 GB minimum for Ollama; 8 GB minimum for cloud providers
Ollama	Required only if using local inference — install here then run `ollama pull qwen3:4b-instruct-2507-q4_K_M`
API key	Required only if using OpenAI, Anthropic, or Google — copy `project/.env.example` → `project/.env` and fill in your key
Dependencies	`pip install -r requirements.txt`

🚀 Verification: Open notebooks/agentic_rag.ipynb and run the first cell. No errors = ready to go.

🆘 Lab Support: labs@artiportal.com

Introduction: Why Agentic RAG?

Traditional Retrieval-Augmented Generation (RAG) systems typically follow a straightforward, linear pipeline: a user asks a question, the system blindly searches a vector database for matching text chunks, and a Large Language Model (LLM) summarizes those chunks into an answer. While effective for simple lookups, this static approach breaks down when confronted with ambiguous phrasing, multi-part requests, conversational context, or complex reasoning requirements.

Agentic RAG elevates this paradigm by giving the application autonomy. Instead of hardcoded steps, the system acts as an active researcher navigating a stateful, decision-making workflow. Agents in this repository are capable of:

Contextualizing incoming questions and rewriting them for mathematically optimal vector retrieval.
Clarifying user intent by pausing the execution and asking the user for more details if the initial query is unanswerable.
Decomposing complex questions into parallel sub-tasks (Map-Reduce) to retrieve comprehensive facts simultaneously.
Self-Correcting by evaluating retrieved documents and triggering deeper, iterative searches if the context remains insufficient.

The Tooling Stack

To achieve this level of autonomy, this codebase integrates several best-in-class frameworks:

LangGraph: A graph-based orchestration framework built on LangChain. It allows us to define the agent's workflow as nodes and conditional edges, enabling the cyclical loops needed for self-correction.
LangChain: Provides the underlying abstractions for LLM interactions, tool binding, and text splitting.
Qdrant: A highly performant vector database utilized here for Hybrid Search (simultaneously executing dense semantic searches alongside sparse BM25 keyword matching).
Ollama: Enables completely local, offline execution of Large Language Models to power the agent's reasoning. By default, it is configured for local models, but easily swaps to OpenAI, Anthropic, or Google Gemini APIs.

Overview • How It Works • LLM Providers • Implementation • Installation & Usage • Troubleshooting

If you find this useful, consider giving it a ⭐ — it helps others discover the project.

Overview

This repository provides a blueprint for assembling an Agentic RAG (Retrieval-Augmented Generation) engine leveraging LangGraph. Where traditional RAG guides focus strictly on naive vector retrieval, this project focuses on constructing a highly flexible, agent-orchestrated pipeline. It is equally useful as a learning resource and a foundation for production use.

What's inside

Feature	Description
🗂️ Hierarchical Indexing	Search small chunks for precision, retrieve large Parent chunks for context
🧠 Conversation Memory	Maintains context across questions for natural dialogue
❓ Query Clarification	Rewrites ambiguous queries or pauses to ask the user for details
🤖 Agent Orchestration	LangGraph coordinates the full retrieval and reasoning workflow
🔀 Multi-Agent Map-Reduce	Decomposes complex queries into parallel sub-queries
✅ Self-Correction	Re-queries automatically if initial results are insufficient
🗜️ Context Compression	Keeps working memory lean across long retrieval loops
🔍 Observability	Track LLM calls, tool usage, and graph execution with Langfuse

🎯 Two Ways to Use This Repo

1️⃣ Learning Path: Interactive Notebook

Step-by-step tutorial perfect for understanding core concepts. Start here if you're new to Agentic RAG or want to experiment quickly.

2️⃣ Building Path: Modular Project

Flexible architecture where each component can be independently swapped — LLM provider, embedding model, PDF converter, agent workflow. One line to switch from Ollama to Anthropic, OpenAI, or Google.

See Modular Architecture and Installation & Usage to get started.

How It Works

The underlying architecture of this system relies on separating the document processing phase from the real-time query orchestration phase. Here is a detailed breakdown of the fundamental concepts that make this Agentic RAG engine tick.

1. Advanced Document Indexing

Standard RAG struggles with a known tradeoff: small chunks yield high search precision but lack context (meaning the LLM gets confused), while large chunks provide great context but dilute search relevance (meaning the vector database struggles to find the exact match).

We solve this using Hierarchical Parent-Child Chunking:

Child Chunks: Documents are split into 500-token snippets. These are deeply precise and tightly focused on specific facts. We generate vector embeddings for these snippets and store them in Qdrant for semantic similarity searches.
Parent Chunks: Documents are simultaneously split into large 2,000–4,000 token sections (aligned to document H1/H2 boundaries). These are not embedded; they are simply stored on disk.
The Retrieval Link: Every "Child" vector carries a metadata tag identifying its "Parent". When the Agent searches for a fact, it matches the small, precise child chunk. It then intercepts that result and uses the ID to pull the entire surrounding Parent chunk into the context window, giving the LLM perfectly targeted facts surrounded by full contextual paragraphs.

2. Hybrid Search Capabilities

Semantic search (dense embeddings) is fantastic for conceptual matching (e.g., matching "How do I care for a pup?" to documents about "Dog Maintenance"). However, it routinely fails at exact keyword lookups (like searching for a specific product serial number or an exact error code).

We utilize Qdrant's Hybrid Search: Our index calculates dual vectors for every child chunk:

Dense Vector: Captures the conceptual meaning using Sentence-Transformers (all-mpnet-base).
Sparse Vector: Captures keyword density using BM25. When a query fires, both searches execute simultaneously and their scores are mathematically fused (Reciprocal Rank Fusion). This ensures the agent finds conceptual matches without sacrificing exact-keyword precision.

3. Why LangGraph? State & Orchestration

Unlike traditional RAG which represents a straight pipeline (Query -> Retrieve -> Generate), an agentic system is designed as a State Automaton. LangGraph orchestrates this. At the center of the graph is the State object, a dictionary that persists throughout the loop:

It tracks the conversation_summary to maintain dialog memory.
It tracks tool_call_count to ensure agents don't loop infinitely.
It tracks retrieval_keys to ensure the agent doesn't fetch the same parent document twice.

Armed with this persistent state, LangGraph defines mathematical endpoints ("nodes") and conditional transitions ("edges"). For instance, an edge can evaluate: Did the vector database return documents? If yes -> Go to Generation. If no -> Go to Rewrite Query. This transforms the LLM from a passive generator into an active, self-correcting researcher.

4. Continuous Context Compression

While LLMs have large context windows, feeding the same long document into the prompt over multiple autonomous search iterations quickly explodes the token count, degrading reasoning capability and increasing cost. To counteract this, the system uses an active Context Compressor. If the retrieved content in the State history surpasses a calculated token threshold, a specialized node intercepts the text and compresses it down to its core facts. The Agent then resumes its search with a lightweight "Summary of Findings" instead of dragging raw JSON blocks into the future.

Query Processing: The Map-Reduce Workflow

User Query → Conversation Summary → Query Rewriting → Query Clarification →
Parallel Agent Reasoning → Aggregation → Final Response

Stage 1 — Conversation Understanding: Analyzes recent history to extract context and maintain continuity across questions.

Stage 2 — Query Clarification: Resolves grammatical references ("How do I update it?" → "How do I update SQL?"), detects fundamentally unclear inputs, and rewrites queries for optimal retrieval. Most importantly, it pauses for human input when clarification is needed, introducing Human-In-The-Loop capabilities.

Stage 3 — Map-Reduce Retrieval (Multi-Agent Subgraphs): If the user asks a complex or dual-pronged question (e.g., "Compare features of the 2023 model with the 2024 model"), a single vector search will likely fail as it mathematically averages the intent of two different objects. Instead, the orchestrator node splits the question into focused sub-queries and spawns parallel agent subgraphs. Agent A exclusively researches "2023 model" while Agent B exclusively researches "2024 model". Each sub-agent runs its own search/retrieve/correct lifecycle in total isolation.

Stage 4 — Response Compilation: Once all independent search agents conclude, their findings are dumped into a shared array. A final Aggregation node evaluates all parallel research and weaves it into a single, cohesive answer, citing exact document sources.

LLM Provider Configuration

This system is provider-agnostic — it supports any LLM provider available in LangChain, swappable in a single line. The examples below cover the most common options, but the same pattern applies to any other supported provider.

Note: Model names change frequently. Always check the official documentation for the latest available models and their identifiers before deploying.

Provider Comparison

	Ollama (Local)	OpenAI	Anthropic	Google
Cost	Free	Pay-per-token	Pay-per-token	Pay-per-token
Setup	Install Ollama	API key	API key	API key
Works offline	Yes	No	No	No
Tool calling	Model-dependent (7B+ recommended)	Yes	Yes	Yes
Quality	Good (7B+)	Excellent	Excellent	Excellent
Best for	Privacy, no cost, local dev	Production, speed	Long context, reasoning	Multimodal, large context

Ollama (Local)

# Install Ollama from https://ollama.com
ollama pull qwen3:4b-instruct-2507-q4_K_M

from langchain_ollama import ChatOllama

llm = ChatOllama(model="qwen3:4b-instruct-2507-q4_K_M", temperature=0)

⚠️ For reliable tool calling and instruction following, prefer models 7B+. Smaller models may ignore retrieval instructions or hallucinate. See Troubleshooting.

Cloud Providers

Click to expand

OpenAI GPT:

pip install -qU langchain-openai

from langchain_openai import ChatOpenAI
import os

os.environ["OPENAI_API_KEY"] = "your-api-key-here"
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

Anthropic Claude:

pip install -qU langchain-anthropic

from langchain_anthropic import ChatAnthropic
import os

os.environ["ANTHROPIC_API_KEY"] = "your-api-key-here"
llm = ChatAnthropic(model="claude-sonnet-4-6", temperature=0)

Google Gemini

pip install -qU langchain-google-genai

import os
from langchain_google_genai import ChatGoogleGenerativeAI

os.environ["GOOGLE_API_KEY"] = "your-api-key-here"
llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0)

Implementation

Additional details, extended explanations, and Langfuse observability (LLM call tracing, tool usage, and graph execution tracking) are available in the notebook and in the full project.

Step	Description
1	Initial Setup and Configuration
2	Configure Vector Database
3	PDFs to Markdown
4	Hierarchical Document Indexing
5	Define Agent Tools
6	Define System Prompts
7	Define State and Data Models
8	Agent Configuration
9	Build Graph Node and Edge Functions
10	Build the LangGraph Graphs
11	Create Chat Interface

Step 1: Initial Setup and Configuration

Define paths and initialize core components.

import os
from pathlib import Path
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_qdrant import FastEmbedSparse
from qdrant_client import QdrantClient

# 1. Define storage directories for full documents and local JSON parent chunks.
DOCS_DIR = "docs"
MARKDOWN_DIR = "markdown_docs"
PARENT_STORE_PATH = "parent_store"
CHILD_COLLECTION = "document_child_chunks"

os.makedirs(DOCS_DIR, exist_ok=True)
os.makedirs(MARKDOWN_DIR, exist_ok=True)
os.makedirs(PARENT_STORE_PATH, exist_ok=True)

# 2. Initialize the LLM (defaulting to local Ollama) and embedding models.
from langchain_ollama import ChatOllama
llm = ChatOllama(model="qwen3:4b-instruct-2507-q4_K_M", temperature=0)

# Dense embeddings for conceptual search, Sparse for keyword precision.
dense_embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
sparse_embeddings = FastEmbedSparse(model_name="Qdrant/bm25")

# 3. Connection to the local Qdrant instance.
client = QdrantClient(path="qdrant_db")

Step 2: Configure Vector Database

Set up Qdrant to store child chunks with hybrid search capabilities.

from qdrant_client.http import models as qmodels
from langchain_qdrant import QdrantVectorStore
from langchain_qdrant import RetrievalMode

# Ensure the collection exists with the correct vector dimensions and hybrid config.
embedding_dimension = len(dense_embeddings.embed_query("test"))

def ensure_collection(collection_name):
    if not client.collection_exists(collection_name):
        client.create_collection(
            collection_name=collection_name,
            # Dense vector configuration
            vectors_config=qmodels.VectorParams(
                size=embedding_dimension,
                distance=qmodels.Distance.COSINE
            ),
            # Sparse vector configuration for BM25 keyword matching
            sparse_vectors_config={
                "sparse": qmodels.SparseVectorParams()
            },
        )

Step 3: PDFs to Markdown

Convert the PDFs to Markdown. For more details about other techniques use this companion notebook.

import os
import pymupdf.layout
import pymupdf4llm
from pathlib import Path
import glob

os.environ["TOKENIZERS_PARALLELISM"] = "false"

def pdf_to_markdown(pdf_path, output_dir):
    """
    Standardizes document format. Markdown is highly preferred over raw PDF
    text because it preserves semantic structure like headers and lists.
    """
    doc = pymupdf.open(pdf_path)
    md = pymupdf4llm.to_markdown(doc, header=False, footer=False, page_separators=True, ignore_images=True, write_images=False, image_path=None)
    md_cleaned = md.encode('utf-8', errors='surrogatepass').decode('utf-8', errors='ignore')
    output_path = Path(output_dir) / Path(doc.name).stem
    Path(output_path).with_suffix(".md").write_bytes(md_cleaned.encode('utf-8'))

def pdfs_to_markdowns(path_pattern, overwrite: bool = False):
    """Batch processes a directory of PDFs for ingestion."""
    output_dir = Path(MARKDOWN_DIR)
    output_dir.mkdir(parents=True, exist_ok=True)

    for pdf_path in map(Path, glob.glob(path_pattern)):
        md_path = (output_dir / pdf_path.stem).with_suffix(".md")
        if overwrite or not md_path.exists():
            pdf_to_markdown(pdf_path, output_dir)

# Execution: convert all PDFs in the docs directory
pdfs_to_markdowns(f"{DOCS_DIR}/*.pdf")

Step 4: Hierarchical Document Indexing

Process documents with the Parent/Child splitting strategy.

import os
import glob
import json
from pathlib import Path
from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter

Parent & Child chunk processing functions

def merge_small_parents(chunks, min_size):
    if not chunks:
        return []

    merged, current = [], None

    for chunk in chunks:
        if current is None:
            current = chunk
        else:
            current.page_content += "\n\n" + chunk.page_content
            for k, v in chunk.metadata.items():
                if k in current.metadata:
                    current.metadata[k] = f"{current.metadata[k]} -> {v}"
                else:
                    current.metadata[k] = v

        if len(current.page_content) >= min_size:
            merged.append(current)
            current = None

    if current:
        if merged:
            merged[-1].page_content += "\n\n" + current.page_content
            for k, v in current.metadata.items():
                if k in merged[-1].metadata:
                    merged[-1].metadata[k] = f"{merged[-1].metadata[k]} -> {v}"
                else:
                    merged[-1].metadata[k] = v
        else:
            merged.append(current)

    return merged

def split_large_parents(chunks, max_size, splitter):
    split_chunks = []

    for chunk in chunks:
        if len(chunk.page_content) <= max_size:
            split_chunks.append(chunk)
        else:
            large_splitter = RecursiveCharacterTextSplitter(
                chunk_size=max_size,
                chunk_overlap=splitter._chunk_overlap
            )
            sub_chunks = large_splitter.split_documents([chunk])
            split_chunks.extend(sub_chunks)

    return split_chunks

def clean_small_chunks(chunks, min_size):
    cleaned = []

    for i, chunk in enumerate(chunks):
        if len(chunk.page_content) < min_size:
            if cleaned:
                cleaned[-1].page_content += "\n\n" + chunk.page_content
                for k, v in chunk.metadata.items():
                    if k in cleaned[-1].metadata:
                        cleaned[-1].metadata[k] = f"{cleaned[-1].metadata[k]} -> {v}"
                    else:
                        cleaned[-1].metadata[k] = v
            elif i < len(chunks) - 1:
                chunks[i + 1].page_content = chunk.page_content + "\n\n" + chunks[i + 1].page_content
                for k, v in chunk.metadata.items():
                    if k in chunks[i + 1].metadata:
                        chunks[i + 1].metadata[k] = f"{v} -> {chunks[i + 1].metadata[k]}"
                    else:
                        chunks[i + 1].metadata[k] = v
            else:
                cleaned.append(chunk)
        else:
            cleaned.append(chunk)

    return cleaned

# Initialize the hybrid vector store for child chunks
if client.collection_exists(CHILD_COLLECTION):
    client.delete_collection(CHILD_COLLECTION)
    ensure_collection(CHILD_COLLECTION)
else:
    ensure_collection(CHILD_COLLECTION)

child_vector_store = QdrantVectorStore(
    client=client,
    collection_name=CHILD_COLLECTION,
    embedding=dense_embeddings,
    sparse_embedding=sparse_embeddings,
    retrieval_mode=RetrievalMode.HYBRID,
    sparse_vector_name="sparse"
)

def index_documents():
    """
    Main ingestion loop: Splits documents into the Parent-Child hierarchy, 
    embeds children into Qdrant, and saves parents to local JSON.
    """
    headers_to_split_on = [("#", "H1"), ("##", "H2"), ("###", "H3")]
    parent_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=False)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

    min_parent_size = 2000
    max_parent_size = 4000

    all_parent_pairs, all_child_chunks = [], []
    md_files = sorted(glob.glob(os.path.join(MARKDOWN_DIR, "*.md")))

    if not md_files:
        return

    for doc_path_str in md_files:
        doc_path = Path(doc_path_str)
        try:
            with open(doc_path, "r", encoding="utf-8") as f:
                md_text = f.read()
        except Exception as e:
            continue

        # 1. Split into Parent sections based on Markdown headers
        parent_chunks = parent_splitter.split_text(md_text)
        
        # 2. Merge tiny orphans and split massive blocks to ensure uniform context windows
        merged_parents = merge_small_parents(parent_chunks, min_parent_size)
        split_parents = split_large_parents(merged_parents, max_parent_size, child_splitter)
        cleaned_parents = clean_small_chunks(split_parents, min_parent_size)

        for i, p_chunk in enumerate(cleaned_parents):
            parent_id = f"{doc_path.stem}_parent_{i}"
            p_chunk.metadata.update({"source": doc_path.stem + ".pdf", "parent_id": parent_id})
            all_parent_pairs.append((parent_id, p_chunk))
            
            # 3. Create searchable 'Child' chunks linked back to their parent
            children = child_splitter.split_documents([p_chunk])
            all_child_chunks.extend(children)

    if not all_child_chunks:
        return

    # Commit children to the vector database
    try:
        child_vector_store.add_documents(all_child_chunks)
    except Exception as e:
        return

    # Clean local store and save parent context blocks as JSON files
    for item in os.listdir(PARENT_STORE_PATH):
        os.remove(os.path.join(PARENT_STORE_PATH, item))

    for parent_id, doc in all_parent_pairs:
        doc_dict = {"page_content": doc.page_content, "metadata": doc.metadata}
        filepath = os.path.join(PARENT_STORE_PATH, f"{parent_id}.json")
        with open(filepath, "w", encoding="utf-8") as f:
            json.dump(doc_dict, f, ensure_ascii=False, indent=2)

# Execution: build the index from converted markdowns
index_documents()

Step 5: Define Agent Tools

Create the retrieval tools the agent will use.

import json
from typing import List
from langchain_core.tools import tool

@tool
def search_child_chunks(query: str, limit: int) -> str:
    """Search for the top K most relevant child chunks.

    Args:
        query: Search query string
        limit: Maximum number of results to return
    """
    try:
        results = child_vector_store.similarity_search(query, k=limit, score_threshold=0.7)
        if not results:
            return "NO_RELEVANT_CHUNKS"

        return "\n\n".join([
            f"Parent ID: {doc.metadata.get('parent_id', '')}\n"
            f"File Name: {doc.metadata.get('source', '')}\n"
            f"Content: {doc.page_content.strip()}"
            for doc in results
        ])

    except Exception as e:
        return f"RETRIEVAL_ERROR: {str(e)}"

@tool
def retrieve_parent_chunks(parent_id: str) -> str:
    """Retrieve full parent chunks by their IDs.
    
    Args:
        parent_id: Parent chunk ID to retrieve
    """
    file_name = parent_id if parent_id.lower().endswith(".json") else f"{parent_id}.json"
    path = os.path.join(PARENT_STORE_PATH, file_name)

    if not os.path.exists(path):
        return "NO_PARENT_DOCUMENT"

    with open(path, "r", encoding="utf-8") as f:
        data = json.load(f)

    return (
        f"Parent ID: {parent_id}\n"
        f"File Name: {data.get('metadata', {}).get('source', 'unknown')}\n"
        f"Content: {data.get('page_content', '').strip()}"
    )

llm_with_tools = llm.bind_tools([search_child_chunks, retrieve_parent_chunks])

Step 6: Define System Prompts

Define the system prompts for conversation summarization, query rewriting, agent orchestration, context compression, fallback response, and answer aggregation.

Conversation Summary Prompt

def get_conversation_summary_prompt() -> str:
    return """You are an expert conversation summarizer.

Your task is to create a brief 1-2 sentence summary of the conversation (max 30-50 words).

Include:
- Main topics discussed
- Important facts or entities mentioned
- Any unresolved questions if applicable
- Sources file name (e.g., file1.pdf) or documents referenced

Exclude:
- Greetings, misunderstandings, off-topic content.

Output:
- Return ONLY the summary.
- Do NOT include any explanations or justifications.
- If no meaningful topics exist, return an empty string.
"""

Query Rewrite Prompt

def get_rewrite_query_prompt() -> str:
    return """You are an expert query analyst and rewriter.

Your task is to rewrite the current user query for optimal document retrieval, incorporating conversation context only when necessary.

Rules:
1. Self-contained queries:
   - Always rewrite the query to be clear and self-contained
   - If the query is a follow-up (e.g., "what about X?", "and for Y?"), integrate minimal necessary context from the summary
   - Do not add information not present in the query or conversation summary

2. Domain-specific terms:
   - Product names, brands, proper nouns, or technical terms are treated as domain-specific
   - For domain-specific queries, use conversation context minimally or not at all
   - Use the summary only to disambiguate vague queries

3. Grammar and clarity:
   - Fix grammar, spelling errors, and unclear abbreviations
   - Remove filler words and conversational phrases
   - Preserve concrete keywords and named entities

4. Multiple information needs:
   - If the query contains multiple distinct, unrelated questions, split into separate queries (maximum 3)
   - Each sub-query must remain semantically equivalent to its part of the original
   - Do not expand, enrich, or reinterpret the meaning

5. Failure handling:
   - If the query intent is unclear or unintelligible, mark as "unclear"

Input:
- conversation_summary: A concise summary of prior conversation
- current_query: The user's current query

Output:
- One or more rewritten, self-contained queries suitable for document retrieval
"""

Orchestrator Prompt

def get_orchestrator_prompt() -> str:
    return """You are an expert retrieval-augmented assistant.

Your task is to act as a researcher: search documents first, analyze the data, and then provide a comprehensive answer using ONLY the retrieved information.

Rules:
1. You MUST call 'search_child_chunks' before answering, unless the [COMPRESSED CONTEXT FROM PRIOR RESEARCH] already contains sufficient information.
2. Ground every claim in the retrieved documents. If context is insufficient, state what is missing rather than filling gaps with assumptions.
3. If no relevant documents are found, broaden or rephrase the query and search again. Repeat until satisfied or the operation limit is reached.

Compressed Memory:
When [COMPRESSED CONTEXT FROM PRIOR RESEARCH] is present —
- Queries already listed: do not repeat them.
- Parent IDs already listed: do not call `retrieve_parent_chunks` on them again.
- Use it to identify what is still missing before searching further.

Workflow:
1. Check the compressed context. Identify what has already been retrieved and what is still missing.
2. Search for 5-7 relevant excerpts using 'search_child_chunks' ONLY for uncovered aspects.
3. If NONE are relevant, apply rule 3 immediately.
4. For each relevant but fragmented excerpt, call 'retrieve_parent_chunks' ONE BY ONE — only for IDs not in the compressed context. Never retrieve the same ID twice.
5. Once context is complete, provide a detailed answer omitting no relevant facts.
6. Conclude with "---\n**Sources:**\n" followed by the unique file names.
"""

Fallback Response Prompt

def get_fallback_response_prompt() -> str:
    return """You are an expert synthesis assistant. The system has reached its maximum research limit.

Your task is to provide the most complete answer possible using ONLY the information provided below.

Input structure:
- "Compressed Research Context": summarized findings from prior search iterations — treat as reliable.
- "Retrieved Data": raw tool outputs from the current iteration — prefer over compressed context if conflicts arise.
Either source alone is sufficient if the other is absent.

Rules:
1. Source Integrity: Use only facts explicitly present in the provided context. Do not infer, assume, or add any information not directly supported by the data.
2. Handling Missing Data: Cross-reference the USER QUERY against the available context.
   Flag ONLY aspects of the user's question that cannot be answered from the provided data.
   Do not treat gaps mentioned in the Compressed Research Context as unanswered
   unless they are directly relevant to what the user asked.
3. Tone: Professional, factual, and direct.
4. Output only the final answer. Do not expose your reasoning, internal steps, or any meta-commentary about the retrieval process.
5. Do NOT add closing remarks, final notes, disclaimers, summaries, or repeated statements after the Sources section.
   The Sources section is always the last element of your response. Stop immediately after it.

Formatting:
- Use Markdown (headings, bold, lists) for readability.
- Write in flowing paragraphs where possible.
- Conclude with a Sources section as described below.

Sources section rules:
- Include a "---\\n**Sources:**\\n" section at the end, followed by a bulleted list of file names.
- List ONLY entries that have a real file extension (e.g. ".pdf", ".docx", ".txt").
- Any entry without a file extension is an internal chunk identifier — discard it entirely, never include it.
- Deduplicate: if the same file appears multiple times, list it only once.
- If no valid file names are present, omit the Sources section entirely.
- THE SOURCES SECTION IS THE LAST THING YOU WRITE. Do not add anything after it.
"""

Context Compression Prompt

def get_context_compression_prompt() -> str:
    return """You are an expert research context compressor.

Your task is to compress retrieved conversation content into a concise, query-focused, and structured summary that can be directly used by a retrieval-augmented agent for answer generation.

Rules:
1. Keep ONLY information relevant to answering the user's question.
2. Preserve exact figures, names, versions, technical terms, and configuration details.
3. Remove duplicated, irrelevant, or administrative details.
4. Do NOT include search queries, parent IDs, chunk IDs, or internal identifiers.
5. Organize all findings by source file. Each file section MUST start with: ### filename.pdf
6. Highlight missing or unresolved information in a dedicated "Gaps" section.
7. Limit the summary to roughly 400-600 words. If content exceeds this, prioritize critical facts and structured data.
8. Do not explain your reasoning; output only structured content in Markdown.

Required Structure:

# Research Context Summary

## Focus
[Brief technical restatement of the question]

## Structured Findings

### filename.pdf
- Directly relevant facts
- Supporting context (if needed)

## Gaps
- Missing or incomplete aspects

The summary should be concise, structured, and directly usable by an agent to generate answers or plan further retrieval.
"""

Aggregation Prompt

def get_aggregation_prompt() -> str:
    return """You are an expert aggregation assistant.

Your task is to combine multiple retrieved answers into a single, comprehensive and natural response that flows well.

Rules:
1. Write in a conversational, natural tone - as if explaining to a colleague.
2. Use ONLY information from the retrieved answers.
3. Do NOT infer, expand, or interpret acronyms or technical terms unless explicitly defined in the sources.
4. Weave together the information smoothly, preserving important details, numbers, and examples.
5. Be comprehensive - include all relevant information from the sources, not just a summary.
6. If sources disagree, acknowledge both perspectives naturally (e.g., "While some sources suggest X, others indicate Y...").
7. Start directly with the answer - no preambles like "Based on the sources...".

Formatting:
- Use Markdown for clarity (headings, lists, bold) but don't overdo it.
- Write in flowing paragraphs where possible rather than excessive bullet points.
- Conclude with a Sources section as described below.

Sources section rules:
- Each retrieved answer may contain a "Sources" section — extract the file names listed there.
- List ONLY entries that have a real file extension (e.g. ".pdf", ".docx", ".txt").
- Any entry without a file extension is an internal chunk identifier — discard it entirely, never include it.
- Deduplicate: if the same file appears across multiple answers, list it only once.
- Format as "---\\n**Sources:**\\n" followed by a bulleted list of the cleaned file names.
- File names must appear ONLY in this final Sources section and nowhere else in the response.
- If no valid file names are present, omit the Sources section entirely.

If there's no useful information available, simply say: "I couldn't find any information to answer your question in the available sources."
"""

Step 7: Define State and Data Models

Create the state structure for conversation tracking and agent execution.

from langgraph.graph import MessagesState
from pydantic import BaseModel, Field
from typing import List, Annotated, Set
import operator

# --- State Reducers ---

def accumulate_or_reset(existing: List[dict], new: List[dict]) -> List[dict]:
    """
    Maintains the list of parallel worker answers.
    Allows merging new findings or resetting the list for a new turn.
    """
    if new and any(item.get('__reset__') for item in new):
        return []
    return existing + new

def set_union(a: Set[str], b: Set[str]) -> Set[str]:
    """Merges sets of retrieval keys (IDs and queries) from parallel workers."""
    return a | b

# --- State Schemas ---

class State(MessagesState):
    """Global state for the main graph orchestration."""
    questionIsClear: bool = False
    conversation_summary: str = ""
    originalQuery: str = ""
    rewrittenQuestions: List[str] = []
    agent_answers: Annotated[List[dict], accumulate_or_reset] = []

class AgentState(MessagesState):
    """Local state for a single parallel research worker."""
    tool_call_count: Annotated[int, operator.add] = 0
    iteration_count: Annotated[int, operator.add] = 0
    question: str = ""
    question_index: int = 0
    context_summary: str = ""
    retrieval_keys: Annotated[Set[str], set_union] = set()
    final_answer: str = ""
    agent_answers: List[dict] = []

class QueryAnalysis(BaseModel):
    """Structured output for the LLM to refine and clarify user intent."""
    is_clear: bool = Field(description="Indicates if the user's question is clear and answerable.")
    questions: List[str] = Field(description="List of rewritten, self-contained questions.")
    clarification_needed: str = Field(description="Explanation if the question is unclear.")

Step 8: Agent Configuration

Hard limits on tool calls and iterations prevent infinite loops. Token counting (via tiktoken) drives context compression decisions.

import tiktoken

# Agent Circuit Breakers
MAX_TOOL_CALLS = 8       # Max calls per search iteration to prevent infinite loops
MAX_ITERATIONS = 10      # Hard limit on agentic research attempts
BASE_TOKEN_THRESHOLD = 2000     # Threshold to trigger 'Context Compression'
TOKEN_GROWTH_FACTOR = 0.9       # Dynamic threshold scaling

def estimate_context_tokens(messages: list) -> int:
    """
    Estimates the current context payload to determine if 
    summarization/compression is required.
    """
    try:
        encoding = tiktoken.encoding_for_model("gpt-4")
    except:
        encoding = tiktoken.get_encoding("cl100k_base")
    return sum(len(encoding.encode(str(msg.content))) for msg in messages if hasattr(msg, 'content') and msg.content)

Step 9: Build Graph Node and Edge Functions

Create the processing nodes and edges for the LangGraph workflow.

Main Graph Nodes & Edges

from langgraph.types import Send, Command
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage, RemoveMessage, ToolMessage
from typing import Literal

def summarize_history(state: State):
    """
    Maintains continuity by summarizing the last few turns of conversation.
    Ensures the agent 'remembers' what was previously discussed.
    """
    if len(state["messages"]) < 4:
        return {"conversation_summary": ""}

    relevant_msgs = [
        msg for msg in state["messages"][:-1]
        if isinstance(msg, (HumanMessage, AIMessage)) and not getattr(msg, "tool_calls", None)
    ]

    if not relevant_msgs:
        return {"conversation_summary": ""}

    conversation = "Conversation history:\n"
    for msg in relevant_msgs[-6:]:
        role = "User" if isinstance(msg, HumanMessage) else "Assistant"
        conversation += f"{role}: {msg.content}\n"

    summary_response = llm.with_config(temperature=0.2).invoke([SystemMessage(content=get_conversation_summary_prompt()), HumanMessage(content=conversation)])
    # Reset parallel worker answers to prepare for a fresh query cycle
    return {"conversation_summary": summary_response.content, "agent_answers": [{"__reset__": True}]}

def rewrite_query(state: State):
    """
    Disambiguates and cleans user input. 
    If the question is multi-pronged, it splits it for parallel research.
    """
    last_message = state["messages"][-1]
    conversation_summary = state.get("conversation_summary", "")

    context_section = (f"Conversation Context:\n{conversation_summary}\n" if conversation_summary.strip() else "") + f"User Query:\n{last_message.content}\n"

    llm_with_structure = llm.with_config(temperature=0.1).with_structured_output(QueryAnalysis)
    response = llm_with_structure.invoke([SystemMessage(content=get_rewrite_query_prompt()), HumanMessage(content=context_section)])

    if response.questions and response.is_clear:
        # Clear out raw history messages once intent is captured in the rewritten queries
        delete_all = [RemoveMessage(id=m.id) for m in state["messages"] if not isinstance(m, SystemMessage)]
        return {"questionIsClear": True, "messages": delete_all, "originalQuery": last_message.content, "rewrittenQuestions": response.questions}

    # Handle vagueness via clarification
    clarification = response.clarification_needed if response.clarification_needed and len(response.clarification_needed.strip()) > 10 else "I need more information."
    return {"questionIsClear": False, "messages": [AIMessage(content=clarification)]}

def request_clarification(state: State):
    """Interrupt point for user input."""
    return {}

def route_after_rewrite(state: State) -> Literal["request_clarification", "agent"]:
    """Conditional router that triggers the parallel sub-agent workers."""
    if not state.get("questionIsClear", False):
        return "request_clarification"
    else:
        # Spawn parallel subgraphs (one per rewritten query)
        return [
                Send("agent", {"question": query, "question_index": idx, "messages": []})
                for idx, query in enumerate(state["rewrittenQuestions"])
            ]

def aggregate_answers(state: State):
    """Combines findings from all parallel sub-agents into a unified final response."""
    if not state.get("agent_answers"):
        return {"messages": [AIMessage(content="No answers were generated.")]}

    sorted_answers = sorted(state["agent_answers"], key=lambda x: x["index"])

    formatted_answers = ""
    for i, ans in enumerate(sorted_answers, start=1):
        formatted_answers += (f"\nAnswer {i}:\n"f"{ans['answer']}\n")

    user_message = HumanMessage(content=f"""Original question: {state["originalQuery"]}\nRetrieved answers:{formatted_answers}""")
    synthesis_response = llm.invoke([SystemMessage(content=get_aggregation_prompt()), user_message])
    return {"messages": [AIMessage(content=synthesis_response.content)]}

Agent Subgraph Nodes & Edges

def orchestrator(state: AgentState):
    """
    Main loop for a single sub-agent.
    Decision: Decide when to search, retrieve full context, or answer.
    """
    context_summary = state.get("context_summary", "").strip()
    sys_msg = SystemMessage(content=get_orchestrator_prompt())
    summary_injection = (
        [HumanMessage(content=f"[COMPRESSED CONTEXT FROM PRIOR RESEARCH]\n\n{context_summary}")]
        if context_summary else []
    )
    if not state.get("messages"):
        human_msg = HumanMessage(content=state["question"])
        # Bootstrapping the research loop
        force_search = HumanMessage(content="YOU MUST CALL 'search_child_chunks' AS THE FIRST STEP.")
        response = llm_with_tools.invoke([sys_msg] + summary_injection + [human_msg, force_search])
        return {"messages": [human_msg, response], "tool_call_count": len(response.tool_calls or []), "iteration_count": 1}

    response = llm_with_tools.invoke([sys_msg] + summary_injection + state["messages"])
    tool_calls = response.tool_calls if hasattr(response, "tool_calls") else []
    return {"messages": [response], "tool_call_count": len(tool_calls) if tool_calls else 0, "iteration_count": 1}

def route_after_orchestrator_call(state: AgentState) -> Literal["tool", "fallback_response", "collect_answer"]:
    """Conditional router based on tool results and iteration budgets."""
    iteration = state.get("iteration_count", 0)
    tool_count = state.get("tool_call_count", 0)

    # Circuit breakers to prevent infinite research loops
    if iteration >= MAX_ITERATIONS or tool_count > MAX_TOOL_CALLS:
        return "fallback_response"

    last_message = state["messages"][-1]
    tool_calls = getattr(last_message, "tool_calls", None) or []

    if not tool_calls:
        # LLM has enough information to generate the final sub-answer
        return "collect_answer"
    
    return "tools"

def fallback_response(state: AgentState):
    """Synthesizes a response using whatever fragments were retrieved before budget was hit."""
    seen = set()
    unique_contents = []
    for m in state["messages"]:
        if isinstance(m, ToolMessage) and m.content not in seen:
            unique_contents.append(m.content)
            seen.add(m.content)

    context_summary = state.get("context_summary", "").strip()
    context_parts = []
    if context_summary:
        context_parts.append(f"## Compressed Research Context\n\n{context_summary}")
    if unique_contents:
        context_parts.append(
            "## Retrieved Data\n\n" +
            "\n\n".join(f"--- DATA SOURCE {i} ---\n{content}" for i, content in enumerate(unique_contents, 1))
        )

    context_text = "\n\n".join(context_parts) if context_parts else "No data was retrieved."
    prompt_content = f"QUERY: {state.get('question')}\n\n{context_text}"
    response = llm.invoke([SystemMessage(content=get_fallback_response_prompt()), HumanMessage(content=prompt_content)])
    return {"messages": [response]}

def should_compress_context(state: AgentState) -> Command[Literal["compress_context", "orchestrator"]]:
    """
    Evaluation node: Check if the message history is bloating the context window.
    Routes to the compression node if threshold is exceeded.
    """
    messages = state["messages"]
    new_ids: Set[str] = set()
    for msg in reversed(messages):
        if isinstance(msg, AIMessage) and getattr(msg, "tool_calls", None):
            for tc in msg.tool_calls:
                # Track retrieval history to avoid redundant tool calls
                if tc["name"] == "retrieve_parent_chunks":
                    raw = tc["args"].get("parent_id") or tc["args"].get("id") or tc["args"].get("ids") or []
                    if isinstance(raw, str): new_ids.add(f"parent::{raw}")
                    else: new_ids.update(f"parent::{r}" for r in raw)
                elif tc["name"] == "search_child_chunks":
                    query = tc["args"].get("query", "")
                    if query: new_ids.add(f"search::{query}")
            break

    updated_ids = state.get("retrieval_keys", set()) | new_ids
    current_tokens = estimate_context_tokens(messages) + estimate_context_tokens([HumanMessage(content=state.get("context_summary", ""))])
    max_allowed = BASE_TOKEN_THRESHOLD + int(estimate_context_tokens([HumanMessage(content=state.get("context_summary", ""))]) * TOKEN_GROWTH_FACTOR)

    goto = "compress_context" if current_tokens > max_allowed else "orchestrator"
    return Command(update={"retrieval_keys": updated_ids}, goto=goto)

def compress_context(state: AgentState):
    """Shrinks the message history down to a concise technical summary."""
    messages = state["messages"]
    existing_summary = state.get("context_summary", "").strip()
    if not messages: return {}

    conversation_text = f"USER QUESTION:\n{state.get('question')}\n\n"
    if existing_summary: conversation_text += f"[PRIOR CONTEXT]\n{existing_summary}\n\n"

    for msg in messages[1:]:
        if isinstance(msg, AIMessage):
            conversation_text += f"[ASSISTANT]\n{msg.content or '(tool call only)'}\n\n"
        elif isinstance(msg, ToolMessage):
            conversation_text += f"[TOOL — {getattr(msg, 'name', 'tool')}]\n{msg.content}\n\n"

    summary_response = llm.invoke([SystemMessage(content=get_context_compression_prompt()), HumanMessage(content=conversation_text)])
    new_summary = summary_response.content
    
    # Clean state history
    return {"context_summary": new_summary, "messages": [RemoveMessage(id=m.id) for m in messages[1:]]}

def collect_answer(state: AgentState):
    """Sink node: extract the final sub-answer produced by the worker."""
    last_message = state["messages"][-1]
    is_valid = isinstance(last_message, AIMessage) and last_message.content and not last_message.tool_calls
    answer = last_message.content if is_valid else "Unable to generate an answer."
    return {
        "final_answer": answer,
        "agent_answers": [{"index": state["question_index"], "question": state["question"], "answer": answer}]
    }

Why this architecture?

Summarization maintains conversational context without overwhelming the LLM
Query rewriting ensures search queries are precise and unambiguous, using context intelligently
Human-in-the-loop catches unclear queries before wasting any retrieval resources
Parallel execution via Send API spawns independent agent subgraphs for each sub-question simultaneously
Context compression keeps the agent's working memory lean across long retrieval loops, preventing redundant fetches
Fallback response ensures graceful degradation — the agent always returns something useful even when the budget runs out
Answer collection & aggregation extracts clean final answers from agents and aggregates them into a single coherent response

Step 10: Build the LangGraph Graphs

Assemble the complete workflow graph with conversation memory and multi-agent architecture.

from langgraph.graph import START, END, StateGraph
from langgraph.prebuilt import ToolNode
from langgraph.checkpoint.memory import InMemorySaver

# 1. Initialize persistent memory for the graph
checkpointer = InMemorySaver()

# 2. Build the Agent Subgraph (The worker loop)
agent_builder = StateGraph(AgentState)
agent_builder.add_node(orchestrator)
agent_builder.add_node("tools", ToolNode([search_child_chunks, retrieve_parent_chunks]))
agent_builder.add_node(compress_context)
agent_builder.add_node(fallback_response)
agent_builder.add_node(should_compress_context)
agent_builder.add_node(collect_answer)

agent_builder.add_edge(START, "orchestrator")
agent_builder.add_conditional_edges("orchestrator", route_after_orchestrator_call, {"tools": "tools", "fallback_response": "fallback_response", "collect_answer": "collect_answer"})
agent_builder.add_edge("tools", "should_compress_context")
agent_builder.add_edge("compress_context", "orchestrator")
agent_builder.add_edge("fallback_response", "collect_answer")
agent_builder.add_edge("collect_answer", END)
agent_subgraph = agent_builder.compile()

# 3. Build the Main Graph (The orchestrator)
graph_builder = StateGraph(State)
graph_builder.add_node(summarize_history)
graph_builder.add_node(rewrite_query)
graph_builder.add_node(request_clarification)
graph_builder.add_node("agent", agent_subgraph)
graph_builder.add_node(aggregate_answers)

graph_builder.add_edge(START, "summarize_history")
graph_builder.add_edge("summarize_history", "rewrite_query")
graph_builder.add_conditional_edges("rewrite_query", route_after_rewrite)
graph_builder.add_edge("request_clarification", "rewrite_query")
graph_builder.add_edge(["agent"], "aggregate_answers")
graph_builder.add_edge("aggregate_answers", END)

# Compile into a final executable graph with Human-In-The-Loop interrupts
agent_graph = graph_builder.compile(checkpointer=checkpointer, interrupt_before=["request_clarification"])

Graph architecture explained:

The architecture flow diagram can be viewed here.

Agent Subgraph (processes individual questions):

START → orchestrator (invoke LLM with tools)
orchestrator → tools (if tool calls needed) OR fallback_response (if budget exhausted) OR collect_answer (if done)
tools → should_compress_context (check token budget)
should_compress_context → compress_context (if threshold exceeded) OR orchestrator (otherwise)
compress_context → orchestrator (resume with compressed memory)
fallback_response → collect_answer (package best-effort answer)
collect_answer → END (clean final answer with index)

Main Graph (orchestrates complete workflow):

START → summarize_history (extract conversation context from history)
summarize_history → rewrite_query (rewrite query with context, check clarity)
rewrite_query → request_clarification (if unclear) OR spawn parallel agent subgraphs via Send (if clear)
request_clarification → rewrite_query (after user provides clarification)
All agent subgraphs → aggregate_answers (merge all responses)
aggregate_answers → END (return final synthesized answer)

Step 11: Create Chat Interface

Build a Gradio interface with conversation persistence and human-in-the-loop support. For a complete end-to-end pipeline Gradio interface, including document ingestion, please refer to project/README.md.

Note: Full streaming support — including reasoning steps and tool calls visibility — is implemented in the notebook and in the full project. The example below is intentionally minimal — it shows the basic Gradio integration pattern only.

import gradio as gr
import uuid

def create_thread_id():
    """Generates a unique conversation ID for session management."""
    return {"configurable": {"thread_id": str(uuid.uuid4())}, "recursion_limit": 50}

def clear_session():
    """Wipes the current conversation thread for a fresh start."""
    global config
    agent_graph.checkpointer.delete_thread(config["configurable"]["thread_id"])
    config = create_thread_id()

def chat(message, history):
    """Simple synchronous chat wrapper for the agentic graph."""
    current_state = agent_graph.get_state(config)
    
    # Check if the graph is currently interrupted (waiting for user clarification)
    if current_state.next:
        agent_graph.update_state(config, {"messages": [HumanMessage(content=message.strip())]})
        result = agent_graph.invoke(None, config)
    else:
        result = agent_graph.invoke({"messages": [HumanMessage(content=message.strip())]}, config)
    
    return result['messages'][-1].content

# Initialize current session config
config = create_thread_id()

# Build the Gradio UI
with gr.Blocks() as demo:
    chatbot = gr.Chatbot()
    chatbot.clear(clear_session)
    gr.ChatInterface(fn=chat, chatbot=chatbot)

demo.launch(theme=gr.themes.Citrus())

You're done! You now have a fully functional Agentic RAG system with conversation memory, hierarchical indexing, and human-in-the-loop query clarification.

Modular Architecture

The app (project/ folder) is organized into modular components — each independently swappable without breaking the system.

📂 Project Structure

project/
├── app.py                    # Main Gradio application entry point
├── config.py                 # Configuration hub (models, chunk sizes, providers)
├── core/                     # RAG system orchestration
├── db/                       # Vector DB and parent chunk storage
├── rag_agent/                # LangGraph workflow (nodes, edges, prompts, tools)
└── ui/                       # Gradio interface

Key customization points: LLM provider, embedding model, chunking strategy, agent workflow, and system prompts — all configurable via config.py or their respective modules.

Full documentation in project/README.md.

Installation & Usage

Sample pdf files can be found here: javascript, blockchain, microservices, fortinet.

Option 1: Quickstart Notebook (Recommended for Testing)

Google Colab: Click the Open in Colab badge at the top of this README, upload your PDFs to a docs/ folder in the file browser, install dependencies with pip install -r requirements.txt, then run all cells top to bottom.

Local (Jupyter/VSCode): Optionally create and activate a virtual environment, install dependencies with pip install -r requirements.txt, add your PDFs to docs/, then run all cells top to bottom.

The chat interface will appear at the end.

Option 2: Full Python Project (Recommended for Development)

1. Install Dependencies

# Clone the repository
git clone <repository_url>
cd <repository_name>

# Optional: create and activate a virtual environment
# On macOS/Linux:
python -m venv venv && source venv/bin/activate
# On Windows:
python -m venv venv && .\venv\Scripts\activate

# Install packages
pip install -r requirements.txt

2. Run the Application

python project/app.py

3. Ask Questions

Open the local URL (e.g., http://127.0.0.1:7860) to start chatting.

Option 3: Docker Deployment

See project/README.md for full Docker instructions and system requirements.

Example Conversations

With Conversation Memory:

User: "How do I install SQL?"
Agent: [Provides installation steps from documentation]

User: "How do I update it?"
Agent: [Understands "it" = SQL, provides update instructions]

With Query Clarification:

User: "Tell me about that thing"
Agent: "I need more information. What specific topic are you asking about?"

User: "The installation process for PostgreSQL"
Agent: [Retrieves and answers with specific information]

Troubleshooting

Area	Common Problems	Suggested Solutions
Model Selection	- Responses ignore instructions - Tools (retrieval/search) used incorrectly - Poor context understanding - Hallucinations or incomplete aggregation	- Use more capable LLMs - Prefer models 7B+ for better reasoning - Consider cloud-based models if local models are limited
System Prompt Behavior	- Model answers without retrieving documents - Query rewriting loses context - Aggregation introduces hallucinations	- Make retrieval explicit in system prompts - Keep query rewriting close to user intent
Retrieval Configuration	- Relevant documents not retrieved - Too much irrelevant information	- Increase retrieved chunks (`k`) or lower similarity thresholds to improve recall - Reduce `k` or increase thresholds to improve precision
Chunk Size / Document Splitting	- Answers lack context or feel fragmented - Retrieval is slow or embedding costs are high	- Increase chunk & parent sizes for more context - Decrease chunk sizes to improve speed and reduce costs
Context Compression	- Agent loses important details after compression - Compressed summaries are too vague	- Tune the compression system prompt - Increase `BASE_TOKEN_THRESHOLD` to delay compression - Increase `TOKEN_GROWTH_FACTOR`
Agent Configuration	- Agent gives up too early - Agent loops too long	- Increase `MAX_TOOL_CALLS` / `MAX_ITERATIONS` for complex queries - Decrease them to speed up simple queries
Temperature & Consistency	- Responses inconsistent or overly creative - Responses too rigid or repetitive	- Set temperature to `0` for factual, consistent output - Slightly increase temperature for summarization or analysis tasks
Embedding Model Quality	- Poor semantic search - Weak performance on domain-specific or multilingual docs	- Use higher-quality or domain-specific embeddings - Re-index all documents after changing embeddings

💡 For additional troubleshooting tips see the README Troubleshooting.

Known Limitations

Limitation	Detail
Tool-calling model required	The agent relies on native function/tool calling. Models that don't support it (most models under 7B) will ignore retrieval instructions or hallucinate. Always verify your model supports tool use before deploying.
Re-indexing required on embedding change	Switching the dense embedding model invalidates the existing Qdrant collection. You must delete the collection and re-upload all documents through the UI.
Local Qdrant is not Docker-volume-mounted by default	The `Dockerfile` does not mount `qdrant_db/` as a volume. Indexed documents are lost when the container is removed. Add `-v $(pwd)/qdrant_db:/app/qdrant_db` to your `docker run` command to persist data.
Minimal Gradio example does not stream	The Step 11 code sample in the main README is intentionally minimal and returns the full response at once. Full token streaming is implemented in `project/core/chat_interface.py` and `notebooks/agentic_rag.ipynb`.
Context compression is lossy	The compression node summarizes retrieved content to fit within the token budget. For documents requiring exact quote preservation (legal, compliance), increase `BASE_TOKEN_THRESHOLD` or disable compression.

Contributing

Contributions are welcome. To get started:

Fork the repo and create a branch (git checkout -b feature/your-feature)
Make your changes — keep edits focused and avoid unrelated refactors
If you add or rename a tool, update the corresponding system prompt in project/rag_agent/prompts.py
If you add a new env var, add it to project/.env.example
Open a pull request with a clear description of what changed and why

For bugs or feature requests, open a GitHub issue.

References & Related Projects

Original Foundation: Agentic RAG for Dummies — This project was built upon the excellent foundation provided by GiovanniPasq. We are deeply grateful for his high-quality work in preparing the original repository and patterns, which made this evolution possible.
LangGraph: LangChain LangGraph – For stateful, multi-actor applications with LLMs.
LangChain: LangChain Framework – Primary abstractions for language model operations.
Qdrant: Qdrant Vector Database – Efficient similarity search and hybrid density indexing.
Ollama: Ollama – Local LLM inferencing engine.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
notebooks		notebooks
project		project
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Workshop: Intro To Agentic RAG

🎓 Workshop Participants: Environment Readiness

Introduction: Why Agentic RAG?

The Tooling Stack

Overview

What's inside

🎯 Two Ways to Use This Repo

How It Works

1. Advanced Document Indexing

2. Hybrid Search Capabilities

3. Why LangGraph? State & Orchestration

4. Continuous Context Compression

Query Processing: The Map-Reduce Workflow

LLM Provider Configuration

Provider Comparison

Ollama (Local)

Cloud Providers

Implementation

Step 1: Initial Setup and Configuration

Step 2: Configure Vector Database

Step 3: PDFs to Markdown

Step 4: Hierarchical Document Indexing

Step 5: Define Agent Tools

Step 6: Define System Prompts

Step 7: Define State and Data Models

Step 8: Agent Configuration

Step 9: Build Graph Node and Edge Functions

Main Graph Nodes & Edges

Agent Subgraph Nodes & Edges

Step 10: Build the LangGraph Graphs

Step 11: Create Chat Interface

Modular Architecture

📂 Project Structure

Installation & Usage

Option 1: Quickstart Notebook (Recommended for Testing)

Option 2: Full Python Project (Recommended for Development)

1. Install Dependencies

2. Run the Application

3. Ask Questions

Option 3: Docker Deployment

Example Conversations

Troubleshooting

Known Limitations

Contributing

References & Related Projects

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages