# Notebook 11 (Industrial Edition): Sharded & Scattered Retrieval for Large-Scale RAG

## Introduction: The Infrastructure for Petabyte-Scale Knowledge

This notebook tackles a fundamental challenge of production RAG: how to maintain low-latency, high-accuracy retrieval when your knowledge base grows to millions or billions of documents. The solution is an infrastructure pattern known as **Sharded & Scattered Retrieval**.

### The Core Concept: Divide and Conquer

Instead of a single, massive vector store that becomes slow and unwieldy, we partition (shard) our knowledge base into multiple smaller, independent vector stores. These shards can be organized by topic, date, data source, or any other logical division. When a user query arrives, a central orchestrator "scatters" the query to all shards, which perform their searches in parallel. The results are then "gathered" and re-ranked to find the globally best documents.

### Role in a Large-Scale System: Enabling Low-Latency Knowledge Access Across Enterprise-Scale Data

This is the *only* viable architecture for building a true enterprise-grade search or RAG system. It is the foundation for:
- **Scalability:** The system can scale horizontally to accommodate a virtually infinite amount of data by simply adding more shards.
- **Performance:** Retrieval latency remains low and constant because searches are performed on smaller, faster indexes.
- **Accuracy & Organization:** Sharding allows for the creation of specialized knowledge domains. A search for a technical term can be prioritized in the engineering shard, leading to more relevant results and higher accuracy.

We will build a simulated two-shard RAG system (Engineering vs. Marketing) and compare it to a monolithic system to demonstrate the concrete benefits in both **latency** and **answer quality**.

## Part 1: Setup and Environment

We'll need our standard libraries plus `langchain-community` for vector stores and embeddings.

In [None]:
%pip install -U langchain langgraph langsmith langchain-huggingface transformers accelerate bitsandbytes torch langchain-community sentence-transformers faiss-cpu

### 1.2: API Keys and Environment Configuration

We will need our LangSmith and Hugging Face keys.

In [None]:
import os
import getpass

def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")

_set_env("LANGCHAIN_API_KEY")
_set_env("HUGGING_FACE_HUB_TOKEN")

# Configure LangSmith for tracing
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "Industrial - RAG Sharded Retrieval"

## Part 2: Components for the Sharded RAG System

We'll need an LLM and, crucially, we will build multiple, distinct knowledge base shards.

### 2.1: The Language Model (LLM)

We will use `meta-llama/Meta-Llama-3-8B-Instruct` for our generator agent.

In [None]:
from langchain_huggingface import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto", load_in_4bit=True)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=1024, do_sample=False)
llm = HuggingFacePipeline(pipeline=pipe)

print("LLM Initialized. Ready to power our RAG system.")

LLM Initialized. Ready to power our RAG system.


### 2.2: Creating the Knowledge Base Shards

We will create two separate vector stores to simulate our shards. Each will contain domain-specific information.

In [None]:
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_core.documents import Document

# Engineering KB Documents
eng_docs = [
    Document(page_content="The QuantumLeap V3 processor utilizes a 3nm process node and features a dedicated AI accelerator core with 128 tensor units. API endpoint `/api/v3/status` provides real-time thermal throttling data.", metadata={"source": "eng-kb"}),
    Document(page_content="Firmware update v2.1 for the Aura Smart Ring optimizes the photoplethysmography (PPG) sensor algorithm for more accurate sleep stage detection. The update is deployed via the mobile app.", metadata={"source": "eng-kb"}),
    Document(page_content="The Smart Mug's heating element is a nickel-chromium coil controlled by a PID controller. It maintains temperature within +/- 1 degree Celsius. Battery polling is done via the `getBattery` function.", metadata={"source": "eng-kb"})
]

# Marketing KB Documents
mkt_docs = [
    Document(page_content="Press Release: Unveiling the QuantumLeap V3, the AI processor that redefines speed. 'It's a game-changer for creative professionals,' says CEO Jane Doe. Available Q4.", metadata={"source": "mkt-kb"}),
    Document(page_content="Product Page: The Aura Smart Ring is your personal wellness companion. Crafted from aerospace-grade titanium, it empowers you to unlock your full potential by understanding your body's signals.", metadata={"source": "mkt-kb"}),
    Document(page_content="Blog Post: 'Five Ways Our Smart Mug Supercharges Your Morning Routine.' The perfect temperature, from the first sip to the last, means your coffee is always perfect.", metadata={"source": "mkt-kb"})
]

# Create embedding model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Create the two vector store shards
eng_vectorstore = FAISS.from_documents(eng_docs, embedding=embeddings)
mkt_vectorstore = FAISS.from_documents(mkt_docs, embedding=embeddings)

eng_retriever = eng_vectorstore.as_retriever(search_kwargs={"k": 2})
mkt_retriever = mkt_vectorstore.as_retriever(search_kwargs={"k": 2})

print(f"Knowledge Base shards created: Engineering KB ({len(eng_docs)} docs), Marketing KB ({len(mkt_docs)} docs).")

Knowledge Base shards created: Engineering KB (3 docs), Marketing KB (3 docs).


## Part 3: The Baseline - A Monolithic RAG System

To establish a baseline, we'll first create a traditional RAG system with a single, combined knowledge base. We will add a simulated latency to its retrieval step to mimic searching a much larger index.

In [None]:
import time
from langchain_core.runnables import RunnableLambda

# 1. Create the monolithic vector store
all_docs = eng_docs + mkt_docs
monolithic_vectorstore = FAISS.from_documents(all_docs, embedding=embeddings)
monolithic_retriever = monolithic_vectorstore.as_retriever(search_kwargs={"k": 4})

# 2. Simulate the increased latency of a large index
def slow_retrieval(query):
    print("--- [Monolithic Retriever] Searching large index... (simulating high latency) ---")
    time.sleep(2.5) # Simulate latency
    return monolithic_retriever.invoke(query)

slow_monolithic_retriever = RunnableLambda(slow_retrieval)

# 3. Create the monolithic RAG chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

generator_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an expert technical and marketing support agent. Answer the user's question based *only* on the provided context.\n\nContext:\n{context}"),
    ("human", "Question: {question}")
])

def format_docs(docs):
    return "\n\n".join(f"[Source: {doc.metadata.get('source', 'N/A')}] {doc.page_content}" for doc in docs)

monolithic_rag_chain = (
    {"context": slow_monolithic_retriever | format_docs, "question": RunnablePassthrough()}
    | generator_prompt
    | llm
    | StrOutputParser()
)

## Part 4: Building the Sharded RAG Graph

Now, let's build the superior, sharded system. The core of this system is a node that scatters the query to our two shards in parallel.

### 4.1: Defining the Graph State and Nodes

In [None]:
from typing import TypedDict, List
from langchain_core.documents import Document
from concurrent.futures import ThreadPoolExecutor
from langchain_core.runnables import RunnableConfig

class ShardedRAGState(TypedDict):
    question: str
    retrieved_docs: List[Document]
    final_answer: str

# Node 1: Parallel Retrieval (Scatter-Gather)
def parallel_retrieval_node(state: ShardedRAGState):
    """Scatters the query to all shards and gathers the results."""
    print("--- [Meta-Retriever] Scattering query to Engineering and Marketing shards in parallel... ---")
    
    # We'll use a ThreadPool to run retrievals concurrently
    with ThreadPoolExecutor(max_workers=2) as executor:
        # P_retrieval function to add a delay to each shard search
        def p_retrieval(retriever):
            time.sleep(0.5) # Simulate network hop and smaller index search time
            return retriever.invoke(state['question'])
        
        futures = [executor.submit(p_retrieval, retriever) for retriever in [eng_retriever, mkt_retriever]]
        
        all_docs = []
        for future in futures:
            all_docs.extend(future.result())
    
    # In a real system, you'd add a re-ranking step here. For now, we'll just deduplicate.
    unique_docs = list({doc.page_content: doc for doc in all_docs}.values())
    print(f"--- [Meta-Retriever] Gathered {len(unique_docs)} unique documents from 2 shards. ---")
    return {"retrieved_docs": unique_docs}

# Node 2: Generation Node (same as before)
def generation_node(state: ShardedRAGState):
    """Synthesizes the final answer from the gathered documents."""
    print("--- [Generator] Synthesizing final answer... ---")
    context = format_docs(state['retrieved_docs'])
    answer = (
        generator_prompt 
        | llm 
        | StrOutputParser()
    ).invoke({"context": context, "question": state['question']})
    return {"final_answer": answer}

### 4.2: Assembling the Sharded Graph

In [None]:
from langgraph.graph import StateGraph, END

workflow = StateGraph(ShardedRAGState)
workflow.add_node("parallel_retrieval", parallel_retrieval_node)
workflow.add_node("generate_answer", generation_node)

workflow.set_entry_point("parallel_retrieval")
workflow.add_edge("parallel_retrieval", "generate_answer")
workflow.add_edge("generate_answer", END)

sharded_rag_app = workflow.compile()

Sharded RAG graph compiled successfully.


## Part 5: Head-to-Head Comparison

Now we will ask both systems a question that requires information from *both* the engineering and marketing knowledge bases to be answered completely and accurately. The monolithic system may struggle to find the less-dominant but still relevant context.

In [None]:
# This query has strong marketing keywords ('game-changer', 'creative professionals')
# but also a specific technical question ('API status endpoint').
user_query = "I heard the new QuantumLeap V3 is a 'game-changer for creative professionals'. Can you tell me more about it, and is there an API endpoint to check its status?"

### 5.1: Running the Monolithic RAG System

In [None]:
print("--- [MONOLITHIC RAG] Starting run... ---")
start_time = time.time()

# We'll capture the context to inspect it
retrieved_context_mono = ""
def capture_context_mono(docs):
    global retrieved_context_mono
    retrieved_context_mono = format_docs(docs)
    return retrieved_context_mono

monolithic_rag_chain_instrumented = (
    {"context": slow_monolithic_retriever | capture_context_mono, "question": RunnablePassthrough()}
    | generator_prompt
    | llm
    | StrOutputParser()
)
monolithic_answer = monolithic_rag_chain_instrumented.invoke(user_query)
monolithic_time = time.time() - start_time

print("="*60)
print("               MONOLITHIC RAG SYSTEM OUTPUT")
print("="*60 + "\n")
print("Retrieved Context:")
print(retrieved_context_mono + "\n")
print("Final Answer:")
print(monolithic_answer)

--- [MONOLITHIC RAG] Starting run... ---
--- [Monolithic Retriever] Searching large index... (simulating high latency) ---

               MONOLITHIC RAG SYSTEM OUTPUT

Retrieved Context:
[Source: mkt-kb] Press Release: Unveiling the QuantumLeap V3, the AI processor that redefines speed. 'It's a game-changer for creative professionals,' says CEO Jane Doe. Available Q4.
[Source: eng-kb] The QuantumLeap V3 processor utilizes a 3nm process node and features a dedicated AI accelerator core with 128 tensor units. API endpoint `/api/v3/status` provides real-time thermal throttling data.
[Source: mkt-kb] Product Page: The Aura Smart Ring is your personal wellness companion. Crafted from aerospace-grade titanium, it empowers you to unlock your full potential by understanding your body's signals.

Final Answer:
The QuantumLeap V3 is an AI processor described as a 'game-changer for creative professionals' according to CEO Jane Doe, and it will be available in Q4. It uses a 3nm process and has an

### 5.2: Running the Sharded RAG System

In [None]:
print("--- [SHARDED RAG] Starting run... ---")
start_time = time.time()
inputs = {"question": user_query}
sharded_result = None
for output in sharded_rag_app.stream(inputs, stream_mode="values"):
    sharded_result = output
sharded_time = time.time() - start_time

retrieved_context_sharded = format_docs(sharded_result['retrieved_docs'])
sharded_answer = sharded_result['final_answer']

print("="*60)
print("                 SHARDED RAG SYSTEM OUTPUT")
print("="*60 + "\n")
print("Retrieved Context:")
print(retrieved_context_sharded + "\n")
print("Final Answer:")
print(sharded_answer)

--- [SHARDED RAG] Starting run... ---
--- [Meta-Retriever] Scattering query to Engineering and Marketing shards in parallel... ---
--- [Meta-Retriever] Gathered 2 unique documents from 2 shards. ---
--- [Generator] Synthesizing final answer... ---

                 SHARDED RAG SYSTEM OUTPUT

Retrieved Context:
[Source: mkt-kb] Press Release: Unveiling the QuantumLeap V3, the AI processor that redefines speed. 'It's a game-changer for creative professionals,' says CEO Jane Doe. Available Q4.
[Source: eng-kb] The QuantumLeap V3 processor utilizes a 3nm process node and features a dedicated AI accelerator core with 128 tensor units. API endpoint `/api/v3/status` provides real-time thermal throttling data.

Final Answer:
The QuantumLeap V3 is hailed as a 'game-changer for creative professionals' in its press release and is set to be available in Q4. On the technical side, it is built on a 3nm process and includes an API endpoint at `/api/v3/status` to monitor its real-time thermal status.
