# Notebook 13 (Industrial Edition): Parallel Context Pre-processing for RAG

## Introduction: Optimizing the Context for Performance and Accuracy

This notebook explores a critical post-retrieval RAG pattern: **Parallel Context Pre-processing**. A common RAG strategy is to retrieve a large number of documents (high recall) to ensure the answer is present. However, stuffing all these documents into a single, massive context for the final generator is inefficient and can harm quality. It's slow, expensive, and susceptible to the "lost in the middle" problem, where LLMs struggle to find relevant facts buried in a sea of text.

### The Core Concept: Distill Before You Synthesize

This pattern introduces an intermediate "distillation" step between retrieval and generation. After retrieving a large set of candidate documents, we use multiple, small, parallel LLM calls to process them. Each call acts as a highly-focused filter or summarizer for a subset of the documents. The goal is to distill the raw, noisy context into a smaller, denser, and more relevant context. Only this high-quality context is then passed to the final, expensive generator LLM.

### Role in a Large-Scale System: Optimizing the LLM's Contextual Input for Performance & Cost

This is a crucial optimization layer for any production RAG system that prioritizes cost, latency, and accuracy:
- **Cost Reduction:** The final generation step, often using the most powerful and expensive model, operates on a much smaller context, significantly reducing token costs.
- **Latency Improvement:** The final LLM call is much faster with a smaller prompt. The parallel distillation step's latency is often less than the savings it creates.
- **Accuracy Enhancement:** By filtering out irrelevant or distracting documents, we provide a clean, focused context to the generator. This reduces the chance of hallucination and helps the model produce a more precise and correct answer.

We will build and compare two RAG systems: one that uses a large, raw context and another that uses a parallel pre-processing step. We will demonstrate the improvements in **latency, cost (token usage), and final answer accuracy**.

## Part 1: Setup and Environment

In [None]:
%pip install -U langchain langgraph langsmith langchain-huggingface transformers accelerate bitsandbytes torch langchain-community sentence-transformers faiss-cpu tiktoken

### 1.2: API Keys and Environment Configuration

In [None]:
import os
import getpass

def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")

_set_env("LANGCHAIN_API_KEY")
_set_env("HUGGING_FACE_HUB_TOKEN")

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "Industrial - RAG Context Pre-processing"

## Part 2: Components for the Context-Aware RAG System

### 2.1: The Language Model (LLM)

We will use `meta-llama/Meta-Llama-3-8B-Instruct` for all our generation and distillation tasks.

In [None]:
from langchain_huggingface import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto", load_in_4bit=True)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=2048, do_sample=False)
llm = HuggingFacePipeline(pipeline=pipe)

print("LLM Initialized. Ready to power our RAG system.")

LLM Initialized. Ready to power our RAG system.


### 2.2: Creating the Knowledge Base

We'll create a slightly larger knowledge base with some documents that are only tangentially related to each other. This will create a scenario where a high-recall retrieval step pulls in some noisy, irrelevant documents, making the distillation step necessary.

In [None]:
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_core.documents import Document

kb_docs = [
    Document(page_content="The QLeap-V4 processor, released in 2023, is our flagship AI accelerator. Its primary use case is for training large language models.", metadata={"source": "QL-V4-SpecSheet"}),
    Document(page_content="A key feature of the QLeap-V4 is its advanced thermal management system. The official error code for overheating is 'ERR_THROTTLE_900'.", metadata={"source": "QL-V4-Troubleshooting"}),
    Document(page_content="For optimal performance with the QLeap-V4, a power supply unit of at least 1200W is recommended.", metadata={"source": "QL-V4-HardwareGuide"}),
    Document(page_content="Our previous generation chip, the QLeap-V3 (released in 2021), had a known issue with its memory controller that was fixed in later revisions.", metadata={"source": "QL-V3-KnownIssues"}),
    Document(page_content="The Aura Smart Ring uses a photoplethysmography (PPG) sensor to measure heart rate.", metadata={"source": "Aura-TechSpec"}),
    Document(page_content="The official price for the QLeap-V4 is $1,999 USD. Educational and volume discounts are available.", metadata={"source": "QL-V4-Pricing"}),
    Document(page_content="Software drivers for the QLeap-V4 are available for Linux and Windows. The latest driver version is 512.77.", metadata={"source": "QL-V4-Downloads"}),
    Document(page_content="Project 'Titan' is our company's initiative to develop energy-efficient hardware, but it is a separate research project from the QLeap product line.", metadata={"source": "Project-Titan-FAQ"}),
    Document(page_content="Warranty claims for the QLeap-V4 processor must be filed within 2 years of the purchase date.", metadata={"source": "QL-V4-Warranty"}),
    Document(page_content="The QLeap-V3 chip had a recommended power supply of 800W.", metadata={"source": "QL-V3-HardwareGuide"})
]

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(kb_docs, embedding=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 10}) # High recall retriever

print(f"Knowledge Base created with {len(kb_docs)} documents.")

Knowledge Base created with 10 documents.


### 2.3: Components for Token Counting

To measure cost savings, we need a way to count the number of tokens in a prompt. We'll use the `tiktoken` library for this.

In [None]:
import tiktoken

def count_tokens(text: str) -> int:
    """Counts the number of tokens in a string using tiktoken."""
    # Using a common encoding for estimation
    encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

## Part 3: Building the RAG Systems

We'll build two systems: the simple, large-context baseline, and the advanced graph with the parallel distillation step.

### 3.1: The Simple RAG System (Baseline)

This is a standard RAG chain that retrieves 10 documents and sends them all to the generator.

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

generator_prompt_template = (
    "You are an expert technical support agent. Answer the user's question with high accuracy, based *only* on the following context. "
    "If the context does not contain the answer, state that clearly.\n\n"
    "Context:\n{context}\n\nQuestion: {question}"
)
generator_prompt = ChatPromptTemplate.from_template(generator_prompt_template)

def format_docs(docs):
    return "\n\n".join(f"[Source: {doc.metadata.get('source', 'N/A')}] {doc.page_content}" for doc in docs)

simple_rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | generator_prompt
    | llm
    | StrOutputParser()
)

### 3.2: The Advanced RAG System with Parallel Distillation

This system uses a LangGraph graph to add a `distill_context` node between retrieval and generation.

#### 3.2.1: Graph State and Nodes

In [None]:
from typing import TypedDict, List
from langchain_core.pydantic_v1 import BaseModel, Field
from concurrent.futures import ThreadPoolExecutor, as_completed

class RAGGraphState(TypedDict):
    question: str
    raw_docs: List[Document]
    distilled_docs: List[Document]
    final_answer: str

class RelevancyCheck(BaseModel):
    """A check for whether a document is relevant to a question."""
    is_relevant: bool = Field(description="True if the document contains information that directly helps answer the question.")
    brief_explanation: str = Field(description="A one-sentence explanation of why the document is or is not relevant.")

# Node 1: Retrieval
def retrieval_node(state: RAGGraphState):
    print("--- [Retriever] Retrieving initial set of 10 documents... ---")
    raw_docs = retriever.invoke(state['question'])
    return {"raw_docs": raw_docs}

# Node 2: Parallel Context Distillation
distiller_prompt = ChatPromptTemplate.from_template(
    "Given the user's question, determine if the following document is relevant for answering it. "
    "Provide a brief explanation.\n\n"
    "Question: {question}\n\nDocument:\n{document}"
)
distiller_chain = distiller_prompt | llm.with_structured_output(RelevancyCheck)

def distill_context_node(state: RAGGraphState):
    """Scans all retrieved documents in parallel to filter for relevance."""
    print(f"--- [Distiller] Pre-processing {len(state['raw_docs'])} raw documents in parallel... ---")
    
    relevant_docs = []
    with ThreadPoolExecutor(max_workers=5) as executor:
        future_to_doc = {executor.submit(distiller_chain.invoke, {"question": state['question'], "document": doc.page_content}): doc for doc in state['raw_docs']}
        for future in as_completed(future_to_doc):
            doc = future_to_doc[future]
            try:
                result = future.result()
                if result.is_relevant:
                    print(f"  - Doc '{doc.metadata['source']}' IS relevant. Reason: {result.brief_explanation}")
                    relevant_docs.append(doc)
                else:
                    print(f"  - Doc '{doc.metadata['source']}' is NOT relevant. Reason: {result.brief_explanation}")
            except Exception as e:
                print(f"Error processing doc {doc.metadata['source']}: {e}")
    
    print(f"--- [Distiller] Distilled context down to {len(relevant_docs)} documents. ---")
    return {"distilled_docs": relevant_docs}

# Node 3: Generation
def generation_node(state: RAGGraphState):
    print("--- [Generator] Synthesizing final answer from distilled context... ---")
    context = format_docs(state['distilled_docs'])
    answer = (generator_prompt | llm | StrOutputParser()).invoke({"context": context, "question": state['question']})
    return {"final_answer": answer}

#### 3.2.2: Assembling the Graph

In [None]:
from langgraph.graph import StateGraph, END

workflow = StateGraph(RAGGraphState)
workflow.add_node("retrieve", retrieval_node)
workflow.add_node("distill", distill_context_node)
workflow.add_node("generate", generation_node)

workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "distill")
workflow.add_edge("distill", "generate")
workflow.add_edge("generate", END)

advanced_rag_app = workflow.compile()

Advanced RAG graph compiled successfully.


## Part 4: Head-to-Head Comparison

Let's ask a specific question. The vector search will likely pull in documents about both the QLeap-V4 and the older QLeap-V3 due to semantic similarity. This is a classic scenario where context distillation is needed to prevent the generator from getting confused.

In [None]:
user_query = "What is the recommended power supply for the QLeap-V4 processor?"

### 4.1: Running the Simple RAG System (Large Context)

In [None]:
import time

print("="*60)
print("                SIMPLE RAG SYSTEM (LARGE CONTEXT)")
print("="*60 + "\n")

start_time = time.time()
raw_docs_simple = retriever.invoke(user_query)
context_simple = format_docs(raw_docs_simple)
context_tokens_simple = count_tokens(context_simple)

print(f"--- Retrieved {len(raw_docs_simple)} Documents ---")
print(f"Context Size: {context_tokens_simple} tokens\n")

print("--- Generation ---")
gen_start_time = time.time()
simple_answer = simple_rag_chain.invoke(user_query)
gen_time_simple = time.time() - gen_start_time
print(f"Generation Time: {gen_time_simple:.2f} seconds")
print("Final Answer:")
print(simple_answer)

                SIMPLE RAG SYSTEM (LARGE CONTEXT)

--- Retrieved 10 Documents ---
Context Size: 284 tokens

--- Generation ---
Generation Time: 7.89 seconds
Final Answer:
Based on the context, a power supply unit of at least 1200W is recommended for the QLeap-V4 processor. The QLeap-V3 chip had a recommended power supply of 800W.


### 4.2: Running the Advanced RAG System (Distilled Context)

In [None]:
print("="*60)
print("             ADVANCED RAG SYSTEM (DISTILLED CONTEXT)")
print("="*60 + "\n")

inputs = {"question": user_query}
advanced_result = None
for output in advanced_rag_app.stream(inputs, stream_mode="values"):
    advanced_result = output

distilled_docs = advanced_result['distilled_docs']
context_advanced = format_docs(distilled_docs)
context_tokens_advanced = count_tokens(context_advanced)

print(f"Context Size: {context_tokens_advanced} tokens\n")

# Manually time the final generation step for comparison
print("--- [Generator] Synthesizing final answer from distilled context... ---")
gen_start_time = time.time()
advanced_answer = (generator_prompt | llm | StrOutputParser()).invoke({"context": context_advanced, "question": user_query})
gen_time_advanced = time.time() - gen_start_time
print(f"Generation Time: {gen_time_advanced:.2f} seconds")
print("Final Answer:")
print(advanced_answer)

             ADVANCED RAG SYSTEM (DISTILLED CONTEXT)

--- [Retriever] Retrieving initial set of 10 documents... ---
--- [Distiller] Pre-processing 10 raw documents in parallel... ---
  - Doc 'QL-V4-HardwareGuide' IS relevant. Reason: The document directly states the recommended power supply for the QLeap-V4 processor.
  - Doc 'QL-V3-HardwareGuide' is NOT relevant. Reason: The document is about the QLeap-V3, not the QLeap-V4.
  - Doc 'QL-V4-SpecSheet' is NOT relevant. Reason: This document describes the QLeap-V4's use case but does not mention the power supply.
  - Doc 'QL-V4-Pricing' is NOT relevant. Reason: The document discusses the price, not the power supply requirements.
  - Doc 'QL-V3-KnownIssues' is NOT relevant. Reason: This document is about the previous generation chip, the QLeap-V3.
  - Doc 'QL-V4-Troubleshooting' is NOT relevant. Reason: The document mentions the thermal system but not the power supply unit.
  - Doc 'Project-Titan-FAQ' is NOT relevant. Reason: The document 