# Level 4: Specialty - RAG, Multi-Agent Systems & Production Patterns

This notebook covers specialty topics: Retrieval-Augmented Generation (RAG), multi-agent orchestration, and production deployment patterns.

## Learning Objectives
- Build a complete RAG pipeline with vector stores
- Implement multi-agent systems with LangGraph
- Use LangSmith for observability and tracing
- Production patterns: error handling, fallbacks, rate limiting
- Evaluation and testing strategies

## Prerequisites
- Completed Notebooks 01-03
- ChromaDB installed (`pip install chromadb langchain-chroma`)

---

**References:**
- [LangChain RAG Tutorial](https://python.langchain.com/docs/tutorials/rag/)
- [LangSmith Docs](https://docs.langchain.com/docs/langsmith/)
- [LangGraph Multi-Agent](https://langchain-ai.github.io/langgraph/how-tos/multi-agent/)

## 1. Setup

In [1]:
# Import required libraries
import os
import sys
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Add parent directory to path for shared config
sys.path.append('..')

# Import global model configuration
from config import (
    GPT_MODEL, GEMINI_MODEL,
    GPT_MODEL_NAME, GEMINI_MODEL_NAME,
    get_model, list_available_models,
)

print(f"Using GPT model:    {GPT_MODEL_NAME}")
print(f"Using Gemini model: {GEMINI_MODEL_NAME}")
print()
list_available_models()

OpenAI client initialized  -> model: gpt-4o-mini
Google client initialized  -> model: gemini-3-flash-preview
Using GPT model:    gpt-4o-mini
Using Gemini model: gemini-3-flash-preview

Available Models:
-------------------------------------------------------
  gpt-4o-mini          -> ChatOpenAI(gpt-4o-mini)
  gemini-3-flash-preview -> ChatGoogleGenerativeAI(gemini-3-flash-preview)
-------------------------------------------------------


## 2. Retrieval-Augmented Generation (RAG)

RAG combines a **retriever** (to find relevant documents) with a **generator** (LLM) to answer questions grounded in your own data.

### RAG Pipeline
```
Documents -> Split -> Embed -> Vector Store -> Retrieve -> Generate Answer
```

In [None]:
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Step 1: Create sample documents (in production, load from files/web)
documents = [
    Document(
        page_content="""LangChain is a framework for developing applications powered by
        large language models. It provides tools for prompt management, chains, agents,
        and memory. LangChain supports multiple LLM providers including OpenAI, Google,
        Anthropic, and local models.""",
        metadata={"source": "langchain_overview", "topic": "framework"},
    ),
    Document(
        page_content="""LangGraph is a library for building stateful, multi-actor
        applications with LLMs. It extends LangChain with graph-based workflows,
        allowing developers to create complex agent architectures with cycles,
        persistence, and human-in-the-loop capabilities.""",
        metadata={"source": "langgraph_overview", "topic": "framework"},
    ),
    Document(
        page_content="""Retrieval-Augmented Generation (RAG) is a technique that
        combines information retrieval with text generation. The system first retrieves
        relevant documents from a knowledge base, then uses those documents as context
        for the LLM to generate accurate, grounded answers.""",
        metadata={"source": "rag_overview", "topic": "technique"},
    ),
    Document(
        page_content="""Vector databases store data as high-dimensional vectors,
        enabling similarity search. Popular options include ChromaDB, Pinecone, Weaviate,
        and FAISS. They are essential for RAG pipelines as they allow efficient
        retrieval of semantically similar documents.""",
        metadata={"source": "vector_db_overview", "topic": "infrastructure"},
    ),
    Document(
        page_content="""LangSmith is a platform for debugging, testing, evaluating,
        and monitoring LLM applications. It provides tracing for every LLM call,
        evaluation datasets, and prompt management. LangSmith integrates natively
        with LangChain and LangGraph.""",
        metadata={"source": "langsmith_overview", "topic": "observability"},
    ),
]

# Step 2: Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=50,
)
splits = text_splitter.split_documents(documents)
print(f"Split {len(documents)} documents into {len(splits)} chunks")
for i, split in enumerate(splits[:3]):
    print(f"\nChunk {i}: {split.page_content[:80]}...")

Split 5 documents into 10 chunks

Chunk 0: LangChain is a framework for developing applications powered by
        large la...

Chunk 1: and memory. LangChain supports multiple LLM providers including OpenAI, Google,
...

Chunk 2: LangGraph is a library for building stateful, multi-actor
        applications w...


: 

In [None]:
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

# Step 3: Create embeddings and vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=embeddings,
    collection_name="langchain_learning",
)

# Step 4: Create a retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# Test retrieval
results = retriever.invoke("What is RAG?")
print(f"Retrieved {len(results)} documents for 'What is RAG?':\n")
for i, doc in enumerate(results):
    print(f"  {i+1}. [{doc.metadata.get('source', 'unknown')}] {doc.page_content[:100]}...")

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Step 5: Build the RAG chain
rag_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a helpful assistant. Answer the question based ONLY on the
    following context. If you cannot find the answer in the context, say so.

    Context:
    {context}"""),
    ("human", "{question}"),
])

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# RAG chain with GPT
rag_chain_gpt = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | GPT_MODEL
    | StrOutputParser()
)

# RAG chain with Gemini
rag_chain_gemini = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | GEMINI_MODEL
    | StrOutputParser()
)

# Test the RAG chain
question = "What is LangGraph and how does it relate to LangChain?"
print(f"Question: {question}\n")
print("GPT Answer:")
print(rag_chain_gpt.invoke(question))
print("\n" + "="*60 + "\n")
print("Gemini Answer:")
print(rag_chain_gemini.invoke(question))

In [None]:
# Test with more questions
test_questions = [
    "What is a vector database?",
    "How does LangSmith help with debugging?",
    "What programming languages does LangChain support?",
]

for q in test_questions:
    print(f"Q: {q}")
    answer = rag_chain_gpt.invoke(q)
    print(f"A: {answer}\n")
    print("-" * 60)

## 3. Multi-Agent System with LangGraph

Build a team of specialized agents that collaborate to complete tasks.

### Architecture
```
User Query -> Router -> Researcher Agent -+-> Writer Agent -> Final Answer
                       Fact-Checker Agent -+
```

In [None]:
from typing import Annotated
from typing_extensions import TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langchain_core.messages import SystemMessage

# Define multi-agent state
class MultiAgentState(TypedDict):
    messages: Annotated[list, add_messages]
    research_notes: str
    draft: str
    final_answer: str

# Researcher node - gathers information
def researcher_node(state: MultiAgentState):
    messages = [
        SystemMessage(content="""You are a research assistant. Analyze the user's question
        and provide detailed research notes with key facts and sources. Be thorough."""),
    ] + state["messages"]
    response = GPT_MODEL.invoke(messages)
    return {"research_notes": response.content}

# Writer node - creates polished response
def writer_node(state: MultiAgentState):
    messages = [
        SystemMessage(content=f"""You are an expert technical writer. Based on these
        research notes, write a clear and well-structured answer.

        Research Notes:
        {state['research_notes']}"""),
    ] + state["messages"]
    response = GEMINI_MODEL.invoke(messages)
    return {"draft": response.content}

# Reviewer node - quality check
def reviewer_node(state: MultiAgentState):
    messages = [
        SystemMessage(content=f"""You are a quality reviewer. Review this draft for
        accuracy, clarity, and completeness. Provide the final polished version.

        Draft:
        {state['draft']}

        Research Notes (for fact-checking):
        {state['research_notes']}"""),
    ] + state["messages"]
    response = GPT_MODEL.invoke(messages)
    return {"final_answer": response.content}

# Build the multi-agent graph
multi_agent = StateGraph(MultiAgentState)
multi_agent.add_node("researcher", researcher_node)
multi_agent.add_node("writer", writer_node)
multi_agent.add_node("reviewer", reviewer_node)

multi_agent.add_edge(START, "researcher")
multi_agent.add_edge("researcher", "writer")
multi_agent.add_edge("writer", "reviewer")
multi_agent.add_edge("reviewer", END)

multi_agent_graph = multi_agent.compile()

# Run the multi-agent system
result = multi_agent_graph.invoke({
    "messages": [{"role": "user", "content": "Explain how RAG works and when to use it vs fine-tuning."}],
    "research_notes": "",
    "draft": "",
    "final_answer": "",
})

print("=== RESEARCH NOTES ===")
print(result["research_notes"][:300] + "...")
print("\n=== DRAFT ===")
print(result["draft"][:300] + "...")
print("\n=== FINAL ANSWER ===")
print(result["final_answer"])

## 4. Production Patterns

### 4a. Fallback Chains

If one model fails, automatically fall back to another.

In [None]:
# Fallback: if GPT fails, fall back to Gemini
from langchain_core.runnables import RunnableWithFallbacks

# Create a chain with fallback
primary_chain = (
    ChatPromptTemplate.from_messages([
        ("system", "You are a helpful assistant."),
        ("human", "{question}"),
    ])
    | GPT_MODEL
    | StrOutputParser()
)

fallback_chain = (
    ChatPromptTemplate.from_messages([
        ("system", "You are a helpful assistant."),
        ("human", "{question}"),
    ])
    | GEMINI_MODEL
    | StrOutputParser()
)

# Chain with fallback
robust_chain = primary_chain.with_fallbacks([fallback_chain])

result = robust_chain.invoke({"question": "What are the benefits of microservices?"})
print("Result (with fallback protection):")
print(result)

### 4b. Retry with Exponential Backoff

In [None]:
# Retry with backoff for rate-limited APIs
from langchain_openai import ChatOpenAI

# LangChain models support max_retries natively
retry_model = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    max_retries=3,  # Automatic retry with exponential backoff
    request_timeout=30,
)

# All invocations will automatically retry on transient failures
result = retry_model.invoke("What is the meaning of life?")
print("With retry protection:", result.content[:200])

### 4c. LangSmith Tracing (Observability)

LangSmith provides full tracing of every LLM call, chain execution, and agent step.

To enable tracing, set these environment variables:
```bash
LANGSMITH_TRACING=true
LANGSMITH_API_KEY=your-key
LANGSMITH_PROJECT=langchain-learning
```

In [None]:
import os

# Check if LangSmith tracing is configured
if os.getenv("LANGSMITH_API_KEY"):
    os.environ["LANGSMITH_TRACING"] = "true"
    print("LangSmith tracing is ENABLED")
    print(f"Project: {os.getenv('LANGSMITH_PROJECT', 'default')}")
    print("View traces at: https://smith.langchain.com/")

    # All subsequent LLM calls will be automatically traced
    traced_result = GPT_MODEL.invoke("What is observability in AI systems?")
    print(f"\nTraced response: {traced_result.content[:200]}")
else:
    print("LangSmith tracing is NOT configured.")
    print("Set LANGSMITH_API_KEY in your .env to enable tracing.")
    print("Sign up at: https://smith.langchain.com/")

## 5. Evaluation Pattern: Comparing Model Outputs

A simple pattern for evaluating model quality by comparing outputs from both models.

In [None]:
from pydantic import BaseModel, Field

class EvalResult(BaseModel):
    """Evaluation of an LLM response."""
    relevance: int = Field(description="How relevant is the answer (1-10)")
    accuracy: int = Field(description="How accurate is the answer (1-10)")
    clarity: int = Field(description="How clear is the answer (1-10)")
    reasoning: str = Field(description="Brief reasoning for the scores")

eval_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an expert evaluator. Score the following AI response
    on relevance, accuracy, and clarity (each 1-10).

    Question: {question}
    Response: {response}"""),
    ("human", "Evaluate this response."),
])

# Generate responses from both models
test_question = "What are the key differences between SQL and NoSQL databases?"

gpt_response = GPT_MODEL.invoke(test_question).content
gemini_response = GEMINI_MODEL.invoke(test_question).content

# Cross-evaluate: GPT evaluates Gemini's response and vice versa
gpt_evaluator = GPT_MODEL.with_structured_output(EvalResult)
gemini_evaluator = GEMINI_MODEL.with_structured_output(EvalResult)

print("=== GPT Response (first 200 chars) ===")
print(gpt_response[:200] + "...\n")

print("=== Gemini Response (first 200 chars) ===")
print(gemini_response[:200] + "...\n")

# GPT evaluates Gemini's response
eval_of_gemini = gpt_evaluator.invoke(
    eval_prompt.invoke({"question": test_question, "response": gemini_response})
)
print("=== GPT's evaluation of Gemini ===")
print(f"  Relevance: {eval_of_gemini.relevance}/10")
print(f"  Accuracy:  {eval_of_gemini.accuracy}/10")
print(f"  Clarity:   {eval_of_gemini.clarity}/10")
print(f"  Reasoning: {eval_of_gemini.reasoning}")

## Summary

| Concept | What You Learned |
|---------|------------------|
| RAG Pipeline | Documents -> Split -> Embed -> Store -> Retrieve -> Generate |
| Vector Store | ChromaDB for similarity search |
| Multi-Agent | Researcher -> Writer -> Reviewer graph |
| Fallback Chains | Automatic model fallback on failure |
| Retry Logic | Built-in retry with exponential backoff |
| LangSmith | Observability and tracing setup |
| Evaluation | Cross-model evaluation with structured output |

## Recommended Next Steps

1. **Explore [LangSmith](https://smith.langchain.com/)** - Set up tracing for your applications
2. **Try [LangGraph Cloud](https://langchain-ai.github.io/langgraph/cloud/)** - Deploy your agents
3. **Read [LangChain Academy](https://academy.langchain.com/)** - Structured courses
4. **Join the [Community](https://github.com/langchain-ai/langchain/discussions)** - Get help and share solutions

## Full Learning Path Complete

| Level | Notebook | Key Topics |
|-------|----------|------------|
| Entry | 01 | Chat models, messages, prompts, LCEL, tools, agents |
| Intermediate | 02 | LCEL advanced, StateGraph, conditional edges, tool loops |
| Advanced | 03 | Persistence, interrupts, streaming, subgraphs, reducers |
| Specialty | 04 | RAG, multi-agent, fallbacks, observability, evaluation |