# Retrieval Strategies: RAG, Summaries, and Hybrid Approaches

## Introduction

In this notebook, you'll learn different strategies for retrieving and providing context to your agent. Not all context should be included all the time - you need smart retrieval strategies to provide relevant information efficiently.

### What You'll Learn

- Different retrieval strategies (full context, RAG, summaries, hybrid)
- When to use each strategy
- How to optimize vector search parameters
- How to measure retrieval quality and performance

### Prerequisites

- Completed Section 3 notebooks
- Redis 8 running locally
- Agent Memory Server running
- OpenAI API key set
- Course data ingested

## Concepts: Retrieval Strategies

### The Context Retrieval Problem

You have a large knowledge base (courses, memories, documents), but you can't include everything in every request. You need to:

1. **Find relevant information** - What's related to the user's query?
2. **Limit context size** - Stay within token budgets
3. **Maintain quality** - Don't miss important information
4. **Optimize performance** - Fast retrieval, low latency

### Strategy 1: Full Context (Naive)

**Approach:** Include everything in every request

```python
# Include entire course catalog
all_courses = get_all_courses()  # 500 courses
context = "\n".join([str(course) for course in all_courses])
```

**Pros:**
- ✅ Never miss relevant information
- ✅ Simple to implement

**Cons:**
- ❌ Exceeds token limits quickly
- ❌ Expensive (more tokens = higher cost)
- ❌ Slow (more tokens = higher latency)
- ❌ Dilutes relevant information with noise

**Verdict:** ❌ Don't use for production

### Strategy 2: RAG (Retrieval-Augmented Generation)

**Approach:** Retrieve only relevant information using semantic search

```python
# Search for relevant courses
query = "machine learning courses"
relevant_courses = search_courses(query, limit=5)
context = "\n".join([str(course) for course in relevant_courses])
```

**Pros:**
- ✅ Only includes relevant information
- ✅ Stays within token budgets
- ✅ Fast and cost-effective
- ✅ Semantic search finds related content

**Cons:**
- ⚠️ May miss relevant information if search isn't perfect
- ⚠️ Requires good embeddings and search tuning

**Verdict:** ✅ Good for most use cases

### Strategy 3: Summaries

**Approach:** Pre-compute summaries of large datasets

```python
# Use pre-computed course catalog summary
summary = get_course_catalog_summary()  # "CS: 50 courses, MATH: 30 courses..."
context = summary
```

**Pros:**
- ✅ Very compact (low token usage)
- ✅ Fast (no search needed)
- ✅ Provides high-level overview

**Cons:**
- ❌ Loses details
- ❌ May not have specific information needed
- ⚠️ Requires pre-computation

**Verdict:** ✅ Good for overviews, combine with RAG for details

### Strategy 4: Hybrid (Best)

**Approach:** Combine summaries + targeted retrieval

```python
# Start with summary for overview
summary = get_course_catalog_summary()

# Add specific relevant courses
relevant_courses = search_courses(query, limit=3)

context = f"{summary}\n\nRelevant courses:\n{courses}"
```

**Pros:**
- ✅ Best of both worlds
- ✅ Overview + specific details
- ✅ Efficient token usage
- ✅ High quality results

**Cons:**
- ⚠️ More complex to implement
- ⚠️ Requires pre-computed summaries

**Verdict:** ✅ Best for production systems

## Setup

In [None]:
import os
import time
import asyncio
from typing import List
import tiktoken
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage
from redis_context_course import CourseManager, MemoryClient

# Initialize
course_manager = CourseManager()
# Initialize memory client with proper config
import os
config = MemoryClientConfig(
    base_url=os.getenv("AGENT_MEMORY_URL", "http://localhost:8000"),
    default_namespace="redis_university"
)
memory_client = MemoryClient(config=config)
llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
tokenizer = tiktoken.encoding_for_model("gpt-4o")

def count_tokens(text: str) -> int:
    return len(tokenizer.encode(text))

print("✅ Setup complete")

## Hands-on: Comparing Retrieval Strategies

### Strategy 1: Full Context (Bad)

Let's try including all courses and see what happens.

In [None]:
print("=" * 80)
print("STRATEGY 1: FULL CONTEXT (Naive)")
print("=" * 80)

# Get all courses
all_courses = await course_manager.get_all_courses()
print(f"\nTotal courses in catalog: {len(all_courses)}")

# Build full context
full_context = "\n\n".join([
    f"{c.course_code}: {c.title}\n{c.description}\nCredits: {c.credits} | {c.format.value}"
    for c in all_courses[:50]  # Limit to 50 for demo
])

tokens = count_tokens(full_context)
print(f"\nTokens for 50 courses: {tokens:,}")
print(f"Estimated tokens for all {len(all_courses)} courses: {(tokens * len(all_courses) / 50):,.0f}")

# Try to use it
user_query = "I'm interested in machine learning courses"
system_prompt = f"""You are a class scheduling agent.

Available courses:
{full_context[:2000]}...
"""

start_time = time.time()
messages = [
    SystemMessage(content=system_prompt),
    HumanMessage(content=user_query)
]
response = llm.invoke(messages)
latency = time.time() - start_time

print(f"\nQuery: {user_query}")
print(f"Response: {response.content[:200]}...")
print(f"\nLatency: {latency:.2f}s")
print(f"Total tokens used: ~{count_tokens(system_prompt) + count_tokens(user_query):,}")

print("\n❌ PROBLEMS:")
print("  - Too many tokens (expensive)")
print("  - High latency")
print("  - Relevant info buried in noise")
print("  - Doesn't scale to full catalog")

### Strategy 2: RAG with Semantic Search (Good)

Now let's use semantic search to retrieve only relevant courses.

In [None]:
print("\n" + "=" * 80)
print("STRATEGY 2: RAG (Semantic Search)")
print("=" * 80)

user_query = "I'm interested in machine learning courses"

# Search for relevant courses
start_time = time.time()
relevant_courses = await course_manager.search_courses(
    query=user_query,
    limit=5
)
search_time = time.time() - start_time

print(f"\nSearch time: {search_time:.3f}s")
print(f"Courses found: {len(relevant_courses)}")

# Build context from relevant courses only
rag_context = "\n\n".join([
    f"{c.course_code}: {c.title}\n{c.description}\nCredits: {c.credits} | {c.format.value}"
    for c in relevant_courses
])

tokens = count_tokens(rag_context)
print(f"Context tokens: {tokens:,}")

# Use it
system_prompt = f"""You are a class scheduling agent.

Relevant courses:
{rag_context}
"""

start_time = time.time()
messages = [
    SystemMessage(content=system_prompt),
    HumanMessage(content=user_query)
]
response = llm.invoke(messages)
latency = time.time() - start_time

print(f"\nQuery: {user_query}")
print(f"Response: {response.content[:200]}...")
print(f"\nTotal latency: {latency:.2f}s")
print(f"Total tokens used: ~{count_tokens(system_prompt) + count_tokens(user_query):,}")

print("\n✅ BENEFITS:")
print("  - Much fewer tokens (cheaper)")
print("  - Lower latency")
print("  - Only relevant information")
print("  - Scales to any catalog size")

### Strategy 3: Pre-computed Summary

Let's create a summary of the course catalog.

In [None]:
print("\n" + "=" * 80)
print("STRATEGY 3: PRE-COMPUTED SUMMARY")
print("=" * 80)

# Create a summary (in production, this would be pre-computed)
all_courses = await course_manager.get_all_courses()

# Group by department
by_department = {}
for course in all_courses:
    dept = course.department
    if dept not in by_department:
        by_department[dept] = []
    by_department[dept].append(course)

# Create summary
summary_lines = ["Course Catalog Summary:\n"]
for dept, courses in sorted(by_department.items()):
    summary_lines.append(f"{dept}: {len(courses)} courses")
    # Add a few example courses
    examples = [f"{c.course_code} ({c.title})" for c in courses[:2]]
    summary_lines.append(f"  Examples: {', '.join(examples)}")

summary = "\n".join(summary_lines)

print(f"\nSummary:\n{summary}")
print(f"\nSummary tokens: {count_tokens(summary):,}")

# Use it
user_query = "What departments offer courses?"
system_prompt = f"""You are a class scheduling agent.

{summary}
"""

start_time = time.time()
messages = [
    SystemMessage(content=system_prompt),
    HumanMessage(content=user_query)
]
response = llm.invoke(messages)
latency = time.time() - start_time

print(f"\nQuery: {user_query}")
print(f"Response: {response.content}")
print(f"\nLatency: {latency:.2f}s")

print("\n✅ BENEFITS:")
print("  - Very compact (minimal tokens)")
print("  - Fast (no search needed)")
print("  - Good for overview questions")

print("\n⚠️  LIMITATIONS:")
print("  - Lacks specific details")
print("  - Can't answer detailed questions")

### Strategy 4: Hybrid (Best)

Combine summary + targeted retrieval for the best results.

In [None]:
print("\n" + "=" * 80)
print("STRATEGY 4: HYBRID (Summary + RAG)")
print("=" * 80)

user_query = "I'm interested in machine learning. What's available?"

# Start with summary
summary_context = summary

# Add targeted retrieval
relevant_courses = await course_manager.search_courses(
    query=user_query,
    limit=3
)

detailed_context = "\n\n".join([
    f"{c.course_code}: {c.title}\n{c.description}\nCredits: {c.credits} | {c.format.value}"
    for c in relevant_courses
])

# Combine
hybrid_context = f"""{summary_context}

Relevant courses for your query:
{detailed_context}
"""

tokens = count_tokens(hybrid_context)
print(f"\nHybrid context tokens: {tokens:,}")

# Use it
system_prompt = f"""You are a class scheduling agent.

{hybrid_context}
"""

start_time = time.time()
messages = [
    SystemMessage(content=system_prompt),
    HumanMessage(content=user_query)
]
response = llm.invoke(messages)
latency = time.time() - start_time

print(f"\nQuery: {user_query}")
print(f"Response: {response.content}")
print(f"\nLatency: {latency:.2f}s")
print(f"Total tokens: ~{count_tokens(system_prompt) + count_tokens(user_query):,}")

print("\n✅ BENEFITS:")
print("  - Overview + specific details")
print("  - Efficient token usage")
print("  - High quality responses")
print("  - Best of all strategies")

## Optimizing Vector Search Parameters

Let's explore how to tune semantic search for better results.

In [None]:
print("\n" + "=" * 80)
print("OPTIMIZING SEARCH PARAMETERS")
print("=" * 80)

user_query = "beginner programming courses"

# Test different limits
print(f"\nQuery: '{user_query}'\n")

for limit in [3, 5, 10]:
    results = await course_manager.search_courses(
        query=user_query,
        limit=limit
    )
    
    print(f"Limit={limit}: Found {len(results)} courses")
    for i, course in enumerate(results, 1):
        print(f"  {i}. {course.course_code}: {course.title}")
    print()

print("💡 TIP: Start with limit=5, adjust based on your needs")
print("  - Too few: May miss relevant results")
print("  - Too many: Wastes tokens, adds noise")

## Performance Comparison

Let's compare all strategies side-by-side.

In [None]:
print("\n" + "=" * 80)
print("STRATEGY COMPARISON")
print("=" * 80)

print(f"\n{'Strategy':<20} {'Tokens':<10} {'Latency':<10} {'Quality':<10} {'Scalability'}")
print("-" * 70)
print(f"{'Full Context':<20} {'50,000+':<10} {'High':<10} {'Good':<10} {'Poor'}")
print(f"{'RAG (Semantic)':<20} {'500-2K':<10} {'Low':<10} {'Good':<10} {'Excellent'}")
print(f"{'Summary Only':<20} {'100-500':<10} {'Very Low':<10} {'Limited':<10} {'Excellent'}")
print(f"{'Hybrid':<20} {'1K-3K':<10} {'Low':<10} {'Excellent':<10} {'Excellent'}")

print("\n✅ RECOMMENDATION: Use Hybrid strategy for production")
print("  - Provides overview + specific details")
print("  - Efficient token usage")
print("  - Scales to any dataset size")
print("  - High quality results")

## Key Takeaways

### Choosing a Retrieval Strategy

**Use RAG when:**
- ✅ You need specific, detailed information
- ✅ Dataset is large
- ✅ Queries are specific

**Use Summaries when:**
- ✅ You need high-level overviews
- ✅ Queries are general
- ✅ Token budget is tight

**Use Hybrid when:**
- ✅ You want the best quality
- ✅ You can pre-compute summaries
- ✅ Building production systems

### Optimization Tips

1. **Start with RAG** - Simple and effective
2. **Add summaries** - For overview context
3. **Tune search limits** - Balance relevance vs. tokens
4. **Pre-compute summaries** - Don't generate on every request
5. **Monitor performance** - Track tokens, latency, quality

### Vector Search Best Practices

- ✅ Use semantic search for finding relevant content
- ✅ Start with limit=5, adjust as needed
- ✅ Use filters when you have structured criteria
- ✅ Test with real user queries
- ✅ Monitor search quality over time

## Exercises

1. **Implement hybrid retrieval**: Create a function that combines summary + RAG for any query.

2. **Measure quality**: Test each strategy with 10 different queries. Which gives the best responses?

3. **Optimize search**: Experiment with different search limits. What's the sweet spot for your use case?

4. **Create summaries**: Build pre-computed summaries for different views (by department, by difficulty, by format).

## Summary

In this notebook, you learned:

- ✅ Different retrieval strategies have different trade-offs
- ✅ RAG (semantic search) is efficient and scalable
- ✅ Summaries provide compact overviews
- ✅ Hybrid approach combines the best of both
- ✅ Proper retrieval is key to production-quality agents

**Key insight:** Don't include everything - retrieve smartly. The hybrid strategy (summaries + targeted RAG) provides the best balance of quality, efficiency, and scalability.