# Context Window Management: Handling Token Limits

## Introduction

In this notebook, you'll learn about context window limits and how to manage them effectively. Every LLM has a maximum number of tokens it can process, and long conversations can exceed this limit. The Agent Memory Server provides automatic summarization to handle this.

### What You'll Learn

- What context windows are and why they matter
- How to count tokens in conversations
- Why summarization is necessary
- How to configure Agent Memory Server summarization
- How summarization works in practice

### Prerequisites

- Completed Section 3 notebooks
- Redis 8 running locally
- Agent Memory Server running
- OpenAI API key set

## Concepts: Context Windows and Token Limits

### What is a Context Window?

A **context window** is the maximum amount of text (measured in tokens) that an LLM can process in a single request. This includes:

- System instructions
- Conversation history
- Retrieved context (memories, documents)
- User's current message
- Space for the response

### Common Context Window Sizes

| Model | Context Window | Notes |
|-------|----------------|-------|
| GPT-4o | 128K tokens | ~96,000 words |
| GPT-4 Turbo | 128K tokens | ~96,000 words |
| GPT-3.5 Turbo | 16K tokens | ~12,000 words |
| Claude 3 Opus | 200K tokens | ~150,000 words |

### The Problem: Long Conversations

As conversations grow, they consume more tokens:

```
Turn 1:  System (500) + Messages (200) = 700 tokens ✅
Turn 5:  System (500) + Messages (1,000) = 1,500 tokens ✅
Turn 20: System (500) + Messages (4,000) = 4,500 tokens ✅
Turn 50: System (500) + Messages (10,000) = 10,500 tokens ✅
Turn 100: System (500) + Messages (20,000) = 20,500 tokens ⚠️
Turn 200: System (500) + Messages (40,000) = 40,500 tokens ⚠️
```

Eventually, you'll hit the limit!

### Why Summarization is Necessary

Without summarization:
- ❌ Conversations eventually fail
- ❌ Costs increase linearly with conversation length
- ❌ Latency increases with more tokens
- ❌ Important early context gets lost

With summarization:
- ✅ Conversations can continue indefinitely
- ✅ Costs stay manageable
- ✅ Latency stays consistent
- ✅ Important context is preserved in summaries

### How Agent Memory Server Handles This

The Agent Memory Server automatically:
1. **Monitors message count** in working memory
2. **Triggers summarization** when threshold is reached
3. **Creates summary** of older messages
4. **Replaces old messages** with summary
5. **Keeps recent messages** for context

### Token Budgets

A **token budget** is how you allocate your context window:

```
Total: 128K tokens
├─ System instructions: 1K tokens
├─ Working memory: 8K tokens
├─ Long-term memories: 2K tokens
├─ Retrieved context: 4K tokens
├─ User message: 500 tokens
└─ Response space: 2K tokens
    ────────────────────────────
    Used: 17.5K / 128K (13.7%)
```

## Setup

In [None]:
import os
import asyncio
import tiktoken
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage, AIMessage
from redis_context_course import MemoryClient, MemoryClientConfig

# Initialize
student_id = "student_context_demo"
session_id = "long_conversation"

# Initialize memory client with proper config
import os
config = MemoryClientConfig(
    base_url=os.getenv("AGENT_MEMORY_URL", "http://localhost:8000"),
    default_namespace="redis_university"
)
memory_client = MemoryClient(config=config)

llm = ChatOpenAI(model="gpt-4o", temperature=0.7)

# Initialize tokenizer for counting
tokenizer = tiktoken.encoding_for_model("gpt-4o")

def count_tokens(text: str) -> int:
    """Count tokens in text."""
    return len(tokenizer.encode(text))

print(f"✅ Setup complete for {student_id}")

## Hands-on: Understanding Token Counts

### Example 1: Counting Tokens in Messages

In [None]:
# Ensure count_tokens is defined (in case cells are run out of order)
if "count_tokens" not in globals():
    import tiktoken
    tokenizer = tiktoken.encoding_for_model("gpt-4o")
    def count_tokens(text: str) -> int:
        return len(tokenizer.encode(text))

# Example messages
messages = [
    "Hi, I'm interested in machine learning courses.",
    "Can you recommend some courses for beginners?",
    "What are the prerequisites for CS401?",
    "I've completed CS101 and CS201. Can I take CS401?",
    "Great! When is CS401 offered?"
]

print("Token counts for individual messages:\n")
total_tokens = 0
for i, msg in enumerate(messages, 1):
    tokens = count_tokens(msg)
    total_tokens += tokens
    print(f"{i}. \"{msg}\"")
    print(f"   Tokens: {tokens}\n")

print(f"Total tokens for 5 messages: {total_tokens}")
print(f"Average tokens per message: {total_tokens / len(messages):.1f}")

### Example 2: Token Growth Over Conversation

In [None]:
# Ensure count_tokens is defined (in case cells are run out of order)
if "count_tokens" not in globals():
    import tiktoken
    tokenizer = tiktoken.encoding_for_model("gpt-4o")
    def count_tokens(text: str) -> int:
        return len(tokenizer.encode(text))

# Simulate conversation growth
system_prompt = """You are a helpful class scheduling agent for Redis University.
Help students find courses and plan their schedule."""

system_tokens = count_tokens(system_prompt)
print(f"System prompt tokens: {system_tokens}\n")

# Simulate growing conversation
conversation_tokens = 0
avg_message_tokens = 50  # Typical message size

print("Token growth over conversation turns:\n")
print(f"{'Turn':<6} {'Messages':<10} {'Conv Tokens':<12} {'Total Tokens':<12} {'% of 128K'}")
print("-" * 60)

for turn in [1, 5, 10, 20, 50, 100, 200, 500, 1000]:
    # Each turn = user message + assistant message
    conversation_tokens = turn * 2 * avg_message_tokens
    total_tokens = system_tokens + conversation_tokens
    percentage = (total_tokens / 128000) * 100
    
    print(f"{turn:<6} {turn*2:<10} {conversation_tokens:<12,} {total_tokens:<12,} {percentage:>6.1f}%")

print("\n⚠️  Without summarization, long conversations will eventually exceed limits!")

## Configuring Summarization

The Agent Memory Server provides automatic summarization. Let's see how to configure it.

### Understanding Summarization Settings

The Agent Memory Server uses these settings:

**Message Count Threshold:**
- When working memory exceeds this many messages, summarization triggers
- Default: 20 messages (10 turns)
- Configurable per session

**Summarization Strategy:**
- **Recent + Summary**: Keep recent N messages, summarize older ones
- **Sliding Window**: Keep only recent N messages
- **Full Summary**: Summarize everything

**What Gets Summarized:**
- Older conversation messages
- Key facts and decisions
- Important context

**What Stays:**
- Recent messages (for immediate context)
- System instructions
- Long-term memories (separate from working memory)

### Example 3: Demonstrating Summarization

Let's create a conversation that triggers summarization.

In [None]:
# Helper function for conversation
async def have_conversation_turn(user_message, session_id):
    """Simulate a conversation turn."""
    # Get working memory
    working_memory = await memory_client.get_or_create_working_memory(
        session_id=session_id,
        model_name="gpt-4o"
    )
    
    # Build messages
    messages = [SystemMessage(content="You are a helpful class scheduling agent.")]
    
    if working_memory and working_memory.messages:
        for msg in working_memory.messages:
            if msg.role == "user":
                messages.append(HumanMessage(content=msg.content))
            elif msg.role == "assistant":
                messages.append(AIMessage(content=msg.content))
    
    messages.append(HumanMessage(content=user_message))
    
    # Get response
    response = llm.invoke(messages)
    
    # Save to working memory
    all_messages = []
    if working_memory and working_memory.messages:
        all_messages = [{"role": m.role, "content": m.content} for m in working_memory.messages]
    
    all_messages.extend([
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": response.content}
    ])
    
    await memory_client.save_working_memory(
        session_id=session_id,
        messages=all_messages
    )
    
    return response.content, len(all_messages)

print("✅ Helper function defined")

In [None]:
# Have a multi-turn conversation
print("=" * 80)
print("DEMONSTRATING SUMMARIZATION")
print("=" * 80)

conversation_queries = [
    "Hi, I'm a computer science major interested in AI.",
    "What machine learning courses do you offer?",
    "Tell me about CS401.",
    "What are the prerequisites?",
    "I've completed CS101 and CS201.",
    "Can I take CS401 next semester?",
    "When is it offered?",
    "Is it available online?",
    "What about CS402?",
    "Can I take both CS401 and CS402?",
    "What's the workload like?",
    "Are there any projects?",
]

for i, query in enumerate(conversation_queries, 1):
    print(f"\nTurn {i}:")
    print(f"User: {query}")
    
    response, message_count = await have_conversation_turn(query, session_id)
    
    print(f"Agent: {response[:100]}...")
    print(f"Total messages in working memory: {message_count}")
    
    if message_count > 20:
        print("⚠️  Message count exceeds threshold - summarization may trigger")
    
    await asyncio.sleep(0.5)  # Rate limiting

print("\n" + "=" * 80)
print("✅ Conversation complete")

### Example 4: Checking Working Memory After Summarization

In [None]:
# Check working memory state
print("\nChecking working memory state...\n")

working_memory = await memory_client.get_or_create_working_memory(
    session_id=session_id,
    model_name="gpt-4o"
)

if working_memory:
    print(f"Total messages: {len(working_memory.messages)}")
    print(f"\nMessage breakdown:")
    
    user_msgs = [m for m in working_memory.messages if m.role == "user"]
    assistant_msgs = [m for m in working_memory.messages if m.role == "assistant"]
    system_msgs = [m for m in working_memory.messages if m.role == "system"]
    
    print(f"  User messages: {len(user_msgs)}")
    print(f"  Assistant messages: {len(assistant_msgs)}")
    print(f"  System messages (summaries): {len(system_msgs)}")
    
    # Check for summary messages
    if system_msgs:
        print("\n✅ Summarization occurred! Summary messages found:")
        for msg in system_msgs:
            print(f"\n  Summary: {msg.content[:200]}...")
    else:
        print("\n⏳ No summarization yet (may need more messages or time)")
else:
    print("No working memory found")

## Key Takeaways

### Context Window Management Strategy

1. **Monitor token usage** - Know your limits
2. **Set message thresholds** - Trigger summarization before hitting limits
3. **Keep recent context** - Don't summarize everything
4. **Use long-term memory** - Important facts go there, not working memory
5. **Trust automatic summarization** - Agent Memory Server handles it

### Token Budget Best Practices

**Allocate wisely:**
- System instructions: 1-2K tokens
- Working memory: 4-8K tokens
- Long-term memories: 2-4K tokens
- Retrieved context: 2-4K tokens
- Response space: 2-4K tokens

**Total: ~15-20K tokens (leaves plenty of headroom)**

### When Summarization Happens

The Agent Memory Server triggers summarization when:
- ✅ Message count exceeds threshold (default: 20)
- ✅ Token count approaches limits
- ✅ Configured summarization strategy activates

### What Summarization Preserves

✅ **Preserved:**
- Key facts and decisions
- Important context
- Recent messages (full text)
- Long-term memories (separate storage)

❌ **Compressed:**
- Older conversation details
- Redundant information
- Small talk

### Why This Matters

Without proper context window management:
- ❌ Conversations fail when limits are hit
- ❌ Costs grow linearly with conversation length
- ❌ Performance degrades with more tokens

With proper management:
- ✅ Conversations can continue indefinitely
- ✅ Costs stay predictable
- ✅ Performance stays consistent
- ✅ Important context is preserved

## Exercises

1. **Calculate your token budget**: For your agent, allocate tokens across system prompt, working memory, long-term memories, and response space.

2. **Test long conversations**: Have a 50-turn conversation and monitor token usage. When does summarization trigger?

3. **Compare strategies**: Test different message thresholds (10, 20, 50). How does it affect conversation quality?

4. **Measure costs**: Calculate the cost difference between keeping full history vs. using summarization for a 100-turn conversation.

## Summary

In this notebook, you learned:

- ✅ Context windows have token limits that conversations can exceed
- ✅ Token budgets help allocate context window space
- ✅ Summarization is necessary for long conversations
- ✅ Agent Memory Server provides automatic summarization
- ✅ Proper management enables indefinite conversations

**Key insight:** Context window management isn't about proving you need summarization - it's about understanding the constraints and using the right tools (like Agent Memory Server) to handle them automatically.