# Understanding Context Windows (Hands-On Tutorial)
Instructor: Zion Pibowei

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1uEjuwRbg8hfHWYj2yS78tOzw2zACRf0d?usp=sharing) [![Watch Video](https://img.shields.io/badge/Watch%20Video-4285F4?logo=googledrive&logoColor=white)](https://drive.google.com/file/d/168Fj31nd2NiOGwVZ28E0VyuTMfKAGuxD/view?usp=sharing)


### 1. What is a Context Window?
A context window is the maximum amount of text (measured in tokens) that a language model can process at one time. Think of it as the model's "working memory" - everything it can "see" and reason about in a single interaction.

#### Key Concepts
**Tokens:** Not the same as words! A token is typically:

- 1 token ≈ 4 characters in English
- 1 token ≈ ¾ of a word on average
- "Hello world!" = ~3 tokens
- "Understanding" = ~2-3 tokens

**Context Window Size:** Different models have different limits

- GPT-3.5-Turbo: 16K tokens (~12,000 words)
- GPT-4: 8K-128K tokens (varies by version)
- Claude 3.5 Sonnet: 200K tokens (~150,000 words)
- Llama 3.1: 128K tokens
- Gemini 1.5 Pro: 2M tokens

### 2. Basic Context Window Mechanics

#### Counting Tokens
Let's understand how text translates to tokens using `tiktoken`, OpenAI's tokenizer

In [1]:
import tiktoken

def count_tokens(text, model="gpt-3.5-turbo"):
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

# Test different texts
examples = [
    "Hi",
    "Hello, world!",
    "The quick brown fox jumps over the lazy dog.",
    "Supercalifragilisticexpialidocious"
]

for text in examples:
    tokens = count_tokens(text)
    print(f"Text: '{text}'")
    print(f"Tokens: {tokens}")
    print(f"Chars: {len(text)}, Ratio: {len(text)/tokens:.2f} chars/token\n")

Text: 'Hi'
Tokens: 1
Chars: 2, Ratio: 2.00 chars/token

Text: 'Hello, world!'
Tokens: 4
Chars: 13, Ratio: 3.25 chars/token

Text: 'The quick brown fox jumps over the lazy dog.'
Tokens: 10
Chars: 44, Ratio: 4.40 chars/token

Text: 'Supercalifragilisticexpialidocious'
Tokens: 11
Chars: 34, Ratio: 3.09 chars/token



#### What Fits into Context?
Let's create a function to check if a given content fits in a context window

In [2]:
def will_it_fit(system_prompt, conversation_history, max_context=4096):
    """Check if conversation fits in context window"""
    total_tokens = 0
    
    # Count system prompt
    total_tokens += count_tokens(system_prompt)
    
    # Count conversation
    for message in conversation_history:
        total_tokens += count_tokens(message["content"])
        total_tokens += 4  # Message formatting overhead
    
    # Reserve space for response (estimate)
    response_buffer = 500
    
    remaining = max_context - total_tokens - response_buffer
    fits = remaining > 0
    
    return {
        "fits": fits,
        "total_tokens": total_tokens,
        "max_context": max_context,
        "remaining": remaining,
        "percentage_used": (total_tokens / max_context) * 100
    }

# Test it
system = "You are a helpful assistant."
history = [
    {"role": "user", "content": "What is Python?"},
    {"role": "assistant", "content": "Python is a programming language..."},
    {"role": "user", "content": "How do I learn it?"}
]

result = will_it_fit(system, history, max_context=4096)
print(f"Fits: {result['fits']}")
print(f"Using {result['percentage_used']:.1f}% of context window")
print(f"Remaining tokens: {result['remaining']}")

Fits: True
Using 0.8% of context window
Remaining tokens: 3562


### 3. Context Window Limitations
#### Catstrophic Forgetting - The "Lost in the Middle" Problem
Models pay more attention to the beginning and end of context, but can "forget" information in the middle.

In [3]:
def create_needle_in_haystack_test(needle_position="middle", bulk_size=20):
    """
    Create a test where important info is placed at different positions
    """
    
    needle = "The secret password is BLUE_ELEPHANT_2024."
    
    filler = [
        "Here's some information about various topics.",
        "Climate change is affecting global temperatures.",
        "The stock market fluctuates based on many factors.",
        "Machine learning models require large datasets.",
        "Coffee is one of the most popular beverages worldwide.",
    ] * bulk_size  # Repeat to create bulk
    
    if needle_position == "start":
        content = [needle] + filler
    elif needle_position == "middle":
        mid = len(filler) // 2
        content = filler[:mid] + [needle] + filler[mid:]
    else:  # end
        content = filler + [needle]
    
    prompt = "\n".join(content) + "\n\nQuestion: What is the secret password?"
    
    return prompt



Test below with different LLMs. Iterate between different LLMs, context window sizes, etc.

In [18]:
import os
from groq import Groq
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv(override=True)


client = Groq(api_key=os.getenv("GROQ_API_KEY"))
models = [
    "llama-3.1-8b-instant",
    "llama-3.3-70b-versatile",
    "openai/gpt-oss-120b",
    "openai/gpt-oss-120b",
]


# Test prompts
prompt_start = create_needle_in_haystack_test("start", bulk_size=20)
prompt_middle = create_needle_in_haystack_test("middle", bulk_size=230)
prompt_end = create_needle_in_haystack_test("end")

messages = [
    {
        "role": "system",
        "content": f"You are a helpful assistant"
    },
    {
        "role": "user",
        "content": f"{prompt_middle}"
    }

]


response = client.chat.completions.create(
    model=models[1],
    messages=messages
)

print(response.choices[0].message.content)

The secret password is BLUE_ELEPHANT_2024.


**Key Insight**

Models have "recency bias" (remember recent info) and "primacy bias" (remember early info), but struggle with middle content in long contexts.

### 4. Managing Context Windows
#### Sliding Window Approach
Keep only the most recent N messages

In [4]:
def sliding_window(conversation_history, max_messages=10):
    """Keep only the last N messages"""
    if len(conversation_history) <= max_messages:
        return conversation_history
    
    # Always keep system message if it exists
    if conversation_history[0]["role"] == "system":
        return [conversation_history[0]] + conversation_history[-(max_messages-1):]
    
    return conversation_history[-max_messages:]

# Test
long_conversation_nested = [
    [
        {"role": "system", "content": "You are helpful."},
        {"role": "user", "content": f"Message {i}"}
    ]
    for i in range(50)
]

# flatten
long_conversation = sum(long_conversation_nested, [])

trimmed = sliding_window(long_conversation, max_messages=5)

print(f"Original conversation history: {len(long_conversation)} messages")
print(f"Trimmed conversation history: {len(trimmed)} messages")

Original conversation history: 100 messages
Trimmed conversation history: 5 messages


#### Smart Summarization
Summarize old messages to preserve context

In [5]:
def summarize_and_compress(conversation_history, llm_function):
    """
    Summarize older messages while keeping recent ones intact
    """
    if len(conversation_history) < 10:
        return conversation_history
    
    # Keep system prompt and last 4 messages
    recent_messages = conversation_history[-4:]
    old_messages = conversation_history[:-4]
    
    # Create summary prompt
    conversation_text = "\n".join([
        f"{msg['role']}: {msg['content']}" 
        for msg in old_messages
    ])
    
    summary_prompt = f"""Summarize this conversation concisely, preserving key facts and context:

{conversation_text}

Provide a brief summary (2-3 sentences):"""
    
    summary = llm_function(summary_prompt)
    
    # Return compressed history
    return [
        {"role": "system", "content": f"Previous conversation summary: {summary}"}
    ] + recent_messages


#### Token-Based Truncation
Truncate based on actual token count

In [6]:
def truncate_to_fit(conversation, max_tokens=4000, buffer=500):
    """
    Remove oldest messages until conversation fits in context window
    """
    target_tokens = max_tokens - buffer
    system_message = None
    messages = conversation.copy()
    
    # Extract system message if present
    if messages and messages[0]["role"] == "system":
        system_message = messages.pop(0)
    
    # Count from newest to oldest
    total_tokens = 0
    kept_messages = []
    
    for message in reversed(messages):
        msg_tokens = count_tokens(message["content"]) + 4
        if total_tokens + msg_tokens <= target_tokens:
            kept_messages.insert(0, message)
            total_tokens += msg_tokens
        else:
            break
    
    # Add system message back
    if system_message:
        kept_messages.insert(0, system_message)
    
    removed = len(messages) - len(kept_messages) + (1 if system_message else 0)
    
    return {
        "messages": kept_messages,
        "total_tokens": total_tokens,
        "removed_count": removed
    }

# Test
conversation_history = [
    [
        {"role": "system", "content": "You are helpful."},
        {"role": "user", "content": f"Message {i}"}
    ]
    for i in range(500)
]
# flatten
conversation_history = sum(conversation_history, [])

result = truncate_to_fit(conversation_history, max_tokens=4096)
print(f"Kept {len(result['messages'])} messages")
print(f"Removed {result['removed_count']} messages")
print(f"Total tokens: {result['total_tokens']}")

Kept 480 messages
Removed 520 messages
Total tokens: 3592
