# Stage 5: Working Memory for Multi-Turn Conversations

## Introduction

In Stage 4, we built a ReAct agent that could dynamically reason about its context needs; thinking through what information it needed, using tools to retrieve it, and looping until it had enough context to answer confidently. At the end of Stage 4, we also demonstrated multi-turn conversations by manually managing a conversation history list:

```python
# Stage 4 - Multi-turn conversation approach
conversation_history = []

# Turn 1
result1 = await app.ainvoke({
    "input": "Tell me about CS002", 
    "history": conversation_history
})
conversation_history.append(f"User: Tell me about CS002")
conversation_history.append(f"Agent: {result1['final_response'][:200]}...")

# Turn 2 - with reference to previous turn
result2 = await app.ainvoke({
    "input": "What are the prerequisites for that course?",
    "history": conversation_history  # Agent can now see Turn 1
})
```

This was our first introduction to managing conversational context through "memory". When we manually passed the conversation history, the agent successfully resolved "that course" to CS002. The reasoning capabilities were there, but the memory infrastructure wasn't. We couldn't persist conversations across sessions, extract important facts for later recall, deduplicate repeated information, or manage memory as conversations grew longer.

Consider this a different scenario showing the limitation:

**Turn 1:**  
üë§ "What is CS002?"  
ü§ñ "CS002 is Data Structures and Algorithms..."

**Turn 2:**
üë§ "What are the prerequisites for that course?"  
ü§ñ ‚ùå "I'm not sure which course you are referring to" *(Agen't doesn't understand what "that course" refers to)*

Solving this type of limitation is the focus of Stage 5, where we will introduce memory and shift the focus from managing retrieved context to managing conversational context. To do this, we'll use a new tool: [Redis Agent Memory Server (RAMS)](https://github.com/redis/agent-memory-server). 

RAMS is a fast and flexible memory layer for AI agents. Adding it will complete our agent stack, transforming our agent from a powerful single-query reasoner into a true conversational partner with persistent context (memory) across sessions.

Before we begin working with memory, let's examine the key concepts surrounding how RAMS manages memory and its role within our agent architecture. 

## Memory Architecture

### How RAMS Manages Memory: The Two-Tier Model

At the heart of implementing the memory infrastructure of our agent is how RAMS manages conversational context. RAMS isn't just a key-value store for conversations‚Äîit implements a two-tier memory system:

1. Working Memory (Session-Scoped)  

Working memory represents recent conversation history for the current session. It captures the immediate context of the ongoing conversation. Each session maintains its own working memory, perfect for managing multiple concurrent conversations per user.

2. Long-Term Memory (Cross-Session)  

On the other hand, long-term memory represents persistent facts extracted from all conversations over time. It stores important information that transcends individual conversations. Facts are automatically extracted, deduplicated, and compressed for efficient retrieval.

Here's a quick comparison of the two memory types:

| Working Memory | Long-Term Memory |
|---|---|
| Session-scoped | User-scoped OR Application-scoped |
| Current conversation | Important facts, rules, knowledge |
| Persists for session | Persists across sessions |
| Full message history | Extracted knowledge (user + domain) |
| Loaded/saved each turn | Searched when needed |
| Challenge: Context window limits | Challenge: Storage growth |

In this stage, we'll focus exclusively on working memory. The agent will load and saves conversation history, while RAMS will extract facts to long-term storage invisibly. In Stage 6 will add explicit tools (`search_memories`, `store_memory`) for the agent to deliberately query and manipulate long-term memory, but for now, we won't need to worry about it.

Once implemented, our Stage 5 architecture wraps the ReAct agent with a memory layer:

```mermaid
graph TD
    Q[Query] --> LM[Load Working Memory]
    LM --> IC[Classify Intent]
    IC -->|GREETING| HG[Handle Greeting]
    IC -->|Other| RA[ReAct Agent]

    subgraph ReAct Loop
        RA --> T1[üí≠ Thought: Analyze + use history]
        T1 --> A1[üîß Action: search_courses]
        A1 --> O1[üëÅÔ∏è Observation: Results]
        O1 --> T2[üí≠ Thought: Evaluate]
        T2 --> |Need more| A1
        T2 --> |Done| F[‚úÖ FINISH]
    end

    F --> SM[Save Working Memory]
    HG --> SM
    SM --> END[Response + Reasoning Trace]

    subgraph Memory Layer
        LM -.->|Read| RAMS[(Redis Agent Memory Server)]
        SM -.->|Write| RAMS
    end
```

### The Working Memory Lifecycle

With RAMS managing the storage and organization of memories, our agent needs to orchestrate when and how to interact with it. The working memory lifecycle in Stage 5 will follow three steps on every turn:

1. Load Working Memory: At the start of each turn, the agent will load the conversation history from RAMS. This provides the immediate context for the current conversation.

2. Process: The agent will use the conversation history as input to its reasoning process. This allows it to understand references like "that course" in context.

3. Save Working Memory: After generating a response, the agent will save the new conversation turn back to RAMS. This updates the working memory for future turns.

This pattern repeats on every turn:
- **Before reasoning**: Load history to provide context
- **During reasoning**: ReAct agent uses history ("that course" ‚Üí CS002)
- **After reasoning**: Save the new exchange for future turns

### Implementation Overview

In this notebook, you'll implement the core memory infrastructure for working memory by building the following nodes: 

1. **`load_working_memory_node`**  
A LangGraph node that retrieves conversation history from RAMS at the start of each turn and adds it to the agent's state.

2. **`save_working_memory_node`**  
A LangGraph node that stores the new conversation turn (user query + agent response) back to RAMS after reasoning completes.

Before diving into implementation, let's review the supporting code that has already been implemented for you in the production agent. You'll focus on implementing the memory nodes, while these components handle configuration, state management, and workflow orchestration.

## Implementing Working Memory

### Supporting Infrastructure

There are a few pieces of supporting infrastructure in the agent code that you should be aware of.

For one, we abstract away configuration details for the Redis Agent Memory Server located in the [nodes.py](../progressive_agents/stage5_working_memory/agent/nodes.py#L55-L62) file.

```python
def get_memory_client() -> MemoryAPIClient:
    """Get the configured Agent Memory Server client."""
    config = MemoryClientConfig(
        base_url=os.getenv("AGENT_MEMORY_URL", "http://localhost:8088"),
        default_namespace="course_qa_agent",
    )
    return MemoryAPIClient(config=config)
```

Additionally, the workflow for the agent is now modified to use a TypedDict to track all state fields across nodes in the [state.py](../progressive_agents/stage5_working_memory/agent/state.py#L28-L80) file. 

For implementing memory, you'll notice there are `session_id`, `student_id`, and `conversation_history` fields: 

```python
class WorkflowState(TypedDict):
    # Core fields you'll work with:
    session_id: str                          # Session identifier
    student_id: str                          # User identifier  
    conversation_history: List[Dict[str, str]]  # Previous messages
    user_query: str                          # Current question
    agent_response: str                      # Current answer
    
    # Plus many other fields for intent classification,
    # entity extraction, caching, metrics, etc.
```

This state flows through every node in the workflow, with each node reading and updating relevant fields.

Lastly, the complete workflow orchestrates multiple nodes with conditional routing in the [workflow.py#L67-L100](../progressive_agents/stage5_working_memory/agent/workflow.py#L67-L100) file. You'll implement the `load_working_memory_node` and `save_working_memory_node` functions that plug into this graph as noted below.

```python
workflow = StateGraph(WorkflowState)

# Add all nodes
workflow.add_node("load_memory", load_working_memory_node)
workflow.add_node("classify_intent", classify_intent_node)
workflow.add_node("react_agent", react_agent_node)
workflow.add_node("save_memory", save_working_memory_node)

# Define flow with conditional routing
workflow.set_entry_point("load_memory")
workflow.add_edge("load_memory", "classify_intent")
workflow.add_conditional_edges("classify_intent", route_after_intent, {...})
workflow.add_edge("react_agent", "save_memory")
workflow.add_edge("save_memory", END)
```

Now, let's get started building them out for the agent. Run the code block below to set up the agent code for this stage.

In [None]:
import sys
from pathlib import Path
from dotenv import load_dotenv

project_root = Path("..").resolve()

stage5_path = project_root / "progressive_agents" / "stage5_working_memory"
src_path = project_root / "src"

load_dotenv(project_root / ".env")

sys.path.insert(0, str(src_path))
sys.path.insert(0, str(stage5_path))

from agent import setup_agent, create_workflow, WorkflowState, get_memory_client, MemoryMessage, WorkingMemory

print("Initializing Stage 5 Agent...")
course_manager, _ = await setup_agent(auto_load_courses=True)
workflow = create_workflow(course_manager)

## Part 1: Implementing loading the working memory

The first step in our memory lifecycle is *loading* conversation history from Agent Memory Server. This happens before the agent begins reasoning, providing it with context from previous turns in the same session.

The `load_working_memory_node` is a LangGraph node that:

1. Connects to Agent Memory Server using the session ID
2. Retrieves all previous messages from this session's working memory
3. Adds those messages to the agent's state as `conversation_history`
4. Returns the updated state with conversation context

This node transforms our agent from stateless (starting fresh every time) to stateful (aware of previous turns).

### üìå Task: Implement `load_working_memory_node`

Your task is to implement the function that loads working memory from the Agent Memory Server and populates the state's conversation history.

The function receives a `state` dictionary containing `session_id` and `student_id`. It will then use the Agent Memory Server client to retrieve the working memory, convert the messages to a format the agent can use, and add them to the state. 

You'll use `get_or_create_working_memory()` method to retrieve the session's conversation history.

<details>
<summary>üõ†Ô∏è Show Implementation Details</summary>
<br>

**Step 1: Retrieve Working Memory**

Call the async method `.get_or_create_working_memory()` on the memory client with three parameters:
- `session_id`: from `state["session_id"]`
- `user_id`: from `state["student_id"]`  
- `model_name`: use `"gpt-4o-mini"`

This returns a tuple: `(metadata, working_memory)`. You only need the `working_memory` object (use `_` for the metadata you don't need).

**Step 2: Extract and Format Messages**

If the `working_memory` exists and has messages (`working_memory.messages`), do the following:
1. Create an empty list called `conversation_history`
2. Iterate through `working_memory.messages`
3. For each message, append a dictionary with `{"role": msg.role, "content": msg.content}` to the list
4. Set `state["conversation_history"]` to this list

If no working memory exists, then set the conversation history in the state to an empty list `[]`

</details>

In [None]:
# Import required types and functions for memory node implementation
from agent import WorkflowState, get_memory_client

# from agent_memory_client import MemoryMessage, WorkingMemory


async def load_working_memory_node(state: WorkflowState) -> WorkflowState:
    """
    Load working memory from Agent Memory Server and populate conversation history.
    
    Args:
        state: Current workflow state with session_id and student_id
        
    Returns:
        Updated state with conversation_history from working memory
    """
    session_id = state["session_id"]
    student_id = state["student_id"]
    
    try:
        # Get the memory client
        memory_client = get_memory_client()
        
        # TODO - Step 1: Retrieve working memory for this session
        
        
        # TODO - Step 2: Extract and format messages
        if working_memory and working_memory.messages:
            
            print(f"‚úÖ Loaded {len(conversation_history)} messages from working memory")
        else:
            
            print("‚ÑπÔ∏è No previous conversation history found")
            
    except Exception as e:
        print(f"‚ö†Ô∏è Error loading working memory: {e}")
        state["conversation_history"] = []
    
    return state

print("‚úÖ load_working_memory_node defined.")

<details>
<summary>üóùÔ∏è Solution code</summary>

```python

async def load_working_memory_node(state: WorkflowState) -> WorkflowState:
    """
    Load working memory from Agent Memory Server and populate conversation history.
    
    Args:
        state: Current workflow state with session_id and student_id
        
    Returns:
        Updated state with conversation_history from working memory
    """
    session_id = state["session_id"]
    student_id = state["student_id"]
    
    try:
        # Get the memory client
        memory_client = get_memory_client()
        
        # Retrieve working memory for this session
        _, working_memory = await memory_client.get_or_create_working_memory(
            session_id=session_id,
            user_id=student_id,
            model_name="gpt-4o-mini"
        )
        
        # Extract and format messages
        if working_memory and working_memory.messages:
            conversation_history = [
                {"role": msg.role, "content": msg.content}
                for msg in working_memory.messages
            ]
            state["conversation_history"] = conversation_history
            print(f"‚úÖ Loaded {len(conversation_history)} messages from working memory")
        else:
            state["conversation_history"] = []
            print("‚ÑπÔ∏è No previous conversation history found")
            
    except Exception as e:
        print(f"‚ö†Ô∏è Error loading working memory: {e}")
        state["conversation_history"] = []
    
    return state

print("‚úÖ load_working_memory_node defined.")
```

</details>

### Test the Load Node

Let's verify our load node works correctly by testing both empty sessions and sessions with existing history. The test file handles setting up test data internally. Run the following code block to start the test:

In [None]:
# Run comprehensive load tests (empty session + session with history)
from test_load_memory import run_tests

await run_tests(load_working_memory_node)

## Part 2: Implementing the Save Working Memory Node

Now that we can load conversation history, we need a way to *save* new conversation turns back to the Agent Memory Server. This happens after the agent completes its response, storing both the user's question and the agent's answer.

### Understanding the Save Node

The `save_working_memory_node` is a LangGraph node that:

1. Collects the complete conversation history (previous turns + current turn)
2. Converts messages to Agent Memory Server's format
3. Saves the updated working memory to the session
4. Triggers automatic fact extraction (Agent Memory Server analyzes the conversation and extracts important facts to long-term memory)

This node ensures that each turn is persisted, so the next turn can retrieve it via the load node.

#### üìå Task: Implement `save_working_memory_node`

Your task is to implement the function that saves the current conversation turn to Agent Memory Server's working memory.

The function receives a `state` dictionary with `conversation_history` (previous turns), `user_query` (current question), and `agent_response` (current answer). 

It should combine all messages and save them to Agent Memory Server. You'll use `MemoryMessage` and `WorkingMemory` classes to format the data, then use the `.put_working_memory()` method to save it.

<details>
<summary>üõ†Ô∏è Show Implementation Details</summary>
<br>

**Step 1: Build Complete Message History**

Create a list called `all_messages` that combines:
- All messages from `state.get("conversation_history", [])` (previous turns)
- The current user message: `{"role": "user", "content": state["user_query"]}`
- The current agent message: `{"role": "assistant", "content": state["agent_response"]}`

**Step 2: Convert to Agent Memory Server Format**

Create a list called `memory_messages` by converting each message dictionary to Agent Memory Server's `MemoryMessage` object:

```python
memory_messages = [
    MemoryMessage(role=msg["role"], content=msg["content"])
    for msg in all_messages
]
```

**Step 3: Create Working Memory Object**

Create a `WorkingMemory` object with:
- `session_id`: from `state["session_id"]`
- `user_id`: from `state["student_id"]`
- `messages`: the `memory_messages` list
- `memories`: empty list `[]`
- `data`: empty dictionary `{}`

**Step 4: Save to Agent Memory Server**

Call the async method `.put_working_memory()` on the memory client with four parameters:
- `session_id`: the `session_id` variable
- `memory`: the `working_memory` object you created
- `user_id`: the `student_id` variable
- `model_name`: use `"gpt-4o-mini"`
</details>

In [None]:
async def save_working_memory_node(state: WorkflowState) -> WorkflowState:
    """
    Save current conversation turn to Agent Memory Server.
    
    Args:
        state: Current workflow state with conversation history and current turn
        
    Returns:
        Unchanged state (saving is a side effect)
    """
    session_id = state["session_id"]
    student_id = state["student_id"]
    
    try:
        # Get the memory client
        memory_client = get_memory_client()

        # TODO - Step 1: Build complete message history (previous + current turn)

        # TODO - Step 2: Convert to Agent Memory Server format
        memory_messages = []
        
        # TODO - Step 3: Create WorkingMemory object
        
        # TODO - Step 4: Save to Agent Memory Server

        print(f"‚úÖ Saved {len(memory_messages)} messages to working memory")
        
    except Exception as e:
        print(f"‚ö†Ô∏è Error saving working memory: {e}")
    
    return state

print("‚úÖ save_working_memory_node defined.")

<details>
<summary>üóùÔ∏è Solution code</summary>
<br> 
    
```python

async def save_working_memory_node(state: WorkflowState) -> WorkflowState:
    """
    Save current conversation turn to Agent Memory Server.
    
    Args:
        state: Current workflow state with conversation history and current turn
        
    Returns:
        Unchanged state (saving is a side effect)
    """
    session_id = state["session_id"]
    student_id = state["student_id"]
    
    try:
        # Get the memory client
        memory_client = get_memory_client()
        
        # Build complete message history (previous + current turn)
        all_messages = state.get("conversation_history", []).copy()
        all_messages.append({"role": "user", "content": state["user_query"]})
        all_messages.append({"role": "assistant", "content": state["agent_response"]})
        
        # Convert to Agent Memory Server format
        memory_messages = [
            MemoryMessage(role=msg["role"], content=msg["content"])
            for msg in all_messages
        ]
        
        # Create WorkingMemory object
        working_memory = WorkingMemory(
            session_id=session_id,
            user_id=student_id,
            messages=memory_messages,
            memories=[],
            data={},
        )
        
        # Save to Agent Memory Server
        await memory_client.put_working_memory(
            session_id=session_id,
            memory=working_memory,
            user_id=student_id,
            model_name="gpt-4o-mini",
        )
        
        print(f"‚úÖ Saved {len(memory_messages)} messages to working memory")
        
    except Exception as e:
        print(f"‚ö†Ô∏è Error saving working memory: {e}")
    
    return state

print("‚úÖ save_working_memory_node defined.")
```

</details>

### Test the Save Node

Let's verify our save node works correctly by testing both new sessions and sessions with existing history:

In [None]:
# Run comprehensive save tests (new session + session with history)
from test_save_memory import run_tests

await run_tests(save_working_memory_node)

## Part 3: Testing Multi-Turn Conversations

You've now implemented the core memory nodes that enable conversation history! While you wrote the `load_working_memory_node` and `save_working_memory_node` functions above, we won't integrate them into the workflow in this notebook. Instead, we'll use the production implementations from [nodes.py](../progressive_agents/stage5_working_memory/agent/nodes.py) which include the same logic you just built, plus:

- Additional error handling and retry logic
- Detailed logging for debugging
- Memory client connection pooling
- Metrics tracking for monitoring

The production workflow in [workflow.py](../progressive_agents/stage5_working_memory/agent/workflow.py) already has these nodes integrated into the full pipeline: `Load Memory ‚Üí Classify Intent ‚Üí ReAct Agent ‚Üí Save Memory`.

Now, let's run [test_react_multi_turn.py](../progressive_agents/stage5_working_memory/test_react_multi_turn.py) which runs 4 comprehensive multi-turn conversation scenarios:

1. Pronoun Resolution Test
   - "What is CS002?" ‚Üí "What are the prerequisites for it?" ‚Üí "Tell me more about the syllabus"
   - Demonstrates how the agent resolves "it" to CS002 using conversation history

2. Follow-up Questions Test
   - "Tell me about machine learning courses" ‚Üí "Which one is best for beginners?" ‚Üí "What are the prerequisites for that course?"
   - Shows context retention across increasingly specific follow-ups

3. Comparison Across Turns Test
   - "What is CS001?" ‚Üí "What is CS002?" ‚Üí "Which one should I take first?"
   - Tests the agent's ability to compare courses mentioned in separate turns

4. Context Accumulation Test
   - 4-turn conversation building progressively specific queries about computer vision courses
   - Demonstrates how context accumulates and informs later responses

Run the code block below to begin the test. Note that in addition to running the tests above, the `run_tests()` function will return some session_ids that we will use in the next section.

In [None]:
# Import and run all multi-turn conversation tests
from test_react_multi_turn import run_tests

# Run test and store session IDs for later use in compression analysis

test_session_ids = await run_tests()

## Preview: Automatic Knowledge Extraction

### From Conversations to Persistent Knowledge

Working memory is important because it captures the current session, but what happens to valuable information when the session ends? If a student spends 20 minutes discussing their interest in machine learning courses, that knowledge shouldn't disappear when they close their browser.

This is where RAMS's two-tier memory system shines. Every time you call `put_working_memory()` to save a conversation turn, RAMS automatically performs several intelligent operations behind the scenes:

1. **Analyzes the Conversation Turn**  
RAMS uses an LLM to examine the user query and agent response, identifying important information worth preserving long-term.

2. **Extracts Key Facts**  
Instead of storing verbatim conversation history, RAMS extracts compressed facts:
   - "Student is interested in machine learning courses"
   - "Student asked about CS004 prerequisites"
   - "Student prefers beginner-level content"

3. **Stores in Long-Term Memory**  
These facts are saved to long-term memory with vector embeddings for semantic search and graph relationships for structured queries. They persist across sessions and are available to all future conversations for that user.

4. **Deduplicates Information**  
If a student asks "What is CS002?" multiple times across turns, RAMS recognizes the redundancy and maintains a single fact rather than duplicating information.

This entire pipeline happens automatically and invisibly in Stage 5. You don't need to call any extraction APIs or manage the transition from working memory to long-term memory‚ÄîRAMS handles it transparently when you save working memory.

> **Note:** RAMS supports [multiple memory extraction strategies](https://redis.github.io/agent-memory-server/memory-extraction-strategies/) that you can configure to customize how facts are identified and stored (such as entity extraction, topic modeling, or custom extraction prompts). In this course, we use the default extraction strategy and don't explicitly configure it in our RAMS setup which allows RAMS to intelligently handle fact extraction out of the box.

The automatic fact extraction also creates a powerful compression mechanism:

```
Working Memory:                   Long-Term Memory:
Turn 1: "What is CS004?"         ‚Üí Fact: "Student interested in CS004"
  Response: 300 tokens
Turn 2: "Prerequisites?"         ‚Üí Fact: "Student asked about prerequisites"
  Response: 200 tokens
Turn 3: "More details?"          ‚Üí Merged with: "Student needs CS004 info"
  Response: 250 tokens

Total: ~750 tokens               Total: ~50 tokens (15x compression)
```

Instead of loading 750 tokens of conversation history on future turns, the agent could search long-term memory and retrieve the compressed fact: "Student is interested in CS004, asked about prerequisites and details." This achieves typical 10:1 to 20:1 compression ratios, dramatically extending the effective conversation length before hitting context limits.

### Seeing Long-Term Memory in Action

Let's inspect what RAMS has automatically extracted from our test conversations without us explicitly doing anything. After running the multi-turn tests above, RAMS has been analyzing and extracting facts behind the scenes:

In [None]:
# Query long-term memories automatically extracted by RAMS
from query_long_term_memory import query_extracted_memories, analyze_compression

# Query memories for the test user using the AMS API
# Note: query_extracted_memories is a wrapper and is not part of the AMS API/Methods. Check out the query_long_term_memory.py to see the full implementation.
memories = await query_extracted_memories(
    student_id="test_user",
    search_query="courses discussed",
    limit=20
)

# Compare token size: working memory vs long-term memory
# Use the session IDs we stored from running the tests above
if 'test_session_ids' in dir() and len(test_session_ids) > 0:
    print(f"\nüîç Using {len(test_session_ids)} sessions from test run:")
    for session_id in test_session_ids:
        print(f"  ‚Ä¢ {session_id}")
    
    print("\nüìä Running compression analysis...\n")
    await analyze_compression(test_session_ids, student_id="test_user")
else:
    print("\n‚ö†Ô∏è No test_session_ids found. Run the multi-turn tests first!")

<details>
<summary>üí° Open this dropdown after running the code block to review the compression analysis</summary>
<br>
    
The compression analysis above demonstrates a critical principle of managing conversational context: quality over quantity. By extracting and compressing facts from verbose conversation history, RAMS enables the agent to maintain context over much longer interactions without the risk of token limits. In this case, a compression ratio of ~1-2.5x is achieved compared to storing the full message history.

</details>

## Wrap Up üèÅ

Excellent work! You've completed Stage 5 and transformed your agent from a single-turn reasoner into a conversational partner with persistent memory.

In this stage, you learned how to:

- Implement the `load_working_memory` node that retrieves conversation history from RAMS at the start of each turn
- Implement the `save_working_memory` node that persists conversation turns back to RAMS after reasoning completes
- Orchestrate the Load ‚Üí Process ‚Üí Save lifecycle that enables multi-turn conversations with context awareness

You also got a preview of how RAMS automatically extracts and compresses facts from working memory to long-term memory behind the scenes.

In Stage 5, this automatic extraction happened invisibly‚Äîuseful for managing context window limits, but not directly accessible to your agent's reasoning process. Stage 6 changes that by giving your agent *explicit control* over long-term memory via LangChain tools.

You'll transform the agent from passively benefiting from automatic extraction to actively managing its own memory‚Äîdeciding what to remember, what to recall, and when to search its accumulated knowledge.
