# Using Memories in Multi-Turn Agent Conversations


Recent research from Salesforce AI Research found that: ["LLMs Get Lost In Multi-Turn Conversation"](https://arxiv.org/pdf/2505.06120):

> "Analysis of 200,000+ simulated conversations decomposes the performance degradation into two components: a minor loss in aptitude and a significant increase in unreliability. We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that \*when LLMs take a wrong turn in a conversation, they get lost and do not recover."

To help avoid this, we can implement a custom short-term and long-term memory to ensure that the conversation turns never get too long, and condense the memory as we go.


## 1. Setup


To make this work, we need two things

1.  A memory block that condenses a;; past chat messages into a single string while maintaining a token limit
2.  A `Memory` instance that uses that memory block, and has token limits configured such that multi-turn conversations are always flushed to the memory block for handling


In [1]:
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

## 2. Custom Memory Block


Here is our custom made memory block -- and we significantly reduced the amount of non-essential characters which often clog tool call outputs.


In [2]:
import tiktoken
from pydantic import Field
from typing import List, Optional, Any
from llama_index.core.llms import ChatMessage, TextBlock
from llama_index.core.memory import Memory, BaseMemoryBlock


class CondensedMemoryBlock(BaseMemoryBlock[str]):
    """
    This class is a smart conversation buffer that maintains context while 
    staying within reasonable memory limits.

    It condenses the conversation history into a single string, while 
    maintaining a token limit.

    It also includes additional kwargs, like tool calls, when needed.
    """
    current_memory: List[str] = Field(default_factory=list)
    token_limit: int = Field(default=50000)
    tokenizer: tiktoken.Encoding = tiktoken.encoding_for_model("gpt-4o") 

    async def _aget(
        self, messages: Optional[List[ChatMessage]] = None, **block_kwargs: Any
    ) -> str:
        """Return the current memory block contents."""
        return "\n".join(self.current_memory)

    async def _aput(self, messages: List[ChatMessage]) -> None:
        """Push messages into the memory block. (Only handles text content)"""
        # construct a string for each message
        for message in messages:
            text_contents = "\n".join(
                block.text
                for block in message.blocks
                if isinstance(block, TextBlock)
            )
            memory_str = text_contents if text_contents else ""
            kwargs = {}
            for key, val in message.additional_kwargs.items():
                if key == "tool_calls":
                    val = [
                        {
                            "name": tool_call["function"]["name"],
                            "args": tool_call["function"]["arguments"],
                        }
                        for tool_call in val
                    ]
                    kwargs[key] = val
                elif key != "session_id" and key != "tool_call_id":
                    kwargs[key] = val
            memory_str += f"\n({kwargs})" if kwargs else ""

            self.current_memory.append(memory_str)

        # ensure this memory block doesn't get too large
        message_length = sum(
            len(self.tokenizer.encode(message))
            for message in self.current_memory
        )
        while message_length > self.token_limit:
            self.current_memory = self.current_memory[1:]
            message_length = sum(
                len(self.tokenizer.encode(message))
                for message in self.current_memory
            )

And then, a `Memory` instance that uses that block while configuring a very limited token limit for the short-term memory:


## 3. Testing Custom Memory


In [3]:
from llama_index.core.memory import Memory, InsertMethod

block = CondensedMemoryBlock(name="condensed_memory")

memory = Memory.from_defaults(
    session_id="summary_memory",
    token_limit=60000,
    token_flush_size=5000,
    async_database_uri="sqlite+aiosqlite:///:memory:",
    memory_blocks=[block],
    insert_method=InsertMethod.USER,
    chat_history_token_ratio=0.0001,
)

In [4]:
initial_messages = [
    ChatMessage(role="user", content="Hello! My name is Megan"),
    ChatMessage(role="assistant", content="Hello! How can I help you?"),
    ChatMessage(role="user", content="What is the capital of France?"),
    ChatMessage(role="assistant", content="The capital of France is Paris"),
]

await memory.aput_messages(initial_messages)

In [5]:
await memory.aput_messages(
    [ChatMessage(role="user", content="What was my name again?")]
)

In [6]:
chat_history = await memory.aget()

for message in chat_history:
    print(f"=> role: {message.role}: {message.content}")


=> role: MessageRole.USER: <memory>
<condensed_memory>
Hello! My name is Megan
Hello! How can I help you?
What is the capital of France?
The capital of France is Paris
</condensed_memory>
</memory>
What was my name again?


## 4. Tool Call Agent Usage


In [7]:
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI


def multiply(a: float, b: float) -> float:
    """Multiply two numbers."""
    return a * b


def divide(a: float, b: float) -> float:
    """Divide two numbers."""
    return a / b


def add(a: float, b: float) -> float:
    """Add two numbers."""
    return a + b


def subtract(a: float, b: float) -> float:
    """Subtract two numbers."""
    return a - b


llm = OpenAI(model="gpt-4.1-mini")

agent = FunctionAgent(
    tools=[multiply, divide, add, subtract],
    llm=llm,
    system_prompt="You are a helpful assistant that can do simple math operations with tools.",
)

In [8]:
block = CondensedMemoryBlock(name="condensed_memory")

memory = Memory.from_defaults(
    session_id="tight-memory",
    token_limit=60000,
    token_flush_size=5000,
    async_database_uri="sqlite+aiosqlite:///:memory:",
    memory_blocks=[block],
    insert_method="user",
    chat_history_token_ratio=0.0001,
)

In [9]:
resp = await agent.run("What is (3214 * 322) / 2?", memory=memory)
print(resp)

The result of (3214 * 322) / 2 is 517454.0.


In [10]:
chat_history = await memory.aget()

for message in chat_history:
    print(f"=> role: {message.role}: {message.content}")

=> role: MessageRole.ASSISTANT: The result of (3214 * 322) / 2 is 517454.0.
=> role: MessageRole.USER: <memory>
<condensed_memory>
What is (3214 * 322) / 2?

({'tool_calls': [{'name': 'multiply', 'args': '{"a": 3214, "b": 322}'}, {'name': 'divide', 'args': '{"a": 3214, "b": 2}'}]})
1034908
1607.0

({'tool_calls': [{'name': 'divide', 'args': '{"a":1034908,"b":2}'}]})
517454.0
</condensed_memory>
</memory>


In [11]:
resp = await agent.run(
    "What was the last question I asked you?", memory=memory
)
print(resp)

The last question you asked was: "What is (3214 * 322) / 2?"


In [12]:

resp = await agent.run(
    "And how did you go about answering that message?", memory=memory
)
print(resp)

To answer the question "(3214 * 322) / 2," I first multiplied 3214 by 322 to get the product. Then, I divided that product by 2 to get the final result. Specifically:

1. Multiply 3214 by 322.
2. Divide the result by 2.

The final answer is 517454.0.
