# Memories: Short-Term & Long-Term

We will use the `Memory` class (from LlamaIndex)to store and retrieve both short-term and long-term memory.

You can use it on its own and orchestrate within a custom workflow, or use it within an existing agent.

By default, short-term memory is represented as a FIFO queue of `ChatMessage` objects. Once the queue exceeds a certain size, the last X messages within a flush size are archived and optionally flushed to long-term memory blocks.

Long-term memory is represented as `Memory Block` objects. These objects receive the messages that are flushed from short-term memory, and optionally process them to extract information. Then when memory is retrieved, the short-term and long-term memories are merged together.


## 1. Setup


In [1]:
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.settings import Settings

llm = OpenAI(model="gpt-4.1-mini", temperature=0.01)
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.embed_model = embed_model
Settings.llm = llm


## 2. Short-term Memory

Let's explore how to configure various components of short-term memory.

For visual purposes, we will set some low token limits to more easily observe the memory behavior.


In [2]:

from llama_index.core.memory import Memory

memory = Memory.from_defaults(
    session_id="my_session",
    token_limit=50,  # small enough to observe the memory behavior
    token_flush_size=10,
    chat_history_token_ratio=0.7,
)

Let's review the configuration we used and what it means:

- `session_id`: A unique identifier for the session. Used to mark chat messages in a SQL database as belonging to a specific session.
- `token_limit`: The maximum number of tokens that can be stored in short-term + long-term memory.
- `chat_history_token_ratio`: The ratio of tokens in the short-term chat history to the total token limit. Here this means that 50\*0.7 = 35 tokens are allocated to short-term memory, and the rest is allocated to long-term memory.
- `token_flush_size`: The number of tokens to flush to long-term memory when the token limit is exceeded. Note that we did not configure long-term memory, so these messages are merely archived in the database and removed from the short-term memory.

Using our memory, we can manually add some messages and observe how it works.


In [3]:
from llama_index.core.llms import ChatMessage

# Simulate a long conversation
for i in range(100):
    await memory.aput_messages(
        [
            ChatMessage(role="user", content="Hello, world!  Message " + str(i)),
            ChatMessage(role="assistant", content="Hello, world to you too!  Message " + str(i)),
            ChatMessage(role="user", content="What is the capital of France?  Message " + str(i)),
            ChatMessage(
                role="assistant", content="The capital of France is Paris.  Message " + str(i)
            ),
        ]
    )

Since our token limit is small, we will only see the last 2 messages in short-term memory (since this fits withint the `50*0.7` limit)


In [4]:
current_chat_history = await memory.aget()
for msg in current_chat_history:
    print(msg)

user: What is the capital of France?  Message 99
assistant: The capital of France is Paris.  Message 99


If we retrieva all messages, we will find all 400 messages.


In [5]:

all_messages = await memory.aget_all()
print(len(all_messages))

400


We can clear the memory at any time to start fresh.


In [6]:
await memory.areset()
all_messages = await memory.aget_all()
print(len(all_messages))

0


## 3. Long-term Memory

Long-term memory is represented as Memory Block objects. These objects receive the messages that are flushed from short-term memory, and optionally process them to extract information. Then when memory is retrieved, the short-term and long-term memories are merged together.


We have 3 prebuilt memory blocks:

- `StaticMemoryBlock`: A memory block that stores a static piece of information.
- `FactExtractionMemoryBlock`: A memory block that extracts facts from the chat history.
- `VectorMemoryBlock`: A memory block that stores and retrieves batches of chat messages from a vector database.

Each block has a `priority` that is used when the long-term memory + short-term memory exceeds the token limit. Priority 0 means the block will always be kept in memory, priority 1 means the block will be temporarily disabled, and so on.


In [7]:
from llama_index.core.memory import (
    StaticMemoryBlock,
    FactExtractionMemoryBlock,
    VectorMemoryBlock,
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

llm = OpenAI(model="gpt-4.1-mini")
embed_model = OpenAIEmbedding(model="text-embedding-3-small")

client = chromadb.EphemeralClient()
vector_store = ChromaVectorStore(
    chroma_collection=client.get_or_create_collection("test_collection")
)

blocks = [
    StaticMemoryBlock(
        name="core_info",
        static_content="My name is ASDRP Agent.  I live in Fremont, CA and I love to talk about nested Matryoshka dolls.",
        priority=0,
    ),
    FactExtractionMemoryBlock(
        name="extracted_info",
        llm=llm,
        max_facts=50,
        priority=1,
    ),
    VectorMemoryBlock(
        name="vector_memory",
        # required: pass in a vector store like qdrant, chroma, weaviate, milvus, etc.
        vector_store=vector_store,
        priority=2,
        embed_model=embed_model,
        # The top-k message batches to retrieve
        # similarity_top_k=2,
        # optional: How many previous messages to include in the retrieval query
        # retrieval_context_window=5
        # optional: pass optional node-postprocessors for things like similarity threshold, etc.
        # node_postprocessors=[...],
    ),
]

With our blocks created, we can pass them into the Memory class.


In [8]:
from llama_index.core.memory import Memory

memory = Memory.from_defaults(
    session_id="my_session",
    token_limit=30000,
    # Setting a extremely low ratio so that more tokens are flushed to long-term memory
    chat_history_token_ratio=0.02,
    token_flush_size=500,
    memory_blocks=blocks,
    # insert into the latest user message, can also be "system"
    insert_method="user",
)

With this, we can simulate a conversation with an agent and inspect the long-term memory.


In [9]:
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI

agent = FunctionAgent(
    tools=[],
    llm=llm,
)

user_msgs = [
    "Hi! My name is Jerry",
    "What is your opinion on nested Matryoshka dolls?",
    "What is the most popular nesting doll?",
    "In history, what is the most significant nesting doll?",
    "What is the most expensive nesting doll?",
    "I am interested in buying a nesting doll, what is the most popular nesting doll?",
    "What is the most valuable nesting doll?",
    "Last week, I bought a nesting doll.",
    "What is the most rare nesting doll?",
    "I am thinking about the historical significance of nesting dolls, what is the most interesting nesting doll?",
    "What is the most unique nesting doll?",
    "Why are nesting dolls so popular?",
    "What is the most interesting nesting doll?",
]

for user_msg in user_msgs:
    _ = await agent.run(user_msg=user_msg, memory=memory)

Now, let's inspect the most recent user-message and see what the memory inserts into the user message.

Note that we pass in at least one chat message so that the vector memory actually runs retrieval.


In [11]:
chat_history = await memory.aget()
for chat in chat_history:
    print(f"==> {chat}")

==> assistant: Hi Jerry! Nesting dolls, especially the traditional Russian Matryoshka dolls, are so popular for several reasons:

1. **Symbolism and Meaning**: They represent family, motherhood, fertility, and continuity, with each smaller doll nested inside a larger one symbolizing generations or layers of life. This deep cultural and emotional symbolism resonates with many people.

2. **Craftsmanship and Artistry**: The intricate hand-painted designs and the skill required to carve perfectly fitting wooden dolls showcase impressive craftsmanship, making them beautiful collectibles and works of folk art.

3. **Cultural Icon**: Matryoshka dolls have become a recognizable symbol of Russian culture worldwide, often serving as souvenirs or gifts that carry cultural significance.

4. **Playfulness and Curiosity**: The nesting feature invites curiosity and interaction—opening one doll to reveal another smaller one inside creates a delightful experience for both children and adults.

5. **Ve

Great, we can see that the current FIFO queue is only 2-3 messages (expected since we set the chat history token ratio to 0.02).

Now, let's inspect the long-term memory blocks that are inserted into the latest user message.


In [12]:
for block in chat_history[-2].blocks:
    print(block.text)

<memory>
<core_info>
My name is ASDRP Agent.  I live in Fremont, CA and I love to talk about nested Matryoshka dolls.
</core_info>
<extracted_info>
<fact>User's name is Jerry.</fact>
<fact>User is interested in nested Matryoshka dolls.</fact>
<fact>User bought a nesting doll last week.</fact>
</extracted_info>
<vector_memory>
<message role='user'>I am thinking about the historical significance of nesting dolls, what is the most interesting nesting doll?</message>
<message role='assistant'>That's a great area to explore, Jerry! One of the most interesting nesting dolls from a historical and cultural perspective is the original Russian Matryoshka doll created in 1890 by Vasily Zvyozdochkin and painted by Sergey Malyutin. This first set featured a peasant woman with a series of smaller dolls inside, symbolizing motherhood, fertility, and the unity of family. 

What makes it particularly fascinating is how it captured the essence of Russian folk art and became a symbol of family and contin

To use this memory outside an agent, and to highlight more of the usage, you might do something like the following:


In [13]:
new_user_msg = ChatMessage(
    role="user", content="What kind of doll was I asking about?"
)
await memory.aput(new_user_msg)

# Get the new chat history
new_chat_history = await memory.aget()
resp = await llm.achat(new_chat_history)
await memory.aput(resp.message)
print(resp.message.content)

You were asking about nesting dolls, specifically nested Matryoshka dolls.
