### LangChain Essentials

# Conversational Memory for OpenAI - LangChain #3

Conversational memory allows our chatbots and agents to remember previous interactions within a conversation. Without conversational memory, our chatbots would only ever be able to respond to the last message they received, essentially forgetting all previous messages with each new message.

Naturally, conversations require our chatbots to be able to respond over multiple interactions and refer to previous messages to understand the context of the conversation.

---

> ⚠️ We will be using OpenAI for this example allowing us to run everything via API. If you would like to use Ollama instead, please see the [Ollama version](https://github.com/aurelio-labs/agents-course/blob/main/04-langchain-ecosystem/01-langchain-essentials/03-conversational-memory-ollama.ipynb) of this example.

---

## LangChain's Memory Types

LangChain versions `0.0.x` consisted of various conversational memory types. Most of these are due for deprecation but still hold value in understanding the different approaches that we can take to building conversational memory.

Throughout the notebook we will be referring to these _older_ memory types and then rewriting them using the recommended `RunnableWithMessageHistory` class. We will learn about:

* `ConversationBufferMemory`: the simplest and most intuitive form of conversational memory, keeping track of a conversation without any additional bells and whistles.
* `ConversationBufferWindowMemory`: similar to `ConversationBufferMemory`, but only keeps track of the last `k` messages.
* `ConversationSummaryMemory`: rather than keeping track of the entire conversation, this memory type keeps track of a summary of the conversation.
* `ConversationTokenBufferMemory`: similar to `ConversationBufferMemory`, but only keeps track of the last `n` tokens.
* `ConversationSummaryBufferMemory`: merges the `ConversationSummaryMemory` and `ConversationTokenBufferMemory` types.

We'll work through each of these memory types in turn, and rewrite each one using the `RunnableWithMessageHistory` class.

## Initialize our LLM

Before jumping into our memory types, let's initialize our LLM. We will use OpenAI's `gpt-4o-mini` model.

In [1]:
import os
from langchain_openai import ChatOpenAI

os.environ["OPENAI_API_KEY"] = "sk-proj-..."

# For normal accurate responses
llm = ChatOpenAI(temperature=0.0, model="gpt-4o",)

## 1. `ConversationBufferMemory`

`ConversationBufferMemory` is the simplest form of conversational memory, it is _literally_ just a place that we store messages, and then use to feed messages into our LLM.

Let's start with LangChain's original `ConversationBufferMemory` object.

In [2]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory()

  memory = ConversationBufferMemory()


There are several ways that we can add messages to our memory, using the `save_context` method we can add a user query (via the `input` key) and the AI's response (via the `output` key). So, to create the following conversation:

```
User: Hi, my name is Josh
AI: Hey Josh, what's up? I'm an AI model.
User: Not much, just hanging
AI: Cool
```

We do:

In [3]:
memory.save_context(
    {"input": "Hi, my name is Josh"},  # user message
    {"output": "Hey Josh, what's up? I'm an AI model."}  # AI response
)
memory.save_context(
    {"input": "Not much, just hanging"},  # user message
    {"output": "Cool"}  # AI response
)

Before using the memory, we need to load in any variables for that memory type — in this case, there are none, so we just pass an empty dictionary:

In [4]:
memory.load_memory_variables({})

{'history': "Human: Hi, my name is Josh\nAI: Hey Josh, what's up? I'm an AI model.\nHuman: Not much, just hanging\nAI: Cool"}

With that, we've created our buffer memory. Before feeding it into our LLM let's quickly view the alternative method for adding messages to our memory. With this other method, we pass individual user and AI messages via the `add_user_message` and `add_ai_message` methods. To reproduce what we did above, we do:

In [5]:
memory = ConversationBufferMemory()

memory.chat_memory.add_user_message("Hi, my name is Josh")
memory.chat_memory.add_ai_message("Hey Josh, what's up? I'm an AI model.")
memory.chat_memory.add_user_message("Not much, just hanging")
memory.chat_memory.add_ai_message("Cool")

memory.load_memory_variables({})

{'history': "Human: Hi, my name is Josh\nAI: Hey Josh, what's up? I'm an AI model.\nHuman: Not much, just hanging\nAI: Cool"}

The outcome is exactly the same in either case. To pass this onto our LLM, we need to create a `ConversationChain` object — which is already deprecated in favor of the `RunnableWithMessageHistory` class, which we will cover in a moment.

In [6]:
from langchain.chains import ConversationChain

chain = ConversationChain(
    llm=llm, 
    memory=memory,
    verbose=True
)

  chain = ConversationChain(


In [7]:
chain.invoke({"input": "what is my name again?"})



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
Human: Hi, my name is Josh
AI: Hey Josh, what's up? I'm an AI model.
Human: Not much, just hanging
AI: Cool
Human: what is my name again?
AI:[0m

[1m> Finished chain.[0m


{'input': 'what is my name again?',
 'history': "Human: Hi, my name is Josh\nAI: Hey Josh, what's up? I'm an AI model.\nHuman: Not much, just hanging\nAI: Cool",
 'response': "Your name is Josh! How's your day going so far?"}

### Implementing with `RunnableWithMessageHistory`

As mentioned, the `ConversationBufferMemory` type is due for deprecation. Instead, we can use the `RunnableWithMessageHistory` class to implement the same functionality.

When implementing `RunnableWithMessageHistory` we will use **L**ang**C**hain **E**xpression **L**anguage (LCEL) and for this we need to define our prompt template and LLM components. Our `llm` has already been defined, so now we just define a `ChatPromptTemplate` object.

In [8]:
from langchain.prompts import (
    SystemMessagePromptTemplate, 
    HumanMessagePromptTemplate,
    MessagesPlaceholder,
    ChatPromptTemplate
)

system_prompt = "You are a helpful assistant called Zeta."

prompt_template = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template(system_prompt),
    MessagesPlaceholder(variable_name="history"),
    HumanMessagePromptTemplate.from_template("{query}"),
])

We can link our `prompt_template` and our `llm` together to create a pipeline via LCEL.

In [9]:
pipeline = prompt_template | llm

Our `RunnableWithMessageHistory` requires our `pipeline` to be wrapped in a `RunnableWithMessageHistory` object. This object requires a few input parameters. One of those is `get_session_history`, which requires a function that returns a `ChatMessageHistory` object based on a session ID. We define this function ourselves:

In [21]:
from langchain_core.chat_history import InMemoryChatMessageHistory

chat_map = {}
def get_chat_history(session_id: str) -> InMemoryChatMessageHistory:
    if session_id not in chat_map:
        # if session ID doesn't exist, create a new chat history
        chat_map[session_id] = InMemoryChatMessageHistory()
    return chat_map[session_id]

We also need to tell our runnable which variable name to use for the chat history (ie `history`) and which to use for the user's query (ie `query`).

In [22]:
from langchain_core.runnables.history import RunnableWithMessageHistory

pipeline_with_history = RunnableWithMessageHistory(
    pipeline,
    get_session_history=get_chat_history,
    input_messages_key="query",
    history_messages_key="history"
)

Now we invoke our runnable:

In [25]:
pipeline_with_history.invoke(
    {"query": "Hi, my name is Josh"},
    config={"session_id": "id_123"}
)

AIMessage(content='Hello, Josh! How can I assist you today?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 12, 'prompt_tokens': 26, 'total_tokens': 38, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_d28bcae782', 'finish_reason': 'stop', 'logprobs': None}, id='run-c4fcf16c-f95f-43a9-b339-c184902f7c29-0', usage_metadata={'input_tokens': 26, 'output_tokens': 12, 'total_tokens': 38, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

Our chat history will now be memorized and retrieved whenever we invoke our runnable with the same session ID.

In [26]:
pipeline_with_history.invoke(
    {"query": "What is my name again?"},
    config={"session_id": "id_123"}
)

AIMessage(content='Your name is Josh. How can I help you today?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 13, 'prompt_tokens': 51, 'total_tokens': 64, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_5f20662549', 'finish_reason': 'stop', 'logprobs': None}, id='run-fb5d7669-1391-4ef3-ac8b-934adc9ef61c-0', usage_metadata={'input_tokens': 51, 'output_tokens': 13, 'total_tokens': 64, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

We have now recreated the `ConversationBufferMemory` type using the `RunnableWithMessageHistory` class. Let's continue onto other memory types and see how these can be implemented.

## 2. `ConversationBufferWindowMemory`

The `ConversationBufferWindowMemory` type is similar to `ConversationBufferMemory`, but only keeps track of the last `k` messages. There are a few reasons why we would want to keep only the last `k` messages:

* More messages mean more tokens are sent with each request, more tokens increases latency _and_ cost.

* LLMs tend to perform worse when given more tokens, making them more likely to deviate from instructions, hallucinate, or _"forget"_ information provided to them. Conciseness is key to high performing LLMs.

* If we keep _all_ messages we will eventually hit the LLM's context window limit, by adding a window size `k` we can ensure we never hit this limit.

The buffer window solves many problems that we encounter with the standard buffer memory, while still being a very simple and intuitive form of conversational memory.

## Conversation Summary Memory

*In this section we will go over LangChain's conversation summary memory.*

Now again import the corresponding libaries, in this case we are using conversation summary memory.

In [16]:
from langchain.memory import ConversationSummaryMemory

This is basically the same as before, however now we are passing in our LLM to the memory as well as the conversation chain.

In [17]:
memory = ConversationSummaryMemory(llm = llm)

In [18]:
conversation = ConversationChain(
    llm=llm, 
    memory = memory,
    verbose=False
)

Now we are going to test the memory recall of this conversation with some questions and answers.

In [None]:
conversation.predict(input="Hi, my name is James")

In [None]:
conversation.predict(input="What is 9 + 10?")

In [None]:
conversation.predict(input="What is my name?")

## Conversation Buffer Window Memory

*In this section we will go over LangChain's conversation buffer window memory.*

Once again, importing the libaries we need for conversation buffer window memory.

In [22]:
from langchain.memory import ConversationBufferWindowMemory

Same as before with the settup, however this time we have 'k=1', k in this instance is how many individual parts of our conversation the memory will store, as showcased below, and this has a direct response on limiting the token usage.

In [23]:
memory = ConversationBufferWindowMemory(k=1, llm=llm) 

In [24]:
memory.save_context({"input": "Hi"},
                    {"output": "What's up"})
memory.save_context({"input": "Not much, just hanging"},
                    {"output": "Cool"})


In [None]:
memory.load_memory_variables({})

Now we also want to test the conversation when programmed this way.

In [26]:
conversation = ConversationChain(
    llm=llm, 
    memory = memory,
    verbose=False
)

In [None]:
conversation.predict(input="Hi, my name is James")

In [None]:
conversation.predict(input="What is 9+10?")

In [None]:
conversation.predict(input="What is my name?")

As you can see the main downside to this is that the memory wipes the name from it's memory, which is annoying if we want to have long conversations going back to previous topics we discussed earlier, however very useful if we need to cut down on tokens as can be seen below!

In [None]:
count_tokens(conversation, "How are you today?")

## Conversation Token Buffer Memory

*In this section we will go over LangChain's conversation token buffer memory.*

Let's go again with conversation token buffer memory.

In [31]:
from langchain.memory import ConversationTokenBufferMemory

Now we set the max amount of tokens we can spend in the memory, this is super useful however we might use more as the query itself has tokens involved.

In [32]:
memory = ConversationTokenBufferMemory(llm=llm, max_token_limit=70)

Here we are forcing 'fixed' conversations into our memory, and if you run this yourself you can see that by changing the token limit this is also getting rid of sections of the memory.

In [33]:
memory.save_context({"input": "AI is what?!"},
                    {"output": "Amazing!"})
memory.save_context({"input": "Backpropagation is what?"},
                    {"output": "Beautiful!"})
memory.save_context({"input": "Chatbots are what?"}, 
                    {"output": "Charming!"})

In [None]:
memory.load_memory_variables({})

In [35]:
conversation = ConversationChain(
    llm=llm, 
    memory = memory,
    verbose=True
)

In [None]:
conversation.predict(input="Dandilions are what?")

## Conversation Summary Buffer Memory

*In this section we will go over LangChain's conversation summary buffer memory.*

One last time, with this being conversation summary buffer memory.

In [37]:
from langchain.memory import ConversationSummaryBufferMemory

Here we are feeding in a schedule that the AI will use to inform us of anything we need to know.

In [38]:
# create a long string
schedule = "There is a meeting at 8am with James. \
You will need your cursor AI course prepared. \
9am-12pm have time to do more with Llama and Langchain \
At Noon, lunch at the italian resturant with a customer who is driving \
from over an hour away to meet you to understand the latest in AI. \
Be sure to bring your laptop to show the latest LLM demo."

memory = ConversationSummaryBufferMemory(llm=llm, max_token_limit=100)
memory.save_context({"input": "Hello"}, {"output": "What's up"})
memory.save_context({"input": "Not much, just hanging"},
                    {"output": "Cool"})
memory.save_context({"input": "What is on the schedule today?"}, 
                    {"output": f"{schedule}"})

In [None]:
memory.load_memory_variables({})

In [40]:
conversation = ConversationChain(
    llm=llm, 
    memory = memory,
    verbose=True
)

As you can see from the previous snippets the AI actually uses the token limit to summarise what has happened rather then keep exact details (unless it can afford to do so), this is especially useful for a generic conversation, where the topic is needed to understand the question being asked, however this is not good for the exact details mentioned.

In [None]:
conversation.predict(input="What would be a good demo to show?")

In [None]:
memory.load_memory_variables({})