# [How to cache chat model responses](https://python.langchain.com/v0.2/docs/how_to/chat_model_caching/)

LangChain provides an optional caching layer for chat models. This is useful for two main reasons:

* It can save you money by reducing the number of API calls you make to the LLM provider, if you're often requesting the same completion multiple times. This is especially useful during app development.
* It can speed up your application by reducing the number of API calls you make to the LLM provider.

This guide will walk you through how to enable this in your apps.

In [1]:
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")



In [3]:
# <!-- ruff: noqa: F821 -->
from langchain_core.globals import set_llm_cache

### In Memory Cache
This is an ephemeral cache that stores model calls in memory. It will be wiped when your environment restarts, and is not shared across processes.

In [4]:
%%time

from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")

CPU times: user 32.2 ms, sys: 14.4 ms, total: 46.7 ms
Wall time: 543 ms


AIMessage(content='Why did the scarecrow win an award? \n\nBecause he was outstanding in his field!', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 18, 'prompt_tokens': 11, 'total_tokens': 29}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_54e2f484be', 'finish_reason': 'stop', 'logprobs': None}, id='run-921897ba-aaf7-4dc0-b276-a6fa1417a466-0', usage_metadata={'input_tokens': 11, 'output_tokens': 18, 'total_tokens': 29})

In [5]:
%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")

CPU times: user 1.98 ms, sys: 581 μs, total: 2.56 ms
Wall time: 2.89 ms


AIMessage(content='Why did the scarecrow win an award? \n\nBecause he was outstanding in his field!', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 18, 'prompt_tokens': 11, 'total_tokens': 29}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_54e2f484be', 'finish_reason': 'stop', 'logprobs': None}, id='run-921897ba-aaf7-4dc0-b276-a6fa1417a466-0', usage_metadata={'input_tokens': 11, 'output_tokens': 18, 'total_tokens': 29})

## SQLite Cache
This cache implementation uses a SQLite database to store responses, and will last across process restarts.

In [6]:
!rm .langchain.db

rm: .langchain.db: No such file or directory


In [7]:
# We can do the same thing with a SQLite cache
from langchain_community.cache import SQLiteCache

set_llm_cache(SQLiteCache(database_path=".langchain.db"))

In [8]:
%%time
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")

CPU times: user 61.2 ms, sys: 13.7 ms, total: 74.9 ms
Wall time: 741 ms


AIMessage(content="Why don't skeletons fight each other?\n\nThey don't have the guts!", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 14, 'prompt_tokens': 11, 'total_tokens': 25}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_483d39d857', 'finish_reason': 'stop', 'logprobs': None}, id='run-a946337d-f5d8-45ba-854b-1483da0eeabe-0', usage_metadata={'input_tokens': 11, 'output_tokens': 14, 'total_tokens': 25})

In [9]:
%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")

CPU times: user 39.2 ms, sys: 25 ms, total: 64.2 ms
Wall time: 109 ms


AIMessage(content="Why don't skeletons fight each other?\n\nThey don't have the guts!", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 14, 'prompt_tokens': 11, 'total_tokens': 25}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_483d39d857', 'finish_reason': 'stop', 'logprobs': None}, id='run-a946337d-f5d8-45ba-854b-1483da0eeabe-0', usage_metadata={'input_tokens': 11, 'output_tokens': 14, 'total_tokens': 25})