# How to cache chat model responses

### Caching chat model responses offers two primary benefits:

1. Cost Efficiency: Reduces the number of API calls to LLM providers, which can lower expenses, especially during development phases.

2. Performance Enhancement: Speeds up application response times by avoiding redundant API requests.

- This is particularly advantageous when the same prompts are used repeatedly, such as in testing scenarios or applications with frequently asked questions.

In [3]:
import getpass
import os
from langchain.chat_models import init_chat_model

In [4]:
if not os.environ.get("GROQ_API_KEY"):
  os.environ["GROQ_API_KEY"] = getpass.getpass("Enter API key for Groq: ")

Enter API key for Groq:  ········


In [5]:
llm = init_chat_model("llama3-8b-8192", model_provider="groq")

In [6]:
from langchain_core.globals import set_llm_cache

In [7]:
%%time
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")

CPU times: total: 31.2 ms
Wall time: 1.51 s


AIMessage(content="Here's one:\n\nWhy couldn't the bicycle stand up by itself?\n\n(Wait for it...)\n\nBecause it was two-tired!\n\nHope that made you laugh!", additional_kwargs={}, response_metadata={'token_usage': {'completion_tokens': 33, 'prompt_tokens': 14, 'total_tokens': 47, 'completion_time': 0.0275, 'prompt_time': 0.002261, 'queue_time': 0.25006279, 'total_time': 0.029761}, 'model_name': 'llama3-8b-8192', 'system_fingerprint': 'fp_179b0f92c9', 'finish_reason': 'stop', 'logprobs': None}, id='run-ee23eb90-ad9b-460b-951b-12782cae4b4e-0', usage_metadata={'input_tokens': 14, 'output_tokens': 33, 'total_tokens': 47})

In [8]:
%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")

CPU times: total: 0 ns
Wall time: 0 ns


AIMessage(content="Here's one:\n\nWhy couldn't the bicycle stand up by itself?\n\n(Wait for it...)\n\nBecause it was two-tired!\n\nHope that made you laugh!", additional_kwargs={}, response_metadata={'token_usage': {'completion_tokens': 33, 'prompt_tokens': 14, 'total_tokens': 47, 'completion_time': 0.0275, 'prompt_time': 0.002261, 'queue_time': 0.25006279, 'total_time': 0.029761}, 'model_name': 'llama3-8b-8192', 'system_fingerprint': 'fp_179b0f92c9', 'finish_reason': 'stop', 'logprobs': None}, id='run-ee23eb90-ad9b-460b-951b-12782cae4b4e-0', usage_metadata={'input_tokens': 14, 'output_tokens': 33, 'total_tokens': 47})