# Can OpenAI's new prompt caching feature boost the effectiveness of RAG applications?


✅ **Prompt caching** is automatically applied by OpenAI to detect and reuse identical prompts previously sent to the API. This allows the system to use cached prompts rather than reprocessing similar ones from scratch.


✅ The **benefits** from using Prompt caching include "reducing latency (up to 80%) and lower costs (up to 50%) for longer prompts".

✅ **Constraints**: However, for caching to work, prompts must be at least 1024 tokens and should have static content such as instructions and examples at the beginning, with variable content at the end for consistent cache hits.
This caching remains active for 5 to 10 min of inactivity, and it can persist up to one hour during off-peak period.

✅ **Cache Hits** occurs when the system finds a matching prompt prefix, enabling caching. In contrast, a **Cache Miss** happens when no matching prefix is found, requiring the prompt to be processed from scratch.

When there is no cache hits (either it's your first call or simply no similarity found) the number of caching tokens is equal to 0.

You can find this value in the completion response object returned by the API.


▶ **Can prompt caching work in RAG apps? Key Takeaways:**

🔽 ⬇ Have a look at the end of the notebook 🔽 ⬇

▶ **To explore prompt caching in a RAG workflow:**

- I analyzed Amazon’s 10-K report using the LlamaIndex framework.
- A simple reader/parser was used to extract data from the financial report.
- Instead of the query engine, built upon the vectore store, I directly accessed the template prompts generated by LlamaIndex.
- I used this template, placing the retrieved context first and the user query at the end.
- The calls used OpenAI’s GPT-4o-mini.
- I gathered final answers, cached token counts, and calculated total tokens sent in each prompt.



---



🔽 Discover the whole process : 🔽

[Hanane DUPOUY](https://www.linkedin.com/in/hanane-d-algo-trader)

In [None]:
!pip install llama-index llama-index-core openai llama_index.embeddings.huggingface -q

In [3]:
import nest_asyncio
nest_asyncio.apply()

from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
LLAMAPARSE_API_KEY = userdata.get('LLAMACLOUD_API_KEY')

In [None]:
!wget "https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/c7c14359-36fa-40c3-b3ca-5bf7f3fa0b96.pdf" -O amzn_2023_10k.pdf

# VectorStore with embedding and SimpleDirectoryReader

In [None]:
# from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
import nest_asyncio;
nest_asyncio.apply()

pdf_name = "amzn_2023_10k.pdf"
# use SimpleDirectoryReader to parse our file
documents = SimpleDirectoryReader(input_files=[pdf_name]).load_data()

embed_model = "local:BAAI/bge-small-en-v1.5" #https://huggingface.co/collections/BAAI/bge-66797a74476eb1f085c7446d

vector_index_std = VectorStoreIndex(documents, embed_model = embed_model)
# chunk size 1048 per default, chunk_overlap = 40

## Compute tokens in the retrieved documents

In [6]:
import tiktoken

In [None]:
encoding = tiktoken.encoding_for_model("gpt-4o-mini")
print(encoding)

idx_keys = vector_index_std.storage_context.vector_stores['default'].data.embedding_dict.keys()
for key in idx_keys:
  text = vector_index_std.docstore.get_node(key).get_text()
  tokens_integer=encoding.encode(text)
  print(len(tokens_integer))

In [None]:
# #To modify the chunking size
# from llama_index.core import Settings
# from llama_index.core.node_parser import SentenceSplitter

# Settings.text_splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=20)

## LlamaIndex Prompts Template:

▶ To understand how the templates are built in llamaIndex, you can try this method (or go directly to the documentation):

I created a query engine, built upon the vectore store index:

In [None]:
from llama_index.llms.openai import OpenAI
llm_gpt4o_mini = OpenAI(model="gpt-4o-mini", api_key = OPENAI_API_KEY)
query_engine_gpt4o_mini = vector_index_std.as_query_engine(similarity_top_k=3, llm=llm_gpt4o_mini)

In [None]:
# Calling the LLM here isn't necessary, but I’m doing it to verify the reliability of the chunking.
query1 = "What was the net income in 2023?"
response = query_engine_gpt4o_mini.query(query1)
print(str(response))

The net income for 2023 is not explicitly provided in the context information. However, the income (loss) before income taxes for 2023 is reported as $37,557 million, and the provision for income taxes is $7,120 million. To determine the net income, one would typically subtract the provision for income taxes from the income before income taxes. Therefore, the net income for 2023 can be calculated as follows:

Net Income = Income (loss) before income taxes - Provision for income taxes
Net Income = $37,557 million - $7,120 million = $30,437 million. 

Thus, the net income for 2023 is approximately $30,437 million.


▶ From the query engine, you can get the prompts templates used by LlamaIndex for the QA and the refine answer. We'll use only the **text_QA_template**:

In [None]:
query_engine_gpt4o_mini.get_prompts().keys()

dict_keys(['response_synthesizer:text_qa_template', 'response_synthesizer:refine_template'])

### text_qa_template

In [None]:
query_engine_gpt4o_mini.get_prompts()['response_synthesizer:text_qa_template']

SelectorPromptTemplate(metadata={'prompt_type': <PromptType.QUESTION_ANSWER: 'text_qa'>}, template_vars=['context_str', 'query_str'], kwargs={}, output_parser=None, template_var_mappings={}, function_mappings={}, default_template=PromptTemplate(metadata={'prompt_type': <PromptType.QUESTION_ANSWER: 'text_qa'>}, template_vars=['context_str', 'query_str'], kwargs={}, output_parser=None, template_var_mappings=None, function_mappings=None, template='Context information is below.\n---------------------\n{context_str}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {query_str}\nAnswer: '), conditionals=[(<function is_chat_model at 0x7c522deb88b0>, ChatPromptTemplate(metadata={'prompt_type': <PromptType.CUSTOM: 'custom'>}, template_vars=['context_str', 'query_str'], kwargs={}, output_parser=None, template_var_mappings=None, function_mappings=None, message_templates=[ChatMessage(role=<MessageRole.SYSTEM: 'system'>, content="You are an exp

In [None]:
query_engine_gpt4o_mini.get_prompts()['response_synthesizer:text_qa_template'].default_template.template#['default_template']

'Context information is below.\n---------------------\n{context_str}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {query_str}\nAnswer: '

In [None]:
prompt_llamaindex = query_engine_gpt4o_mini.get_prompts()['response_synthesizer:text_qa_template'].default_template.template
prompt_llamaindex

'Context information is below.\n---------------------\n{context_str}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {query_str}\nAnswer: '

### refine_template

I'm not using this template, but I'm showing in case you need it for your own project:

In [None]:
query_engine_gpt4o_mini.get_prompts()['response_synthesizer:refine_template'].default_template.template#['default_template']

"The original query is as follows: {query_str}\nWe have provided an existing answer: {existing_answer}\nWe have the opportunity to refine the existing answer (only if needed) with some more context below.\n------------\n{context_msg}\n------------\nGiven the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.\nRefined Answer: "

## Vector Store as retriver

▶ Create a retriver from the vector store index, so we can retrive the context related to our query and use it later in the process:

In [None]:
retriever = vector_index_std.as_retriever(similarity_top_k=3)
query_str = "What was the net income in 2023?"
response = retriever.retrieve(query_str)
print(str(response))

In [8]:
for res in response:
  print(res.node.metadata)

{'page_label': '65', 'file_name': 'amzn_2023_10k.pdf', 'file_path': 'amzn_2023_10k.pdf', 'file_type': 'application/pdf', 'file_size': 800598, 'creation_date': '2024-10-27', 'last_modified_date': '2024-02-02'}
{'page_label': '28', 'file_name': 'amzn_2023_10k.pdf', 'file_path': 'amzn_2023_10k.pdf', 'file_type': 'application/pdf', 'file_size': 800598, 'creation_date': '2024-10-27', 'last_modified_date': '2024-02-02'}
{'page_label': '67', 'file_name': 'amzn_2023_10k.pdf', 'file_path': 'amzn_2023_10k.pdf', 'file_type': 'application/pdf', 'file_size': 800598, 'creation_date': '2024-10-27', 'last_modified_date': '2024-02-02'}


Collecting the labled pages:

In [None]:
page_labels = []
for res in response:
  if res.node.metadata!={}:
    print(res.node.metadata['page_label'])
    page_labels.append(res.node.metadata['page_label'])

In [None]:
page_labels

['65', '28', '67']

Showing the context:

In [None]:
context_str=""
for resp in response:
  text = resp.node.get_text()
  print(text)
  context_str += text + " \n\n"

## Simple call to the chat completion to see cached token:

### First query:

In [None]:
query_str = "What was the net income in 2023?"

In [None]:
prompt_llamaindex = f"""Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: """

In [None]:
from openai import OpenAI
client = OpenAI(api_key = OPENAI_API_KEY)

completion = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[
    {"role": "system", "content": "You are a financial analyst expert."},
    {"role": "user", "content": prompt_llamaindex}
  ]
)

print(completion.choices[0].message)

ChatCompletionMessage(content='To calculate the net income for the year 2023, we can start with the income (loss) before income taxes and adjust it by the provision (benefit) for income taxes.\n\nFrom the provided information:\n- Income (loss) before income taxes for 2023: $37,557 million\n- Provision (benefit) for income taxes, net for 2023: $7,120 million\n\nThe formula for net income is:\n\n\\[ \\text{Net Income} = \\text{Income (loss) before income taxes} - \\text{Provision (benefit) for income taxes} \\]\n\nSubstituting in the values:\n\n\\[ \\text{Net Income} = 37,557 - 7,120 \\]\n\nCalculating this gives:\n\n\\[ \\text{Net Income} = 30,437 \\]\n\nThus, the net income for 2023 was **$30,437 million** or **$30.437 billion**.', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None)


In [None]:
print(completion.choices[0].message.content)

To calculate the net income for the year 2023, we can start with the income (loss) before income taxes and adjust it by the provision (benefit) for income taxes.

From the provided information:
- Income (loss) before income taxes for 2023: $37,557 million
- Provision (benefit) for income taxes, net for 2023: $7,120 million

The formula for net income is:

\[ \text{Net Income} = \text{Income (loss) before income taxes} - \text{Provision (benefit) for income taxes} \]

Substituting in the values:

\[ \text{Net Income} = 37,557 - 7,120 \]

Calculating this gives:

\[ \text{Net Income} = 30,437 \]

Thus, the net income for 2023 was **$30,437 million** or **$30.437 billion**.


In [None]:
completion.usage.prompt_tokens_details #==> cahed_tokens = 0 ==> first call ==> normal

PromptTokensDetails(audio_tokens=None, cached_tokens=0)

### Second query:

In [None]:
query_str = "What was the revenue in 2023?"
prompt_llamaindex = f"""Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: """

In [None]:
completion2 = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[
    {"role": "system", "content": "You are a financial analyst expert."},
    {"role": "user", "content": prompt_llamaindex}
  ]
)

print(completion2.choices[0].message.content)


The provided context information does not include details about revenue for the year 2023. Therefore, I cannot determine the revenue for that year based on the information given. If additional data on revenue is available, it would be necessary to review that information to provide an answer.


In [None]:
completion2.usage.prompt_tokens_details #==> cahed_tokens = 2688 ==> second call. So we have used 2688 tokens.

PromptTokensDetails(audio_tokens=None, cached_tokens=2688)

In [None]:
completion2.usage.prompt_tokens

2851

# All together: Caching Tokens

In [11]:
import tiktoken

In [14]:
MODEL = "gpt-4o-mini"
encoding = tiktoken.encoding_for_model(MODEL)
print(encoding)

from openai import OpenAI
client = OpenAI(api_key = OPENAI_API_KEY)

<Encoding 'o200k_base'>


In [15]:
def get_retrieved_context(query_str,retriver):
  response = retriever.retrieve(query_str)
  context_str=""
  for resp in response:
    text = resp.node.get_text()
    context_str += text + " \n\n"

  page_labels = []
  for res in response:
    if res.node.metadata!={}:
      # print(res.node.metadata['page_label'])
      page_labels.append(res.node.metadata['page_label'])
  return context_str, page_labels

def get_template(query_str,context_str):
  prompt_llamaindex = f"""Context information is below.
  ---------------------
  {context_str}
  ---------------------
  Given the context information and not prior knowledge, answer the query.
  Query: {query_str}
  Answer: """
  return prompt_llamaindex

def call_gpt_4o(prompt):
  completion = client.chat.completions.create(
  model=MODEL,
  messages=[
    {"role": "system", "content": "You are a financial analyst expert."},
    {"role": "user", "content": prompt}
    ]
  )

  llm_answer = completion.choices[0].message.content
  cached_tokens_nbr = completion.usage.prompt_tokens_details.cached_tokens
  # prompt_input_nbr_tokens = completion.usage.prompt_tokens
  return llm_answer, cached_tokens_nbr

def compute_nb_tokens(text):
  tokens_integer=encoding.encode(text)
  return len(tokens_integer)

def get_final_answer(query_str,retriever):
  context_str, page_labels = get_retrieved_context(query_str,retriever)
  prompt_llamaindex = get_template(query_str,context_str)
  llm_answer, cached_tokens_nbr = call_gpt_4o(prompt_llamaindex)
  prompt_nbr_tokens = compute_nb_tokens(prompt_llamaindex) #You can also use: completion.usage.prompt_tokens in call_gpt_4o method
  return llm_answer, cached_tokens_nbr, prompt_nbr_tokens, page_labels

**With SimpleDirectory Retriver**

In the following, I'll be asking different questions to see if the prompt caching is enabled:

## Query 1:

First call, we'll see 0 caching tokens:

In [16]:
queries_list = ["What was the net income in 2023?","What was the revenue in 2023?", \
                "What are the operating income in 2022?", "What are the operating expenses in 2021?"]
for query in queries_list :
  resp, cached_tokens, prompt_nbr_tokens, page_labels = get_final_answer(query,retriever)
  print(f"query:\n{query}"+"\n\n")

  print(f"response:\n{resp}" +"\n\n")
  print(f"nbr_tokens in the prompt = {prompt_nbr_tokens}" +"\n")
  print(f"cached_tokens = {cached_tokens}" +"\n")
  print(f"page_labels = {page_labels}" +"\n")
  print("--"*50)

query:
What was the net income in 2023?


response:
To calculate the net income for the year 2023, we need to start with the income before income taxes and subtract the provision (benefit) for income taxes. 

From the provided information for the year ended December 31, 2023:

- Income before income taxes: $37,557 million
- Provision for income taxes: $7,120 million

Now, we can calculate the net income as follows:

Net Income = Income before income taxes - Provision for income taxes  
Net Income = $37,557 million - $7,120 million  
Net Income = $30,437 million

Therefore, the net income in 2023 was **$30,437 million**.


nbr_tokens in the prompt = 2840

cached_tokens = 0

page_labels = ['65', '28', '67']

----------------------------------------------------------------------------------------------------
query:
What was the revenue in 2023?


response:
The context provided does not explicitly state the total revenue for the year 2023. To determine the revenue, we would typically look 

## Query 2

In the second call, caching tokens will appear because by modifying only the years in the queries, this leads to retrieve the same (almost) context, thus leading to utilize cached tokens:

In [17]:
queries_list = ["What was the net income in 2023?","What was the revenue in 2023?", \
                "What are the operating income in 2022?", "What are the operating expenses in 2021?"]
for query in queries_list :
  resp, cached_tokens, prompt_nbr_tokens, page_labels = get_final_answer(query,retriever)
  print(f"query:\n{query}"+"\n\n")

  print(f"response:\n{resp}" +"\n\n")
  print(f"nbr_tokens in the prompt = {prompt_nbr_tokens}" +"\n")
  print(f"cached_tokens = {cached_tokens}" +"\n")
  print(f"page_labels = {page_labels}" +"\n")
  print("--"*50)

query:
What was the net income in 2023?


response:
To calculate the net income for 2023, we need to consider the income (loss) before income taxes and the provision for income taxes.

From the data provided:
- Income (loss) before income taxes in 2023: $37,557 million
- Provision (benefit) for income taxes in 2023: $7,120 million

Net income can be calculated as follows:

Net Income = Income (loss) before income taxes - Provision for income taxes
Net Income = $37,557 million - $7,120 million
Net Income = $30,437 million

Therefore, the net income in 2023 was **$30,437 million**.


nbr_tokens in the prompt = 2840

cached_tokens = 2688

page_labels = ['65', '28', '67']

----------------------------------------------------------------------------------------------------
query:
What was the revenue in 2023?


response:
The provided context does not explicitly state the total revenue for the year 2023. However, it mentions that $12.4 billion of unearned revenue was recognized as revenue du

## Query 3

In this query, I'm asking completely different questions than Query 1 and Query 2: First time the context is retrieved, thus the caching tokens is 0:

In [18]:
queries_list = ["What are the total assets in 2022?","What are the current liabilities in 2023?"]
for query in queries_list :
  resp, cached_tokens, prompt_nbr_tokens, page_labels = get_final_answer(query,retriever)
  print(f"query:\n{query}"+"\n\n")

  print(f"response:\n{resp}" +"\n\n")
  print(f"nbr_tokens in the prompt = {prompt_nbr_tokens}" +"\n")
  print(f"cached_tokens = {cached_tokens}" +"\n")
  print(f"page_labels = {page_labels}" +"\n")
  print("--"*50)

query:
What are the total assets in 2022?


response:
The total assets in 2022 are $462,675 million.


nbr_tokens in the prompt = 2634

cached_tokens = 0

page_labels = ['70', '40', '23']

----------------------------------------------------------------------------------------------------
query:
What are the current liabilities in 2023?


response:
The current liabilities in 2023 are $164,917 million.


nbr_tokens in the prompt = 2359

cached_tokens = 0

page_labels = ['67', '40', '66']

----------------------------------------------------------------------------------------------------


## Query 4

Even if I modified only the years, the retrieved context here (page_label) does not follow the same order than the query before (Query 3), leading to different context, thus to no caching tokens.

In [19]:
queries_list = ["What are the total assets in 2023?","What are the current liabilities in 2022?"]
for query in queries_list :
  resp, cached_tokens, prompt_nbr_tokens, page_labels = get_final_answer(query,retriever)
  print(f"query:\n{query}"+"\n\n")

  print(f"response:\n{resp}" +"\n\n")
  print(f"nbr_tokens in the prompt = {prompt_nbr_tokens}" +"\n")
  print(f"cached_tokens = {cached_tokens}" +"\n")
  print(f"page_labels = {page_labels}" +"\n")
  print("--"*50)

query:
What are the total assets in 2023?


response:
The total assets in 2023 amount to $527,854 million.


nbr_tokens in the prompt = 2159

cached_tokens = 0

page_labels = ['70', '66', '40']

----------------------------------------------------------------------------------------------------
query:
What are the current liabilities in 2022?


response:
The current liabilities for the year ended December 31, 2022, were $155,393 million.


nbr_tokens in the prompt = 2320

cached_tokens = 0

page_labels = ['67', '63', '40']

----------------------------------------------------------------------------------------------------


In [None]:
queries_list = ["What was the net income in 2023?","What was the revenue in 2023?", \
                "What are the operating income in 2022?", "What are the operating expenses in 2021?"]

for query in queries_list :
  resp, cached_tokens, prompt_nbr_tokens, page_labels = get_final_answer(query,retriever)
  print(f"query:\n{query}"+"\n\n")

  print(f"response:\n{resp}" +"\n\n")
  print(f"nbr_tokens in the prompt = {prompt_nbr_tokens}" +"\n")
  print(f"cached_tokens = {cached_tokens}" +"\n")
  print(f"page_labels = {page_labels}" +"\n")
  print("--"*50)

query:
What was the net income in 2023?


response:
To determine the net income for the year 2023, we can use the provided income before income taxes and the provision (benefit) for income taxes.

Given:

- Income before income taxes for 2023: $37,557 million
- Provision for income taxes for 2023: $7,120 million

Net income is calculated as follows:

Net Income = Income before income taxes - Provision for income taxes

Thus:

Net Income = $37,557 million - $7,120 million = $30,437 million

Therefore, the net income in 2023 was $30,437 million.


cached_tokens = 2688

nbr_tokens in the prompt = 2840

----------------------------------------------------------------------------------------------------
query:
What was the revenue in 2023?


response:
The context information provided does not specify the total revenue for 2023. However, it does mention that $12.4 billion of unearned revenue was recognized as revenue during the year ended December 31, 2023. To obtain the total revenue for 20

# Key Takeaways:

▶ **Can prompt caching work in RAG apps? Key Takeaways:**

- It depends on the prompt. If it begins with lengthy, static instructions or examples (e.g., few-shot learning), caching can be effective.
- For prompts with brief instructions followed by dynamic retrieved context and user-specific queries, caching is unlikely, as the context changes per query (for Small RAG apps, not widely used).
- Caching could work if users repeatedly ask similar questions that pull the same context.
- For shared RAG systems, especially in organizations, caching frequent queries can help reduce latency and costs.