<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/llm/anthropic_prompt_caching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Anthropic Prompt Caching

In this Notebook, we will demonstrate the usage of [Anthropic Prompt Caching](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching) with LlamaIndex abstractions.

Prompt Caching is enabled by marking `cache_control` in the messages request.


## How Prompt Caching works

When you send a request with Prompt Caching enabled:

1. The system checks if the prompt prefix is already cached from a recent query.
2. If found, it uses the cached version, reducing processing time and costs.
3. Otherwise, it processes the full prompt and caches the prefix for future use.


**Note:** 

A. Prompt caching works with `Claude 3.5 Sonnet`, `Claude 3 Haiku` and `Claude 3 Opus` models.

B. The minimum cacheable prompt length is:

    1. 1024 tokens for Claude 3.5 Sonnet and Claude 3 Opus
    2. 2048 tokens for Claude 3 Haiku

C. Shorter prompts cannot be cached, even if marked with `cache_control`.

### Setup API Keys

In [None]:
import os

os.environ[
    "ANTHROPIC_API_KEY"
] = "sk-..."  # replace with your Anthropic API key

### Setup LLM

In [None]:
from llama_index.llms.anthropic import Anthropic

llm = Anthropic(model="claude-3-5-sonnet-20240620")

### Download Data

In this demonstration, we will use the text from the `Paul Graham Essay`. We will cache the text and run some queries based on it.

In [None]:
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O './paul_graham_essay.txt'

--2024-09-28 01:22:14--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8001::154, 2606:50c0:8002::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘./paul_graham_essay.txt’


2024-09-28 01:22:14 (5.73 MB/s) - ‘./paul_graham_essay.txt’ saved [75042/75042]



### Load Data

In [None]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["./paul_graham_essay.txt"],
).load_data()

document_text = documents[0].text

### Prompt Caching

Enabling Prompt Cache:

1.	Include `"cache_control": {"type": "ephemeral"}` for the text prompt you want to cache.
2.	Add `extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"}` in the request.

We can verify if the text is cached by checking the following parameters:

`cache_creation_input_tokens:` Number of tokens written to the cache when creating a new entry.

`cache_read_input_tokens:` Number of tokens retrieved from the cache for this request.

`input_tokens:` Number of input tokens which were not read from or used to create a cache.

In [None]:
from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(role="system", content="You are helpful AI Assitant."),
    ChatMessage(
        role="user",
        content=[
            {
                "text": f"{document_text}",
                "type": "text",
                "cache_control": {"type": "ephemeral"},
            },
            {"text": "Why did Paul Graham start YC?", "type": "text"},
        ],
    ),
]

resp = llm.chat(
    messages, extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"}
)

Let's examine the raw response.

In [None]:
resp.raw

{'id': 'msg_01KCcFZnbAGjxSKJm7LnXajp',
 'content': [TextBlock(text="Based on the essay, it seems Paul Graham started Y Combinator for a few key reasons:\n\n1. He had been thinking about ways to improve venture capital and startup funding, like making smaller investments in younger, more technical founders.\n\n2. He wanted to try angel investing but hadn't gotten around to it yet, despite intending to for years after Yahoo acquired his company Viaweb.\n\n3. He missed working with his former Viaweb co-founders Robert Morris and Trevor Blackwell and wanted to find a project they could collaborate on.\n\n4. His girlfriend (later wife) Jessica Livingston was looking for a new job after interviewing at a VC firm, and Graham had been telling her ideas for how to improve VC.\n\n5. When giving a talk to Harvard students about startups, he realized there was demand for seed funding and advice from experienced founders.\n\n6. They wanted to create an investment firm that would actually implement 

As you can see, `17470` tokens have been cached, as indicated by `cache_creation_input_tokens`.

Now, let’s run another query on the same document. It should retrieve the document text from the cache, which will be reflected in `cache_read_input_tokens`.

In [None]:
messages = [
    ChatMessage(role="system", content="You are helpful AI Assitant."),
    ChatMessage(
        role="user",
        content=[
            {
                "text": f"{document_text}",
                "type": "text",
                "cache_control": {"type": "ephemeral"},
            },
            {"text": "What did Paul Graham do growing up?", "type": "text"},
        ],
    ),
]

resp = llm.chat(
    messages, extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"}
)

In [None]:
resp.raw

{'id': 'msg_01CpwhtuvJ8UR64xSbpxoutZ',
 'content': [TextBlock(text='Based on the essay, here are some key things Paul Graham did growing up:\n\n1. As a teenager, he focused mainly on writing and programming outside of school. He tried writing short stories but says they were "awful".\n\n2. In 9th grade (age 13-14), he started programming on an IBM 1401 computer at his school district\'s data processing center. He used an early version of Fortran.\n\n3. He convinced his father to buy a TRS-80 microcomputer around 1980 when he was in high school. He wrote simple games, a program to predict model rocket flight, and a word processor his father used.\n\n4. He planned to study philosophy in college, thinking it was more powerful than other fields. \n\n5. In college, he got interested in artificial intelligence after reading a novel featuring an intelligent computer and seeing a documentary about an AI program called SHRDLU.\n\n6. He taught himself Lisp programming language in college since t

As you can see, the response was generated using cached text, as indicated by `cache_read_input_tokens`.