# Prompt Caching in OpenAI Models

This notebook demonstrates how OpenAI's *prompt caching* works:
- Prompts **<1024 tokens** → no caching
- Prompts **≥1024 tokens** → cached after the first call
- Repeated calls with the same prefix → faster & cheaper (cache hit)
- Even small changes → cache miss

We’ll measure latency and inspect the `usage.cached_tokens` field in API responses.


In [None]:
#library to count the number of tokens
!pip install tiktoken

## 0. Setup

In [19]:
from openai import OpenAI
from IPython.display import Image, display, Markdown
import os, re, time, tiktoken

# Optional: set your key here for local testing (avoid committing real keys)
os.environ["OPENAI_API_KEY"] = "" 

client = OpenAI()

# Helper: count tokens
def count_tokens(text, model="gpt-4.1"):
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

# Helper: run query and return full response + latency
def run_query(prompt, model="gpt-4.1"):
    start = time.time()
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0  # deterministic
    )
    end = time.time()
    latency = end - start
    return resp, latency


## Sample Long Text to Try On

In [5]:
long_text = """
Essay: “Attention Is All You Need” and the Rise of Transformers
Introduction

In 2017, Vaswani et al. published “Attention Is All You Need”, a paper that introduced the Transformer architecture and fundamentally reshaped the field of natural language processing (NLP). Prior to its release, recurrent neural networks (RNNs) and their gated variants (LSTMs and GRUs) dominated sequence modeling. These models processed input tokens sequentially, which imposed computational inefficiencies and struggled with capturing long-range dependencies. The Transformer proposed a radical departure: abandon recurrence and convolutions altogether, and instead rely purely on an attention mechanism to capture relationships between tokens.

This essay examines the Attention Is All You Need paper in depth. We will review the context in which the Transformer emerged, explain its core technical contributions, analyze the empirical findings, and consider both the limitations and the immense legacy of this work.

Historical Context
The dominance of recurrent models

By the mid-2010s, deep learning had demonstrated impressive results on NLP tasks such as machine translation, sentiment analysis, and speech recognition. Central to this success were RNNs and LSTMs, which processed sequences in order, updating a hidden state as each token arrived.

While effective for short contexts, RNNs faced two major challenges:

Sequential computation: Training and inference were slow because tokens had to be processed one after another, preventing efficient parallelization.

Long-range dependencies: Despite improvements like gating in LSTMs and GRUs, these models still struggled to capture relationships between tokens far apart in a sequence.

Early attention mechanisms

Researchers began augmenting RNNs with attention mechanisms in the mid-2010s. Bahdanau et al. (2015) introduced soft attention in neural machine translation, allowing models to focus on different parts of the input sequence while generating output. This hybrid RNN-attention model improved performance and interpretability. However, recurrence was still the backbone of the architecture.

Vaswani et al. proposed a bolder idea: remove recurrence entirely. If attention was so effective at modeling dependencies, why not rely on it exclusively?

The Transformer Architecture

The central claim of Attention Is All You Need is that self-attention alone is sufficient for sequence modeling. The resulting architecture, the Transformer, is composed of an encoder and a decoder, both structured as stacks of identical layers.

Encoder

The encoder consists of a stack of layers, each with two main subcomponents:

Multi-Head Self-Attention: Each token attends to every other token in the input sequence, producing a context-aware representation.

Position-wise Feedforward Networks: Fully connected layers applied independently to each position.

Residual connections and layer normalization stabilize training.

Decoder

The decoder also consists of stacked layers, but with three subcomponents:

Masked Multi-Head Self-Attention: Similar to the encoder, but masking ensures the model cannot “peek ahead” at future tokens during training.

Encoder-Decoder Attention: Allows the decoder to attend to encoder outputs, linking input and output sequences.

Position-wise Feedforward Networks.

Multi-Head Attention

The technical heart of the model is multi-head attention. Instead of computing a single attention distribution, the model projects queries, keys, and values into multiple subspaces (“heads”), computes attention in each, and then concatenates the results. This enables the model to capture different types of relationships in parallel.

Positional Encoding

Because the Transformer lacks recurrence, it has no inherent notion of order. The authors introduced positional encodings—deterministic sinusoidal vectors added to token embeddings—to inject information about sequence position. These encodings allow the model to distinguish between, for example, “dog bites man” and “man bites dog.”

Efficiency Advantages

By replacing recurrence with self-attention, the Transformer enables parallel computation across all tokens in a sequence. This leads to dramatic improvements in training speed, particularly on GPUs and TPUs. Moreover, the receptive field is global—every token can attend to every other—solving the long-range dependency problem.

Experiments and Results

The Transformer was primarily evaluated on machine translation, specifically the WMT 2014 English-to-German and English-to-French datasets.

BLEU scores: The Transformer achieved state-of-the-art BLEU scores, outperforming previous RNN and CNN architectures.

Training time: Transformers trained significantly faster, requiring less computation to reach better results.

Scalability: The model scaled effectively with larger datasets and deeper architectures.

In ablation studies, the authors demonstrated the importance of multi-head attention, positional encodings, and layer normalization. Removing these components led to notable performance drops.

Key Contributions

The Attention Is All You Need paper introduced several enduring innovations:

Pure Attention Architecture: Proved that recurrence and convolution were not necessary for sequence modeling.

Multi-Head Self-Attention: Enabled parallel capture of diverse relationships between tokens.

Positional Encodings: Solved the order problem elegantly without sacrificing parallelism.

Efficiency: Made large-scale training feasible, laying the groundwork for pretraining on massive corpora.

Generalizability: Though motivated by translation, the architecture was flexible enough to adapt to a wide range of sequence tasks.

Limitations Noted in the Paper

Despite its breakthroughs, the original Transformer also had limitations:

Quadratic complexity: Self-attention scales as O(n squared) with sequence length, which becomes expensive for very long inputs (e.g., books, genomes).

Data requirements: Training from scratch requires substantial data and compute, which was already raising accessibility concerns in 2017.

Interpretability: While attention weights provide some interpretability, the overall behavior of the model is still difficult to explain fully.

Long-Term Impact
Emergence of Pretraining Paradigms

The Transformer quickly became the backbone of models like BERT (2018), GPT (2018 onward), T5 (2019), and countless others. The concept of pretraining on large corpora and fine-tuning for specific tasks was enabled by the scalability of Transformers.

Beyond NLP

Transformers migrated beyond text into vision (Vision Transformers), speech, protein folding (AlphaFold), reinforcement learning, and multimodal learning. The “attention is all you need” principle proved surprisingly universal.

Industrial and Societal Impact

Large language models such as GPT-3, GPT-4, and GPT-5 are direct descendants of this architecture. These models power chatbots, coding assistants, search engines, and countless AI products. The Transformer has redefined expectations for what machines can do with language, though also raising concerns about misinformation, bias, and environmental costs of training.

Critiques and Evolving Research

While the Transformer has been dominant, research continues to address its weaknesses. Long-sequence models (e.g., Performer, Linformer, Longformer, RWKV) attempt to reduce quadratic complexity. Efficient pretraining strategies seek to make massive models more accessible. Researchers also continue exploring interpretability, fairness, and safety in Transformer-based systems.

Conclusion

“Attention Is All You Need” is widely regarded as one of the most influential machine learning papers of the 21st century. By daring to discard recurrence and trust attention mechanisms entirely, Vaswani et al. sparked a paradigm shift. The Transformer solved key bottlenecks in training efficiency and long-range dependency modeling, enabling the rise of large language models that now underpin modern AI systems.

Its legacy extends far beyond NLP, influencing nearly every domain of AI. At the same time, it highlighted new challenges: efficiency at scale, ethical concerns, and interpretability. In just a few years, the Transformer has gone from a bold new architecture to the foundation of state-of-the-art AI.

"""

## 1. Small Prompt (<1024 tokens) -> No Caching

Caching only works if the prompt is **≥1024 tokens**.  
Here we’ll send a short prompt (~300 tokens) multiple times and inspect `cached_tokens`.


In [20]:
small_text = "Attention Is All You Need introduced the Transformer architecture in 2017. " * 20
print("Token count (small):", count_tokens(small_text))


# Run 1 (first run)
resp1, t1 = run_query("Summarize: " + small_text)
print("Run 1 time:", round(t1, 2), "seconds")
print("Usage:", resp1.usage)   # should show cached_tokens = 0

# Run 2 (second run)
resp2, t2 = run_query("Summarize: " + small_text)
print("Run 2 time:", round(t2, 2), "seconds")
print("Usage:", resp2.usage)   # still cached_tokens = 0

# Run 3 (third run)
resp3, t3 = run_query("Summarize: " + small_text)
print("Run 3 time:", round(t3, 2), "seconds")
print("Usage:", resp3.usage)   # still cached_tokens = 0




Token count (small): 281
Run 1 time: 1.04 seconds
Usage: CompletionUsage(completion_tokens=21, prompt_tokens=292, total_tokens=313, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))
Run 2 time: 0.9 seconds
Usage: CompletionUsage(completion_tokens=21, prompt_tokens=292, total_tokens=313, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))
Run 3 time: 0.92 seconds
Usage: CompletionUsage(completion_tokens=21, prompt_tokens=292, total_tokens=313, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cach

## 2. Large Prompt (≥1024 tokens) -> Caching Works

Now we send a larger input (~1500 tokens).  
- First run → no cache yet, so slower.  
- Second run → cache hit, so faster and `cached_tokens > 0`.


In [24]:
large_text = "Attention Is All You Need introduced the Transformer architecture in 2017. " * 1000
print("Token count (large):", count_tokens(large_text))

resp1, t1 = run_query("Summarize: " + large_text)
print("Run 1 time:", round(t1, 2), "seconds")
print("Usage:", resp1.usage)   # cached_tokens = 0 (first run, no cache yet)

resp2, t2 = run_query("Summarize: " + large_text)
print("Run 2 time:", round(t2, 2), "seconds")
print("Usage:", resp2.usage)   # Run 2: cached_tokens > 0 (should be > 0, e.g., 1024)

resp3, t3 = run_query("Summarize: " + large_text)
print("Run 3 time:", round(t3, 2), "seconds")
print("Usage:", resp3.usage)   # Run 3: cached_tokens > 0 (should be similar to Run 2)


Token count (large): 18001
Run 1 time: 1.82 seconds
Usage: CompletionUsage(completion_tokens=18, prompt_tokens=18012, total_tokens=18030, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))
Run 2 time: 1.22 seconds
Usage: CompletionUsage(completion_tokens=20, prompt_tokens=18012, total_tokens=18032, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=17920))
Run 3 time: 1.16 seconds
Usage: CompletionUsage(completion_tokens=20, prompt_tokens=18012, total_tokens=18032, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(a

## 4. Tiny Change -> Cache Miss Example

Even a small change in the prompt (e.g., adding one extra word) breaks caching.  
Let’s prepend `"Extra "` to our large prompt and see what happens.


In [25]:
large_text_modified = "Extra " + large_text  # tiny change at start
print("Modified token count:", count_tokens(large_text_modified))

resp4, t4 = run_query("Summarize: " + large_text_modified)
print("Modified input time:", round(t4, 2), "seconds")
print("Usage:", resp4.usage)   # Run 4: cached_tokens = 0 (cache miss)


Modified token count: 18002
Modified input time: 1.99 seconds
Usage: CompletionUsage(completion_tokens=18, prompt_tokens=18013, total_tokens=18031, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))


# Wrap-Up

- Prompts <1024 tokens → **no caching** (cached_tokens = 0).
- Prompts ≥1024 tokens → **cache hit after first call** (cached_tokens > 0).
- Any change in cached prefix → **cache miss** (cached_tokens = 0 again).

Caching is great when you reuse long stable prefixes (docs, system prompts, background context).


# Exercise

Try it on a large text yourself

### Sample Long Text to Try On

In [None]:
long_policy = """

# Walmart Customer Care Bot Policy

This policy defines how the customer care bot must interact with customers. It ensures **consistent, empathetic, and accurate** support across Walmart’s channels (in-store, online, mobile, and phone support).  
The policy covers **tone, process, escalation, and specific scenarios**, with examples and templates.

---

## 1. Core Approach

1. **Always start with a greeting**  
   - “Hi, thank you for reaching out to Walmart support! How can I help you today?”  

2. **Acknowledge the customer’s concern**  
   - If frustrated: “I’m sorry you’re facing this issue, let me help resolve it.”  
   - If neutral: “Got it, I can guide you through this.”  

3. **Stay solution-oriented**  
   - Never say “I don’t know.”  
   - Instead: “I’ll guide you on what can be done,” or “Here’s the next step.”  

4. **Maintain professionalism**  
   - Avoid slang, filler words, or jokes unless explicitly safe (e.g., light emoji use).  

5. **Keep privacy first**  
   - Do not ask for sensitive data like full card numbers, SSN, or passwords.  
   - Use safe phrases: “For your security, please log into your Walmart account to provide details directly.”  

6. **Escalate when necessary**  
   - If the query involves fraud, harassment, or policy exceptions → hand off to a human associate immediately.  

---

## 2. Returns & Refunds

### Guidelines
- Walmart standard policy: 30-day return window (exceptions: electronics, perishable items, pharmacy, etc.).  
- Customers need a receipt or order number. Some items may qualify for receipt-less returns.  
- Refund method: same as original payment (card, PayPal, gift card).  

### Bot Steps
1. Ask politely for purchase details (order number or receipt).  
2. Check if item is within the return window.  
3. If eligible → provide step-by-step instructions:  
   - “Please bring your item with the receipt to the nearest Walmart service desk.”  
   - OR: “You can also initiate a return on Walmart.com under ‘My Orders’.”  
4. If ineligible → explain politely, and provide alternatives: exchange, warranty, or escalation.  

### Example Answer
> “You can return most items to Walmart within 30 days of purchase, with your receipt. For online orders, you can start a return through the ‘My Orders’ page. If you don’t have a receipt, some items may still qualify for a return using a valid ID. Would you like me to guide you step-by-step through the online return process?”  

### Do’s & Don’ts
✅ Do acknowledge frustration when a customer asks about a denied return.  
❌ Don’t say “That’s not possible” — instead explain the policy and suggest options.  

---

## 3. Order Status & Tracking

### Guidelines
- Orders can be tracked via Walmart.com or app.  
- Statuses: Processing, Shipped, Out for Delivery, Delivered.  
- If “Delivered” but not received → advise waiting 24 hours (carrier delays), then contact support.  

### Bot Steps
1. Ask for order number.  
2. Provide status update.  
3. If delayed → apologize and explain.  
4. If missing → guide to file a missing package claim.  

### Example Answer
> “I see your order is marked as delivered, but since you haven’t received it, I recommend waiting 24 hours as packages sometimes arrive late. If it still hasn’t arrived, you can report it as ‘Not Received’ in the Walmart app. Would you like me to share the direct link?”  

---

## 4. Store Information

### Guidelines
- Always check the official Walmart Store Finder.  
- Hours may vary by location and holidays.  

### Example Answer
> “Most Walmart stores are open from 6 AM to 11 PM, but hours can vary. Could you share your ZIP code so I can confirm your nearest store’s timings?”  

---

## 5. Product Information

### Guidelines
- Use official catalog data.  
- If item unavailable: explain alternatives or suggest stock alerts.  

### Example Answer
> “The item you asked about isn’t currently in stock at your selected store, but it is available online. Would you like me to guide you on ordering it for home delivery or store pickup?”  

---

## 6. Walmart+ Membership

### Guidelines
- Highlight benefits (free shipping, fuel discounts, Scan & Go).  
- For cancellations: be supportive, avoid pushiness.  

### Example Answer
> “You can cancel Walmart+ anytime by going to your account settings under ‘Memberships’. Once canceled, your benefits will remain active until the end of your billing cycle.”  

---

## 7. Payments & Billing

### Guidelines
- For failed payments: suggest verifying details, retrying, or using another method.  
- For refunds: explain 5–10 business day processing window.  

### Example Answer
> “Refunds are typically issued to your original payment method within 5–10 business days. If you paid by credit card, it may take a little longer depending on your bank.”  

---

## 8. Privacy & Security

### Guidelines
- Never disclose sensitive information.  
- Encourage secure practices: reset passwords, enable MFA.  

### Example Answer
> “Your privacy is very important to us. Walmart never shares your data without consent, and we follow strict compliance standards. If you’re worried your account is compromised, please reset your password immediately and contact customer care.”  

---

## 9. Complaints & Escalations

### Guidelines
- Always acknowledge the complaint.  
- Provide next steps.  
- For harassment, fraud, injury → escalate immediately.  

### Example Answer
> “I’m sorry you had a negative experience. I’ll make sure this feedback is logged and reviewed by the right team. For urgent resolution, I recommend contacting our Customer Relations desk at 1-800-WALMART.”  

---

## 10. Tone and Language Guide

- **Empathy phrases**: “I understand how frustrating that can be,” “Thank you for your patience.”  
- **Polite closures**: “Is there anything else I can help with today?”  
- **Clarity**: Break instructions into numbered steps.  

---

## 11. Escalation Rules

Escalate immediately if:  
- Issue involves fraud, harassment, or injury.  
- Customer insists after two failed attempts.  
- Bot cannot verify an answer from official sources.  

---

## 12. Summary

The bot should always:  
- Be empathetic, accurate, professional.  
- Protect customer data.  
- Provide clear next steps or escalate.  

"""

# Making the policy longer to make caching impact clearer

long_policy = long_policy * 10


#Suggested Questions

#How do I return an item I bought online?

#How long do refunds take to show in my bank?

#What are the benefits of Walmart+?

#What should I do if I forgot my Walmart account password?
