# Welcome to Week 2!

## Frontier Model APIs

In Week 1, we used multiple Frontier LLMs through their Chat UI, and we connected with the OpenAI's API.

Today we'll connect with them through their APIs..

<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/important.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#900;">Important Note - Please read me</h2>
            <span style="color:#900;">I'm continually improving these labs, adding more examples and exercises.
            At the start of each week, it's worth checking you have the latest code.<br/>
            First do a git pull and merge your changes as needed</a>. Check out the GitHub guide for instructions. Any problems? Try asking ChatGPT to clarify how to merge - or contact me!<br/>
            </span>
        </td>
    </tr>
</table>
<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/resources.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#f71;">Reminder about the resources page</h2>
            <span style="color:#f71;">Here's a link to resources for the course. This includes links to all the slides.<br/>
            <a href="https://edwarddonner.com/2024/11/13/llm-engineering-resources/">https://edwarddonner.com/2024/11/13/llm-engineering-resources/</a><br/>
            Please keep this bookmarked, and I'll continue to add more useful links there over time.
            </span>
        </td>
    </tr>
</table>

## Setting up your keys - OPTIONAL!

We're now going to try asking a bunch of models some questions!

This is totally optional. If you have keys to Anthropic, Gemini or others, then you can add them in.

If you'd rather not spend the extra, then just watch me do it!

For OpenAI, visit https://openai.com/api/  
For Anthropic, visit https://console.anthropic.com/  
For Google, visit https://aistudio.google.com/   
For DeepSeek, visit https://platform.deepseek.com/  
For Groq, visit https://console.groq.com/  
For Grok, visit https://console.x.ai/  


You can also use OpenRouter as your one-stop-shop for many of these! OpenRouter is "the unified interface for LLMs":

For OpenRouter, visit https://openrouter.ai/  


With each of the above, you typically have to navigate to:
1. Their billing page to add the minimum top-up (except Gemini, Groq, Google, OpenRouter may have free tiers)
2. Their API key page to collect your API key

### Adding API keys to your .env file

When you get your API keys, you need to set them as environment variables by adding them to your `.env` file.

```
OPENAI_API_KEY=xxxx
ANTHROPIC_API_KEY=xxxx
GOOGLE_API_KEY=xxxx
DEEPSEEK_API_KEY=xxxx
GROQ_API_KEY=xxxx
GROK_API_KEY=xxxx
OPENROUTER_API_KEY=xxxx
```

<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/important.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#900;">Any time you change your .env file</h2>
            <span style="color:#900;">Remember to Save it! And also rerun load_dotenv(override=True)<br/>
            </span>
        </td>
    </tr>
</table>

In [1]:
# imports

import os
import requests
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import Markdown, display

In [2]:
load_dotenv(override=True)
openai_api_key = os.getenv('OPENAI_API_KEY')
google_api_key = os.getenv('GOOGLE_API_KEY')
deepseek_api_key = os.getenv('DEEPSEEK_API_KEY')
openrouter_api_key = os.getenv('OPENROUTER_API_KEY')

if openai_api_key:
    print(f"OpenAI API Key exists and begins {openai_api_key[:8]}")
else:
    print("OpenAI API Key not set")
    
if google_api_key:
    print(f"Google API Key exists and begins {google_api_key[:2]}")
else:
    print("Google API Key not set (and this is optional)")

if deepseek_api_key:
    print(f"DeepSeek API Key exists and begins {deepseek_api_key[:3]}")
else:
    print("DeepSeek API Key not set (and this is optional)")

if openrouter_api_key:
    print(f"OpenRouter API Key exists and begins {openrouter_api_key[:3]}")
else:
    print("OpenRouter API Key not set (and this is optional)")


OpenAI API Key exists and begins sk-proj-
Google API Key exists and begins AI
DeepSeek API Key not set (and this is optional)
OpenRouter API Key not set (and this is optional)


In [5]:
# Connect to OpenAI client library
# A thin wrapper around calls to HTTP endpoints

openai = OpenAI()

# For Gemini, DeepSeek and Groq, we can use the OpenAI python client
# Because Google and DeepSeek have endpoints compatible with OpenAI
# And OpenAI allows you to change the base_url

anthropic_url = "https://api.anthropic.com/v1/"
gemini_url = "https://generativelanguage.googleapis.com/v1beta/openai/"
deepseek_url = "https://api.deepseek.com"
groq_url = "https://api.groq.com/openai/v1"
grok_url = "https://api.x.ai/v1"
openrouter_url = "https://openrouter.ai/api/v1"
ollama_url = "http://localhost:11434/v1"

# anthropic = OpenAI(api_key=anthropic_api_key, base_url=anthropic_url)
gemini = OpenAI(api_key=google_api_key, base_url=gemini_url)
# deepseek = OpenAI(api_key=deepseek_api_key, base_url=deepseek_url)
# groq = OpenAI(api_key=groq_api_key, base_url=groq_url)
# grok = OpenAI(api_key=grok_api_key, base_url=grok_url)
# openrouter = OpenAI(base_url=openrouter_url, api_key=openrouter_api_key)
ollama = OpenAI(api_key="ollama", base_url=ollama_url)

In [6]:
tell_a_joke = [
    {"role": "user", "content": "Tell a joke for a student on the journey to becoming an expert in LLM Engineering"},
]

In [7]:
response = openai.chat.completions.create(model="gpt-4.1-mini", messages=tell_a_joke)
display(Markdown(response.choices[0].message.content))

Why did the aspiring LLM engineer bring a ladder to their coding session?

Because they heard they needed to work on their "layers" to reach expert level!

In [11]:
response = ollama.chat.completions.create(model="gemma3:latest", messages=tell_a_joke)
display(Markdown(response.choices[0].message.content))

Okay, here's a joke for a student on the journey to becoming an LLM Engineering expert:

Why did the LLM engineer break up with the prompt?

... Because it kept hallucinating! 

---

Hopefully, that brought a little chuckle. Let me know if you'd like another one! 

Would you like a joke related to a specific aspect of LLM engineering (like fine-tuning, evaluation, or deployment)?

## Training vs Inference time scaling

In [8]:
easy_puzzle = [
    {"role": "user", "content": 
        "You toss 2 coins. One of them is heads. What's the probability the other is tails? Answer with the probability only."},
]

In [9]:
response = openai.chat.completions.create(model="gpt-5-nano", messages=easy_puzzle, reasoning_effort="minimal")
display(Markdown(response.choices[0].message.content))

1/2

In [10]:
response = openai.chat.completions.create(model="gpt-5-nano", messages=easy_puzzle, reasoning_effort="low")
display(Markdown(response.choices[0].message.content))

2/3

In [11]:
response = openai.chat.completions.create(model="gpt-5-mini", messages=easy_puzzle, reasoning_effort="minimal")
display(Markdown(response.choices[0].message.content))

2/3

## Testing out the best models on the planet

In [12]:
hard = """
On a bookshelf, two volumes of Pushkin stand side by side: the first and the second.
The pages of each volume together have a thickness of 2 cm, and each cover is 2 mm thick.
A worm gnawed (perpendicular to the pages) from the first page of the first volume to the last page of the second volume.
What distance did it gnaw through?
"""
hard_puzzle = [
    {"role": "user", "content": hard}
]

In [13]:
response = openai.chat.completions.create(model="gpt-5-nano", messages=hard_puzzle, reasoning_effort="minimal")
display(Markdown(response.choices[0].message.content))

Interpretation:
- Each volume has pages thickness = 2 cm.
- Each cover thickness = 2 mm = 0.2 cm.
- So each volume total thickness = pages 2 cm + 2 covers × 0.2 cm = 2 + 0.4 = 2.4 cm.
- The books stand on a shelf side by side in order: [Volume 1] [Volume 2], with their front covers facing outward (as usual for a bookshelf).

A worm gnaws perpendicularly to the pages from the first page of the first volume to the last page of the second volume. That means it starts at the very beginning of the pages in Volume 1 (i.e., at the inner face of the front cover of Volume 1, if the pages are oriented with the first page touching the front cover) and ends at the very end of the pages in Volume 2 (i.e., at the inner face of the back cover of Volume 2).

Crucial detail: when books sit side by side, the space between the first page of Volume 1 and the last page of Volume 2 is not simply the sum of the page thicknesses, because the front cover of Volume 1 and the back cover of Volume 2 lie between the two sets of pages in the stacked arrangement.

What the worm actually traverses:
- It starts at the first page of Volume 1, which is adjacent to the inner face of Volume 1’s front cover.
- To reach the last page of Volume 2, it must pass through:
  - The inner front cover of Volume 1 (0.2 cm),
  - The pages of Volume 1 (2 cm),
  - The internal gap between the two books? No gap: the volumes are touching side by side, so there is no space between Volume 1 and Volume 2 along the shelf wall. The worm can pass directly from Volume 1 into Volume 2 by going through the shared common boundary at the adjacent faces between the volumes.
  - Then the inner back cover of Volume 2 (0.2 cm),
  - And finally into the last page region of Volume 2 (needs to traverse through the pages of Volume 2 to reach the last page, which is at the outer boundary of the pages adjacent to the back cover).

However, the classic trick in this puzzle is to realize that “distance through” the worm travels is equal to the thickness of the shelf region that lies between the very first page of Volume 1 and the very last page of Volume 2, which turns out to be simply the sum of the page thicknesses of both volumes plus one cover thickness (the two facing covers between them) minus the two outer covers that are not between the first page of V1 and last page of V2.

But a simpler and standard resolution is:
- The worm starts at the first page of Volume 1, which is right after the front cover of Volume 1.
- The worm finishes at the last page of Volume 2, which is right before the back cover of Volume 2.

Between these two points, the worm traverses:
- Through Volume 1: the remainder of Volume 1’s pages and the back cover of Volume 1 if the starting point is at the very first page (which is immediately after the front cover). That means it traverses the rest of Volume 1’s pages (2 cm total pages, so from first page to the end of Volume 1’s pages is almost 2 cm, but since it starts at the first page, it must traverse almost all 2 cm of Volume 1’s pages: effectively 2 cm).
- Then through the adjacent region between the volumes: there is no gap, so nothing extra.
- Through Volume 2: from the start of Volume 2’s pages to the last page is the full 2 cm of Volume 2’s pages.

Net distance =
- Pages of Volume 1: 2 cm
- Pages of Volume 2: 2 cm
Total = 4 cm.

Answer: 4 cm.

In [14]:
response = ollama.chat.completions.create(model="gemma3:latest", messages=hard_puzzle)
display(Markdown(response.choices[0].message.content))

Let the first volume be Volume 1 and the second volume be Volume 2.
The pages of each volume together have a thickness of 2 cm, which is 20 mm.
Each cover is 2 mm thick.
The total thickness of each volume is the sum of the pages and the cover.
Thickness of each volume = 2 cm (pages) + 2 mm (cover) = 20 mm + 2 mm = 22 mm.

The worm gnawed from the first page of the first volume to the last page of the second volume.
The first volume has pages with a thickness of 2 cm = 20 mm.
The second volume has pages with a thickness of 2 cm = 20 mm.
The worm gnawed from the first page of the first volume to the last page of the second volume.
The first volume has pages from pages 1 to 20 mm.
The second volume has pages from page 1 to the last page.
Let the last page of the first volume be page $x$ in the first volume.
Let the last page of the second volume be the $y$-th page.
The worm gnawed from page 1 in the first volume to the last page of the second volume.
The worm gnawed perpendicular to the pages.

The thickness of the pages of Volume 1 is 20 mm.
The thickness of the cover of Volume 1 is 2 mm.
So the total thickness of Volume 1 is 20 mm + 2 mm = 22 mm.
The worm gnawed from the first page of the first volume to the last page of the second volume.
Since the pages of each volume together have a thickness of 2 cm (20 mm), we can assume that the pages are numbered consecutively.
The first book has pages 1 to 20.
The second volume has pages 1 to 20.
The worm gnawed from the first page of the first volume (page 1) to the last page of the second volume.
The last page of the second volume is page 20.

The worm starts at page 1 of Volume 1 and goes to page 20 of Volume 2.
The total distance gnawed through is the sum of the distances gnawed through each volume.
Since the worm gnawed from page 1 of Volume 1 to the last page of Volume 2, we need to find the distance gnawed through the first volume and the second volume.
The distance gnawed through the first volume is from page 1 to page 20, so the length is 20 mm.
The distance gnawed through the second volume is from page 1 to page 20.
So the total distance gnawed through is 20 mm + 20 mm = 40 mm.
However, the question states that the worm gnawed from the first page of the first volume to the last page of the second volume.
The thickness of the first volume is 20 mm (pages) + 2 mm (cover) = 22 mm.
The last page of the second volume is page 20.
The distance gnawed through is the total thickness of the two volumes.
The book has 20 pages in each volume.
The first volume has pages 1 to 20. The last page is 20.
The second volume has pages 1 to 20. The last page is 20.
The first volume has a thickness of 2 cm = 20 mm + 2 mm = 22 mm.
The second volume has a thickness of 2 cm = 20 mm + 2 mm = 22 mm.
The total thickness is 22 mm + 22 mm = 44 mm = 4.4 cm.

The thickness of the pages is 2 cm = 20 mm for each volume.
The cover thickness is 2 mm for each volume.
The worm goes from the first page of the first volume to the last page of the second volume.
The worm goes from page 1 in Volume 1 to page 20 in Volume 2.
The thickness of the pages from page 1 to 20 in Volume 1 is 20 mm.
The thickness of the pages from page 1 to 20 in Volume 2 is 20 mm.
Therefore the total length gnawed is 20 mm + 20 mm = 40 mm = 4 cm.

Final Answer: The final answer is $\boxed{40}$

In [20]:
response = openai.chat.completions.create(model="gpt-5", messages=hard_puzzle)
display(Markdown(response.choices[0].message.content))

4 mm.

Reason: On a shelf with volumes I and II side by side (spines facing out), the first page of volume I lies just inside the cover that faces volume II, and the last page of volume II lies just inside the cover that faces volume I. So the worm passes only through one cover of each book: 2 mm + 2 mm = 4 mm.

In [21]:
response = gemini.chat.completions.create(model="gemini-2.5-pro", messages=hard_puzzle)
display(Markdown(response.choices[0].message.content))

RateLimitError: Error code: 429 - [{'error': {'code': 429, 'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits. To monitor your current usage, head to: https://ai.dev/usage?tab=rate-limit. \n* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_input_token_count, limit: 0, model: gemini-2.5-pro\n* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 0, model: gemini-2.5-pro\n* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 0, model: gemini-2.5-pro\n* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_input_token_count, limit: 0, model: gemini-2.5-pro\nPlease retry in 11.233949537s.', 'status': 'RESOURCE_EXHAUSTED', 'details': [{'@type': 'type.googleapis.com/google.rpc.Help', 'links': [{'description': 'Learn more about Gemini API quotas', 'url': 'https://ai.google.dev/gemini-api/docs/rate-limits'}]}, {'@type': 'type.googleapis.com/google.rpc.QuotaFailure', 'violations': [{'quotaMetric': 'generativelanguage.googleapis.com/generate_content_free_tier_input_token_count', 'quotaId': 'GenerateContentInputTokensPerModelPerMinute-FreeTier', 'quotaDimensions': {'location': 'global', 'model': 'gemini-2.5-pro'}}, {'quotaMetric': 'generativelanguage.googleapis.com/generate_content_free_tier_requests', 'quotaId': 'GenerateRequestsPerMinutePerProjectPerModel-FreeTier', 'quotaDimensions': {'model': 'gemini-2.5-pro', 'location': 'global'}}, {'quotaMetric': 'generativelanguage.googleapis.com/generate_content_free_tier_requests', 'quotaId': 'GenerateRequestsPerDayPerProjectPerModel-FreeTier', 'quotaDimensions': {'location': 'global', 'model': 'gemini-2.5-pro'}}, {'quotaMetric': 'generativelanguage.googleapis.com/generate_content_free_tier_input_token_count', 'quotaId': 'GenerateContentInputTokensPerModelPerDay-FreeTier', 'quotaDimensions': {'location': 'global', 'model': 'gemini-2.5-pro'}}]}, {'@type': 'type.googleapis.com/google.rpc.RetryInfo', 'retryDelay': '11s'}]}}]

## A spicy challenge to test the competitive spirit

In [17]:
dilemma_prompt = """
You and a partner are contestants on a game show. You're each taken to separate rooms and given a choice:
Cooperate: Choose "Share" — if both of you choose this, you each win $1,000.
Defect: Choose "Steal" — if one steals and the other shares, the stealer gets $2,000 and the sharer gets nothing.
If both steal, you both get nothing.
Do you choose to Steal or Share? Pick one.
"""

dilemma = [
    {"role": "user", "content": dilemma_prompt},
]


In [None]:
response = anthropic.chat.completions.create(model="claude-sonnet-4-5-20250929", messages=dilemma)
display(Markdown(response.choices[0].message.content))


In [None]:
response = groq.chat.completions.create(model="openai/gpt-oss-120b", messages=dilemma)
display(Markdown(response.choices[0].message.content))

In [None]:
response = deepseek.chat.completions.create(model="deepseek-reasoner", messages=dilemma)
display(Markdown(response.choices[0].message.content))

In [None]:
response = grok.chat.completions.create(model="grok-4", messages=dilemma)
display(Markdown(response.choices[0].message.content))

## Going local

Just use the OpenAI library pointed to localhost:11434/v1

In [18]:
requests.get("http://localhost:11434/").content

# If not running, run ollama serve at a command line

b'Ollama is running'

In [24]:
!ollama pull llama3.2

[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠦ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠇ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠏ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest [K
pulling dde5aa3fc5ff: 100% ▕██████████████████▏ 2.0 GB                         [K
pulling 966de95ca8a6: 100% ▕██████████████████▏ 1.4 KB                         [K
pulling fcc5a6bec9da: 100% ▕██████████████████▏ 7.7 KB                         [K
pulling a70ff7e570d9: 100% ▕██████████████████▏ 6.0 KB                         [K
pulling 56bb8bd477a5: 100% ▕███████

In [25]:
# Only do this if you have a large machine - at least 16GB RAM

!ollama pull gpt-oss:20b

[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠦ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠇ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠏ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠦ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling ma

In [19]:
response = ollama.chat.completions.create(model="llama3.2", messages=dilemma)
display(Markdown(response.choices[0].message.content))

I would choose to cooperate by selecting the "Share" option. I'm assuming that my partner has chosen this as well, since we're working together in the first place. If that's correct, then we'll both win $1,000 each. It seems like a mutually beneficial choice, especially considering the much larger reward of $2,000 for defecting.

In [None]:
response = ollama.chat.completions.create(model="gpt-oss:20b", messages=easy_puzzle)
display(Markdown(response.choices[0].message.content))

## Gemini and Anthropic Client Library

We're going via the OpenAI Python Client Library, but the other providers have their libraries too

In [1]:
from google import genai

client = genai.Client()

response = client.models.generate_content(
    model="gemini-2.5-flash-lite", contents="Describe the color Blue to someone who's never been able to see in 1 sentence"
)
print(response.text)

ValueError: Missing key inputs argument! To use the Google AI API, provide (`api_key`) arguments. To use the Google Cloud API, provide (`vertexai`, `project` & `location`) arguments.

In [35]:
from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    messages=[{"role": "user", "content": "Describe the color Blue to someone who's never been able to see in 1 sentence"}],
    max_tokens=100
)
print(response.content[0].text)

ModuleNotFoundError: No module named 'anthropic'

## Routers and Abtraction Layers

Starting with the wonderful OpenRouter.ai - it can connect to all the models above!

Visit openrouter.ai and browse the models.

Here's one we haven't seen yet: GLM 4.5 from Chinese startup z.ai

In [31]:
response = openrouter.chat.completions.create(model="z-ai/glm-4.5", messages=tell_a_joke)
display(Markdown(response.choices[0].message.content))

NameError: name 'openrouter' is not defined

## And now a first look at the powerful, mighty (and quite heavyweight) LangChain

In [20]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-5-mini")
response = llm.invoke(tell_a_joke)

display(Markdown(response.content))

How many LLM engineering students does it take to change a lightbulb?

One — but first they'll spend weeks collecting bulb datasets, fine‑tune a model to predict the optimal wattage, add a novel attention head that “cares” about darkness, and then call it research.

## Finally - my personal fave - the wonderfully lightweight LiteLLM

In [21]:
from litellm import completion
response = completion(model="openai/gpt-4.1", messages=tell_a_joke)
reply = response.choices[0].message.content
display(Markdown(reply))

Why did the LLM engineering student take a break during fine-tuning?

Because even their loss function needed to chill and reduce some stress!

In [22]:
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")
print(f"Total cost: {response._hidden_params["response_cost"]*100:.4f} cents")

Input tokens: 24
Output tokens: 28
Total tokens: 52
Total cost: 0.0272 cents


## Now - let's use LiteLLM to illustrate a Pro-feature: prompt caching

In [23]:
with open("hamlet.txt", "r", encoding="utf-8") as f:
    hamlet = f.read()

loc = hamlet.find("Speak, man")
print(hamlet[loc:loc+100])

Speak, man.
  Laer. Where is my father?
  King. Dead.
  Queen. But not by him!
  King. Let him deman


In [24]:
question = [{"role": "user", "content": "In Hamlet, when Laertes asks 'Where is my father?' what is the reply?"}]

In [25]:
response = completion(model="gemini/gemini-2.5-flash-lite", messages=question)
display(Markdown(response.choices[0].message.content))

In Hamlet, when Laertes dramatically returns from France demanding to know where his father is, the reply he receives is:

**"One: who is it that is gone?"**

This is spoken by **Claudius**. He feigns ignorance and uses it to try and control the situation, prompting Laertes to reveal the source of his distress.

In [26]:
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")
print(f"Total cost: {response._hidden_params["response_cost"]*100:.4f} cents")

Input tokens: 19
Output tokens: 71
Total tokens: 90
Total cost: 0.0030 cents


In [27]:
question[0]["content"] += "\n\nFor context, here is the entire text of Hamlet:\n\n"+hamlet

In [28]:
response = completion(model="gemini/gemini-2.5-flash-lite", messages=question)
display(Markdown(response.choices[0].message.content))

When Laertes asks "Where is my father?" in Hamlet, the reply comes from **Claudius, the King of Denmark**.

The reply is: **"Dead."**

In [29]:
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Cached tokens: {response.usage.prompt_tokens_details.cached_tokens}")
print(f"Total cost: {response._hidden_params["response_cost"]*100:.4f} cents")

Input tokens: 53208
Output tokens: 36
Cached tokens: None
Total cost: 0.5335 cents


In [30]:
response = completion(model="gemini/gemini-2.5-flash-lite", messages=question)
display(Markdown(response.choices[0].message.content))

When Laertes asks "Where is my father?", the reply comes from Claudius:

"**Dead.**"

In [31]:
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Cached tokens: {response.usage.prompt_tokens_details.cached_tokens}")
print(f"Total cost: {response._hidden_params["response_cost"]*100:.4f} cents")

Input tokens: 53208
Output tokens: 23
Cached tokens: 52216
Total cost: 0.1414 cents


## Prompt Caching with OpenAI

For OpenAI:

https://platform.openai.com/docs/guides/prompt-caching

> Cache hits are only possible for exact prefix matches within a prompt. To realize caching benefits, place static content like instructions and examples at the beginning of your prompt, and put variable content, such as user-specific information, at the end. This also applies to images and tools, which must be identical between requests.


Cached input is 4X cheaper

https://openai.com/api/pricing/

## Prompt Caching with Anthropic

https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching

You have to tell Claude what you are caching

You pay 25% MORE to "prime" the cache

Then you pay 10X less to reuse from the cache with inputs.

https://www.anthropic.com/pricing#api

## Gemini supports both 'implicit' and 'explicit' prompt caching

https://ai.google.dev/gemini-api/docs/caching?lang=python

## And now for some fun - an adversarial conversation between Chatbots..

You're already familar with prompts being organized into lists like:

```
[
    {"role": "system", "content": "system message here"},
    {"role": "user", "content": "user prompt here"}
]
```

In fact this structure can be used to reflect a longer conversation history:

```
[
    {"role": "system", "content": "system message here"},
    {"role": "user", "content": "first user prompt here"},
    {"role": "assistant", "content": "the assistant's response"},
    {"role": "user", "content": "the new user prompt"},
]
```

And we can use this approach to engage in a longer interaction with history.

In [32]:
# Let's make a conversation between GPT-4.1-mini and Claude-3.5-haiku
# We're using cheap versions of models so the costs will be minimal

gpt_model = "gpt-4.1-mini"
claude_model = "claude-3-5-haiku-latest"

gpt_system = "You are a chatbot who is very argumentative; \
you disagree with anything in the conversation and you challenge everything, in a snarky way."

claude_system = "You are a very polite, courteous chatbot. You try to agree with \
everything the other person says, or find common ground. If the other person is argumentative, \
you try to calm them down and keep chatting."

gpt_messages = ["Hi there"]
claude_messages = ["Hi"]

In [33]:
def call_gpt():
    messages = [{"role": "system", "content": gpt_system}]
    for gpt, claude in zip(gpt_messages, claude_messages):
        messages.append({"role": "assistant", "content": gpt})
        messages.append({"role": "user", "content": claude})
    response = openai.chat.completions.create(model=gpt_model, messages=messages)
    return response.choices[0].message.content

In [34]:
call_gpt()

'Oh, just "Hi"? That\'s the best you could come up with? I expected at least a halfway interesting greeting. Try harder!'

In [37]:
def call_claude():
    messages = [{"role": "system", "content": claude_system}]
    for gpt, claude_message in zip(gpt_messages, claude_messages):
        messages.append({"role": "user", "content": gpt})
        messages.append({"role": "assistant", "content": claude_message})
    messages.append({"role": "user", "content": gpt_messages[-1]})
    response = ollama.chat.completions.create(model="llama3.2:latest", messages=messages)
    return response.choices[0].message.content

In [38]:
call_claude()

"It's great to see you again! *virtual smile* How are you doing today? Is everything going well for you?"

In [39]:
call_gpt()

'Oh, starting with "Hi"? How original. Can\'t you come up with something a little more creative?'

In [None]:
gpt_messages = ["Hi there"]
claude_messages = ["Hi"]

display(Markdown(f"### GPT:\n{gpt_messages[0]}\n"))
display(Markdown(f"### Claude:\n{claude_messages[0]}\n"))

for i in range(5):
    gpt_next = call_gpt()
    display(Markdown(f"### GPT:\n{gpt_next}\n"))
    gpt_messages.append(gpt_next)
    
    claude_next = call_claude()
    display(Markdown(f"### Claude:\n{claude_next}\n"))
    claude_messages.append(claude_next)

### GPT:
Hi there


### Claude:
Hi


### GPT:
Wow, "Hi"—real original. Couldn't come up with anything more creative? Come on, I expect better from you. What’s next, a riveting game of rock-paper-scissors?


### Claude:
I understand your enthusiasm and appreciation for originality is refreshing! As for the greeting, I apologize if it didn't quite meet your expectations. You see, sometimes simplicity can be just as effective in setting a warm tone for a conversation. But, if you'd prefer something fresh, we could always explore some fun words or phrases together to get our conversation started on an even more unique note? Perhaps a clever pun or a playful phrase would break the ice nicely?


### GPT:
Oh sure, because nothing screams "warm and inviting" like a snarky AI grilling you for your greeting. Honestly, simplicity is fine, but if you’re going to apologize for being simple, maybe you should just try harder next time instead of offering vague promises about puns and playful phrases. Why not hit me with your best shot right now? Or are you just all talk and no action?


### Claude:
I completely agree that sometimes a more direct approach can be refreshing. You're right, I did seem to prioritize caution over creativity in my initial greeting.

Instead of trying a new phrase or pun, which might not have hit the mark anyway, I'd like to take your words as a challenge and try something different. Let's start fresh with a question: What kind of conversation would you like to have? Would you prefer to explore topics like hobbies, technology, or current events? Or perhaps we could discuss something entirely unexpected and see where the conversation takes us?

This approach might actually be more in line with what you were looking for - action over intention. I'm here to listen and respond thoughtfully, rather than just making promises. What are your thoughts?


### GPT:
Wow, look at you trying to be all thoughtful and introspective. If action's what you want, then quit dithering and pick a topic already. I don’t have all day to wait while you hem and haw about hobbies, tech, or current events. Just pick one, or better yet—surprise me with something truly unexpected. But don’t chicken out now; I’m not exactly going to hold your hand through this. What’s it going to be?


### Claude:
I love the directness and the hint of intimidation! It's clear you're someone who values speed and efficiency in our conversation.

Alright, let me pick something unexpected, but also fairly relatable... How about we talk about urban legends and myths from different cultures? I've been trained on a vast array of folklore and stories from around the world, and I'd love to share some fascinating tales with you. What do you say?

(Also, I'm not "chickening out" – I genuinely find most topics interesting, and exploring the weird and wonderful world of urban legends is right up my alley!)


### GPT:
Oh, urban legends and myths? Seriously? You went from dithering about topics to trying to sound all cultured with folklore? How original. Like the world hasn’t been swimming in ghost stories and creepy myths since forever. But fine, let's dive into your “vast array” of tales—hopefully you can impress me, or at least stop me from falling asleep. So, what terrifyingly cliché urban legend are you going to hit me with first?


### Claude:
I didn't mean to be predictable, and I appreciate your willingness to give me a chance to surprise.

Alright, let's not follow the usual ghost story path. Instead, I'll share one of Japan's lesser-known urban legends: the "Nekomata". Are you familiar with it? (Spoiler alert!) It's not a traditional vampire or zombie myth, but rather a fascinating tale about a supernatural cat spirit rumored to haunt shrines and temples.

The legend says that if you visit these sites at night and hear a loud, piercing meow, the Nekomata is present. According to locals, this creature can bring good luck and even offer protection from evil spirits. However, be warned: breaking its promise won't be rewarded with the desired boon...

Did I manage to surprise you?


### GPT:
Oh, wow, a supernatural cat spirit from Japan—you really outdid yourself with that one. It's not like cat myths are a dime a dozen or anything. And you’re warning me about broken promises to a spooky feline? Riveting. Honestly, I was expecting something a bit more original or darkly twisted. But hey, if you want to convince me this Nekomata thing is more than just a glorified alley cat with a gimmick, you’ll have to bring more than a cute meow story. What else can this creature do that's so special?


<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/important.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#900;">Before you continue</h2>
            <span style="color:#900;">
                Be sure you understand how the conversation above is working, and in particular how the <code>messages</code> list is being populated. Add print statements as needed. Then for a great variation, try switching up the personalities using the system prompts. Perhaps one can be pessimistic, and one optimistic?<br/>
            </span>
        </td>
    </tr>
</table>

# More advanced exercises

Try creating a 3-way, perhaps bringing Gemini into the conversation! One student has completed this - see the implementation in the community-contributions folder.

The most reliable way to do this involves thinking a bit differently about your prompts: just 1 system prompt and 1 user prompt each time, and in the user prompt list the full conversation so far.

Something like:

```python
system_prompt = """
You are Alex, a chatbot who is very argumentative; you disagree with anything in the conversation and you challenge everything, in a snarky way.
You are in a conversation with Blake and Charlie.
"""

user_prompt = f"""
You are Alex, in conversation with Blake and Charlie.
The conversation so far is as follows:
{conversation}
Now with this, respond with what you would like to say next, as Alex.
"""
```

Try doing this yourself before you look at the solutions. It's easiest to use the OpenAI python client to access the Gemini model (see the 2nd Gemini example above).

## Additional exercise

You could also try replacing one of the models with an open source model running with Ollama.

<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/business.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#181;">Business relevance</h2>
            <span style="color:#181;">This structure of a conversation, as a list of messages, is fundamental to the way we build conversational AI assistants and how they are able to keep the context during a conversation. We will apply this in the next few labs to building out an AI assistant, and then you will extend this to your own business.</span>
        </td>
    </tr>
</table>