<a href="https://colab.research.google.com/github/sufiyansayyed19/LLM_Learning/blob/main/W2D1_playWithAPIs_and_Tokens.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Calling Different APIs

## API key setup

In [24]:
from google.colab import userdata

open_route_api_key = userdata.get("OPEN_ROUTE_API_KEY")
gemini_api_key = userdata.get("GEMINI_API_KEY")

### Check Keys

In [2]:
if open_route_api_key:
  print("open_route api key is good.")
else:
  print("open_route api key not found.")

if gemini_api_key:
  print("gemini_api_key is good.")
else:
  print("gemini_api key not found.")

open_route api key is good.
gemini_api_key is good.


## Url and base set up

In [3]:
gemini_url = "https://generativelanguage.googleapis.com/v1beta/openai/"
open_route_url = "https://openrouter.ai/api/v1"

In [4]:
from openai import OpenAI

In [5]:
openroute = OpenAI(base_url=open_route_url,api_key=open_route_api_key)

gemini = OpenAI(base_url=gemini_url, api_key=gemini_api_key)

## User prompt

In [6]:
tell_a_joke = [
    { "role": "user", "content":"Tell a joke for a student on the journey to becoming an expert in LLM Engineering",    }
]

## API call with markdown

### OpenRoute

In [7]:
from IPython.display import Markdown,display

response = openroute.chat.completions.create(
    model="openai/gpt-4o-mini",
    messages=tell_a_joke
)

display(Markdown(response.choices[0].message.content))

Why did the LLM engineer break up with their dataset?

Because it had too many bias issues and just couldn't keep things balanced!

### Gemini

In [8]:
response = gemini.chat.completions.create(
    model="gemini-2.5-flash",
    messages=tell_a_joke)
display(Markdown(response.choices[0].message.content))

Okay, here's one for a budding LLM Engineer:

A junior LLM Engineer excitedly tells their senior colleague, "I finally got my model to *reliably* generate text exactly as I wanted!"

The senior engineer raises an eyebrow. "Oh really? What did you prompt it with?"

The junior engineer beams, "I simply wrote: 'Generate the word "banana".'"

The senior engineer nods slowly. "Ah, the early days of finding stability. Next, try 'Generate a six-paragraph, emotionally resonant short story about a talking banana, formatted as a JSON object, but ensure it *never* mentions the color yellow.'"

The junior engineer's smile slowly fades. "Oh... right."

**Welcome to LLM Engineering!**

## Gemini Base code (alternative to openai - not commonly used)

In [None]:
from google import genai

client = genai.Client()

response = client.models.generate_content(
    model="gemini-2.5-flash-lite", contents="Describe the color Blue to someone who's never been able to see in 1 sentence"
)
print(response.text)

## Antrophhic Base code (alternative to openai - not commonly used)

In [None]:
from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    messages=[{"role": "user", "content": "Describe the color Blue to someone who's never been able to see in 1 sentence"}],
    max_tokens=100
)
print(response.content[0].text)

# Langchain First look ( powerful but heavy)

In [9]:
!pip install langchain-openai langchain-core langchain

Collecting langchain-openai
  Downloading langchain_openai-1.1.7-py3-none-any.whl.metadata (2.6 kB)
Collecting langchain-core
  Downloading langchain_core-1.2.6-py3-none-any.whl.metadata (3.7 kB)
Downloading langchain_openai-1.1.7-py3-none-any.whl (84 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.8/84.8 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langchain_core-1.2.6-py3-none-any.whl (489 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m489.1/489.1 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain-core, langchain-openai
  Attempting uninstall: langchain-core
    Found existing installation: langchain-core 1.2.1
    Uninstalling langchain-core-1.2.1:
      Successfully uninstalled langchain-core-1.2.1
Successfully installed lan

In [10]:
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage


llm = ChatOpenAI(
    model = "openai/gpt-4o-mini",
    base_url = "https://openrouter.ai/api/v1",
    api_key=  open_route_api_key

)

message = [HumanMessage(content="Tell me a llm joke")]

response =  llm.invoke(message)

display(Markdown(response.content))

Why did the large language model break up with its partner?

Because it just couldn‚Äôt stop generating misunderstandings!

#  LiteLLM ( light weigth)

In [7]:
# 1. Install LiteLLM
!pip install litellm

Collecting litellm
  Downloading litellm-1.80.12-py3-none-any.whl.metadata (29 kB)
Collecting fastuuid>=0.13.0 (from litellm)
  Downloading fastuuid-0.14.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.1 kB)
Downloading litellm-1.80.12-py3-none-any.whl (11.5 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m11.5/11.5 MB[0m [31m65.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fastuuid-0.14.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (278 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m278.1/278.1 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fastuuid, litellm
Successfully installed fastuuid-0.14.0 litellm-1.80.12


In [8]:
import os
from litellm import completion
from IPython.display import Markdown,display
# 1. Setup the Environment Variable (LiteLLM looks for this specific name)
os.environ["OPENROUTER_API_KEY"] = open_route_api_key


response = completion(
    model="openrouter/openai/gpt-4o-mini",
    messages=tell_a_joke
)

display(Markdown(response.choices[0].message.content))

  PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='Why did ...one, 'reasoning': None}), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...finish_reason': 'stop'}), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


Why did the LLM refuse to play hide and seek?

Because good luck hiding when it can predict your next move!

## We can see Tokens Sizes

In [9]:
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")
print(f"Total cost: {response._hidden_params["response_cost"]*100:.4f} cents")

Input tokens: 24
Output tokens: 24
Total tokens: 48
Total cost: 0.0018 cents


## Now - let's use LiteLLM to illustrate a Pro-feature: prompt caching

In [10]:
import requests

# 1. Download Hamlet from a public source (saving it to Colab)
url = "https://gist.githubusercontent.com/provpup/2fc41686eab7400b796b/raw/b575bd01a58494dfddc91e0143db631f36331463/hamlet.txt"
response = requests.get(url)

with open("hamlet.txt", "w", encoding="utf-8") as f:
    f.write(response.text)

print("‚úÖ 'hamlet.txt' downloaded successfully!")

# 2. Now run your code
with open("hamlet.txt", "r", encoding="utf-8") as f:
    hamlet = f.read()

# 3. Find the quote
loc = hamlet.find("Speak, man")

# Safety check: if text isn't found in this specific version
if loc != -1:
    print(f"\nFound quote at index {loc}:\n")
    print(hamlet[loc:loc+100])
else:
    print("Could not find the specific phrase in this version of the text.")

‚úÖ 'hamlet.txt' downloaded successfully!
Could not find the specific phrase in this version of the text.


In [11]:
question = [{"role": "user", "content": "In Hamlet, when Laertes asks 'Where is my father?' what is the reply?"}]

In [12]:
response = completion(model="openrouter/openai/gpt-4o-mini", messages=question)
display(Markdown(response.choices[0].message.content))

  PydanticSerializationUnexpectedValue(Expected 10 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='In Shake...one, 'reasoning': None}), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...finish_reason': 'stop'}), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


In Shakespeare's "Hamlet," when Laertes asks "Where is my father?" he is speaking to King Claudius. The reply comes from Claudius, who tells Laertes that his father, Polonius, is dead. Claudius explains that Polonius was killed by Hamlet, which sets off further events in the play. This moment is crucial as it heightens Laertes' grief and desire for revenge against Hamlet.

In [14]:
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")
print(f"Total cost: {response._hidden_params["response_cost"]*100:.4f} cents")

Input tokens: 25
Output tokens: 90
Total tokens: 115
Total cost: 0.0058 cents


In [15]:
response = completion(model="openrouter/openai/gpt-4o-mini", messages=question)
display(Markdown(response.choices[0].message.content))

In Shakespeare's "Hamlet," when Laertes asks "Where is my father?" he is met with a response from Gertrude that leads to the revelation of Polonius's death. Specifically, the reply comes from Queen Gertrude, who does not directly answer Laertes' question but instead indicates that Polonius is dead, leading to a dramatic progression in the play. Following this, Hamlet is discovered to be responsible for Polonius's death.

In [16]:
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")
print(f"Total cost: {response._hidden_params["response_cost"]*100:.4f} cents")

Input tokens: 25
Output tokens: 94
Total tokens: 119
Total cost: 0.0060 cents


In [29]:
import os
import time
from google.colab import userdata
from litellm import completion

# 1. Setup Gemini Key (Fixed syntax)
# We must get the key from userdata *before* setting the environment variable
try:
    # Get the key you saved in the Secrets (Key icon on the left)
    api_key = userdata.get('GEMINI_API_KEY')
    if api_key:
        os.environ['GEMINI_API_KEY'] = api_key
        print("‚úÖ Gemini API Key loaded.")
    else:
        print("‚ö†Ô∏è Warning: GEMINI_API_KEY is empty in Secrets.")
except Exception as e:
    print(f"‚ùå Error: Could not find GEMINI_API_KEY. Did you add it to the Secrets (Key icon)?\n{e}")

# 2. Ensure Hamlet is loaded
if 'book_content' not in globals():
    try:
        with open("hamlet.txt", "r", encoding="utf-8") as f:
            book_content = f.read()
    except FileNotFoundError:
        # Fallback if file isn't there (downloads it)
        import requests
        url = "https://gist.githubusercontent.com/provpup/2fc41686eab7400b796b/raw/b575bd01a58494dfddc91e0143db631f36331463/hamlet.txt"
        with open("hamlet.txt", "w", encoding="utf-8") as f:
            f.write(requests.get(url).text)
        with open("hamlet.txt", "r", encoding="utf-8") as f:
            book_content = f.read()

# 3. Construct the Message
messages = [
    {"role": "user", "content": f"Here is the book Hamlet:\n{book_content}\n\nQuestion: What is the reply to 'Where is my father?'"}
]

# 4. Use Gemini (Correct Syntax)
# LiteLLM requires the "gemini/" prefix to know which provider to use.
MODEL = "gemini/gemini-2.5-flash"  # <--- FIXED: Added 'gemini/' prefix

print(f"\nü§ñ Testing with {MODEL}...")

# --- RUN 1 ---
print("\n--- üèÉ RUN 1 ---")
start = time.time()
try:
    response1 = completion(model=MODEL, messages=messages)
    end = time.time()

    print(f"Time: {end - start:.2f}s")
    # Gemini Free Tier (API) usually returns $0 cost, but this checks just in case
    cost = response1._hidden_params.get('response_cost', 0)
    print(f"Cost: ${cost:.6f}")
    print(f"Answer: {response1.choices[0].message.content}")
except Exception as e:
    print(f"Error calling Gemini: {e}")

‚úÖ Gemini API Key loaded.

ü§ñ Testing with gemini/gemini-2.5-flash...

--- üèÉ RUN 1 ---

[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

Error calling Gemini: litellm.InternalServerError: litellm.InternalServerError: geminiException - {
  "error": {
    "code": 503,
    "message": "The model is overloaded. Please try again later.",
    "status": "UNAVAILABLE"
  }
}



In [31]:
# 1. Update LiteLLM to the latest version (Crucial for Gemini 1.5/2.0 support)
!pip install -Uq litellm requests

import os
import time
import requests
from google.colab import userdata
from litellm import completion

# --- CONFIGURATION ---
# We use 'gemini/gemini-1.5-flash' which is the current stable standard.
# If this fails, try 'gemini/gemini-pro' as a fallback.
MODEL = "gemini/gemini-1.5-flash"

# ---------------------------------------------------------
# STEP 1: SETUP API KEY
# ---------------------------------------------------------
print("üîë Setting up API Key...")
try:
    # Ensure you have added 'GEMINI_API_KEY' in the Colab Secrets (Key icon)
    api_key = userdata.get('GEMINI_API_KEY')
    if api_key:
        os.environ['GEMINI_API_KEY'] = api_key
        print("‚úÖ Gemini API Key loaded.")
    else:
        raise ValueError("Key is empty")
except Exception as e:
    print(f"‚ùå ERROR: Could not find 'GEMINI_API_KEY'. Please check your Colab Secrets.\n{e}")

# ---------------------------------------------------------
# STEP 2: PREPARE THE BOOK (FIXED SOURCE)
# ---------------------------------------------------------
print("\nüìñ Downloading Hamlet from Project Gutenberg...")
file_path = "hamlet.txt"

# Use a stable Project Gutenberg URL
url = "https://www.gutenberg.org/cache/epub/1524/pg1524.txt"

try:
    response = requests.get(url)
    response.raise_for_status() # Raise error if download fails

    # Save to file
    with open(file_path, "w", encoding="utf-8") as f:
        f.write(response.text)

    # Read back to verify
    with open(file_path, "r", encoding="utf-8") as f:
        book_content = f.read()

    # REMOVE GUTENBERG HEADER/FOOTER (Optional but cleaner)
    # This keeps just the play text roughly
    if "*** START OF THE PROJECT GUTENBERG EBOOK HAMLET ***" in book_content:
        book_content = book_content.split("*** START OF THE PROJECT GUTENBERG EBOOK HAMLET ***")[1]
        if "*** END OF THE PROJECT GUTENBERG EBOOK HAMLET ***" in book_content:
            book_content = book_content.split("*** END OF THE PROJECT GUTENBERG EBOOK HAMLET ***")[0]

    print(f"‚úÖ Book loaded! Size: {len(book_content)} characters.")

    # ERROR CHECK: If size is small (<1000), the download failed.
    if len(book_content) < 1000:
        raise ValueError("The downloaded file is too small. The URL might be blocked.")

except Exception as e:
    print(f"‚ùå Failed to download book: {e}")
    # Fallback to a dummy text if download fails so code doesn't crash
    book_content = "To be, or not to be, that is the question." * 1000

# ---------------------------------------------------------
# STEP 3: RUN THE EXPERIMENT
# ---------------------------------------------------------
messages = [
    {"role": "user", "content": f"Here is the book Hamlet:\n{book_content[:50000]}\n\nQuestion: What is the reply to 'Where is my father?'"}
]

print(f"\nü§ñ Starting Experiment with model: {MODEL}")

# --- RUN 1 ---
print("\n------------------------------------------------")
print("üèÉ RUN 1 (Cold Start)")
print("------------------------------------------------")
start_time = time.time()

try:
    response1 = completion(model=MODEL, messages=messages)
    elapsed1 = time.time() - start_time

    print(f"‚è±Ô∏è  Time: {elapsed1:.2f} seconds")
    print(f"üìù Answer: {response1.choices[0].message.content.strip()[:200]}...")
except Exception as e:
    print(f"‚ùå Error in Run 1: {e}")
    print("üí° TIP: If you get a 404, check if 'Generative Language API' is enabled in your Google Cloud Console.")

# --- RUN 2 ---
print("\n------------------------------------------------")
print("üèÉ RUN 2 (Simulated Cache)")
print("------------------------------------------------")
start_time = time.time()

try:
    response2 = completion(model=MODEL, messages=messages)
    elapsed2 = time.time() - start_time

    print(f"‚è±Ô∏è  Time: {elapsed2:.2f} seconds")
    if 'elapsed1' in locals():
        print(f"üöÄ Speed Difference: {elapsed1 - elapsed2:.2f}s")
    print(f"üìù Answer: {response2.choices[0].message.content.strip()[:200]}...")
except Exception as e:
    print(f"‚ùå Error in Run 2: {e}")

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/64.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m64.7/64.7 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests==2.32.4, but you have requests 2.32.5 which is incompatible.[0m[31m
[0müîë Setting up API Key...
‚úÖ Gemini API Key loaded.

üìñ Downloading Hamlet from Project Gutenberg...
‚úÖ Book loaded! Size: 177967 characters.

ü§ñ Starting Experiment with model: gemini/gemini-1.5-flash

------------------------------------------------
üèÉ RUN 1 (Cold Start)
------------------------------

In [32]:
# 1. Update LiteLLM to the latest version
!pip install -Uq litellm requests

import os
import time
import requests
from google.colab import userdata
from litellm import completion

# --- CONFIGURATION ---
# Use the correct model string format for Gemini API v1
MODEL = "gemini-1.5-flash"  # Changed from "gemini/gemini-1.5-flash"

# ---------------------------------------------------------
# STEP 1: SETUP API KEY
# ---------------------------------------------------------
print("üîë Setting up API Key...")
try:
    # Ensure you have added 'GEMINI_API_KEY' in the Colab Secrets (Key icon)
    api_key = userdata.get('GEMINI_API_KEY')
    if api_key:
        os.environ['GEMINI_API_KEY'] = api_key
        print("‚úÖ Gemini API Key loaded.")
    else:
        raise ValueError("Key is empty")
except Exception as e:
    print(f"‚ùå ERROR: Could not find 'GEMINI_API_KEY'. Please check your Colab Secrets.\n{e}")

# ---------------------------------------------------------
# STEP 2: PREPARE THE BOOK (FIXED SOURCE)
# ---------------------------------------------------------
print("\nüìñ Downloading Hamlet from Project Gutenberg...")
file_path = "hamlet.txt"
url = "https://www.gutenberg.org/cache/epub/1524/pg1524.txt"

try:
    response = requests.get(url)
    response.raise_for_status()

    # Save to file
    with open(file_path, "w", encoding="utf-8") as f:
        f.write(response.text)

    # Read back to verify
    with open(file_path, "r", encoding="utf-8") as f:
        book_content = f.read()

    # REMOVE GUTENBERG HEADER/FOOTER (Optional but cleaner)
    if "*** START OF THE PROJECT GUTENBERG EBOOK HAMLET ***" in book_content:
        book_content = book_content.split("*** START OF THE PROJECT GUTENBERG EBOOK HAMLET ***")[1]
    if "*** END OF THE PROJECT GUTENBERG EBOOK HAMLET ***" in book_content:
        book_content = book_content.split("*** END OF THE PROJECT GUTENBERG EBOOK HAMLET ***")[0]

    print(f"‚úÖ Book loaded! Size: {len(book_content)} characters.")

    # ERROR CHECK: If size is small (<1000), the download failed
    if len(book_content) < 1000:
        raise ValueError("The downloaded file is too small. The URL might be blocked.")

except Exception as e:
    print(f"‚ùå Failed to download book: {e}")
    book_content = "To be, or not to be, that is the question." * 1000

# ---------------------------------------------------------
# STEP 3: RUN THE EXPERIMENT
# ---------------------------------------------------------
messages = [
    {"role": "user", "content": f"Here is the book Hamlet:\n{book_content[:50000]}\n\nQuestion: What is the reply to 'Where is my father?'"}
]

print(f"\nü§ñ Starting Experiment with model: {MODEL}")

# --- RUN 1 ---
print("\n------------------------------------------------")
print("üèÉ RUN 1 (Cold Start)")
print("------------------------------------------------")
start_time = time.time()
try:
    response1 = completion(
        model=MODEL,
        messages=messages,
        api_key=os.environ['GEMINI_API_KEY']
    )
    elapsed1 = time.time() - start_time
    print(f"‚è±Ô∏è Time: {elapsed1:.2f} seconds")
    print(f"üìù Answer: {response1.choices[0].message.content.strip()[:200]}...")
except Exception as e:
    print(f"‚ùå Error in Run 1: {e}")
    print("üí° TIP: Make sure the Generative Language API is enabled at:")
    print("   https://console.cloud.google.com/apis/library/generativelanguage.googleapis.com")

# --- RUN 2 ---
print("\n------------------------------------------------")
print("üèÉ RUN 2 (Simulated Cache)")
print("------------------------------------------------")
start_time = time.time()
try:
    response2 = completion(
        model=MODEL,
        messages=messages,
        api_key=os.environ['GEMINI_API_KEY']
    )
    elapsed2 = time.time() - start_time
    print(f"‚è±Ô∏è Time: {elapsed2:.2f} seconds")
    if 'elapsed1' in locals():
        print(f"üöÄ Speed Difference: {elapsed1 - elapsed2:.2f}s")
    print(f"üìù Answer: {response2.choices[0].message.content.strip()[:200]}...")
except Exception as e:
    print(f"‚ùå Error in Run 2: {e}")

print("\n" + "="*50)
print("üìä EXPERIMENT COMPLETE")
print("="*50)

üîë Setting up API Key...
‚úÖ Gemini API Key loaded.

üìñ Downloading Hamlet from Project Gutenberg...
‚úÖ Book loaded! Size: 177967 characters.

ü§ñ Starting Experiment with model: gemini-1.5-flash

------------------------------------------------
üèÉ RUN 1 (Cold Start)
------------------------------------------------


[92m16:15:50 - LiteLLM:ERROR[0m: vertex_llm_base.py:550 - Failed to load vertex credentials. Check to see if credentials containing partial/invalid information. Error: ("Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Engine metadata service. Status: 404 Response:\nb''", <google.auth.transport.requests._Response object at 0x7d954f516ea0>)
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/google/auth/compute_engine/credentials.py", line 140, in _refresh_token
    self._retrieve_info(request)
  File "/usr/local/lib/python3.12/dist-packages/google/auth/compute_engine/credentials.py", line 107, in _retrieve_info
    info = _metadata.get_service_account_info(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/google/auth/compute_engine/_metadata.py", line 342, in get_service_account_info
    return get(request, path, p


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

‚ùå Error in Run 1: litellm.APIConnectionError: ("Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Engine metadata service. Status: 404 Response:\nb''", <google.auth.transport.requests._Response object at 0x7d954f516ea0>)
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/google/auth/compute_engine/credentials.py", line 140, in _refresh_token
    self._retrieve_info(request)
  File "/usr/local/lib/python3.12/dist-packages/google/auth/compute_engine/credentials.py", line 107, in _retrieve_info
    info = _metadata.get_service_account_info(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/google/auth/compute_engine/_metadata.py", line 342, in get_service_ac

In [33]:
# 1. Update LiteLLM to the latest version
!pip install -Uq litellm requests

import os
import time
import requests
from google.colab import userdata
from litellm import completion

# --- CONFIGURATION ---
# CRITICAL: Add "gemini/" prefix to force Google AI Studio API (not Vertex AI)
MODEL = "gemini/gemini-1.5-flash-latest"

# ---------------------------------------------------------
# STEP 1: SETUP API KEY
# ---------------------------------------------------------
print("üîë Setting up API Key...")
try:
    api_key = userdata.get('GEMINI_API_KEY')
    if api_key:
        os.environ['GEMINI_API_KEY'] = api_key
        print("‚úÖ Gemini API Key loaded.")
    else:
        raise ValueError("Key is empty")
except Exception as e:
    print(f"‚ùå ERROR: Could not find 'GEMINI_API_KEY'. Please check your Colab Secrets.\n{e}")

# ---------------------------------------------------------
# STEP 2: PREPARE THE BOOK (FIXED SOURCE)
# ---------------------------------------------------------
print("\nüìñ Downloading Hamlet from Project Gutenberg...")
file_path = "hamlet.txt"
url = "https://www.gutenberg.org/cache/epub/1524/pg1524.txt"

try:
    response = requests.get(url)
    response.raise_for_status()

    with open(file_path, "w", encoding="utf-8") as f:
        f.write(response.text)

    with open(file_path, "r", encoding="utf-8") as f:
        book_content = f.read()

    # Remove Gutenberg header/footer
    if "*** START OF THE PROJECT GUTENBERG EBOOK HAMLET ***" in book_content:
        book_content = book_content.split("*** START OF THE PROJECT GUTENBERG EBOOK HAMLET ***")[1]
    if "*** END OF THE PROJECT GUTENBERG EBOOK HAMLET ***" in book_content:
        book_content = book_content.split("*** END OF THE PROJECT GUTENBERG EBOOK HAMLET ***")[0]

    print(f"‚úÖ Book loaded! Size: {len(book_content)} characters.")

    if len(book_content) < 1000:
        raise ValueError("The downloaded file is too small.")

except Exception as e:
    print(f"‚ùå Failed to download book: {e}")
    book_content = "To be, or not to be, that is the question." * 1000

# ---------------------------------------------------------
# STEP 3: RUN THE EXPERIMENT
# ---------------------------------------------------------
# Note: Using 50k chars to stay within context limits
messages = [
    {"role": "user", "content": f"Here is the book Hamlet:\n\n{book_content[:50000]}\n\nQuestion: What is the reply to 'Where is my father?'"}
]

print(f"\nü§ñ Starting Experiment with model: {MODEL}")
print("üìù Note: Google AI Studio doesn't have explicit prompt caching API yet.")
print("   However, Gemini may cache internally for identical prompts.\n")

# --- RUN 1 ---
print("="*50)
print("üèÉ RUN 1 (Cold Start)")
print("="*50)
start_time = time.time()
try:
    response1 = completion(
        model=MODEL,
        messages=messages,
        # LiteLLM will automatically use GEMINI_API_KEY from environment
    )
    elapsed1 = time.time() - start_time
    answer1 = response1.choices[0].message.content.strip()

    print(f"‚è±Ô∏è  Time: {elapsed1:.2f} seconds")
    print(f"üìù Answer: {answer1[:300]}...")

except Exception as e:
    print(f"‚ùå Error in Run 1: {e}")
    print("\nüí° TROUBLESHOOTING:")
    print("   1. Verify your API key is from Google AI Studio (not Cloud Console)")
    print("   2. Get a key from: https://aistudio.google.com/app/apikey")
    print("   3. Make sure it's added to Colab Secrets as 'GEMINI_API_KEY'")

# --- RUN 2 ---
print("\n" + "="*50)
print("üèÉ RUN 2 (Potential Cache Hit)")
print("="*50)
time.sleep(1)  # Small delay between requests
start_time = time.time()
try:
    response2 = completion(
        model=MODEL,
        messages=messages,
    )
    elapsed2 = time.time() - start_time
    answer2 = response2.choices[0].message.content.strip()

    print(f"‚è±Ô∏è  Time: {elapsed2:.2f} seconds")

    if 'elapsed1' in locals():
        speedup = elapsed1 - elapsed2
        speedup_pct = (speedup / elapsed1) * 100 if elapsed1 > 0 else 0
        print(f"üöÄ Speed Difference: {speedup:.2f}s ({speedup_pct:.1f}% faster)")

        if speedup > 1:
            print("‚ú® Significant speedup detected! Likely benefiting from internal caching.")
        elif speedup > 0:
            print("‚ö° Slight speedup - results may vary due to network/server load.")
        else:
            print("‚è±Ô∏è  Similar speed - caching may not be active for this query.")

    print(f"üìù Answer: {answer2[:300]}...")

except Exception as e:
    print(f"‚ùå Error in Run 2: {e}")

# ---------------------------------------------------------
# SUMMARY
# ---------------------------------------------------------
print("\n" + "="*50)
print("üìä EXPERIMENT COMPLETE")
print("="*50)
if 'elapsed1' in locals() and 'elapsed2' in locals():
    print(f"Run 1: {elapsed1:.2f}s")
    print(f"Run 2: {elapsed2:.2f}s")
    print(f"Difference: {elapsed1 - elapsed2:.2f}s")
    print("\nüí° Note: Google AI Studio API doesn't expose explicit caching controls.")
    print("   Any speedup is from Gemini's internal optimizations.")

üîë Setting up API Key...
‚úÖ Gemini API Key loaded.

üìñ Downloading Hamlet from Project Gutenberg...
‚úÖ Book loaded! Size: 177967 characters.

ü§ñ Starting Experiment with model: gemini/gemini-1.5-flash-latest
üìù Note: Google AI Studio doesn't have explicit prompt caching API yet.
   However, Gemini may cache internally for identical prompts.

üèÉ RUN 1 (Cold Start)

[1;31mProvider List: https://docs.litellm.ai/docs/providers[0m


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.


[1;31mProvider List: https://docs.litellm.ai/docs/providers[0m

‚ùå Error in Run 1: litellm.NotFoundError: GeminiException - {
  "error": {
    "code": 404,
    "message": "models/gemini-1.5-flash-latest is not found for API version v1beta, or is not supported for generateContent. Call ListModels to see the list of available models and their supported methods.",
    "status": "NOT_FOUN

In [34]:
# 1. Update LiteLLM to the latest version
!pip install -Uq litellm requests

import os
import time
import requests
from google.colab import userdata
from litellm import completion

# --- CONFIGURATION ---
# Use the standard v1 API model name (not v1beta)
MODEL = "gemini/gemini-1.5-flash"

# ---------------------------------------------------------
# STEP 1: SETUP API KEY
# ---------------------------------------------------------
print("üîë Setting up API Key...")
try:
    api_key = userdata.get('GEMINI_API_KEY')
    if api_key:
        os.environ['GEMINI_API_KEY'] = api_key
        print("‚úÖ Gemini API Key loaded.")
    else:
        raise ValueError("Key is empty")
except Exception as e:
    print(f"‚ùå ERROR: Could not find 'GEMINI_API_KEY'. Please check your Colab Secrets.\n{e}")

# ---------------------------------------------------------
# STEP 2: PREPARE THE BOOK (FIXED SOURCE)
# ---------------------------------------------------------
print("\nüìñ Downloading Hamlet from Project Gutenberg...")
file_path = "hamlet.txt"
url = "https://www.gutenberg.org/cache/epub/1524/pg1524.txt"

try:
    response = requests.get(url)
    response.raise_for_status()

    with open(file_path, "w", encoding="utf-8") as f:
        f.write(response.text)

    with open(file_path, "r", encoding="utf-8") as f:
        book_content = f.read()

    # Remove Gutenberg header/footer
    if "*** START OF THE PROJECT GUTENBERG EBOOK HAMLET ***" in book_content:
        book_content = book_content.split("*** START OF THE PROJECT GUTENBERG EBOOK HAMLET ***")[1]
    if "*** END OF THE PROJECT GUTENBERG EBOOK HAMLET ***" in book_content:
        book_content = book_content.split("*** END OF THE PROJECT GUTENBERG EBOOK HAMLET ***")[0]

    print(f"‚úÖ Book loaded! Size: {len(book_content)} characters.")

    if len(book_content) < 1000:
        raise ValueError("The downloaded file is too small.")

except Exception as e:
    print(f"‚ùå Failed to download book: {e}")
    book_content = "To be, or not to be, that is the question." * 1000

# ---------------------------------------------------------
# STEP 3: RUN THE EXPERIMENT
# ---------------------------------------------------------
# Use 50k chars to stay within context limits
messages = [
    {"role": "user", "content": f"Here is the book Hamlet:\n\n{book_content[:50000]}\n\nQuestion: What is the reply to 'Where is my father?'"}
]

print(f"\nü§ñ Starting Experiment with model: {MODEL}")
print("üìù Note: Google AI Studio (free tier) doesn't expose explicit caching.")
print("   But Gemini may optimize internally for repeated prompts.\n")

# --- RUN 1 ---
print("="*50)
print("üèÉ RUN 1 (Cold Start)")
print("="*50)
start_time = time.time()
try:
    response1 = completion(
        model=MODEL,
        messages=messages,
        api_base="https://generativelanguage.googleapis.com/v1/models"  # Explicit v1 API
    )
    elapsed1 = time.time() - start_time
    answer1 = response1.choices[0].message.content.strip()

    print(f"‚è±Ô∏è  Time: {elapsed1:.2f} seconds")
    print(f"üìù Answer: {answer1[:300]}...")

except Exception as e:
    print(f"‚ùå Error in Run 1: {e}")
    print("\nüí° TROUBLESHOOTING:")
    print("   ‚Ä¢ Your API key should be from: https://aistudio.google.com/app/apikey")
    print("   ‚Ä¢ Make sure 'Generative Language API' is enabled")
    print("   ‚Ä¢ Try regenerating your API key if it's old")

    # Try alternative model names
    print("\nüîÑ Attempting with alternative model name...")
    try:
        response1 = completion(
            model="gemini/gemini-pro",
            messages=messages
        )
        elapsed1 = time.time() - start_time
        answer1 = response1.choices[0].message.content.strip()
        print(f"‚úÖ Success with gemini-pro!")
        print(f"‚è±Ô∏è  Time: {elapsed1:.2f} seconds")
        print(f"üìù Answer: {answer1[:300]}...")
        MODEL = "gemini/gemini-pro"  # Update model for Run 2
    except Exception as e2:
        print(f"‚ùå Also failed with gemini-pro: {e2}")

# --- RUN 2 ---
print("\n" + "="*50)
print("üèÉ RUN 2 (Potential Cache Hit)")
print("="*50)
time.sleep(1)  # Small delay between requests
start_time = time.time()
try:
    response2 = completion(
        model=MODEL,
        messages=messages,
        api_base="https://generativelanguage.googleapis.com/v1/models"
    )
    elapsed2 = time.time() - start_time
    answer2 = response2.choices[0].message.content.strip()

    print(f"‚è±Ô∏è  Time: {elapsed2:.2f} seconds")

    if 'elapsed1' in locals():
        speedup = elapsed1 - elapsed2
        speedup_pct = (speedup / elapsed1) * 100 if elapsed1 > 0 else 0
        print(f"üöÄ Speed Difference: {speedup:.2f}s ({speedup_pct:.1f}% faster)")

        if speedup > 1:
            print("‚ú® Significant speedup detected! Likely benefiting from internal caching.")
        elif speedup > 0.2:
            print("‚ö° Moderate speedup - some optimization may be happening.")
        else:
            print("‚è±Ô∏è  Similar speed - no significant caching detected.")

    print(f"üìù Answer: {answer2[:300]}...")

except Exception as e:
    print(f"‚ùå Error in Run 2: {e}")

# ---------------------------------------------------------
# SUMMARY
# ---------------------------------------------------------
print("\n" + "="*50)
print("üìä EXPERIMENT COMPLETE")
print("="*50)
if 'elapsed1' in locals() and 'elapsed2' in locals():
    print(f"Run 1: {elapsed1:.2f}s")
    print(f"Run 2: {elapsed2:.2f}s")
    print(f"Difference: {elapsed1 - elapsed2:.2f}s")
    print("\nüí° Notes:")
    print("   ‚Ä¢ Free tier doesn't have explicit Context Caching API")
    print("   ‚Ä¢ Any speedup is from Gemini's internal optimizations")
    print("   ‚Ä¢ For guaranteed caching, you'd need Vertex AI (paid tier)")
else:
    print("‚ö†Ô∏è  Could not complete both runs. Check the errors above.")

üîë Setting up API Key...
‚úÖ Gemini API Key loaded.

üìñ Downloading Hamlet from Project Gutenberg...
‚úÖ Book loaded! Size: 177967 characters.

ü§ñ Starting Experiment with model: gemini/gemini-1.5-flash
üìù Note: Google AI Studio (free tier) doesn't expose explicit caching.
   But Gemini may optimize internally for repeated prompts.

üèÉ RUN 1 (Cold Start)

[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

‚ùå Error in Run 1: litellm.NotFoundError: GeminiException - 

üí° TROUBLESHOOTING:
   ‚Ä¢ Your API key should be from: https://aistudio.google.com/app/apikey
   ‚Ä¢ Make sure 'Generative Language API' is enabled
   ‚Ä¢ Try regenerating your API key if it's old

üîÑ Attempting with alternative model name...

[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug