# Workout: Multimodal APIs

## Setup
```bash
uv add openai anthropic pillow
```

---
## Drill 1: Basic Vision Call 游릭
**Task:** Analyze an image from URL with GPT-4o

In [None]:
from openai import OpenAI

client = OpenAI()

# Analyze this image: https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Python-logo-notext.svg/1200px-Python-logo-notext.svg.png
# Ask: "What logo is this and what does it represent?"

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                # Add text and image_url content
            ]
        }
    ],
    max_tokens=300
)

---
## Drill 2: Base64 Image Encoding 游릭
**Task:** Encode a local image and send to API

In [None]:
import base64
from openai import OpenAI

def encode_image(path: str) -> str:
    """Encode image to base64 string."""
    pass

# Encode an image and ask "Describe what you see"

---
## Drill 3: Multiple Images 游리
**Task:** Compare two images in a single request

In [None]:
from openai import OpenAI

client = OpenAI()

# Send two images and ask: "What are the differences between these?"

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "..."},
                # Add two image_url items
            ]
        }
    ]
)

---
## Drill 4: Claude Vision 游리
**Task:** Use Claude to analyze an image

In [None]:
from anthropic import Anthropic
import base64

client = Anthropic()

# Note: Claude requires source.type = "base64" or "url"
# and media_type field

# Analyze an image with Claude

---
## Drill 5: Image Optimization 游리
**Task:** Resize and compress an image before sending

In [None]:
from PIL import Image
import io
import base64

def optimize_for_api(
    image_path: str,
    max_dimension: int = 1024,
    quality: int = 85
) -> tuple[str, int]:
    """
    Optimize image for API.
    Return (base64_string, estimated_tokens).
    """
    pass

# Test with a large image
# optimized, tokens = optimize_for_api("large_photo.jpg")
# print(f"Estimated tokens: {tokens}")

---
## Drill 6: Whisper Transcription 游릭
**Task:** Transcribe an audio file

In [None]:
from openai import OpenAI

client = OpenAI()

# Transcribe an MP3 file
# Note: You'll need an actual audio file to test

# with open("audio.mp3", "rb") as f:
#     transcript = client.audio.transcriptions.create(
#         model="whisper-1",
#         file=f
#     )

# print(transcript.text)

---
## Drill 7: Timestamped Transcription 游리
**Task:** Get word-level timestamps from Whisper

In [None]:
from openai import OpenAI

client = OpenAI()

# Use response_format="verbose_json"
# and timestamp_granularities=["word"]

# Print each word with its timestamp

---
## Drill 8: Text-to-Speech 游릭
**Task:** Generate speech from text

In [None]:
from openai import OpenAI
from pathlib import Path

client = OpenAI()

# Generate speech saying "Hello, welcome to the AI course!"
# Save to speech.mp3
# Try different voices: alloy, echo, fable, onyx, nova, shimmer

---
## Drill 9: Document Extraction with Vision 游댮
**Task:** Extract structured data from a document image

In [None]:
from openai import OpenAI
from pydantic import BaseModel
import instructor
import base64

client = instructor.from_openai(OpenAI())

class ReceiptData(BaseModel):
    store_name: str
    date: str
    items: list[str]
    total: float

def extract_receipt(image_path: str) -> ReceiptData:
    """Extract receipt data from image."""
    pass

# Test with a receipt image

---
## Drill 10: Vision + Audio Pipeline 游댮
**Task:** Build a pipeline that describes an image and reads it aloud

In [None]:
from openai import OpenAI
from pathlib import Path
import base64

class ImageNarrator:
    def __init__(self):
        self.client = OpenAI()

    def describe(self, image_path: str) -> str:
        """Get detailed description of image."""
        pass

    def narrate(self, description: str, output_path: str):
        """Convert description to speech."""
        pass

    def process(self, image_path: str, audio_output: str) -> str:
        """Describe image and create audio narration."""
        description = self.describe(image_path)
        self.narrate(description, audio_output)
        return description

# Test
# narrator = ImageNarrator()
# desc = narrator.process("photo.jpg", "narration.mp3")
# print(f"Description: {desc}")

---
## Self-Check

- [ ] Can send images to OpenAI and Claude
- [ ] Can encode images as base64
- [ ] Can optimize images to reduce tokens
- [ ] Can transcribe audio with Whisper
- [ ] Can generate speech with TTS
- [ ] Can combine vision and audio in pipelines