# Exercise: Video Tagging with Pydantic AI and Gemini

In this exercise you will use Pydantic AI with Gemini 2.5 Flash to tag videos.

We will pretend we have a small collection of videos we want to use to train a model for planning for a robotic hand. The hand needs to learn basic object manipulations. We want to use video tagging to pre-label and categorize our videos, so downstream Q/A teams can more easily assess them.

## Import Libraries and Load Environment

Before starting, make sure you have placed your Google Gemini credentials in the `.env` file:

```bash
cp env.example .env
```
then edit `.env` and modify GEMINI_API_KEY with your key.

In [None]:
import os
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from pydantic_ai import Agent, BinaryContent
from IPython.display import Video, display

# Load environment variables
assert load_dotenv(), "Please prepare a .env file with your GEMINI_API_KEY"
assert os.getenv("GEMINI_API_KEY"), "GEMINI_API_KEY not found in .env file"

# Only needed on the Udacity workspace. Comment this out if running on another system.
os.environ['HF_HOME'] = '/voc/data/huggingface'
os.environ['OLLAMA_MODELS'] = '/voc/data/ollama/cache'
os.environ['HF_HUB_OFFLINE'] = '1'
os.environ['PATH'] = f"/voc/data/ollama/bin:/voc/data/ffmpeg/bin:{os.environ.get('PATH', '')}"
os.environ['LD_LIBRARY_PATH'] = f"/voc/data/ollama/lib:/voc/data/ffmpeg/lib:{os.environ.get('LD_LIBRARY_PATH', '')}"

# This is needed to use asyncio within jupyter
import nest_asyncio

nest_asyncio.apply()

## Define Response Schema

As usual, we need to define our response schema.

In [None]:
from typing import List
from typing_extensions import Literal


# TODO: Define a video tagging output schema with Pydantic AI
# The schema should include:
# 1. description: a detailed description of the action performed in the video
# 2. An overall quality assessment (poor, ok, good)
# 3. A list of actions (move sideways, move vertically,
#    put object 1 into object 2, pull object 1 out of object 2)
# 4. A list of objects present in the video from the following options:
#    candle, container, cotton swabs, basket, box, tape, board game, mug,
#    jar, slipper, container, scarf, smartphone, pen
# HINT: remember to use Literal[...] to define fields that can only have specific values,
# for example Literal["poor", "ok", "good"]
class VideoTagging(BaseModel):
    """Structured output for video tagging analysis."""

    # TODO: remember to assign the type and use the Field function to add 
    # a description
    description: str = Field(
        description="A detailed description of the action performed in the video"
    )
    
    # TODO
    # HINT: remember you can use Literal[...] to restrict the possible values
    quality: ... #complete

    # TODO
    actions: ... #complete

    # TODO
    objects_involved_in_the_action: int = Field(
        description="Number of distinct objects involved in the action"
    )
    
    objects_present: List[
        Literal[
            "candle",
            "container",
            "cotton swabs",
            "basket",
            "box",
            "tape",
            "board game",
            "mug",
            "jar",
            "slipper",
            "container",
            "scarf",
            "smartphone",
            "pen",
        ]
    ] = Field(
        description="List of objects present in the video (e.g., person, car, tree)"
    )

## Create Pydantic AI Agent

Set up the agent with Gemini for video content analysis.

In [None]:
from pydantic_ai.models.google import GoogleModelSettings


# Remove thinking to avoid long delays and timeouts
settings = GoogleModelSettings(
    google_thinking_config={"thinking_budget": 0},
    temperature=0,
    seed=42,
)

# TODO: Create an agent for video tagging using the VideoTagging schema
# Use the "gemini-2.5-flash-lite" model and the settings defined above
# Remember to use output_type to specify the output schema to what we defined
# above
# Craft a short but precise set of instructions for the agent
# Also set retries=5 or some large number, as some calls tend to fail because
# the video is larger than just text
video_agent = ... #complete

## Helper Function for Video Processing

Let's define a helper function to load and format video files for analysis.

In [None]:
def load_video_for_analysis(video_path):
    """Load video file and format it for Pydantic AI."""
    with open(video_path, 'rb') as f:
        video_bytes = f.read()
    
    # Create binary content for Pydantic AI
    # (see note at the end about using File API for longer videos)
    video_content = BinaryContent(
        data=video_bytes,
        media_type='video/mp4'  # Adjust based on your video format
    )
    
    return video_content

## Video Analysis

Analyze a video file to see the structured description output.

In [None]:
from IPython.display import Video, display
from pathlib import Path
from pprint import pprint


videos = Path("../videos").glob("*.mp4")

for video_path in videos:
    print(f"Original video: {video_path.name}")
    display(Video(str(video_path)))
    
    video_content = load_video_for_analysis(video_path)
    
    print("\nAnalyzing video content...")

    # TODO: run the agent on the video
    # HINT: just call video_agent.run_sync providing a list containing a prompt
    # (like "provide a description of this video"), and the preprocessed video
    # (video_content)
    result = ... #complete
    
    print("\nVideo Analysis Results:")
    pprint(result.output.model_dump(), width=80, depth=None)
    print("\n" + "="*80 + "\n")

