# Exercise: Video Tagging with Pydantic AI and Gemini

In this exercise you will use Pydantic AI with Gemini 2.5 Flash to tag videos.

We will pretend we have a small collection of videos we want to use to train a model for planning for a robotic hand. The hand needs to learn basic object manipulations. We want to use video tagging to pre-label and categorize our videos, so downstream Q/A teams can more easily assess them.

## Import Libraries and Load Environment

Before starting, make sure you have placed your Google Gemini credentials in the `.env` file:

```bash
cp env.example .env
```
then edit `.env` and modify GEMINI_API_KEY with your key.

In [None]:
import os
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from pydantic_ai import Agent, BinaryContent
from IPython.display import Video, display

# Load environment variables
assert load_dotenv(), "Please prepare a .env file with your GEMINI_API_KEY"
assert os.getenv("GEMINI_API_KEY"), "GEMINI_API_KEY not found in .env file"

# Only needed on the Udacity workspace. Comment this out if running on another system.
os.environ['HF_HOME'] = '/voc/data/huggingface'
os.environ['OLLAMA_MODELS'] = '/voc/data/ollama/cache'
os.environ['HF_HUB_OFFLINE'] = '1'
os.environ['PATH'] = f"/voc/data/ollama/bin:/voc/data/ffmpeg/bin:{os.environ.get('PATH', '')}"
os.environ['LD_LIBRARY_PATH'] = f"/voc/data/ollama/lib:/voc/data/ffmpeg/lib:{os.environ.get('LD_LIBRARY_PATH', '')}"

# This is needed to use asyncio within jupyter
import nest_asyncio

nest_asyncio.apply()

## Define Response Schema

As usual, we need to define our response schema.

In [None]:
from typing import List
from typing_extensions import Literal


# TODO: Define a video tagging output schema with Pydantic AI
# The schema should include:
# 1. An overall quality assessment (poor, ok, good)
# 2. A list of actions (move sideways, move vertically,
#    put object 1 into object 2, pull object 1 out of object 2)
# 3. A list of objects present in the video from the following options:
#    candle, container, cotton swabs, basket, box, tape, board game, mug,
#    jar, slipper, container, scarf, smartphone, pen
class VideoTagging(BaseModel):
    """Structured output for video tagging analysis."""

    # TODO: remember to assign the type and use the Field function to add 
    # a description
    description: str = Field(
        description="A detailed description of the action performed in the video"
    )
    
    # TODO
    # HINT: remember you can use Literal[...] to restrict the possible values
    quality: Literal[
        "poor",
        "ok",
        "good",
    ] = Field(
        description="Overall technical quality of the video (e.g., resolution, stability)"
    )

    # TODO
    actions: List[
        Literal[
            "move object sideways",
            "move object vertically",
            "put object into another object",
            "pull object out of another object",
        ]
    ] = Field(description="Action classification")

    # TODO
    objects_involved_in_the_action: int = Field(
        description="Number of distinct objects involved in the action"
    )
    
    objects_present: List[
        Literal[
            "candle",
            "container",
            "cotton swabs",
            "basket",
            "box",
            "tape",
            "board game",
            "mug",
            "jar",
            "slipper",
            "container",
            "scarf",
            "smartphone",
            "pen",
        ]
    ] = Field(
        description="List of objects present in the video (e.g., person, car, tree)"
    )

## Create Pydantic AI Agent

Set up the agent with Gemini for video content analysis.

In [19]:
from pydantic_ai.models.google import GoogleModelSettings


# Remove thinking to avoid long delays and timeouts
settings = GoogleModelSettings(
    google_thinking_config={"thinking_budget": 0},
    temperature=0,
    seed=42,
)

# TODO: Create an agent for video tagging using the VideoTagging schema
# Use the "gemini-2.5-flash-lite" model and the settings defined above
# Remember to use output_type to specify the output schema
# Craft a short but precise set of instructions for the agent
video_agent = Agent(
    model="gemini-2.5-flash-lite",
    output_type=VideoTagging,
    instructions="""
    You are an expert video content analyzer. Watch the provided video carefully and determine:

    1. The overall technical quality of the video (poor, ok, good)
    2. The actions being performed (choose from push or pull)
    3. The number of distinct objects involved in the action
    4. The objects present in the video

    # Rules:
    - PAY ATTENTION to the real action that is being performed on objects, do not get fooled by
      merely the movement of the hand
    """,
    model_settings=settings,
    retries=5
)

## Helper Function for Video Processing

Let's define a helper function to load and format video files for analysis.

In [17]:
def load_video_for_analysis(video_path):
    """Load video file and format it for Pydantic AI."""
    with open(video_path, 'rb') as f:
        video_bytes = f.read()
    
    # Create binary content for Pydantic AI
    # (see note at the end about using File API for longer videos)
    video_content = BinaryContent(
        data=video_bytes,
        media_type='video/mp4'  # Adjust based on your video format
    )
    
    return video_content

## Video Analysis

Analyze a video file to see the structured description output.

In [None]:
from IPython.display import Video, display
from pathlib import Path
from pprint import pprint


videos = Path("../videos").glob("*.mp4")

for video_path in videos:
    print(f"Original video: {video_path.name}")
    display(Video(str(video_path)))
    
    video_content = load_video_for_analysis(video_path)
    
    print("\nAnalyzing video content...")

    # TODO: run the agent on the video
    result = video_agent.run_sync(
        ["Please provide a detailed description of this video.", video_content]
    )
    
    print("\nVideo Analysis Results:")
    pprint(result.output.model_dump(), width=80, depth=None)
    print("\n" + "="*80 + "\n")



Original video: 175060.mp4



Analyzing video content...

Video Analysis Results:
{'actions': ['pull object out of another object'],
 'description': 'A white string is being pulled out of a green container.',
 'objects_involved_in_the_action': 2,
 'objects_present': ['container'],
 'quality': 'poor'}


Original video: 161206.mp4



Analyzing video content...

Video Analysis Results:
{'actions': ['move object sideways'],
 'description': 'A hand is seen manipulating a pink and purple object, '
                'possibly a scarf, on a textured rug. The object is being '
                'moved around, but the specific action is unclear due to the '
                'blurriness of the video.',
 'objects_involved_in_the_action': 1,
 'objects_present': ['scarf'],
 'quality': 'poor'}


Original video: 73015.mp4



Analyzing video content...

Video Analysis Results:
{'actions': ['move object sideways'],
 'description': 'A hand is shown pushing a pink cylindrical object across a '
                'white surface with yellow and red markings. The object moves '
                'from left to right and out of frame.',
 'objects_involved_in_the_action': 1,
 'objects_present': ['container'],
 'quality': 'ok'}


Original video: 201684.mp4



Analyzing video content...

Video Analysis Results:
{'actions': ['pull object out of another object'],
 'description': 'A hand reaches into a wicker basket and pulls out a deck of '
                'cards.',
 'objects_involved_in_the_action': 2,
 'objects_present': ['basket', 'board game'],
 'quality': 'good'}


Original video: 130178.mp4



Analyzing video content...

Video Analysis Results:
{'actions': ['pull object out of another object'],
 'description': 'A hand reaches into a glass jar and pulls out a small, '
                'light-colored object. The object appears to be a piece of '
                'candy or a similar small item. The hand then removes the '
                'object from the jar.',
 'objects_involved_in_the_action': 2,
 'objects_present': ['jar', 'candle'],
 'quality': 'ok'}


Original video: 121844.mp4



Analyzing video content...

Video Analysis Results:
{'actions': ['move object sideways'],
 'description': 'A hand is shown pushing a box of "Werewolves" board game '
                'sideways.',
 'objects_involved_in_the_action': 1,
 'objects_present': ['box', 'board game'],
 'quality': 'good'}


Original video: 108542.mp4



Analyzing video content...

Video Analysis Results:
{'actions': ['put object into another object'],
 'description': 'A hand is shown placing a slipper with a cartoon character '
                'design onto a rug.',
 'objects_involved_in_the_action': 2,
 'objects_present': ['slipper'],
 'quality': 'good'}


Original video: 212936.mp4



Analyzing video content...

Video Analysis Results:
{'actions': ['move object sideways'],
 'description': 'A hand is holding a mug with the text "To infinite energy '
                'COFFEE" and rotating it.',
 'objects_involved_in_the_action': 1,
 'objects_present': ['mug'],
 'quality': 'good'}


Original video: 33539.mp4



Analyzing video content...

Video Analysis Results:
{'actions': ['pull object out of another object'],
 'description': 'A hand reaches into the frame and pulls a single cotton swab '
                'out of a clear plastic container filled with cotton swabs.',
 'objects_involved_in_the_action': 2,
 'objects_present': ['cotton swabs', 'container'],
 'quality': 'ok'}


Original video: 194530.mp4



Analyzing video content...

Video Analysis Results:
{'actions': ['put object into another object'],
 'description': 'A hand is shown placing a blue pen on top of a black '
                'smartphone. The pen is positioned horizontally across the '
                'phone. The video quality is poor due to low resolution and '
                'shaky camera work.',
 'objects_involved_in_the_action': 2,
 'objects_present': ['smartphone', 'pen'],
 'quality': 'poor'}


Original video: 34137.mp4



Analyzing video content...

Video Analysis Results:
{'actions': ['pull object out of another object'],
 'description': 'A hand is shown picking up a cylindrical object with a dark '
                'red, patterned surface. The object appears to be a candle. '
                'The hand then moves the object slightly to the right before '
                'placing it back down.',
 'objects_involved_in_the_action': 1,
 'objects_present': ['candle'],
 'quality': 'ok'}


