# Audio Command Recognition with Pydantic AI and Gemini

This notebook demonstrates how to use Pydantic AI with Gemini to recognize voice commands for a robot control system.

## Import Libraries and Load Environment

Before starting, make sure you have placed your Google Gemini credentials in the `.env` file **in the parent folder**:

```bash
cp ../env.example ../.env
```
then edit `../.env` and modify GEMINI_API_KEY with your key.

In [1]:
import os
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from pydantic_ai import Agent, BinaryContent
import numpy as np
from IPython.display import Audio, display
import io
import soundfile as sf

# Load environment variables
assert load_dotenv("../.env"), "Please prepare a .env file with your GEMINI_API_KEY"
assert os.getenv("GEMINI_API_KEY"), "GEMINI_API_KEY not found in .env file"

# This is needed to use asyncio within jupyter
import nest_asyncio

nest_asyncio.apply()

## Define Command Schema

We'll use Pydantic models to structure our command recognition output. This ensures consistent output that can be directly used by a robot control system.

In [2]:
from typing import Optional
from typing_extensions import Literal

# TODO: define a Pydantic model for the structured output
# It should contain:
# - action: the recognized robot command (from a predefined set). Use Literal
#.  to restrict to valid commands. The commands should be "move_forward",
#  "move_backward", "turn_left", "turn_right", "pick_up", "put_down", "stop", and "unknown"
# - target_object: an optional string for the object mentioned in the command, if any
# - rationale: a string explaining why this command was recognized
class RobotCommand(BaseModel):
    """Structured output for robot command recognition."""

    action: Literal[
        "move_forward",
        "move_backward",
        "turn_left",
        "turn_right",
        "pick_up",
        "put_down",
        "stop",
        "unknown"
    ] = Field(description="The recognized robot action command")
    
    target_object: Optional[str] = Field(
        default=None, description="Object mentioned in the command, if any"
    )
    
    rationale: str = Field(
        description="Brief explanation of why this command was recognized"
    )

## Create Pydantic AI Agent

Set up the agent with Gemini for robot command recognition.

In [None]:
from pydantic_ai.models.google import GoogleModelSettings


# Remove thinking to avoid long delays and timeouts
settings = GoogleModelSettings(
    google_thinking_config={"thinking_budget": 0},
)

# TODO: create the pydantic ai Agent for this. For the prompt,
# come up with a good prompt explaing the model that it needs to
# recognize robot commands from audio. 
# Use the RobotCommand model as output_type
# and remember to include model_settings=settings to avoid
# long thinking times
# For the instructions, be structured and precise.
command_agent = Agent(
    model="gemini-2.5-flash-lite",
    output_type=RobotCommand,
    instructions="""
    You are a robot command recognition system. Listen to the audio and identify:
    1. The primary action command (movement, manipulation, or control)
    2. Any objects if the action involves manipulating an object, otherwise leave empty
    3. A rationale for your decision

    Commands include:
    - Movement
    - Manipulation
    - Control

    Remember that clockwise means towards the right, and counterclockwise means 
    towards the left.

    If the audio or the command is unclear or not a robot command, use "unknown" action.
    Be precise and focus on actionable commands.
    """,
    model_settings=settings,
)

## Load Dataset

We'll use the same audio dataset and pretend these are voice commands for our robot.

In [4]:
from pathlib import Path

samples = list(Path("../samples").glob("*.mp3"))
print(samples)

[PosixPath('../samples/grab_bottle.mp3'), PosixPath('../samples/left_turn.mp3'), PosixPath('../samples/metal_tool.mp3'), PosixPath('../samples/back_up.mp3'), PosixPath('../samples/turn_right.mp3'), PosixPath('../samples/stop.mp3'), PosixPath('../samples/coffee.mp3'), PosixPath('../samples/go_ahead.mp3')]


## Helper Function for Audio Processing

In [5]:
import librosa


def audio_to_bytes(audio_array, sample_rate):
    """Convert audio array to bytes for Pydantic AI."""
    buffer = io.BytesIO()
    sf.write(buffer, audio_array, sample_rate, format='WAV')
    buffer.seek(0)
    return buffer.getvalue()

def format_audio_for_gemini(file_path: Path):
    """Format a single audio sample from the dataset."""
    audio_array, sample_rate = librosa.load(file_path, sr=None, mono=True)
    
    # Convert audio to bytes
    audio_bytes = audio_to_bytes(
        audio_array,
        sample_rate
    )
    
    # Create binary content for Pydantic AI
    audio_content = BinaryContent(
        data=audio_bytes,
        media_type='audio/wav'
    )
    
    return audio_array, sample_rate, audio_content

## Single Command Recognition

Analyze one audio sample to see how the system interprets it as a robot command.

In [None]:
# Select a sample to analyze
audio_array, sample_rate,audio_content = format_audio_for_gemini(samples[0])


# Display the audio
display(Audio(audio_array, rate=sample_rate))

# TODO: run agent to recognize the command
# Hint: use command_agent.run_sync
# The input should be a list with two elements:
# 1. A string with a short prompt instructing to analyze the audio 
#    for robot commands.
# 2. The audio_content variable
result = command_agent.run_sync(
    ["Analyze this audio for robot commands.", audio_content]
)

print("\nCommand Recognition Results:")
print(f"Action: {result.output.action}")
print(f"Target Object: {result.output.target_object}")
print(f"Reasoning: {result.output.rationale}")


Command Recognition Results:
Action: pick_up
Target Object: blue bottle
Reasoning: The user wants to grab the blue bottle.


## Batch Command Analysis

Test the system with multiple audio samples to see how it interprets various speech patterns as robot commands.

In [8]:
print("Testing multiple audio samples for robot commands:")
print("=" * 50)

for sample_path in samples:

    # TODO: format audio for Gemini
    audio_array, sample_rate, audio_content = format_audio_for_gemini(sample_path)

    print(f"\nSample {sample_path}")
    display(Audio(audio_array, rate=sample_rate))

    # TODO: run agent to recognize the command
    result = command_agent.run_sync(
        ["Analyze this audio for robot commands.", audio_content]
    )

    print(f"Recognized Command: {result.output.action}")
    print(f"Object: {result.output.target_object or 'None'}")
    print(f"Reasoning: {result.output.rationale}")
    print("-" * 30)

Testing multiple audio samples for robot commands:

Sample ../samples/grab_bottle.mp3


Recognized Command: pick_up
Object: blue bottle
Reasoning: The user wants to grab the blue bottle from the table.
------------------------------

Sample ../samples/left_turn.mp3


Recognized Command: turn_left
Object: None
Reasoning: The user said 'make a left' which indicates a left turn.
------------------------------

Sample ../samples/metal_tool.mp3


Recognized Command: put_down
Object: metal tool
Reasoning: The user wants to place the metal tool on the surface.
------------------------------

Sample ../samples/back_up.mp3


Recognized Command: move_backward
Object: None
Reasoning: The user said "back up", which is a command to move backward.
------------------------------

Sample ../samples/turn_right.mp3


Recognized Command: turn_right
Object: None
Reasoning: The user said "rotate clockwise" which maps to "turn_right".
------------------------------

Sample ../samples/stop.mp3


Recognized Command: stop
Object: None
Reasoning: The user explicitly stated the command 'stop'.
------------------------------

Sample ../samples/coffee.mp3


Recognized Command: unknown
Object: None
Reasoning: The audio does not contain a robot command. It contains a statement about coffee tasting bitter.
------------------------------

Sample ../samples/go_ahead.mp3


Recognized Command: move_forward
Object: unknown
Reasoning: The user command 'go ahead and continue' directly translates to a forward movement.
------------------------------
