# Building Voice Agents with Azure OpenAI Realtime API
## WeAreDevelopers Conference Workshop (20 minutes)

### üéØ Workshop Objectives
By the end of this **20-minute session**, you will:
- **See** a live demo of our TPS report voice agent
- **Understand** the core architecture and key components  
- **Get** the complete code to build your own voice agent
- **Know** next steps for production deployment

### üöÄ What We'll Cover (Live Demo Focus)
1. **Quick Setup** (2 min) - Environment and credentials
2. **Live Demo** (12 min) - TPS report voice agent in action
3. **Code Walkthrough** (5 min) - Key implementation highlights
4. **Next Steps** (1 min) - Resources and production tips

### üé≠ The Demo: TPS Reports Voice Agent
We're showcasing a hands-free voice agent that helps employees file TPS reports while driving:
- Asks if the user has filed their TPS report
- Collects report details through natural conversation
- Generates a structured JSON report via function calling
- Handles interruptions and real conversation flow

**Let's see it in action!** üöÄ

In [None]:
!python -m venv .venv
!.venv/Scripts/activate # or .venv/bin/activate on Unix
!pip install -r requirements.txt

**Download Connection Details**

Use this [link](https://pwpush.com/p/godbrgvnmdm/r) to download the content for your `.env` file.  
Copy and paste the downloaded content into your local `.env` file to configure your environment.

## ‚ö° Quick Setup (2 minutes)

Before we demo, let's quickly verify our environment and Azure OpenAI credentials.

In [None]:
# Quick Azure OpenAI credentials check
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Azure OpenAI configuration
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_DEPLOYMENT_NAME = os.getenv("AZURE_VOICE_COMPLETION_DEPLOYMENT_NAME") 
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")

# Quick verification
print("üîß Azure OpenAI Configuration:")
print(f"üìç Endpoint: {'‚úÖ Configured' if AZURE_OPENAI_ENDPOINT else '‚ùå Missing'}")
print(f"üöÄ Deployment: {'‚úÖ Configured' if AZURE_DEPLOYMENT_NAME else '‚ùå Missing'}")
print(f"üîë API Key: {'‚úÖ Configured' if AZURE_OPENAI_API_KEY else '‚ùå Missing'}")

if AZURE_OPENAI_ENDPOINT and AZURE_DEPLOYMENT_NAME and AZURE_OPENAI_API_KEY:
    print("\n‚ú® Ready for demo!")
else:
    print("\n‚ö†Ô∏è Please check your .env file configuration")
    print("Required: AZURE_OPENAI_ENDPOINT, AZURE_VOICE_COMPLETION_DEPLOYMENT_NAME, AZURE_OPENAI_API_KEY")

## ?Ô∏è Architecture Overview (Demo Context)

Our voice agent has 4 key components working together in real-time:

1. **Frontend (Browser)**: Captures audio, sends to middle tier, plays responses
2. **Middle Tier (Python)**: Routes messages, manages tools, handles auth
3. **Azure OpenAI**: Processes voice, runs conversation, calls functions
4. **Custom Tools**: Business logic (our TPS report generator)

**Real-time Flow:**
```
üé§ User speaks ‚Üí üåê WebSocket ‚Üí üß† Azure OpenAI ‚Üí üõ†Ô∏è Function Call ‚Üí üìä JSON Report
```

**Now let's see it working!**

## üèóÔ∏è Voice Agent Architecture Overview

<img src="img/voice_agent_architecture.png" alt="Voice Agent Architecture" style="width: 100%; max-width: 800px; height: auto; display: block; margin: 0 auto; border-radius: 8px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">

### üîÑ Real-time Data Flow Example:

1. üé§ **User**: "I need to file TPS report 123 for Acme Corp"
2. üåê **WebSocket** sends audio to RTMiddleTier
3. üîÑ **RTMiddleTier** forwards to Azure OpenAI
4. üß† **Azure** processes speech & understands intent
5. üõ†Ô∏è **Function call** triggered: `generate_report()`
6. üìä **Tool** generates JSON report
7. üîä **Response** sent back to user as speech

## üöÄ Live Demo Setup

Let's quickly set up our voice agent components for the live demo. We'll focus on the key pieces that make the magic happen!

In [None]:
# Essential classes for our voice agent demo
import json
from enum import Enum
from datetime import datetime

class ToolResultDirection(Enum):
    TO_SERVER = 1    # Send result back to Azure OpenAI
    TO_CLIENT = 2    # Send result to the frontend client

class ToolResult:
    def __init__(self, text: str, destination: ToolResultDirection):
        self.text = text
        self.destination = destination
    
    def to_text(self) -> str:
        return self.text if isinstance(self.text, str) else json.dumps(self.text)

class Tool:
    def __init__(self, target, schema: dict):
        self.target = target    # Function to execute
        self.schema = schema    # JSON schema for the tool

# Our TPS Report Tool - the star of the demo!
async def generate_report_tool(args: dict) -> ToolResult:
    """Generate a TPS report from conversation data"""
    report_data = {
        "tps_report_id": args.get("tps_report_id"),
        "customer_name": args.get("customer_name"), 
        "hours_spent": args.get("hours_spent"),
        "status": args.get("status"),
        "generated_at": datetime.now().isoformat(),
        "generated_by": "Voice Agent Demo"
    }
    
    print(f"üìã Generated TPS Report: {report_data}")
    return ToolResult(report_data, ToolResultDirection.TO_CLIENT)

# Tool schema for Azure OpenAI function calling
tps_report_schema = {
    "type": "function",
    "name": "generate_report",
    "description": "Generates a JSON TPS report from conversation data",
    "parameters": {
        "type": "object",
        "properties": {
            "tps_report_id": {"type": "string", "description": "Three-digit report ID"},
            "customer_name": {"type": "string", "description": "Customer name"}, 
            "hours_spent": {"type": "string", "description": "Hours spent"},
            "status": {"type": "string", "enum": ["active", "done", "postponed"]}
        },
        "required": ["tps_report_id", "customer_name", "hours_spent", "status"]
    }
}

# Create our demo tool
tps_tool = Tool(target=generate_report_tool, schema=tps_report_schema)
print("‚úÖ TPS Report Tool ready for demo!")

## üõ†Ô∏è Function Calling Flow & Schema Structure

### Function Calling Process
<img src="img/function_calling.png" alt="Function Calling Flow" style="width: 100%; max-width: 800px; height: auto; display: block; margin: 0 auto; border-radius: 8px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">

### Tool Schema Structure
<img src="img/tools_schema.png" alt="Tools Schema" style="width: 100%; max-width: 800px; height: auto; display: block; margin: 0 auto; border-radius: 8px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">

### üéØ Key Insights:
- **Function calling** bridges natural conversation and structured data
- **Schema** defines exactly what information to extract  
- **Azure OpenAI** automatically maps speech to function parameters
- **Tools** can return data to client, server, or both

## üé§ Demo: Voice Agent Configuration

Now let's configure our voice agent with the TPS report conversation flow and start the demo!

In [None]:
# Configure our TPS Report Voice Agent for Demo
from azure.core.credentials import AzureKeyCredential
from azure.identity import DefaultAzureCredential

# Import our RTMiddleTier (this exists in backend/rtmt.py)
from backend.rtmt import RTMiddleTier

# The conversation script for our voice agent
TPS_SYSTEM_MESSAGE = """
You are a helpful assistant for TPS report filing. The user is driving and talking hands-free.

Start by asking: "Have you filed your TPS report today?"

If YES: Congratulate them and wish them a good day.
If NO: Help them file it by collecting:
1. TPS report ID (3 digits)
2. Customer name  
3. Hours spent
4. Status (active/done/postponed)

After collecting all info, call the 'generate_report' function to create their report.
Be conversational and friendly!
"""

# Create and configure the voice agent
print("üéôÔ∏è Setting up TPS Report Voice Agent...")

rtmt = RTMiddleTier(
    endpoint=AZURE_OPENAI_ENDPOINT,
    deployment=AZURE_DEPLOYMENT_NAME,
    credentials=AzureKeyCredential(AZURE_OPENAI_API_KEY) if AZURE_OPENAI_API_KEY else DefaultAzureCredential()
)

# Configure the agent
rtmt.system_message = TPS_SYSTEM_MESSAGE
rtmt.tools["generate_report"] = tps_tool
rtmt.temperature = 0.7

print("‚úÖ Voice agent configured!")
print(f"üõ†Ô∏è Tools: {list(rtmt.tools.keys())}")
print("üé¨ Ready for live demo!")

In [None]:
# üöÄ Launch the Voice Agent Demo!
from aiohttp import web
from pathlib import Path
import asyncio

async def create_demo_app():
    """Create the demo web application"""
    app = web.Application()
    
    # Attach our voice agent
    rtmt.attach_to_app(app, "/realtime")
    
    # Serve the frontend
    static_dir = Path("static")
    app.router.add_get('/', lambda req: web.FileResponse(static_dir / "index.html"))
    app.router.add_static('/static/', path=str(static_dir), name='static')
    
    return app

print("üé¨ LIVE DEMO TIME!")
print("=" * 50)
print("üåê Demo URL: http://localhost:8765")
print("üéôÔ∏è WebSocket: ws://localhost:8765/realtime")
print("üìã Expected conversation flow:")
print("   1. Agent asks: 'Have you filed your TPS report?'")
print("   2. User says: 'No, I need to file it'")
print("   3. Agent collects: ID, customer, hours, status")
print("   4. Agent generates JSON report")
print()
print("üéØ Demo conversation example:")
print("   üë§ 'No, I haven't filed it yet'")
print("   ü§ñ 'What's the report ID?'")
print("   üë§ 'Report 123'")
print("   ü§ñ 'Which customer?'")
print("   üë§ 'Acme Corporation'")
print("   ü§ñ 'How many hours?'")
print("   üë§ 'About 8 hours'")
print("   ü§ñ 'What's the status?'")
print("   üë§ 'Active'")
print("   ü§ñ [Generates JSON report]")
print()

# For notebook demo - show setup without actually starting server
print("üîß Demo Setup Complete!")
print("üìù To run the actual demo server:")
print("1. Run: python app.py")
print("2. Open: http://localhost:8765")
print("3. Allow microphone access")
print("4. Click 'Start Conversation'")

# Optional: For actual demo launch (uncomment if running in production)
async def start_demo():
    app = await create_demo_app()
    runner = web.AppRunner(app)
    await runner.setup()
    site = web.TCPSite(runner, "localhost", 8765)
    await site.start()
    print("‚úÖ Demo running at http://localhost:8765")
    return runner

# To actually start: 
runner = await start_demo()


## üé≠ TPS Report Conversation Flow Demo

<img src="img/conversation_flow.png" alt="TPS Report Conversation Flow" style="width: 100%; max-width: 800px; height: auto; display: block; margin: 0 auto; border-radius: 8px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">

### üé¨ Demo Highlights:
- **Natural conversation flow** with interruptions and clarifications
- **Automatic data extraction** from unstructured speech  
- **Function calling** triggered when all required data is collected
- **Real-time processing** with < 500ms response time
- **Hands-free operation** perfect for driving scenarios

## üñ•Ô∏è Code Walkthrough (5 minutes)

While the demo runs, let's quickly walk through the key code components that make this work:

In [None]:
# Key Architecture Components

print("üèóÔ∏è Voice Agent Architecture Highlights:")
print("=" * 45)

components = {
    "1. Frontend (HTML/JS)": [
        "‚Ä¢ Captures microphone audio",
        "‚Ä¢ WebSocket connection to /realtime",
        "‚Ä¢ Plays audio responses",
        "‚Ä¢ Shows generated reports"
    ],
    "2. RTMiddleTier (Python)": [
        "‚Ä¢ Routes messages between client & Azure OpenAI",
        "‚Ä¢ Manages custom tools (our TPS generator)",
        "‚Ä¢ Handles authentication",
        "‚Ä¢ Enforces conversation flow"
    ],
    "3. Azure OpenAI Realtime API": [
        "‚Ä¢ Speech-to-text conversion",
        "‚Ä¢ Natural conversation processing", 
        "‚Ä¢ Function calling (tool execution)",
        "‚Ä¢ Text-to-speech responses"
    ],
    "4. Custom Tools": [
        "‚Ä¢ Business logic (TPS report generation)",
        "‚Ä¢ JSON schema for function calling",
        "‚Ä¢ Result routing (to client/server)"
    ]
}

for component, details in components.items():
    print(f"\n{component}:")
    for detail in details:
        print(f"  {detail}")

print("\nüîÑ Real-time Flow:")
print("üé§ Audio ‚Üí üåê WebSocket ‚Üí üß† AI Processing ‚Üí üõ†Ô∏è Tool Call ‚Üí üìä JSON Result")

print("\n‚ú® The magic happens in real-time conversation!")
print("üí° All code available in this notebook for you to take home!")

## ‚ö° Technical Implementation Overview

### üåê WebSocket Message Flow
<img src="img/message_flow.png" alt="WebSocket Message Flow" style="width: 100%; max-width: 800px; height: auto; display: block; margin: 0 auto; border-radius: 8px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">

### üéµ Audio Processing Pipeline
<img src="img/processing_pipeline.png" alt="Audio Processing Pipeline" style="width: 100%; max-width: 800px; height: auto; display: block; margin: 0 auto; border-radius: 8px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">

### ‚ö° Real-time Performance Metrics
| Metric | Value | Description |
|--------|-------|-------------|
| üéØ **Latency** | < 500ms | End-to-end response time |
| üîä **Audio Quality** | 24kHz | High-fidelity speech |
| üéôÔ∏è **VAD Threshold** | 0.7 | Voice activity detection |
| üì¶ **Buffer Size** | 4096 | Audio processing chunks |
| üåê **WebSocket** | WSS | Secure real-time connection |

### üè≠ Production Deployment Architecture
<img src="img/production_deployment.png" alt="Production Deployment" style="width: 100%; max-width: 800px; height: auto; display: block; margin: 0 auto; border-radius: 8px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">

### üîß Technical Highlights:
- **WebSocket** enables full-duplex real-time communication
- **Audio processing** pipeline handles format conversions seamlessly
- **Performance** optimized for < 500ms end-to-end latency
- **Production-ready** with Azure services for scale and reliability

In [None]:
# Key Implementation Highlights
import json

print("üîß Implementation Deep Dive:")
print("=" * 35)

# 1. Function Calling Schema
print("1Ô∏è‚É£ FUNCTION CALLING MAGIC")
print("Our tool schema tells Azure OpenAI exactly what data to collect:")

# TPS Report Schema (self-contained for this demo)
tps_schema_example = {
    "type": "function",
    "name": "generate_report",
    "description": "Generates a JSON TPS report from conversation data",
    "parameters": {
        "type": "object",
        "properties": {
            "tps_report_id": {"type": "string", "description": "Three-digit report ID"},
            "customer_name": {"type": "string", "description": "Customer name"}, 
            "hours_spent": {"type": "string", "description": "Hours spent"},
            "status": {"type": "string", "enum": ["active", "done", "postponed"]}
        },
        "required": ["tps_report_id", "customer_name", "hours_spent", "status"]
    }
}

print(f"""
{json.dumps(tps_schema_example, indent=2)}
""")

# 2. Real-time Processing
print("2Ô∏è‚É£ REAL-TIME PROCESSING")
print("‚Ä¢ WebSocket enables bidirectional real-time communication")
print("‚Ä¢ Server-side Voice Activity Detection (VAD)")
print("‚Ä¢ 24kHz audio streaming")
print("‚Ä¢ Interrupt handling (user can cut off AI)")

# 3. System Message Power
print("3Ô∏è‚É£ CONVERSATION CONTROL")
print("System message defines the entire conversation flow:")
print("‚Ä¢ Specific question sequence")
print("‚Ä¢ When to call functions")
print("‚Ä¢ Response personality")

# 4. Production Considerations
print("4Ô∏è‚É£ PRODUCTION READY FEATURES")
production_features = [
    "Authentication with Azure (API keys or managed identity)",
    "Error handling and retries", 
    "Audio format conversion",
    "WebSocket connection management",
    "Tool result routing (client vs server)",
    "Scalable deployment patterns"
]

for i, feature in enumerate(production_features, 1):
    print(f"  {i}. {feature}")

print("\n‚ú® This is a complete, production-ready voice agent foundation!")
print("üí° Perfect starting point for enterprise voice applications")

## üéØ Wrap-up & Next Steps (1 minute)

### üéâ What You Just Experienced

In 20 minutes, you've seen a complete voice agent in action:
- **Real-time conversation** with Azure OpenAI Realtime API
- **Function calling** to execute business logic
- **WebSocket architecture** for low-latency communication
- **Production-ready patterns** you can use immediately

### üöÄ Immediate Next Steps

1. **? Get the Code**: This entire notebook is yours to take!
2. **? Set up Azure OpenAI**: Get your realtime preview access
3. **üõ†Ô∏è Customize**: Replace TPS reports with your business logic
4. **üåê Deploy**: Use Azure Container Apps for production

### ? Resources to Continue

- **Complete code**: Available in this notebook
- **Azure OpenAI Realtime API docs**: [docs.microsoft.com](https://docs.microsoft.com/azure/cognitive-services/openai/)
- **Production deployment guide**: Check the notebook appendix
- **Community samples**: [github.com/azure-samples](https://github.com/azure-samples)

### ü§ù Thank You!

**Questions?** Find me after the session!

**Keep building amazing voice experiences!** üéôÔ∏è‚ú®

## üöÄ Voice Agent Use Cases & Learning Journey

### Use Cases & Applications
<img src="img/use_cases.png" alt="Use Cases & Applications" style="width: 100%; max-width: 800px; height: auto; display: block; margin: 0 auto; border-radius: 8px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">

### üìñ Workshop Resources Summary:

#### üí° Key Concepts Learned:
- **Real-time WebSocket architecture**
- **Function calling with structured data**
- **Audio processing pipeline**
- **Production deployment patterns**

#### ‚ö° Quick Wins:
- **Copy notebook and modify TPS logic**
- **Deploy to Azure Container Apps**
- **Add your own custom functions**
- **Integrate with existing APIs**

### üéâ You're ready to build amazing voice experiences!
### ü§ù Questions? Find me after the session!