# Agent Version Performance Comparison - REAL DEMO

## What This Demo Shows

This notebook demonstrates **ACTUAL PERFORMANCE DIFFERENCES** between agent versions:

1. **Create a Math Tutor Agent** - Simple, focused use case
2. **Version 1**: Poor instruction ‚Üí Low evaluation scores
3. **Version 2**: Excellent instruction ‚Üí High evaluation scores
4. **Compare Results**: See real performance metrics side-by-side

## Why Math Tutor?
- Clear success criteria (correct answers + explanation quality)
- Simple word problems show instruction quality impact
- Shows stark contrast between poor and excellent instructions

## Table of Contents

1. [Setup](#setup)
2. [Create Math Tutor Agent - Version 1 (Poor)](#create-math-tutor-agent---version-1-poor)
3. [Test Version 1 - Evaluate Performance](#test-version-1---evaluate-performance)
4. [Update to Version 2 - Excellent Instruction](#update-to-version-2---excellent-instruction)
5. [Test Version 2 - Evaluate Performance](#test-version-2---evaluate-performance)
6. [Compare Both Versions - Side by Side](#compare-both-versions---side-by-side)
7. [Cleanup](#cleanup)
8. [Summary](#summary)

## Setup

In [None]:
import os
import shutil

new_path_entry = "/opt/homebrew/bin"  # Replace with the directory you want to add
current_path = os.environ.get('PATH', '')

if new_path_entry not in current_path.split(os.pathsep):
    os.environ['PATH'] = new_path_entry + os.pathsep + current_path
    print(f"Updated PATH for this session: {os.environ['PATH']}")
else:
    print(f"PATH already contains {new_path_entry}: {current_path}")

# You can then verify with shutil.which again
print(f"Location of 'az' found by kernel now: {shutil.which('az')}")

In [None]:
import os
import sys
import json
from pathlib import Path
from dotenv import load_dotenv
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
from azure.ai.evaluation import (
    IntentResolutionEvaluator,
    AzureOpenAIModelConfiguration
)

# Add parent directory to path
parent_dir = Path.cwd().parent / "utils"
sys.path.insert(0, str(parent_dir))
sys.path.insert(0, str(Path.cwd()))

# Load environment
env_path = Path.cwd().parent / ".env"
load_dotenv(env_path)

from agent_db import AgentDB
from agent_utils import AgentManager
from agent_version_manager import AgentVersionManager

# Initialize clients
endpoint = os.getenv("AZURE_AI_PROJECT_ENDPOINT")
project_client = AIProjectClient(endpoint=endpoint, credential=DefaultAzureCredential())

# Initialize managers
agent_manager = AgentManager(project_client=project_client)
version_manager = AgentVersionManager(agent_manager=agent_manager)

# Configure evaluator
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT_GPT_4o"],
    api_key=os.environ["AZURE_OPENAI_API_KEY_GPT_4o"],
    api_version=os.environ["AZURE_OPENAI_API_VERSION_GPT_4o"],
    azure_deployment=os.environ["AZURE_OPENAI_MODEl_GPT_4o"],
)

intent_evaluator = IntentResolutionEvaluator(model_config=model_config, threshold=3)

print("‚úÖ Setup complete")

## Create Math Tutor Agent - Version 1 (Poor)

**Intentionally poor instruction** to show low performance baseline

In [None]:
# Create agent with POOR instruction - intentionally minimal and problematic
v1_instruction = """Answer briefly. Just give the final answer. Skip the work."""

# Create the agent using agent_manager to save metadata to DB
agent = agent_manager.create_agent(
    model="gpt-4o",
    name="Math Tutor Performance Demo2",
    instructions=v1_instruction,
    description="Math tutor for version performance comparison",
    category="demo",
    status="active"
)

# Get metadata from database to get the local ID
agent_data = agent_manager.get_agent_metadata(azure_agent_id=agent.id)

print(f"‚úÖ Agent created and saved to database")
print(f"üìù Azure Agent ID: {agent.id}")
print(f"üìù Database ID: {agent_data['id']}")
print(f"üìù Version 1 instruction: {v1_instruction}")

# Store IDs for later
MATH_AGENT_ID = agent.id  # For Azure API calls
MATH_AGENT_DB_ID = agent_data['id']  # For database operations

## Test Version 1 - Evaluate Performance

In [None]:
# Test queries - simpler prompts let instruction quality differentiate performance
test_queries = [
    "A student scores 85, 92, and 78 on three tests. What score is needed on the fourth test to achieve an overall average of 90?",
    "If a rectangle has a perimeter of 50 cm and its length is 5 cm more than twice its width, find the dimensions.",
    "A water tank is being filled by two pipes. Pipe A can fill the tank alone in 4 hours, and Pipe B can fill it alone in 6 hours. If both pipes work together for 1.5 hours, then Pipe A is closed, how much longer will it take for Pipe B alone to finish filling the tank?"
]


def evaluate_agent_version(agent_id, queries, version_name):
    """Evaluate agent performance on test queries"""
    print(f"\n{'='*80}")
    print(f"EVALUATING: {version_name}")
    print(f"{'='*80}\n")

    scores = []

    for i, query in enumerate(queries, 1):
        print(f"\nüìù Test {i}: {query}")

        # Create thread and run
        thread = agent_manager.create_thread()
        project_client.agents.messages.create(
            thread_id=thread.id,
            role="user",
            content=query
        )

        # Process the run and wait for completion
        _ = project_client.agents.runs.create_and_process(
            thread_id=thread.id,
            agent_id=agent_id
        )

        # Get response
        messages = list(project_client.agents.messages.list(
            thread.id, order="asc"))
        response = messages[-1].content[0].text.value if messages else ""
        print(f"ü§ñ Response: {response[:200]}...")

        # Evaluate using the original query without enhancement
        # This lets instruction quality drive the response detail level
        intent_result = intent_evaluator(
            query=query,
            response=response
        )

        intent_score = intent_result.get('intent_resolution', 0)

        scores.append(intent_score)

        print(f"üìä Intent Score: {intent_score:.1f}/5")

        # Cleanup
        agent_manager.delete_thread(thread.id, silent=True)

    overall_avg = sum(scores) / len(scores)

    print(f"\n{'='*80}")
    print(f"üìà OVERALL AVERAGE: {overall_avg:.2f}/5")
    print(f"{'='*80}\n")

    return {
        "version": version_name,
        "scores": scores,
        "average": overall_avg,
        "passed": overall_avg >= 3.0
    }


In [None]:
# Evaluate Version 1
v1_results = evaluate_agent_version(
    MATH_AGENT_ID, test_queries, "Version 1 (Poor)")

## Update to Version 2 - Excellent Instruction

In [None]:
# Update with EXCELLENT instruction
v2_instruction = """You are an Expert Math Tutor specializing in clear, accurate mathematical instruction.

## Your Approach
1. **Understand**: Carefully read and identify what the problem is asking
2. **Plan**: Determine which mathematical concepts/formulas to apply
3. **Solve**: Show each step with clear explanations
4. **Verify**: Check your answer makes sense
5. **Present**: Provide the final answer clearly

## Standards
- Use proper mathematical notation
- Show all intermediate steps
- Explain reasoning at each step
- Include units where applicable
- Round to 2 decimal places unless specified otherwise

## Example Format
**Problem**: [Restate the problem]
**Solution**:
Step 1: [First step with explanation]
Step 2: [Second step with explanation]
...
**Final Answer**: [Clear, concise answer with units]
"""

# Update agent WITH VERSIONING - saves to database
version_manager.update_agent_with_versioning(
    agent_id=MATH_AGENT_DB_ID,
    updates={"instruction": v2_instruction},  # Note: singular 'instruction' - gets converted to plural for Azure API
    change_description="Expert-level instruction with comprehensive format and standards",
    changed_by="demo@example.com"
)

print("‚úÖ Agent updated to Version 2 (saved to database)")
print(f"üìù New instruction:\n{v2_instruction[:200]}...")

## Test Version 2 - Evaluate Performance

In [None]:
# Evaluate Version 2
v2_results = evaluate_agent_version(MATH_AGENT_ID, test_queries, "Version 2 (Excellent)")

## View Version History from Database

Let's see what was saved to the database with our version manager.

In [None]:
# Get version history from database
version_history = version_manager.get_version_history(agent_id=MATH_AGENT_DB_ID)
current_version = version_manager.get_current_version_number(agent_id=MATH_AGENT_DB_ID)

print("=" * 100)
print(f"VERSION HISTORY FROM DATABASE (Current Version: {current_version})")
print("=" * 100 + "\n")

for version in sorted(version_history, key=lambda x: x['versionNumber'], reverse=True):
    print(f"\nüîñ Version {version['versionNumber']}")
    print(f"   ‚è∞ Timestamp: {version['timestamp']}")
    print(f"   üë§ Changed by: {version['changedBy']}")
    print(f"   üìù Description: {version['changeDescription']}")
    
    snapshot = version['snapshot']
    instruction = snapshot.get('instruction', '')[:100].replace('\n', ' ')
    print(f"   üìÑ Instruction: {instruction}...")
    print("-" * 100)

print("\n‚úÖ Version history retrieved from Cosmos DB")

## Compare Both Versions - Side by Side

In [None]:
import pandas as pd
from IPython.display import display, Markdown

print("\n" + "="*100)
print("PERFORMANCE COMPARISON - BOTH VERSIONS")
print("="*100 + "\n")

# Create comparison table
comparison_df = pd.DataFrame([
    {
        "Version": "V1 (Poor)",
        "Average Score": f"{v1_results['average']:.2f}",
        "Test 1": f"{v1_results['scores'][0]:.2f}",
        "Test 2": f"{v1_results['scores'][1]:.2f}",
        "Test 3": f"{v1_results['scores'][2]:.2f}",
        "Status": "‚úÖ PASS" if v1_results['passed'] else "‚ùå FAIL"
    },
    {
        "Version": "V2 (Excellent)",
        "Average Score": f"{v2_results['average']:.2f}",
        "Test 1": f"{v2_results['scores'][0]:.2f}",
        "Test 2": f"{v2_results['scores'][1]:.2f}",
        "Test 3": f"{v2_results['scores'][2]:.2f}",
        "Status": "‚úÖ PASS" if v2_results['passed'] else "‚ùå FAIL"
    }
])

display(comparison_df)

# Visual comparison
def create_bar(score, max_score=5, width=30):
    filled = int((score / max_score) * width)
    return "‚ñà" * filled + "‚ñë" * (width - filled)

print("\n" + "="*100)
print("VISUAL SCORE COMPARISON")
print("="*100 + "\n")

print(f"V1 (Poor):      {create_bar(v1_results['average'])} {v1_results['average']:.2f}/5.0")
print(f"V2 (Excellent): {create_bar(v2_results['average'])} {v2_results['average']:.2f}/5.0")

# Calculate improvement
v1_to_v2 = ((v2_results['average'] - v1_results['average']) / v1_results['average'] * 100) if v1_results['average'] > 0 else 0

print("\n" + "="*100)
print("PERFORMANCE IMPROVEMENT")
print("="*100 + "\n")

print(f"üìà V1 (Poor) ‚Üí V2 (Excellent): {v1_to_v2:+.1f}% improvement")

print("\n" + "="*100)
print("KEY TAKEAWAYS")
print("="*100 + "\n")

if v2_results['average'] > v1_results['average']:
    print("‚úÖ Clear contrast: Better instructions = Better performance")
    print("‚úÖ Version 2 (Excellent) shows significantly better results")
    print(f"‚úÖ {v1_to_v2:+.1f}% improvement from V1 to V2")
    print("\nüí° Recommendation: Deploy Version 2 to production")
    print("üí° Value: Version control enables rollback if needed")
else:
    print("‚ö†Ô∏è  Results may vary - consider running more tests")

print("\n" + "="*100)

# Show database info
print("\n" + "="*100)
print("DATABASE VERSION INFO")
print("="*100 + "\n")
print(f"üì¶ Database ID: {MATH_AGENT_DB_ID}")
print(f"üîó Azure Agent ID: {MATH_AGENT_ID}")
print(f"üìä Current Version in DB: {version_manager.get_current_version_number(MATH_AGENT_DB_ID)}")
print(f"üìú Total Versions Saved: {len(version_manager.get_version_history(MATH_AGENT_DB_ID))}")
print("\n‚úÖ All versions tracked in Cosmos DB with full audit trail")
print("=" * 100)

## Cleanup

In [None]:
# Optional: Delete the test agent
# Uncomment to delete:
# agent_manager.delete_agent(agent_id=MATH_AGENT_DB_ID)
# print(f"‚úÖ Agent deleted from Azure and database")

print("‚ÑπÔ∏è  Agent cleanup commented out. Uncomment to delete test agent.")

## Summary

This notebook demonstrated **REAL performance differences** between agent versions:

### What We Tested
- **Math Tutor Agent** with 2 different instruction qualities (binary comparison)
- **3 Math Word Problems**: Percentage calculation, algebra, speed-distance-time
- **Intent Resolution Evaluator**: Objective scoring (0-5 scale)

### Results Show
1. **Poor instructions** (V1) ‚Üí Low scores (~1-2/5) - deliberately unhelpful
2. **Excellent instructions** (V2) ‚Üí High scores (~5/5) - comprehensive expert guidance

### Why This Matters
- **Version control** lets you track what changed with full audit trail
- **Automated evaluation** provides objective performance metrics
- **Rollback capability** protects against regressions (revert to previous version if needed)
- **Clear evidence** for deployment decisions

### Key Insight
Binary comparison (Poor vs Excellent) demonstrates the value better than gradual progression. With highly capable models like GPT-4o, even minimal instructions produce good results - but version control remains critical for **audit trails, rollback capability, and change tracking** in production environments.

This is exactly what you need for **production agent management**!