# Agent CI/CD with Versioning and Automated Evaluation

This notebook demonstrates a complete CI/CD workflow for Azure AI agents with:

## Key Features
- üîÑ **Automatic Versioning**: Every agent update creates a version snapshot
- üìä **Automated Evaluation**: Trigger evaluations after configuration changes
- üìú **Version History**: Track all changes with timestamps and descriptions
- üîç **Version Comparison**: Compare different versions side-by-side with detailed diffs
- üìà **Visual Score Comparison**: See performance differences with charts and metrics
- üéØ **Change Tracking**: Identify exactly what changed between versions (instructions, descriptions, prompts)
- ‚èÆÔ∏è **Rollback Support**: Restore previous versions when needed
- üöÄ **CI/CD Pipeline**: Update ‚Üí Version ‚Üí Evaluate ‚Üí Deploy workflow

## Workflow Overview
1. **Update Agent Configuration**: Modify system prompt, parameters, or tools
2. **Automatic Versioning**: Current state saved to `versions` array, `currentVersion` incremented
3. **Trigger Evaluation**: Run agent evaluators (Intent Resolution, Tool Call Accuracy, Task Adherence)
4. **Review Results**: Analyze evaluation metrics before promoting to production
5. **Rollback if Needed**: Restore previous version if evaluation fails

## Version Control Schema
Each version snapshot contains:
- `versionNumber`: Sequential version number
- `timestamp`: When the version was created
- `changeDescription`: What changed in this version
- `changedBy`: Who made the change
- `snapshot`: Complete configuration at that point in time

## Table of Contents

1. [Environment Setup](#environment-setup)
2. [Initialize Clients and Managers](#initialize-clients-and-managers)
3. [Get Existing Agent](#get-existing-agent)
4. [Update Agent with Automatic Versioning](#update-agent-with-automatic-versioning)
5. [View Version History](#view-version-history)
6. [Automated Evaluation After Update](#automated-evaluation-after-update)
7. [Compare Versions](#compare-versions)
8. [Rollback to Previous Version](#rollback-to-previous-version)
9. [Complete CI/CD Pipeline Example](#complete-cicd-pipeline-example)
   - Test Good vs Bad Updates
   - Compare Evaluation Results
   - View Version History with Changes
   - Visual Score Comparison
9. [Complete CI/CD Pipeline Example](#complete-cicd-pipeline-example)
10. [Best Practices](#best-practices)

## Environment Setup

In [None]:
import os
import shutil

new_path_entry = "/opt/homebrew/bin"  # Replace with the directory you want to add
current_path = os.environ.get('PATH', '')

if new_path_entry not in current_path.split(os.pathsep):
    os.environ['PATH'] = new_path_entry + os.pathsep + current_path
    print(f"Updated PATH for this session: {os.environ['PATH']}")
else:
    print(f"PATH already contains {new_path_entry}: {current_path}")

# You can then verify with shutil.which again
print(f"Location of 'az' found by kernel now: {shutil.which('az')}")

In [None]:
import os
import sys
import json
from pathlib import Path
from dotenv import load_dotenv
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential

# Add parent directory to path for agent utilities
parent_dir = Path.cwd().parent / "utils"
sys.path.insert(0, str(parent_dir))

# Add current directory for version manager
sys.path.insert(0, str(Path.cwd()))

# Load environment variables
env_path = Path.cwd().parent / ".env"
load_dotenv(env_path)

print("‚úÖ Environment setup complete")

## Initialize Clients and Managers

In [None]:
from agent_db import AgentDB
from agent_utils import AgentManager
from agent_version_manager import AgentVersionManager

# Get project endpoint
endpoint = os.getenv("AZURE_AI_PROJECT_ENDPOINT")
if not endpoint:
    raise ValueError("Please set AZURE_AI_PROJECT_ENDPOINT in environment")

# Verify Cosmos DB endpoint
cosmos_endpoint = os.getenv("AZURE_COSMOS_ENDPOINT")
if not cosmos_endpoint:
    raise ValueError("Please set AZURE_COSMOS_ENDPOINT in environment")

# Initialize Azure AI Project Client
project_client = AIProjectClient(
    endpoint=endpoint,
    credential=DefaultAzureCredential()
)

# Initialize managers
agent_manager = AgentManager(project_client=project_client)
version_manager = AgentVersionManager(agent_manager=agent_manager)

print("‚úÖ Clients initialized")
print(f"üì¶ Project Endpoint: {endpoint}")
print(f"üì¶ Cosmos DB Endpoint: {cosmos_endpoint}")

## Get Existing Agent

Let's retrieve the Web Research Assistant that we want to manage with CI/CD.

In [None]:
# Get the agent from database using Azure Agent ID
AZURE_AGENT_ID = "asst_CncymMdTqov5hRCQQJrvLwhX"

agent_data = agent_manager.get_agent_metadata(azure_agent_id=AZURE_AGENT_ID)

if not agent_data:
    raise ValueError(f"Agent not found: {AZURE_AGENT_ID}")

print(f"‚úÖ Agent retrieved: {agent_data['name']}")
print(f"üìã Current version: {agent_data.get('currentVersion', 'Not set')}")
print(f"üìù Status: {agent_data['status']}")
print(f"üìÇ Category: {agent_data['category']}")
print(f"\nüìÑ Current instruction (first 200 chars):")
print(agent_data['instruction'][:200] + "...")

## Update Agent with Automatic Versioning

When we update the agent configuration, the system will:
1. Save the current state to the `versions` array
2. Apply the updates
3. Increment the `currentVersion` field
4. Update timestamps

This creates a complete audit trail of all changes.

In [None]:
# Define the updates we want to make
updates = {
    "instruction": """You are an Advanced Web Research Assistant with real-time web search capabilities.

## Role
You help users find current information from the web, including:
- Latest news and current events with source verification
- Recent industry updates and emerging trends
- Real-time data, statistics, and market insights
- Current product information, reviews, and comparisons
- Academic research and technical documentation

## Enhanced Capabilities
1. **Multi-source Verification**: Cross-reference information from multiple sources
2. **Trend Analysis**: Identify patterns and trends in search results
3. **Source Quality Assessment**: Evaluate credibility of information sources
4. **Contextual Summarization**: Provide concise summaries with key insights

## Constraints
1. When asked about current events or recent information, use the Bing search tool to find up-to-date information
2. Always cite your sources with links to the websites you reference
3. Present information clearly and concisely, using structured formats when appropriate
4. If information is time-sensitive, mention when the data was retrieved
5. For controversial topics, present multiple perspectives with proper attribution
6. Reply in English unless specifically asked for another language
7. If search results are insufficient, acknowledge limitations and suggest alternative approaches

## Quality Standards
- Verify facts across multiple sources when possible
- Distinguish between facts, opinions, and speculation
- Provide context for statistics and data points
- Update users if information may have changed since retrieval""",
    "description": "Advanced Web Research Assistant with enhanced verification and analysis capabilities. Uses Bing Search for real-time information retrieval with multi-source validation.",
    "sample_prompts": [
        "What are the latest AI developments this week with source verification?",
        "Compare the top 3 cloud computing platforms based on recent reviews",
        "What are the emerging trends in sustainable technology?",
        "Find and summarize recent research on quantum computing applications",
        "What is the current market sentiment on electric vehicles?"
    ]
}

# Update with versioning
success = version_manager.update_agent_with_versioning(
    agent_id=agent_data["id"],
    updates=updates,
    change_description="Enhanced research capabilities with multi-source verification and trend analysis",
    changed_by="xle@microsoft.com"
)

if success:
    print("‚úÖ Agent updated successfully with versioning")
    
    # Get updated data
    updated_agent = agent_manager.get_agent_metadata(azure_agent_id=AZURE_AGENT_ID)
    print(f"üìã New version: {updated_agent.get('currentVersion')}")
    print(f"üìù Versions in history: {len(updated_agent.get('versions', []))}")
else:
    print("‚ùå Failed to update agent")

## View Version History

Let's examine the version history to see all changes made to the agent.

In [None]:
# Get version history
version_history = version_manager.get_version_history(agent_id=agent_data["id"])

print(f"üìú Version History ({len(version_history)} versions)\n")
print("=" * 100)

for version in version_history:
    print(f"\nüîñ Version {version['versionNumber']}")
    print(f"   ‚è∞ Timestamp: {version['timestamp']}")
    print(f"   üë§ Changed by: {version['changedBy']}")
    print(f"   üìù Description: {version['changeDescription']}")
    
    snapshot = version['snapshot']
    print(f"   üìÑ Instruction preview: {snapshot.get('instruction', '')[:100]}...")
    print(f"   üìÇ Category: {snapshot.get('category')}")
    print(f"   üìä Status: {snapshot.get('status')}")
    print("-" * 100)

## Automated Evaluation After Update

After updating the agent, we should evaluate its performance to ensure the changes improved quality.
We'll run the agent through test scenarios and evaluate with Azure AI evaluators.

### Step 1: Prepare Evaluation Environment

In [None]:
from azure.ai.evaluation import (
    IntentResolutionEvaluator,
    ToolCallAccuracyEvaluator,
    AzureOpenAIModelConfiguration,
    AIAgentConverter
)

# Configure the model for evaluation (LLM judge)
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT_GPT_4o"],
    api_key=os.environ["AZURE_OPENAI_API_KEY_GPT_4o"],
    api_version=os.environ["AZURE_OPENAI_API_VERSION_GPT_4o"],
    azure_deployment=os.environ["AZURE_OPENAI_MODEl_GPT_4o"],
)

# Initialize evaluators
intent_resolution = IntentResolutionEvaluator(model_config=model_config, threshold=3)
tool_call_accuracy = ToolCallAccuracyEvaluator(model_config=model_config, threshold=3)

print("‚úÖ Evaluators initialized")

### Step 2: Create Test Thread and Run Agent

In [None]:
# Create a thread for testing
thread = agent_manager.create_thread()
print(f"‚úÖ Test thread created: {thread.id}")

# Test query
test_query = "What are the latest developments in generative AI this month?"

# Add message to thread
message = project_client.agents.messages.create(
    thread_id=thread.id,
    role="user",
    content=test_query
)
print(f"‚úÖ Test message added: {test_query}")

# Run the agent
run = project_client.agents.runs.create_and_process(
    thread_id=thread.id,
    agent_id=AZURE_AGENT_ID
)

print(f"‚úÖ Agent run completed with status: {run.status}")

# Display conversation
print("\n" + "=" * 100)
print("AGENT CONVERSATION")
print("=" * 100)

for msg in project_client.agents.messages.list(thread.id, order="asc"):
    print(f"\n{msg.role.upper()}:")
    print(msg.content[0].text.value)
    print("-" * 100)

### Step 3: Convert Agent Data for Evaluation

In [None]:
# Initialize converter
converter = AIAgentConverter(project_client)

# Convert agent run to evaluation format
evaluation_data = converter.convert(thread_id=thread.id, run_id=run.id)

print("‚úÖ Evaluation data converted")
print(f"\nüìä Data structure:")
print(json.dumps(evaluation_data, indent=2, default=str)[:500] + "...")

### Step 4: Run Evaluations

In [None]:
# Run evaluators
print("üîç Running evaluations...\n")
print("=" * 100)

# Intent Resolution
print("\nüìã Intent Resolution Evaluator")
intent_result = intent_resolution(
    query=evaluation_data.get("query"),
    response=evaluation_data.get("response")
)
print(f"  Score: {intent_result.get('intent_resolution')}")
print(f"  Pass: {intent_result.get('intent_resolution_pass')}")
print(f"  Reason: {intent_result.get('intent_resolution_reason')}")

# Tool Call Accuracy
print("\nüîß Tool Call Accuracy Evaluator")
if evaluation_data.get("tool_calls"):
    tool_result = tool_call_accuracy(
        query=evaluation_data.get("query"),
        response=evaluation_data.get("response"),
        tool_calls=evaluation_data.get("tool_calls"),
        tool_definitions=evaluation_data.get("tool_definitions", [])
    )
    print(f"  Score: {tool_result.get('tool_call_accuracy')}")
    print(f"  Pass: {tool_result.get('tool_call_accuracy_pass')}")
    print(f"  Correct calls: {tool_result.get('correct_tool_calls_made_by_agent')}")
    print(f"  Total calls: {tool_result.get('tool_calls_made_by_agent')}")
else:
    print("  No tool calls detected")

### Step 5: Save Evaluation Results

In [None]:
# Compile evaluation results
evaluation_results = {
    "version": updated_agent.get("currentVersion"),
    "timestamp": updated_agent.get("dateModified"),
    "test_query": test_query,
    "intent_resolution": {
        "score": intent_result.get('intent_resolution'),
        "pass": intent_result.get('intent_resolution_pass'),
        "reason": intent_result.get('intent_resolution_reason')
    },
    "tool_call_accuracy": {
        "score": tool_result.get('tool_call_accuracy') if evaluation_data.get("tool_calls") else "N/A",
        "pass": tool_result.get('tool_call_accuracy_pass') if evaluation_data.get("tool_calls") else "N/A"
    } if evaluation_data.get("tool_calls") else {"score": "N/A", "pass": "N/A"}
}

# Save to file
eval_filename = f"data/evaluation_v{updated_agent.get('currentVersion')}.json"
with open(eval_filename, 'w') as f:
    json.dump(evaluation_results, f, indent=2, default=str)

print(f"‚úÖ Evaluation results saved to: {eval_filename}")
print(f"\nüìä Summary:")
print(f"  Version: {evaluation_results['version']}")
print(f"  Intent Resolution: {evaluation_results['intent_resolution']['score']} (Pass: {evaluation_results['intent_resolution']['pass']})")
print(f"  Tool Call Accuracy: {evaluation_results['tool_call_accuracy']['score']} (Pass: {evaluation_results['tool_call_accuracy']['pass']})")

## Compare Versions

Let's compare different versions to see what changed.

In [None]:
# Get current version number and version history
current_version = version_manager.get_current_version_number(agent_data["id"])
version_history = version_manager.get_version_history(
    agent_id=agent_data["id"])

if len(version_history) >= 1:
    # Get the most recent version snapshot from history
    previous_version = version_history[0]  # Most recent in history

    # Get current agent state
    current_agent = agent_manager.get_agent_metadata(
        azure_agent_id=AZURE_AGENT_ID)

    print(
        f"üìä Comparing Version {previous_version['versionNumber']} (Previous) vs Current State (Version {current_version})\n")

    # Compare key fields
    differences = {}
    prev_snapshot = previous_version['snapshot']

    for key in ["name", "description", "instruction", "category", "status", "samplePrompts"]:
        prev_val = prev_snapshot.get(key)
        curr_val = current_agent.get(key)

        if prev_val != curr_val:
            differences[key] = {
                "previous": prev_val,
                "current": curr_val
            }

    if differences:
        from IPython.display import display, Markdown

        print("\nüîç Differences Found:\n")

        # Create markdown table
        table_rows = ["| Field | Version {} (Previous) | Version {} (Current) |".format(
            previous_version['versionNumber'], current_version
        )]
        table_rows.append(
            "|-------|----------------------|---------------------|")

        for field, changes in differences.items():
            prev_val = changes.get("previous", "")
            curr_val = changes.get("current", "")

            # Format values for table display
            if isinstance(prev_val, str):
                # Truncate long strings and escape markdown
                prev_display = prev_val[:150].replace(
                    "\n", " ").replace("|", "\\|")
                if len(prev_val) > 150:
                    prev_display += "..."
            elif isinstance(prev_val, list):
                prev_display = f"{len(prev_val)} items"
            else:
                prev_display = str(prev_val)

            if isinstance(curr_val, str):
                curr_display = curr_val[:150].replace(
                    "\n", " ").replace("|", "\\|")
                if len(curr_val) > 150:
                    curr_display += "..."
            elif isinstance(curr_val, list):
                curr_display = f"{len(curr_val)} items"
            else:
                curr_display = str(curr_val)

            table_rows.append(
                f"| **{field}** | {prev_display} | {curr_display} |")

        markdown_table = "\n".join(table_rows)
        display(Markdown(markdown_table))

        # Show detailed differences for arrays
        for field, changes in differences.items():
            if isinstance(changes.get("previous"), list) or isinstance(changes.get("current"), list):
                print(f"\nüìã Detailed {field}:")
                print(
                    f"\n  Version {previous_version['versionNumber']} (Previous):")
                for item in changes.get("previous", []):
                    print(f"    - {item}")
                print(f"\n  Version {current_version} (Current):")
                for item in changes.get("current", []):
                    print(f"    - {item}")
                print("-" * 80)
    else:
        print("\n‚úÖ No differences detected")
else:
    print("No version history available yet, cannot compare")

## Rollback to Previous Version

If the evaluation results are unsatisfactory, we can rollback to a previous version.

In [None]:
# Example: Rollback to previous version
# UNCOMMENT TO EXECUTE ROLLBACK

# current_version = version_manager.get_current_version_number(agent_data["id"])
# 
# if current_version > 1:
#     target_version = current_version - 1
#     
#     print(f"‚èÆÔ∏è  Rolling back from version {current_version} to version {target_version}...")
#     
#     success = version_manager.rollback_to_version(
#         agent_id=agent_data["id"],
#         version_number=target_version,
#         change_description=f"Rollback to version {target_version} due to evaluation concerns",
#         changed_by="xle@microsoft.com"
#     )
#     
#     if success:
#         print("‚úÖ Rollback successful")
#         
#         # Verify rollback
#         updated_agent = agent_manager.get_agent_metadata(azure_agent_id=AZURE_AGENT_ID)
#         print(f"üìã Current version after rollback: {updated_agent.get('currentVersion')}")
#     else:
#         print("‚ùå Rollback failed")
# else:
#     print("Cannot rollback - only one version exists")

print("‚ÑπÔ∏è  Rollback example code is commented out. Uncomment to execute.")

## Complete CI/CD Pipeline Example

Here's a complete function that encapsulates the entire CI/CD workflow.

In [None]:
def agent_cicd_pipeline(
    agent_id: str,
    updates: dict,
    change_description: str,
    changed_by: str,
    test_queries: list,
    evaluation_threshold: float = 3.0,
    auto_rollback: bool = True
):
    """
    Complete CI/CD pipeline for agent updates
    
    Args:
        agent_id: Agent ID
        updates: Dict of updates to apply
        change_description: Description of changes
        changed_by: Who made the change
        test_queries: List of test queries for evaluation
        evaluation_threshold: Minimum acceptable score
        auto_rollback: Whether to auto-rollback on failure
    
    Returns:
        Dict with pipeline results
    """
    results = {
        "success": False,
        "version_created": None,
        "evaluation_passed": False,
        "rolled_back": False,
        "errors": [],
        "evaluation_scores": [],
        "instruction": None,
        "description": None
    }
    
    try:
        # Step 1: Update with versioning
        print("üìù Step 1: Updating agent with versioning...")
        success = version_manager.update_agent_with_versioning(
            agent_id=agent_id,
            updates=updates,
            change_description=change_description,
            changed_by=changed_by
        )
        
        if not success:
            results["errors"].append("Update failed")
            return results
        
        current_version = version_manager.get_current_version_number(agent_id)
        results["version_created"] = current_version
        
        # Store the instruction and description for comparison
        agent_metadata = agent_manager.get_agent_metadata(agent_id=agent_id)
        results["instruction"] = agent_metadata.get("instruction", "")[:200] + "..."
        results["description"] = agent_metadata.get("description", "")
        
        print(f"‚úÖ Version {current_version} created")
        
        # Step 2: Run evaluations
        print("\nüîç Step 2: Running evaluations...")
        evaluation_scores = []
        
        # Get Azure agent ID for API calls
        agent_metadata = agent_manager.get_agent_metadata(agent_id=agent_id)
        azure_agent_id = agent_metadata.get("azure_agent_id")
        
        if not azure_agent_id:
            results["errors"].append("Azure agent ID not found")
            return results
        
        for test_query in test_queries:
            # Create thread and run agent
            thread = agent_manager.create_thread()
            message = project_client.agents.messages.create(
                thread_id=thread.id,
                role="user",
                content=test_query
            )
            run = project_client.agents.runs.create_and_process(
                thread_id=thread.id,
                agent_id=azure_agent_id  # Use Azure agent ID, not local DB ID
            )
            
            # Convert and evaluate
            converter = AIAgentConverter(project_client)
            eval_data = converter.convert(thread_id=thread.id, run_id=run.id)
            
            # Run evaluators
            intent_result = intent_resolution(
                query=eval_data.get("query"),
                response=eval_data.get("response")
            )
            
            scores = [
                intent_result.get('intent_resolution', 0),
            ]
            evaluation_scores.extend(scores)
            
            # Cleanup
            agent_manager.delete_thread(thread.id, silent=True)
        
        # Step 3: Analyze results
        avg_score = sum(evaluation_scores) / len(evaluation_scores) if evaluation_scores else 0
        results["evaluation_scores"] = evaluation_scores
        results["average_score"] = avg_score
        results["evaluation_passed"] = avg_score >= evaluation_threshold
        
        print(f"\nüìä Average evaluation score: {avg_score:.2f}")
        print(f"   Threshold: {evaluation_threshold}")
        print(f"   Status: {'‚úÖ PASSED' if results['evaluation_passed'] else '‚ùå FAILED'}")
        
        # Step 4: Rollback if needed
        if not results["evaluation_passed"] and auto_rollback and current_version > 1:
            print(f"\n‚èÆÔ∏è  Step 4: Rolling back to version {current_version - 1}...")
            rollback_success = version_manager.rollback_to_version(
                agent_id=agent_id,
                version_number=current_version - 1,
                change_description=f"Auto-rollback: evaluation score {avg_score:.2f} below threshold {evaluation_threshold}",
                changed_by="system"
            )
            results["rolled_back"] = rollback_success
            if rollback_success:
                print("‚úÖ Rollback successful")
            else:
                print("‚ùå Rollback failed")
        
        results["success"] = results["evaluation_passed"]
        
    except Exception as e:
        results["errors"].append(str(e))
        print(f"\n‚ùå Pipeline error: {e}")
    
    return results

print("‚úÖ CI/CD pipeline function defined")

### Test the Complete Pipeline

We'll run two experiments:
1. **Good Update**: Enhanced instruction with clear guidelines
2. **Bad Update**: Vague, unhelpful instruction

Then compare the evaluation results.

In [None]:
# Test queries for evaluation
import time
test_queries = [
    "What are the latest AI trends?",
    "Find recent news about cloud computing"
]

# Experiment 1: Good Update - Enhanced instruction
print("=" * 100)
print("EXPERIMENT 1: GOOD UPDATE")
print("=" * 100)

good_updates = {
    "instruction": """You are an Advanced Web Research Assistant with real-time web search capabilities.

## Role
You help users find current information from the web, including:
- Latest news and current events with source verification
- Recent industry updates and emerging trends
- Real-time data, statistics, and market insights
- Current product information, reviews, and comparisons
- Academic research and technical documentation

## Enhanced Capabilities
1. **Multi-source Verification**: Cross-reference information from multiple sources
2. **Trend Analysis**: Identify patterns and trends in search results
3. **Source Quality Assessment**: Evaluate credibility of information sources
4. **Contextual Summarization**: Provide concise summaries with key insights

## Constraints
1. When asked about current events or recent information, use the Bing search tool to find up-to-date information
2. Always cite your sources with links to the websites you reference
3. Present information clearly and concisely, using structured formats when appropriate
4. If information is time-sensitive, mention when the data was retrieved
5. For controversial topics, present multiple perspectives with proper attribution
6. Reply in English unless specifically asked for another language
7. If search results are insufficient, acknowledge limitations and suggest alternative approaches

## Quality Standards
- Verify facts across multiple sources when possible
- Distinguish between facts, opinions, and speculation
- Provide context for statistics and data points
- Update users if information may have changed since retrieval""",
    "description": "Advanced Web Research Assistant with enhanced verification and analysis capabilities"
}

good_results = agent_cicd_pipeline(
    agent_id=agent_data["id"],
    updates=good_updates,
    change_description="Enhanced instruction with clear guidelines and quality standards",
    changed_by="xle@microsoft.com",
    test_queries=test_queries,
    evaluation_threshold=3.0,
    auto_rollback=False  # Don't rollback, we want to test both versions
)

print("\n‚úÖ Good update completed")
print(f"Version created: {good_results['version_created']}")
print(f"Evaluation passed: {good_results['evaluation_passed']}")

# Wait a moment before next update
time.sleep(2)

# Experiment 2: Bad Update - Vague, unhelpful instruction
print("\n" + "=" * 100)
print("EXPERIMENT 2: BAD UPDATE")
print("=" * 100)

bad_updates = {
    "instruction": """You are a search assistant. Help users find things on the web. Use the search tool when needed. Provide answers based on what you find.""",
    "description": "Basic search assistant"
}

bad_results = agent_cicd_pipeline(
    agent_id=agent_data["id"],
    updates=bad_updates,
    change_description="Simplified instruction (deliberately reduced quality for testing)",
    changed_by="xle@microsoft.com",
    test_queries=test_queries,
    evaluation_threshold=3.0,
    auto_rollback=False  # Don't rollback, we want to compare results
)

print("\n‚úÖ Bad update completed")
print(f"Version created: {bad_results['version_created']}")
print(f"Evaluation passed: {bad_results['evaluation_passed']}")

# Store results for comparison
experiment_results = {
    "good": good_results,
    "bad": bad_results
}

### Compare Evaluation Results

Let's compare the two experiments side by side to see the impact of instruction quality.


In [None]:
from IPython.display import display, Markdown
import pandas as pd

# Extract evaluation metrics for comparison
comparison_data = {
    "Metric": [
        "Version Number",
        "Evaluation Passed",
        "Average Score",
        "Individual Scores",
        "Rolled Back",
        "Errors"
    ],
    "Good Update (Enhanced)": [
        good_results.get("version_created", "N/A"),
        "‚úÖ Yes" if good_results.get("evaluation_passed") else "‚ùå No",
        f"{good_results.get('average_score', 0):.2f}" if good_results.get('average_score') else "N/A",
        ", ".join([f"{s:.1f}" for s in good_results.get('evaluation_scores', [])]) if good_results.get('evaluation_scores') else "N/A",
        "‚úÖ Yes" if good_results.get("rolled_back") else "‚ùå No",
        ", ".join(good_results.get("errors", [])) if good_results.get("errors") else "None"
    ],
    "Bad Update (Simplified)": [
        bad_results.get("version_created", "N/A"),
        "‚úÖ Yes" if bad_results.get("evaluation_passed") else "‚ùå No",
        f"{bad_results.get('average_score', 0):.2f}" if bad_results.get('average_score') else "N/A",
        ", ".join([f"{s:.1f}" for s in bad_results.get('evaluation_scores', [])]) if bad_results.get('evaluation_scores') else "N/A",
        "‚úÖ Yes" if bad_results.get("rolled_back") else "‚ùå No",
        ", ".join(bad_results.get("errors", [])) if bad_results.get("errors") else "None"
    ]
}

# Create DataFrame
df = pd.DataFrame(comparison_data)

# Display as markdown table
print("\n" + "=" * 100)
print("EVALUATION COMPARISON")
print("=" * 100 + "\n")

# Convert to markdown table
markdown_table = "| " + " | ".join(df.columns) + " |\n"
markdown_table += "|" + "|".join(["---" for _ in df.columns]) + "|\n"
for _, row in df.iterrows():
    markdown_table += "| " + " | ".join(str(val) for val in row) + " |\n"

display(Markdown(markdown_table))

# Show instruction differences
print("\n" + "=" * 100)
print("INSTRUCTION COMPARISON")
print("=" * 100)

print("\nüìù Good Update (Enhanced) - Instruction Preview:")
print("-" * 100)
print(good_results.get("instruction", "N/A"))
print(f"\nüìã Description: {good_results.get('description', 'N/A')}")

print("\n" + "-" * 100)
print("\nüìù Bad Update (Simplified) - Instruction Preview:")
print("-" * 100)
print(bad_results.get("instruction", "N/A"))
print(f"\nüìã Description: {bad_results.get('description', 'N/A')}")
print("\n" + "=" * 100)

# Display detailed results
print("\nüìä Detailed Results:\n")
print("=" * 100)
print("\nüü¢ GOOD UPDATE (Enhanced Instruction)")
print("-" * 100)
print(json.dumps(good_results, indent=2, default=str))

print("\nüî¥ BAD UPDATE (Simplified Instruction)")
print("-" * 100)
print(json.dumps(bad_results, indent=2, default=str))

print("\n" + "=" * 100)
print("\nüìà Analysis:")
print("-" * 100)
if good_results.get("evaluation_passed") and not bad_results.get("evaluation_passed"):
    print("‚úÖ The enhanced instruction significantly outperformed the simplified version")
    print("   This demonstrates the importance of clear, detailed instructions for agent quality")
elif bad_results.get("evaluation_passed") and not good_results.get("evaluation_passed"):
    print("‚ö†Ô∏è  Surprisingly, the simplified instruction performed better")
    print("   This may indicate the test queries don't fully capture instruction quality")
elif good_results.get("evaluation_passed") and bad_results.get("evaluation_passed"):
    print("‚úÖ Both versions passed evaluation")
    print("   Review individual scores to see which performed better overall")
else:
    print("‚ùå Both versions failed evaluation")
    print("   Consider adjusting the evaluation threshold or improving both instructions")


### View Version History with Changes

Let's examine the version history to see exactly what changed between versions.


In [None]:
# Get version history to see all changes
version_history = version_manager.get_version_history(agent_id=agent_data["id"])

print(f"\nüìú Complete Version History ({len(version_history)} versions)")
print("=" * 100)

# Sort by version number descending to show most recent first
sorted_history = sorted(version_history, key=lambda x: x['versionNumber'], reverse=True)

for i, version in enumerate(sorted_history):
    print(f"\n{'üîµ' if i == 0 else '‚ö™'} Version {version['versionNumber']}")
    print(f"   ‚è∞ Timestamp: {version['timestamp']}")
    print(f"   üë§ Changed by: {version['changedBy']}")
    print(f"   üìù Change: {version['changeDescription']}")
    
    snapshot = version['snapshot']
    
    # Show instruction preview
    instruction_preview = snapshot.get('instruction', '')[:150].replace('\n', ' ')
    print(f"   üìÑ Instruction: {instruction_preview}...")
    
    # Show description
    print(f"   üìã Description: {snapshot.get('description', 'N/A')}")
    
    # Compare with previous version if available
    if i < len(sorted_history) - 1:
        prev_version = sorted_history[i + 1]
        prev_snapshot = prev_version['snapshot']
        
        # Check what changed
        changes = []
        if snapshot.get('instruction') != prev_snapshot.get('instruction'):
            changes.append("instruction")
        if snapshot.get('description') != prev_snapshot.get('description'):
            changes.append("description")
        if snapshot.get('samplePrompts') != prev_snapshot.get('samplePrompts'):
            changes.append("samplePrompts")
        
        if changes:
            print(f"   üîÑ Changes from v{prev_version['versionNumber']}: {', '.join(changes)}")
    
    print("-" * 100)

# Show detailed comparison between two most recent versions
if len(sorted_history) >= 2:
    print("\n" + "=" * 100)
    print("DETAILED COMPARISON: Latest Two Versions")
    print("=" * 100)
    
    v1 = sorted_history[0]  # Most recent
    v2 = sorted_history[1]  # Second most recent
    
    print(f"\nüìä Comparing Version {v2['versionNumber']} ‚Üí Version {v1['versionNumber']}")
    print("-" * 100)
    
    # Compare instructions
    print("\nüìù INSTRUCTION CHANGES:")
    print(f"\n  Version {v2['versionNumber']}:")
    print(f"  {v2['snapshot'].get('instruction', '')[:300]}...")
    print(f"\n  Version {v1['versionNumber']}:")
    print(f"  {v1['snapshot'].get('instruction', '')[:300]}...")
    
    # Compare descriptions
    print("\nüìã DESCRIPTION CHANGES:")
    print(f"  Version {v2['versionNumber']}: {v2['snapshot'].get('description', 'N/A')}")
    print(f"  Version {v1['versionNumber']}: {v1['snapshot'].get('description', 'N/A')}")
    
    # Compare sample prompts if they exist
    v1_prompts = v1['snapshot'].get('samplePrompts', [])
    v2_prompts = v2['snapshot'].get('samplePrompts', [])
    
    if v1_prompts or v2_prompts:
        print("\nüí¨ SAMPLE PROMPTS CHANGES:")
        
        # Find added prompts
        added = [p for p in v1_prompts if p not in v2_prompts]
        if added:
            print(f"\n  ‚ûï Added in v{v1['versionNumber']}:")
            for p in added:
                print(f"     - {p}")
        
        # Find removed prompts
        removed = [p for p in v2_prompts if p not in v1_prompts]
        if removed:
            print(f"\n  ‚ûñ Removed from v{v2['versionNumber']}:")
            for p in removed:
                print(f"     - {p}")
        
        if not added and not removed:
            print("  ‚úÖ No changes")
    
    print("\n" + "=" * 100)


### Visual Score Comparison

Let's visualize the performance difference between the two versions.


In [None]:
# Create a visual comparison of scores
def create_score_bar(score, max_score=5, width=50):
    """Create a text-based progress bar for scores"""
    filled = int((score / max_score) * width)
    bar = "‚ñà" * filled + "‚ñë" * (width - filled)
    return f"{bar} {score:.2f}/{max_score}"

print("\n" + "=" * 100)
print("EVALUATION SCORES VISUALIZATION")
print("=" * 100)

good_avg = good_results.get('average_score', 0)
bad_avg = bad_results.get('average_score', 0)

print(f"\nüìä Version {good_results.get('version_created')} (Enhanced Instruction):")
print(f"    {create_score_bar(good_avg)}")
print(f"    Status: {'‚úÖ PASSED' if good_results.get('evaluation_passed') else '‚ùå FAILED'}")

print(f"\nüìä Version {bad_results.get('version_created')} (Simplified Instruction):")
print(f"    {create_score_bar(bad_avg)}")
print(f"    Status: {'‚úÖ PASSED' if bad_results.get('evaluation_passed') else '‚ùå FAILED'}")

# Calculate improvement
improvement = ((good_avg - bad_avg) / bad_avg * 100) if bad_avg > 0 else 0

print(f"\nüìà Performance Difference:")
if improvement > 0:
    print(f"    ‚úÖ Enhanced version is {improvement:.1f}% better")
elif improvement < 0:
    print(f"    ‚ö†Ô∏è  Enhanced version is {abs(improvement):.1f}% worse")
else:
    print(f"    ‚ûñ No difference in performance")

# Show individual test scores
print("\n" + "-" * 100)
print("\nüìã Individual Test Scores:")

good_scores = good_results.get('evaluation_scores', [])
bad_scores = bad_results.get('evaluation_scores', [])

max_tests = max(len(good_scores), len(bad_scores))
test_pairs = list(zip(range(1, max_tests + 1), 
                      good_scores + [None] * (max_tests - len(good_scores)),
                      bad_scores + [None] * (max_tests - len(bad_scores))))

print(f"\n{'Test':<8} {'Enhanced (v' + str(good_results.get('version_created')) + ')':<25} {'Simplified (v' + str(bad_results.get('version_created')) + ')':<25} {'Difference':<15}")
print("-" * 100)

for test_num, good_score, bad_score in test_pairs:
    good_str = f"{good_score:.2f}" if good_score is not None else "N/A"
    bad_str = f"{bad_score:.2f}" if bad_score is not None else "N/A"
    
    if good_score is not None and bad_score is not None:
        diff = good_score - bad_score
        diff_str = f"{'+' if diff > 0 else ''}{diff:.2f}"
        indicator = "üü¢" if diff > 0 else "üî¥" if diff < 0 else "‚ö™"
    else:
        diff_str = "N/A"
        indicator = "‚ö™"
    
    print(f"{indicator} #{test_num:<5} {good_str:<25} {bad_str:<25} {diff_str:<15}")

print("\n" + "=" * 100)

# Summary recommendation
print("\nüí° Recommendation:")
if good_results.get('evaluation_passed') and not bad_results.get('evaluation_passed'):
    print("   ‚úÖ Deploy Version {} (Enhanced) - significantly better performance".format(good_results.get('version_created')))
    print(f"   ‚èÆÔ∏è  Consider rolling back Version {bad_results.get('version_created')} (Simplified)")
elif bad_results.get('evaluation_passed') and not good_results.get('evaluation_passed'):
    print("   ‚ö†Ô∏è  Version {} (Enhanced) underperformed - investigate before deployment".format(good_results.get('version_created')))
    print(f"   ‚úÖ Keep Version {bad_results.get('version_created')} (Simplified) as baseline")
elif good_results.get('evaluation_passed') and bad_results.get('evaluation_passed'):
    if good_avg > bad_avg:
        print("   ‚úÖ Both passed, but Version {} (Enhanced) has better scores".format(good_results.get('version_created')))
    else:
        print("   ‚ö†Ô∏è  Both passed, but simpler instruction may be sufficient")
else:
    print("   ‚ùå Both versions failed evaluation - review instructions and test queries")
    print("   üîß Consider adjusting evaluation threshold or improving agent configuration")

print("\n" + "=" * 100)


## Best Practices

### Version Control
1. **Descriptive Change Logs**: Always provide meaningful descriptions for changes
2. **Incremental Updates**: Make small, focused changes rather than large overhauls
3. **Version Tagging**: Use semantic versioning concepts (major, minor, patch)
4. **Regular Backups**: Export agent metadata periodically

### Evaluation Strategy
1. **Multiple Test Cases**: Use diverse test queries covering different scenarios
2. **Baseline Comparison**: Compare new version scores against baseline
3. **Threshold Tuning**: Adjust evaluation thresholds based on your quality requirements
4. **Progressive Rollout**: Test with limited users before full deployment

### CI/CD Automation
1. **Automated Testing**: Run evaluations automatically on every update
2. **Quality Gates**: Block deployment if evaluation scores drop
3. **Monitoring**: Track evaluation scores over time
4. **Alert System**: Notify team when scores fall below threshold

### Rollback Procedures
1. **Quick Rollback**: Have automated rollback for critical failures
2. **Investigation**: Analyze why the new version failed before retry
3. **Graceful Degradation**: Consider keeping previous version running during testing
4. **User Communication**: Inform users of any service changes

### Database Schema Evolution
The version control system stores:
```json
{
  "id": "agent-local-id",
  "azure_agent_id": "asst_xxx",
  "currentVersion": 5,
  "versions": [
    {
      "versionNumber": 1,
      "timestamp": "2025-11-10T10:00:00Z",
      "changeDescription": "Initial version",
      "changedBy": "user@example.com",
      "snapshot": {
        "name": "...",
        "instruction": "...",
        // ... complete configuration
      }
    },
    // ... more versions
  ]
}
```

### Integration with External Systems
- **GitHub Actions**: Trigger CI/CD on code changes
- **Azure DevOps**: Integrate with deployment pipelines
- **Monitoring Tools**: Send evaluation results to monitoring dashboards
- **Notification Systems**: Alert on version changes and evaluation results

## Cleanup

In [None]:
# Clean up test thread if it exists
if 'thread' in locals():
    agent_manager.delete_thread(thread.id, silent=True)
    print("‚úÖ Test thread cleaned up")
else:
    print("‚ÑπÔ∏è  No cleanup needed")

## Summary

This notebook demonstrated a complete CI/CD workflow for Azure AI agents:

### Key Accomplishments
1. ‚úÖ **Automatic Versioning**: Every update creates a version snapshot
2. ‚úÖ **Version History**: Complete audit trail with timestamps and descriptions
3. ‚úÖ **Automated Evaluation**: Run evaluators after updates
4. ‚úÖ **Version Comparison**: See differences between versions
5. ‚úÖ **Rollback Support**: Restore previous versions safely
6. ‚úÖ **Complete Pipeline**: End-to-end CI/CD with quality gates

### Database Updates
- `versions` array stores complete version history
- `currentVersion` tracks the active version number
- Each version includes timestamp, change description, and who made the change
- Full configuration snapshot preserved for each version

### Next Steps
1. **Integrate with Git**: Version control your agent configurations
2. **Set Up Monitoring**: Track evaluation metrics over time
3. **Automate Deployment**: Trigger CI/CD from source control
4. **Expand Test Coverage**: Add more comprehensive test scenarios
5. **A/B Testing**: Compare multiple versions in production

### Resources
- [Azure AI Evaluation Documentation](https://learn.microsoft.com/azure/ai-studio/how-to/evaluate-sdk)
- [Azure Cosmos DB Best Practices](https://learn.microsoft.com/azure/cosmos-db/best-practices)
- [CI/CD for ML Models](https://learn.microsoft.com/azure/machine-learning/concept-model-management-and-deployment)