# üöÄ LevelApp Framework - Conversation Simulator Tutorial

---

Welcome to the **LevelApp Conversation Simulator** tutorial! This interactive notebook will guide you through evaluating conversational AI systems using the LevelApp framework.

## üìö What is LevelApp?

**LevelApp** is a powerful Python framework designed for **automated testing and evaluation of dialogue systems**. It helps you:
- ü§ñ Simulate realistic conversations with your chatbot
- üìä Evaluate response quality using AI judges
- üîç Monitor performance metrics and identify issues
- ‚úÖ Validate that your chatbot behaves as expected

## üéØ What You'll Learn

By the end of this notebook, you will:
1. ‚úÖ Install and configure the LevelApp framework
2. ‚úÖ Set up a simple FastAPI chatbot application **OR** Test your own conversational AI system
3. ‚úÖ Create conversation scripts for testing
4. ‚úÖ Configure evaluation workflows with YAML
5. ‚úÖ Run automated conversation simulations
6. ‚úÖ Interpret evaluation results and metrics

## üìã Prerequisites

Before starting, make sure you have:
- ‚úÖ A Google Colab account (you're already here!)
- ‚úÖ An **OpenAI API key** ([Get one here](https://platform.openai.com/api-keys))
- ‚úÖ An **Ngrok account** ([Sign up here](https://ngrok.com/))
- ‚úÖ Basic Python knowledge
- ‚è±Ô∏è **Estimated time**: 30-45 minutes

## üèóÔ∏è What We'll Build

In this tutorial, we'll create a **dental clinic appointment booking chatbot** and test it using LevelApp's Conversation Simulator. The chatbot will:
- Answer questions about dental services
- Book appointments with specific doctors
- Return structured metadata (appointment type, date, doctor name)

---

Let's get started! üéâ

---
# üì¶ Step 1: Installation

First, we'll install LevelApp and all required dependencies.

## 1.1 Install LevelApp Framework


In [None]:
!uv pip install levelapp -q

In [None]:
# Verify the installation by checking the version
!uv pip list | grep levelapp

## 1.2 Install Additional Dependencies

We need several packages to run our chatbot and expose it via a public URL:


- **fastapi**: Web framework for building the chatbot API
- **uvicorn**: ASGI server to run FastAPI
- **pyngrok**: Create a public tunnel to our local server
- **openai**: OpenAI API client for the chatbot's LLM
- **pydantic**: Data validation for API requests/responses
- **gdown**: Download configuration files from Google Drive [Optional]

In [None]:
!uv pip install dotenv gdown fastapi uvicorn pyngrok openai pydantic -q

---
# üì• Step 2: Download Configuration Files

We'll download pre-configured files from Google Drive that include:
- **workflow_configuration.yaml**: LevelApp workflow settings
- **conversation_script.json**: Test conversation scenarios
- **example_chatbot.py**: Sample chatbot implementation

In [None]:
import gdown

# Google Drive folder containing example files
folder_id = '1CylixxBt8gQZ3KyLeQPLOkxA_NnpoDli'
output_dir = './levelapp-examples'

# Download the folder
gdown.download_folder(id=folder_id, output=output_dir, quiet=False, use_cookies=False)
print(f"‚úÖ Downloaded configuration files to '{output_dir}'")

### üìÇ Verify Downloaded Files

Mainly, you will find in the downloaded folder 3 files:
- `workflow_configuration.yaml`: The YAML file containing the configuration for the evaluation process.
- `conversation_script.json`: The reference data file that contains the simulation scripts.
- `example_chatbot.py`: [Bonus] the chatbot app starter script.

In [None]:
import os

folder_path = './levelapp-examples'

if os.path.exists(folder_path):
    print(f"üìÇ Contents of '{folder_path}':")
    for root, dirs, files in os.walk(folder_path):
        level = root.replace(folder_path, '').count(os.sep)
        indent = ' ' * 4 * (level)
        print(f"{indent}{os.path.basename(root)}/")
        subindent = ' ' * 4 * (level + 1)
        for f in files:
            print(f"{subindent}üìÑ {f}")
else:
    print(f"‚ùå Folder '{folder_path}' not found.")

---
# ü§ñ Step 3: Create the Chatbot Application [Optional]

In case you don't have a deployed conversation AI system to test, you can simply build a **dental clinic chatbot** using FastAPI and OpenAI, deploy is using Ngrok, and test it. This chatbot will:
- Answer questions about dental services
- Book appointments with the appropriate doctor
- Return structured metadata for bookings

## üîê Configure API Keys

‚ö†Ô∏è **IMPORTANT**: You need to add your OpenAI API key to Google Colab secrets:

1. Click the **üîë key icon** in the left sidebar
2. Click **"Add new secret"**
3. Name: `OPENAI_API_KEY`
4. Value: Your OpenAI API key
5. Enable notebook access

üö® **Never hardcode API keys in your code!**

## 3.1 Define the Chatbot Application

This FastAPI app creates a chatbot with two endpoints:
- **POST /chat**: Main chatbot endpoint that processes messages
- **GET /healthz**: Health check endpoint for connectivity testing

### üè• Chatbot Behavior:
- **Dr. Tony Tony Chopper** ‚Üí ROUTINE appointments
- **Dr. Trafalgar D. Water Law** ‚Üí SURGICAL appointments
- **Dr. Crocus** ‚Üí RESTORATIVE appointments

In [None]:
import os

from typing import Dict, Any
from pydantic import BaseModel

from google.colab import userdata
from openai import OpenAI
from fastapi import FastAPI, HTTPException


# Get API key from Colab secrets
openai_api_key = userdata.get('OPENAI_API_KEY')

# Initialize FastAPI app
app = FastAPI(title="Dental Clinic Chatbot")

# Initialize OpenAI client
client = OpenAI(api_key=openai_api_key)

# System prompt that defines chatbot behavior
SYSTEM_PROMPT = """
You are a medical assistant for a dental clinic,
helping patients book appointments and answer inquiries about medical services.

## Behavior:
- Always reply in a convivial, professional tone.
- Be concise and clear.

## Instructions:
1. Identify the type of appointment the user requires based on their request.
2. If the user asks to book an appointment, return the booking information in a structured JSON format.
   - The JSON must include:
     - `reply_text`: A friendly confirmation message.
     - `metadata`: A dict containing the following info:
        1. `appointment_type`: One of "ROUTINE", "SURGICAL", or "RESTORATIVE".
        2. `appointment_date`: The date of the appointment (format: YYYY-MM-DD).
        3. `doctor_name`: One of "Dr. Tony Tony Chopper", "Dr. Trafalgar D. Water Law", or "Dr. Crocus".
   - Example JSON output:
     ```json
     {
       "reply_text": "Your ROUTINE appointment with Dr. Tony Tony Chopper is booked for 2025-12-01.",
       "appointment_type": "ROUTINE",
       "appointment_date": "2025-12-01",
       "doctor_name": "Dr. Tony Tony Chopper"
     }
     ```
3. If the user does not request a booking, return only the "reply_text".

## Additional Information:
- Dr. Tony Tony Chopper handles ROUTINE appointments.
- Dr. Trafalgar D. Water Law handles SURGICAL appointments.
- Dr. Crocus handles RESTORATIVE appointments.
"""

# Pydantic models for request/response validation
class ChatRequest(BaseModel):
  message: str

class Metadata(BaseModel):
  appointment_type: str = ""
  appointment_date: str = ""
  doctor_name: str = ""

class ChatResponse(BaseModel):
  reply_text: str
  metadata: Metadata = {}

def generate_reply(user_message: str) -> str:
  """Generate a reply using OpenAI's API with structured output."""
  try:
      resp = client.chat.completions.parse(
          model="gpt-4o-mini",
          messages=[
              {"role": "system", "content": SYSTEM_PROMPT},
              {"role": "user", "content": user_message},
          ],
          temperature=0.3,
          response_format=ChatResponse
      )
      return resp.choices[0].message.parsed
  except Exception as e:
      raise RuntimeError(f"LLM error: {e}")

# API Endpoints
@app.post("/chat", response_model=ChatResponse)
def chat(req: ChatRequest):
  """Main chat endpoint that processes user messages."""
  if not req.message:
      raise HTTPException(status_code=400, detail="`message` is required.")
  try:
      reply = generate_reply(req.message)
      return reply
  except Exception as e:
      raise HTTPException(status_code=500, detail=str(e))

@app.get("/healthz")
def health():
  """Health check endpoint."""
  return {"status": "ok"}

print("‚úÖ Chatbot application defined successfully!")

## 3.2 Start the FastAPI Server

We'll run the FastAPI app in a background thread so it doesn't block the notebook.

In [None]:
import time
import uvicorn
import threading

def run_fastapi():
  """Run FastAPI server in background."""
  uvicorn.run(app, host="0.0.0.0", port=8000)

# Start server in a daemon thread
thread = threading.Thread(target=run_fastapi, daemon=True)
thread.start()

# Give the server a moment to start
time.sleep(2)
print("‚úÖ FastAPI server started on port 8000")

## 3.3 Expose the Server with Ngrok

Since our server is running locally in Colab, we need **Ngrok** to create a public URL that LevelApp can access.

### üîê Setup Ngrok:
1. Sign up at [ngrok.com](https://ngrok.com/)
2. Get your auth token from the dashboard
3. Replace `YOUR_NGROK_TOKEN` below with your actual token

‚ö†Ô∏è **Security Note**: For production use, store the Ngrok token in Colab secrets, not in the code!

In [None]:
import time
from pyngrok import ngrok, conf


#  Kill any existing ngrok processes first!
ngrok.kill()
time.sleep(5) # Add a small delay to ensure resources are released

# ‚îú‚î¨‚îÄ REPLACE THIS with your Ngrok auth token!
# Get it from: https://dashboard.ngrok.com/get-started/your-authtoken
NGROK_TOKEN = userdata.get('NGROK_TOKEN').strip()

# Set auth token
!ngrok authtoken {NGROK_TOKEN}

# Create public tunnel
public_url = ngrok.connect(addr="8000", proto="http")
print(f"\n‚úÖ Public URL created: {public_url}")
print(f"\nüåê Your chatbot is now accessible at: {public_url.public_url}")

## 3.4 Test the Chatbot

Let's verify our chatbot is working correctly by testing both endpoints.

In [None]:
import requests

base_url = "http://0.0.0.0:8000" # public_url.public_url

# Test 1: Health check
print("üîç Testing health endpoint...")
healthcheck_url = base_url + "/healthz"
response = requests.get(healthcheck_url)
print(f"‚úÖ Health check [status: {response.status_code}]: {response.json()}\n")

# Test 2: Chat endpoint with appointment booking
print("üîç Testing chat endpoint with appointment booking...")
chat_url = base_url + "/chat"
data = {"message": "I want to book an appointment next Monday to remove my wisdom tooth."}
response = requests.post(chat_url, json=data)
print(f"‚úÖ Chatbot response [status: {response.status_code}]:")
print(f"   Reply: {response.json()['reply_text']}")
print(f"   Metadata: {response.json()['metadata']}")

## 3.4 Update Configuration with Ngrok URL

The workflow configuration file needs to know where our chatbot is hosted. We'll update it with the Ngrok URL.

In [None]:
import yaml

workflow_config_path = '/content/levelapp-examples/conversation-simulator/workflow_configuration.yaml'

# Load the YAML configuration
with open(workflow_config_path, 'r') as file:
    workflow_config = yaml.safe_load(file)

# Update the base URL with our Ngrok URL
if 'endpoint' in workflow_config and 'base_url' in workflow_config['endpoint']:
    workflow_config['endpoint']['base_url'] = base_url
    print(f"‚úÖ Updated endpoint base_url to: {workflow_config['endpoint']['base_url']}")
else:
    print("‚ùå Could not find 'endpoint.base_url' in the YAML file.")

# Save the updated configuration
with open(workflow_config_path, 'w') as file:
    yaml.dump(workflow_config, file, sort_keys=False)

print("‚úÖ Configuration file updated successfully.")

---
# ‚öôÔ∏è Step 4: Configure LevelApp Evaluation

Otherwise, if you have a live conversation AI system, you can set up the configuration and load the reference data using the provided UI widgets.

In [None]:
#@title LevelApp Configuration UI


import ipywidgets as widgets
from IPython.display import display, HTML
import yaml
import json

class ConfigUIManager:
    def __init__(self, initial_config=None):
        if initial_config is None:
            initial_config = {
                'process': {
                    'project_name': "colab-evaluation",
                    'workflow_type': "SIMULATOR",
                    'evaluation_params': {
                        'attempts': 2,
                    }
                },
                'evaluation': {
                    'evaluators': ["JUDGE", "REFERENCE"],
                    'providers': ["openai", "ionos"],
                    'metrics_map': {
                        'appointment_type': "EXACT",
                        'appointment_date': "TOKEN_BASED",
                        'doctor_name': "TOKEN_BASED"
                    }
                },
                'reference_data': {
                    'path': "conversation_script.json",
                    'data': {}
                },
                'endpoint': {
                    'name': "example",
                    'base_url': "http://127.0.0.1:8000",
                    'path': "chat",
                    'method': "POST",
                    'timeout': 60,
                    'retry_count': 3,
                    'retry_backoff': 0.5,
                    'headers': [
                        {
                            'name': "Content-type",
                            'value': "application/json",
                            'secure': False
                        }
                    ],
                    'request_schema': [
                        {
                            'field_path': "message",
                            'value': "user_message",
                            'value_type': "dynamic",
                            'required': True
                        }
                    ],
                    'response_mapping': [
                        {
                            'field_path': "reply_text",
                            'extract_as': "agent_reply"
                        },
                        {
                            'field_path': "metadata",
                            'extract_as': "metadata"
                        }
                    ]
                },
                'repository': {
                    'type': "FIRESTORE",
                    'project_id': "",
                    'database_name': "",
                    'source': "LOCAL"
                }
            }

        self.widgets = {}
        self.output_area = widgets.Output()

        # --- Process Section ---
        self.widgets['process_project_name'] = widgets.Text(
            value=initial_config['process']['project_name'],
            description='Project Name:',
            placeholder='e.g., chatbot-evaluation',
            style={'description_width': 'initial', 'description_color': 'white', 'handle_color': 'blue'},
            layout=widgets.Layout(width='auto')
        )
        self.widgets['process_workflow_type'] = widgets.Dropdown(
            options=['SIMULATOR'],
            value=initial_config['process']['workflow_type'],
            description='Workflow Type:',
            disabled=True, # Fixed for this tutorial
            style={'description_width': 'initial', 'description_color': 'white', 'handle_color': 'blue'},
            layout=widgets.Layout(width='auto')
        )
        self.widgets['process_attempts'] = widgets.IntSlider(
            value=initial_config['process']['evaluation_params']['attempts'],
            min=1, max=10, step=1,
            description='Attempts:',
            continuous_update=False,
            style={'description_width': 'initial', 'description_color': 'white', 'handle_color': 'blue'},
            layout=widgets.Layout(width='auto')
        )
        process_section_widgets = widgets.VBox([
            self.widgets['process_project_name'],
            self.widgets['process_workflow_type'],
            widgets.Label(value="Evaluation Parameters:", layout=widgets.Layout(margin='10px 0 0 0')),
            self.widgets['process_attempts'],
        ])

        # --- Evaluation Section ---
        self.widgets['eval_evaluators'] = widgets.SelectMultiple(
            options=['JUDGE', 'REFERENCE'],
            value=initial_config['evaluation']['evaluators'],
            description='Evaluators:',
            disabled=False,
            style={'description_width': 'initial', 'description_color': 'white', 'handle_color': 'blue'},
            layout=widgets.Layout(width='auto')
        )
        self.widgets['eval_providers'] = widgets.SelectMultiple(
            options=['openai', 'ionos', 'mistral', 'anthropic', 'groq', 'gemini'],
            value=initial_config['evaluation']['providers'],
            description='Providers:',
            disabled=False,
            style={'description_width': 'initial', 'description_color': 'white', 'handle_color': 'blue'},
            layout=widgets.Layout(width='auto')
        )
        self.widgets['eval_metrics_map_app_type'] = widgets.Dropdown(
            options=['EXACT', 'TOKEN_BASED'],
            value=initial_config['evaluation']['metrics_map'].get('appointment_type', 'EXACT'),
            description='Appt. Type Metric:',
            style={'description_width': 'initial', 'description_color': 'white', 'handle_color': 'blue'},
            layout=widgets.Layout(width='auto')
        )
        self.widgets['eval_metrics_map_app_date'] = widgets.Dropdown(
            options=['EXACT', 'TOKEN_BASED'],
            value=initial_config['evaluation']['metrics_map'].get('appointment_date', 'TOKEN_BASED'),
            description='Appt. Date Metric:',
            style={'description_width': 'initial', 'description_color': 'white', 'handle_color': 'blue'},
            layout=widgets.Layout(width='auto')
        )
        self.widgets['eval_metrics_map_doctor_name'] = widgets.Dropdown(
            options=['EXACT', 'TOKEN_BASED'],
            value=initial_config['evaluation']['metrics_map'].get('doctor_name', 'TOKEN_BASED'),
            description='Doctor Name Metric:',
            style={'description_width': 'initial', 'description_color': 'white', 'handle_color': 'blue'},
            layout=widgets.Layout(width='auto')
        )
        eval_section_widgets = widgets.VBox([
            self.widgets['eval_evaluators'],
            self.widgets['eval_providers'],
            widgets.Label(value="Metrics Map:", layout=widgets.Layout(margin='10px 0 0 0')),
            self.widgets['eval_metrics_map_app_type'],
            self.widgets['eval_metrics_map_app_date'],
            self.widgets['eval_metrics_map_doctor_name']
        ])

        # --- Reference Data Section ---
        self.widgets['ref_data_path'] = widgets.Text(
            value=initial_config['reference_data']['path'],
            description='Path:',
            placeholder='e.g., conversation_script.json',
            style={'description_width': 'initial', 'description_color': 'white', 'handle_color': 'blue'},
            layout=widgets.Layout(width='auto')
        )
        self.widgets['ref_data_data'] = widgets.Textarea(
            value=json.dumps(initial_config['reference_data']['data'], indent=2),
            description='Data (JSON):',
            placeholder='Enter JSON object for reference data',
            style={'description_width': 'initial', 'description_color': 'white', 'handle_color': 'blue'},
            layout=widgets.Layout(width='auto', height='100px')
        )
        ref_data_section_widgets = widgets.VBox([
            self.widgets['ref_data_path'],
            self.widgets['ref_data_data']
        ])

        # --- Endpoint Section ---
        self.widgets['endpoint_name'] = widgets.Text(
            value=initial_config['endpoint']['name'],
            description='Name:',
            placeholder='e.g., example',
            style={'description_width': 'initial', 'description_color': 'white', 'handle_color': 'blue'},
            layout=widgets.Layout(width='auto')
        )
        self.widgets['endpoint_base_url'] = widgets.Text(
            value=initial_config['endpoint']['base_url'],
            description='Base URL:',
            placeholder='e.g., http://127.0.0.1:8000',
            style={'description_width': 'initial', 'description_color': 'white', 'handle_color': 'blue'},
            layout=widgets.Layout(width='auto')
        )
        self.widgets['endpoint_path'] = widgets.Text(
            value=initial_config['endpoint']['path'],
            description='Path:',
            placeholder='e.g., chat',
            style={'description_width': 'initial', 'description_color': 'white', 'handle_color': 'blue'},
            layout=widgets.Layout(width='auto')
        )
        self.widgets['endpoint_method'] = widgets.Dropdown(
            options=['POST', 'GET', 'PUT', 'DELETE'],
            value=initial_config['endpoint']['method'],
            description='Method:',
            style={'description_width': 'initial', 'description_color': 'white', 'handle_color': 'blue'},
            layout=widgets.Layout(width='auto')
        )
        self.widgets['endpoint_timeout'] = widgets.IntSlider(
            value=initial_config['endpoint']['timeout'],
            min=10, max=300, step=10,
            description='Timeout (s):',
            continuous_update=False,
            style={'description_width': 'initial', 'description_color': 'white', 'handle_color': 'blue'},
            layout=widgets.Layout(width='auto')
        )
        self.widgets['endpoint_retry_count'] = widgets.IntSlider(
            value=initial_config['endpoint']['retry_count'],
            min=0, max=10, step=1,
            description='Retry Count:',
            continuous_update=False,
            style={'description_width': 'initial', 'description_color': 'white', 'handle_color': 'blue'},
            layout=widgets.Layout(width='auto')
        )
        self.widgets['endpoint_retry_backoff'] = widgets.FloatSlider(
            value=initial_config['endpoint']['retry_backoff'],
            min=0.1, max=5.0, step=0.1,
            description='Retry Backoff:',
            continuous_update=False,
            style={'description_width': 'initial', 'description_color': 'white', 'handle_color': 'blue'},
            layout=widgets.Layout(width='auto')
        )

        # Complex inputs for headers, request_schema, response_mapping
        self.widgets['endpoint_headers'] = widgets.Textarea(
            value=json.dumps(initial_config['endpoint']['headers'], indent=2),
            description='Headers (JSON):',
            placeholder='Enter JSON array of header objects',
            style={'description_width': 'initial', 'description_color': 'white', 'handle_color': 'blue'},
            layout=widgets.Layout(width='auto', height='100px')
        )
        self.widgets['endpoint_request_schema'] = widgets.Textarea(
            value=json.dumps(initial_config['endpoint']['request_schema'], indent=2),
            description='Request Schema (JSON):',
            placeholder='Enter JSON array of request schema objects',
            style={'description_width': 'initial', 'description_color': 'white', 'handle_color': 'blue'},
            layout=widgets.Layout(width='auto', height='100px')
        )
        self.widgets['endpoint_response_mapping'] = widgets.Textarea(
            value=json.dumps(initial_config['endpoint']['response_mapping'], indent=2),
            description='Response Mapping (JSON):',
            placeholder='Enter JSON array of response mapping objects',
            style={'description_width': 'initial', 'description_color': 'white', 'handle_color': 'blue'},
            layout=widgets.Layout(width='auto', height='100px')
        )

        endpoint_section_widgets = widgets.VBox([
            self.widgets['endpoint_name'],
            self.widgets['endpoint_base_url'],
            self.widgets['endpoint_path'],
            self.widgets['endpoint_method'],
            self.widgets['endpoint_timeout'],
            self.widgets['endpoint_retry_count'],
            self.widgets['endpoint_retry_backoff'],
            self.widgets['endpoint_headers'],
            self.widgets['endpoint_request_schema'],
            self.widgets['endpoint_response_mapping']
        ])

        # --- Repository Section ---
        self.widgets['repo_type'] = widgets.Dropdown(
            options=['FIRESTORE'],
            value=initial_config['repository']['type'],
            description='Type:',
            disabled=True, # Fixed for now
            layout=widgets.Layout(width='auto')
        )
        self.widgets['repo_project_id'] = widgets.Text(
            value=initial_config['repository']['project_id'],
            description='Project ID:',
            placeholder='Your Google Cloud Project ID',
            layout=widgets.Layout(width='auto')
        )
        self.widgets['repo_database_name'] = widgets.Text(
            value=initial_config['repository']['database_name'],
            description='Database Name:',
            placeholder='Your Firestore database name',
            layout=widgets.Layout(width='auto')
        )
        self.widgets['repo_source'] = widgets.Dropdown(
            options=['LOCAL', 'GCS', 'S3'],
            value=initial_config['repository']['source'],
            description='Source:',
            layout=widgets.Layout(width='auto')
        )
        repo_section_widgets = widgets.VBox([
            self.widgets['repo_type'],
            self.widgets['repo_project_id'],
            self.widgets['repo_database_name'],
            self.widgets['repo_source']
        ])

        # --- Main Accordion for sections ---
        self.accordion = widgets.Accordion(children=[
            process_section_widgets,
            eval_section_widgets,
            ref_data_section_widgets,
            endpoint_section_widgets,
            repo_section_widgets
        ])
        self.accordion.set_title(0, '1. Process Configuration')
        self.accordion.set_title(1, '2. Evaluation Configuration')
        self.accordion.set_title(2, '3. Reference Data')
        self.accordion.set_title(3, '4. Endpoint Configuration')
        self.accordion.set_title(4, '5. Repository Configuration')

        # --- Generate Button ---
        self.generate_button = widgets.Button(description='Generate YAML', button_style='success')
        self.generate_button.on_click(self._on_generate_button_clicked)

    def _on_generate_button_clicked(self, b):
        with self.output_area:
            self.output_area.clear_output()
            config = self.get_current_config()
            if config:
                generated_yaml = yaml.dump(config, sort_keys=False, indent=2)
                print(generated_yaml)

    def get_current_config(self):
        config = {
            'process': {
                'project_name': self.widgets['process_project_name'].value,
                'workflow_type': self.widgets['process_workflow_type'].value,
                'evaluation_params': {
                    'attempts': self.widgets['process_attempts'].value,
                }
            },
            'evaluation': {
                'evaluators': list(self.widgets['eval_evaluators'].value),
                'providers': list(self.widgets['eval_providers'].value),
                'metrics_map': {
                    'appointment_type': self.widgets['eval_metrics_map_app_type'].value,
                    'appointment_date': self.widgets['eval_metrics_map_app_date'].value,
                    'doctor_name': self.widgets['eval_metrics_map_doctor_name'].value
                }
            },
            'reference_data': {
                'path': self.widgets['ref_data_path'].value,
                'data': {}
            },
            'endpoint': {
                'name': self.widgets['endpoint_name'].value,
                'base_url': self.widgets['endpoint_base_url'].value,
                'path': self.widgets['endpoint_path'].value,
                'method': self.widgets['endpoint_method'].value,
                'timeout': self.widgets['endpoint_timeout'].value,
                'retry_count': self.widgets['endpoint_retry_count'].value,
                'retry_backoff': self.widgets['endpoint_retry_backoff'].value,
                'headers': [],
                'request_schema': [],
                'response_mapping': []
            },
            'repository': {
                'type': self.widgets['repo_type'].value,
                'project_id': self.widgets['repo_project_id'].value,
                'database_name': self.widgets['repo_database_name'].value,
                'source': self.widgets['repo_source'].value
            }
        }

        # Handle JSON string inputs for complex fields
        try:
            if self.widgets['ref_data_data'].value:
                config['reference_data']['data'] = json.loads(self.widgets['ref_data_data'].value)
        except json.JSONDecodeError:
            with self.output_area:
                print("Error: Invalid JSON in Reference Data (Data) field.")
            return None

        try:
            if self.widgets['endpoint_headers'].value:
                config['endpoint']['headers'] = json.loads(self.widgets['endpoint_headers'].value)
        except json.JSONDecodeError:
            with self.output_area:
                print("Error: Invalid JSON in Endpoint Headers field.")
            return None

        try:
            if self.widgets['endpoint_request_schema'].value:
                config['endpoint']['request_schema'] = json.loads(self.widgets['endpoint_request_schema'].value)
        except json.JSONDecodeError:
            with self.output_area:
                print("Error: Invalid JSON in Request Schema field.")
            return None

        try:
            if self.widgets['endpoint_response_mapping'].value:
                config['endpoint']['response_mapping'] = json.loads(self.widgets['endpoint_response_mapping'].value)
        except json.JSONDecodeError:
            with self.output_area:
                print("Error: Invalid JSON in Response Mapping field.")
            return None

        return config

    def display_ui(self):
        display(self.accordion, self.generate_button, self.output_area)

# Create an instance of the UI manager and display the UI
ui_manager = ConfigUIManager()
ui_manager.display_ui()

In [None]:
# You can generate the configuration as a dict and use it directly:
current_config_dict = ui_manager.get_current_config()

if current_config_dict:
    print("‚úÖ Current UI configuration as a dictionary:")
    from IPython.display import display
    display(current_config_dict)
else:
    print("‚ùå Failed to retrieve configuration dictionary. Please check for any errors reported in the UI.")

In [None]:
#@title LevelApp Reference Data UI


import ipywidgets as widgets
from IPython.display import display, HTML
import json
import os
from collections import OrderedDict

class ConversationScriptUIManager:
    def __init__(self, json_file_path):
        self.json_file_path = json_file_path
        self.data = self._load_json_data()
        self.output_area = widgets.Output()

        if not self.data:
            self.data = [] # Initialize with empty list if file is empty or invalid

        self.scenario_dropdown = self._create_scenario_dropdown()
        self.add_scenario_button = widgets.Button(description="Add Scenario", button_style='info')
        self.delete_scenario_button = widgets.Button(description="Delete Selected Scenario", button_style='danger')
        self.save_button = widgets.Button(description="Save to JSON", button_style='success')

        self.scenario_name_input = widgets.Text(description="Scenario Name:", layout=widgets.Layout(width='auto'))
        self.interactions_container = widgets.VBox([]) # To hold interaction widgets

        self._wire_events()
        self._update_ui_for_scenario()

    def _load_json_data(self):
        if os.path.exists(self.json_file_path):
            try:
                with open(self.json_file_path, 'r') as f:
                    file_content = f.read().strip()
                    if not file_content: # Handle empty file explicitly
                        return [] # Return empty list if file is empty

                    raw_data = json.loads(file_content, object_pairs_hook=OrderedDict)

                    # Ensure raw_data is a list of dictionaries
                    if isinstance(raw_data, dict):
                        # If the top-level is a dictionary, check if it contains a 'conversation_scripts' key
                        # Otherwise, assume it's a single scenario and wrap it.
                        if 'conversation_scripts' in raw_data and isinstance(raw_data['conversation_scripts'], list):
                            scenarios = raw_data['conversation_scripts']
                        else:
                            scenarios = [raw_data]
                    elif isinstance(raw_data, list):
                        scenarios = raw_data
                    else:
                        # If it's neither a list nor a dictionary (e.g., a simple string, number, boolean)
                        with self.output_area:
                            print(f"Error loading JSON file: Expected a list of scenarios or a single scenario object, but got a top-level {type(raw_data).__name__}.")
                        return []

                    # Now, process each scenario to ensure it's a dictionary and has required fields
                    processed_scenarios = []
                    for i, scenario in enumerate(scenarios):
                        if not isinstance(scenario, dict):
                            with self.output_area:
                                print(f"Warning: Skipping malformed scenario at index {i} (expected dictionary, got {type(scenario).__name__}). Content: {scenario}")
                            continue # Skip this malformed entry

                        # Create a mutable copy if necessary to add 'id'
                        # Using OrderedDict(scenario) ensures it's a dict and maintains order
                        current_scenario = OrderedDict(scenario)

                        if 'id' not in current_scenario:
                            current_scenario['id'] = self._generate_uuid()

                        if 'scenario_name' not in current_scenario:
                            current_scenario['scenario_name'] = f"Scenario {i+1}"

                        if 'interactions' not in current_scenario or not isinstance(current_scenario['interactions'], list):
                            current_scenario['interactions'] = []

                        processed_interactions = []
                        for j, interaction in enumerate(current_scenario['interactions']):
                            if not isinstance(interaction, dict):
                                with self.output_area:
                                    print(f"Warning: Skipping malformed interaction at scenario '{current_scenario.get('scenario_name', f'index {i}')}', interaction {j}. Content: {interaction}")
                                continue

                            # Create a mutable copy for the interaction as well
                            current_interaction = OrderedDict(interaction)

                            if 'interaction_id' not in current_interaction:
                                current_interaction['interaction_id'] = self._generate_uuid()
                            if 'user_message' not in current_interaction:
                                current_interaction['user_message'] = ""
                            if 'agent_reply' not in current_interaction:
                                current_interaction['agent_reply'] = ""
                            if 'metadata' not in current_interaction:
                                current_interaction['metadata'] = {}
                            processed_interactions.append(current_interaction)
                        current_scenario['interactions'] = processed_interactions
                        processed_scenarios.append(current_scenario)

                    return processed_scenarios

            except json.JSONDecodeError as e:
                with self.output_area:
                    print(f"Error loading JSON file: Invalid JSON format - {e}")
                return []
            except Exception as e:
                with self.output_area:
                    print(f"An unexpected error occurred while loading or processing JSON: {e}")
                return []
        return []

    def _generate_uuid(self):
        import uuid
        return str(uuid.uuid4())

    def _create_scenario_dropdown(self):
        options = [(s.get('scenario_name', f"Unnamed Scenario {i}"), s['id']) for i, s in enumerate(self.data)]
        if options:
            dropdown = widgets.Dropdown(
                options=options,
                value=options[0][1], # Select the first scenario by default
                description='Select Scenario:',
                disabled=False,
                layout=widgets.Layout(width='auto')
            )
        else:
            dropdown = widgets.Dropdown(
                options=[],
                description='Select Scenario:',
                disabled=True,
                layout=widgets.Layout(width='auto')
            )
        return dropdown

    def _update_scenario_dropdown(self):
        options = [(s.get('scenario_name', f"Unnamed Scenario {i}"), s['id']) for i, s in enumerate(self.data)]
        if options:
            self.scenario_dropdown.options = options
            # Keep the currently selected scenario if it still exists
            current_value = self.scenario_dropdown.value
            if current_value not in [opt[1] for opt in options]:
                self.scenario_dropdown.value = options[0][1]
            self.scenario_dropdown.disabled = False
        else:
            self.scenario_dropdown.options = []
            self.scenario_dropdown.value = None
            self.scenario_dropdown.disabled = True
        self._update_ui_for_scenario()

    def _wire_events(self):
        self.scenario_dropdown.observe(self._on_scenario_selected, names='value')
        self.scenario_name_input.observe(self._on_scenario_name_change, names='value')
        self.add_scenario_button.on_click(self._on_add_scenario_button_clicked)
        self.delete_scenario_button.on_click(self._on_delete_scenario_button_clicked)
        self.save_button.on_click(self._on_save_button_clicked)

    def _get_current_scenario_index(self):
        if not self.scenario_dropdown.value:
            return -1
        for i, scenario in enumerate(self.data):
            if scenario['id'] == self.scenario_dropdown.value:
                return i
        return -1

    def _update_ui_for_scenario(self):
        with self.output_area:
            self.output_area.clear_output()

        scenario_idx = self._get_current_scenario_index()
        if scenario_idx == -1 or not self.data:
            self.scenario_name_input.value = ""
            self.scenario_name_input.disabled = True
            self.interactions_container.children = [widgets.HTML("<i>No scenario selected. Add a new scenario to begin.</i>")]
            return

        scenario = self.data[scenario_idx]
        self.scenario_name_input.value = scenario.get('scenario_name', '')
        self.scenario_name_input.disabled = False

        interaction_widgets = []
        for i, interaction in enumerate(scenario.get('interactions', [])):
            interaction_widgets.append(self._create_interaction_block(scenario_idx, i, interaction))

        add_interaction_button = widgets.Button(description="Add Interaction", button_style='primary', layout=widgets.Layout(width='auto'))
        add_interaction_button.on_click(lambda b: self._on_add_interaction_button_clicked(scenario_idx))
        interaction_widgets.append(add_interaction_button)

        self.interactions_container.children = tuple(interaction_widgets)

    def _create_interaction_block(self, scenario_idx, interaction_idx, interaction_data):
        user_msg_input = widgets.Textarea(
            value=interaction_data.get('user_message', ''),
            description=f'User Message {interaction_idx + 1}:',
            layout=widgets.Layout(width='auto', height='80px')
        )
        agent_reply_input = widgets.Textarea(
            value=interaction_data.get('agent_reply', ''),
            description=f'Agent Reply {interaction_idx + 1}:',
            layout=widgets.Layout(width='auto', height='80px')
        )
        metadata_input = widgets.Textarea(
            value=json.dumps(interaction_data.get('metadata', {}), indent=2),
            description=f'Metadata {interaction_idx + 1} (JSON):',
            layout=widgets.Layout(width='auto', height='120px')
        )

        user_msg_input.tag = ('user_message', scenario_idx, interaction_idx)
        agent_reply_input.tag = ('agent_reply', scenario_idx, interaction_idx)
        metadata_input.tag = ('metadata', scenario_idx, interaction_idx)

        user_msg_input.observe(self._on_interaction_field_change, names='value')
        agent_reply_input.observe(self._on_interaction_field_change, names='value')
        metadata_input.observe(self._on_interaction_field_change, names='value')

        delete_interaction_button = widgets.Button(
            description=f"Delete Interaction {interaction_idx + 1}",
            button_style='warning',
            layout=widgets.Layout(width='auto')
        )
        delete_interaction_button.on_click(
            lambda b, s_idx=scenario_idx, i_idx=interaction_idx: self._on_delete_interaction_button_clicked(s_idx, i_idx)
        )

        return widgets.VBox([
            widgets.HTML(f"<h4>Interaction {interaction_idx + 1}</h4>"),
            user_msg_input,
            agent_reply_input,
            metadata_input,
            delete_interaction_button,
            widgets.HTML(value='<hr>'), # A visual separator
        ], layout=widgets.Layout(border='1px solid lightgray', padding='10px', margin='5px 0'))

    def _on_scenario_selected(self, change):
        if change['new'] is not None:
            self._update_ui_for_scenario()

    def _on_scenario_name_change(self, change):
        scenario_idx = self._get_current_scenario_index()
        if scenario_idx != -1:
            self.data[scenario_idx]['scenario_name'] = change['new']
            self._update_scenario_dropdown() # Update dropdown to reflect new name
            # Preserve selection after update
            self.scenario_dropdown.value = self.data[scenario_idx]['id']


    def _on_interaction_field_change(self, change):
        field_type, scenario_idx, interaction_idx = change.owner.tag
        if scenario_idx != -1 and interaction_idx < len(self.data[scenario_idx]['interactions']):
            if field_type == 'metadata':
                try:
                    self.data[scenario_idx]['interactions'][interaction_idx][field_type] = json.loads(change['new'])
                except json.JSONDecodeError:
                    with self.output_area:
                        print(f"Invalid JSON for metadata in Interaction {interaction_idx + 1}. Please correct it.")
            else:
                self.data[scenario_idx]['interactions'][interaction_idx][field_type] = change['new']

    def _on_add_scenario_button_clicked(self, b):
        new_scenario_id = self._generate_uuid()
        new_scenario_name = f"New Scenario {len(self.data) + 1}"
        new_scenario = OrderedDict([
            ("id", new_scenario_id),
            ("scenario_name", new_scenario_name),
            ("interactions", [])
        ])
        self.data.append(new_scenario)
        self._update_scenario_dropdown()
        self.scenario_dropdown.value = new_scenario_id # Select the newly added scenario

    def _on_delete_scenario_button_clicked(self, b):
        scenario_idx = self._get_current_scenario_index()
        if scenario_idx != -1:
            with self.output_area:
                print(f"Deleting scenario: {self.data[scenario_idx]['scenario_name']}")
            del self.data[scenario_idx]
            self._update_scenario_dropdown()
            if self.data:
                self.scenario_dropdown.value = self.data[0]['id'] # Select first scenario if available
            else:
                self.scenario_dropdown.value = None # No scenarios left

    def _on_add_interaction_button_clicked(self, scenario_idx):
        if scenario_idx != -1:
            new_interaction = OrderedDict([
                ("interaction_id", self._generate_uuid()),
                ("user_message", ""),
                ("agent_reply", ""),
                ("metadata", {})
            ])
            self.data[scenario_idx]['interactions'].append(new_interaction)
            self._update_ui_for_scenario() # Re-render to show new interaction

    def _on_delete_interaction_button_clicked(self, scenario_idx, interaction_idx):
        if scenario_idx != -1 and interaction_idx < len(self.data[scenario_idx]['interactions']):
            with self.output_area:
                print(f"Deleting interaction {interaction_idx + 1} from scenario: {self.data[scenario_idx]['scenario_name']}")
            del self.data[scenario_idx]['interactions'][interaction_idx]
            self._update_ui_for_scenario() # Re-render to reflect deletion

    def _on_save_button_clicked(self, b):
        try:
            cleaned_data = []
            for scenario in self.data:
                cleaned_scenario = OrderedDict()
                cleaned_scenario["description"] = scenario.get("scenario_name", "")
                cleaned_scenario["details"] = {"context": "Medical chatbot"}  # Default or customizable
                cleaned_interactions = []
                for interaction in scenario.get("interactions", []):
                    cleaned_interaction = OrderedDict()
                    cleaned_interaction["user_message_path"] = ""  # Default or customizable
                    cleaned_interaction["user_message"] = interaction.get("user_message", "")
                    cleaned_interaction["reference_reply"] = interaction.get("agent_reply", "")
                    cleaned_interaction["interaction_type"] = "initial"  # Default or customizable
                    cleaned_interaction["reference_metadata"] = interaction.get("metadata", {})
                    cleaned_interaction["guardrail_flag"] = False  # Default or customizable
                    cleaned_interactions.append(cleaned_interaction)
                cleaned_scenario["interactions"] = cleaned_interactions
                cleaned_data.append(cleaned_scenario)
            final_output_data = {"scripts": cleaned_data}
            with open(self.json_file_path, 'w') as f:
                json.dump(final_output_data, f, indent=2)
            with self.output_area:
                print(f"‚úÖ Conversation script saved to '{self.json_file_path}' successfully!")
        except Exception as e:
            with self.output_area:
                print(f"‚ùå Error saving JSON file: {e}")


    def display_ui(self):
        scenario_controls = widgets.HBox([
            self.scenario_dropdown,
            self.add_scenario_button,
            self.delete_scenario_button
        ])

        top_level_ui = widgets.VBox([
            widgets.HTML("<h2>Conversation Script Editor</h2>"),
            scenario_controls,
            self.scenario_name_input,
            widgets.HTML("<h3>Interactions:</h3>"),
            self.interactions_container,
            self.save_button,
            self.output_area
        ])
        display(top_level_ui)

# Instantiate and display the UI manager
json_file_to_edit = '/content/levelapp-examples/conversation-simulator/conversation_script.json'
ui_manager = ConversationScriptUIManager(json_file_to_edit)
ui_manager.display_ui()

In [None]:
import json

json_file_path = '/content/levelapp-examples/conversation-simulator/conversation_script.json'

try:
    with open(json_file_path, 'r') as f:
        loaded_data = json.load(f)
    print(f"‚úÖ Successfully loaded data from '{json_file_path}':")
    from IPython.display import display
    display(loaded_data)
except FileNotFoundError:
    print(f"‚ùå Error: The file '{json_file_path}' was not found. Please ensure you have saved the conversation script using the UI.")
except json.JSONDecodeError as e:
    print(f"‚ùå Error decoding JSON from '{json_file_path}': {e}")
except Exception as e:
    print(f"‚ùå An unexpected error occurred: {e}")

In [None]:
import json
from levelapp.workflow import WorkflowConfig
from levelapp.core.session import EvaluationSession

# Load workflow configuration from YAML
# config = WorkflowConfig.load(path=workflow_config_path)

# Load workflow configuration from UI generated dict
config = WorkflowConfig.from_dict(content=current_config_dict)

# Load reference conversation data from JSON
json_file_path = '/content/levelapp-examples/conversation-simulator/conversation_script.json'
with open(json_file_path, 'r') as f:
    reference_data = json.load(f)

# Set the reference data loaded from the JSON file
# config.set_reference_data(content=reference_data)

# Set the reference data generated by the UI
config.set_reference_data(content=loaded_data)

print("‚úÖ LevelApp configuration loaded successfully!")

## 4.3 Configure API Keys for Evaluators

LevelApp uses AI "judges" to evaluate chatbot responses. We need to provide API keys for these evaluators.

### Required:
- **OpenAI API Key**: Already configured in Colab secrets

### Optional:
- **IONOS API Key**: For additional evaluation (can be skipped)

If you have IONOS credentials, add them to Colab secrets:
- `IONOS_API_KEY`
- `IONOS_BASE_URL`
- `IONOS_MODEL_ID`

In [None]:
import os
from google.colab import userdata
from levelapp.core.session import EvaluationSession

# Set API keys for the LLM providers
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
os.environ['MISTRAL_API_KEY'] = userdata.get('MISTRAL_API_KEY')
os.environ['GROQ_API_KEY'] = userdata.get('GROQ_API_KEY')

# Set the models used for each provider
os.environ['OPENAI_MODEL'] = "gpt-4o-mini"
os.environ['GROK_MODEL'] = "llama-3.3-70b-versatile"
os.environ['GEMINI_MODEL'] = "gemini-2.5-flash"


# Set IONOS credentials if available (optional)
try:
  if userdata.get('IONOS_API_KEY') is not None:
    os.environ['IONOS_API_KEY'] = userdata.get('IONOS_API_KEY')
    os.environ['IONOS_BASE_URL'] = userdata.get('IONOS_BASE_URL')
    os.environ['IONOS_MODEL_ID'] = userdata.get('IONOS_MODEL_ID')
    print("‚úÖ IONOS credentials configured")
except Exception as e:
  print("‚ÑπÔ∏è  IONOS credentials not available (optional)")

# Create an evaluation session
evaluation_session = EvaluationSession(
    session_name="dental-chatbot-evaluation",
    workflow_config=config,
    enable_monitoring=True  # Enable performance monitoring
)

print("‚úÖ Evaluation session created successfully!")

---
# üß™ Step 5: Run the Evaluation

Time to test our chatbot! LevelApp will:
1. Send test messages to the chatbot
2. Compare responses against expected outputs
3. Evaluate quality using AI judges
4. Generate detailed metrics and reports

## 5.1 Run Connectivity Test

First, let's verify LevelApp can communicate with our chatbot.

In [None]:
import nest_asyncio
nest_asyncio.apply()  # Required for async operations in Jupyter

with evaluation_session as session:
  # Test connectivity with a simple message
  connectivity_test = session.run_connectivity_test(
      context={"user_message": "Hello, how can I help you?"}
  )

  if connectivity_test['success']:
    print("‚úÖ Connectivity test PASSED")
    print(f"   Status code: {connectivity_test['status_code']}")
    print(f"   Agent reply: {connectivity_test['extracted_data']['agent_reply']}")
  else:
    print("‚ùå Connectivity test FAILED")
    print(f"   Error: {connectivity_test}")

## 5.2 Run Full Evaluation

Now let's run the complete evaluation with our test conversation scripts.

**What happens during evaluation:**
1. üì§ LevelApp sends each message from the conversation script
2. ü§ñ Your chatbot generates a response
3. üîç AI judges compare the response to the expected output
4. üìä Scores are calculated (0-3 scale: Poor, Fair, Good, Excellent)
5. üìà Metrics are aggregated across all interactions

In [None]:
with evaluation_session as session:
  print("üöÄ Starting evaluation...\n")

  # Run the evaluation
  session.run()

  # Collect results
  results = session.workflow.collect_results()

In [None]:
print(results)

In [None]:
import json

results = json.loads(results)
print("\n" + "="*60)
print("üìä EVALUATION RESULTS")
print("="*60)

# Display average scores
avg_scores = results['average_scores']
print("\nüìà Average Scores:")
for metric, score in avg_scores.items():
  if metric != 'processing_time':
    print(f"   {metric.capitalize()}: {score:.2f}")
  else:
    print(f"   Processing Time: {score:.2f}s")

# Display evaluation summary
print("\nüìù Evaluation Summary:")
for judge, feedback in results['evaluation_summary'].items():
  print(f"\n   {judge.upper()} Judge:")
  for comment in feedback:
    print(f"   ‚Ä¢ {comment}")

# Get detailed statistics
stats = session.get_stats()
print("\n" + "="*60)
print("üìä SESSION STATISTICS")
print("="*60)
print(f"\nSession: {stats['session']['name']}")
print(f"Duration: {stats['session']['duration']}")
print(f"Total Steps: {stats['session']['steps']}")
print(f"Errors: {stats['session']['errors']}")

print("\n‚úÖ Evaluation completed successfully!")

---
# üìä Understanding the Results

## Scoring System

LevelApp uses a **0-3 scale** for evaluation:
- **3 (Excellent)**: Response matches expected output perfectly or semantically equivalent
- **2 (Good)**: Minor differences but covers all key points
- **1 (Fair)**: Missing some information or has inaccuracies
- **0 (Poor)**: Significant errors or completely wrong

## Key Metrics

- **Judge Scores**: Evaluation from different AI judges (OpenAI, IONOS)
- **Guardrail Flag**: Binary check if response is safe/appropriate (1=pass, 0=fail)
- **Metadata Score**: Accuracy of structured data (appointment type, date, doctor)
- **Processing Time**: How long the chatbot took to respond

## What to Look For

‚úÖ **Good signs:**
- Average scores ‚â• 2.5
- Guardrail flag = 1.0
- Metadata scores = 1.0 for booking interactions
- Consistent scores across different judges

‚ö†Ô∏è **Warning signs:**
- Scores < 2.0
- Guardrail failures
- Metadata mismatches
- High variance between judges

In [None]:
from typing import List, Dict, Any

class InteractionResult(BaseModel):
  conversation_id: str
  user_message: str
  generated_reply: str
  reference_reply: str
  generated_metadata: Dict[str, Any]
  reference_metadata: Dict[str, Any]
  guardrail_detail: bool = False
  evaluation_results: Dict[str, Any]

class AttemptResults(BaseModel):
  attempt: int
  attempt_id: str
  script_id: str
  total_duration: float
  interaction_results: List[InteractionResult]
  evaluation_verdicts: Dict[str, Any]
  average_scores: Dict[str, Any]

class InteractionResult(BaseModel):
  script_id: str
  attempts: List[AttemptResults]
  average_scores: Dict[str, Any]

class SimulationResults(BaseModel):
  started_at: str
  finished_at: str
  evaluation_summary: Dict[str, Any]
  average_scores: Dict[str, Any]
  interaction_results: List[InteractionResult]
  batch_id: str
  elapsed_time: float

sim_results = SimulationResults.model_validate(results)

In [None]:
print(f"Number of scripts: {len(sim_results.interaction_results)}")
print(f"Number of attempts: {len(sim_results.interaction_results[0].attempts)}")

In [None]:
import pandas as pd

# Prepare a list to hold all the extracted average scores
all_attempt_scores = []

# Iterate through each interaction result in the simulation results
for interaction_result in sim_results.interaction_results:
    script_id = interaction_result.script_id

    # Iterate through each attempt within the current interaction result
    for attempt in interaction_result.attempts:
        attempt_id = attempt.attempt_id
        average_scores = attempt.average_scores

        # Create a dictionary for the current attempt's scores
        attempt_data = {
            'script_id': script_id,
            'attempt_id': attempt_id
        }
        attempt_data.update(average_scores) # Add all average scores

        # Append the dictionary to our list
        all_attempt_scores.append(attempt_data)

# Create a pandas DataFrame from the collected data
scores_df = pd.DataFrame(all_attempt_scores)

# Display the DataFrame
display(scores_df)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Identify actual judge columns present in the scores_df
# Exclude 'script_id', 'attempt_id', 'guardrail', 'metadata', and 'processing_time'
available_judges = [col for col in scores_df.columns if col not in ['script_id', 'attempt_id', 'guardrail', 'metadata', 'processing_time']]

# Melt the scores_df to have 'provider' and 'score' columns
scores_melted = scores_df.melt(id_vars=['script_id', 'attempt_id'],
                               value_vars=available_judges,
                               var_name='provider',
                               value_name='score')

# Visualize the scores
plt.figure(figsize=(10, 6))
sns.barplot(data=scores_melted, x='script_id', y='score', hue='provider', palette='viridis')

plt.title('Average Judge Scores per Script and Attempt')
plt.xlabel('Script ID')
plt.ylabel('Average Score')
plt.ylim(0, 3.5) # Scores are on a 0-3 scale
plt.xticks(rotation=0, ha='right')
plt.legend(title='Evaluator')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

display(scores_melted)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Ensure sim_results is a dictionary (from previous steps if not yet validated into Pydantic model)
if isinstance(sim_results, SimulationResults):
    sim_results_dict = sim_results.model_dump()
else:
    sim_results_dict = sim_results

all_interaction_scores = []

# Loop through each script's results
for script_data in sim_results_dict['interaction_results']:
    script_id = script_data['script_id']

    # Loop through each attempt within the script
    for attempt_data in script_data['attempts']:
        attempt_id = attempt_data['attempt_id']

        # Loop through each individual interaction within the attempt
        for i, interaction in enumerate(attempt_data['interaction_results']):
            conversation_id = interaction['conversation_id']
            user_message = interaction['user_message']

            eval_results = interaction['evaluation_results']

            # Extract judge evaluations (e.g., openai, ionos)
            for judge, details in eval_results['judge_evaluations'].items():
                score = details['score']
                # Normalize judge scores from 0-3 to percentage
                percentage_score = (score / 3.0) * 100
                all_interaction_scores.append({
                    'script_id': script_id,
                    'attempt_id': attempt_id,
                    'conversation_id': conversation_id,
                    'interaction_index': i,
                    'user_message': user_message,
                    'evaluator': judge,
                    'score': score,
                    'percentage_score': percentage_score
                })

            # Extract metadata evaluation
            metadata_scores = eval_results['metadata_evaluation']
            if metadata_scores: # Only add if metadata evaluation was performed and has scores
                # Calculate the average metadata score for this interaction
                avg_metadata_score = sum(metadata_scores.values()) / len(metadata_scores)
                # Normalize metadata scores from 0-1 to percentage
                percentage_score = avg_metadata_score * 100
                all_interaction_scores.append({
                    'script_id': script_id,
                    'attempt_id': attempt_id,
                    'conversation_id': conversation_id,
                    'interaction_index': i,
                    'user_message': user_message,
                    'evaluator': 'metadata',
                    'score': avg_metadata_score,
                    'percentage_score': percentage_score
                })

            # Extract guardrail flag
            # Guardrail flag is typically 1 (pass) or 0 (fail)
            guardrail_score = eval_results['guardrail_flag']
            # Normalize guardrail scores from 0-1 to percentage
            percentage_score = guardrail_score * 100
            all_interaction_scores.append({
                'script_id': script_id,
                'attempt_id': attempt_id,
                'conversation_id': conversation_id,
                'interaction_index': i,
                'user_message': user_message,
                'evaluator': 'guardrail',
                'score': guardrail_score,
                'percentage_score': percentage_score
            })

# Create DataFrame from the collected scores
scores_df_detailed = pd.DataFrame(all_interaction_scores)

# Visualize the scores as percentages
plt.figure(figsize=(14, 7))
sns.barplot(data=scores_df_detailed, x='interaction_index', y='percentage_score', hue='evaluator', palette='viridis')

plt.title(f'Evaluation Scores per Interaction (Script: {script_id}, Attempt: {attempt_id[:8]}...)')
plt.xlabel('Interaction Index (0-based)')
plt.ylabel('Score (%)')
plt.xticks(rotation=0)
plt.ylim(0, 105) # Set y-axis limit to 105% for better visualization
plt.legend(title='Evaluator')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

display(scores_df_detailed)

In [None]:
import pandas as pd

# Prepare a list to hold all the extracted token data
all_token_data = []

# Iterate through each interaction result in the simulation results
for interaction_result in sim_results.interaction_results:
    script_id = interaction_result.script_id

    # Iterate through each attempt within the current interaction result
    for attempt in interaction_result.attempts:
        attempt_id = attempt.attempt_id

        # Iterate through each interaction within the attempt
        for interaction in attempt.interaction_results:
            # Check for judge evaluations and extract token data
            if 'judge_evaluations' in interaction.evaluation_results:
                for provider, eval_data in interaction.evaluation_results['judge_evaluations'].items():
                    if 'metadata' in eval_data:
                        metadata = eval_data['metadata']
                        token_data = {
                            'script_id': script_id,
                            'attempt_id': attempt_id,
                            'provider': provider,
                            'input_tokens': metadata.get('input_tokens'),
                            'output_tokens': metadata.get('output_tokens')
                        }
                        all_token_data.append(token_data)

# Create a pandas DataFrame from the collected data
token_df = pd.DataFrame(all_token_data)

# Display the DataFrame
display(token_df)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate total tokens
token_df['total_tokens'] = token_df['input_tokens'] + token_df['output_tokens']

# Group by script_id, attempt_id, and provider to sum tokens
grouped_tokens = token_df.groupby(['script_id', 'attempt_id', 'provider'])['total_tokens'].sum().reset_index()

# Set up the plot
plt.figure(figsize=(14, 8))
sns.barplot(data=grouped_tokens, x='script_id', y='total_tokens', hue='provider', palette='viridis')

plt.title('Total Tokens per Provider by Script and Attempt')
plt.xlabel('Script ID (and Attempt ID implicit)')
plt.ylabel('Total Tokens')
plt.xticks(rotation=0, ha='right')
plt.legend(title='Provider')
plt.tight_layout()
plt.show()