# 03. Observability & Tracing

This notebook explores TensorZero's observability features:
- Understanding inference data stored in ClickHouse
- Using the TensorZero UI for monitoring
- Implementing feedback loops
- Analyzing performance metrics
- Testing structured outputs with advanced models

In [1]:
import os
import json
import time
import pandas as pd
from datetime import datetime, timedelta
from tensorzero import TensorZeroGateway
from dotenv import load_dotenv
import httpx

# Load environment variables
load_dotenv()

# Initialize gateway client with new method
client = TensorZeroGateway.build_http(gateway_url="http://localhost:3000")
print("✅ Connected to TensorZero gateway")
print("🌐 TensorZero UI: http://localhost:4000")
print("📊 ClickHouse: http://localhost:8123")

✅ Connected to TensorZero gateway
🌐 TensorZero UI: http://localhost:4000
📊 ClickHouse: http://localhost:8123


## 1. Generate Test Data

Let's generate some inference data across different providers to explore.

In [2]:
# Test scenarios for observability
test_prompts = [
    "What is machine learning?",
    "Explain quantum computing briefly.",
    "How does blockchain work?",
    "What is cloud computing?",
    "Describe artificial intelligence."
]

# Providers to test
providers_to_test = [
    ("gpt4_mini", "OpenAI GPT-4o Mini"),
    ("claude3_haiku", "Anthropic Claude 3 Haiku"),
    ("grok3_mini", "xAI Grok-3 Mini"),
]

# Generate test data
inference_ids = []

print("🧪 Generating test data...")
for prompt in test_prompts:
    for variant, provider_name in providers_to_test:
        try:
            response = client.inference(
                function_name="chat",
                variant_name=variant,
                input={
                    "messages": [
                        {"role": "user", "content": prompt}
                    ]
                }
            )
            
            inference_ids.append({
                "inference_id": response.inference_id,
                "variant": variant,
                "provider": provider_name,
                "prompt": prompt,
                "timestamp": datetime.now().isoformat()
            })
            
            print(f"✅ {provider_name}: {prompt[:30]}... - ID: {response.inference_id}")
            
        except Exception as e:
            print(f"❌ {provider_name}: {prompt[:30]}... - Error: {str(e)[:50]}")

print(f"\n📊 Generated {len(inference_ids)} successful inferences")

🧪 Generating test data...
✅ OpenAI GPT-4o Mini: What is machine learning?... - ID: 0198f224-4400-72d2-a279-a6638ad1670f
✅ Anthropic Claude 3 Haiku: What is machine learning?... - ID: 0198f224-6099-7e91-8ee1-4cb1002ae8ba
✅ xAI Grok-3 Mini: What is machine learning?... - ID: 0198f224-6acb-7c43-a7c7-d92b78b370a8
✅ OpenAI GPT-4o Mini: Explain quantum computing brie... - ID: 0198f224-82f4-7b13-a3a6-95dd31fb2a4e
✅ Anthropic Claude 3 Haiku: Explain quantum computing brie... - ID: 0198f224-8fd7-76b3-a735-93954d1cde4e
✅ xAI Grok-3 Mini: Explain quantum computing brie... - ID: 0198f224-9cc4-7402-b4bf-20c44a0ceb2a
✅ OpenAI GPT-4o Mini: How does blockchain work?... - ID: 0198f224-b204-7623-87e7-2021c9df4110
✅ Anthropic Claude 3 Haiku: How does blockchain work?... - ID: 0198f224-dde0-7380-90e8-79abedf9df55
✅ xAI Grok-3 Mini: How does blockchain work?... - ID: 0198f224-e994-7063-8c84-00a947fa134d
✅ OpenAI GPT-4o Mini: What is cloud computing?... - ID: 0198f225-10dc-7083-a4e8-5959cf7a3701
✅ Anthropic

## 2. Collecting Feedback

TensorZero allows collecting feedback on inferences for optimization.

In [3]:
# Collect feedback on some inferences
import random

print("📝 Collecting feedback on inferences...")

# Sample some inference IDs for feedback
feedback_samples = random.sample(inference_ids, min(5, len(inference_ids)))

for sample in feedback_samples:
    # Simulate different types of feedback
    feedback_score = random.uniform(0.5, 1.0)
    helpful = feedback_score > 0.7
    
    try:
        # Submit rating feedback (float metric: 0.0-1.0)
        rating_response = client.feedback(
            metric_name="user_rating",
            inference_id=sample["inference_id"],
            value=feedback_score
        )
        
        # Submit helpful feedback (boolean metric)
        helpful_response = client.feedback(
            metric_name="helpful", 
            inference_id=sample["inference_id"],
            value=helpful
        )
        
        # Submit comment feedback (built-in metric - no explicit config needed)
        comment = f"Generated by {sample['provider']} - {'Great response!' if helpful else 'Could be better'}"
        comment_response = client.feedback(
            metric_name="comment",
            inference_id=sample["inference_id"],
            value=comment
        )
        
        print(f"✅ Feedback for {sample['provider']}: Score={feedback_score:.2f}, Helpful={helpful}")
        print(f"   Rating ID: {rating_response.feedback_id}")
        print(f"   Comment: {comment[:50]}...")
        
    except Exception as e:
        print(f"❌ Failed to submit feedback: {e}")

print(f"\n💡 View feedback in TensorZero UI: http://localhost:4000")

📝 Collecting feedback on inferences...
✅ Feedback for xAI Grok-3 Mini: Score=0.57, Helpful=False
   Rating ID: {'feedback_id': '0198f225-a6d4-79b3-9f41-4d554f966d1d'}
   Comment: Generated by xAI Grok-3 Mini - Could be better...
✅ Feedback for Anthropic Claude 3 Haiku: Score=0.71, Helpful=True
   Rating ID: {'feedback_id': '0198f225-a6e2-72f2-8f42-b1c57a6e6f57'}
   Comment: Generated by Anthropic Claude 3 Haiku - Great resp...
✅ Feedback for xAI Grok-3 Mini: Score=0.66, Helpful=False
   Rating ID: {'feedback_id': '0198f225-a6e9-77c1-8ed1-db8571ebdc61'}
   Comment: Generated by xAI Grok-3 Mini - Could be better...
✅ Feedback for OpenAI GPT-4o Mini: Score=0.67, Helpful=False
   Rating ID: {'feedback_id': '0198f225-a6f0-73f0-8cf7-aafe14fb33d4'}
   Comment: Generated by OpenAI GPT-4o Mini - Could be better...
✅ Feedback for xAI Grok-3 Mini: Score=0.90, Helpful=True
   Rating ID: {'feedback_id': '0198f225-a6f6-7190-ae39-dbcad2eaeefc'}
   Comment: Generated by xAI Grok-3 Mini - Great respons

## 3. Direct ClickHouse Queries

Let's query ClickHouse directly to analyze our inference data.

In [4]:
# ClickHouse connection details
clickhouse_url = "http://localhost:8123"
clickhouse_user = "chuser"
clickhouse_password = "chpassword"
database = "tensorzero"

def query_clickhouse(query):
    """Execute a query against ClickHouse."""
    response = httpx.post(
        f"{clickhouse_url}/",
        params={
            "database": database,
            "user": clickhouse_user,
            "password": clickhouse_password,
            "default_format": "JSONEachRow"
        },
        data=query
    )
    
    if response.status_code == 200:
        lines = response.text.strip().split('\n')
        return [json.loads(line) for line in lines if line]
    else:
        raise Exception(f"ClickHouse query failed: {response.text}")

# Test ClickHouse connection
try:
    tables = query_clickhouse("SHOW TABLES")
    print("📊 ClickHouse Tables:")
    for table in tables:
        print(f"   • {table['name']}")
except Exception as e:
    print(f"❌ ClickHouse connection error: {e}")

📊 ClickHouse Tables:
   • BatchIdByInferenceId
   • BatchIdByInferenceIdView
   • BatchModelInference
   • BatchRequest
   • BooleanMetricFeedback
   • BooleanMetricFeedbackByTargetId
   • BooleanMetricFeedbackByTargetIdView
   • BooleanMetricFeedbackTagView
   • ChatInference
   • ChatInferenceByEpisodeIdView
   • ChatInferenceByIdView
   • ChatInferenceDatapoint
   • ChatInferenceTagView
   • CommentFeedback
   • CommentFeedbackByTargetId
   • CommentFeedbackByTargetIdView
   • CommentFeedbackTagView
   • CumulativeUsage
   • CumulativeUsageView
   • DemonstrationFeedback
   • DemonstrationFeedbackByInferenceId
   • DemonstrationFeedbackByInferenceIdView
   • DemonstrationFeedbackTagView
   • DeploymentID
   • DynamicEvaluationRun
   • DynamicEvaluationRunByProjectName
   • DynamicEvaluationRunByProjectNameView
   • DynamicEvaluationRunEpisode
   • DynamicEvaluationRunEpisodeByRunId
   • DynamicEvaluationRunEpisodeByRunIdView
   • DynamicInContextLearningExample
   • FeedbackTag
   •

In [5]:
# Query recent inferences
try:
    # Get inference count by variant
    query = """
    SELECT 
        variant_name,
        COUNT(*) as count,
        AVG(inference_duration_ms) as avg_duration_ms
    FROM Chat_inferences
    WHERE timestamp > now() - INTERVAL 1 HOUR
    GROUP BY variant_name
    ORDER BY count DESC
    """
    
    results = query_clickhouse(query)
    
    if results:
        print("📈 Inference Statistics (Last Hour):")
        print("=" * 50)
        for row in results:
            print(f"Variant: {row['variant_name']}")
            print(f"  Count: {row['count']}")
            print(f"  Avg Duration: {row.get('avg_duration_ms', 'N/A')} ms\n")
    else:
        print("No inference data found in the last hour")
        
except Exception as e:
    print(f"Query error: {e}")
    # Let's try to see what tables actually exist
    try:
        tables = query_clickhouse("SHOW TABLES")
        print("\nAvailable tables:")
        for table in tables:
            print(f"  • {table['name']}")
    except:
        pass

Query error: ClickHouse query failed: {"exception": "Code: 60. DB::Exception: Unknown table expression identifier 'Chat_inferences' in scope SELECT variant_name, COUNT(*) AS count, AVG(inference_duration_ms) AS avg_duration_ms FROM Chat_inferences WHERE timestamp > (now() - toIntervalHour(1)) GROUP BY variant_name ORDER BY count DESC. (UNKNOWN_TABLE) (version 24.12.6.70 (official build))"}


Available tables:
  • BatchIdByInferenceId
  • BatchIdByInferenceIdView
  • BatchModelInference
  • BatchRequest
  • BooleanMetricFeedback
  • BooleanMetricFeedbackByTargetId
  • BooleanMetricFeedbackByTargetIdView
  • BooleanMetricFeedbackTagView
  • ChatInference
  • ChatInferenceByEpisodeIdView
  • ChatInferenceByIdView
  • ChatInferenceDatapoint
  • ChatInferenceTagView
  • CommentFeedback
  • CommentFeedbackByTargetId
  • CommentFeedbackByTargetIdView
  • CommentFeedbackTagView
  • CumulativeUsage
  • CumulativeUsageView
  • DemonstrationFeedback
  • DemonstrationFeedbackByInferenceId
  • Demo

## 4. Structured Output Testing

Let's test structured outputs, especially with Grok models that support this feature.

In [6]:
# First, let's check if we have a structured output function configured
# If not, we'll create one

print("🔧 Testing Structured Output Capabilities")
print("=" * 40)

# Test with sentiment analysis (if configured)
test_texts = [
    "TensorZero is amazing! It makes LLM integration so easy.",
    "The setup was a bit complex but worth it.",
    "Having issues with the configuration."
]

# Try sentiment analysis if available
try:
    for text in test_texts:
        response = client.inference(
            function_name="analyze_sentiment",
            variant_name="gpt4_json",  # Try with GPT-4 first
            input={
                "messages": [
                    {"role": "user", "content": text}
                ]
            }
        )
        
        # Parse structured output
        result = json.loads(response.content[0].text)
        
        print(f"\nText: '{text[:50]}...'")
        print(f"Sentiment: {result['sentiment']} (confidence: {result['confidence']:.2f})")
        print(f"Explanation: {result['explanation']}")
        
except Exception as e:
    print(f"\n⚠️  Sentiment analysis not configured or failed: {str(e)[:100]}")
    print("\nTo enable structured output, add this to your tensorzero.toml:")
    print("""[functions.analyze_sentiment]
type = "json"
schema = '''{
  "type": "object",
  "properties": {
    "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
    "confidence": {"type": "number", "minimum": 0, "maximum": 1},
    "explanation": {"type": "string"}
  },
  "required": ["sentiment", "confidence", "explanation"]
}'''

[functions.analyze_sentiment.variants.grok3_mini]
type = "chat_completion"
model = "xai::grok-3-mini"
""")

🔧 Testing Structured Output Capabilities
[2m2025-08-28T19:26:50.763243Z[0m [33m WARN[0m [2mtensorzero_core::error[0m[2m:[0m Request failed: HTTP status client error (400 Bad Request) for url (http://localhost:3000/inference)

⚠️  Sentiment analysis not configured or failed: TensorZeroError (status code 400): {"error":"`input.system` is empty but a system template is presen

To enable structured output, add this to your tensorzero.toml:
[functions.analyze_sentiment]
type = "json"
schema = '''{
  "type": "object",
  "properties": {
    "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
    "confidence": {"type": "number", "minimum": 0, "maximum": 1},
    "explanation": {"type": "string"}
  },
  "required": ["sentiment", "confidence", "explanation"]
}'''

[functions.analyze_sentiment.variants.grok3_mini]
type = "chat_completion"
model = "xai::grok-3-mini"



## 5. Performance Analysis Dashboard

Let's create a simple performance dashboard using the data we've collected.

In [7]:
# Create a performance summary
if inference_ids:
    df = pd.DataFrame(inference_ids)
    
    print("📊 Inference Summary")
    print("=" * 40)
    
    # Count by provider
    provider_counts = df['provider'].value_counts()
    print("\nInferences by Provider:")
    for provider, count in provider_counts.items():
        print(f"  {provider}: {count}")
    
    # Recent activity
    print(f"\nTotal Inferences: {len(df)}")
    print(f"Unique Prompts: {df['prompt'].nunique()}")
    print(f"Time Range: {df['timestamp'].min()} to {df['timestamp'].max()}")
    
    # Show sample IDs for UI exploration
    print("\n🔍 Sample Inference IDs (for UI exploration):")
    for _, row in df.head(3).iterrows():
        print(f"  • {row['inference_id']} ({row['provider']})")
    
    print(f"\n🌐 View these in TensorZero UI: http://localhost:4000")
else:
    print("❌ No inference data collected yet")

📊 Inference Summary

Inferences by Provider:
  OpenAI GPT-4o Mini: 5
  Anthropic Claude 3 Haiku: 5
  xAI Grok-3 Mini: 5

Total Inferences: 15
Unique Prompts: 5
Time Range: 2025-08-28T15:25:27.065180 to 2025-08-28T15:26:50.574570

🔍 Sample Inference IDs (for UI exploration):
  • 0198f224-4400-72d2-a279-a6638ad1670f (OpenAI GPT-4o Mini)
  • 0198f224-6099-7e91-8ee1-4cb1002ae8ba (Anthropic Claude 3 Haiku)
  • 0198f224-6acb-7c43-a7c7-d92b78b370a8 (xAI Grok-3 Mini)

🌐 View these in TensorZero UI: http://localhost:4000


## 6. Advanced Observability Features

Let's explore more advanced features like tracing multi-step workflows.

In [8]:
# Multi-step workflow example
print("🔄 Testing Multi-Step Workflow Tracing")
print("=" * 40)

# Step 1: Generate a topic
try:
    step1 = client.inference(
        function_name="chat",
        variant_name="gpt4_mini",
        input={
            "messages": [
                {"role": "user", "content": "Generate a random technical topic in 3 words or less."}
            ]
        }
    )
    
    topic = step1.content[0].text
    print(f"Step 1 - Topic Generated: {topic}")
    print(f"  Inference ID: {step1.inference_id}")
    
    # Step 2: Explain the topic
    step2 = client.inference(
        function_name="chat",
        variant_name="claude3_haiku",
        input={
            "messages": [
                {"role": "user", "content": f"Explain '{topic}' in one sentence."}
            ]
        }
    )
    
    explanation = step2.content[0].text
    print(f"\nStep 2 - Explanation: {explanation}")
    print(f"  Inference ID: {step2.inference_id}")
    
    # Step 3: Generate a haiku about it
    step3 = client.inference(
        function_name="generate_haiku",
        input={
            "messages": [
                {"role": "user", "content": f"Write a haiku about {topic}."}
            ]
        }
    )
    
    haiku = step3.content[0].text
    print(f"\nStep 3 - Haiku:\n{haiku}")
    print(f"  Inference ID: {step3.inference_id}")
    
    print("\n✅ Multi-step workflow completed!")
    print("View the trace in TensorZero UI to see how these steps connect.")
    
except Exception as e:
    print(f"❌ Workflow failed: {e}")

🔄 Testing Multi-Step Workflow Tracing
Step 1 - Topic Generated: Quantum Computing Security
  Inference ID: 0198f225-a81b-7811-9686-83c3b977f3b2

Step 2 - Explanation: Quantum computing has the potential to break many of the current cryptographic algorithms used to secure digital communications, making the development of quantum-resistant cryptography crucial for ensuring the future security of data and information.
  Inference ID: 0198f225-aad3-7881-943c-52121a1e27a6

Step 3 - Haiku:
Bits dance in the light,  
Shadows of quantum whispers,  
Guardians of code.
  Inference ID: 0198f225-af37-7ae1-a78e-840bf3a2d3a8

✅ Multi-step workflow completed!
View the trace in TensorZero UI to see how these steps connect.


## Key Insights

### Observability Features:
1. **Inference Tracking**: Every API call gets a unique ID
2. **Feedback Loop**: Can attach feedback to any inference
3. **ClickHouse Storage**: All data queryable for analysis
4. **UI Dashboard**: Visual exploration at http://localhost:4000

### Advanced Capabilities:
1. **Structured Output**: JSON schema validation (all Grok models support this)
2. **Multi-Step Tracing**: Track complex workflows
3. **Performance Metrics**: Latency, token usage, costs
4. **A/B Testing**: Built-in experimentation framework

### Next Steps:
1. Explore the TensorZero UI for visual insights
2. Set up custom ClickHouse queries for specific metrics
3. Implement structured output functions
4. Create feedback-driven optimization loops

Next notebook: We'll explore prompt management and A/B testing.