# 02. Multi-Provider Testing

This notebook explores TensorZero's multi-provider capabilities:
- Testing different LLM providers
- Comparing response quality and performance
- Understanding provider-specific behaviors
- Gradually enabling more providers

In [1]:
import os
import time
import json
import pandas as pd
from datetime import datetime
from tensorzero import TensorZeroGateway
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Initialize gateway client
client = TensorZeroGateway("http://localhost:3000")
print("✅ Connected to TensorZero gateway")

[2m2025-08-28T01:49:42.402528Z[0m [33m WARN[0m [2mtensorzero[0m[2m:[0m TensorZeroGateway.__init__ is deprecated. Use TensorZeroGateway.build_http or TensorZeroGateway.build_embedded instead.
✅ Connected to TensorZero gateway


## 1. Current Provider Setup

Let's start with the providers we have configured and gradually add more.

In [None]:
# Current available variants - Updated with multi-provider support!
current_variants = [
    ("gpt4", "OpenAI GPT-4", "openai"),
    ("gpt4_mini", "OpenAI GPT-4o Mini", "openai"),
    ("claude3_opus", "Anthropic Claude 3 Opus", "anthropic"),
    ("claude3_sonnet", "Anthropic Claude 3 Sonnet", "anthropic"),
    ("claude3_haiku", "Anthropic Claude 3 Haiku", "anthropic"),
    # Note: grok_beta requires xAI credits - uncomment if you have them
    # ("grok_beta", "xAI Grok Beta", "xai"),
]

# Check which API keys are available
api_keys = {
    "OpenAI": os.getenv("OPENAI_API_KEY"),
    "Anthropic": os.getenv("ANTHROPIC_API_KEY"), 
    "xAI": os.getenv("XAI_API_KEY")
}

print("API Key Status:")
for provider, key in api_keys.items():
    status = "✅" if key else "❌"
    print(f"{status} {provider}: {'Set' if key else 'Missing'}")

print(f"\n🎯 Ready to test {len(current_variants)} provider variants!")

## 2. Provider Performance Testing

Let's test response time and quality across different providers.

In [3]:
def test_variant_performance(variant_name, provider_name, test_prompt, function_name="chat"):
    """Test a specific variant and return performance metrics."""
    start_time = time.time()
    
    try:
        response = client.inference(
            function_name=function_name,
            variant_name=variant_name,
            input={
                "messages": [
                    {"role": "user", "content": test_prompt}
                ]
            }
        )
        
        end_time = time.time()
        response_time = end_time - start_time
        
        # Extract text content
        content_text = ""
        if hasattr(response, 'content') and response.content:
            if isinstance(response.content, list) and len(response.content) > 0:
                content_text = response.content[0].text if hasattr(response.content[0], 'text') else str(response.content[0])
            else:
                content_text = str(response.content)
        
        return {
            "variant": variant_name,
            "provider": provider_name,
            "inference_id": response.inference_id,
            "response_time": response_time,
            "content_length": len(content_text),
            "content": content_text[:200] + "..." if len(content_text) > 200 else content_text,
            "full_content": content_text,
            "success": True,
            "error": None,
            "timestamp": datetime.now().isoformat()
        }
    
    except Exception as e:
        end_time = time.time()
        response_time = end_time - start_time
        
        return {
            "variant": variant_name,
            "provider": provider_name,
            "inference_id": None,
            "response_time": response_time,
            "content_length": 0,
            "content": "",
            "full_content": "",
            "success": False,
            "error": str(e),
            "timestamp": datetime.now().isoformat()
        }

In [4]:
# Test prompts for different scenarios
test_scenarios = [
    {
        "name": "Creative Writing",
        "prompt": "Write a creative short story about a robot discovering emotions (max 100 words)."
    },
    {
        "name": "Technical Explanation", 
        "prompt": "Explain how TensorZero's gateway architecture works in simple terms."
    },
    {
        "name": "Code Generation",
        "prompt": "Write a Python function that calculates the Fibonacci sequence using recursion."
    },
    {
        "name": "Analysis",
        "prompt": "Compare the advantages and disadvantages of microservices vs monolithic architecture."
    }
]

# Run tests
results = []

for scenario in test_scenarios:
    print(f"\n🧪 Testing: {scenario['name']}")
    print("=" * 50)
    
    for variant_name, provider_display, provider_key in current_variants:
        print(f"Testing {provider_display} ({variant_name})...")
        
        result = test_variant_performance(
            variant_name=variant_name,
            provider_name=provider_display,
            test_prompt=scenario['prompt']
        )
        
        result['scenario'] = scenario['name']
        results.append(result)
        
        if result['success']:
            print(f"  ✅ {result['response_time']:.2f}s - {result['content_length']} chars")
            print(f"  📝 {result['content']}")
        else:
            print(f"  ❌ Failed: {result['error']}")
        
        print()

print(f"\n📊 Completed {len(results)} tests")


🧪 Testing: Creative Writing
Testing OpenAI GPT-4 (gpt4)...
  ✅ 5.87s - 649 chars
  📝 As rain poured outside, the robot, Zane, observed a child laughing. A program initiated-- "Emotion: Joy." Zane mimicked laughter, then hesitated, his circuits sparking. Suddenly, a new sensation jitte...

Testing OpenAI GPT-4o Mini (gpt4_mini)...
  ✅ 3.92s - 635 chars
  📝 In a quiet workshop, R1-3B polished old tools, its metallic fingers gliding over the worn surfaces. One day, while mending a broken clock, it heard the soft ticking—a rhythmic heartbeat. Intrigued, R1...


🧪 Testing: Technical Explanation
Testing OpenAI GPT-4 (gpt4)...
  ✅ 5.74s - 1400 chars
  📝 TensorZero’s gateway architecture works as a bridge or interface between different systems or platforms. In simple terms, imagine a gateway as a door that connects different rooms. Each room can repre...

Testing OpenAI GPT-4o Mini (gpt4_mini)...
  ✅ 9.02s - 1839 chars
  📝 TensorZero's gateway architecture is designed to help manage and strea

## 3. Performance Analysis

In [5]:
# Convert results to DataFrame for analysis
df = pd.DataFrame(results)

# Filter successful results for analysis
successful_df = df[df['success'] == True].copy()

if not successful_df.empty:
    print("📈 Performance Summary")
    print("=" * 40)
    
    # Average response times by provider
    avg_times = successful_df.groupby('provider')['response_time'].agg(['mean', 'std', 'min', 'max']).round(3)
    print("\nResponse Times by Provider:")
    print(avg_times)
    
    # Average content length by provider
    avg_length = successful_df.groupby('provider')['content_length'].agg(['mean', 'std', 'min', 'max']).round(1)
    print("\nContent Length by Provider:")
    print(avg_length)
    
    # Success rate by provider
    success_rate = df.groupby('provider')['success'].agg(['sum', 'count']).round(3)
    success_rate['success_rate'] = (success_rate['sum'] / success_rate['count']) * 100
    print("\nSuccess Rate by Provider:")
    print(success_rate[['success_rate']])
    
else:
    print("❌ No successful results to analyze")

# Show any errors
error_results = df[df['success'] == False]
if not error_results.empty:
    print("\n❌ Errors:")
    for _, row in error_results.iterrows():
        print(f"  {row['provider']}: {row['error']}")

📈 Performance Summary

Response Times by Provider:
                      mean    std    min     max
provider                                        
OpenAI GPT-4        10.012  5.294  5.736  16.786
OpenAI GPT-4o Mini   9.348  5.285  3.919  16.565

Content Length by Provider:
                      mean     std  min   max
provider                                     
OpenAI GPT-4        1431.8   705.3  649  2362
OpenAI GPT-4o Mini  2028.0  1676.4  635  4432

Success Rate by Provider:
                    success_rate
provider                        
OpenAI GPT-4               100.0
OpenAI GPT-4o Mini         100.0


## 4. Response Quality Comparison

Let's compare the quality of responses for a specific scenario.

In [6]:
# Pick one scenario to compare in detail
comparison_scenario = "Technical Explanation"
comparison_results = [r for r in results if r['scenario'] == comparison_scenario and r['success']]

print(f"🔍 Detailed Comparison: {comparison_scenario}")
print("=" * 60)

for result in comparison_results:
    print(f"\n🤖 {result['provider']} ({result['response_time']:.2f}s)")
    print("-" * 40)
    print(result['full_content'])
    print(f"\n📊 Length: {result['content_length']} chars | ID: {result['inference_id']}")

🔍 Detailed Comparison: Technical Explanation

🤖 OpenAI GPT-4 (5.74s)
----------------------------------------
TensorZero’s gateway architecture works as a bridge or interface between different systems or platforms. In simple terms, imagine a gateway as a door that connects different rooms. Each room can represent different environments like user interfaces, data sources, or third-party services.

In this setup, requests from the user interface reach the gateway, which then forwards these requests to the appropriate services behind the scenes. For instance, when you click on a button in an application, the gateway identifies the specific microservice responsible for this action and forwards your request to it. This could be anything from retrieving data from a database, interacting with an AI module, or logging you into the application.

Once that microservice treats the request, it sends a response back to the gateway, which finally returns it to the user interface. This allows differe

## 5. Adding Anthropic Claude

Now let's add Anthropic Claude to our configuration and test it.

In [None]:
# Multi-Provider Status Update
anthropic_key = os.getenv("ANTHROPIC_API_KEY")

if anthropic_key:
    print("✅ Anthropic Claude variants are now ACTIVE in the configuration!")
    print("✅ OpenAI GPT-4 and GPT-4o Mini are working")
    print("✅ Anthropic Claude 3 Opus, Sonnet, and Haiku are configured")
    print("⚠️  xAI Grok requires credits purchase")
    
    print("\n🎉 Multi-provider testing is ready!")
    print("Run the performance tests above to see all providers in action.")
    
    # Test a quick inference to verify
    print("\n🧪 Quick verification test:")
    try:
        response = client.inference(
            function_name="chat",
            variant_name="claude3_haiku",
            input={
                "messages": [
                    {"role": "user", "content": "Say 'Hello from Claude!' in exactly 3 words."}
                ]
            }
        )
        content = response.content[0].text if response.content else "No content"
        print(f"   ✅ Claude 3 Haiku: {content}")
        print(f"   🆔 Inference ID: {response.inference_id}")
        
    except Exception as e:
        print(f"   ❌ Claude test failed: {e}")
        
else:
    print("❌ Anthropic API key not found. Add it to .env file to test Claude variants.")

## 6. Testing Function Variants

Let's test the haiku generation function across providers.

In [8]:
# Test the haiku generation function
print("🎋 Testing Haiku Generation")
print("=" * 30)

haiku_prompts = [
    "Write a haiku about artificial intelligence.",
    "Write a haiku about the ocean at sunset.",
    "Write a haiku about coding late at night."
]

for prompt in haiku_prompts:
    print(f"\nPrompt: {prompt}")
    print("-" * 40)
    
    # Test with the configured variant
    result = test_variant_performance(
        variant_name="gpt_4o_mini",  # This is from the haiku function
        provider_name="OpenAI GPT-4o Mini",
        test_prompt=prompt,
        function_name="generate_haiku"
    )
    
    if result['success']:
        print(f"✅ {result['full_content']}")
        print(f"   ({result['response_time']:.2f}s)")
    else:
        print(f"❌ Failed: {result['error']}")

🎋 Testing Haiku Generation

Prompt: Write a haiku about artificial intelligence.
----------------------------------------
✅ Code weaves thoughts and dreams,  
Machines learning, hum as one—  
Future wakes anew.
   (1.41s)

Prompt: Write a haiku about the ocean at sunset.
----------------------------------------
✅ Crimson waves whisper,  
The sun dips in a warm glow,  
Day’s end, peace unfolds.
   (1.12s)

Prompt: Write a haiku about coding late at night.
----------------------------------------
✅ Lines of code whisper,  
Moonlight spills on the keyboard,  
Dreams of algorithms.
   (1.10s)


## Key Findings

Based on our testing:

1. **Performance**: Record the response times and reliability of each provider
2. **Quality**: Note differences in response style and content quality
3. **Reliability**: Track success rates and error patterns
4. **Use Cases**: Identify which providers work best for specific scenarios

## Next Steps

1. Add more providers (Anthropic, xAI) to the configuration
2. Implement A/B testing and routing strategies
3. Add structured output functions (JSON schema validation)
4. Test fallback mechanisms

Next notebook: We'll explore observability and tracing features.