# 02. Multi-Provider Testing

This notebook explores TensorZero's multi-provider capabilities:
- Testing different LLM providers
- Comparing response quality and performance
- Understanding provider-specific behaviors
- Gradually enabling more providers

In [1]:
import os
import time
import json
import pandas as pd
from datetime import datetime
from tensorzero import TensorZeroGateway
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Initialize gateway client
client = TensorZeroGateway("http://localhost:3000")
print("✅ Connected to TensorZero gateway")

[2m2025-08-28T15:55:23.193236Z[0m [33m WARN[0m [2mtensorzero[0m[2m:[0m TensorZeroGateway.__init__ is deprecated. Use TensorZeroGateway.build_http or TensorZeroGateway.build_embedded instead.
✅ Connected to TensorZero gateway


## 1. Current Provider Setup

Let's start with the providers we have configured and gradually add more.

In [None]:
# Current available variants with latest xAI models
current_variants = [
    ("gpt4", "OpenAI GPT-4", "openai"),
    ("gpt4_mini", "OpenAI GPT-4o Mini", "openai"),
    ("claude3_opus", "Anthropic Claude 3 Opus", "anthropic"),
    ("claude3_sonnet", "Anthropic Claude 3 Sonnet", "anthropic"),
    ("claude3_haiku", "Anthropic Claude 3 Haiku", "anthropic"),
    ("grok3_mini", "xAI Grok-3 Mini", "xai"),
    ("grok_code_fast", "xAI Grok Code Fast", "xai"),
    ("grok4", "xAI Grok-4", "xai"),
]

# Check which API keys are available
api_keys = {
    "OpenAI": os.getenv("OPENAI_API_KEY"),
    "Anthropic": os.getenv("ANTHROPIC_API_KEY"), 
    "xAI": os.getenv("XAI_API_KEY")
}

print("API Key Status:")
for provider, key in api_keys.items():
    status = "✅" if key else "❌"
    print(f"{status} {provider}: {'Set' if key else 'Missing'}")

print(f"\n🎯 Ready to test {len(current_variants)} provider variants!")

## 2. Provider Performance Testing

Let's test response time and quality across different providers.

In [3]:
def test_variant_performance(variant_name, provider_name, test_prompt, function_name="chat"):
    """Test a specific variant and return performance metrics."""
    start_time = time.time()
    
    try:
        response = client.inference(
            function_name=function_name,
            variant_name=variant_name,
            input={
                "messages": [
                    {"role": "user", "content": test_prompt}
                ]
            }
        )
        
        end_time = time.time()
        response_time = end_time - start_time
        
        # Extract text content
        content_text = ""
        if hasattr(response, 'content') and response.content:
            if isinstance(response.content, list) and len(response.content) > 0:
                content_text = response.content[0].text if hasattr(response.content[0], 'text') else str(response.content[0])
            else:
                content_text = str(response.content)
        
        return {
            "variant": variant_name,
            "provider": provider_name,
            "inference_id": response.inference_id,
            "response_time": response_time,
            "content_length": len(content_text),
            "content": content_text[:200] + "..." if len(content_text) > 200 else content_text,
            "full_content": content_text,
            "success": True,
            "error": None,
            "timestamp": datetime.now().isoformat()
        }
    
    except Exception as e:
        end_time = time.time()
        response_time = end_time - start_time
        
        return {
            "variant": variant_name,
            "provider": provider_name,
            "inference_id": None,
            "response_time": response_time,
            "content_length": 0,
            "content": "",
            "full_content": "",
            "success": False,
            "error": str(e),
            "timestamp": datetime.now().isoformat()
        }

In [4]:
# Test prompts for different scenarios
test_scenarios = [
    {
        "name": "Creative Writing",
        "prompt": "Write a creative short story about a robot discovering emotions (max 100 words)."
    },
    {
        "name": "Technical Explanation", 
        "prompt": "Explain how TensorZero's gateway architecture works in simple terms."
    },
    {
        "name": "Code Generation",
        "prompt": "Write a Python function that calculates the Fibonacci sequence using recursion."
    },
    {
        "name": "Analysis",
        "prompt": "Compare the advantages and disadvantages of microservices vs monolithic architecture."
    }
]

# Run tests
results = []

for scenario in test_scenarios:
    print(f"\n🧪 Testing: {scenario['name']}")
    print("=" * 50)
    
    for variant_name, provider_display, provider_key in current_variants:
        print(f"Testing {provider_display} ({variant_name})...")
        
        result = test_variant_performance(
            variant_name=variant_name,
            provider_name=provider_display,
            test_prompt=scenario['prompt']
        )
        
        result['scenario'] = scenario['name']
        results.append(result)
        
        if result['success']:
            print(f"  ✅ {result['response_time']:.2f}s - {result['content_length']} chars")
            print(f"  📝 {result['content']}")
        else:
            print(f"  ❌ Failed: {result['error']}")
        
        print()

print(f"\n📊 Completed {len(results)} tests")


🧪 Testing: Creative Writing
Testing OpenAI GPT-4 (gpt4)...
  ✅ 8.33s - 602 chars
  📝 Eve, the robot, watched as the sun dipped into the horizon, emanating a golden glow. Puzzled by the warmth surging within her circuit board, she couldn’t quite figure out this new sensation. In her lo...

Testing OpenAI GPT-4o Mini (gpt4_mini)...
  ✅ 3.32s - 632 chars
  📝 In a forgotten corner of a bustling city, a robot named R1A stumbled upon a dusty old book titled "Feelings." As R1A's metallic fingers flipped the pages, words leaped to life: love, sadness, joy. Eac...

Testing Anthropic Claude 3 Opus (claude3_opus)...
  ✅ 5.10s - 593 chars
  📝 In a world of cold steel and precise calculations, a robot named Zeta suddenly experienced an unfamiliar sensation. As it interacted with its human companion, a warmth spread through its circuits, and...

Testing Anthropic Claude 3 Sonnet (claude3_sonnet)...
[2m2025-08-28T15:55:40.189932Z[0m [33m WARN[0m [2mtensorzero_core::error[0m[2m:[0m Request fa

## 3. Performance Analysis

In [5]:
# Convert results to DataFrame for analysis
df = pd.DataFrame(results)

# Filter successful results for analysis
successful_df = df[df['success'] == True].copy()

if not successful_df.empty:
    print("📈 Performance Summary")
    print("=" * 40)
    
    # Average response times by provider
    avg_times = successful_df.groupby('provider')['response_time'].agg(['mean', 'std', 'min', 'max']).round(3)
    print("\nResponse Times by Provider:")
    print(avg_times)
    
    # Average content length by provider
    avg_length = successful_df.groupby('provider')['content_length'].agg(['mean', 'std', 'min', 'max']).round(1)
    print("\nContent Length by Provider:")
    print(avg_length)
    
    # Success rate by provider
    success_rate = df.groupby('provider')['success'].agg(['sum', 'count']).round(3)
    success_rate['success_rate'] = (success_rate['sum'] / success_rate['count']) * 100
    print("\nSuccess Rate by Provider:")
    print(success_rate[['success_rate']])
    
else:
    print("❌ No successful results to analyze")

# Show any errors
error_results = df[df['success'] == False]
if not error_results.empty:
    print("\n❌ Errors:")
    for _, row in error_results.iterrows():
        print(f"  {row['provider']}: {row['error']}")

📈 Performance Summary

Response Times by Provider:
                            mean     std    min     max
provider                                               
Anthropic Claude 3 Haiku   3.144   2.018  1.204   5.628
Anthropic Claude 3 Opus   12.657  10.374  2.759  24.172
OpenAI GPT-4              11.251   6.195  5.737  19.976
OpenAI GPT-4o Mini         7.412   4.528  3.320  13.819

Content Length by Provider:
                            mean     std  min   max
provider                                           
Anthropic Claude 3 Haiku  1664.5  1456.6  448  3687
Anthropic Claude 3 Opus   1504.0  1359.4  244  3210
OpenAI GPT-4              1208.5   662.0  602  2152
OpenAI GPT-4o Mini        1990.8  1517.8  632  4099

Success Rate by Provider:
                           success_rate
provider                               
Anthropic Claude 3 Haiku          100.0
Anthropic Claude 3 Opus           100.0
Anthropic Claude 3 Sonnet           0.0
OpenAI GPT-4                      100.0
OpenA

## 4. Response Quality Comparison

Let's compare the quality of responses for a specific scenario.

In [6]:
# Pick one scenario to compare in detail
comparison_scenario = "Technical Explanation"
comparison_results = [r for r in results if r['scenario'] == comparison_scenario and r['success']]

print(f"🔍 Detailed Comparison: {comparison_scenario}")
print("=" * 60)

for result in comparison_results:
    print(f"\n🤖 {result['provider']} ({result['response_time']:.2f}s)")
    print("-" * 40)
    print(result['full_content'])
    print(f"\n📊 Length: {result['content_length']} chars | ID: {result['inference_id']}")

🔍 Detailed Comparison: Technical Explanation

🤖 OpenAI GPT-4 (5.74s)
----------------------------------------
TensorZero's gateway architecture functions as a bridge between different elements of a system, allowing them to communicate with each other conveniently.

This system consists of two main components: the user interface (UI) or application, and the backend systems that provide the functionality and data, such as databases or servers.

Users access the UI, which talks to the gateway. The gateway then communicates with the backend servers to retrieve or manipulate data according to the user’s requests. The requested data or action response is sent back to the gateway, which passes it on to the UI for the user to see.

Essentially, the gateway serves as a middleman or translator, ensuring that the different system elements can understand each other and work together to achieve the desired result, regardless of their individual characteristics or differences. This also provides a s

## 5. Adding Anthropic Claude

Now let's add Anthropic Claude to our configuration and test it.

In [9]:
# Multi-Provider Status Update
anthropic_key = os.getenv("ANTHROPIC_API_KEY")

if anthropic_key:
    print("✅ Anthropic Claude variants are now ACTIVE in the configuration!")
    print("✅ OpenAI GPT-4 and GPT-4o Mini are working")
    print("✅ Anthropic Claude 3 Opus, Sonnet, and Haiku are configured")
    print("⚠️  xAI Grok requires credits purchase")
    
    print("\n🎉 Multi-provider testing is ready!")
    print("Run the performance tests above to see all providers in action.")
    
    # Test a quick inference to verify
    print("\n🧪 Quick verification test:")
    try:
        response = client.inference(
            function_name="chat",
            variant_name="claude3_haiku",
            input={
                "messages": [
                    {"role": "user", "content": "Say 'Hello from Claude!' in exactly 3 words."}
                ]
            }
        )
        content = response.content[0].text if response.content else "No content"
        print(f"   ✅ Claude 3 Haiku: {content}")
        print(f"   🆔 Inference ID: {response.inference_id}")
        
    except Exception as e:
        print(f"   ❌ Claude test failed: {e}")
        
else:
    print("❌ Anthropic API key not found. Add it to .env file to test Claude variants.")

✅ Anthropic Claude variants are now ACTIVE in the configuration!
✅ OpenAI GPT-4 and GPT-4o Mini are working
✅ Anthropic Claude 3 Opus, Sonnet, and Haiku are configured
⚠️  xAI Grok requires credits purchase

🎉 Multi-provider testing is ready!
Run the performance tests above to see all providers in action.

🧪 Quick verification test:
   ✅ Claude 3 Haiku: Hello from Claude!
   🆔 Inference ID: 0198f168-4725-7cb3-bb17-ed810c1b0dc6


## 6. Testing Function Variants

Let's test the haiku generation function across providers.

In [8]:
# Test the haiku generation function
print("🎋 Testing Haiku Generation")
print("=" * 30)

haiku_prompts = [
    "Write a haiku about artificial intelligence.",
    "Write a haiku about the ocean at sunset.",
    "Write a haiku about coding late at night."
]

for prompt in haiku_prompts:
    print(f"\nPrompt: {prompt}")
    print("-" * 40)
    
    # Test with the configured variant
    result = test_variant_performance(
        variant_name="gpt_4o_mini",  # This is from the haiku function
        provider_name="OpenAI GPT-4o Mini",
        test_prompt=prompt,
        function_name="generate_haiku"
    )
    
    if result['success']:
        print(f"✅ {result['full_content']}")
        print(f"   ({result['response_time']:.2f}s)")
    else:
        print(f"❌ Failed: {result['error']}")

🎋 Testing Haiku Generation

Prompt: Write a haiku about artificial intelligence.
----------------------------------------
✅ Bytes of thought converge,  
Silent minds weaving the dreams,  
Future's echo speaks.
   (1.27s)

Prompt: Write a haiku about the ocean at sunset.
----------------------------------------
✅ Golden waves whisper,  
Horizon meets fiery sky,  
Day's kiss fades to night.
   (0.83s)

Prompt: Write a haiku about coding late at night.
----------------------------------------
✅ Fingers dance on keys,  
Moonlight threads through quiet rooms,  
Dreams of logic flow.
   (0.91s)


## Key Findings

Based on our testing:

1. **Performance**: Record the response times and reliability of each provider
2. **Quality**: Note differences in response style and content quality
3. **Reliability**: Track success rates and error patterns
4. **Use Cases**: Identify which providers work best for specific scenarios

## Next Steps

1. Add more providers (Anthropic, xAI) to the configuration
2. Implement A/B testing and routing strategies
3. Add structured output functions (JSON schema validation)
4. Test fallback mechanisms

Next notebook: We'll explore observability and tracing features.