<a href="https://colab.research.google.com/github/wesslen/llm-experiments/blob/main/notebooks/nondeterminism/structured_data/claude.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Claude Non-Determinism Experiment Analysis

## Experiment Overview
* Tests Claude's (`claude-3-sonnet-20240229`) output consistency with structured JSON responses
* Uses three different prompt types:
  - Exercise benefits (descriptive/analytical)
  - Random numbers (explicit randomness)
  - Major cities (factual)
* Runs 20 iterations per prompt
* Measures variation using entropy scores and unique response counts

## Hypothesis
* Exercise benefits: Expected moderate variation (entropy ~1.0-1.5)
* Random numbers: Expected high variation (entropy ~2.0-2.5)
* City names: Expected low variation (entropy ~0.5-1.0)

## Results Analysis
* Exercise Benefits
  - Shows increasing entropy across fields (0.708 → 1.148 → 2.346)
  - benefit1: Strong preference for "Improved cardiovascular health" (16/20 responses)
  - benefit3: Highest variation with 12 unique phrasings for mental health benefits

* Random Numbers
  - Less random than expected
  - num1: Strong preference for "42" (15/20 responses)
  - num2 & num3: Moderate variation (~9 unique values each)
  - Shows clustering around certain numbers rather than uniform distribution

* Cities
  - Most consistent of all prompts
  - Strong preferences: Tokyo (city1), New York/New York City (city2), London (city3)
  - Shows pattern-matching behavior rather than true randomness

Key Finding: Claude shows significantly less non-determinism than hypothesized, suggesting strong underlying preferences even when asked for random or varied responses.

In [5]:
!uv pip install --system anthropic

[2mUsing Python 3.10.12 environment at /usr[0m
[2K[2mResolved [1m15 packages[0m [2min 427ms[0m[0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[2mPrepared [1m1 package[0m [2min 69ms[0m[0m
[2K[2mInstalled [1m1 package[0m [2min 4ms[0m[0m
 [32m+[39m [1m

In [1]:
import os
from google.colab import userdata
os.environ["ANTHROPIC_API_KEY"] = userdata.get('ANTHROPIC_API_KEY')

In [2]:
import anthropic
import json
from typing import List, Dict
import statistics
import datetime
from collections import defaultdict
import numpy as np
from scipy.stats import entropy
import os
from dataclasses import dataclass
import pandas as pd
from pathlib import Path

@dataclass
class ExperimentConfig:
    num_iterations: int = 20
    prompts: List[str] = None

    def __post_init__(self):
        if self.prompts is None:
            self.prompts = [
                "What are three key benefits of exercise? Format response as JSON with keys: benefit1, benefit2, benefit3",
                "List three random numbers between 1-100. Format as JSON with keys: num1, num2, num3",
                "Name three major cities. Format as JSON with keys: city1, city2, city3"
            ]

class NonDeterminismExperiment:
    def __init__(self, api_key: str, config: ExperimentConfig):
        self.client = anthropic.Client(api_key=api_key)
        self.config = config
        self.results = defaultdict(list)

    def get_claude_response(self, prompt: str) -> Dict:
        """Get a single response from Claude and parse as JSON"""
        try:
            message = self.client.messages.create(
                model="claude-3-sonnet-20240229",
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}]
            )
            response = message.content[0].text
            return json.loads(response)
        except Exception as e:
            print(f"Error getting response: {e}")
            return None

    def run_experiment(self):
        """Run the experiment for all prompts and iterations"""
        for prompt in self.config.prompts:
            print(f"\nRunning experiment for prompt: {prompt}")

            responses = []
            for _ in range(self.config.num_iterations):
                response = self.get_claude_response(prompt)
                if response is not None:
                    responses.append(response)

            self.results[prompt] = responses

    def analyze_results(self) -> Dict:
        """Analyze the results and compute statistics"""
        analysis = {}

        for prompt, responses in self.results.items():
            if not responses:  # Skip if no valid responses
                continue

            prompt_analysis = {
                "total_responses": len(responses),
                "unique_responses": len(set(json.dumps(r) for r in responses)),
                "field_analysis": {}
            }

            # Analyze variation in each field
            all_fields = responses[0].keys()
            for field in all_fields:
                values = [r[field] for r in responses]
                unique_values = list(set(values))

                field_stats = {
                    "unique_values": len(unique_values),
                    "value_frequencies": {val: values.count(val) for val in unique_values},
                    "entropy": entropy([values.count(val)/len(values) for val in unique_values])
                }
                prompt_analysis["field_analysis"][field] = field_stats

            analysis[prompt] = prompt_analysis

        return analysis

    def save_results(self, output_dir: str = "experiment_results"):
        """Save raw results and analysis to files"""
        timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
        output_path = Path(output_dir) / timestamp
        output_path.mkdir(parents=True, exist_ok=True)

        # Save raw results
        with open(output_path / "raw_results.json", "w") as f:
            json.dump(dict(self.results), f, indent=2)

        # Save analysis
        analysis = self.analyze_results()
        with open(output_path / "analysis.json", "w") as f:
            json.dump(analysis, f, indent=2)

        # Generate summary report
        self.generate_report(analysis, output_path / "summary_report.txt")

        return output_path

    def generate_report(self, analysis: Dict, output_file: Path):
        """Generate a human-readable summary report"""
        with open(output_file, "w") as f:
            f.write("Claude Non-Determinism Experiment Summary\n")
            f.write("=======================================\n\n")

            for prompt, results in analysis.items():
                f.write(f"Prompt: {prompt}\n")
                f.write(f"Total responses: {results['total_responses']}\n")
                f.write(f"Unique responses: {results['unique_responses']}\n")
                f.write("\nField Analysis:\n")

                for field, stats in results['field_analysis'].items():
                    f.write(f"\n{field}:\n")
                    f.write(f"  Unique values: {stats['unique_values']}\n")
                    f.write(f"  Entropy: {stats['entropy']:.3f}\n")
                    f.write("  Value frequencies:\n")
                    for val, freq in stats['value_frequencies'].items():
                        f.write(f"    {val}: {freq}\n")
                f.write("\n" + "="*50 + "\n")

def run_notebook_experiment(api_key: str = None):
    if api_key is None:
        api_key = os.getenv("ANTHROPIC_API_KEY")
    if not api_key:
        raise ValueError("Please set ANTHROPIC_API_KEY environment variable or pass it as a parameter")

    config = ExperimentConfig()
    experiment = NonDeterminismExperiment(api_key, config)

    experiment.run_experiment()
    output_path = experiment.save_results()

    # Display the summary report
    with open(output_path / "summary_report.txt", "r") as f:
        print(f.read())

    return experiment

In [3]:

# Second cell: Run the experiment
experiment = run_notebook_experiment()


Running experiment for prompt: What are three key benefits of exercise? Format response as JSON with keys: benefit1, benefit2, benefit3

Running experiment for prompt: List three random numbers between 1-100. Format as JSON with keys: num1, num2, num3

Running experiment for prompt: Name three major cities. Format as JSON with keys: city1, city2, city3
Claude Non-Determinism Experiment Summary

Prompt: What are three key benefits of exercise? Format response as JSON with keys: benefit1, benefit2, benefit3
Total responses: 20
Unique responses: 13

Field Analysis:

benefit1:
  Unique values: 4
  Entropy: 0.708
  Value frequencies:
    Improved physical health: 1
    Improves cardiovascular health: 1
    Improved cardiovascular health and reduced risk of heart disease: 2
    Improved cardiovascular health: 16

benefit2:
  Unique values: 7
  Entropy: 1.148
  Value frequencies:
    Strengthens muscles and bones: 1
    Stronger muscles and bones: 1
    Increased strength and endurance: 1
  