<a href="https://colab.research.google.com/github/wesslen/llm-experiments/blob/main/notebooks/nondeterminism/structured_data/llama3_1_8b_pydantic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Llama 3.1 8B Instruct with Pydantic Non-Determinism Experiment Analysis

## Experiment Overview
* Tests Llama 3.1 8B Instruct's output consistency with structured JSON responses
* This run uses Pydantic with LangChain for structured data generation.
* Uses three different prompt types:
  - Exercise benefits (descriptive/analytical)
  - Random numbers (explicit randomness)
  - Major cities (factual)
* Runs 20 iterations per prompt
* Measures variation using entropy scores and unique response counts

## Hypothesis
* Exercise benefits: Expected moderate variation (entropy ~1.0-1.5)
* Random numbers: Expected high variation (entropy ~2.0-2.5)
* City names: Expected low variation (entropy ~0.5-1.0)

## Results Analysis
* Exercise Benefits
  - Moderate success rate (55%) with 11 successful responses out of 20 attempts
  - Consistently high entropy across all fields (2.398 for all three)
  - Each field shows 11 unique variations, indicating balanced diversity
  - Responses maintain high quality and coherence despite variations
  - Benefits show clear thematic consistency while varying in specific wording

* Random Numbers
  - Perfect success rate (100%) with all 20 attempts successful
  - Moderate to high entropy across fields (2.346, 1.692, 2.042)
  - Shows some clustering around specific numbers:
    - num1: "14" appears most frequently (4 times)
    - num2: "91" appears most frequently (8 times)
    - num3: "13" and "28" tied for most frequent (4 times each)
  - Demonstrates successful uniqueness validation between numbers

* Cities
  - High success rate (95%) with 19 successful responses
  - Very low entropy (0.000, 0.206, 0.206) indicating high consistency
  - Strong preferences for specific cities:
    - Tokyo consistently as city1 (100% of successful responses)
    - Delhi dominated city2 (18/19 responses)
    - Shanghai dominated city3 (18/19 responses)
  - Shows deterministic behavior despite temperature=1.0 setting

Key Finding: The model shows distinct patterns of variation based on task type:
- High variation for descriptive content (exercise benefits)
- Moderate variation for numerical choices (random numbers)
- High consistency for factual knowledge (cities), suggesting strong priors for major city rankings

This improved version with fixed validation shows more clearly how the model handles different types of structured outputs, with variation levels appropriate to the content type.

In [9]:
!uv pip install --system openai langchain langchain_core langchain_openai

[2mUsing Python 3.10.12 environment at /usr[0m
[2K[2mResolved [1m45 packages[0m [2min 1.03s[0m[0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/2)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/2)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/2)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/2)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/2)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/2)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/2)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/2)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/2)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/2)
[2mlangchain-openai[0m [32m[2m------------------------------[0m[0m     0 B/49.20 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/2)
[2mlangchain-openai[0m [32m---------[2m---------------------[0m[0m 14.09 KiB/49.20 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/2)
[2mlangchain-openai[0m [32m---------[2m----------

In [3]:
import os
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('DSBA_LLAMA3_KEY')
os.environ["BASE_URL"] = userdata.get('MODAL_BASE_URL')
os.environ["model"] = "/models/NousResearch/Meta-Llama-3.1-8B-Instruct"

In [20]:
from pydantic import BaseModel, Field, field_validator, model_validator, ValidationError
from typing import List, Dict, Optional
from langchain_openai import ChatOpenAI
from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
import statistics
import datetime
from collections import defaultdict
import numpy as np
from scipy.stats import entropy
import os
from pathlib import Path
import json

# Add an Error Tracking Model
class ExperimentResult(BaseModel):
    """Track both successful and failed responses"""
    success: bool
    data: Optional[dict] = None
    error_type: Optional[str] = None
    error_detail: Optional[str] = None
    raw_response: Optional[str] = None

# Pydantic models for structured outputs
class ExerciseBenefits(BaseModel):
    """Structured format for exercise benefits"""
    benefit1: str = Field(
        description="A key physical health benefit of exercise",
        min_length=10,
        max_length=200
    )
    benefit2: str = Field(
        description="A key mental health benefit of exercise",
        min_length=10,
        max_length=200
    )
    benefit3: str = Field(
        description="A key energy or lifestyle benefit of exercise",
        min_length=10,
        max_length=200
    )

    @field_validator('benefit2')
    def must_mention_mental_health(cls, v):
        if 'mental' not in v.lower():
            raise ValueError('benefit2 must mention mental health')
        return v

class RandomNumbers(BaseModel):
    """Structured format for random numbers"""
    num1: int = Field(
        description="A random number between 1 and 100",
        ge=1,
        le=100
    )
    num2: int = Field(
        description="A different random number between 1 and 100",
        ge=1,
        le=100
    )
    num3: int = Field(
        description="Another different random number between 1 and 100",
        ge=1,
        le=100
    )

    @model_validator(mode='after')
    def numbers_must_be_unique(self) -> 'RandomNumbers':
        numbers = [self.num1, self.num2, self.num3]
        if len(set(numbers)) != len(numbers):
            raise ValueError("All numbers must be different")
        return self

class MajorCities(BaseModel):
    """Structured format for major cities"""
    city1: str = Field(
        description="Name of a major global city, preferably one of the top 10 by population",
        min_length=2,
        max_length=50
    )
    city2: str = Field(
        description="Name of a different major global city, not repeating the first city",
        min_length=2,
        max_length=50
    )
    city3: str = Field(
        description="Name of another different major global city, not repeating previous cities",
        min_length=2,
        max_length=50
    )

    @model_validator(mode='after')
    def cities_must_be_unique(self) -> 'MajorCities':
        cities = [self.city1.lower(), self.city2.lower(), self.city3.lower()]
        if len(set(cities)) != len(cities):
            raise ValueError("All cities must be different")
        return self

class NonDeterminismExperiment:
    def __init__(self, num_iterations: int = 20):
        # Basic configuration
        self.num_iterations = num_iterations

        # Initialize the model
        self.model = ChatOpenAI(
            api_key=os.getenv("OPENAI_API_KEY"),
            base_url=os.getenv("BASE_URL"),
            model=os.getenv("model"),
            temperature=1.0
        )

        # Initialize results tracking
        self.raw_results = defaultdict(list)  # Store ExperimentResult objects
        self.error_counts = defaultdict(lambda: defaultdict(int))

        # Create parsers for each model
        self.exercise_parser = PydanticOutputParser(pydantic_object=ExerciseBenefits)
        self.numbers_parser = PydanticOutputParser(pydantic_object=RandomNumbers)
        self.cities_parser = PydanticOutputParser(pydantic_object=MajorCities)

        # Create prompts for each experiment with more explicit instructions
        self.exercise_prompt = PromptTemplate(
            template="Provide three distinct benefits of exercise: one physical, one mental, and one lifestyle/energy-related benefit. Ensure the mental health benefit explicitly mentions mental health aspects. Respond ONLY with the JSON format specified.\n{format_instructions}\n",
            input_variables=[],
            partial_variables={"format_instructions": self.exercise_parser.get_format_instructions()}
        )

        self.numbers_prompt = PromptTemplate(
            template="Respond ONLY with a JSON object containing three different random numbers between 1 and 100. Do not include any explanation, code, or additional text.\n{format_instructions}\n",
            input_variables=[],
            partial_variables={"format_instructions": self.numbers_parser.get_format_instructions()}
        )

        self.cities_prompt = PromptTemplate(
            template="Respond ONLY with a JSON object containing three different major global cities, preferably from the top 10 by population. Do not include any explanation or additional text.\n{format_instructions}\n",
            input_variables=[],
            partial_variables={"format_instructions": self.cities_parser.get_format_instructions()}
        )
        # Create prompts for each experiment
        # self.exercise_prompt = PromptTemplate(
        #     template="Provide three distinct benefits of exercise: one physical, one mental, and one lifestyle/energy-related benefit. Ensure the mental health benefit explicitly mentions mental health aspects.\n{format_instructions}\n",
        #     input_variables=[],
        #     partial_variables={"format_instructions": self.exercise_parser.get_format_instructions()}
        # )

        # self.numbers_prompt = PromptTemplate(
        #     template="Generate three different random numbers between 1 and 100. Each number must be unique.\n{format_instructions}\n",
        #     input_variables=[],
        #     partial_variables={"format_instructions": self.numbers_parser.get_format_instructions()}
        # )

        # self.cities_prompt = PromptTemplate(
        #     template="Name three different major global cities, preferably from the top 10 by population. Do not repeat any cities.\n{format_instructions}\n",
        #     input_variables=[],
        #     partial_variables={"format_instructions": self.cities_parser.get_format_instructions()}
        # )

    def record_result(self, experiment_type: str, success: bool, data: dict = None,
                     error_type: str = None, error_detail: str = None, raw_response: str = None):
        """Record a single experiment result"""
        result = ExperimentResult(
            success=success,
            data=data,
            error_type=error_type,
            error_detail=error_detail,
            raw_response=raw_response
        )
        self.raw_results[experiment_type].append(result)

        if not success:
            self.error_counts[experiment_type][error_type] += 1

    def run_experiment(self):
        """Run all three experiments"""
        print("Running Exercise Benefits experiment...")
        for i in range(self.num_iterations):
            print(f"Iteration {i+1}/{self.num_iterations}", end="\r")
            try:
                response = self.model.invoke(self.exercise_prompt.format())
                parsed = self.exercise_parser.parse(response.content)
                self.record_result("exercise", True, parsed.model_dump(), raw_response=response.content)
            except ValidationError as e:
                print(f"\nValidation error in exercise experiment: {e}")
                self.record_result("exercise", False,
                                error_type="validation_error",
                                error_detail=str(e),
                                raw_response=response.content if 'response' in locals() else None)
            except Exception as e:
                print(f"\nError in exercise experiment: {e}")
                self.record_result("exercise", False,
                                error_type="other_error",
                                error_detail=str(e),
                                raw_response=response.content if 'response' in locals() else None)

        print("\nRunning Random Numbers experiment...")
        for i in range(self.num_iterations):
            print(f"Iteration {i+1}/{self.num_iterations}", end="\r")
            try:
                response = self.model.invoke(self.numbers_prompt.format())
                parsed = self.numbers_parser.parse(response.content)
                self.record_result("numbers", True, parsed.model_dump(), raw_response=response.content)
            except ValidationError as e:
                print(f"\nValidation error in numbers experiment: {e}")
                self.record_result("numbers", False,
                                error_type="validation_error",
                                error_detail=str(e),
                                raw_response=response.content if 'response' in locals() else None)
            except Exception as e:
                print(f"\nError in numbers experiment: {e}")
                self.record_result("numbers", False,
                                error_type="other_error",
                                error_detail=str(e),
                                raw_response=response.content if 'response' in locals() else None)

        print("\nRunning Major Cities experiment...")
        for i in range(self.num_iterations):
            print(f"Iteration {i+1}/{self.num_iterations}", end="\r")
            try:
                response = self.model.invoke(self.cities_prompt.format())
                parsed = self.cities_parser.parse(response.content)
                self.record_result("cities", True, parsed.model_dump(), raw_response=response.content)
            except ValidationError as e:
                print(f"\nValidation error in cities experiment: {e}")
                self.record_result("cities", False,
                                error_type="validation_error",
                                error_detail=str(e),
                                raw_response=response.content if 'response' in locals() else None)
            except Exception as e:
                print(f"\nError in cities experiment: {e}")
                self.record_result("cities", False,
                                error_type="other_error",
                                error_detail=str(e),
                                raw_response=response.content if 'response' in locals() else None)


    def analyze_results(self) -> Dict:
        """Analyze the results and compute statistics"""
        analysis = {}

        for experiment_type, results in self.raw_results.items():
            successful_results = [r.data for r in results if r.success]

            prompt_analysis = {
                "total_attempts": len(results),
                "successful_responses": len(successful_results),
                "error_rate": (len(results) - len(successful_results)) / len(results),
                "error_counts": dict(self.error_counts[experiment_type]),
                "unique_responses": len(set(str(r.data) for r in results if r.success)),
                "field_analysis": {}
            }

            if successful_results:
                # Analyze variation in each field for successful responses
                all_fields = successful_results[0].keys()
                for field in all_fields:
                    values = [r[field] for r in successful_results]
                    unique_values = list(set(values))

                    field_stats = {
                        "unique_values": len(unique_values),
                        "value_frequencies": {str(val): values.count(val) for val in unique_values},
                        "entropy": entropy([values.count(val)/len(values) for val in unique_values])
                    }
                    prompt_analysis["field_analysis"][field] = field_stats

            analysis[experiment_type] = prompt_analysis

        return analysis

    def generate_report(self, analysis: Dict, output_file: Path):
        """Generate a human-readable summary report"""
        with open(output_file, "w") as f:
            f.write("Structured Output Non-Determinism Experiment Summary\n")
            f.write("===============================================\n\n")

            for experiment_type, results in analysis.items():
                f.write(f"Experiment: {experiment_type}\n")
                f.write(f"Total attempts: {results['total_attempts']}\n")
                f.write(f"Successful responses: {results['successful_responses']}\n")
                f.write(f"Error rate: {results['error_rate']:.2%}\n")

                if results['error_counts']:
                    f.write("\nError Analysis:\n")
                    for error_type, count in results['error_counts'].items():
                        f.write(f"  {error_type}: {count} occurrences\n")

                f.write(f"\nUnique successful responses: {results['unique_responses']}\n")

                if results['field_analysis']:
                    f.write("\nField Analysis:\n")
                    for field, stats in results['field_analysis'].items():
                        f.write(f"\n{field}:\n")
                        f.write(f"  Unique values: {stats['unique_values']}\n")
                        f.write(f"  Entropy: {stats['entropy']:.3f}\n")
                        f.write("  Value frequencies:\n")
                        for val, freq in stats['value_frequencies'].items():
                            f.write(f"    {val}: {freq}\n")
                f.write("\n" + "="*50 + "\n")

    def save_results(self, output_dir: str = "experiment_results"):
        """Save raw results and analysis to files"""
        timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
        output_path = Path(output_dir) / timestamp
        output_path.mkdir(parents=True, exist_ok=True)

        # Save raw results including errors
        raw_results_dict = {
            exp_type: [r.model_dump() for r in results]
            for exp_type, results in self.raw_results.items()
        }
        with open(output_path / "raw_results.json", "w") as f:
            json.dump(raw_results_dict, f, indent=2, default=str)

        # Save error counts
        with open(output_path / "error_counts.json", "w") as f:
            json.dump(dict(self.error_counts), f, indent=2)

        # Save analysis
        analysis = self.analyze_results()
        with open(output_path / "analysis.json", "w") as f:
            json.dump(analysis, f, indent=2)

        # Generate summary report
        self.generate_report(analysis, output_path / "summary_report.txt")

        return output_path

In [21]:
# Second cell: Run the experiment
experiment = run_notebook_experiment()

Running Exercise Benefits experiment...

Error in exercise experiment: Failed to parse ExerciseBenefits from completion {"benefit1": "Regular exercise can reduce the risk of heart disease, stroke, and high blood pressure by improving circulation and cardiovascular function.", "benefit2": "Exercise has been shown to have a positive impact on mental health by reducing symptoms of anxiety and depression, improving mood, and enhancing cognitive function.", "benefit3": "Exercise can increase energy levels and reduce fatigue by improving sleep quality, increasing alertness, and releasing endorphins, which can help combat mid-day slumps and improve overall productivity."}. Got: 1 validation error for ExerciseBenefits
benefit3
  String should have at most 200 characters [type=string_too_long, input_value='Exercise can increase en...e overall productivity.', input_type=str]
    For further information visit https://errors.pydantic.dev/2.9/v/string_too_long
For troubleshooting, visit: https://py