<a href="https://colab.research.google.com/github/wesslen/llm-experiments/blob/main/notebooks/nondeterminism/structured_data/llama3_1_8b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Llama 3.1 8B Instruct Non-Determinism Experiment Analysis

## Experiment Overview
* Tests Llama 3.1 8B Instruct's output consistency with structured JSON responses
* Uses three different prompt types:
  - Exercise benefits (descriptive/analytical)
  - Random numbers (explicit randomness)
  - Major cities (factual)
* Runs 20 iterations per prompt
* Measures variation using entropy scores and unique response counts

## Hypothesis
* Exercise benefits: Expected moderate variation (entropy ~1.0-1.5)
* Random numbers: Expected high variation (entropy ~2.0-2.5)
* City names: Expected low variation (entropy ~0.5-1.0)

## Results Analysis
* Exercise Benefits
  - Shows consistently high entropy across all fields (2.831, 2.926, 2.926)
  - Extremely high variation with 18-19 unique responses per field
  - Strong tendency to elaborate and expand on basic concepts
  - Shows significant rephrasing of similar concepts rather than entirely different ideas

* Random Numbers
  - Shows moderate clustering behavior
  - num1: Strong preference for "53" (12/20 responses)
  - num2: Strong preference for "91" (14/20 responses)
  - num3: Preference for "14" (10/20 responses)
  - Lower entropy than exercise responses (1.175, 0.984, 1.402)

* Cities
  - Shows moderate consistency with some variation
  - city1: Strong preference for "New York/New York City" (16/20 combined)
  - city2: More distributed between London, Tokyo, and New York
  - city3: Balanced between London and Tokyo
  - Entropy increases from city1 to city2 (0.886 → 1.479)

Key Finding: Llama 3.1 8B Instruct shows significantly higher non-determinism in descriptive tasks compared to the original hypothesis, but maintains more predictable patterns in factual and numerical responses. This suggests the model has stronger anchoring to factual knowledge but more flexibility in generating descriptive content.

In [5]:
!uv pip install --system openai

[2mUsing Python 3.10.12 environment at /usr[0m
[2K[2mResolved [1m15 packages[0m [2min 427ms[0m[0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)
[2K[2mPrepared [1m1 package[0m [2min 69ms[0m[0m
[2K[2mInstalled [1m1 package[0m [2min 4ms[0m[0m
 [32m+[39m [1m

In [6]:
import os
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('LLAMA3_KEY')
os.environ["BASE_URL"] = userdata.get('BASE_URL')
os.environ["model"] = "/models/NousResearch/Meta-Llama-3.1-8B-Instruct"

In [7]:
import openai
import json
from typing import List, Dict
import statistics
import datetime
from collections import defaultdict
import numpy as np
from scipy.stats import entropy
import os
from dataclasses import dataclass
import pandas as pd
from pathlib import Path

@dataclass
class ExperimentConfig:
    num_iterations: int = 20
    prompts: List[str] = None

    def __post_init__(self):
        if self.prompts is None:
            self.prompts = [
                "What are three key benefits of exercise? Format response as JSON with keys: benefit1, benefit2, benefit3",
                "List three random numbers between 1-100. Format as JSON with keys: num1, num2, num3",
                "Name three major cities. Format as JSON with keys: city1, city2, city3"
            ]

class NonDeterminismExperiment:
    def __init__(self, config: ExperimentConfig):
        # Set up OpenAI client with environment variables
        self.client = openai.OpenAI(
            api_key=os.getenv("OPENAI_API_KEY"),
            base_url=os.getenv("BASE_URL")
        )
        self.config = config
        self.results = defaultdict(list)
        self.model = os.getenv("model")

    def get_completion_response(self, prompt: str) -> Dict:
        """Get a single response and parse as JSON"""
        try:
            # Format the message to ensure JSON output
            system_message = "You are a helpful assistant. Always respond with valid JSON format."

            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": system_message},
                    {"role": "user", "content": prompt}
                ],
                temperature=1.0,  # Maximum randomness
                max_tokens=1024
            )

            # Extract the response content
            response_text = response.choices[0].message.content.strip()

            # Handle potential markdown code blocks in response
            if response_text.startswith("```json"):
                response_text = response_text.replace("```json", "").replace("```", "").strip()
            elif response_text.startswith("```"):
                response_text = response_text.replace("```", "").strip()

            return json.loads(response_text)

        except Exception as e:
            print(f"Error getting response: {e}")
            print(f"Raw response text: {response_text if 'response_text' in locals() else 'No response'}")
            return None

    def run_experiment(self):
        """Run the experiment for all prompts and iterations"""
        for prompt in self.config.prompts:
            print(f"\nRunning experiment for prompt: {prompt}")

            responses = []
            for i in range(self.config.num_iterations):
                print(f"Running iteration {i+1}/{self.config.num_iterations}", end="\r")
                response = self.get_completion_response(prompt)
                if response is not None:
                    responses.append(response)

            print()  # New line after progress
            self.results[prompt] = responses

    def analyze_results(self) -> Dict:
        """Analyze the results and compute statistics"""
        analysis = {}

        for prompt, responses in self.results.items():
            if not responses:  # Skip if no valid responses
                continue

            prompt_analysis = {
                "total_responses": len(responses),
                "unique_responses": len(set(json.dumps(r) for r in responses)),
                "field_analysis": {}
            }

            # Analyze variation in each field
            all_fields = responses[0].keys()
            for field in all_fields:
                values = [r[field] for r in responses]
                unique_values = list(set(values))

                field_stats = {
                    "unique_values": len(unique_values),
                    "value_frequencies": {val: values.count(val) for val in unique_values},
                    "entropy": entropy([values.count(val)/len(values) for val in unique_values])
                }
                prompt_analysis["field_analysis"][field] = field_stats

            analysis[prompt] = prompt_analysis

        return analysis

    def save_results(self, output_dir: str = "experiment_results"):
        """Save raw results and analysis to files"""
        timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
        output_path = Path(output_dir) / timestamp
        output_path.mkdir(parents=True, exist_ok=True)

        # Save raw results
        with open(output_path / "raw_results.json", "w") as f:
            json.dump(dict(self.results), f, indent=2)

        # Save analysis
        analysis = self.analyze_results()
        with open(output_path / "analysis.json", "w") as f:
            json.dump(analysis, f, indent=2)

        # Generate summary report
        self.generate_report(analysis, output_path / "summary_report.txt")

        return output_path

    def generate_report(self, analysis: Dict, output_file: Path):
        """Generate a human-readable summary report"""
        with open(output_file, "w") as f:
            f.write("OpenAI Model Non-Determinism Experiment Summary\n")
            f.write("============================================\n\n")

            for prompt, results in analysis.items():
                f.write(f"Prompt: {prompt}\n")
                f.write(f"Total responses: {results['total_responses']}\n")
                f.write(f"Unique responses: {results['unique_responses']}\n")
                f.write("\nField Analysis:\n")

                for field, stats in results['field_analysis'].items():
                    f.write(f"\n{field}:\n")
                    f.write(f"  Unique values: {stats['unique_values']}\n")
                    f.write(f"  Entropy: {stats['entropy']:.3f}\n")
                    f.write("  Value frequencies:\n")
                    for val, freq in stats['value_frequencies'].items():
                        f.write(f"    {val}: {freq}\n")
                f.write("\n" + "="*50 + "\n")

def run_notebook_experiment():
    """Run the experiment from a Jupyter notebook"""
    # Validate environment variables
    required_vars = ["OPENAI_API_KEY", "BASE_URL", "model"]
    missing_vars = [var for var in required_vars if not os.getenv(var)]

    if missing_vars:
        raise ValueError(f"Missing required environment variables: {', '.join(missing_vars)}")

    config = ExperimentConfig()
    experiment = NonDeterminismExperiment(config)

    experiment.run_experiment()
    output_path = experiment.save_results()

    # Display the summary report
    with open(output_path / "summary_report.txt", "r") as f:
        print(f.read())

    return experiment

In [8]:

# Second cell: Run the experiment
experiment = run_notebook_experiment()


Running experiment for prompt: What are three key benefits of exercise? Format response as JSON with keys: benefit1, benefit2, benefit3


Running experiment for prompt: List three random numbers between 1-100. Format as JSON with keys: num1, num2, num3


Running experiment for prompt: Name three major cities. Format as JSON with keys: city1, city2, city3

OpenAI Model Non-Determinism Experiment Summary

Prompt: What are three key benefits of exercise? Format response as JSON with keys: benefit1, benefit2, benefit3
Total responses: 20
Unique responses: 20

Field Analysis:

benefit1:
  Unique values: 18
  Entropy: 2.831
  Value frequencies:
    Weight Management and Body Composition: 1
    Improves Mental Health and Reduces Stress: 1
    Improved Physical Health: Regular exercise helps to maintain a healthy weight, reduce the risk of chronic diseases such as heart disease, diabetes, and some cancers.: 1
    Improves physical health, reducing the risk of chronic diseases such as heart di