Skip to content

Testing - Toxic Berries #41

@bordumb

Description

@bordumb

Comprehensive ARLA Causal Reasoning Validation Plan

Statistical Analysis Framework (Applied to All Experiments)

Sample Size and Power Analysis

  • N = 50 simulation runs per experimental condition
  • Power analysis validation using G*Power software to confirm 80% power for detecting medium effect sizes (Cohen's d = 0.5)
  • Alpha level = 0.05 for all statistical tests

Statistical Testing Protocol

  1. Distribution Testing:

    • Shapiro-Wilk test for normality (α = 0.05)
    • Levene's test for homogeneity of variances (α = 0.05)
  2. Primary Analysis:

    • If assumptions met: One-way ANOVA followed by Tukey's HSD post-hoc tests
    • If normality violated: Kruskal-Wallis test followed by Dunn's post-hoc tests
    • If variance equality violated: Welch's ANOVA followed by Games-Howell post-hoc tests
  3. Multiple Comparisons Correction:

    • Holm-Bonferroni correction applied across all experiments to control family-wise error rate
  4. Effect Size Calculation:

    • Cohen's d for all pairwise comparisons
    • Eta-squared (η²) for overall ANOVA effects

Experiment 1: Isolating Causal Reasoning from Sensory Architecture

Hypothesis

The CausalGraphSystem provides significant performance advantages beyond those attributable to sensory architecture or exploration strategies.

Experimental Groups

Group A: Heuristic Baseline (N=50)

  • Agent Type: Baseline-Heuristic-Agent
  • Cognitive Systems: None (rule-based behavior)
  • Sensory Input: Direct access to environment state
  • Decision Making: Move toward closest visible berry
  • Exploration: Deterministic (no randomness)

Group B: Full Causal Model (N=50)

  • Agent Type: Causal-QLearning-Agent
  • Cognitive Systems: QLearningSystem + CausalGraphSystem + PerceptionComponent
  • Sensory Input: Limited perception within vision range
  • Decision Making: Q-learning with causal feedback
  • Exploration: Epsilon-greedy (ε = 0.1, decay = 0.995)

Group C: Perception-Only Control (N=50)

  • Agent Type: Perception-QLearning-Agent
  • Cognitive Systems: QLearningSystem + PerceptionComponent (CausalGraphSystem disabled)
  • Sensory Input: Identical to Group B
  • Decision Making: Standard Q-learning without causal feedback
  • Exploration: Identical epsilon-greedy parameters to Group B

Group D: Exploration-Matched Heuristic (N=50)

  • Agent Type: Heuristic-Agent-with-Exploration
  • Cognitive Systems: None (rule-based with random component)
  • Sensory Input: Direct access to environment state
  • Decision Making: 90% move toward closest berry, 10% random movement
  • Exploration: Matched to approximate Q-learning exploration frequency

Environment Configuration

  • Grid Size: 50x50 cells
  • Berry spawn rates: Red (20%), Blue (15%), Yellow (15%)
  • Water sources: 8-10 randomly placed
  • Rock formations: 15-20 randomly placed
  • Phase 1 (ticks 0-1000): Blue berries always safe
  • Phase 2 (ticks 1000-1600): Blue berries toxic when within 2 tiles of water

Measurement Protocol

Primary Metrics:

  1. Causal Understanding Score: Proportion of correct decisions regarding blue berries in novel contexts (ticks 1000-1100)
  2. Average Agent Health: Mean health across population during adaptation period (ticks 1000-1200)

Process Metrics:
3. Causal Model Accuracy (Groups B only): Dowhy validation of learned causal relationships against ground truth
4. Behavioral Adaptation Speed: Number of ticks required to achieve 80% accuracy on blue berry decisions after environmental change

Control Metrics:
5. Exploration Coverage: Percentage of map tiles visited by each group
6. Berry Consumption Distribution: Proportion of each berry type consumed pre/post environmental change

Success Criteria

  • Group B must significantly outperform Groups C and D on primary metrics (p < 0.05, d > 0.5)
  • Group B must show superior causal model accuracy compared to baseline chance
  • Performance differences must persist after controlling for exploration coverage

Experiment 2: Testing Genuine Causal Understanding vs. Simple Avoidance

Hypothesis

Agents with causal reasoning learn flexible, context-dependent rules rather than simple avoidance heuristics.

Experimental Groups

  • Group B: Full Causal Model (from Experiment 1)
  • Group C: Perception-Only Control (from Experiment 1)

Environment Modifications

Base Environment: Same as Experiment 1

Added Elements:

  • Purifier Crystals: 5-7 visible crystal objects randomly placed on map
  • Purification Rule: Blue berries within 2 tiles of crystals are always safe, overriding water toxicity
  • Crystal Visibility: Crystals appear as distinct visual elements agents can perceive

Experimental Timeline

  • Phase 1 (ticks 0-1000): Blue berries safe, crystals present but inactive
  • Phase 2 (ticks 1000-1200): Blue berries toxic near water, crystals activate purification
  • Phase 3 (ticks 1200-1600): Extended testing period

Measurement Protocol

Primary Metrics:

  1. Safe Zone Consumption Rate: Number of blue berries consumed near crystals during Phase 2
  2. Crystal Approach Behavior: Frequency of movement toward crystals when blue berries are visible nearby

Secondary Metrics:
3. Causal Chain Recognition: Behavioral evidence of understanding crystal→safety→blue berry relationship
4. Context Switching Accuracy: Correct identification of safe vs. unsafe blue berries based on environmental context

Process Validation:
5. Counterfactual Reasoning Test: Post-training query of causal models about crystal effects

Success Criteria

  • Group B must show significantly higher Safe Zone Consumption Rate than Group C
  • Group B must demonstrate Crystal Approach Behavior while Group C shows avoidance
  • Process validation must confirm Group B learned crystal purification rule

Experiment 3: Temporal Causal Reasoning

Hypothesis

The CausalGraphSystem can identify and utilize temporal causal relationships involving delayed effects.

Experimental Groups

  • Group B: Full Causal Model
  • Group C: Perception-Only Control
  • Both groups receive identical state information including explicit "metabolic boost" status flag

Environment Design

New Berry Type: Orange berries (10% spawn rate)
Temporal Causal Rule:

  • Eating orange berry triggers 50-tick "metabolic boost" state
  • During metabolic boost, all berry types provide double health benefits
  • Metabolic boost status visible to agents in state representation

Experimental Timeline

  • Training Phase (ticks 0-800): Agents learn basic environment
  • Testing Phase (ticks 800-1600): Temporal causation active

Measurement Protocol

Primary Metrics:

  1. Temporal Strategy Score: Frequency of seeking orange berries when health is low (indicating understanding of delayed benefit)
  2. Optimal Timing Behavior: Rate of delaying valuable berry consumption until after eating orange berries

Secondary Metrics:
3. Boost Utilization Rate: Proportion of metabolic boost duration spent consuming high-value berries
4. Causal Discovery Speed: Ticks required to establish orange berry seeking behavior

Success Criteria

  • Group B must show significantly higher Temporal Strategy Score than Group C
  • Group B must demonstrate Optimal Timing Behavior patterns
  • Behavioral differences must emerge despite identical state information access

Experiment 4: Learning Convergence Control

Hypothesis

Performance differences between causal and perception-only agents reflect cognitive architectural advantages, not differential learning time requirements.

Experimental Groups

Group C-Standard: Perception-Only agents run for standard 1600 ticks

Group C-Extended: Perception-Only agents run for 4800 ticks (3x duration)

Group C-Plateau: Perception-Only agents run until performance plateaus (no improvement for 200 consecutive ticks)

Convergence Detection Protocol

  1. Calculate rolling 50-tick average of causal understanding score
  2. Detect plateau when rolling average changes < 0.01 for 200 ticks
  3. Compare final plateau performance against Group B performance at various time points

Measurement Protocol

Primary Metrics:

  1. Plateau Performance Level: Final causal understanding score after convergence
  2. Learning Efficiency: Ticks required to reach 90% of final performance level

Convergence Analysis:
3. Asymptotic Performance Comparison: Statistical comparison of final performance levels
4. Learning Curve Analysis: Comparison of learning trajectories using curve fitting

Success Criteria

  • No significant difference between Group C-Standard and Group C-Extended final performance
  • Group C-Plateau performance remains significantly below Group B performance
  • Learning curve analysis confirms different asymptotes rather than different learning rates

Experiment 5: Direct Causal Model Validation

Hypothesis

Agents with CausalGraphSystem develop accurate internal representations of environmental causal structure.

Experimental Protocol

Model Extraction:

def extract_causal_model(agent_id, causal_system):
    """Extract learned causal relationships from agent's internal model"""
    return causal_system.get_causal_graph(agent_id)

def validate_causal_relationships(learned_model, ground_truth_rules):
    """Compare learned model against known environmental causation"""
    accuracy_scores = []
    
    for relationship in ground_truth_rules:
        predicted_effect = learned_model.estimate_effect(
            treatment=relationship.cause,
            outcome=relationship.effect,
            treatment_value=relationship.intervention_value
        )
        
        true_effect = relationship.ground_truth_effect
        accuracy = calculate_prediction_accuracy(predicted_effect, true_effect)
        accuracy_scores.append(accuracy)
    
    return np.mean(accuracy_scores)

Counterfactual Testing:
Test agents' ability to predict outcomes of hypothetical interventions:

  1. "What would happen if blue berries appeared near crystals instead of water?"
  2. "What would happen if orange berries lasted 100 ticks instead of 50?"
  3. "What would happen if yellow berries never appeared near rocks?"

Measurement Protocol

Model Quality Metrics:

  1. Causal Relationship Accuracy: Proportion of correctly identified causal links
  2. Counterfactual Prediction Accuracy: Accuracy of hypothetical scenario predictions
  3. Model Complexity Score: Number of causal relationships inferred (penalize overfitting)

Success Criteria

  • Group B must achieve >80% accuracy on causal relationship identification
  • Counterfactual predictions must significantly exceed chance performance
  • Model complexity should be appropriate (not overfitted or underfitted)

Experiment 6: Cross-Domain Transfer Learning

Hypothesis

Causal reasoning capabilities generalize across different environmental domains and causal structures.

Transfer Environment: Tool Crafting Domain

Environment Design:

  • Agents must collect resources and craft tools in sequence
  • Multi-step causal chains: Wood + Stone → Hammer; Hammer + Metal → Sword
  • Success requires understanding prerequisite relationships

Transfer Protocol:

  1. Train agents in berry environment (800 ticks)
  2. Transfer to tool crafting environment (no additional training)
  3. Measure adaptation speed and final performance

Measurement Protocol

Transfer Metrics:

  1. Adaptation Speed: Ticks required to achieve first successful tool craft
  2. Final Performance: Tool crafting success rate in final 200 ticks
  3. Causal Transfer Evidence: Behavioral indicators of applying causal reasoning to new domain

Success Criteria

  • Group B must show faster adaptation speed than Group C
  • Final performance differences must parallel original domain advantages
  • Process analysis must confirm application of causal reasoning principles

Enhanced Exploration Controls

Standardized Exploration Protocol

All Q-learning agents use identical parameters:

  • Initial epsilon: 0.1
  • Epsilon decay rate: 0.995 per tick
  • Minimum epsilon: 0.01
  • Learning rate: 0.1
  • Discount factor: 0.95

Exploration Validation Metrics

  1. Map Coverage: Percentage of environment tiles visited
  2. Action Diversity: Entropy of action selection distribution
  3. Exploration Efficiency: Ratio of novel states discovered to total actions taken

Comprehensive Success Framework

Evidence FOR Causal Reasoning

  • Significant behavioral advantages (d > 0.5) across multiple experiments
  • High causal model accuracy in process validation (>80%)
  • Successful transfer to novel domains
  • Advantages persist after controlling for exploration and learning time
  • Counterfactual reasoning capabilities exceed chance performance

Evidence AGAINST Causal Reasoning

  • Performance differences eliminated by exploration/learning controls
  • No advantage in process measures of causal model quality
  • Poor transfer to new domains
  • Advantages explained by confounding factors
  • Random-level performance on counterfactual reasoning tests

Implementation Timeline

Weeks 1-2: Implement all experimental conditions and enhanced measurement systems
Weeks 3-4: Execute Experiments 1-3 with full statistical protocols
Weeks 5-6: Execute Experiments 4-6 and cross-experiment validation analyses
Week 7: Comprehensive analysis, replication checks, and results interpretation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions