-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Comprehensive ARLA Causal Reasoning Validation Plan
Statistical Analysis Framework (Applied to All Experiments)
Sample Size and Power Analysis
- N = 50 simulation runs per experimental condition
- Power analysis validation using G*Power software to confirm 80% power for detecting medium effect sizes (Cohen's d = 0.5)
- Alpha level = 0.05 for all statistical tests
Statistical Testing Protocol
-
Distribution Testing:
- Shapiro-Wilk test for normality (α = 0.05)
- Levene's test for homogeneity of variances (α = 0.05)
-
Primary Analysis:
- If assumptions met: One-way ANOVA followed by Tukey's HSD post-hoc tests
- If normality violated: Kruskal-Wallis test followed by Dunn's post-hoc tests
- If variance equality violated: Welch's ANOVA followed by Games-Howell post-hoc tests
-
Multiple Comparisons Correction:
- Holm-Bonferroni correction applied across all experiments to control family-wise error rate
-
Effect Size Calculation:
- Cohen's d for all pairwise comparisons
- Eta-squared (η²) for overall ANOVA effects
Experiment 1: Isolating Causal Reasoning from Sensory Architecture
Hypothesis
The CausalGraphSystem provides significant performance advantages beyond those attributable to sensory architecture or exploration strategies.
Experimental Groups
Group A: Heuristic Baseline (N=50)
- Agent Type: Baseline-Heuristic-Agent
- Cognitive Systems: None (rule-based behavior)
- Sensory Input: Direct access to environment state
- Decision Making: Move toward closest visible berry
- Exploration: Deterministic (no randomness)
Group B: Full Causal Model (N=50)
- Agent Type: Causal-QLearning-Agent
- Cognitive Systems: QLearningSystem + CausalGraphSystem + PerceptionComponent
- Sensory Input: Limited perception within vision range
- Decision Making: Q-learning with causal feedback
- Exploration: Epsilon-greedy (ε = 0.1, decay = 0.995)
Group C: Perception-Only Control (N=50)
- Agent Type: Perception-QLearning-Agent
- Cognitive Systems: QLearningSystem + PerceptionComponent (CausalGraphSystem disabled)
- Sensory Input: Identical to Group B
- Decision Making: Standard Q-learning without causal feedback
- Exploration: Identical epsilon-greedy parameters to Group B
Group D: Exploration-Matched Heuristic (N=50)
- Agent Type: Heuristic-Agent-with-Exploration
- Cognitive Systems: None (rule-based with random component)
- Sensory Input: Direct access to environment state
- Decision Making: 90% move toward closest berry, 10% random movement
- Exploration: Matched to approximate Q-learning exploration frequency
Environment Configuration
- Grid Size: 50x50 cells
- Berry spawn rates: Red (20%), Blue (15%), Yellow (15%)
- Water sources: 8-10 randomly placed
- Rock formations: 15-20 randomly placed
- Phase 1 (ticks 0-1000): Blue berries always safe
- Phase 2 (ticks 1000-1600): Blue berries toxic when within 2 tiles of water
Measurement Protocol
Primary Metrics:
- Causal Understanding Score: Proportion of correct decisions regarding blue berries in novel contexts (ticks 1000-1100)
- Average Agent Health: Mean health across population during adaptation period (ticks 1000-1200)
Process Metrics:
3. Causal Model Accuracy (Groups B only): Dowhy validation of learned causal relationships against ground truth
4. Behavioral Adaptation Speed: Number of ticks required to achieve 80% accuracy on blue berry decisions after environmental change
Control Metrics:
5. Exploration Coverage: Percentage of map tiles visited by each group
6. Berry Consumption Distribution: Proportion of each berry type consumed pre/post environmental change
Success Criteria
- Group B must significantly outperform Groups C and D on primary metrics (p < 0.05, d > 0.5)
- Group B must show superior causal model accuracy compared to baseline chance
- Performance differences must persist after controlling for exploration coverage
Experiment 2: Testing Genuine Causal Understanding vs. Simple Avoidance
Hypothesis
Agents with causal reasoning learn flexible, context-dependent rules rather than simple avoidance heuristics.
Experimental Groups
- Group B: Full Causal Model (from Experiment 1)
- Group C: Perception-Only Control (from Experiment 1)
Environment Modifications
Base Environment: Same as Experiment 1
Added Elements:
- Purifier Crystals: 5-7 visible crystal objects randomly placed on map
- Purification Rule: Blue berries within 2 tiles of crystals are always safe, overriding water toxicity
- Crystal Visibility: Crystals appear as distinct visual elements agents can perceive
Experimental Timeline
- Phase 1 (ticks 0-1000): Blue berries safe, crystals present but inactive
- Phase 2 (ticks 1000-1200): Blue berries toxic near water, crystals activate purification
- Phase 3 (ticks 1200-1600): Extended testing period
Measurement Protocol
Primary Metrics:
- Safe Zone Consumption Rate: Number of blue berries consumed near crystals during Phase 2
- Crystal Approach Behavior: Frequency of movement toward crystals when blue berries are visible nearby
Secondary Metrics:
3. Causal Chain Recognition: Behavioral evidence of understanding crystal→safety→blue berry relationship
4. Context Switching Accuracy: Correct identification of safe vs. unsafe blue berries based on environmental context
Process Validation:
5. Counterfactual Reasoning Test: Post-training query of causal models about crystal effects
Success Criteria
- Group B must show significantly higher Safe Zone Consumption Rate than Group C
- Group B must demonstrate Crystal Approach Behavior while Group C shows avoidance
- Process validation must confirm Group B learned crystal purification rule
Experiment 3: Temporal Causal Reasoning
Hypothesis
The CausalGraphSystem can identify and utilize temporal causal relationships involving delayed effects.
Experimental Groups
- Group B: Full Causal Model
- Group C: Perception-Only Control
- Both groups receive identical state information including explicit "metabolic boost" status flag
Environment Design
New Berry Type: Orange berries (10% spawn rate)
Temporal Causal Rule:
- Eating orange berry triggers 50-tick "metabolic boost" state
- During metabolic boost, all berry types provide double health benefits
- Metabolic boost status visible to agents in state representation
Experimental Timeline
- Training Phase (ticks 0-800): Agents learn basic environment
- Testing Phase (ticks 800-1600): Temporal causation active
Measurement Protocol
Primary Metrics:
- Temporal Strategy Score: Frequency of seeking orange berries when health is low (indicating understanding of delayed benefit)
- Optimal Timing Behavior: Rate of delaying valuable berry consumption until after eating orange berries
Secondary Metrics:
3. Boost Utilization Rate: Proportion of metabolic boost duration spent consuming high-value berries
4. Causal Discovery Speed: Ticks required to establish orange berry seeking behavior
Success Criteria
- Group B must show significantly higher Temporal Strategy Score than Group C
- Group B must demonstrate Optimal Timing Behavior patterns
- Behavioral differences must emerge despite identical state information access
Experiment 4: Learning Convergence Control
Hypothesis
Performance differences between causal and perception-only agents reflect cognitive architectural advantages, not differential learning time requirements.
Experimental Groups
Group C-Standard: Perception-Only agents run for standard 1600 ticks
Group C-Extended: Perception-Only agents run for 4800 ticks (3x duration)
Group C-Plateau: Perception-Only agents run until performance plateaus (no improvement for 200 consecutive ticks)
Convergence Detection Protocol
- Calculate rolling 50-tick average of causal understanding score
- Detect plateau when rolling average changes < 0.01 for 200 ticks
- Compare final plateau performance against Group B performance at various time points
Measurement Protocol
Primary Metrics:
- Plateau Performance Level: Final causal understanding score after convergence
- Learning Efficiency: Ticks required to reach 90% of final performance level
Convergence Analysis:
3. Asymptotic Performance Comparison: Statistical comparison of final performance levels
4. Learning Curve Analysis: Comparison of learning trajectories using curve fitting
Success Criteria
- No significant difference between Group C-Standard and Group C-Extended final performance
- Group C-Plateau performance remains significantly below Group B performance
- Learning curve analysis confirms different asymptotes rather than different learning rates
Experiment 5: Direct Causal Model Validation
Hypothesis
Agents with CausalGraphSystem develop accurate internal representations of environmental causal structure.
Experimental Protocol
Model Extraction:
def extract_causal_model(agent_id, causal_system):
"""Extract learned causal relationships from agent's internal model"""
return causal_system.get_causal_graph(agent_id)
def validate_causal_relationships(learned_model, ground_truth_rules):
"""Compare learned model against known environmental causation"""
accuracy_scores = []
for relationship in ground_truth_rules:
predicted_effect = learned_model.estimate_effect(
treatment=relationship.cause,
outcome=relationship.effect,
treatment_value=relationship.intervention_value
)
true_effect = relationship.ground_truth_effect
accuracy = calculate_prediction_accuracy(predicted_effect, true_effect)
accuracy_scores.append(accuracy)
return np.mean(accuracy_scores)Counterfactual Testing:
Test agents' ability to predict outcomes of hypothetical interventions:
- "What would happen if blue berries appeared near crystals instead of water?"
- "What would happen if orange berries lasted 100 ticks instead of 50?"
- "What would happen if yellow berries never appeared near rocks?"
Measurement Protocol
Model Quality Metrics:
- Causal Relationship Accuracy: Proportion of correctly identified causal links
- Counterfactual Prediction Accuracy: Accuracy of hypothetical scenario predictions
- Model Complexity Score: Number of causal relationships inferred (penalize overfitting)
Success Criteria
- Group B must achieve >80% accuracy on causal relationship identification
- Counterfactual predictions must significantly exceed chance performance
- Model complexity should be appropriate (not overfitted or underfitted)
Experiment 6: Cross-Domain Transfer Learning
Hypothesis
Causal reasoning capabilities generalize across different environmental domains and causal structures.
Transfer Environment: Tool Crafting Domain
Environment Design:
- Agents must collect resources and craft tools in sequence
- Multi-step causal chains: Wood + Stone → Hammer; Hammer + Metal → Sword
- Success requires understanding prerequisite relationships
Transfer Protocol:
- Train agents in berry environment (800 ticks)
- Transfer to tool crafting environment (no additional training)
- Measure adaptation speed and final performance
Measurement Protocol
Transfer Metrics:
- Adaptation Speed: Ticks required to achieve first successful tool craft
- Final Performance: Tool crafting success rate in final 200 ticks
- Causal Transfer Evidence: Behavioral indicators of applying causal reasoning to new domain
Success Criteria
- Group B must show faster adaptation speed than Group C
- Final performance differences must parallel original domain advantages
- Process analysis must confirm application of causal reasoning principles
Enhanced Exploration Controls
Standardized Exploration Protocol
All Q-learning agents use identical parameters:
- Initial epsilon: 0.1
- Epsilon decay rate: 0.995 per tick
- Minimum epsilon: 0.01
- Learning rate: 0.1
- Discount factor: 0.95
Exploration Validation Metrics
- Map Coverage: Percentage of environment tiles visited
- Action Diversity: Entropy of action selection distribution
- Exploration Efficiency: Ratio of novel states discovered to total actions taken
Comprehensive Success Framework
Evidence FOR Causal Reasoning
- Significant behavioral advantages (d > 0.5) across multiple experiments
- High causal model accuracy in process validation (>80%)
- Successful transfer to novel domains
- Advantages persist after controlling for exploration and learning time
- Counterfactual reasoning capabilities exceed chance performance
Evidence AGAINST Causal Reasoning
- Performance differences eliminated by exploration/learning controls
- No advantage in process measures of causal model quality
- Poor transfer to new domains
- Advantages explained by confounding factors
- Random-level performance on counterfactual reasoning tests
Implementation Timeline
Weeks 1-2: Implement all experimental conditions and enhanced measurement systems
Weeks 3-4: Execute Experiments 1-3 with full statistical protocols
Weeks 5-6: Execute Experiments 4-6 and cross-experiment validation analyses
Week 7: Comprehensive analysis, replication checks, and results interpretation