Testing - Toxic Berries

# Comprehensive ARLA Causal Reasoning Validation Plan

## Statistical Analysis Framework (Applied to All Experiments)

### Sample Size and Power Analysis
- **N = 50** simulation runs per experimental condition
- Power analysis validation using G*Power software to confirm 80% power for detecting medium effect sizes (Cohen's d = 0.5)
- Alpha level = 0.05 for all statistical tests

### Statistical Testing Protocol
1. **Distribution Testing**:
   - Shapiro-Wilk test for normality (α = 0.05)
   - Levene's test for homogeneity of variances (α = 0.05)

2. **Primary Analysis**:
   - If assumptions met: One-way ANOVA followed by Tukey's HSD post-hoc tests
   - If normality violated: Kruskal-Wallis test followed by Dunn's post-hoc tests
   - If variance equality violated: Welch's ANOVA followed by Games-Howell post-hoc tests

3. **Multiple Comparisons Correction**:
   - Holm-Bonferroni correction applied across all experiments to control family-wise error rate

4. **Effect Size Calculation**:
   - Cohen's d for all pairwise comparisons
   - Eta-squared (η²) for overall ANOVA effects

---

## Experiment 1: Isolating Causal Reasoning from Sensory Architecture

### Hypothesis
The CausalGraphSystem provides significant performance advantages beyond those attributable to sensory architecture or exploration strategies.

### Experimental Groups

**Group A: Heuristic Baseline (N=50)**
- Agent Type: Baseline-Heuristic-Agent
- Cognitive Systems: None (rule-based behavior)
- Sensory Input: Direct access to environment state
- Decision Making: Move toward closest visible berry
- Exploration: Deterministic (no randomness)

**Group B: Full Causal Model (N=50)**  
- Agent Type: Causal-QLearning-Agent
- Cognitive Systems: QLearningSystem + CausalGraphSystem + PerceptionComponent
- Sensory Input: Limited perception within vision range
- Decision Making: Q-learning with causal feedback
- Exploration: Epsilon-greedy (ε = 0.1, decay = 0.995)

**Group C: Perception-Only Control (N=50)**
- Agent Type: Perception-QLearning-Agent  
- Cognitive Systems: QLearningSystem + PerceptionComponent (CausalGraphSystem disabled)
- Sensory Input: Identical to Group B
- Decision Making: Standard Q-learning without causal feedback
- Exploration: Identical epsilon-greedy parameters to Group B

**Group D: Exploration-Matched Heuristic (N=50)**
- Agent Type: Heuristic-Agent-with-Exploration
- Cognitive Systems: None (rule-based with random component)
- Sensory Input: Direct access to environment state
- Decision Making: 90% move toward closest berry, 10% random movement
- Exploration: Matched to approximate Q-learning exploration frequency

### Environment Configuration
- Grid Size: 50x50 cells
- Berry spawn rates: Red (20%), Blue (15%), Yellow (15%)
- Water sources: 8-10 randomly placed
- Rock formations: 15-20 randomly placed
- Phase 1 (ticks 0-1000): Blue berries always safe
- Phase 2 (ticks 1000-1600): Blue berries toxic when within 2 tiles of water

### Measurement Protocol

**Primary Metrics**:
1. **Causal Understanding Score**: Proportion of correct decisions regarding blue berries in novel contexts (ticks 1000-1100)
2. **Average Agent Health**: Mean health across population during adaptation period (ticks 1000-1200)

**Process Metrics**:
3. **Causal Model Accuracy** (Groups B only): Dowhy validation of learned causal relationships against ground truth
4. **Behavioral Adaptation Speed**: Number of ticks required to achieve 80% accuracy on blue berry decisions after environmental change

**Control Metrics**:
5. **Exploration Coverage**: Percentage of map tiles visited by each group
6. **Berry Consumption Distribution**: Proportion of each berry type consumed pre/post environmental change

### Success Criteria
- Group B must significantly outperform Groups C and D on primary metrics (p < 0.05, d > 0.5)
- Group B must show superior causal model accuracy compared to baseline chance
- Performance differences must persist after controlling for exploration coverage

---

## Experiment 2: Testing Genuine Causal Understanding vs. Simple Avoidance

### Hypothesis  
Agents with causal reasoning learn flexible, context-dependent rules rather than simple avoidance heuristics.

### Experimental Groups
- **Group B**: Full Causal Model (from Experiment 1)
- **Group C**: Perception-Only Control (from Experiment 1)

### Environment Modifications
**Base Environment**: Same as Experiment 1

**Added Elements**:
- **Purifier Crystals**: 5-7 visible crystal objects randomly placed on map
- **Purification Rule**: Blue berries within 2 tiles of crystals are always safe, overriding water toxicity
- **Crystal Visibility**: Crystals appear as distinct visual elements agents can perceive

### Experimental Timeline
- **Phase 1** (ticks 0-1000): Blue berries safe, crystals present but inactive
- **Phase 2** (ticks 1000-1200): Blue berries toxic near water, crystals activate purification
- **Phase 3** (ticks 1200-1600): Extended testing period

### Measurement Protocol

**Primary Metrics**:
1. **Safe Zone Consumption Rate**: Number of blue berries consumed near crystals during Phase 2
2. **Crystal Approach Behavior**: Frequency of movement toward crystals when blue berries are visible nearby

**Secondary Metrics**:  
3. **Causal Chain Recognition**: Behavioral evidence of understanding crystal→safety→blue berry relationship
4. **Context Switching Accuracy**: Correct identification of safe vs. unsafe blue berries based on environmental context

**Process Validation**:
5. **Counterfactual Reasoning Test**: Post-training query of causal models about crystal effects

### Success Criteria
- Group B must show significantly higher Safe Zone Consumption Rate than Group C
- Group B must demonstrate Crystal Approach Behavior while Group C shows avoidance
- Process validation must confirm Group B learned crystal purification rule

---

## Experiment 3: Temporal Causal Reasoning

### Hypothesis
The CausalGraphSystem can identify and utilize temporal causal relationships involving delayed effects.

### Experimental Groups
- **Group B**: Full Causal Model 
- **Group C**: Perception-Only Control
- Both groups receive identical state information including explicit "metabolic boost" status flag

### Environment Design
**New Berry Type**: Orange berries (10% spawn rate)
**Temporal Causal Rule**: 
- Eating orange berry triggers 50-tick "metabolic boost" state
- During metabolic boost, all berry types provide double health benefits
- Metabolic boost status visible to agents in state representation

### Experimental Timeline
- **Training Phase** (ticks 0-800): Agents learn basic environment
- **Testing Phase** (ticks 800-1600): Temporal causation active

### Measurement Protocol

**Primary Metrics**:
1. **Temporal Strategy Score**: Frequency of seeking orange berries when health is low (indicating understanding of delayed benefit)
2. **Optimal Timing Behavior**: Rate of delaying valuable berry consumption until after eating orange berries

**Secondary Metrics**:
3. **Boost Utilization Rate**: Proportion of metabolic boost duration spent consuming high-value berries
4. **Causal Discovery Speed**: Ticks required to establish orange berry seeking behavior

### Success Criteria
- Group B must show significantly higher Temporal Strategy Score than Group C
- Group B must demonstrate Optimal Timing Behavior patterns
- Behavioral differences must emerge despite identical state information access

---

## Experiment 4: Learning Convergence Control

### Hypothesis
Performance differences between causal and perception-only agents reflect cognitive architectural advantages, not differential learning time requirements.

### Experimental Groups

**Group C-Standard**: Perception-Only agents run for standard 1600 ticks

**Group C-Extended**: Perception-Only agents run for 4800 ticks (3x duration)

**Group C-Plateau**: Perception-Only agents run until performance plateaus (no improvement for 200 consecutive ticks)

### Convergence Detection Protocol
1. Calculate rolling 50-tick average of causal understanding score
2. Detect plateau when rolling average changes < 0.01 for 200 ticks
3. Compare final plateau performance against Group B performance at various time points

### Measurement Protocol

**Primary Metrics**:
1. **Plateau Performance Level**: Final causal understanding score after convergence
2. **Learning Efficiency**: Ticks required to reach 90% of final performance level

**Convergence Analysis**:
3. **Asymptotic Performance Comparison**: Statistical comparison of final performance levels
4. **Learning Curve Analysis**: Comparison of learning trajectories using curve fitting

### Success Criteria
- No significant difference between Group C-Standard and Group C-Extended final performance
- Group C-Plateau performance remains significantly below Group B performance
- Learning curve analysis confirms different asymptotes rather than different learning rates

---

## Experiment 5: Direct Causal Model Validation

### Hypothesis
Agents with CausalGraphSystem develop accurate internal representations of environmental causal structure.

### Experimental Protocol

**Model Extraction**:
```python
def extract_causal_model(agent_id, causal_system):
    """Extract learned causal relationships from agent's internal model"""
    return causal_system.get_causal_graph(agent_id)

def validate_causal_relationships(learned_model, ground_truth_rules):
    """Compare learned model against known environmental causation"""
    accuracy_scores = []
    
    for relationship in ground_truth_rules:
        predicted_effect = learned_model.estimate_effect(
            treatment=relationship.cause,
            outcome=relationship.effect,
            treatment_value=relationship.intervention_value
        )
        
        true_effect = relationship.ground_truth_effect
        accuracy = calculate_prediction_accuracy(predicted_effect, true_effect)
        accuracy_scores.append(accuracy)
    
    return np.mean(accuracy_scores)
```

**Counterfactual Testing**:
Test agents' ability to predict outcomes of hypothetical interventions:
1. "What would happen if blue berries appeared near crystals instead of water?"
2. "What would happen if orange berries lasted 100 ticks instead of 50?"
3. "What would happen if yellow berries never appeared near rocks?"

### Measurement Protocol

**Model Quality Metrics**:
1. **Causal Relationship Accuracy**: Proportion of correctly identified causal links
2. **Counterfactual Prediction Accuracy**: Accuracy of hypothetical scenario predictions
3. **Model Complexity Score**: Number of causal relationships inferred (penalize overfitting)

### Success Criteria
- Group B must achieve >80% accuracy on causal relationship identification
- Counterfactual predictions must significantly exceed chance performance
- Model complexity should be appropriate (not overfitted or underfitted)

---

## Experiment 6: Cross-Domain Transfer Learning

### Hypothesis
Causal reasoning capabilities generalize across different environmental domains and causal structures.

### Transfer Environment: Tool Crafting Domain

**Environment Design**:
- Agents must collect resources and craft tools in sequence
- Multi-step causal chains: Wood + Stone → Hammer; Hammer + Metal → Sword
- Success requires understanding prerequisite relationships

**Transfer Protocol**:
1. Train agents in berry environment (800 ticks)
2. Transfer to tool crafting environment (no additional training)
3. Measure adaptation speed and final performance

### Measurement Protocol

**Transfer Metrics**:
1. **Adaptation Speed**: Ticks required to achieve first successful tool craft
2. **Final Performance**: Tool crafting success rate in final 200 ticks
3. **Causal Transfer Evidence**: Behavioral indicators of applying causal reasoning to new domain

### Success Criteria
- Group B must show faster adaptation speed than Group C
- Final performance differences must parallel original domain advantages
- Process analysis must confirm application of causal reasoning principles

---

## Enhanced Exploration Controls

### Standardized Exploration Protocol
All Q-learning agents use identical parameters:
- Initial epsilon: 0.1
- Epsilon decay rate: 0.995 per tick
- Minimum epsilon: 0.01
- Learning rate: 0.1
- Discount factor: 0.95

### Exploration Validation Metrics
1. **Map Coverage**: Percentage of environment tiles visited
2. **Action Diversity**: Entropy of action selection distribution  
3. **Exploration Efficiency**: Ratio of novel states discovered to total actions taken

---

## Comprehensive Success Framework

### Evidence FOR Causal Reasoning
- Significant behavioral advantages (d > 0.5) across multiple experiments
- High causal model accuracy in process validation (>80%)
- Successful transfer to novel domains
- Advantages persist after controlling for exploration and learning time
- Counterfactual reasoning capabilities exceed chance performance

### Evidence AGAINST Causal Reasoning
- Performance differences eliminated by exploration/learning controls
- No advantage in process measures of causal model quality
- Poor transfer to new domains  
- Advantages explained by confounding factors
- Random-level performance on counterfactual reasoning tests

### Implementation Timeline

**Weeks 1-2**: Implement all experimental conditions and enhanced measurement systems
**Weeks 3-4**: Execute Experiments 1-3 with full statistical protocols
**Weeks 5-6**: Execute Experiments 4-6 and cross-experiment validation analyses  
**Week 7**: Comprehensive analysis, replication checks, and results interpretation

Testing - Toxic Berries #41

Description

Comprehensive ARLA Causal Reasoning Validation Plan

Statistical Analysis Framework (Applied to All Experiments)

Sample Size and Power Analysis

Statistical Testing Protocol

Experiment 1: Isolating Causal Reasoning from Sensory Architecture

Hypothesis

Experimental Groups

Environment Configuration

Measurement Protocol

Success Criteria

Experiment 2: Testing Genuine Causal Understanding vs. Simple Avoidance

Hypothesis

Experimental Groups

Environment Modifications

Experimental Timeline

Measurement Protocol

Success Criteria

Experiment 3: Temporal Causal Reasoning

Hypothesis

Experimental Groups

Environment Design

Experimental Timeline

Measurement Protocol

Success Criteria

Experiment 4: Learning Convergence Control

Hypothesis

Experimental Groups

Convergence Detection Protocol

Measurement Protocol

Success Criteria

Experiment 5: Direct Causal Model Validation

Hypothesis

Experimental Protocol

Measurement Protocol

Success Criteria

Experiment 6: Cross-Domain Transfer Learning

Hypothesis

Transfer Environment: Tool Crafting Domain

Measurement Protocol

Success Criteria

Enhanced Exploration Controls

Standardized Exploration Protocol

Exploration Validation Metrics

Comprehensive Success Framework

Evidence FOR Causal Reasoning

Evidence AGAINST Causal Reasoning

Implementation Timeline

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions