# SIMPLE GAIA VALIDATOR - NO FLUFF

### Declaring validation class by importing methods from testing framework

In [None]:
from agent_testing import run_quick_gaia_test, run_gaia_test, compare_agent_configs, run_smart_routing_test

class GAIAValidator:
    def __init__(self):
        self.last_result = None
        print("üéØ GAIA Validator ready")
    
    def quick(self, config="groq", questions=5):
        result = run_quick_gaia_test(config, num_questions=questions)
        self.last_result = result
        if result and 'overall_performance' in result:
            acc = result['overall_performance']['accuracy']
            print(f"‚úÖ {acc:.1%} accuracy")
        return result
    
    def full(self, config="groq", questions=20):
        result = run_gaia_test(config, max_questions=questions)
        self.last_result = result
        if result and 'overall_performance' in result:
            acc = result['overall_performance']['accuracy']
            total = result['overall_performance']['total_questions']
            correct = result['overall_performance']['correct_answers']
            print(f"‚úÖ {acc:.1%} accuracy ({correct}/{total})")
            print(f"GAIA Target: {'‚úÖ MET' if acc >= 0.45 else '‚ùå NOT MET'}")
        return result
    
    def compare(self, configs=["groq", "google"], questions=10):
        result = compare_agent_configs(configs, questions)
        self.last_result = result
        if result and 'comparison_results' in result:
            for config, data in result['comparison_results'].items():
                if 'accuracy' in data:
                    print(f"{config}: {data['accuracy']:.1%}")
        return result
    
    def insights(self):
        if not self.last_result or 'overall_performance' not in self.last_result:
            print("‚ùå No results to analyze")
            return
        
        overall = self.last_result['overall_performance']
        acc = overall['accuracy']
        
        print(f"\nüìä INSIGHTS")
        print(f"Accuracy: {acc:.1%}")
        
        if acc >= 0.60:
            print("üèÜ EXCELLENT")
        elif acc >= 0.45:
            print("‚úÖ GOOD - Above GAIA threshold")
        else:
            print("‚ö†Ô∏è NEEDS IMPROVEMENT")
        
        # Level breakdown
        levels = self.last_result.get('level_performance', {})
        if levels:
            print("By level:")
            for level, perf in levels.items():
                print(f"  Level {level}: {perf['accuracy']:.1%}")
        
        # Simple recommendations
        if acc < 0.45:
            print("üí° Focus on Level 1 questions first")
        elif acc < 0.60:
            print("üí° Good performance - optimize Level 2")
        else:
            print("üí° Excellent - test with more questions")

validator = GAIAValidator()

# GAIA Testing Methods Reference

## Validator Methods

### `validator.quick(config, questions)`
**Quick validation test - perfect for development**
- `config` (str, default="groq"): Agent configuration 
  - Options: `"groq"`, `"google"`, `"openrouter"`, `"ollama"`, `"performance"`, `"accuracy"`
- `questions` (int, default=5): Number of questions to test
- **Returns**: Test results dict
- **Use case**: Fast iteration during development

```python
result = validator.quick('groq', 5)
```

### `validator.full(config, questions)`
**Comprehensive test - for production validation**
- `config` (str, default="groq"): Agent configuration
- `questions` (int, default=20): Number of questions (20+ recommended for reliable results)
- **Returns**: Complete test results with level breakdown
- **Use case**: Final validation before deployment

```python
result = validator.full('groq', 20)
```

### `validator.compare(configs, questions)`
**Compare multiple configurations**
- `configs` (list, default=["groq", "google"]): List of configurations to compare
- `questions` (int, default=10): Questions per configuration
- **Returns**: Comparison results with rankings
- **Use case**: Choosing the best configuration

```python
result = validator.compare(['groq', 'google', 'performance'], 10)
```

### `validator.insights()`
**Analyze last test results**
- **No parameters**
- **Returns**: None (prints analysis)
- **Use case**: Get actionable recommendations after any test

```python
validator.insights()
```

---

## Underlying Testing Functions

### `run_quick_gaia_test(agent_config_name, **kwargs)`
**Direct access to quick testing**
- `agent_config_name` (str): Configuration name
- `num_questions` (int, default=5): Number of questions
- `dataset_path` (str, default="./tests/gaia_data"): Dataset location
- **Returns**: Evaluation results dict

### `run_gaia_test(agent_config_name, dataset_path, max_questions, test_config)`
**Complete GAIA test workflow**
- `agent_config_name` (str, default="groq"): Configuration name
- `dataset_path` (str, default="./tests/gaia_data"): Dataset location
- `max_questions` (int, default=20): Maximum questions to test
- `test_config` (GAIATestConfig, optional): Advanced test configuration
- **Returns**: Complete evaluation results

### `compare_agent_configs(config_names, num_questions, dataset_path)`
**Compare multiple agent configurations**
- `config_names` (List[str]): List of configuration names
- `num_questions` (int, default=10): Questions per configuration
- `dataset_path` (str, default="./tests/gaia_data"): Dataset location
- **Returns**: Comparison results dict

### `run_smart_routing_test(agent_config_name, num_questions)`
**Test smart routing effectiveness**
- `agent_config_name` (str, default="performance"): Configuration name
- `num_questions` (int, default=15): Number of questions for analysis
- **Returns**: Routing analysis results

---

## Configuration Options

### Available Configurations
| Config Name | Provider | Model | Use Case |
|-------------|----------|-------|----------|
| `"groq"` | Groq | qwen-qwq-32b | Fast, reliable |
| `"google"` | Google | gemini-2.0-flash-preview | Balanced performance |
| `"openrouter"` | OpenRouter | qwen/qwen-2.5-coder-32b-instruct:free | Cost-effective |
| `"ollama"` | Ollama | qwen2.5-coder:32b | Local deployment |
| `"performance"` | Groq | qwen-qwq-32b | Optimized for speed |
| `"accuracy"` | Google | gemini-2.0-flash-preview | Optimized for accuracy |

---

## Test Result Structure

### Standard Result Format
```python
{
    "overall_performance": {
        "total_questions": 20,
        "correct_answers": 11,
        "accuracy": 0.55,
        "successful_executions": 19
    },
    "level_performance": {
        "1": {"accuracy": 0.70, "correct": 7, "total": 10},
        "2": {"accuracy": 0.44, "correct": 4, "total": 9},
        "3": {"accuracy": 0.0, "correct": 0, "total": 1}
    },
    "strategy_analysis": {
        "one_shot_llm": {"accuracy": 0.67, "total_questions": 12},
        "manager_coordination": {"accuracy": 0.38, "total_questions": 8}
    }
}
```

### Comparison Result Format
```python
{
    "comparison_results": {
        "groq": {"accuracy": 0.60, "correct_answers": 6, "total_questions": 10},
        "google": {"accuracy": 0.50, "correct_answers": 5, "total_questions": 10}
    },
    "timestamp": "2024-12-19T10:30:00",
    "test_questions": 10
}
```

---

## Usage Examples

### Basic Workflow
```python
# 1. Quick test
result = validator.quick('groq', 5)
validator.insights()

# 2. Full validation  
result = validator.full('groq', 20)
validator.insights()
```

### Configuration Comparison
```python
# Compare multiple configs
result = validator.compare(['groq', 'google', 'performance'], 10)
validator.insights()
```

### Advanced Testing
```python
# Direct function access for custom workflows
from agent_testing import run_gaia_test, analyze_failure_patterns

result = run_gaia_test('groq', max_questions=50)
failure_analysis = analyze_failure_patterns(result)
```

---

## Performance Benchmarks

### GAIA Accuracy Targets
- **45%+**: GAIA benchmark threshold
- **50-60%**: Competitive performance
- **60%+**: Excellent performance

### Execution Time Guidelines
- **Quick test (5q)**: ~1-2 minutes
- **Full test (20q)**: ~5-8 minutes  
- **Comparison (3 configs, 10q each)**: ~10-15 minutes

### Recommended Question Counts
- **Development**: 5 questions (quick feedback)
- **Validation**: 20 questions (reliable results)
- **Production**: 50+ questions (comprehensive assessment)

# GAIA Testing

In [None]:
result = validator.quick('openrouter')