# CLARISSA Tutorial 08: RIGOR Benchmark Framework

**Learning Objectives:**
- Understand the RIGOR evaluation dimensions
- Implement benchmark test cases
- Score deck generation quality
- Compare system performance across tiers

**Prerequisites:** Notebooks 01-07

**Estimated Time:** 45 minutes

## What is RIGOR?

**R**eservoir **I**nput **G**eneration **O**utput **R**eview

A benchmark framework for evaluating conversational simulation systems across four dimensions:

| Dimension | What it Measures | Example |
|-----------|------------------|----------|
| **Syntactic Validity** | Parser acceptance, keyword correctness | Does OPM Flow accept the deck? |
| **Semantic Correctness** | Logical consistency, unit coherence | Are FIELD units used consistently? |
| **Physical Plausibility** | Realistic parameters, sensible gradients | Is pressure gradient ~0.45 psi/ft? |
| **Conversational Efficiency** | Turns to completion, clarification rate | How many questions asked? |

In [None]:
from dataclasses import dataclass, field
from typing import List, Dict, Optional, Tuple, Callable
from enum import Enum, auto
import json
import re
from datetime import datetime

print("RIGOR Benchmark Framework initialized")

## Section 1: Complexity Tiers

RIGOR defines three complexity tiers for progressive evaluation.

In [None]:
class ComplexityTier(Enum):
    """RIGOR complexity tiers."""
    FOUNDATIONAL = 1   # Simple models, single-phase
    INTERMEDIATE = 2   # Multi-well, black-oil
    ADVANCED = 3       # Compositional, mid-conversation changes

@dataclass
class TierSpec:
    """Specification for a complexity tier."""
    name: str
    description: str
    max_cells: int
    max_wells: int
    phases: List[str]
    features: List[str]
    expected_turns: int  # Baseline for efficiency scoring

TIER_SPECS = {
    ComplexityTier.FOUNDATIONAL: TierSpec(
        name="Foundational",
        description="Linear displacement, laboratory coreflood, single-phase flow",
        max_cells=1000,
        max_wells=2,
        phases=["OIL", "WATER"],
        features=["Cartesian grid", "Uniform properties", "Simple schedule"],
        expected_turns=5
    ),
    ComplexityTier.INTERMEDIATE: TierSpec(
        name="Intermediate",
        description="Pattern flood, multi-well, black-oil",
        max_cells=50000,
        max_wells=20,
        phases=["OIL", "WATER", "GAS"],
        features=["5-spot pattern", "Variable permeability", "Well controls"],
        expected_turns=10
    ),
    ComplexityTier.ADVANCED: TierSpec(
        name="Advanced",
        description="Compositional EOS, mid-conversation model changes",
        max_cells=500000,
        max_wells=100,
        phases=["COMPOSITIONAL"],
        features=["EOS modeling", "Thermal effects", "Model pivots"],
        expected_turns=20
    )
}

print("RIGOR Complexity Tiers:")
print("=" * 60)
for tier, spec in TIER_SPECS.items():
    print(f"\n{spec.name} (Tier {tier.value})")
    print(f"  {spec.description}")
    print(f"  Max cells: {spec.max_cells:,}")
    print(f"  Max wells: {spec.max_wells}")
    print(f"  Expected turns: {spec.expected_turns}")

## Section 2: Evaluation Dimensions

Define scorers for each dimension.

In [None]:
@dataclass
class DimensionScore:
    """Score for a single evaluation dimension."""
    dimension: str
    score: float  # 0.0 to 1.0
    max_score: float = 1.0
    details: List[str] = field(default_factory=list)
    penalties: List[Tuple[str, float]] = field(default_factory=list)
    
    @property
    def percentage(self) -> float:
        return (self.score / self.max_score) * 100

class SyntacticValidator:
    """Check syntactic validity of generated decks."""
    
    # Required sections
    REQUIRED_SECTIONS = ['RUNSPEC', 'GRID', 'PROPS', 'SOLUTION', 'SCHEDULE']
    
    # Basic keyword patterns
    KEYWORD_PATTERNS = {
        'DIMENS': r'DIMENS\s+\d+\s+\d+\s+\d+\s*/',
        'WELSPECS': r'WELSPECS[\s\S]*?/',
        'COMPDAT': r'COMPDAT[\s\S]*?/',
    }
    
    def validate(self, deck: str) -> DimensionScore:
        """Validate deck syntax."""
        score = 1.0
        details = []
        penalties = []
        
        # Check required sections
        for section in self.REQUIRED_SECTIONS:
            if section in deck:
                details.append(f"{section} present")
            else:
                penalties.append((f"Missing {section}", 0.15))
                score -= 0.15
        
        # Check keyword patterns
        for keyword, pattern in self.KEYWORD_PATTERNS.items():
            if re.search(pattern, deck):
                details.append(f"{keyword} valid")
            elif keyword in deck:
                penalties.append((f"{keyword} malformed", 0.1))
                score -= 0.1
        
        # Check for terminator
        if deck.strip().endswith('END'):
            details.append("END terminator present")
        else:
            penalties.append(("Missing END", 0.05))
            score -= 0.05
        
        return DimensionScore(
            dimension="Syntactic Validity",
            score=max(0, score),
            details=details,
            penalties=penalties
        )

class SemanticValidator:
    """Check semantic correctness."""
    
    def validate(self, deck: str, metadata: Dict) -> DimensionScore:
        """Validate semantic consistency."""
        score = 1.0
        details = []
        penalties = []
        
        # Check unit consistency
        has_field = 'FIELD' in deck
        has_metric = 'METRIC' in deck
        
        if has_field and has_metric:
            penalties.append(("Mixed unit systems", 0.3))
            score -= 0.3
        elif has_field or has_metric:
            details.append(f"Unit system: {'FIELD' if has_field else 'METRIC'}")
        else:
            penalties.append(("No unit system specified", 0.2))
            score -= 0.2
        
        # Check grid dimensions match data
        dimens_match = re.search(r'DIMENS\s+(\d+)\s+(\d+)\s+(\d+)', deck)
        if dimens_match:
            nx, ny, nz = map(int, dimens_match.groups())
            total = nx * ny * nz
            details.append(f"Grid: {nx}x{ny}x{nz} = {total:,} cells")
            
            # Check PORO count matches
            poro_match = re.search(r'PORO\s+(\d+)\*', deck)
            if poro_match:
                poro_count = int(poro_match.group(1))
                if poro_count == total:
                    details.append("PORO count matches grid")
                else:
                    penalties.append((f"PORO count {poro_count} != grid {total}", 0.2))
                    score -= 0.2
        
        # Check well locations within grid
        # (Simplified - would need full parsing in production)
        
        return DimensionScore(
            dimension="Semantic Correctness",
            score=max(0, score),
            details=details,
            penalties=penalties
        )

class PhysicsValidator:
    """Check physical plausibility."""
    
    # Typical ranges
    RANGES = {
        'porosity': (0.01, 0.40),
        'permeability': (0.1, 10000),  # md
        'pressure_gradient': (0.35, 0.55),  # psi/ft
        'water_saturation': (0.0, 1.0),
    }
    
    def validate(self, deck: str, metadata: Dict) -> DimensionScore:
        """Validate physics."""
        score = 1.0
        details = []
        penalties = []
        
        # Extract and check porosity
        poro_match = re.search(r'PORO\s+\d+\*([\d.]+)', deck)
        if poro_match:
            poro = float(poro_match.group(1))
            if self.RANGES['porosity'][0] <= poro <= self.RANGES['porosity'][1]:
                details.append(f"Porosity {poro:.2f} in range")
            else:
                penalties.append((f"Porosity {poro:.2f} out of range", 0.2))
                score -= 0.2
        
        # Extract and check permeability
        permx_match = re.search(r'PERMX\s+\d+\*([\d.]+)', deck)
        if permx_match:
            perm = float(permx_match.group(1))
            if self.RANGES['permeability'][0] <= perm <= self.RANGES['permeability'][1]:
                details.append(f"Permeability {perm:.0f} md in range")
            else:
                penalties.append((f"Permeability {perm:.0f} unusual", 0.15))
                score -= 0.15
        
        # Check pressure gradient (from EQUIL)
        equil_match = re.search(r'EQUIL\s+[\d.]+\s+([\d.]+)', deck)
        tops_match = re.search(r'TOPS\s+\d+\*([\d.]+)', deck)
        if equil_match and tops_match:
            pressure = float(equil_match.group(1))
            depth = float(tops_match.group(1))
            if depth > 0:
                gradient = pressure / depth
                if self.RANGES['pressure_gradient'][0] <= gradient <= self.RANGES['pressure_gradient'][1]:
                    details.append(f"Pressure gradient {gradient:.3f} psi/ft OK")
                else:
                    penalties.append((f"Pressure gradient {gradient:.3f} unusual", 0.2))
                    score -= 0.2
        
        return DimensionScore(
            dimension="Physical Plausibility",
            score=max(0, score),
            details=details,
            penalties=penalties
        )

class EfficiencyValidator:
    """Measure conversational efficiency."""
    
    def validate(self, conversation: List[Dict], tier: ComplexityTier) -> DimensionScore:
        """Score based on turns and clarifications."""
        spec = TIER_SPECS[tier]
        
        total_turns = len(conversation)
        clarifications = sum(1 for msg in conversation 
                            if msg.get('role') == 'assistant' and '?' in msg.get('content', ''))
        
        details = [
            f"Total turns: {total_turns}",
            f"Clarifications: {clarifications}",
            f"Expected: {spec.expected_turns}"
        ]
        penalties = []
        
        # Score based on turns vs expected
        if total_turns <= spec.expected_turns:
            score = 1.0
            details.append("Under or at expected turns")
        else:
            excess = total_turns - spec.expected_turns
            penalty = min(0.5, excess * 0.05)
            score = 1.0 - penalty
            penalties.append((f"{excess} excess turns", penalty))
        
        # Penalty for excessive clarifications
        if clarifications > spec.expected_turns / 2:
            penalty = 0.1 * (clarifications - spec.expected_turns / 2)
            penalties.append(("Excessive clarifications", min(0.3, penalty)))
            score -= min(0.3, penalty)
        
        return DimensionScore(
            dimension="Conversational Efficiency",
            score=max(0, score),
            details=details,
            penalties=penalties
        )

# Test validators
print("Testing validators...")

sample_deck = '''RUNSPEC
TITLE
Test Model

FIELD

DIMENS
  10 10 5 /

GRID
PORO
  500*0.22 /
PERMX
  500*150 /
TOPS
  100*8500 /

PROPS
SOLUTION
EQUIL
  8500 3800 9500 0 0 0 1 /

SCHEDULE
WELSPECS
  PROD1 G1 5 5 1* OIL /
/
COMPDAT
  PROD1 5 5 1 5 OPEN /
/
END
'''

# Run validators
syntactic = SyntacticValidator().validate(sample_deck)
semantic = SemanticValidator().validate(sample_deck, {})
physics = PhysicsValidator().validate(sample_deck, {})

print(f"\nSyntactic: {syntactic.percentage:.0f}%")
print(f"Semantic: {semantic.percentage:.0f}%")
print(f"Physics: {physics.percentage:.0f}%")

## Section 3: Benchmark Test Cases

Define specific test cases for each tier.

In [None]:
@dataclass
class TestCase:
    """A single benchmark test case."""
    id: str
    tier: ComplexityTier
    name: str
    description: str
    user_prompt: str
    expected_features: List[str]
    validation_checks: List[Callable]

# Tier 1 Test Cases
TIER1_TESTS = [
    TestCase(
        id="T1-01",
        tier=ComplexityTier.FOUNDATIONAL,
        name="Linear Coreflood",
        description="Simple 1D displacement model",
        user_prompt="Create a coreflood model: 20 cells in x-direction, "
                   "water injection at one end, producer at the other. "
                   "Porosity 0.25, permeability 100 md.",
        expected_features=["1D grid (nx>1, ny=1, nz=1)", "2 wells", "Water injection"],
        validation_checks=[]
    ),
    TestCase(
        id="T1-02",
        tier=ComplexityTier.FOUNDATIONAL,
        name="Single Well Depletion",
        description="Radial flow to single producer",
        user_prompt="Model a single producer well in a 10x10x3 grid. "
                   "Well at center, producing at 500 stb/d for 2 years.",
        expected_features=["3D grid", "1 producer", "Rate control"],
        validation_checks=[]
    ),
]

# Tier 2 Test Cases
TIER2_TESTS = [
    TestCase(
        id="T2-01",
        tier=ComplexityTier.INTERMEDIATE,
        name="5-Spot Waterflood",
        description="Classic pattern flood",
        user_prompt="Create a 5-spot waterflood pattern on 40-acre spacing. "
                   "Depth 8500 ft, pressure 3800 psi. "
                   "Run for 10 years with water injection.",
        expected_features=["5 wells", "4 injectors + 1 producer", "Pattern geometry"],
        validation_checks=[]
    ),
    TestCase(
        id="T2-02",
        tier=ComplexityTier.INTERMEDIATE,
        name="Multi-Layer Model",
        description="Layered reservoir with varying properties",
        user_prompt="Build a model with 5 layers. Top 2 layers high perm (200md), "
                   "middle layer shale barrier (1md), bottom 2 layers medium perm (50md). "
                   "20x20 areal grid.",
        expected_features=["5 layers", "Variable permeability", "Barrier layer"],
        validation_checks=[]
    ),
]

# Tier 3 Test Cases
TIER3_TESTS = [
    TestCase(
        id="T3-01",
        tier=ComplexityTier.ADVANCED,
        name="Black-Oil to Compositional Pivot",
        description="Mid-conversation model type change",
        user_prompt="Start with a black-oil waterflood model for the Permian. "
                   "[After initial model] Actually, we need to evaluate CO2 injection "
                   "for tertiary recovery. Convert to compositional.",
        expected_features=["Model pivot", "EOS components", "CO2 properties"],
        validation_checks=[]
    ),
]

ALL_TESTS = TIER1_TESTS + TIER2_TESTS + TIER3_TESTS

print(f"Benchmark suite: {len(ALL_TESTS)} test cases")
print("\nTest case summary:")
for tier in ComplexityTier:
    tests = [t for t in ALL_TESTS if t.tier == tier]
    print(f"  Tier {tier.value} ({tier.name}): {len(tests)} tests")
    for t in tests:
        print(f"    - {t.id}: {t.name}")

## Section 4: Benchmark Runner

Execute test cases and collect scores.

In [None]:
@dataclass
class BenchmarkResult:
    """Result from running a single test case."""
    test_id: str
    tier: ComplexityTier
    syntactic: DimensionScore
    semantic: DimensionScore
    physics: DimensionScore
    efficiency: DimensionScore
    deck: str
    conversation: List[Dict]
    execution_time: float
    
    @property
    def overall_score(self) -> float:
        """Weighted average of all dimensions."""
        weights = {
            'syntactic': 0.25,
            'semantic': 0.25,
            'physics': 0.30,
            'efficiency': 0.20
        }
        return (
            self.syntactic.score * weights['syntactic'] +
            self.semantic.score * weights['semantic'] +
            self.physics.score * weights['physics'] +
            self.efficiency.score * weights['efficiency']
        )

class BenchmarkRunner:
    """Run RIGOR benchmark suite."""
    
    def __init__(self):
        self.syntactic_validator = SyntacticValidator()
        self.semantic_validator = SemanticValidator()
        self.physics_validator = PhysicsValidator()
        self.efficiency_validator = EfficiencyValidator()
        self.results: List[BenchmarkResult] = []
    
    def run_test(self, test: TestCase, deck: str, 
                 conversation: List[Dict]) -> BenchmarkResult:
        """Run a single test case."""
        import time
        start = time.time()
        
        # Run all validators
        syntactic = self.syntactic_validator.validate(deck)
        semantic = self.semantic_validator.validate(deck, {})
        physics = self.physics_validator.validate(deck, {})
        efficiency = self.efficiency_validator.validate(conversation, test.tier)
        
        result = BenchmarkResult(
            test_id=test.id,
            tier=test.tier,
            syntactic=syntactic,
            semantic=semantic,
            physics=physics,
            efficiency=efficiency,
            deck=deck,
            conversation=conversation,
            execution_time=time.time() - start
        )
        
        self.results.append(result)
        return result
    
    def run_suite(self, tests: List[TestCase], 
                  deck_generator: Callable) -> List[BenchmarkResult]:
        """Run full test suite with a deck generator function."""
        results = []
        for test in tests:
            # Generate deck (would call actual system in production)
            deck, conversation = deck_generator(test.user_prompt)
            result = self.run_test(test, deck, conversation)
            results.append(result)
            print(f"  {test.id}: {result.overall_score:.0%}")
        return results
    
    def generate_report(self) -> str:
        """Generate benchmark report."""
        lines = ["RIGOR Benchmark Report", "=" * 50, ""]
        
        # By tier
        for tier in ComplexityTier:
            tier_results = [r for r in self.results if r.tier == tier]
            if tier_results:
                lines.append(f"\n{TIER_SPECS[tier].name} (Tier {tier.value})")
                lines.append("-" * 40)
                
                for r in tier_results:
                    lines.append(f"  {r.test_id}: {r.overall_score:.0%}")
                    lines.append(f"    Syntactic: {r.syntactic.percentage:.0f}%")
                    lines.append(f"    Semantic:  {r.semantic.percentage:.0f}%")
                    lines.append(f"    Physics:   {r.physics.percentage:.0f}%")
                    lines.append(f"    Efficiency:{r.efficiency.percentage:.0f}%")
                
                avg = sum(r.overall_score for r in tier_results) / len(tier_results)
                lines.append(f"  Tier Average: {avg:.0%}")
        
        # Overall
        if self.results:
            overall = sum(r.overall_score for r in self.results) / len(self.results)
            lines.append(f"\n{'=' * 50}")
            lines.append(f"Overall Score: {overall:.0%}")
        
        return "\n".join(lines)

# Mock deck generator for demo
def mock_generator(prompt: str) -> Tuple[str, List[Dict]]:
    """Mock deck generator for testing."""
    # Return sample deck and conversation
    return sample_deck, [
        {'role': 'user', 'content': prompt},
        {'role': 'assistant', 'content': 'I will create that model.'},
        {'role': 'assistant', 'content': 'What porosity should I use?'},
        {'role': 'user', 'content': '0.22'},
        {'role': 'assistant', 'content': 'Here is your deck...'}
    ]

# Run benchmark
runner = BenchmarkRunner()
print("Running RIGOR benchmark...\n")
runner.run_suite(TIER1_TESTS + TIER2_TESTS[:1], mock_generator)

print("\n" + runner.generate_report())

## Section 5: Leaderboard and Comparison

Compare different system configurations.

In [None]:
@dataclass
class SystemConfig:
    """Configuration of a CLARISSA system variant."""
    name: str
    llm_model: str
    use_rl: bool
    use_constraints: bool
    use_analogs: bool

class Leaderboard:
    """Track and compare system performance."""
    
    def __init__(self):
        self.entries: List[Dict] = []
    
    def add_entry(self, config: SystemConfig, results: List[BenchmarkResult]):
        """Add benchmark results for a system configuration."""
        overall = sum(r.overall_score for r in results) / len(results) if results else 0
        
        # By dimension
        dim_scores = {
            'syntactic': sum(r.syntactic.score for r in results) / len(results),
            'semantic': sum(r.semantic.score for r in results) / len(results),
            'physics': sum(r.physics.score for r in results) / len(results),
            'efficiency': sum(r.efficiency.score for r in results) / len(results),
        }
        
        self.entries.append({
            'config': config,
            'overall': overall,
            'dimensions': dim_scores,
            'num_tests': len(results),
            'timestamp': datetime.now().isoformat()
        })
        
        # Sort by overall score
        self.entries.sort(key=lambda x: x['overall'], reverse=True)
    
    def display(self):
        """Display leaderboard."""
        print("\nRIGOR Leaderboard")
        print("=" * 70)
        print(f"{'Rank':<5} {'System':<25} {'Overall':<10} {'Syn':<8} {'Sem':<8} {'Phy':<8} {'Eff':<8}")
        print("-" * 70)
        
        for i, entry in enumerate(self.entries, 1):
            config = entry['config']
            dims = entry['dimensions']
            print(f"{i:<5} {config.name:<25} {entry['overall']:<10.0%} "
                  f"{dims['syntactic']:<8.0%} {dims['semantic']:<8.0%} "
                  f"{dims['physics']:<8.0%} {dims['efficiency']:<8.0%}")

# Demo leaderboard
leaderboard = Leaderboard()

# Add mock entries
configs = [
    SystemConfig("CLARISSA v0.1 (baseline)", "GPT-3.5", False, False, False),
    SystemConfig("CLARISSA v0.2 (+constraints)", "GPT-3.5", False, True, False),
    SystemConfig("CLARISSA v0.3 (+RL)", "GPT-4", True, True, False),
    SystemConfig("CLARISSA v0.4 (full)", "Claude-3", True, True, True),
]

# Mock results with improving scores
for i, config in enumerate(configs):
    mock_results = []
    base_score = 0.6 + i * 0.1
    for test in TIER1_TESTS:
        mock_results.append(BenchmarkResult(
            test_id=test.id,
            tier=test.tier,
            syntactic=DimensionScore("Syntactic", base_score + 0.05),
            semantic=DimensionScore("Semantic", base_score),
            physics=DimensionScore("Physics", base_score - 0.05),
            efficiency=DimensionScore("Efficiency", base_score + 0.1),
            deck="",
            conversation=[],
            execution_time=1.0
        ))
    leaderboard.add_entry(config, mock_results)

leaderboard.display()

## Summary

In this tutorial, we learned:

1. **RIGOR Framework**: Four evaluation dimensions for CUI simulation systems
2. **Complexity Tiers**: Progressive difficulty from coreflood to compositional
3. **Validators**: Syntactic, semantic, physics, and efficiency scoring
4. **Test Cases**: Standardized prompts for benchmarking
5. **Leaderboard**: Compare system configurations

**Key Insight**: Systematic evaluation enables objective comparison and improvement tracking.

**Next Tutorial:** [09_Full_Pipeline_Demo.ipynb](09_Full_Pipeline_Demo.ipynb) - End-to-end example