# CLARISSA Tutorial 07: Reinforcement Learning Agent

**Learning Objectives:**
- Understand RL fundamentals for simulation optimization
- Implement PPO-based action selection
- Design reward functions for deck generation
- Train agents with simulation feedback

**Prerequisites:** Notebooks 01-06

**Estimated Time:** 90 minutes

**Note:** GPU recommended for training (Colab T4 works well)

## Why Reinforcement Learning?

LLMs generate plausible text, but simulation success requires:

| Challenge | RL Solution |
|-----------|-------------|
| Convergence failures | Learn from simulator feedback |
| Suboptimal defaults | Optimize based on outcomes |
| Action sequencing | Policy learns effective orderings |
| Error recovery | Reward successful corrections |

CLARISSA uses RL to optimize the *sequence of actions* taken during deck generation.

In [None]:
# Setup
import numpy as np
from dataclasses import dataclass, field
from typing import List, Dict, Tuple, Optional, Any
from enum import Enum, auto
import random
from collections import deque
import math

# Check for PyTorch (optional - we have numpy fallback)
try:
    import torch
    import torch.nn as nn
    import torch.optim as optim
    TORCH_AVAILABLE = True
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f'PyTorch available on {device}')
except ImportError:
    TORCH_AVAILABLE = False
    print('PyTorch not available - using NumPy implementation')

## Section 1: The CLARISSA Action Space

Define the actions the RL agent can take during deck generation.

In [None]:
class Action(Enum):
    """Actions available to the CLARISSA RL agent."""
    # Information gathering
    ASK_CLARIFICATION = auto()     # Request more info from user
    QUERY_KNOWLEDGE = auto()       # Look up in knowledge base
    QUERY_ANALOG = auto()          # Find similar reservoirs
    
    # Deck construction
    SET_GRID = auto()              # Define grid dimensions
    SET_ROCK_PROPS = auto()        # Set porosity, permeability
    SET_FLUID_PROPS = auto()       # Set PVT, densities
    SET_RELPERM = auto()           # Set relative permeability
    SET_INITIAL = auto()           # Set initial conditions
    ADD_WELL = auto()              # Add a well
    SET_SCHEDULE = auto()          # Define time steps
    
    # Validation & execution
    VALIDATE_DECK = auto()         # Run constraint checks
    RUN_SIMULATION = auto()        # Execute OPM Flow
    FIX_ERROR = auto()             # Attempt automatic fix
    
    # Completion
    PRESENT_RESULTS = auto()       # Show results to user
    DONE = auto()                  # Task complete

# Action metadata
ACTION_INFO = {
    Action.ASK_CLARIFICATION: {'cost': 1, 'reversible': True},
    Action.QUERY_KNOWLEDGE: {'cost': 0.5, 'reversible': True},
    Action.QUERY_ANALOG: {'cost': 1, 'reversible': True},
    Action.SET_GRID: {'cost': 2, 'reversible': True},
    Action.SET_ROCK_PROPS: {'cost': 1, 'reversible': True},
    Action.SET_FLUID_PROPS: {'cost': 1, 'reversible': True},
    Action.SET_RELPERM: {'cost': 1, 'reversible': True},
    Action.SET_INITIAL: {'cost': 1, 'reversible': True},
    Action.ADD_WELL: {'cost': 2, 'reversible': True},
    Action.SET_SCHEDULE: {'cost': 1, 'reversible': True},
    Action.VALIDATE_DECK: {'cost': 1, 'reversible': True},
    Action.RUN_SIMULATION: {'cost': 10, 'reversible': False},
    Action.FIX_ERROR: {'cost': 2, 'reversible': True},
    Action.PRESENT_RESULTS: {'cost': 0, 'reversible': True},
    Action.DONE: {'cost': 0, 'reversible': False},
}

print(f"Action space size: {len(Action)}")
print("\nAction costs:")
for action in Action:
    info = ACTION_INFO[action]
    print(f"  {action.name:20} cost={info['cost']:4} reversible={info['reversible']}")

## Section 2: State Representation

The state encodes what CLARISSA knows and what's been done.

In [None]:
@dataclass
class DeckState:
    """State of deck generation process."""
    # Completion flags (0 or 1)
    grid_defined: float = 0.0
    rock_defined: float = 0.0
    fluid_defined: float = 0.0
    relperm_defined: float = 0.0
    initial_defined: float = 0.0
    wells_defined: float = 0.0
    schedule_defined: float = 0.0
    
    # Quality metrics (0 to 1)
    validation_score: float = 0.0
    physics_score: float = 0.0
    
    # Simulation status
    sim_attempted: float = 0.0
    sim_converged: float = 0.0
    sim_time_ratio: float = 0.0  # actual/target time
    
    # Conversation context
    clarifications_asked: float = 0.0
    user_satisfaction: float = 0.5  # Estimated
    
    # Resource usage
    steps_taken: float = 0.0
    
    def to_vector(self) -> np.ndarray:
        """Convert state to feature vector."""
        return np.array([
            self.grid_defined,
            self.rock_defined,
            self.fluid_defined,
            self.relperm_defined,
            self.initial_defined,
            self.wells_defined,
            self.schedule_defined,
            self.validation_score,
            self.physics_score,
            self.sim_attempted,
            self.sim_converged,
            self.sim_time_ratio,
            self.clarifications_asked / 5.0,  # Normalize
            self.user_satisfaction,
            self.steps_taken / 20.0,  # Normalize
        ], dtype=np.float32)
    
    @property
    def completeness(self) -> float:
        """How complete is the deck (0-1)."""
        sections = [self.grid_defined, self.rock_defined, self.fluid_defined,
                   self.relperm_defined, self.initial_defined, 
                   self.wells_defined, self.schedule_defined]
        return sum(sections) / len(sections)

STATE_DIM = len(DeckState().to_vector())
ACTION_DIM = len(Action)

print(f"State dimension: {STATE_DIM}")
print(f"Action dimension: {ACTION_DIM}")

# Demo state
state = DeckState(grid_defined=1.0, rock_defined=1.0, wells_defined=0.5)
print(f"\nExample state completeness: {state.completeness:.1%}")
print(f"State vector: {state.to_vector()}")

## Section 3: Reward Function Design

The reward function shapes what the agent learns to optimize.

In [None]:
class RewardCalculator:
    """Calculate rewards for RL agent actions."""
    
    def __init__(self):
        # Reward weights
        self.w_completion = 10.0   # Completing deck sections
        self.w_validation = 5.0    # Passing validation
        self.w_convergence = 20.0  # Simulation converging
        self.w_efficiency = -0.1   # Penalty per step
        self.w_clarification = -0.5  # Penalty for asking
        self.w_sim_fail = -5.0     # Penalty for failed sim
    
    def calculate(self, prev_state: DeckState, action: Action, 
                  new_state: DeckState, sim_result: Optional[Dict] = None) -> float:
        """Calculate reward for state transition."""
        reward = 0.0
        
        # 1. Completion progress
        completion_delta = new_state.completeness - prev_state.completeness
        reward += self.w_completion * completion_delta
        
        # 2. Validation improvement
        if action == Action.VALIDATE_DECK:
            validation_delta = new_state.validation_score - prev_state.validation_score
            reward += self.w_validation * validation_delta
        
        # 3. Simulation outcome
        if action == Action.RUN_SIMULATION:
            if new_state.sim_converged > 0:
                reward += self.w_convergence * new_state.sim_time_ratio
            else:
                reward += self.w_sim_fail
        
        # 4. Efficiency penalty
        reward += self.w_efficiency
        
        # 5. Clarification penalty (but sometimes necessary)
        if action == Action.ASK_CLARIFICATION:
            # Less penalty early, more penalty late
            penalty_scale = prev_state.completeness
            reward += self.w_clarification * (1 + penalty_scale)
        
        # 6. Bonus for successful completion
        if action == Action.DONE and new_state.sim_converged > 0:
            efficiency_bonus = max(0, 1 - new_state.steps_taken / 15)
            reward += 10.0 * efficiency_bonus
        
        return reward

# Demo reward calculation
calc = RewardCalculator()

# Scenario 1: Define grid (good progress)
s1 = DeckState()
s2 = DeckState(grid_defined=1.0)
r1 = calc.calculate(s1, Action.SET_GRID, s2)
print(f"Set grid: reward = {r1:.2f}")

# Scenario 2: Run simulation that converges
s3 = DeckState(grid_defined=1, rock_defined=1, fluid_defined=1, relperm_defined=1,
               initial_defined=1, wells_defined=1, schedule_defined=1, validation_score=1)
s4 = DeckState(grid_defined=1, rock_defined=1, fluid_defined=1, relperm_defined=1,
               initial_defined=1, wells_defined=1, schedule_defined=1, validation_score=1,
               sim_attempted=1, sim_converged=1, sim_time_ratio=0.9)
r2 = calc.calculate(s3, Action.RUN_SIMULATION, s4)
print(f"Successful simulation: reward = {r2:.2f}")

# Scenario 3: Simulation fails
s5 = DeckState(grid_defined=1, rock_defined=1, fluid_defined=1, relperm_defined=1,
               initial_defined=1, wells_defined=1, schedule_defined=1, validation_score=0.5,
               sim_attempted=1, sim_converged=0)
r3 = calc.calculate(s3, Action.RUN_SIMULATION, s5)
print(f"Failed simulation: reward = {r3:.2f}")

## Section 4: Environment Simulation

A simulated environment for training (before connecting to OPM Flow).

In [None]:
class DeckGenerationEnv:
    """Simulated environment for deck generation."""
    
    def __init__(self, max_steps: int = 20):
        self.max_steps = max_steps
        self.reward_calc = RewardCalculator()
        self.reset()
    
    def reset(self) -> np.ndarray:
        """Reset to initial state."""
        self.state = DeckState()
        self.steps = 0
        self.done = False
        return self.state.to_vector()
    
    def step(self, action: Action) -> Tuple[np.ndarray, float, bool, Dict]:
        """Take action, return (new_state, reward, done, info)."""
        prev_state = DeckState(**self.state.__dict__)
        self.steps += 1
        self.state.steps_taken = self.steps
        
        info = {'action': action.name}
        
        # Apply action effects (simplified simulation)
        if action == Action.SET_GRID:
            self.state.grid_defined = 1.0
        elif action == Action.SET_ROCK_PROPS:
            if self.state.grid_defined:
                self.state.rock_defined = 1.0
        elif action == Action.SET_FLUID_PROPS:
            self.state.fluid_defined = 1.0
        elif action == Action.SET_RELPERM:
            self.state.relperm_defined = 1.0
        elif action == Action.SET_INITIAL:
            if self.state.grid_defined:
                self.state.initial_defined = 1.0
        elif action == Action.ADD_WELL:
            if self.state.grid_defined:
                self.state.wells_defined = min(1.0, self.state.wells_defined + 0.25)
        elif action == Action.SET_SCHEDULE:
            self.state.schedule_defined = 1.0
        elif action == Action.VALIDATE_DECK:
            self.state.validation_score = self.state.completeness * 0.9 + random.random() * 0.1
            self.state.physics_score = self.state.validation_score * 0.95
        elif action == Action.RUN_SIMULATION:
            self.state.sim_attempted = 1.0
            # Convergence depends on validation score
            if self.state.validation_score > 0.7 and random.random() < self.state.validation_score:
                self.state.sim_converged = 1.0
                self.state.sim_time_ratio = 0.8 + random.random() * 0.2
            else:
                self.state.sim_converged = 0.0
        elif action == Action.ASK_CLARIFICATION:
            self.state.clarifications_asked += 1
            self.state.user_satisfaction = max(0.3, self.state.user_satisfaction - 0.05)
        elif action == Action.DONE:
            self.done = True
        
        # Calculate reward
        reward = self.reward_calc.calculate(prev_state, action, self.state)
        
        # Check termination
        if self.steps >= self.max_steps:
            self.done = True
            info['timeout'] = True
        
        return self.state.to_vector(), reward, self.done, info
    
    def get_valid_actions(self) -> List[Action]:
        """Return list of valid actions in current state."""
        valid = [Action.ASK_CLARIFICATION, Action.QUERY_KNOWLEDGE]
        
        # Grid must be first
        if not self.state.grid_defined:
            valid.append(Action.SET_GRID)
        else:
            valid.extend([Action.SET_ROCK_PROPS, Action.SET_INITIAL, Action.ADD_WELL])
        
        # Other properties can be set anytime
        valid.extend([Action.SET_FLUID_PROPS, Action.SET_RELPERM, Action.SET_SCHEDULE])
        
        # Validation requires some content
        if self.state.completeness > 0.3:
            valid.append(Action.VALIDATE_DECK)
        
        # Simulation requires validation
        if self.state.validation_score > 0.5:
            valid.append(Action.RUN_SIMULATION)
        
        # Done if simulation succeeded
        if self.state.sim_converged > 0:
            valid.append(Action.DONE)
        
        return valid

# Test environment
env = DeckGenerationEnv()
state = env.reset()

print("Environment test:")
print(f"Initial valid actions: {[a.name for a in env.get_valid_actions()]}")

# Take some actions
actions = [Action.SET_GRID, Action.SET_ROCK_PROPS, Action.SET_FLUID_PROPS,
           Action.SET_RELPERM, Action.SET_INITIAL, Action.ADD_WELL, 
           Action.ADD_WELL, Action.SET_SCHEDULE, Action.VALIDATE_DECK]

total_reward = 0
for action in actions:
    state, reward, done, info = env.step(action)
    total_reward += reward
    print(f"  {action.name:20} reward={reward:+.2f}  completeness={env.state.completeness:.0%}")

print(f"\nTotal reward: {total_reward:.2f}")
print(f"Valid actions now: {[a.name for a in env.get_valid_actions()]}")

## Section 5: Policy Network (PPO)

A neural network that learns to select actions.

In [None]:
if TORCH_AVAILABLE:
    class PolicyNetwork(nn.Module):
        """Actor-Critic network for PPO."""
        
        def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 64):
            super().__init__()
            
            # Shared feature extractor
            self.shared = nn.Sequential(
                nn.Linear(state_dim, hidden_dim),
                nn.ReLU(),
                nn.Linear(hidden_dim, hidden_dim),
                nn.ReLU(),
            )
            
            # Actor head (policy)
            self.actor = nn.Linear(hidden_dim, action_dim)
            
            # Critic head (value function)
            self.critic = nn.Linear(hidden_dim, 1)
        
        def forward(self, state):
            features = self.shared(state)
            action_logits = self.actor(features)
            value = self.critic(features)
            return action_logits, value
        
        def get_action(self, state, valid_actions: List[int] = None):
            """Sample action from policy."""
            with torch.no_grad():
                logits, value = self.forward(state)
                
                # Mask invalid actions
                if valid_actions is not None:
                    mask = torch.ones_like(logits) * float('-inf')
                    mask[valid_actions] = 0
                    logits = logits + mask
                
                probs = torch.softmax(logits, dim=-1)
                action = torch.multinomial(probs, 1).item()
                
            return action, probs[action].item(), value.item()
    
    # Create network
    policy = PolicyNetwork(STATE_DIM, ACTION_DIM).to(device)
    print(f"Policy network created: {sum(p.numel() for p in policy.parameters())} parameters")
    
    # Test forward pass
    test_state = torch.randn(STATE_DIM).to(device)
    logits, value = policy(test_state)
    print(f"Output shapes: logits={logits.shape}, value={value.shape}")

else:
    # NumPy fallback for simple policy
    class SimplePolicy:
        """Simple policy without neural network."""
        
        def __init__(self, action_dim: int):
            self.action_dim = action_dim
            self.action_values = np.zeros(action_dim)
        
        def get_action(self, state, valid_actions: List[int] = None):
            if valid_actions:
                # Choose from valid actions based on learned values
                values = self.action_values[valid_actions]
                probs = np.exp(values) / np.sum(np.exp(values))
                action_idx = np.random.choice(len(valid_actions), p=probs)
                action = valid_actions[action_idx]
            else:
                action = np.random.randint(self.action_dim)
            return action, 1.0/self.action_dim, 0.0
    
    policy = SimplePolicy(ACTION_DIM)
    print("Using simple NumPy policy (install PyTorch for full PPO)")

## Section 6: Training Loop

Train the agent through interaction with the environment.

In [None]:
class RLTrainer:
    """Train RL agent for deck generation."""
    
    def __init__(self, env, policy, lr: float = 3e-4):
        self.env = env
        self.policy = policy
        self.gamma = 0.99  # Discount factor
        self.eps_clip = 0.2  # PPO clip parameter
        
        if TORCH_AVAILABLE:
            self.optimizer = optim.Adam(policy.parameters(), lr=lr)
        
        # Tracking
        self.episode_rewards = []
        self.episode_lengths = []
    
    def collect_episode(self) -> Tuple[List, float]:
        """Collect one episode of experience."""
        state = self.env.reset()
        trajectory = []
        total_reward = 0
        
        while True:
            # Get valid actions
            valid_actions = [a.value - 1 for a in self.env.get_valid_actions()]
            
            # Select action
            if TORCH_AVAILABLE:
                state_tensor = torch.FloatTensor(state).to(device)
                action_idx, prob, value = self.policy.get_action(state_tensor, valid_actions)
            else:
                action_idx, prob, value = self.policy.get_action(state, valid_actions)
            
            action = list(Action)[action_idx]
            
            # Take action
            next_state, reward, done, info = self.env.step(action)
            
            trajectory.append({
                'state': state,
                'action': action_idx,
                'reward': reward,
                'prob': prob,
                'value': value,
                'done': done
            })
            
            total_reward += reward
            state = next_state
            
            if done:
                break
        
        return trajectory, total_reward
    
    def compute_returns(self, trajectory: List) -> List[float]:
        """Compute discounted returns."""
        returns = []
        G = 0
        for step in reversed(trajectory):
            G = step['reward'] + self.gamma * G * (1 - step['done'])
            returns.insert(0, G)
        return returns
    
    def train_episode(self) -> float:
        """Train on one episode."""
        trajectory, total_reward = self.collect_episode()
        returns = self.compute_returns(trajectory)
        
        self.episode_rewards.append(total_reward)
        self.episode_lengths.append(len(trajectory))
        
        if TORCH_AVAILABLE and len(trajectory) > 0:
            # PPO update
            states = torch.FloatTensor([t['state'] for t in trajectory]).to(device)
            actions = torch.LongTensor([t['action'] for t in trajectory]).to(device)
            old_probs = torch.FloatTensor([t['prob'] for t in trajectory]).to(device)
            returns_t = torch.FloatTensor(returns).to(device)
            
            # Forward pass
            logits, values = self.policy(states)
            probs = torch.softmax(logits, dim=-1)
            new_probs = probs.gather(1, actions.unsqueeze(1)).squeeze()
            
            # PPO loss
            ratio = new_probs / (old_probs + 1e-8)
            advantages = returns_t - values.squeeze()
            surr1 = ratio * advantages.detach()
            surr2 = torch.clamp(ratio, 1-self.eps_clip, 1+self.eps_clip) * advantages.detach()
            actor_loss = -torch.min(surr1, surr2).mean()
            
            critic_loss = nn.MSELoss()(values.squeeze(), returns_t)
            
            loss = actor_loss + 0.5 * critic_loss
            
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
        
        return total_reward
    
    def train(self, num_episodes: int = 100, print_every: int = 10):
        """Train for multiple episodes."""
        for ep in range(num_episodes):
            reward = self.train_episode()
            
            if (ep + 1) % print_every == 0:
                avg_reward = np.mean(self.episode_rewards[-print_every:])
                avg_length = np.mean(self.episode_lengths[-print_every:])
                print(f"Episode {ep+1:4d} | Avg Reward: {avg_reward:+.2f} | Avg Length: {avg_length:.1f}")

# Train!
trainer = RLTrainer(env, policy)
print("Training RL agent...\n")
trainer.train(num_episodes=50, print_every=10)

In [None]:
# Visualize training progress
import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Rewards
ax1.plot(trainer.episode_rewards)
ax1.set_xlabel('Episode')
ax1.set_ylabel('Total Reward')
ax1.set_title('Episode Rewards')
ax1.axhline(y=0, color='r', linestyle='--', alpha=0.3)

# Episode lengths
ax2.plot(trainer.episode_lengths)
ax2.set_xlabel('Episode')
ax2.set_ylabel('Steps')
ax2.set_title('Episode Lengths')

plt.tight_layout()
plt.show()

## Section 7: Evaluation

Evaluate the trained agent.

In [None]:
def evaluate_agent(env, policy, num_episodes: int = 10, verbose: bool = False):
    """Evaluate trained agent."""
    results = []
    
    for ep in range(num_episodes):
        state = env.reset()
        total_reward = 0
        actions_taken = []
        
        while True:
            valid_actions = [a.value - 1 for a in env.get_valid_actions()]
            
            if TORCH_AVAILABLE:
                state_tensor = torch.FloatTensor(state).to(device)
                action_idx, _, _ = policy.get_action(state_tensor, valid_actions)
            else:
                action_idx, _, _ = policy.get_action(state, valid_actions)
            
            action = list(Action)[action_idx]
            actions_taken.append(action.name)
            
            state, reward, done, info = env.step(action)
            total_reward += reward
            
            if done:
                break
        
        results.append({
            'reward': total_reward,
            'steps': len(actions_taken),
            'converged': env.state.sim_converged > 0,
            'actions': actions_taken
        })
        
        if verbose:
            status = 'converged' if env.state.sim_converged > 0 else 'FAILED'
            print(f"Episode {ep+1}: {status} in {len(actions_taken)} steps, reward={total_reward:.1f}")
    
    return results

# Evaluate
print("Evaluating trained agent:")
print("=" * 50)
results = evaluate_agent(env, policy, num_episodes=10, verbose=True)

# Summary
success_rate = sum(1 for r in results if r['converged']) / len(results)
avg_reward = np.mean([r['reward'] for r in results])
avg_steps = np.mean([r['steps'] for r in results])

print(f"\nSummary:")
print(f"  Success rate: {success_rate:.0%}")
print(f"  Avg reward: {avg_reward:.2f}")
print(f"  Avg steps: {avg_steps:.1f}")

## Summary

In this tutorial, we learned:

1. **Action Space**: Define actions for deck generation
2. **State Representation**: Encode progress and quality
3. **Reward Design**: Shape learning with completion, validation, convergence
4. **Environment**: Simulate deck generation process
5. **PPO Policy**: Neural network for action selection
6. **Training**: Collect experience and update policy

**Key Insight**: RL optimizes the *sequence* of actions, not just individual decisions.

**Next Tutorial:** [08_RIGOR_Benchmark.ipynb](08_RIGOR_Benchmark.ipynb) - Evaluation framework