# Evaluating Sailing Agents

This notebook provides a simple interface for evaluating sailing agents on different wind scenarios. You can:

1. Test your agent on any predefined wind scenario
2. Get quantitative performance metrics (success rate, rewards, steps)
3. Optionally visualize your agent's behavior

## Setup

First, let's import the necessary evaluation tools:

In [9]:
import sys
import os
import numpy as np
import matplotlib.pyplot as plt
from typing import Dict, Any

# Add the src directory to the path
sys.path.append(os.path.abspath('../src'))
sys.path.append(os.path.abspath('..'))

# Import the evaluation tools
from src.test_agent_validity import validate_agent, load_agent_class
from src.evaluation import evaluate_agent, visualize_trajectory
from wind_scenarios import get_wind_scenario, WIND_SCENARIOS

# List available wind scenarios
print("Available wind scenarios:")
for windfield_name in sorted(WIND_SCENARIOS.keys()):
    print(f"- {windfield_name}")

Available wind scenarios:
- simple_static
- static_headwind
- training_1
- training_2
- training_3


## Configuration

Set your evaluation parameters below. You can easily modify these values without changing the rest of the notebook.

In [10]:
#############################################
### MODIFY THESE PARAMETERS AS NEEDED ######
#############################################

# Path to your agent implementation (change this to your agent file path)
AGENT_PATH = "../src/agents/agent_naive.py"

# Scenario to evaluate on (choose from the list printed above)
WIND_SCENARIO_NAME = "simple_static" # Options: simple_static, static_headwind, training_1, training_2, training_3, etc.

# Evaluation parameters
SEEDS = [1]  # Seeds to use for evaluation
MAX_HORIZON = 200            # Maximum steps per episode
VERBOSE = True               # Show progress bar
RENDER = False               # Enable rendering (slower but necessary for visualization)

#############################################
### DO NOT MODIFY BELOW THIS LINE ##########
#############################################

# Validation and informational prints
print(f"Agent to evaluate: {AGENT_PATH}")
print(f"Wind scenario: {WIND_SCENARIO_NAME}")
print(f"Using {len(SEEDS)} seeds: {SEEDS}")
print(f"Max steps per episode: {MAX_HORIZON}")

Agent to evaluate: ../src/agents/agent_naive.py
Wind scenario: simple_static
Using 1 seeds: [1]
Max steps per episode: 200


## Load and Validate Agent

First, let's load and validate your agent implementation:

In [11]:
def load_and_validate_agent(agent_path):
    """Load and validate an agent from a file path."""
    try:
        # Validate the agent first
        validation_results = validate_agent(agent_path)
        
        if not validation_results['valid']:
            print("❌ Agent validation failed:")
            for error in validation_results['errors']:
                print(f"  - {error}")
            return None
        
        # If valid, return the agent class
        return validation_results['agent_class']
        
    except Exception as e:
        print(f"❌ Error loading agent: {str(e)}")
        return None

# Load and validate the agent specified in AGENT_PATH
AgentClass = load_and_validate_agent(AGENT_PATH)

if AgentClass:
    print(f"✅ Successfully loaded agent: {AgentClass.__name__}")
    # Create an instance of your agent
    agent = AgentClass()
else:
    print("⚠️ Please fix your agent implementation before evaluation.")

✅ Successfully loaded agent: NaiveAgent


## Evaluate on Specified Wind Scenario

Let's evaluate your agent on the wind scenario you selected:

In [12]:
def print_evaluation_results(results):
    """Print evaluation results in a readable format."""
    print("\n" + "="*50)
    print("EVALUATION RESULTS")
    print("="*50)
    
    print(f"Success Rate: {results['success_rate']:.2%}")
    print(f"Mean Reward: {results['mean_reward']:.2f} ± {results['std_reward']:.2f}")
    print(f"Mean Steps: {results['mean_steps']:.1f} ± {results['std_steps']:.1f}")
    
    if 'individual_results' in results:
        print("\nIndividual Episode Results:")
        for i, episode in enumerate(results['individual_results']):
            print(f"  Seed {episode['seed']}: " + 
                  f"Reward={episode['reward']:.1f}, " +
                  f"Steps={episode['steps']}, " +
                  f"Success={'✓' if episode['success'] else '✗'}")
    
    print("="*50)

# Only run if the agent was successfully loaded
if 'agent' in locals():
    # Get the selected wind scenario
    wind_scenario = get_wind_scenario(WIND_SCENARIO_NAME)
    
    print(f"Evaluating agent on wind scenario: {WIND_SCENARIO_NAME}")
    print(f"Using {len(SEEDS)} seeds with max horizon of {MAX_HORIZON} steps")
    
    # Run the evaluation
    results = evaluate_agent(
        agent=agent,
        wind_scenario=wind_scenario,
        seeds=SEEDS,
        max_horizon=MAX_HORIZON,
        verbose=VERBOSE,
        render=RENDER,
        full_trajectory=True  # Need full trajectory for later visualization
    )
    
    # Display the results
    print_evaluation_results(results)

Evaluating agent on wind scenario: simple_static
Using 1 seeds with max horizon of 200 steps


Evaluating seeds:   0%|          | 0/1 [00:00<?, ?it/s]


EVALUATION RESULTS
Success Rate: 0.00%
Mean Reward: 0.00 ± 0.00
Mean Steps: 200.0 ± 0.0

Individual Episode Results:
  Seed 1: Reward=0.0, Steps=200, Success=✗


## Evaluate on All Training Scenarios

To get a comprehensive evaluation, you can test your agent on all training scenarios:

In [13]:
#############################################
### MODIFY THESE PARAMETERS AS NEEDED ######
#############################################

# Choose which wind scenarios to evaluate on
TRAINING_WIND_SCENARIOS = ["simple_static", "training_1", "training_2", "training_3"]

# Evaluation parameters for all wind scenarios
ALL_SEEDS = [42, 43, 44, 45, 46]  # Seeds to use for all evaluations
ALL_MAX_HORIZON = 200             # Maximum steps per episode

#############################################
### DO NOT MODIFY BELOW THIS LINE ##########
#############################################

# Only run if the agent was successfully loaded
if 'agent' in locals():
    # Store results for each wind scenario
    all_results = {}
    
    print(f"Evaluating agent on {len(TRAINING_WIND_SCENARIOS)} wind scenarios (including simple static)...")
    
    # Evaluate on each wind scenario
    for wind_scenario_name in TRAINING_WIND_SCENARIOS:
        print(f"\nWind scenario: {wind_scenario_name}")
        
        # Get the wind scenario
        wind_scenario = get_wind_scenario(wind_scenario_name)
        
        # Run the evaluation
        results = evaluate_agent(
            agent=agent,
            wind_scenario=wind_scenario,
            seeds=ALL_SEEDS,
            max_horizon=ALL_MAX_HORIZON,
            verbose=False,  # Less verbose for multiple evaluations
            render=False,
            full_trajectory=False
        )
        
        # Store results
        all_results[wind_scenario_name] = results
        
        # Print summary
        print(f"  Success Rate: {results['success_rate']:.2%}")
        print(f"  Mean Reward: {results['mean_reward']:.2f}")
        print(f"  Mean Steps: {results['mean_steps']:.1f}")
    
    # Print overall performance
    total_success = sum(r['success_rate'] for r in all_results.values()) / len(all_results)
    print("\n" + "="*50)
    print(f"OVERALL SUCCESS RATE: {total_success:.2%}")
    print("="*50)

Evaluating agent on 4 wind scenarios (including simple static)...

Wind scenario: simple_static
  Success Rate: 0.00%
  Mean Reward: 0.00
  Mean Steps: 200.0

Wind scenario: training_1
  Success Rate: 100.00%
  Mean Reward: 28.51
  Mean Steps: 127.4

Wind scenario: training_2
  Success Rate: 100.00%
  Mean Reward: 70.26
  Mean Steps: 36.2

Wind scenario: training_3
  Success Rate: 100.00%
  Mean Reward: 47.44
  Mean Steps: 75.4

OVERALL SUCCESS RATE: 75.00%


## Summary Results Across Wind Scenarios

The table below summarizes your agent's performance across all the training wind scenarios. 
This gives you a comprehensive view of how well your agent generalizes to different wind patterns and conditions.

A strong agent should:
1. Maintain a high success rate across all wind scenarios
2. Achieve good rewards efficiently (high reward values)
3. Complete episodes in fewer steps (better efficiency)

Compare your agent's performance across wind scenarios to identify potential weaknesses that you might address in future improvements.

In [14]:
#############################################
### SUMMARY TABLE FOR ALL WIND SCENARIOS #########
#############################################

# Only run if the agent was successfully loaded and evaluated on multiple wind scenarios
if 'agent' in locals() and 'all_results' in locals():
    # Create summary table with pandas
    import pandas as pd
    
    # Prepare data for summary table
    summary_data = []
    for wind_scenario_name, results in all_results.items():
        summary_data.append({
            'Wind Scenario': wind_scenario_name.upper(),
            'Mean Reward': f"{results['mean_reward']:.2f} ± {results['std_reward']:.2f}",
            'Success Rate': f"{results['success_rate']:.2%}",
            'Mean Steps': f"{results['mean_steps']:.1f} ± {results['std_steps']:.1f}"
        })
    
    # Create summary DataFrame
    summary_df = pd.DataFrame(summary_data)
    
    # Display summary table
    from IPython.display import display
    print("\nSummary of Results Across All Wind Scenarios:")
    display(summary_df)
    
    # Calculate average across wind scenarios
    avg_success_rate = np.mean([results['success_rate'] for results in all_results.values()])
    avg_reward = np.mean([results['mean_reward'] for results in all_results.values()])
    avg_steps = np.mean([results['mean_steps'] for results in all_results.values()])
    
    print(f"\nAverage Across Training Wind Scenarios:")
    print(f"  Success Rate: {avg_success_rate:.2%}")
    print(f"  Mean Reward: {avg_reward:.2f}")
    print(f"  Mean Steps: {avg_steps:.1f}")
    print("\nNote: Your final evaluation will include hidden test wind scenarios.")


Summary of Results Across All Wind Scenarios:


Unnamed: 0,Wind Scenario,Mean Reward,Success Rate,Mean Steps
0,SIMPLE_STATIC,0.00 ± 0.00,0.00%,200.0 ± 0.0
1,TRAINING_1,28.51 ± 5.11,100.00%,127.4 ± 17.4
2,TRAINING_2,70.26 ± 2.70,100.00%,36.2 ± 3.8
3,TRAINING_3,47.44 ± 2.97,100.00%,75.4 ± 6.5



Average Across Training Wind Scenarios:
  Success Rate: 75.00%
  Mean Reward: 36.55
  Mean Steps: 109.8

Note: Your final evaluation will include hidden test wind scenarios.


## Visualize Agent Behavior (Optional)

If you want to see how your agent behaves in a specific wind scenario, you can visualize its trajectory.
First, enable rendering by setting `VISUALIZE = True` below.

In [15]:
#############################################
### MODIFY THESE PARAMETERS AS NEEDED ######
#############################################

# Set to True to enable visualization
VISUALIZE = True

# Visualization parameters
VIZ_WIND_SCENARIO_NAME = "simple_static"  # Choose which wind scenario to visualize
VIZ_SEED = 1                    # Choose a single seed for visualization

#############################################
### DO NOT MODIFY BELOW THIS LINE ##########
#############################################

# Only run if visualization is enabled and agent is loaded
if VISUALIZE and 'agent' in locals():
    # Get the wind scenario with visualization parameters
    viz_wind_scenario = get_wind_scenario(VIZ_WIND_SCENARIO_NAME)
    viz_wind_scenario.update({
        'env_params': {
            'wind_grid_density': 25,
            'wind_arrow_scale': 80,
            'render_mode': "rgb_array"
        }
    })
    
    print(f"Visualizing agent behavior on wind scenario: {VIZ_WIND_SCENARIO_NAME}")
    print(f"Using seed: {VIZ_SEED}")
    
    # Run the evaluation with visualization enabled
    viz_results = evaluate_agent(
        agent=agent,
        wind_scenario=viz_wind_scenario,
        seeds=VIZ_SEED,
        max_horizon=MAX_HORIZON,
        verbose=False,
        render=True,
        full_trajectory=True  # Enable full trajectory for visualization
    )
    
    # Visualize the trajectory with a slider
    visualize_trajectory(viz_results, None, with_slider=True)
else:
    if 'agent' in locals():
        print("Visualization is disabled. Set VISUALIZE = True to see agent behavior.")

Visualizing agent behavior on wind scenario: simple_static
Using seed: 1


interactive(children=(IntSlider(value=0, description='Step:', max=199), Output()), _dom_classes=('widget-inter…

## 7. Command-Line Evaluation

For quick evaluation of your agent on different scenarios, you can use the command-line interface:

```bash
cd src
python3 evaluate_submission.py agents/agent_naive.py --wind_scenario training_1 --seeds 1 --num-seeds 100 --verbose
```

### Command Options

- `agents/agent_naive.py`: Path to your agent implementation file
- `--wind_scenario NAME`: Specific wind scenario to evaluate on (e.g., `simple_static`, `training_1`, `training_2`, `training_3`)
- `--seeds N`: Starting seed number (default: 1)
- `--num-seeds N`: Number of consecutive seeds to evaluate on (default: 1)
- `--output FILE`: Save results to a JSON file (e.g., `--output results.json`)
- `--verbose`: Show detailed evaluation results (default: simplified output)

### Evaluating on Multiple Wind Scenarios

To evaluate on all training wind scenarios:

```bash
cd src
python3 evaluate_submission.py agents/agent_naive.py --seeds 1 --num-seeds 100
```

This will run your agent on all available training windfields and compute the average performance.

### Sample Output (Simplified)

```bash
Validating agent: agents/agent_naive.py
✅ Successfully loaded agent: NaiveAgent
Evaluating on 4 scenarios with 100 seeds each
SCENARIO | SUCCESS RATE | MEAN REWARD | MEAN STEPS
static_headwind | Success: 100.00% | Reward: 78.50 ± 2.15 | Steps: 25.3 ± 3.2
training_1 | Success: 98.00% | Reward: 61.43 ± 3.85 | Steps: 49.2 ± 6.2
training_2 | Success: 94.00% | Reward: 58.21 ± 4.12 | Steps: 53.8 ± 7.5
training_3 | Success: 96.00% | Reward: 59.87 ± 3.96 | Steps: 51.4 ± 6.8
======================================================================
OVERALL | 97.00% ± 2.50% | 64.50 ± 3.27 | 44.7 ± 4.3
======================================================================
```


For more detailed output, add the `--verbose` flag to see seed-by-seed results.

## Conclusion

This notebook provides a standardized way to evaluate agents for the Sailing Challenge. You've now:

1. Validated your agent's implementation to ensure it meets the interface requirements
2. Evaluated your agent on one or more wind scenarios to measure its performance
3. Viewed a summary of your agent's results across multiple wind scenarios
4. Optionally visualized your agent's behavior in a specific wind scenarios

### Next Steps

- **Fine-tune your agent**: Use the performance metrics to identify areas for improvement
- **Test across all wind scenarios**: Ensure your agent can handle different wind patterns
- **Optimize for efficiency**: Aim to reach the goal in fewer steps
- **Consider advanced strategies**: Experiment with algorithms that better account for wind physics

Remember that your final agent will be evaluated on both the training wind scenarios and hidden test wind scenarios, so your agent should be robust and adaptable.

Good luck with your agent submission!