# Private Submission Evaluation (Evaluator Only)

This notebook provides a comprehensive evaluation system for the Sailing Challenge submissions. It is **for evaluator use only** and differs from the student version in that it:

1. Includes evaluation on the hidden test scenario
2. Provides more detailed diagnostic information
3. Compares performance across all scenarios including the private test scenario

Use this notebook to evaluate student submissions and determine their ranking.

## Setup

First, let's import the necessary modules and set up the evaluation environment:

In [1]:
# Initial imports and setup
import sys
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import display
from pathlib import Path

# Import the environment and evaluation modules
sys.path.append(os.path.abspath('../src'))
sys.path.append(os.path.abspath('..'))  # Add project root to path

from env_sailing import SailingEnv
from evaluation import evaluate_agent, visualize_trajectory
from agents.base_agent import BaseAgent
from scenarios import get_scenario
from scenarios.private_scenarios import TEST_SCENARIO, ALL_SCENARIOS
from src.test_agent_validity import validate_agent

# List available scenarios including the test scenario
print("Available evaluation scenarios:")
for scenario_name in sorted(ALL_SCENARIOS.keys()):
    print(f"- {scenario_name}" + (" (HIDDEN TEST)" if scenario_name == "test" else ""))

Available evaluation scenarios:
- simple_test
- test (HIDDEN TEST)
- training_1
- training_2
- training_3


## Configuration Parameters

Set your evaluation parameters below. You can easily modify these values without changing the rest of the notebook.

In [8]:
#############################################
### MODIFY THESE PARAMETERS AS NEEDED ######
#############################################

# Path to the submission file (change this to evaluate different submissions)
SUBMISSION_PATH = "../src/agents/agent_greedy.py"

# Number of seeds to use for evaluation
# For quick testing, use a small number (e.g., 10)
# For final evaluation, use a larger number (e.g., 100)
EVAL_SEEDS = list(range(1, 11))  # Seeds 1-10 for quick testing
# EVAL_SEEDS = list(range(1, 101))  # Uncomment for full evaluation (seeds 1-100)

# Maximum steps per episode
MAX_HORIZON = 200

# Whether to show visualization of agent trajectories
ENABLE_VISUALIZATION = True

# Whether to show progress bars during evaluation
VERBOSE = True

#############################################
### DO NOT MODIFY BELOW THIS LINE ##########
#############################################

print(f"Submission to evaluate: {SUBMISSION_PATH}")
print(f"Using {len(EVAL_SEEDS)} seeds: {EVAL_SEEDS[0]}...{EVAL_SEEDS[-1]}")
print(f"Max steps per episode: {MAX_HORIZON}")

Submission to evaluate: ../src/agents/agent_greedy.py
Using 10 seeds: 1...10
Max steps per episode: 200


## 1. Load and Validate Submission

First, let's load and validate the student's submission to ensure it meets the required interface.

In [9]:
def load_submission(submission_path):
    """Load a student submission from a Python file.
    
    Args:
        submission_path: Path to the submission file
        
    Returns:
        An instance of the agent class defined in the submission file
    """
    # First, validate the agent implementation
    validation_results = validate_agent(submission_path)
    
    if not validation_results['valid']:
        print(f"⚠️ VALIDATION FAILED: The agent does not meet requirements!")
        for i, error in enumerate(validation_results['errors'], 1):
            print(f"  Error {i}: {error}")
        
        if validation_results['warnings']:
            print("\nWarnings:")
            for i, warning in enumerate(validation_results['warnings'], 1):
                print(f"  Warning {i}: {warning}")
                
        print("\nPlease fix these issues before evaluation.")
        return None
    
    # If validation passed, create an instance of the agent
    agent_class = validation_results['agent_class']
    agent = agent_class()
    
    print(f"✓ Validation successful: Agent '{agent_class.__name__}' loaded from {submission_path}")
    
    # Print any warnings
    if validation_results['warnings']:
        print("\nWarnings (non-critical):")
        for i, warning in enumerate(validation_results['warnings'], 1):
            print(f"  Warning {i}: {warning}")
    
    return agent

# Load the agent
agent = load_submission(SUBMISSION_PATH)

if agent is None:
    print("Evaluation aborted due to validation failure.")
else:
    print("Agent loaded successfully. Ready for evaluation.")

✓ Validation successful: Agent 'GreedyAgent' loaded from ../src/agents/agent_greedy.py
Agent loaded successfully. Ready for evaluation.


## 2. Define Evaluation Scenarios

Now let's prepare the scenarios we'll use to evaluate the agent, including both training scenarios and the hidden test scenario.

In [10]:
# Set visualization parameters for scenarios
viz_params = {
    'env_params': {
        'wind_grid_density': 25,    # Fewer arrows = clearer visualization
        'wind_arrow_scale': 80,     # Larger value = smaller arrows
        'render_mode': "rgb_array" if ENABLE_VISUALIZATION else None
    }
}

# Define all evaluation scenarios
evaluation_scenarios = {
    "training_1": get_scenario("training_1"),
    "training_2": get_scenario("training_2"),
    "training_3": get_scenario("training_3"),
    "test": TEST_SCENARIO  # The hidden test scenario
}

# Add visualization parameters to each scenario
for scenario_name, scenario in evaluation_scenarios.items():
    scenario.update(viz_params)

print("Prepared evaluation scenarios:")
for name in evaluation_scenarios.keys():
    print(f"  - {name.upper()}" + (" (HIDDEN TEST)" if name == "test" else ""))

Prepared evaluation scenarios:
  - TRAINING_1
  - TRAINING_2
  - TRAINING_3
  - TEST (HIDDEN TEST)


## 3. Evaluate Agent on All Scenarios

Let's evaluate the agent on all scenarios, including the hidden test scenario. We'll run multiple episodes with different seeds for robust evaluation.

In [11]:
# Only run if the agent was successfully loaded
if 'agent' in locals():
    # Initialize results storage
    all_results = {}
    demo_results = {}
    
    # Evaluate on each scenario
    for scenario_name, scenario in evaluation_scenarios.items():
        print(f"\nEvaluating on {scenario_name.upper()} scenario...")
        
        # Run evaluation with multiple seeds
        results = evaluate_agent(
            agent=agent,
            scenario=scenario,
            seeds=EVAL_SEEDS,
            max_horizon=MAX_HORIZON,
            verbose=VERBOSE,
            render=False  # Don't render during bulk evaluation
        )
        
        # Store results
        all_results[scenario_name] = results
        
        # Print evaluation results
        print(f"  Mean Reward: {results['mean_reward']:.2f} ± {results['std_reward']:.2f}")
        print(f"  Success Rate: {results['success_rate']:.2%}")
        print(f"  Mean Steps: {results['mean_steps']:.1f} ± {results['std_steps']:.1f}")
        
        # Store detailed individual episode results
        if 'individual_results' in results:
            successes = [ep['success'] for ep in results['individual_results']]
            failures = len(successes) - sum(successes)
            if failures > 0:
                print(f"  ⚠️ Failed on {failures} out of {len(successes)} episodes")
                
        # Run one demo episode with rendering for visualization if enabled
        if ENABLE_VISUALIZATION:
            demo_seed = EVAL_SEEDS[0]  # Use the first seed for demonstration
            demo_result = evaluate_agent(
                agent=agent,
                scenario=scenario,
                seeds=[demo_seed],
                max_horizon=MAX_HORIZON,
                verbose=False,
                render=True,
                full_trajectory=True
            )
                
            demo_results[scenario_name] = demo_result
else:
    print("Evaluation skipped due to agent loading failure.")


Evaluating on TRAINING_1 scenario...


Evaluating seeds:   0%|          | 0/10 [00:00<?, ?it/s]

  Mean Reward: 10.96 ± 11.92
  Success Rate: 50.00%
  Mean Steps: 177.9 ± 28.9
  ⚠️ Failed on 5 out of 10 episodes

Evaluating on TRAINING_2 scenario...


Evaluating seeds:   0%|          | 0/10 [00:00<?, ?it/s]

  Mean Reward: 14.10 ± 9.87
  Success Rate: 70.00%
  Mean Steps: 173.7 ± 23.7
  ⚠️ Failed on 3 out of 10 episodes

Evaluating on TRAINING_3 scenario...


Evaluating seeds:   0%|          | 0/10 [00:00<?, ?it/s]

  Mean Reward: 0.00 ± 0.00
  Success Rate: 0.00%
  Mean Steps: 200.0 ± 0.0
  ⚠️ Failed on 10 out of 10 episodes

Evaluating on TEST scenario...


Evaluating seeds:   0%|          | 0/10 [00:00<?, ?it/s]

  Mean Reward: 0.00 ± 0.00
  Success Rate: 0.00%
  Mean Steps: 200.0 ± 0.0
  ⚠️ Failed on 10 out of 10 episodes


## 4. Performance Summary and Analysis

This section provides a detailed summary of the agent's performance across all scenarios, with special attention to performance on the hidden test scenario.

In [12]:
# Only run if we have evaluation results
if 'all_results' in locals() and all_results:
    # Create summary table for all scenarios
    summary_data = []
    for scenario_name, results in all_results.items():
        summary_data.append({
            'Scenario': scenario_name.upper() + (" (TEST)" if scenario_name == "test" else ""),
            'Success Rate': f"{results['success_rate']:.2%}",
            'Mean Reward': f"{results['mean_reward']:.2f} ± {results['std_reward']:.2f}",
            'Mean Steps': f"{results['mean_steps']:.1f} ± {results['std_steps']:.1f}"
        })
    
    # Create and display summary DataFrame
    summary_df = pd.DataFrame(summary_data)
    display(summary_df)
    
    # Calculate average performance metrics
    avg_success_rate = np.mean([results['success_rate'] for results in all_results.values()])
    avg_reward = np.mean([results['mean_reward'] for results in all_results.values()])
    avg_steps = np.mean([results['mean_steps'] for results in all_results.values()])
    
    # Calculate test scenario performance specifically
    test_success_rate = all_results['test']['success_rate'] if 'test' in all_results else 0
    test_reward = all_results['test']['mean_reward'] if 'test' in all_results else 0
    
    # Calculate overall weighted score (50% test, 50% training average)
    if 'test' in all_results:
        training_success = np.mean([
            results['success_rate'] for name, results in all_results.items() 
            if name != 'test'
        ])
        training_reward = np.mean([
            results['mean_reward'] for name, results in all_results.items() 
            if name != 'test'
        ])
        
        weighted_success = 0.5 * test_success_rate + 0.5 * training_success
        weighted_reward = 0.5 * test_reward + 0.5 * training_reward
        
        print("\nPerformance Analysis:")
        print(f"  Average Success Rate: {avg_success_rate:.2%}")
        print(f"  Average Reward: {avg_reward:.2f}")
        print(f"  Test Scenario Success Rate: {test_success_rate:.2%}")
        print(f"  Test Scenario Reward: {test_reward:.2f}")
        print(f"\nWeighted Score (50% test, 50% training):")
        print(f"  Weighted Success Rate: {weighted_success:.2%}")
        print(f"  Weighted Reward: {weighted_reward:.2f}")
        
        # Provide a simple grading metric
        grade = "A+" if weighted_success > 0.98 else \
                "A" if weighted_success > 0.95 else \
                "B" if weighted_success > 0.9 else \
                "C" if weighted_success > 0.8 else \
                "D" if weighted_success > 0.7 else "F"
                
        print(f"\nOverall Grade: {grade}")
else:
    print("No evaluation results available.")

Unnamed: 0,Scenario,Success Rate,Mean Reward,Mean Steps
0,TRAINING_1,50.00%,10.96 ± 11.92,177.9 ± 28.9
1,TRAINING_2,70.00%,14.10 ± 9.87,173.7 ± 23.7
2,TRAINING_3,0.00%,0.00 ± 0.00,200.0 ± 0.0
3,TEST (TEST),0.00%,0.00 ± 0.00,200.0 ± 0.0



Performance Analysis:
  Average Success Rate: 30.00%
  Average Reward: 6.26
  Test Scenario Success Rate: 0.00%
  Test Scenario Reward: 0.00

Weighted Score (50% test, 50% training):
  Weighted Success Rate: 20.00%
  Weighted Reward: 4.18

Overall Grade: F


## 5. Visualize Agent Behavior

In this section, we'll visualize how our agent performs in the sailing environment. The visualization will show the agent's trajectory and the wind conditions throughout the episode.

In [13]:
#############################################
### MODIFY THESE PARAMETERS AS NEEDED ######
#############################################

# Set to True to enable visualization
VISUALIZE = True

# Visualization parameters
VIZ_SCENARIO_NAME = "training_1"  # Choose which scenario to visualize
VIZ_SEED = 42                    # Choose a single seed for visualization

#############################################
### DO NOT MODIFY BELOW THIS LINE ##########
#############################################

# Only run if visualization is enabled and agent is loaded
if VISUALIZE and 'agent' in locals():
    # Get the scenario with visualization parameters
    viz_scenario = get_scenario(VIZ_SCENARIO_NAME)
    viz_scenario.update({
        'env_params': {
            'wind_grid_density': 25,
            'wind_arrow_scale': 80,
            'render_mode': "rgb_array"
        }
    })
    
    print(f"Visualizing agent behavior on scenario: {VIZ_SCENARIO_NAME}")
    print(f"Using seed: {VIZ_SEED}")
    
    # Run the evaluation with visualization enabled
    viz_results = evaluate_agent(
        agent=agent,
        scenario=viz_scenario,
        seeds=VIZ_SEED,
        max_horizon=MAX_HORIZON,
        verbose=False,
        render=True,
        full_trajectory=True  # Enable full trajectory for visualization
    )
    
    # Visualize the trajectory with a slider
    visualize_trajectory(viz_results, None, with_slider=True)
else:
    if 'agent' in locals():
        print("Visualization is disabled. Set VISUALIZE = True to see agent behavior.")
    else:
        print("No agent defined. Please run the cells above to create an agent first.")

Visualizing agent behavior on scenario: training_1
Using seed: 42


interactive(children=(IntSlider(value=0, description='Step:', max=199), Output()), _dom_classes=('widget-inter…

## 6. Compare Performance Across Scenarios

Now let's visualize how our agent performs across different scenarios. This will help us understand the agent's robustness to different wind conditions.

In [14]:
# Set to True to enable multi-scenario comparison
COMPARE_SCENARIOS = True

# Scenarios to compare
SCENARIO_NAMES = ["training_1", "training_2", "training_3", "test"]
COMPARISON_SEED = 42

if COMPARE_SCENARIOS and 'agent' in locals():
    # Visualization parameters
    viz_params = {
        'env_params': {
            'wind_grid_density': 25,
            'wind_arrow_scale': 80,
            'render_mode': "rgb_array"
        }
    }
    
    # Store results for each scenario
    scenario_results = {}
    
    # Evaluate agent on each scenario
    for scenario_name in SCENARIO_NAMES:
        print(f"\nEvaluating on {scenario_name}...")
        
        # Get the scenario with visualization parameters
        scenario = get_scenario(scenario_name)
        scenario.update(viz_params)
        
        # Run the evaluation
        results = evaluate_agent(
            agent=agent,
            scenario=scenario,
            seeds=COMPARISON_SEED,
            max_horizon=MAX_HORIZON,
            verbose=False,
            render=True,
            full_trajectory=True
        )
        
        # Store results
        scenario_results[scenario_name] = results
        
        # Display performance metrics
        rewards = results['rewards']
        steps = results['mean_steps']
        discounted_reward = results['mean_reward']
        success = any(r > 0 for r in rewards)
        
        print(f"Scenario: {scenario_name}")
        if success:
            print(f"✅ Success! Reached goal in {steps:.0f} steps")
        else:
            print(f"❌ Failed to reach goal (max {steps:.0f} steps)")
        print(f"Mean Discounted Reward: {discounted_reward:.2f}")
        
        # Visualize the trajectory
        print(f"\nTrajectory for {scenario_name}:")
        visualize_trajectory(results, None, with_slider=True)
        
    # Print summary comparison
    print("\n===== Scenario Comparison Summary =====")
    for scenario_name, results in scenario_results.items():
        success = any(r > 0 for r in results['rewards'])
        print(f"{scenario_name}: {'✅ Success' if success else '❌ Failed'} | "
              f"Steps: {results['mean_steps']:.0f} | "
              f"Reward: {results['mean_reward']:.2f}")
else:
    if 'agent' in locals():
        print("Scenario comparison is disabled. Set COMPARE_SCENARIOS = True to compare performance across scenarios.")
    else:
        print("No agent defined. Please run the cells above to create an agent first.")


Evaluating on training_1...
Scenario: training_1
❌ Failed to reach goal (max 200 steps)
Mean Discounted Reward: 0.00

Trajectory for training_1:


interactive(children=(IntSlider(value=0, description='Step:', max=199), Output()), _dom_classes=('widget-inter…


Evaluating on training_2...
Scenario: training_2
❌ Failed to reach goal (max 200 steps)
Mean Discounted Reward: 0.00

Trajectory for training_2:


interactive(children=(IntSlider(value=0, description='Step:', max=199), Output()), _dom_classes=('widget-inter…


Evaluating on training_3...
Scenario: training_3
❌ Failed to reach goal (max 200 steps)
Mean Discounted Reward: 0.00

Trajectory for training_3:


interactive(children=(IntSlider(value=0, description='Step:', max=199), Output()), _dom_classes=('widget-inter…


Evaluating on test...


ValueError: Unknown scenario 'test'. Available scenarios: ['training_1', 'training_2', 'training_3', 'simple_test']

## 7. Command-Line Evaluation (For Batch Processing)

For comprehensive evaluation of submissions, use the command-line interface with the test scenario:

```bash
cd Sailing_project_v1/src
python3 evaluate_submission.py agents/submission_example.py --include-test --seeds {1..100}
```

### Command Options

- `agents/submission_example.py`: Path to the agent implementation file
- `--include-test`: Include the hidden test scenario in evaluation (evaluator use only)
- `--seeds {1..100}`: Evaluate on seeds 1 through 100 (bash expansion syntax)
- `--max_horizon N`: Maximum steps per episode (default: 200)
- `--output FILE`: Save results to a JSON file (e.g., `--output results.json`)
- `--verbose`: Show detailed evaluation results (default: simplified output)

### Sample Output (Simplified)

```
Validating agent: agents/submission_example.py
✅ Successfully loaded agent: SubmissionAgent
Evaluating on 4 scenarios with 100 seeds
Max horizon: 200 steps
SCENARIO | SUCCESS RATE | MEAN REWARD | MEAN STEPS
training_1 | Success: 98.00% | Reward: 61.20 ± 3.22 | Steps: 50.0 ± 5.3
training_2 | Success: 97.00% | Reward: 63.84 ± 3.36 | Steps: 45.8 ± 5.3
training_3 | Success: 96.00% | Reward: 62.24 ± 3.06 | Steps: 48.3 ± 4.8
test (TEST) | Success: 95.00% | Reward: 61.93 ± 2.99 | Steps: 48.8 ± 4.9
======================================================================
OVERALL | 96.50% ± 1.12% | 62.30 ± 0.96 | 48.2 ± 1.5
WEIGHTED FINAL REWARD: 62.18 (50% test, 50% training)
======================================================================
```


### Evaluate Only on Test Scenario

To evaluate an agent exclusively on the hidden test scenario:

```bash
cd Sailing_project_v1/src
python3 evaluate_submission.py agents/agent_greedy.py --scenario test --seeds {1..100}
```

### Evaluate on All Scenarios (Without Training Output)

To evaluate on all scenarios and save results to a file without displaying detailed output:

```bash
cd Sailing_project_v1/src
python3 evaluate_submission.py agents/agent_greedy.py --include-test --seeds {1..100} --output results.json
```

The **weighted final reward** (combining 50% test and 50% training performance) is the key metric for ranking submissions. This metric ensures that agents are evaluated both on their ability to handle the training scenarios and to generalize to new conditions in the test scenario.

For detailed seed-by-seed results, especially useful when analyzing failures, add the `--verbose` flag to the command.

## 8. Conclusion

This private evaluation framework provides a comprehensive assessment of sailing agents with these key metrics:

- **Mean reward** is the primary performance indicator, reflecting sailing efficiency
- **Standard deviation** across seeds indicates agent robustness and consistency
- **Weighted final reward** (50% test + 50% training) provides the definitive ranking metric

When evaluating submissions:
1. Run batch evaluations with 100+ seeds for statistical significance
2. Examine performance gaps between training and test scenarios to assess generalization
3. Use `--verbose` and visualization for qualitative assessment of sailing strategies
4. Consider reward distribution across different seeds to identify edge cases

The private test scenario ensures agents are evaluated on their ability to handle novel conditions, not just memorize training scenarios. This comprehensive evaluation approach rewards agents that implement robust, adaptable sailing strategies rather than brittle, overfitted solutions.