# Notebook 5: Training Policies

Now comes the exciting part - **learning**! In this notebook, we'll train policies to balance the pendulum automatically, without manually specifying the weights.

## What You'll Learn

1. The training objective
2. Random search (simplest learning)
3. Hill climbing
4. Evolution strategies
5. Comparing algorithms

In [None]:
import sys
sys.path.append('..')

import numpy as np
import matplotlib.pyplot as plt
from src.environments import InvertedPendulumEnv
from src.policies import LinearPolicy, NeuralNetworkPolicy
from src.utils import train_policy, evaluate_policy, plot_training_progress

## 1. The Training Objective

Our goal is to find policy parameters $\theta$ that maximize expected return:

$$\theta^* = \arg\max_\theta \mathbb{E}\left[ \sum_{t=0}^{T} r_t \right]$$

In simpler terms:
- Run the policy in the environment
- Calculate total reward
- Adjust parameters to get higher reward
- Repeat!

### Key Challenges
- We don't have gradients (unlike supervised learning)
- Rewards are noisy (stochastic environment)
- Credit assignment: which actions led to good outcomes?

In [None]:
# Let's see how noisy the reward signal is
env = InvertedPendulumEnv()
policy = LinearPolicy(weights=np.array([0, 0, 10, 3]))

rewards = []
for i in range(100):
    result = evaluate_policy(env, policy, n_episodes=1)
    rewards.append(result['mean_reward'])

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(rewards)
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('Same Policy, Different Outcomes')

plt.subplot(1, 2, 2)
plt.hist(rewards, bins=20)
plt.xlabel('Reward')
plt.ylabel('Frequency')
plt.title(f'Reward Distribution\nmean={np.mean(rewards):.1f}, std={np.std(rewards):.1f}')

plt.tight_layout()
plt.show()

print("Key insight: Even the same policy gives different rewards each episode!")
print("This is why we average over multiple episodes when evaluating.")

## 2. Random Search

The simplest possible learning algorithm:

1. Generate random parameters
2. Evaluate performance
3. Keep the best
4. Repeat

It's surprisingly effective for low-dimensional problems!

In [None]:
# Train with random search
env = InvertedPendulumEnv()
policy = LinearPolicy()  # Start with zeros

print("Training with Random Search...")
print(f"Initial weights: {policy.get_flat_params()}\n")

result = train_policy(
    env, policy,
    algorithm='random_search',
    n_iterations=100,
    noise_scale=2.0,  # Range of random values
    n_episodes_per_eval=5,
    verbose=True
)

print(f"\nBest reward: {result['best_reward']:.1f}")
print(f"Best weights: {result['best_params']}")

In [None]:
# Plot training progress
plt.figure(figsize=(10, 4))
plt.plot(result['reward_history'])
plt.xlabel('Iteration')
plt.ylabel('Reward')
plt.title('Random Search Training Progress')
plt.axhline(y=500, color='r', linestyle='--', label='Maximum')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Evaluate the trained policy
eval_result = evaluate_policy(env, policy, n_episodes=50, verbose=True)

print(f"\nTrained Policy Performance:")
print(f"  Mean reward: {eval_result['mean_reward']:.1f} ± {eval_result['std_reward']:.1f}")

## 3. Hill Climbing

A slightly smarter approach:

1. Start with current parameters
2. Add small random perturbation
3. If better, keep new parameters
4. Repeat

This is like walking uphill in the dark - only take steps that go up!

In [None]:
# Train with hill climbing
env = InvertedPendulumEnv()
policy = LinearPolicy()

print("Training with Hill Climbing...\n")

result_hc = train_policy(
    env, policy,
    algorithm='hill_climbing',
    n_iterations=200,
    noise_scale=0.5,  # Perturbation size
    n_episodes_per_eval=5,
    verbose=True
)

print(f"\nBest reward: {result_hc['best_reward']:.1f}")
print(f"Best weights: {result_hc['best_params']}")

In [None]:
# Plot hill climbing progress
plt.figure(figsize=(10, 4))
plt.plot(result_hc['reward_history'])
plt.xlabel('Iteration')
plt.ylabel('Best Reward So Far')
plt.title('Hill Climbing Training Progress')
plt.axhline(y=500, color='r', linestyle='--', label='Maximum')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("Notice: Progress is monotonic (never goes down)!")
print("But it can get stuck in local optima.")

## 4. Evolution Strategies

A more sophisticated approach inspired by natural evolution:

1. Create a population of perturbed parameters
2. Evaluate fitness of each
3. Move toward the best performers (weighted average)
4. Repeat

This is more robust and can escape local optima!

In [None]:
# Train with evolution strategies
env = InvertedPendulumEnv()
policy = LinearPolicy()

print("Training with Evolution Strategies...\n")

result_es = train_policy(
    env, policy,
    algorithm='evolutionary',
    n_iterations=50,
    population_size=20,  # Number of variants to try
    elite_frac=0.2,  # Keep top 20%
    noise_scale=0.5,
    n_episodes_per_eval=5,
    verbose=True
)

print(f"\nBest reward: {result_es['best_reward']:.1f}")
print(f"Best weights: {result_es['best_params']}")

In [None]:
# Plot evolution strategies progress
plt.figure(figsize=(10, 4))
plt.plot(result_es['reward_history'])
plt.xlabel('Generation')
plt.ylabel('Elite Mean Reward')
plt.title('Evolution Strategies Training Progress')
plt.axhline(y=500, color='r', linestyle='--', label='Maximum')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 5. Comparing Algorithms

Let's run all three algorithms multiple times and compare!

In [None]:
# Run each algorithm multiple times
n_runs = 5
n_iterations = 100

algorithms = ['random_search', 'hill_climbing', 'evolutionary']
all_results = {alg: [] for alg in algorithms}

print(f"Running each algorithm {n_runs} times...\n")

for alg in algorithms:
    print(f"\n{alg}:")
    for run in range(n_runs):
        env = InvertedPendulumEnv()
        policy = LinearPolicy()
        
        result = train_policy(
            env, policy,
            algorithm=alg,
            n_iterations=n_iterations,
            population_size=15,
            noise_scale=0.5,
            n_episodes_per_eval=3,
            verbose=False,
            seed=run
        )
        
        # Evaluate final policy
        eval_result = evaluate_policy(env, policy, n_episodes=20)
        all_results[alg].append(eval_result['mean_reward'])
        print(f"  Run {run+1}: {eval_result['mean_reward']:.1f}")

In [None]:
# Plot comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Box plot
data = [all_results[alg] for alg in algorithms]
ax1.boxplot(data, labels=['Random', 'Hill Climb', 'Evolution'])
ax1.set_ylabel('Final Mean Reward')
ax1.set_title('Algorithm Comparison')
ax1.axhline(y=500, color='r', linestyle='--', alpha=0.5)

# Bar plot with error bars
means = [np.mean(all_results[alg]) for alg in algorithms]
stds = [np.std(all_results[alg]) for alg in algorithms]
x = range(len(algorithms))
ax2.bar(x, means, yerr=stds, capsize=5)
ax2.set_xticks(x)
ax2.set_xticklabels(['Random', 'Hill Climb', 'Evolution'])
ax2.set_ylabel('Mean Reward ± Std')
ax2.set_title('Average Performance')

plt.tight_layout()
plt.show()

# Print summary
print("\nSummary:")
for alg in algorithms:
    rewards = all_results[alg]
    print(f"  {alg:15s}: {np.mean(rewards):.1f} ± {np.std(rewards):.1f}")

## 6. Training a Neural Network Policy

Can we train a neural network with these same algorithms?

In [None]:
# Train a neural network with evolution strategies
env = InvertedPendulumEnv()
nn_policy = NeuralNetworkPolicy(hidden_sizes=[16, 16])  # Smaller network

print(f"Neural network has {nn_policy.get_num_params()} parameters")
print("(vs 5 for linear policy)\n")

print("Training neural network with Evolution Strategies...\n")

result_nn = train_policy(
    env, nn_policy,
    algorithm='evolutionary',
    n_iterations=100,
    population_size=30,
    noise_scale=0.3,
    n_episodes_per_eval=3,
    verbose=True
)

print(f"\nBest reward: {result_nn['best_reward']:.1f}")

In [None]:
# Evaluate trained neural network
eval_result = evaluate_policy(env, nn_policy, n_episodes=50)

print(f"Trained Neural Network Performance:")
print(f"  Mean reward: {eval_result['mean_reward']:.1f} ± {eval_result['std_reward']:.1f}")

In [None]:
# Visualize the trained neural network's policy surface
from src.utils.visualization import plot_policy_surface

fig = plot_policy_surface(
    nn_policy,
    state_ranges={'theta': (-0.3, 0.3), 'theta_dot': (-2, 2)},
    fixed_states={'x': 0, 'x_dot': 0}
)
plt.suptitle('Trained Neural Network Policy', y=1.02)
plt.show()

print("Notice: The trained network learned a sensible policy!")
print("Positive theta -> positive action (push right)")

## Exercises

### Exercise 1: Hyperparameter Tuning
Try different hyperparameters for evolution strategies (population size, elite fraction, noise scale). What works best?

### Exercise 2: Longer Training
Train for more iterations. Does performance keep improving? When does it plateau?

### Exercise 3: Harder Environment
Try training on a harder environment (shorter pole, less force). Can the algorithms still find good policies?

In [None]:
# Exercise 3: Train on harder environment
hard_env = InvertedPendulumEnv(
    pole_length=0.3,  # Shorter pole
    force_mag=8.0,    # Less force
    theta_threshold=0.15  # Stricter angle
)

policy = LinearPolicy()

print("Training on harder environment...\n")

result_hard = train_policy(
    hard_env, policy,
    algorithm='evolutionary',
    n_iterations=100,
    population_size=30,
    noise_scale=0.5,
    n_episodes_per_eval=5,
    verbose=True
)

# Evaluate
eval_result = evaluate_policy(hard_env, policy, n_episodes=50)
print(f"\nHard Environment Performance:")
print(f"  Mean reward: {eval_result['mean_reward']:.1f} ± {eval_result['std_reward']:.1f}")
print(f"  Max possible: {hard_env.max_steps}")

## Summary

In this notebook, we learned:

- **Training objective**: Maximize expected cumulative reward
- **Random search**: Simple but effective baseline
- **Hill climbing**: Only accepts improvements, can get stuck
- **Evolution strategies**: Uses population, more robust
- Neural networks need more iterations but can learn too!

### Key Takeaways

1. Even simple algorithms can learn good policies
2. Evolution strategies tends to be most reliable
3. More parameters = more iterations needed
4. Always evaluate with multiple episodes (reduce noise)

## Next Steps

In the final notebook, we'll run experiments, visualize results, and explore what we've learned!