# Notebook 6: Experiments and Visualization

In this final notebook, we'll bring everything together! We'll run comprehensive experiments, create beautiful visualizations, and solidify our understanding of RL.

## What You'll Learn

1. Running systematic experiments
2. Visualizing trained policies
3. Analyzing what the agent learned
4. Creating animations
5. Next steps in your RL journey

In [None]:
import sys
sys.path.append('..')

import numpy as np
import matplotlib.pyplot as plt
from IPython.display import HTML

from src.environments import InvertedPendulumEnv
from src.policies import LinearPolicy, NeuralNetworkPolicy, RandomPolicy
from src.utils import (
    train_policy, evaluate_policy, collect_episode,
    plot_trajectory, plot_training_progress, animate_pendulum
)
from src.utils.visualization import plot_phase_portrait, plot_policy_surface

## 1. Train the Best Policy

Let's train a really good policy with enough iterations!

In [None]:
# Train with extra iterations
env = InvertedPendulumEnv()
best_policy = LinearPolicy()

print("Training the best linear policy...\n")

result = train_policy(
    env, best_policy,
    algorithm='evolutionary',
    n_iterations=150,
    population_size=30,
    elite_frac=0.2,
    noise_scale=0.3,
    n_episodes_per_eval=10,
    verbose=True,
    seed=42
)

print(f"\nFinal weights: {best_policy.get_flat_params()}")

In [None]:
# Thorough evaluation
eval_result = evaluate_policy(env, best_policy, n_episodes=100, verbose=True)

print(f"\nBest Linear Policy Performance:")
print(f"  Mean reward: {eval_result['mean_reward']:.1f} Â± {eval_result['std_reward']:.1f}")
print(f"  Mean length: {eval_result['mean_length']:.1f} steps")
print(f"  Success rate: {100 * sum(r >= 490 for r in eval_result['episode_rewards']) / 100:.0f}%")

In [None]:
# Plot training progress
fig = plot_training_progress(
    result['reward_history'],
    window_size=10
)
plt.show()

## 2. Visualize the Learned Policy

In [None]:
# Compare random vs trained policy
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

random_policy = RandomPolicy()

for ax, (policy, title) in zip(axes, [
    (random_policy, 'Random Policy'),
    (best_policy, 'Trained Policy')
]):
    # Run episode
    states, actions, rewards, _ = collect_episode(env, policy)
    states = np.array(states)
    
    # Plot trajectory in state space
    ax.plot(states[:, 2], states[:, 3], 'b-', alpha=0.7)
    ax.plot(states[0, 2], states[0, 3], 'go', markersize=10, label='Start')
    ax.plot(states[-1, 2], states[-1, 3], 'ro', markersize=10, label='End')
    ax.set_xlabel('Pole Angle (rad)')
    ax.set_ylabel('Angular Velocity (rad/s)')
    ax.set_title(f'{title}\n({len(states)} steps)')
    ax.legend()
    ax.grid(True, alpha=0.3)
    ax.set_xlim(-0.3, 0.3)
    ax.set_ylim(-2, 2)

plt.tight_layout()
plt.show()

In [None]:
# Visualize the policy surface
fig = plot_policy_surface(
    best_policy,
    state_ranges={'theta': (-0.25, 0.25), 'theta_dot': (-2, 2)},
    fixed_states={'x': 0, 'x_dot': 0},
    resolution=100
)
plt.suptitle('Trained Linear Policy: Action = f(theta, theta_dot)', y=1.02)
plt.show()

# Interpret the weights
weights = best_policy.weights
print("\nLearned weights interpretation:")
print(f"  x coefficient: {weights[0]:.3f}")
print(f"  x_dot coefficient: {weights[1]:.3f}")
print(f"  theta coefficient: {weights[2]:.3f}")
print(f"  theta_dot coefficient: {weights[3]:.3f}")
print(f"\nAction = {weights[0]:.2f}*x + {weights[1]:.2f}*x_dot + {weights[2]:.2f}*theta + {weights[3]:.2f}*theta_dot")

## 3. Create Animations

In [None]:
# Run a successful episode and animate it
np.random.seed(123)
states, actions, rewards, _ = collect_episode(env, best_policy)

history = {
    'states': np.array(states),
    'actions': np.array(actions),
    'rewards': np.array(rewards)
}

env_params = {
    'pole_length': env.pole_length,
    'x_threshold': env.x_threshold
}

print(f"Episode length: {len(rewards)} steps")
print("Creating animation...")

# Create animation (subsample for speed)
subsample = 3  # Show every 3rd frame
subsampled_history = {
    'states': history['states'][::subsample],
    'actions': history['actions'][::subsample] if len(history['actions']) > 0 else [],
    'rewards': history['rewards'][::subsample] if len(history['rewards']) > 0 else []
}

anim = animate_pendulum(subsampled_history, env_params, interval=50)
HTML(anim.to_jshtml())

## 4. Compare Different Challenges

In [None]:
# Train policies for different environments
environments = {
    'Easy (long pole)': InvertedPendulumEnv(pole_length=1.0),
    'Normal': InvertedPendulumEnv(),
    'Hard (short pole)': InvertedPendulumEnv(pole_length=0.3),
    'Very Hard': InvertedPendulumEnv(pole_length=0.25, force_mag=5.0)
}

results = {}

for name, env in environments.items():
    print(f"\nTraining for {name}...")
    policy = LinearPolicy()
    
    train_result = train_policy(
        env, policy,
        algorithm='evolutionary',
        n_iterations=100,
        population_size=25,
        noise_scale=0.4,
        n_episodes_per_eval=5,
        verbose=False
    )
    
    eval_result = evaluate_policy(env, policy, n_episodes=50)
    results[name] = {
        'policy': policy,
        'train': train_result,
        'eval': eval_result
    }
    
    print(f"  Mean reward: {eval_result['mean_reward']:.1f} / {env.max_steps}")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training curves
for name, result in results.items():
    axes[0].plot(result['train']['reward_history'], label=name)
axes[0].set_xlabel('Generation')
axes[0].set_ylabel('Elite Mean Reward')
axes[0].set_title('Training Progress')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Final performance
names = list(results.keys())
means = [results[n]['eval']['mean_reward'] for n in names]
stds = [results[n]['eval']['std_reward'] for n in names]
x = range(len(names))

bars = axes[1].bar(x, means, yerr=stds, capsize=5)
axes[1].set_xticks(x)
axes[1].set_xticklabels(names, rotation=15)
axes[1].set_ylabel('Mean Reward')
axes[1].set_title('Final Performance')
axes[1].axhline(y=500, color='r', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()

## 5. Analyze the Learned Behavior

In [None]:
# Look at what the trained policy does in detail
env = InvertedPendulumEnv()

# Collect episode data
states, actions, rewards, _ = collect_episode(env, best_policy)
history = {
    'states': np.array(states),
    'actions': np.array(actions),
    'rewards': np.array(rewards)
}

# Full trajectory plot
fig = plot_trajectory(history, title='Trained Policy Episode')
plt.show()

In [None]:
# Analyze the relationship between state and action
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

state_names = ['Cart Position', 'Cart Velocity', 'Pole Angle', 'Angular Velocity']

for i, (ax, name) in enumerate(zip(axes.flat, state_names)):
    states_i = history['states'][:-1, i]  # Exclude last (no action)
    ax.scatter(states_i, history['actions'], alpha=0.3, s=10)
    ax.set_xlabel(name)
    ax.set_ylabel('Action (Force)')
    
    # Fit line
    z = np.polyfit(states_i, history['actions'], 1)
    p = np.poly1d(z)
    x_line = np.linspace(states_i.min(), states_i.max(), 100)
    ax.plot(x_line, p(x_line), 'r-', linewidth=2, label=f'slope={z[0]:.1f}')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.suptitle('How Action Depends on Each State Variable')
plt.tight_layout()
plt.show()

In [None]:
# Energy analysis
env = InvertedPendulumEnv()
state = env.reset(seed=789)

energies = {'kinetic': [], 'potential': [], 'total': []}
done = False

while not done:
    energy = env.get_energy()
    for key in energies:
        energies[key].append(energy[key])
    
    action = best_policy.get_action(state)
    state, _, done, _ = env.step(action)

# Plot energy
fig, ax = plt.subplots(figsize=(10, 5))
steps = range(len(energies['kinetic']))

ax.plot(steps, energies['kinetic'], 'r-', label='Kinetic', alpha=0.7)
ax.plot(steps, energies['potential'], 'b-', label='Potential', alpha=0.7)
ax.plot(steps, energies['total'], 'g--', label='Total', linewidth=2)

ax.set_xlabel('Time Step')
ax.set_ylabel('Energy (J)')
ax.set_title('System Energy During Controlled Episode')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

print("Notice: The controller adds/removes energy to keep the pole balanced!")

## 6. Summary and Next Steps

### What We Learned

In this series of notebooks, we covered:

1. **RL Fundamentals**: Agent, environment, state, action, reward, policy
2. **Inverted Pendulum Physics**: Equations of motion, unstable equilibrium
3. **Environment Design**: Reset/step interface, customization, history tracking
4. **Policy Types**: Random, linear, neural network
5. **Training Algorithms**: Random search, hill climbing, evolution strategies
6. **Visualization**: Trajectories, animations, policy surfaces

### Key Takeaways

- Simple environments are great for learning RL concepts
- Linear policies can be surprisingly effective
- Evolution strategies work well for policy optimization
- Visualization helps understand what agents learn

### Next Steps in Your RL Journey

1. **Policy Gradients**: Learn REINFORCE and actor-critic methods
2. **Value Functions**: Q-learning, DQN, TD learning
3. **Continuous Control**: PPO, SAC, DDPG
4. **Harder Environments**: MuJoCo, Atari, robotics
5. **Advanced Topics**: Model-based RL, hierarchical RL, multi-agent RL

### Recommended Resources

- Sutton & Barto: "Reinforcement Learning: An Introduction"
- OpenAI Spinning Up: spinningup.openai.com
- Deep RL Course by HuggingFace

In [None]:
# Final celebration - watch your best policy!
print("Congratulations on completing the RL tutorial!")
print("\nYour trained policy achieved:")

env = InvertedPendulumEnv()
final_eval = evaluate_policy(env, best_policy, n_episodes=100)

print(f"  Mean reward: {final_eval['mean_reward']:.1f} / 500")
print(f"  Success rate: {100 * sum(r >= 490 for r in final_eval['episode_rewards']) / 100:.0f}%")
print(f"\nYou've learned the fundamentals of reinforcement learning!")

## Bonus: Challenge Yourself!

Try these extensions:

1. **Double Inverted Pendulum**: Add a second pole segment
2. **Swing-Up Task**: Start with pole hanging down, swing it up
3. **Moving Target**: Keep cart at a changing target position
4. **Noisy Observations**: Add sensor noise to states
5. **Delayed Rewards**: Sparse rewards only at episode end