# Reinforcement Learning Plan B - Part 5: Policy Gradient Methods

This notebook explores policy gradient methods, moving from value-based learning (DQN) to direct policy optimization. We'll implement REINFORCE, Actor-Critic methods, and Proximal Policy Optimization (PPO).

## Learning Objectives

By the end of this notebook, you will understand:
- Policy gradient theorem and REINFORCE algorithm
- Actor-Critic methods and advantage estimation
- Proximal Policy Optimization (PPO) theory and implementation
- Continuous vs discrete action spaces
- Variance reduction techniques
- When to use policy methods vs value methods

## Mathematical Foundation

### Policy Gradient Theorem

The fundamental insight of policy gradient methods is to directly optimize the policy parameters $\theta$ to maximize expected return:

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \gamma^t r_t \right]$$

The policy gradient theorem states:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right]$$

Where $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ is the return from time $t$.

### REINFORCE Algorithm

The REINFORCE algorithm uses Monte Carlo sampling to estimate the policy gradient:

$$\theta \leftarrow \theta + \alpha \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t$$

### Baseline Reduction

To reduce variance, we subtract a baseline $b(s_t)$ that doesn't depend on the action:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - b(s_t)) \right]$$

### Actor-Critic Methods

Actor-Critic methods use a value function $V^\pi(s)$ as the baseline and replace returns with TD targets:

$$\delta_t = r_t + \gamma V^\pi(s_{t+1}) - V^\pi(s_t)$$

The advantage estimate is: $A_t = \delta_t$ (1-step) or $A_t = \sum_{k=0}^{n-1} \gamma^k \delta_{t+k}$ (n-step).

### Proximal Policy Optimization (PPO)

PPO addresses the challenge of step size in policy optimization by constraining policy updates:

$$L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t \right) \right]$$

Where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the probability ratio.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical, Normal
import random
import math
from collections import deque, namedtuple
from typing import List, Tuple, Optional, Dict, Any, Union
import warnings
warnings.filterwarnings('ignore')

# Try different gym versions
try:
    import gymnasium as gym
    gym_version = 'gymnasium'
except ImportError:
    import gym
    gym_version = 'gym'

print(f"Using {gym_version} for environments")

# Set device - optimized for MacBook Air M2
if torch.backends.mps.is_available():
    device = torch.device("mps")
    print("Using MPS (Metal Performance Shaders) for acceleration")
elif torch.cuda.is_available():
    device = torch.device("cuda")
    print("Using CUDA for acceleration")
else:
    device = torch.device("cpu")
    print("Using CPU")

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
random.seed(42)

## Policy Network Architectures

We'll implement policy networks for both discrete and continuous action spaces.

In [None]:
class DiscretePolicy(nn.Module):
    """Policy network for discrete action spaces."""
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dims: List[int] = [128, 128]):
        super(DiscretePolicy, self).__init__()
        
        # Build network layers
        layers = []
        prev_dim = state_dim
        
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.Tanh(),  # Tanh often works better than ReLU for policy networks
                nn.LayerNorm(hidden_dim)
            ])
            prev_dim = hidden_dim
        
        # Output layer (no activation - we'll apply softmax later)
        layers.append(nn.Linear(prev_dim, action_dim))
        
        self.network = nn.Sequential(*layers)
        
        # Initialize weights
        self.apply(self._init_weights)
    
    def _init_weights(self, module):
        """Initialize network weights for policy networks."""
        if isinstance(module, nn.Linear):
            # Smaller initialization for policy networks
            nn.init.orthogonal_(module.weight, gain=0.1)
            nn.init.constant_(module.bias, 0.0)
    
    def forward(self, state: torch.Tensor) -> Categorical:
        """Forward pass returning action distribution."""
        logits = self.network(state)
        return Categorical(logits=logits)


class ContinuousPolicy(nn.Module):
    """Policy network for continuous action spaces."""
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dims: List[int] = [128, 128],
                 log_std_init: float = -0.5):
        super(ContinuousPolicy, self).__init__()
        
        self.action_dim = action_dim
        
        # Mean network
        layers = []
        prev_dim = state_dim
        
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.Tanh(),
                nn.LayerNorm(hidden_dim)
            ])
            prev_dim = hidden_dim
        
        # Output mean
        layers.append(nn.Linear(prev_dim, action_dim))
        layers.append(nn.Tanh())  # For environments with bounded actions
        
        self.mean_network = nn.Sequential(*layers)
        
        # Log standard deviation (learnable parameter)
        self.log_std = nn.Parameter(torch.ones(action_dim) * log_std_init)
        
        # Initialize weights
        self.apply(self._init_weights)
    
    def _init_weights(self, module):
        """Initialize network weights."""
        if isinstance(module, nn.Linear):
            nn.init.orthogonal_(module.weight, gain=0.1)
            nn.init.constant_(module.bias, 0.0)
    
    def forward(self, state: torch.Tensor) -> Normal:
        """Forward pass returning action distribution."""
        mean = self.mean_network(state)
        std = torch.exp(self.log_std)
        return Normal(mean, std)


class ValueNetwork(nn.Module):
    """Value function network for Actor-Critic methods."""
    
    def __init__(self, state_dim: int, hidden_dims: List[int] = [128, 128]):
        super(ValueNetwork, self).__init__()
        
        # Build network layers
        layers = []
        prev_dim = state_dim
        
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),  # ReLU works well for value networks
                nn.LayerNorm(hidden_dim)
            ])
            prev_dim = hidden_dim
        
        # Output layer (single value)
        layers.append(nn.Linear(prev_dim, 1))
        
        self.network = nn.Sequential(*layers)
        
        # Initialize weights
        self.apply(self._init_weights)
    
    def _init_weights(self, module):
        """Initialize network weights."""
        if isinstance(module, nn.Linear):
            nn.init.orthogonal_(module.weight, gain=1.0)
            nn.init.constant_(module.bias, 0.0)
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Forward pass returning state values."""
        return self.network(state).squeeze(-1)

## REINFORCE Algorithm Implementation

Let's implement the classic REINFORCE algorithm with baseline.

In [None]:
class REINFORCEAgent:
    """REINFORCE algorithm with optional baseline.
    
    Implements the policy gradient theorem with Monte Carlo returns.
    """
    
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        lr_policy: float = 3e-4,
        lr_value: float = 1e-3,
        gamma: float = 0.99,
        use_baseline: bool = True,
        hidden_dims: List[int] = [128, 128],
        action_type: str = 'discrete'  # 'discrete' or 'continuous'
    ):
        self.gamma = gamma
        self.use_baseline = use_baseline
        self.action_type = action_type
        
        # Create policy network
        if action_type == 'discrete':
            self.policy = DiscretePolicy(state_dim, action_dim, hidden_dims).to(device)
        else:
            self.policy = ContinuousPolicy(state_dim, action_dim, hidden_dims).to(device)
        
        # Create value network for baseline
        if use_baseline:
            self.value_net = ValueNetwork(state_dim, hidden_dims).to(device)
            self.value_optimizer = optim.Adam(self.value_net.parameters(), lr=lr_value)
        
        # Policy optimizer
        self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=lr_policy)
        
        # Storage for episode data
        self.reset_episode_data()
        
        # Training statistics
        self.episode_rewards = []
        self.policy_losses = []
        self.value_losses = []
    
    def reset_episode_data(self):
        """Reset episode data storage."""
        self.states = []
        self.actions = []
        self.rewards = []
        self.log_probs = []
    
    def select_action(self, state: np.ndarray, training: bool = True) -> Union[int, np.ndarray]:
        """Select action using current policy."""
        state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
        
        with torch.no_grad():
            action_dist = self.policy(state_tensor)
            
            if training:
                action = action_dist.sample()
            else:
                # Use mean action for evaluation
                if self.action_type == 'discrete':
                    action = action_dist.probs.argmax(dim=-1)
                else:
                    action = action_dist.mean
        
        if training:
            # Store data for training
            log_prob = action_dist.log_prob(action)
            self.states.append(state)
            self.actions.append(action.cpu().numpy())
            self.log_probs.append(log_prob.item())
        
        if self.action_type == 'discrete':
            return action.item()
        else:
            return action.cpu().numpy().squeeze()
    
    def store_reward(self, reward: float):
        """Store reward for current step."""
        self.rewards.append(reward)
    
    def compute_returns(self) -> List[float]:
        """Compute Monte Carlo returns."""
        returns = []
        G = 0
        
        # Compute returns backward
        for reward in reversed(self.rewards):
            G = reward + self.gamma * G
            returns.insert(0, G)
        
        return returns
    
    def update_policy(self) -> Tuple[float, Optional[float]]:
        """Update policy using REINFORCE."""
        if len(self.rewards) == 0:
            return 0.0, None
        
        # Compute returns
        returns = self.compute_returns()
        returns = torch.FloatTensor(returns).to(device)
        
        # Normalize returns for stability
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        
        # Convert stored data to tensors
        states = torch.FloatTensor(np.array(self.states)).to(device)
        log_probs = torch.FloatTensor(self.log_probs).to(device)
        
        # Compute baseline if using
        baseline = None
        value_loss = None
        
        if self.use_baseline:
            # Update value function
            values = self.value_net(states)
            value_loss = F.mse_loss(values, returns)
            
            self.value_optimizer.zero_grad()
            value_loss.backward()
            torch.nn.utils.clip_grad_norm_(self.value_net.parameters(), max_norm=0.5)
            self.value_optimizer.step()
            
            # Use value function as baseline
            with torch.no_grad():
                baseline = self.value_net(states)
            
            advantages = returns - baseline
            value_loss = value_loss.item()
        else:
            advantages = returns
        
        # Policy gradient update
        policy_loss = -(log_probs * advantages).mean()
        
        self.policy_optimizer.zero_grad()
        policy_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.policy.parameters(), max_norm=0.5)
        self.policy_optimizer.step()
        
        # Reset episode data
        self.reset_episode_data()
        
        return policy_loss.item(), value_loss

## Actor-Critic Implementation

Actor-Critic methods reduce variance by using bootstrapping instead of full Monte Carlo returns.

In [None]:
class ActorCriticAgent:
    """Actor-Critic algorithm with TD(0) updates.
    
    Combines policy gradients (actor) with value function learning (critic).
    """
    
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        lr_policy: float = 3e-4,
        lr_value: float = 1e-3,
        gamma: float = 0.99,
        hidden_dims: List[int] = [128, 128],
        action_type: str = 'discrete',
        entropy_coef: float = 0.01  # Entropy regularization
    ):
        self.gamma = gamma
        self.action_type = action_type
        self.entropy_coef = entropy_coef
        
        # Create networks
        if action_type == 'discrete':
            self.policy = DiscretePolicy(state_dim, action_dim, hidden_dims).to(device)
        else:
            self.policy = ContinuousPolicy(state_dim, action_dim, hidden_dims).to(device)
        
        self.value_net = ValueNetwork(state_dim, hidden_dims).to(device)
        
        # Optimizers
        self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=lr_policy)
        self.value_optimizer = optim.Adam(self.value_net.parameters(), lr=lr_value)
        
        # Training statistics
        self.episode_rewards = []
        self.policy_losses = []
        self.value_losses = []
        self.entropies = []
    
    def select_action(self, state: np.ndarray, training: bool = True) -> Union[int, np.ndarray]:
        """Select action using current policy."""
        state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
        
        with torch.no_grad():
            action_dist = self.policy(state_tensor)
            
            if training:
                action = action_dist.sample()
            else:
                # Use mean action for evaluation
                if self.action_type == 'discrete':
                    action = action_dist.probs.argmax(dim=-1)
                else:
                    action = action_dist.mean
        
        if self.action_type == 'discrete':
            return action.item()
        else:
            return action.cpu().numpy().squeeze()
    
    def update(
        self,
        state: np.ndarray,
        action: Union[int, np.ndarray],
        reward: float,
        next_state: np.ndarray,
        done: bool
    ) -> Tuple[float, float, float]:
        """Update actor and critic networks."""
        # Convert to tensors
        state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
        next_state_tensor = torch.FloatTensor(next_state).unsqueeze(0).to(device)
        reward_tensor = torch.FloatTensor([reward]).to(device)
        
        if self.action_type == 'discrete':
            action_tensor = torch.LongTensor([action]).to(device)
        else:
            action_tensor = torch.FloatTensor(action).unsqueeze(0).to(device)
        
        # Compute current and next state values
        current_value = self.value_net(state_tensor)
        
        with torch.no_grad():
            if done:
                next_value = torch.zeros(1).to(device)
            else:
                next_value = self.value_net(next_state_tensor)
            
            # Compute TD target and advantage
            td_target = reward_tensor + self.gamma * next_value
            advantage = td_target - current_value
        
        # Update critic (value function)
        value_loss = F.mse_loss(current_value, td_target)
        
        self.value_optimizer.zero_grad()
        value_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.value_net.parameters(), max_norm=0.5)
        self.value_optimizer.step()
        
        # Update actor (policy)
        action_dist = self.policy(state_tensor)
        log_prob = action_dist.log_prob(action_tensor.squeeze())
        entropy = action_dist.entropy()
        
        # Policy loss with entropy regularization
        policy_loss = -(log_prob * advantage.detach()).mean() - self.entropy_coef * entropy.mean()
        
        self.policy_optimizer.zero_grad()
        policy_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.policy.parameters(), max_norm=0.5)
        self.policy_optimizer.step()
        
        return policy_loss.item(), value_loss.item(), entropy.mean().item()

## Proximal Policy Optimization (PPO) Implementation

PPO is currently one of the most popular policy gradient methods due to its stability and performance.

In [None]:
class PPOAgent:
    """Proximal Policy Optimization (PPO) agent.
    
    Implements clipped surrogate objective for stable policy updates.
    """
    
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        lr: float = 3e-4,
        gamma: float = 0.99,
        gae_lambda: float = 0.95,  # GAE parameter
        clip_epsilon: float = 0.2,  # PPO clipping parameter
        entropy_coef: float = 0.01,
        value_coef: float = 0.5,
        max_grad_norm: float = 0.5,
        ppo_epochs: int = 4,  # Number of PPO epochs per update
        mini_batch_size: int = 64,
        hidden_dims: List[int] = [128, 128],
        action_type: str = 'discrete'
    ):
        self.gamma = gamma
        self.gae_lambda = gae_lambda
        self.clip_epsilon = clip_epsilon
        self.entropy_coef = entropy_coef
        self.value_coef = value_coef
        self.max_grad_norm = max_grad_norm
        self.ppo_epochs = ppo_epochs
        self.mini_batch_size = mini_batch_size
        self.action_type = action_type
        
        # Create networks
        if action_type == 'discrete':
            self.policy = DiscretePolicy(state_dim, action_dim, hidden_dims).to(device)
        else:
            self.policy = ContinuousPolicy(state_dim, action_dim, hidden_dims).to(device)
        
        self.value_net = ValueNetwork(state_dim, hidden_dims).to(device)
        
        # Optimizer
        self.optimizer = optim.Adam(
            list(self.policy.parameters()) + list(self.value_net.parameters()),
            lr=lr
        )
        
        # Storage for trajectory data
        self.reset_storage()
        
        # Training statistics
        self.episode_rewards = []
        self.policy_losses = []
        self.value_losses = []
        self.entropies = []
    
    def reset_storage(self):
        """Reset trajectory storage."""
        self.states = []
        self.actions = []
        self.rewards = []
        self.dones = []
        self.values = []
        self.log_probs = []
    
    def select_action(self, state: np.ndarray, training: bool = True) -> Union[int, np.ndarray]:
        """Select action and store data for training."""
        state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
        
        with torch.no_grad():
            action_dist = self.policy(state_tensor)
            value = self.value_net(state_tensor)
            
            if training:
                action = action_dist.sample()
                log_prob = action_dist.log_prob(action)
                
                # Store for training
                self.states.append(state)
                self.actions.append(action.cpu().numpy())
                self.values.append(value.item())
                self.log_probs.append(log_prob.item())
            else:
                # Use mean action for evaluation
                if self.action_type == 'discrete':
                    action = action_dist.probs.argmax(dim=-1)
                else:
                    action = action_dist.mean
        
        if self.action_type == 'discrete':
            return action.item()
        else:
            return action.cpu().numpy().squeeze()
    
    def store_transition(self, reward: float, done: bool):
        """Store reward and done flag."""
        self.rewards.append(reward)
        self.dones.append(done)
    
    def compute_gae(self, next_value: float = 0.0) -> Tuple[List[float], List[float]]:
        """Compute Generalized Advantage Estimation (GAE)."""
        values = self.values + [next_value]
        advantages = []
        returns = []
        
        gae = 0
        for i in reversed(range(len(self.rewards))):
            delta = self.rewards[i] + self.gamma * values[i + 1] * (1 - self.dones[i]) - values[i]
            gae = delta + self.gamma * self.gae_lambda * (1 - self.dones[i]) * gae
            advantages.insert(0, gae)
            returns.insert(0, gae + values[i])
        
        return advantages, returns
    
    def update(self, next_state: Optional[np.ndarray] = None) -> Dict[str, float]:
        """Update policy using PPO."""
        if len(self.rewards) == 0:
            return {'policy_loss': 0.0, 'value_loss': 0.0, 'entropy': 0.0}
        
        # Compute next state value for GAE
        next_value = 0.0
        if next_state is not None:
            with torch.no_grad():
                next_state_tensor = torch.FloatTensor(next_state).unsqueeze(0).to(device)
                next_value = self.value_net(next_state_tensor).item()
        
        # Compute advantages and returns
        advantages, returns = self.compute_gae(next_value)
        
        # Convert to tensors
        states = torch.FloatTensor(np.array(self.states)).to(device)
        if self.action_type == 'discrete':
            actions = torch.LongTensor(np.array(self.actions).squeeze()).to(device)
        else:
            actions = torch.FloatTensor(np.array(self.actions)).to(device)
        
        old_log_probs = torch.FloatTensor(self.log_probs).to(device)
        advantages = torch.FloatTensor(advantages).to(device)
        returns = torch.FloatTensor(returns).to(device)
        
        # Normalize advantages
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # PPO update
        total_policy_loss = 0
        total_value_loss = 0
        total_entropy = 0
        
        for _ in range(self.ppo_epochs):
            # Create mini-batches
            batch_size = states.size(0)
            indices = torch.randperm(batch_size)
            
            for start_idx in range(0, batch_size, self.mini_batch_size):
                end_idx = min(start_idx + self.mini_batch_size, batch_size)
                batch_indices = indices[start_idx:end_idx]
                
                # Get batch data
                batch_states = states[batch_indices]
                batch_actions = actions[batch_indices]
                batch_old_log_probs = old_log_probs[batch_indices]
                batch_advantages = advantages[batch_indices]
                batch_returns = returns[batch_indices]
                
                # Forward pass
                action_dist = self.policy(batch_states)
                values = self.value_net(batch_states)
                
                # Compute probability ratio
                log_probs = action_dist.log_prob(batch_actions)
                ratio = torch.exp(log_probs - batch_old_log_probs)
                
                # Clipped surrogate objective
                surr1 = ratio * batch_advantages
                surr2 = torch.clamp(ratio, 1 - self.clip_epsilon, 1 + self.clip_epsilon) * batch_advantages
                policy_loss = -torch.min(surr1, surr2).mean()
                
                # Value loss
                value_loss = F.mse_loss(values, batch_returns)
                
                # Entropy loss
                entropy = action_dist.entropy().mean()
                
                # Total loss
                total_loss = policy_loss + self.value_coef * value_loss - self.entropy_coef * entropy
                
                # Update
                self.optimizer.zero_grad()
                total_loss.backward()
                torch.nn.utils.clip_grad_norm_(
                    list(self.policy.parameters()) + list(self.value_net.parameters()),
                    self.max_grad_norm
                )
                self.optimizer.step()
                
                # Accumulate losses
                total_policy_loss += policy_loss.item()
                total_value_loss += value_loss.item()
                total_entropy += entropy.item()
        
        # Reset storage
        self.reset_storage()
        
        # Return average losses
        num_updates = self.ppo_epochs * math.ceil(batch_size / self.mini_batch_size)
        return {
            'policy_loss': total_policy_loss / num_updates,
            'value_loss': total_value_loss / num_updates,
            'entropy': total_entropy / num_updates
        }

## Environment Wrappers

Let's create wrappers for both discrete and continuous control environments.

In [None]:
class CartPoleWrapper:
    """CartPole environment wrapper for discrete action space."""
    
    def __init__(self, render_mode=None):
        try:
            self.env = gym.make('CartPole-v1', render_mode=render_mode)
        except:
            self.env = gym.make('CartPole-v1')
        
        self.state_dim = self.env.observation_space.shape[0]
        self.action_dim = self.env.action_space.n
        self.action_type = 'discrete'
        
        # Normalization parameters
        self.obs_mean = np.zeros(self.state_dim)
        self.obs_std = np.ones(self.state_dim)
        self.obs_count = 0
    
    def reset(self):
        if gym_version == 'gymnasium':
            obs, _ = self.env.reset()
        else:
            obs = self.env.reset()
        return self._normalize_obs(obs)
    
    def step(self, action):
        obs, reward, done, truncated, info = self.env.step(action)
        if gym_version == 'gym':
            done = done or truncated
        return self._normalize_obs(obs), reward, done, info
    
    def _normalize_obs(self, obs):
        """Simple online normalization."""
        self.obs_count += 1
        delta = obs - self.obs_mean
        self.obs_mean += delta / self.obs_count
        self.obs_std = np.sqrt(((self.obs_count - 1) * self.obs_std**2 + delta * (obs - self.obs_mean)) / self.obs_count)
        self.obs_std = np.maximum(self.obs_std, 1e-8)
        return (obs - self.obs_mean) / self.obs_std
    
    def close(self):
        self.env.close()


class PendulumWrapper:
    """Pendulum environment wrapper for continuous action space."""
    
    def __init__(self, render_mode=None):
        try:
            self.env = gym.make('Pendulum-v1', render_mode=render_mode)
        except:
            try:
                self.env = gym.make('Pendulum-v1')
            except:
                # Fallback to a simple continuous environment
                print("Pendulum not available, using simulated environment")
                self.env = None
        
        if self.env is not None:
            self.state_dim = self.env.observation_space.shape[0]
            self.action_dim = self.env.action_space.shape[0]
            self.action_type = 'continuous'
            
            # Action scaling
            self.action_scale = (self.env.action_space.high - self.env.action_space.low) / 2.0
            self.action_bias = (self.env.action_space.high + self.env.action_space.low) / 2.0
        else:
            # Simulated environment
            self.state_dim = 3
            self.action_dim = 1
            self.action_type = 'continuous'
            self.action_scale = np.array([2.0])
            self.action_bias = np.array([0.0])
            
            # Simple pendulum simulation state
            self.angle = 0.0
            self.angular_velocity = 0.0
            self.max_steps = 200
            self.current_step = 0
    
    def reset(self):
        if self.env is not None:
            if gym_version == 'gymnasium':
                obs, _ = self.env.reset()
            else:
                obs = self.env.reset()
            return obs
        else:
            # Reset simulated environment
            self.angle = np.random.uniform(-np.pi, np.pi)
            self.angular_velocity = np.random.uniform(-1, 1)
            self.current_step = 0
            return np.array([np.cos(self.angle), np.sin(self.angle), self.angular_velocity])
    
    def step(self, action):
        if self.env is not None:
            # Scale action to environment range
            scaled_action = action * self.action_scale + self.action_bias
            obs, reward, done, truncated, info = self.env.step(scaled_action)
            if gym_version == 'gym':
                done = done or truncated
            return obs, reward, done, info
        else:
            # Simple pendulum simulation
            dt = 0.05
            g = 10.0
            m = 1.0
            l = 1.0
            
            # Clip action
            action = np.clip(action, -2.0, 2.0)
            
            # Simple physics update
            self.angular_velocity += dt * (-3 * g / (2 * l) * np.sin(self.angle + np.pi) + 3.0 / (m * l**2) * action)
            self.angle += dt * self.angular_velocity
            self.angular_velocity = np.clip(self.angular_velocity, -8, 8)
            
            # Normalize angle
            self.angle = ((self.angle + np.pi) % (2 * np.pi)) - np.pi
            
            # Compute reward
            reward = -(self.angle**2 + 0.1 * self.angular_velocity**2 + 0.001 * action**2)
            
            # Check termination
            self.current_step += 1
            done = self.current_step >= self.max_steps
            
            obs = np.array([np.cos(self.angle), np.sin(self.angle), self.angular_velocity])
            return obs, reward, done, {}
    
    def close(self):
        if self.env is not None:
            self.env.close()


# Test environments
print("Testing discrete environment (CartPole):")
env_discrete = CartPoleWrapper()
state = env_discrete.reset()
print(f"State dim: {env_discrete.state_dim}, Action dim: {env_discrete.action_dim}")
print(f"Sample state: {state}")
env_discrete.close()

print("\nTesting continuous environment (Pendulum):")
env_continuous = PendulumWrapper()
state = env_continuous.reset()
print(f"State dim: {env_continuous.state_dim}, Action dim: {env_continuous.action_dim}")
print(f"Sample state: {state}")
action = np.random.uniform(-1, 1, env_continuous.action_dim)
next_state, reward, done, _ = env_continuous.step(action)
print(f"Sample action: {action}, Reward: {reward:.3f}")
env_continuous.close()

## Training and Evaluation Functions

In [None]:
def train_policy_agent(
    agent,
    env,
    num_episodes: int = 500,
    max_steps: int = 500,
    verbose: bool = True,
    agent_type: str = 'reinforce'
) -> Dict[str, List]:
    """Train policy gradient agent."""
    
    episode_rewards = []
    policy_losses = []
    value_losses = []
    entropies = []
    
    best_reward = -float('inf')
    
    for episode in range(num_episodes):
        state = env.reset()
        total_reward = 0
        episode_policy_loss = []
        episode_value_loss = []
        episode_entropy = []
        
        for step in range(max_steps):
            # Select action
            action = agent.select_action(state, training=True)
            
            # Take step
            next_state, reward, done, _ = env.step(action)
            total_reward += reward
            
            if agent_type == 'reinforce':
                # Store reward for REINFORCE
                agent.store_reward(reward)
            elif agent_type == 'actor_critic':
                # Update online for Actor-Critic
                policy_loss, value_loss, entropy = agent.update(state, action, reward, next_state, done)
                episode_policy_loss.append(policy_loss)
                episode_value_loss.append(value_loss)
                episode_entropy.append(entropy)
            elif agent_type == 'ppo':
                # Store transition for PPO
                agent.store_transition(reward, done)
            
            state = next_state
            
            if done:
                break
        
        # Update policy (for REINFORCE and PPO)
        if agent_type == 'reinforce':
            policy_loss, value_loss = agent.update_policy()
            episode_policy_loss.append(policy_loss)
            if value_loss is not None:
                episode_value_loss.append(value_loss)
        elif agent_type == 'ppo':
            if not done:
                # Pass next state for GAE computation
                losses = agent.update(next_state)
            else:
                losses = agent.update()
            
            episode_policy_loss.append(losses['policy_loss'])
            episode_value_loss.append(losses['value_loss'])
            episode_entropy.append(losses['entropy'])
        
        # Record statistics
        episode_rewards.append(total_reward)
        
        if episode_policy_loss:
            policy_losses.append(np.mean(episode_policy_loss))
        else:
            policy_losses.append(0.0)
        
        if episode_value_loss:
            value_losses.append(np.mean(episode_value_loss))
        else:
            value_losses.append(0.0)
        
        if episode_entropy:
            entropies.append(np.mean(episode_entropy))
        else:
            entropies.append(0.0)
        
        if total_reward > best_reward:
            best_reward = total_reward
        
        # Print progress
        if verbose and episode % 50 == 0:
            avg_reward = np.mean(episode_rewards[-50:]) if len(episode_rewards) >= 50 else np.mean(episode_rewards)
            print(f"Episode {episode}, Average Reward: {avg_reward:.2f}, "
                  f"Policy Loss: {policy_losses[-1]:.4f}")
    
    return {
        'episode_rewards': episode_rewards,
        'policy_losses': policy_losses,
        'value_losses': value_losses,
        'entropies': entropies,
        'best_reward': best_reward
    }


def evaluate_policy_agent(agent, env, num_episodes: int = 100) -> Dict[str, float]:
    """Evaluate trained policy agent."""
    rewards = []
    lengths = []
    
    for _ in range(num_episodes):
        state = env.reset()
        total_reward = 0
        steps = 0
        
        while True:
            action = agent.select_action(state, training=False)
            state, reward, done, _ = env.step(action)
            total_reward += reward
            steps += 1
            
            if done or steps >= 500:
                break
        
        rewards.append(total_reward)
        lengths.append(steps)
    
    return {
        'mean_reward': np.mean(rewards),
        'std_reward': np.std(rewards),
        'mean_length': np.mean(lengths),
        'std_length': np.std(lengths)
    }

## Experiment 1: REINFORCE vs REINFORCE with Baseline

Compare vanilla REINFORCE with baseline reduction.

In [None]:
# Create environments
env_vanilla = CartPoleWrapper()
env_baseline = CartPoleWrapper()

# Create agents
agent_vanilla = REINFORCEAgent(
    state_dim=env_vanilla.state_dim,
    action_dim=env_vanilla.action_dim,
    lr_policy=3e-4,
    gamma=0.99,
    use_baseline=False,  # Vanilla REINFORCE
    hidden_dims=[128, 128],
    action_type='discrete'
)

agent_baseline = REINFORCEAgent(
    state_dim=env_baseline.state_dim,
    action_dim=env_baseline.action_dim,
    lr_policy=3e-4,
    gamma=0.99,
    use_baseline=True,  # REINFORCE with baseline
    hidden_dims=[128, 128],
    action_type='discrete'
)

print("Training Vanilla REINFORCE...")
results_vanilla = train_policy_agent(agent_vanilla, env_vanilla, num_episodes=300, 
                                   verbose=False, agent_type='reinforce')

print("Training REINFORCE with Baseline...")
results_baseline = train_policy_agent(agent_baseline, env_baseline, num_episodes=300, 
                                    verbose=False, agent_type='reinforce')

# Evaluate both agents
print("\nEvaluating Vanilla REINFORCE...")
eval_vanilla = evaluate_policy_agent(agent_vanilla, env_vanilla, num_episodes=100)

print("Evaluating REINFORCE with Baseline...")
eval_baseline = evaluate_policy_agent(agent_baseline, env_baseline, num_episodes=100)

print(f"\nVanilla REINFORCE Results:")
print(f"  Mean Reward: {eval_vanilla['mean_reward']:.2f} ± {eval_vanilla['std_reward']:.2f}")
print(f"  Mean Episode Length: {eval_vanilla['mean_length']:.1f}")

print(f"\nREINFORCE with Baseline Results:")
print(f"  Mean Reward: {eval_baseline['mean_reward']:.2f} ± {eval_baseline['std_reward']:.2f}")
print(f"  Mean Episode Length: {eval_baseline['mean_length']:.1f}")

# Clean up
env_vanilla.close()
env_baseline.close()

## Experiment 2: REINFORCE vs Actor-Critic vs PPO

Compare different policy gradient methods.

In [None]:
# Create environments
env_reinforce = CartPoleWrapper()
env_ac = CartPoleWrapper()
env_ppo = CartPoleWrapper()

# Create agents
agent_reinforce = REINFORCEAgent(
    state_dim=env_reinforce.state_dim,
    action_dim=env_reinforce.action_dim,
    lr_policy=3e-4,
    gamma=0.99,
    use_baseline=True,
    hidden_dims=[128, 128],
    action_type='discrete'
)

agent_ac = ActorCriticAgent(
    state_dim=env_ac.state_dim,
    action_dim=env_ac.action_dim,
    lr_policy=3e-4,
    lr_value=1e-3,
    gamma=0.99,
    hidden_dims=[128, 128],
    action_type='discrete',
    entropy_coef=0.01
)

agent_ppo = PPOAgent(
    state_dim=env_ppo.state_dim,
    action_dim=env_ppo.action_dim,
    lr=3e-4,
    gamma=0.99,
    gae_lambda=0.95,
    clip_epsilon=0.2,
    entropy_coef=0.01,
    hidden_dims=[128, 128],
    action_type='discrete'
)

print("Training REINFORCE with Baseline...")
results_reinforce = train_policy_agent(agent_reinforce, env_reinforce, num_episodes=300, 
                                     verbose=False, agent_type='reinforce')

print("Training Actor-Critic...")
results_ac = train_policy_agent(agent_ac, env_ac, num_episodes=300, 
                              verbose=False, agent_type='actor_critic')

print("Training PPO...")
results_ppo = train_policy_agent(agent_ppo, env_ppo, num_episodes=300, 
                               verbose=False, agent_type='ppo')

# Evaluate all agents
print("\nEvaluating REINFORCE...")
eval_reinforce = evaluate_policy_agent(agent_reinforce, env_reinforce, num_episodes=100)

print("Evaluating Actor-Critic...")
eval_ac = evaluate_policy_agent(agent_ac, env_ac, num_episodes=100)

print("Evaluating PPO...")
eval_ppo = evaluate_policy_agent(agent_ppo, env_ppo, num_episodes=100)

print(f"\nREINFORCE Results:")
print(f"  Mean Reward: {eval_reinforce['mean_reward']:.2f} ± {eval_reinforce['std_reward']:.2f}")

print(f"\nActor-Critic Results:")
print(f"  Mean Reward: {eval_ac['mean_reward']:.2f} ± {eval_ac['std_reward']:.2f}")

print(f"\nPPO Results:")
print(f"  Mean Reward: {eval_ppo['mean_reward']:.2f} ± {eval_ppo['std_reward']:.2f}")

# Clean up
env_reinforce.close()
env_ac.close()
env_ppo.close()

## Experiment 3: Continuous Control with PPO

Test PPO on continuous action space (Pendulum).

In [None]:
# Create continuous environment
env_continuous = PendulumWrapper()

# Create PPO agent for continuous control
agent_continuous = PPOAgent(
    state_dim=env_continuous.state_dim,
    action_dim=env_continuous.action_dim,
    lr=3e-4,
    gamma=0.99,
    gae_lambda=0.95,
    clip_epsilon=0.2,
    entropy_coef=0.01,
    hidden_dims=[128, 128],
    action_type='continuous'  # Continuous action space
)

print("Training PPO on Continuous Control (Pendulum)...")
results_continuous = train_policy_agent(agent_continuous, env_continuous, num_episodes=400, 
                                      max_steps=200, verbose=True, agent_type='ppo')

# Evaluate continuous agent
print("\nEvaluating Continuous PPO...")
eval_continuous = evaluate_policy_agent(agent_continuous, env_continuous, num_episodes=50)

print(f"\nContinuous PPO Results:")
print(f"  Mean Reward: {eval_continuous['mean_reward']:.2f} ± {eval_continuous['std_reward']:.2f}")
print(f"  Mean Episode Length: {eval_continuous['mean_length']:.1f}")

# Clean up
env_continuous.close()

## Performance Visualization and Analysis

In [None]:
def plot_policy_training_results(results_dict: Dict[str, Dict], title: str = "Policy Gradient Comparison"):
    """Plot training results for policy gradient methods."""
    
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle(title, fontsize=16)
    
    colors = ['blue', 'red', 'green', 'orange', 'purple']
    
    # Episode rewards
    ax = axes[0, 0]
    for i, (name, results) in enumerate(results_dict.items()):
        rewards = results['episode_rewards']
        # Smooth with moving average
        window_size = min(30, len(rewards) // 10)
        if window_size > 1:
            smoothed = np.convolve(rewards, np.ones(window_size)/window_size, mode='valid')
            ax.plot(range(window_size-1, len(rewards)), smoothed, 
                   color=colors[i % len(colors)], label=f'{name} (smoothed)', linewidth=2)
        ax.plot(rewards, color=colors[i % len(colors)], alpha=0.3, linewidth=0.5)
    
    ax.set_xlabel('Episode')
    ax.set_ylabel('Episode Reward')
    ax.set_title('Episode Rewards')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Policy loss
    ax = axes[0, 1]
    for i, (name, results) in enumerate(results_dict.items()):
        losses = results['policy_losses']
        # Smooth losses
        window_size = min(30, len(losses) // 10)
        if window_size > 1 and len(losses) > window_size:
            smoothed_loss = np.convolve(losses, np.ones(window_size)/window_size, mode='valid')
            ax.plot(range(window_size-1, len(losses)), smoothed_loss, 
                   color=colors[i % len(colors)], label=name, linewidth=2)
    
    ax.set_xlabel('Episode')
    ax.set_ylabel('Policy Loss')
    ax.set_title('Policy Training Loss')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Value loss
    ax = axes[1, 0]
    for i, (name, results) in enumerate(results_dict.items()):
        if 'value_losses' in results and any(v > 0 for v in results['value_losses']):
            losses = results['value_losses']
            # Smooth losses
            window_size = min(30, len(losses) // 10)
            if window_size > 1 and len(losses) > window_size:
                smoothed_loss = np.convolve(losses, np.ones(window_size)/window_size, mode='valid')
                ax.plot(range(window_size-1, len(losses)), smoothed_loss, 
                       color=colors[i % len(colors)], label=name, linewidth=2)
    
    ax.set_xlabel('Episode')
    ax.set_ylabel('Value Loss')
    ax.set_title('Value Function Training Loss')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Entropy (if available)
    ax = axes[1, 1]
    for i, (name, results) in enumerate(results_dict.items()):
        if 'entropies' in results and any(e > 0 for e in results['entropies']):
            entropies = results['entropies']
            # Smooth entropies
            window_size = min(30, len(entropies) // 10)
            if window_size > 1 and len(entropies) > window_size:
                smoothed_entropy = np.convolve(entropies, np.ones(window_size)/window_size, mode='valid')
                ax.plot(range(window_size-1, len(entropies)), smoothed_entropy, 
                       color=colors[i % len(colors)], label=name, linewidth=2)
    
    ax.set_xlabel('Episode')
    ax.set_ylabel('Entropy')
    ax.set_title('Policy Entropy (Exploration)')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()


# Plot all comparisons
print("Vanilla REINFORCE vs REINFORCE with Baseline:")
plot_policy_training_results({
    'Vanilla REINFORCE': results_vanilla,
    'REINFORCE + Baseline': results_baseline
}, "REINFORCE: Vanilla vs Baseline")

print("\nPolicy Gradient Method Comparison:")
plot_policy_training_results({
    'REINFORCE': results_reinforce,
    'Actor-Critic': results_ac,
    'PPO': results_ppo
}, "Policy Gradient Methods Comparison")

print("\nContinuous Control (PPO on Pendulum):")
plot_policy_training_results({
    'PPO Continuous': results_continuous
}, "PPO on Continuous Control")

## Policy Gradient Analysis and Insights

In [None]:
def analyze_policy_distributions(agent, env, num_states: int = 1000):
    """Analyze policy distributions learned by the agent."""
    
    if env.action_type != 'discrete':
        print("Analysis currently supports discrete action spaces only.")
        return
    
    # Collect states and action probabilities
    states = []
    action_probs = []
    
    for _ in range(num_states):
        # Reset and take random steps to get diverse states
        state = env.reset()
        steps = np.random.randint(0, 50)
        
        for _ in range(steps):
            action = env.env.action_space.sample()
            state, _, done, _ = env.step(action)
            if done:
                state = env.reset()
                break
        
        # Get action probabilities for this state
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
            action_dist = agent.policy(state_tensor)
            probs = action_dist.probs.cpu().numpy()[0]
        
        states.append(state)
        action_probs.append(probs)
    
    states = np.array(states)
    action_probs = np.array(action_probs)
    
    # Create visualization
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    fig.suptitle('Policy Analysis: Action Probabilities vs State Features', fontsize=14)
    
    feature_names = ['Cart Position', 'Cart Velocity', 'Pole Angle', 'Pole Angular Velocity']
    
    for i in range(4):
        ax = axes[i//2, i%2]
        
        # Plot action probabilities vs state feature
        ax.scatter(states[:, i], action_probs[:, 0], 
                  alpha=0.5, s=10, label='Action 0 (Left)', color='blue')
        ax.scatter(states[:, i], action_probs[:, 1], 
                  alpha=0.5, s=10, label='Action 1 (Right)', color='red')
        
        ax.set_xlabel(feature_names[i])
        ax.set_ylabel('Action Probability')
        ax.set_title(f'Policy vs {feature_names[i]}')
        ax.legend()
        ax.grid(True, alpha=0.3)
        ax.set_ylim(0, 1)
    
    plt.tight_layout()
    plt.show()
    
    # Print statistics
    print("\nPolicy Statistics:")
    print(f"Mean Action 0 Probability: {np.mean(action_probs[:, 0]):.3f}")
    print(f"Mean Action 1 Probability: {np.mean(action_probs[:, 1]):.3f}")
    
    # Analyze policy determinism
    max_probs = np.max(action_probs, axis=1)
    determinism = np.mean(max_probs)
    print(f"Policy Determinism (mean max prob): {determinism:.3f}")
    
    # Entropy analysis
    entropies = -np.sum(action_probs * np.log(action_probs + 1e-8), axis=1)
    print(f"Mean Policy Entropy: {np.mean(entropies):.3f} (max: {np.log(env.action_dim):.3f})")


# Analyze the trained PPO agent
env_analysis = CartPoleWrapper()
print("Analyzing PPO Policy:")
analyze_policy_distributions(agent_ppo, env_analysis, 500)
env_analysis.close()

## Performance Summary and Comparison

In [None]:
def create_policy_gradient_summary():
    """Create comprehensive summary of policy gradient methods."""
    
    print("=" * 70)
    print("POLICY GRADIENT METHODS PERFORMANCE SUMMARY")
    print("=" * 70)
    
    # Method comparison table
    methods = {
        'Vanilla REINFORCE': eval_vanilla,
        'REINFORCE + Baseline': eval_baseline,
        'Actor-Critic': eval_ac,
        'PPO': eval_ppo
    }
    
    print(f"{'Method':<20} {'Mean Reward':<12} {'Std Reward':<12} {'Variance':<10}")
    print("-" * 60)
    
    for method, results in methods.items():
        mean_reward = results['mean_reward']
        std_reward = results['std_reward']
        variance = "Low" if std_reward < 30 else "Medium" if std_reward < 60 else "High"
        
        print(f"{method:<20} {mean_reward:<12.1f} {std_reward:<12.1f} {variance:<10}")
    
    print("\n" + "=" * 70)
    print("KEY INSIGHTS AND THEORETICAL UNDERSTANDING")
    print("=" * 70)
    
    insights = [
        "1. VARIANCE REDUCTION:",
        "   • REINFORCE baseline reduces variance significantly",
        "   • Actor-Critic further reduces variance with bootstrapping",
        "   • PPO provides most stable learning with clipping",
        "",
        "2. SAMPLE EFFICIENCY:",
        "   • REINFORCE: Low (Monte Carlo, high variance)",
        "   • Actor-Critic: Medium (bootstrapping, online updates)", 
        "   • PPO: High (GAE, multiple epochs, stable updates)",
        "",
        "3. IMPLEMENTATION COMPLEXITY:",
        "   • REINFORCE: Simple (just policy gradients)",
        "   • Actor-Critic: Medium (policy + value networks)",
        "   • PPO: Complex (clipping, GAE, mini-batches)",
        "",
        "4. THEORETICAL GUARANTEES:",
        "   • All methods converge to local optima under assumptions",
        "   • PPO has additional stability guarantees from clipping",
        "   • Actor-Critic can be more unstable due to bootstrap bias"
    ]
    
    for insight in insights:
        print(insight)
    
    print("\n" + "=" * 70)
    print("WHEN TO USE EACH METHOD")
    print("=" * 70)
    
    recommendations = [
        "REINFORCE:",
        "  • Simple environments with low-dimensional action spaces",
        "  • When you need theoretical understanding first",
        "  • Research/educational purposes",
        "",
        "ACTOR-CRITIC:",
        "  • Online learning scenarios",
        "  • When sample efficiency matters more than stability",
        "  • Continuous learning environments",
        "",
        "PPO:",
        "  • Production systems requiring stability",
        "  • Complex environments with high-dimensional action spaces",
        "  • When you have computational resources for multiple epochs",
        "  • Continuous control tasks",
        "",
        "GENERAL RECOMMENDATION:",
        "  Start with PPO for most practical applications - it's currently",
        "  the most robust and widely-used policy gradient method."
    ]
    
    for rec in recommendations:
        print(rec)
    
    print("\n" + "=" * 70)
    print("PRACTICAL IMPLEMENTATION TIPS")
    print("=" * 70)
    
    tips = [
        "• NETWORK ARCHITECTURE: Use Tanh activations for policy networks",
        "• LEARNING RATES: Start with 3e-4 for policy, 1e-3 for value",
        "• NORMALIZATION: Always normalize advantages for stability",
        "• GRADIENT CLIPPING: Essential for stable policy learning (0.5 max norm)",
        "• ENTROPY REGULARIZATION: Use 0.01 coefficient to maintain exploration",
        "• GAE PARAMETER: λ=0.95 works well for most environments",
        "• PPO CLIPPING: ε=0.2 is a good default, tune based on performance",
        "• BATCH SIZE: Larger batches (64-256) work better for PPO",
        "• CONTINUOUS ACTIONS: Use log-std as learnable parameter",
        "• DEBUGGING: Monitor policy entropy to ensure exploration"
    ]
    
    for tip in tips:
        print(tip)


# Create the comprehensive summary
create_policy_gradient_summary()

## Conclusion and Next Steps

In this notebook, we've implemented and analyzed the major policy gradient methods:

### Key Achievements

1. **REINFORCE Implementation**: Classic policy gradient with Monte Carlo returns
2. **Baseline Reduction**: Demonstrated variance reduction with value function baseline
3. **Actor-Critic Methods**: Implemented bootstrapping for improved sample efficiency
4. **PPO Implementation**: State-of-the-art policy optimization with stability guarantees
5. **Continuous Control**: Extended to continuous action spaces with normal distributions
6. **Comprehensive Analysis**: Compared methods with statistical rigor and visualization

### Mathematical Insights

- **Policy Gradient Theorem**: Direct optimization of expected return through log-likelihood gradients
- **Variance-Bias Tradeoff**: REINFORCE has high variance but no bias; Actor-Critic reduces variance but introduces bias
- **GAE (Generalized Advantage Estimation)**: Provides tunable variance-bias tradeoff
- **PPO Clipping**: Prevents destructively large policy updates while maintaining progress

### Performance Results

- Baseline reduction significantly improves REINFORCE performance
- Actor-Critic methods provide better sample efficiency than REINFORCE
- PPO offers the best combination of stability and performance
- Continuous control requires careful action distribution design

### When to Use Policy vs Value Methods

**Use Policy Gradient Methods When:**
- High-dimensional or continuous action spaces
- Stochastic policies are beneficial
- Need direct policy optimization
- Handling partially observable environments

**Use Value-Based Methods (DQN) When:**
- Discrete action spaces with moderate dimensionality
- Sample efficiency is critical
- Deterministic policies are sufficient
- Environment is fully observable

### Next Steps

The final notebook (Part 6) will cover **Advanced Methods & Applications**, exploring:
- Soft Actor-Critic (SAC) for continuous control
- Model-based RL fundamentals
- Transfer learning and domain adaptation
- Real-world deployment considerations
- Multi-agent RL basics
- Current research directions

### Practical Applications

Policy gradient methods are essential for:
- Robotics and continuous control
- Game playing with complex action spaces
- Natural language generation
- Recommendation systems
- Any domain requiring stochastic policies

The understanding gained here provides a solid foundation for tackling complex real-world reinforcement learning problems where direct policy optimization is necessary.