# Policy Gradient Methods Tutorial

## From REINFORCE to Advantage Actor-Critic

---

This tutorial provides a comprehensive introduction to policy gradient methods in reinforcement learning, covering:

1. **Policy Gradient Theorem** - The theoretical foundation
2. **REINFORCE Algorithm** - Monte Carlo policy gradient
3. **Variance Reduction** - Baselines and advantage functions
4. **Actor-Critic Methods** - Combining policy and value learning
5. **GAE** - Generalized Advantage Estimation

---

**References:**
- Williams (1992). Simple statistical gradient-following algorithms
- Sutton et al. (1999). Policy gradient methods for RL with function approximation
- Schulman et al. (2016). High-dimensional continuous control using GAE

## 1. Environment Setup

In [None]:
# Standard library imports
import warnings
from collections import deque
from typing import List, Tuple, Dict, Optional

# Scientific computing
import numpy as np

# Deep learning
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical, Normal

# Visualization
import matplotlib.pyplot as plt

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

In [None]:
# Configuration and reproducibility
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)

# Plotting defaults
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12
plt.rcParams['axes.grid'] = True

# Device selection
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

print(f"PyTorch Version: {torch.__version__}")
print(f"Device: {DEVICE}")

## 2. Policy Gradient Theorem

### 2.1 Value Methods vs Policy Methods

| Aspect | Value Methods (DQN) | Policy Methods |
|--------|---------------------|----------------|
| Learning Target | Q(s,a) → implicit policy | Direct policy π(a\|s) |
| Action Space | Primarily discrete | Natural for continuous |
| Policy Type | Deterministic (argmax) | Stochastic (distribution) |

### 2.2 Objective Function

Maximize expected cumulative return:

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t r_t\right]$$

### 2.3 Policy Gradient Theorem (Sutton et al., 1999)

$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^{\pi}(s, a)\right]$$

**Intuition:**
- $\nabla_\theta \log \pi_\theta(a|s)$: Direction to increase action probability
- $Q^{\pi}(s, a)$: How good the action is
- Good actions → increase probability, bad actions → decrease probability

In [None]:
def visualize_policy_gradient_intuition():
    """Visualize the intuition behind policy gradient updates."""
    fig, axes = plt.subplots(1, 3, figsize=(14, 4))
    
    actions = ['Left', 'Right', 'Jump']
    initial_probs = [0.33, 0.33, 0.34]
    rewards = [0.1, 0.9, 0.0]
    updated_probs = [0.15, 0.70, 0.15]
    
    # Initial policy
    axes[0].bar(actions, initial_probs, color='steelblue', alpha=0.7)
    axes[0].set_ylim(0, 0.8)
    axes[0].set_title('Initial Policy π(a|s)')
    axes[0].set_ylabel('Probability')
    
    # Action returns
    axes[1].bar(actions, rewards, color='green', alpha=0.7)
    axes[1].set_ylim(0, 1.0)
    axes[1].set_title('Action Returns Q(s,a)')
    axes[1].set_ylabel('Return')
    
    # Updated policy
    axes[2].bar(actions, updated_probs, color='orange', alpha=0.7)
    axes[2].set_ylim(0, 0.8)
    axes[2].set_title("Updated Policy π'(a|s)")
    axes[2].set_ylabel('Probability')
    
    plt.suptitle('Policy Gradient: Increase probability of high-return actions', y=1.02)
    plt.tight_layout()
    plt.show()

visualize_policy_gradient_intuition()

## 3. Policy Networks

### 3.1 Discrete Action Space: Softmax Policy

$$\pi_\theta(a|s) = \frac{\exp(h(s, a; \theta))}{\sum_{a'} \exp(h(s, a'; \theta))}$$

In [None]:
class DiscretePolicy(nn.Module):
    """
    Policy network for discrete action spaces.
    
    Uses Softmax to output a probability distribution over actions.
    Network outputs logits for numerical stability.
    """
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )
        self._init_weights()
    
    def _init_weights(self):
        """Apply orthogonal initialization for stable training."""
        for layer in self.net:
            if isinstance(layer, nn.Linear):
                nn.init.orthogonal_(layer.weight, gain=np.sqrt(2))
                nn.init.zeros_(layer.bias)
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Output action logits."""
        return self.net(state)
    
    def get_distribution(self, state: torch.Tensor) -> Categorical:
        """Get categorical distribution over actions."""
        logits = self.forward(state)
        return Categorical(logits=logits)
    
    def sample(self, state: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """Sample action, returning (action, log_prob, entropy)."""
        dist = self.get_distribution(state)
        action = dist.sample()
        return action, dist.log_prob(action), dist.entropy()

In [None]:
# Test discrete policy
policy = DiscretePolicy(state_dim=4, action_dim=2)
state = torch.randn(1, 4)

action, log_prob, entropy = policy.sample(state)
probs = policy.get_distribution(state).probs

print(f"State: {state.squeeze().numpy().round(3)}")
print(f"Action Probabilities: {probs.squeeze().detach().numpy().round(3)}")
print(f"Sampled Action: {action.item()}")
print(f"Log π(a|s): {log_prob.item():.4f}")
print(f"Entropy H(π): {entropy.item():.4f}")

### 3.2 Continuous Action Space: Gaussian Policy

$$\pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta(s)^2)$$

For bounded actions, apply **tanh squashing** and correct log-probability:

$$a = \tanh(u), \quad u \sim \mathcal{N}(\mu, \sigma^2)$$

$$\log \pi(a|s) = \log \mathcal{N}(u|\mu,\sigma^2) - \sum_i \log(1 - \tanh^2(u_i))$$

In [None]:
class ContinuousPolicy(nn.Module):
    """
    Gaussian policy for continuous action spaces.
    
    Outputs mean and log-std of Gaussian distribution.
    Uses tanh to bound actions to [-1, 1].
    """
    
    LOG_STD_MIN = -20.0
    LOG_STD_MAX = 2.0
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 256):
        super().__init__()
        
        # Feature extraction network
        self.feature_net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        # Mean output layer
        self.mean_layer = nn.Linear(hidden_dim, action_dim)
        
        # Learnable log standard deviation
        self.log_std = nn.Parameter(torch.zeros(action_dim))
    
    def forward(self, state: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """Output (mean, std) of Gaussian distribution."""
        features = self.feature_net(state)
        mean = self.mean_layer(features)
        log_std = torch.clamp(self.log_std, self.LOG_STD_MIN, self.LOG_STD_MAX)
        std = log_std.exp()
        return mean, std
    
    def sample(self, state: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """Sample action using reparameterization trick."""
        mean, std = self.forward(state)
        dist = Normal(mean, std)
        
        # Reparameterization: u = mean + std * epsilon
        u = dist.rsample()
        action = torch.tanh(u)  # Bound to [-1, 1]
        
        # Correct log_prob for tanh transformation (Jacobian)
        log_prob = dist.log_prob(u).sum(dim=-1)
        log_prob -= torch.log(1 - action.pow(2) + 1e-6).sum(dim=-1)
        
        entropy = dist.entropy().sum(dim=-1)
        return action, log_prob, entropy

In [None]:
# Test continuous policy
cont_policy = ContinuousPolicy(state_dim=3, action_dim=2)
state = torch.randn(1, 3)

action, log_prob, entropy = cont_policy.sample(state)
mean, std = cont_policy.forward(state)

print(f"Mean μ: {mean.squeeze().detach().numpy().round(3)}")
print(f"Std σ: {std.squeeze().detach().numpy().round(3)}")
print(f"Sampled Action (after tanh): {action.squeeze().detach().numpy().round(3)}")
print(f"Action bounds: [{action.min().item():.3f}, {action.max().item():.3f}]")

## 4. REINFORCE Algorithm

### 4.1 Monte Carlo Policy Gradient (Williams, 1992)

Use complete episode return $G_t$ to estimate $Q(s,a)$:

$$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) G_t^{(i)}$$

Where $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ is the discounted return from time $t$.

In [None]:
def compute_returns(rewards: List[float], gamma: float, normalize: bool = True) -> torch.Tensor:
    """
    Compute Monte Carlo returns.
    
    G_t = r_t + γr_{t+1} + γ²r_{t+2} + ...
    
    Computed efficiently via backward iteration: O(T) time.
    
    Args:
        rewards: List of rewards [r_0, r_1, ..., r_{T-1}]
        gamma: Discount factor
        normalize: Whether to standardize returns
    
    Returns:
        Tensor of returns [G_0, G_1, ..., G_{T-1}]
    """
    returns = []
    G = 0.0
    
    # Backward iteration
    for reward in reversed(rewards):
        G = reward + gamma * G
        returns.insert(0, G)
    
    returns = torch.tensor(returns, dtype=torch.float32)
    
    if normalize and len(returns) > 1:
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)
    
    return returns

In [None]:
# Demonstrate return computation
rewards = [1.0, 1.0, 1.0, 1.0, 1.0]
gamma = 0.99

returns = compute_returns(rewards, gamma, normalize=False)

print("Return Computation Example")
print("=" * 40)
print(f"Rewards: {rewards}")
print(f"Discount γ = {gamma}")
print(f"\nReturns (backward computation):")
for t, (r, G) in enumerate(zip(rewards, returns)):
    print(f"  t={t}: r_t={r:.1f}, G_t={G.item():.4f}")

In [None]:
class REINFORCE:
    """
    REINFORCE algorithm implementation.
    
    Properties:
    - Uses Monte Carlo returns (complete episodes)
    - Unbiased gradient estimate
    - High variance
    """
    
    def __init__(self, state_dim: int, action_dim: int, 
                 lr: float = 1e-3, gamma: float = 0.99):
        self.gamma = gamma
        self.policy = DiscretePolicy(state_dim, action_dim)
        self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)
        
        # Episode storage
        self.log_probs = []
        self.rewards = []
    
    def select_action(self, state: np.ndarray) -> int:
        """Select action according to policy."""
        state_t = torch.FloatTensor(state).unsqueeze(0)
        action, log_prob, _ = self.policy.sample(state_t)
        self.log_probs.append(log_prob)
        return action.item()
    
    def store_reward(self, reward: float):
        """Store reward for current step."""
        self.rewards.append(reward)
    
    def update(self) -> float:
        """Update policy at episode end."""
        # Compute returns
        returns = compute_returns(self.rewards, self.gamma, normalize=True)
        
        # Policy gradient loss: -E[log π(a|s) * G]
        log_probs = torch.stack(self.log_probs)
        policy_loss = -(log_probs * returns).mean()
        
        # Optimize
        self.optimizer.zero_grad()
        policy_loss.backward()
        self.optimizer.step()
        
        # Clear episode data
        self.log_probs = []
        self.rewards = []
        
        return policy_loss.item()

print("REINFORCE class defined successfully")

## 5. Variance Reduction with Baselines

### 5.1 The High Variance Problem

REINFORCE suffers from high variance because returns $G_t$ include many random factors.

### 5.2 Introducing a Baseline

$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot (Q(s,a) - b(s))\right]$$

**Key insight:** Any state-dependent baseline $b(s)$ doesn't change the expected gradient:

$$\mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)] = 0$$

### 5.3 Advantage Function

Using $V(s)$ as baseline gives us the **advantage function**:

$$A(s, a) = Q(s, a) - V(s)$$

- $A > 0$: Action better than average
- $A < 0$: Action worse than average

In [None]:
def visualize_baseline_effect():
    """Demonstrate how baselines reduce variance."""
    np.random.seed(42)
    
    n_samples = 1000
    base_return = 100  # High average return
    returns = np.random.normal(base_return, 20, n_samples)
    
    # Without baseline
    gradient_no_baseline = returns
    
    # With baseline (subtract mean)
    baseline = returns.mean()
    gradient_with_baseline = returns - baseline
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    axes[0].hist(gradient_no_baseline, bins=50, alpha=0.7, color='steelblue')
    axes[0].axvline(x=0, color='red', linestyle='--', label='Zero')
    axes[0].set_title(f'No Baseline\nMean={np.mean(gradient_no_baseline):.1f}, Var={np.var(gradient_no_baseline):.1f}')
    axes[0].legend()
    
    axes[1].hist(gradient_with_baseline, bins=50, alpha=0.7, color='orange')
    axes[1].axvline(x=0, color='red', linestyle='--', label='Zero')
    axes[1].set_title(f'With Baseline\nMean={np.mean(gradient_with_baseline):.1f}, Var={np.var(gradient_with_baseline):.1f}')
    axes[1].legend()
    
    plt.suptitle('Baseline Reduces Variance While Preserving Expected Gradient', y=1.02)
    plt.tight_layout()
    plt.show()
    
    print(f"Variance reduction: {np.var(gradient_no_baseline) / np.var(gradient_with_baseline):.2f}x")

visualize_baseline_effect()

In [None]:
class ValueNetwork(nn.Module):
    """State value function V(s) for use as baseline."""
    
    def __init__(self, state_dim: int, hidden_dim: int = 128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        return self.net(state)

In [None]:
class REINFORCEBaseline:
    """
    REINFORCE with learned value baseline.
    
    Uses advantage A = G - V(s) instead of raw returns.
    """
    
    def __init__(self, state_dim: int, action_dim: int,
                 lr_policy: float = 1e-3, lr_value: float = 1e-3,
                 gamma: float = 0.99):
        self.gamma = gamma
        
        self.policy = DiscretePolicy(state_dim, action_dim)
        self.value_net = ValueNetwork(state_dim)
        
        self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=lr_policy)
        self.value_optimizer = optim.Adam(self.value_net.parameters(), lr=lr_value)
        
        self.log_probs = []
        self.values = []
        self.rewards = []
    
    def select_action(self, state: np.ndarray) -> int:
        state_t = torch.FloatTensor(state).unsqueeze(0)
        
        action, log_prob, _ = self.policy.sample(state_t)
        value = self.value_net(state_t)
        
        self.log_probs.append(log_prob)
        self.values.append(value)
        
        return action.item()
    
    def store_reward(self, reward: float):
        self.rewards.append(reward)
    
    def update(self) -> Tuple[float, float]:
        # Compute returns (not normalized, for value target)
        returns = compute_returns(self.rewards, self.gamma, normalize=False)
        
        log_probs = torch.stack(self.log_probs)
        values = torch.cat(self.values).squeeze()
        
        # Compute advantages: A = G - V
        # IMPORTANT: detach values to prevent gradient flow through baseline
        advantages = returns - values.detach()
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # Policy loss
        policy_loss = -(log_probs * advantages).mean()
        
        # Value loss (MSE)
        value_loss = F.mse_loss(values, returns)
        
        # Update policy
        self.policy_optimizer.zero_grad()
        policy_loss.backward()
        self.policy_optimizer.step()
        
        # Update value network
        self.value_optimizer.zero_grad()
        value_loss.backward()
        self.value_optimizer.step()
        
        # Clear episode data
        self.log_probs = []
        self.values = []
        self.rewards = []
        
        return policy_loss.item(), value_loss.item()

print("REINFORCEBaseline class defined successfully")

## 6. Generalized Advantage Estimation (GAE)

### 6.1 TD Error

$$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$$

### 6.2 GAE Formula (Schulman et al., 2016)

$$A_t^{GAE}(\gamma,\lambda) = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}$$

### 6.3 λ Parameter Controls Bias-Variance Trade-off

| λ Value | Effect |
|---------|--------|
| λ=0 | TD(0), $A_t = \delta_t$, high bias, low variance |
| λ=1 | Monte Carlo, $A_t = G_t - V(s_t)$, low bias, high variance |
| λ=0.95 | Good balance for most tasks |

In [None]:
def compute_gae(
    rewards: List[float],
    values: List[float],
    next_value: float,
    dones: List[bool],
    gamma: float,
    gae_lambda: float
) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Compute Generalized Advantage Estimation.
    
    δ_t = r_t + γV(s_{t+1}) - V(s_t)
    A_t^GAE = Σ (γλ)^l δ_{t+l}
    
    Args:
        rewards: List of rewards
        values: List of value estimates
        next_value: Bootstrap value for final state
        dones: Episode termination flags
        gamma: Discount factor
        gae_lambda: GAE λ parameter
    
    Returns:
        advantages: GAE advantage estimates
        returns: Value targets (A + V)
    """
    advantages = []
    gae = 0.0
    
    values = list(values) + [next_value]
    
    for t in reversed(range(len(rewards))):
        # If done, next value is 0
        next_val = 0.0 if dones[t] else values[t + 1]
        
        # TD error
        delta = rewards[t] + gamma * next_val - values[t]
        
        # GAE accumulation (reset on done)
        gae = delta + gamma * gae_lambda * (1 - dones[t]) * gae
        advantages.insert(0, gae)
    
    advantages = torch.tensor(advantages, dtype=torch.float32)
    returns = advantages + torch.tensor(values[:-1], dtype=torch.float32)
    
    return advantages, returns

In [None]:
# Demonstrate GAE
rewards = [1.0, 1.0, 1.0, 1.0, 1.0]
values = [0.8, 0.9, 1.0, 0.95, 0.85]
dones = [False, False, False, False, True]
gamma = 0.99
gae_lambda = 0.95

advantages, returns = compute_gae(rewards, values, 0.0, dones, gamma, gae_lambda)

print("GAE Computation Example")
print("=" * 40)
print(f"Rewards: {rewards}")
print(f"Values: {values}")
print(f"\nAdvantages (λ={gae_lambda}): {advantages.numpy().round(3)}")
print(f"Returns: {returns.numpy().round(3)}")

In [None]:
def visualize_gae_lambda_effect():
    """Visualize how λ affects advantage variance."""
    np.random.seed(42)
    
    T = 50
    rewards = np.random.normal(1.0, 0.5, T)
    true_values = np.cumsum(rewards[::-1])[::-1] * 0.99 ** np.arange(T)
    noisy_values = true_values + np.random.normal(0, 0.3, T)
    dones = [False] * (T-1) + [True]
    gamma = 0.99
    
    lambdas = [0.0, 0.5, 0.9, 1.0]
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 8))
    
    for ax, lam in zip(axes.flat, lambdas):
        advantages, _ = compute_gae(
            rewards.tolist(), noisy_values.tolist(),
            0.0, dones, gamma, lam
        )
        
        ax.plot(advantages.numpy(), linewidth=2)
        ax.axhline(y=0, color='red', linestyle='--', alpha=0.5)
        ax.set_xlabel('Time Step')
        ax.set_ylabel('Advantage')
        ax.set_title(f'λ={lam}: Variance={advantages.std().item():.3f}')
        ax.grid(True, alpha=0.3)
    
    plt.suptitle('GAE: Effect of λ on Advantage Variance', y=1.02)
    plt.tight_layout()
    plt.show()

visualize_gae_lambda_effect()

## 7. Advantage Actor-Critic (A2C)

### 7.1 Architecture

```
state → [shared_net] → features
                         ├→ [actor_head] → policy π(a|s)
                         └→ [critic_head] → value V(s)
```

### 7.2 Loss Function

$$\mathcal{L} = \underbrace{-\log \pi(a|s) \cdot A}_{\text{Actor}} + \underbrace{c_v (V(s) - G)^2}_{\text{Critic}} - \underbrace{c_{ent} H(\pi)}_{\text{Entropy}}$$

In [None]:
class ActorCriticNetwork(nn.Module):
    """
    Shared-feature Actor-Critic network.
    
    Actor and Critic share feature extraction layers,
    then branch into separate heads.
    """
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 256):
        super().__init__()
        
        # Shared feature layers
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        # Actor head (policy)
        self.actor = nn.Linear(hidden_dim, action_dim)
        
        # Critic head (value)
        self.critic = nn.Linear(hidden_dim, 1)
    
    def forward(self, state: torch.Tensor):
        features = self.shared(state)
        logits = self.actor(features)
        value = self.critic(features)
        return Categorical(logits=logits), value
    
    def get_action_and_value(self, state: torch.Tensor):
        """Get action, log_prob, entropy, and value in one pass."""
        dist, value = self.forward(state)
        action = dist.sample()
        return action, dist.log_prob(action), dist.entropy(), value.squeeze(-1)

In [None]:
class A2C:
    """
    Advantage Actor-Critic with GAE.
    
    Features:
    - Shared actor-critic network
    - GAE for advantage estimation
    - Entropy regularization for exploration
    - Gradient clipping for stability
    """
    
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        lr: float = 3e-4,
        gamma: float = 0.99,
        gae_lambda: float = 0.95,
        entropy_coef: float = 0.01,
        value_coef: float = 0.5,
        max_grad_norm: float = 0.5
    ):
        self.gamma = gamma
        self.gae_lambda = gae_lambda
        self.entropy_coef = entropy_coef
        self.value_coef = value_coef
        self.max_grad_norm = max_grad_norm
        
        self.model = ActorCriticNetwork(state_dim, action_dim)
        self.optimizer = optim.Adam(self.model.parameters(), lr=lr)
        
        # Episode buffers
        self.states = []
        self.actions = []
        self.rewards = []
        self.log_probs = []
        self.values = []
        self.dones = []
        self.entropies = []
    
    def select_action(self, state: np.ndarray) -> int:
        state_t = torch.FloatTensor(state).unsqueeze(0)
        action, log_prob, entropy, value = self.model.get_action_and_value(state_t)
        
        self.states.append(state)
        self.actions.append(action.item())
        self.log_probs.append(log_prob)
        self.values.append(value.item())
        self.entropies.append(entropy)
        
        return action.item()
    
    def store(self, reward: float, done: bool):
        self.rewards.append(reward)
        self.dones.append(done)
    
    def update(self, next_state: np.ndarray, done: bool) -> Dict[str, float]:
        # Get bootstrap value
        if done:
            next_value = 0.0
        else:
            with torch.no_grad():
                next_state_t = torch.FloatTensor(next_state).unsqueeze(0)
                _, next_value = self.model(next_state_t)
                next_value = next_value.item()
        
        # Compute GAE
        advantages, returns = compute_gae(
            self.rewards, self.values, next_value,
            self.dones, self.gamma, self.gae_lambda
        )
        
        # Normalize advantages
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # Prepare tensors
        log_probs = torch.stack(self.log_probs)
        values = torch.tensor(self.values, dtype=torch.float32)
        entropies = torch.stack(self.entropies)
        
        # Compute losses
        policy_loss = -(log_probs * advantages.detach()).mean()
        value_loss = F.mse_loss(values, returns)
        entropy_bonus = entropies.mean()
        
        # Combined loss
        total_loss = (
            policy_loss
            + self.value_coef * value_loss
            - self.entropy_coef * entropy_bonus
        )
        
        # Optimize
        self.optimizer.zero_grad()
        total_loss.backward()
        nn.utils.clip_grad_norm_(self.model.parameters(), self.max_grad_norm)
        self.optimizer.step()
        
        # Clear buffers
        self.states = []
        self.actions = []
        self.rewards = []
        self.log_probs = []
        self.values = []
        self.dones = []
        self.entropies = []
        
        return {
            'policy_loss': policy_loss.item(),
            'value_loss': value_loss.item(),
            'entropy': entropy_bonus.item(),
            'total_loss': total_loss.item()
        }

print("A2C class defined successfully")

## 8. Training and Evaluation

In [None]:
# Check for gymnasium
try:
    import gymnasium as gym
    HAS_GYM = True
    print("Gymnasium imported successfully")
except ImportError:
    HAS_GYM = False
    print("Warning: gymnasium not installed")
    print("Install with: pip install gymnasium")

In [None]:
def train_agent(agent, env_name: str, num_episodes: int = 300, log_interval: int = 50):
    """Train an agent on a gymnasium environment."""
    if not HAS_GYM:
        print("Gymnasium required for training")
        return []
    
    env = gym.make(env_name)
    rewards_history = []
    
    print(f"\n{'='*50}")
    print(f"Training {agent.__class__.__name__} on {env_name}")
    print(f"{'='*50}\n")
    
    for episode in range(num_episodes):
        state, _ = env.reset()
        total_reward = 0
        
        while True:
            action = agent.select_action(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            if hasattr(agent, 'store'):
                agent.store(reward, done)
            else:
                agent.store_reward(reward)
            
            total_reward += reward
            state = next_state
            
            if done:
                break
        
        # Update
        if isinstance(agent, A2C):
            agent.update(next_state, done)
        else:
            agent.update()
        
        rewards_history.append(total_reward)
        
        if (episode + 1) % log_interval == 0:
            avg_reward = np.mean(rewards_history[-100:])
            print(f"Episode {episode+1:4d} | Avg Reward: {avg_reward:.2f}")
    
    env.close()
    return rewards_history

In [None]:
# Train and compare algorithms
if HAS_GYM:
    env_name = "CartPole-v1"
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    env.close()
    
    num_episodes = 300
    results = {}
    
    # Train REINFORCE
    print("[1/3] Training REINFORCE...")
    agent_rf = REINFORCE(state_dim, action_dim, lr=1e-3, gamma=0.99)
    results['REINFORCE'] = train_agent(agent_rf, env_name, num_episodes)
    
    # Train REINFORCE + Baseline
    print("\n[2/3] Training REINFORCE + Baseline...")
    agent_rfb = REINFORCEBaseline(state_dim, action_dim, lr_policy=1e-3, lr_value=1e-3)
    results['REINFORCE+Baseline'] = train_agent(agent_rfb, env_name, num_episodes)
    
    # Train A2C
    print("\n[3/3] Training A2C...")
    agent_a2c = A2C(state_dim, action_dim, lr=3e-4, gamma=0.99, gae_lambda=0.95)
    results['A2C (GAE)'] = train_agent(agent_a2c, env_name, num_episodes)
else:
    print("Skipping training (gymnasium not installed)")
    results = {}

In [None]:
def plot_learning_curves(results: Dict[str, List[float]], window: int = 50):
    """Plot learning curves with smoothing."""
    if not results:
        print("No training data to plot")
        return
    
    fig, ax = plt.subplots(figsize=(12, 6))
    colors = ['steelblue', 'orange', 'green']
    
    for (name, rewards), color in zip(results.items(), colors):
        # Raw data (transparent)
        ax.plot(rewards, alpha=0.2, color=color)
        
        # Smoothed curve
        if len(rewards) > window:
            smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
            ax.plot(range(window-1, len(rewards)), smoothed,
                   label=name, color=color, linewidth=2)
    
    ax.set_xlabel('Episode')
    ax.set_ylabel('Total Reward')
    ax.set_title('Policy Gradient Methods Comparison')
    ax.legend(loc='lower right')
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

plot_learning_curves(results)

## 9. Summary

### Algorithm Comparison

| Algorithm | Advantage Estimate | Update | Variance | Bias | Best For |
|-----------|-------------------|--------|----------|------|----------|
| REINFORCE | $G_t$ (MC) | Episode | High | None | Simple tasks |
| +Baseline | $G_t - V(s)$ | Episode | Medium | None | Medium tasks |
| A2C | GAE($\delta_t$) | n-step | Low | Some | Complex tasks |

### Practical Guidelines

| Parameter | Typical Range | Notes |
|-----------|---------------|-------|
| Learning Rate (Actor) | 1e-4 to 3e-4 | Policy should change smoothly |
| Learning Rate (Critic) | 1e-3 to 3e-3 | Can be larger than actor |
| γ (Discount) | 0.99 | Use 0.95 for shorter tasks |
| λ (GAE) | 0.95 | Balance bias-variance |
| Entropy Coefficient | 0.01 | Too high = random policy |
| Gradient Clipping | 0.5 to 1.0 | Prevents instability |

## 10. Exercises

### Exercise 1: Understanding Policy Gradients
Why does the policy gradient formula use $\log \pi$ instead of $\pi$? Explain from both mathematical and intuitive perspectives.

### Exercise 2: Implement Improvements
Add the following improvements to REINFORCE and observe effects:
1. Causality: Only use future rewards (not the entire episode)
2. Reward normalization

### Exercise 3: Continuous Control
Use `ContinuousPolicy` to train A2C on `Pendulum-v1`.

### Exercise 4: Hyperparameter Tuning
Experiment with different GAE λ values (0.9, 0.95, 0.99, 1.0) and compare learning curve stability.

In [None]:
# Exercise code space
# Write your solutions here

pass

---

## References

1. Sutton & Barto, "Reinforcement Learning: An Introduction", Chapter 13
2. Williams (1992). Simple statistical gradient-following algorithms
3. Schulman et al. (2016). High-dimensional continuous control using GAE
4. Mnih et al. (2016). Asynchronous methods for deep RL (A3C)
5. OpenAI Spinning Up: https://spinningup.openai.com/

---

[Back to Index](../README.md)