# CA6: Policy Gradient Methods

## Session 6: From Value-Based to Policy-Based Reinforcement Learning

### Overview

Welcome to Session 6 of the Deep Reinforcement Learning course. In the previous sessions, we focused on **value-based methods** like Deep Q-Networks (DQN) and its variants. These methods learn a value function and then derive a policy from it.

In this session, we will explore a fundamentally different approach: **Policy Gradient (PG) methods**. Instead of learning a value function, we will directly parameterize and optimize the policy itself. This approach has several advantages, particularly in continuous action spaces and stochastic environments.

### Learning Objectives

By the end of this session, you will be ableto:
- Understand the theoretical foundations of Policy Gradient methods.
- Differentiate between value-based and policy-based approaches.
- Implement the REINFORCE algorithm from scratch.
- Understand and mitigate the high variance problem in policy gradients.
- Implement Actor-Critic methods, including A2C.
- Analyze the trade-offs between different policy gradient algorithms.

### Notebook Structure

- **Part 1: Introduction to Policy Gradient Methods**: Theory, advantages, and mathematical foundations.
- **Part 2: The REINFORCE Algorithm**: Derivation and implementation of the simplest policy gradient method.
- **Part 3: Actor-Critic Methods**: Introducing a critic to reduce variance.
- **Part 4: Advanced Actor-Critic (A2C/A3C)**: State-of-the-art policy gradient methods.
- **Part 5: Practical Exercises and Assignments**: Hands-on coding exercises.
- **Part 6: Theoretical Questions and Answers**: Deep dive into the theory.

---

## Part 1: Introduction to Policy Gradient Methods

### 1.1 What are Policy Gradient Methods?

Policy Gradient methods are a class of algorithms in reinforcement learning that directly learn a parameterized policy, denoted as **π_θ(a|s)**. The policy is typically represented by a neural network with parameters **θ**. The goal is to adjust the parameters **θ** to maximize the expected total reward.

**Key Idea**: Instead of learning the values of state-action pairs, we directly learn the probability of taking each action in each state. The learning process involves updating the policy parameters in the direction of the gradient of the expected return.

### 1.2 Value-Based vs. Policy-Based Methods

| Feature | Value-Based Methods (e.g., DQN) | Policy-Based Methods (e.g., REINFORCE) |
| :--- | :--- | :--- |
| **What is Learned** | Learns a value function Q(s,a). | Learns a policy π(a|s). |
| **Policy** | Implicitly derived from the value function (e.g., ε-greedy). | Explicitly represented and learned. |
| **Action Space** | Primarily for discrete action spaces. | Can handle both discrete and continuous action spaces naturally. |
| **Policy Type** | Typically deterministic (or ε-greedy). | Can learn stochastic policies. |
| **Optimization** | Minimizes a TD error (e.g., MSE). | Maximizes an objective function (expected return) via gradient ascent. |

### 1.3 Advantages and Disadvantages of Policy Gradient Methods

**Advantages:**

1.  **Continuous Action Spaces**: PG methods can naturally handle continuous action spaces by outputting the parameters of a probability distribution (e.g., mean and standard deviation for a Gaussian distribution).
2.  **Stochastic Policies**: They can learn truly stochastic policies, which is beneficial in environments where the optimal policy is stochastic (e.g., rock-paper-scissors).
3.  **Simpler Action Selection**: Once the policy is learned, selecting an action is a simple matter of sampling from the policy's output distribution. There's no need for a complex maximization step over Q-values.
4.  **Better Convergence Properties**: In some cases, PG methods have better convergence properties than value-based methods, which can suffer from instabilities due to bootstrapping.

**Disadvantages:**

1.  **High Variance**: The policy gradient estimate is often very noisy, leading to high variance during training. This can make learning slow and unstable.
2.  **Sample Inefficiency**: Basic PG methods are often less sample-efficient than their value-based counterparts, requiring more interactions with the environment to learn.
3.  **Local Optima**: The optimization process can get stuck in local optima, as it's performing gradient ascent on a potentially complex, non-convex landscape.

### 1.4 Mathematical Foundations: The Policy Objective Function

The goal of policy gradient methods is to find the optimal policy parameters **θ*** that maximize the expected total reward. We define an objective function, **J(θ)**, which represents the quality of the policy.

For an episodic environment, the objective function is the expected return from the starting state distribution:

**J(θ) = E_τ∼π_θ [R(τ)] = E_τ∼π_θ [Σ_{t=0}^{T} r_t]**

Where:
- **τ** is a trajectory (or episode) sampled from the policy **π_θ**.
- **R(τ)** is the total reward of the trajectory.

The core of policy gradient methods is to update the policy parameters **θ** using gradient ascent:

**θ_{k+1} = θ_k + α ∇_θ J(θ_k)**

Where **α** is the learning rate and **∇_θ J(θ)** is the policy gradient. The main challenge is to compute this gradient.


### 1.5 The Policy Gradient Theorem

The core of policy gradient methods is the **Policy Gradient Theorem**, which provides a way to compute the gradient of the objective function **J(θ)** without needing to know the dynamics of the environment. The theorem provides an analytical expression for the policy gradient that we can estimate from samples.

The theorem states that the gradient of the objective function is given by:

**∇_θ J(θ) = E_τ∼π_θ [ (Σ_{t=0}^{T} ∇_θ log π_θ(a_t|s_t)) (Σ_{t=0}^{T} r(s_t, a_t)) ]**

This form is often simplified to:

**∇_θ J(θ) = E_τ∼π_θ [ Σ_{t=0}^{T} G_t ∇_θ log π_θ(a_t|s_t) ]**

Where:
- **G_t = Σ_{k=t}^{T} r_k** is the **return** (the sum of rewards from time step *t* to the end of the episode).
- **∇_θ log π_θ(a_t|s_t)** is the gradient of the log-probability of taking action *a_t* in state *s_t*. This term tells us how to change the policy parameters to make the action *a_t* more or less likely.

**Intuition:**
- If the return **G_t** is high, we want to increase the probability of taking action **a_t** in state **s_t**.
- If the return **G_t** is low (or negative), we want to decrease the probability of taking action **a_t** in state **s_t**.

This theorem is powerful because it allows us to estimate the gradient using trajectories sampled from the environment, without needing to know the transition probabilities **P(s'|s,a)**.

---

## Part 2: The REINFORCE Algorithm (Monte-Carlo Policy Gradient)

The REINFORCE algorithm, also known as Monte-Carlo Policy Gradient, is one of the most fundamental policy gradient algorithms. It directly applies the Policy Gradient Theorem by estimating the gradient from a batch of complete episodes.

### 2.1 The REINFORCE Update Rule

The REINFORCE algorithm collects a set of trajectories by running the current policy **π_θ** in the environment. Then, for each time step *t* in each trajectory, it computes the return **G_t** and uses it to update the policy parameters.

The update rule for a single trajectory is:

**θ ← θ + α Σ_{t=0}^{T} G_t ∇_θ log π_θ(a_t|s_t)**

In practice, we collect a batch of trajectories and average the gradients over the batch.

### 2.2 Algorithm: REINFORCE

1.  **Initialize** the policy network **π_θ** with random parameters **θ**.
2.  **Loop forever** (for each episode):
    a. **Generate** an episode (a trajectory) by running the policy **π_θ**:
       τ = (s_0, a_0, r_1, s_1, a_1, r_2, ..., s_T, a_T, r_{T+1})
    b. **For each time step** t = 0, 1, ..., T:
       i. **Compute** the return **G_t = Σ_{k=t+1}^{T+1} r_k**.
    c. **Update** the policy parameters **θ**:
       **θ ← θ + α Σ_{t=0}^{T} G_t ∇_θ log π_θ(a_t|s_t)**

### 2.3 The High Variance Problem

A major drawback of the REINFORCE algorithm is the high variance of the gradient estimate. The return **G_t** can vary significantly depending on the trajectory, even for the same state-action pair. This high variance can make the learning process slow and unstable.

**Sources of Variance:**
1.  **Stochasticity in the Environment**: The environment's transition function and reward function can be stochastic.
2.  **Stochasticity in the Policy**: The policy itself is stochastic, leading to different actions and trajectories.

In the next sections, we will explore techniques to reduce this variance, such as using a baseline and the Actor-Critic architecture.

In [None]:
### 2.4 Implementation of REINFORCE

Let's implement the REINFORCE algorithm to solve the `CartPole-v1` environment from OpenAI Gym.

First, we need to set up our environment and import the necessary libraries.

```python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import gym
import numpy as np
import matplotlib.pyplot as plt

# Set up the environment
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

# Hyperparameters
learning_rate = 0.01
gamma = 0.99
num_episodes = 1000
```

Next, we define our policy network. For a discrete action space, the network will output logits for each action, which we can convert to probabilities using a softmax function.

```python
class PolicyNetwork(nn.Module):
    def __init__(self, state_size, action_size):
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(state_size, 128)
        self.fc2 = nn.Linear(128, action_size)

    def forward(self, state):
        x = torch.relu(self.fc1(state))
        logits = self.fc2(x)
        return logits

policy_net = PolicyNetwork(state_size, action_size)
optimizer = optim.Adam(policy_net.parameters(), lr=learning_rate)
```

Now, we can write the main training loop for the REINFORCE algorithm.

```python
def train_reinforce():
    all_rewards = []
    for episode in range(num_episodes):
        state = env.reset()
        log_probs = []
        rewards = []
        
        # Generate an episode
        while True:
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            logits = policy_net(state_tensor)
            action_dist = Categorical(logits=logits)
            action = action_dist.sample()
            
            log_prob = action_dist.log_prob(action)
            log_probs.append(log_prob)
            
            next_state, reward, done, _ = env.step(action.item())
            rewards.append(reward)
            state = next_state
            
            if done:
                break
        
        all_rewards.append(sum(rewards))
        
        # Compute returns
        returns = []
        G_t = 0
        for r in reversed(rewards):
            G_t = r + gamma * G_t
            returns.insert(0, G_t)
        
        returns = torch.tensor(returns)
        # Normalize returns for stability
        returns = (returns - returns.mean()) / (returns.std() + 1e-9)
        
        # Compute policy loss
        policy_loss = []
        for log_prob, G in zip(log_probs, returns):
            policy_loss.append(-log_prob * G)
        
        # Update policy
        optimizer.zero_grad()
        policy_loss = torch.cat(policy_loss).sum()
        policy_loss.backward()
        optimizer.step()
        
        if (episode + 1) % 100 == 0:
            print(f"Episode {episode+1}, Average Reward: {np.mean(all_rewards[-100:])}")
            
    return all_rewards

# Train the agent
rewards_history = train_reinforce()

# Plot the results
plt.figure(figsize=(10, 6))
plt.plot(rewards_history)
plt.title('REINFORCE Training on CartPole-v1')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.grid(True)
plt.show()
```

This implementation demonstrates the core concepts of the REINFORCE algorithm. Note the normalization of returns, which is a common trick to stabilize training.

---

## Part 3: Actor-Critic Methods - Reducing Variance

As we've seen, the REINFORCE algorithm suffers from high variance because the return **G_t** can be very noisy. Actor-Critic methods address this by introducing a **critic**, which learns a value function to assist the **actor** (the policy).

### 3.1 Introducing a Baseline

A simple way to reduce the variance of the policy gradient is to subtract a **baseline** from the return **G_t**. The baseline, **b(s_t)**, is a function of the state **s_t**. The policy gradient with a baseline is:

**∇_θ J(θ) = E_τ∼π_θ [ Σ_{t=0}^{T} (G_t - b(s_t)) ∇_θ log π_θ(a_t|s_t) ]**

This modification does not change the expected value of the gradient, so it doesn't introduce bias. However, a well-chosen baseline can significantly reduce the variance.

**What is a good baseline?**
A good baseline is the **state-value function, V(s_t)**. The term **A(s_t, a_t) = G_t - V(s_t)** is an estimate of the **advantage function**, which measures how much better than average it is to take action *a_t* in state *s_t*.

Using the advantage function, the policy gradient becomes:

**∇_θ J(θ) = E_τ∼π_θ [ Σ_{t=0}^{T} A(s_t, a_t) ∇_θ log π_θ(a_t|s_t) ]**

### 3.2 The Actor-Critic Architecture

This leads us to the Actor-Critic architecture, which consists of two components:

1.  **The Actor**: A policy network **π_θ(a|s)** that controls how the agent acts.
2.  **The Critic**: A value network **V_φ(s)** that estimates the state-value function **V(s)**.

**How it works:**
- The **actor** decides which action to take.
- The **critic** evaluates the action by computing the advantage function.
- The **actor** updates its policy in the direction suggested by the critic.
- The **critic** updates its value function to be more accurate.

**The Actor-Critic Update Cycle:**

1.  **Actor Update (Policy Update)**:
    - The actor uses the critic's value estimate to compute the advantage:
      **A(s_t, a_t) ≈ r_t + γV_φ(s_{t+1}) - V_φ(s_t)**
    - The actor's parameters **θ** are updated using this advantage:
      **θ ← θ + α_actor A(s_t, a_t) ∇_θ log π_θ(a_t|s_t)**

2.  **Critic Update (Value Update)**:
    - The critic's parameters **φ** are updated to minimize the error between its value estimate and the observed return (TD error):
      **δ_t = r_t + γV_φ(s_{t+1}) - V_φ(s_t)**
      **φ ← φ + α_critic δ_t ∇_φ V_φ(s_t)**

This approach is more stable and sample-efficient than REINFORCE because it uses the critic's value estimates to reduce variance and provide more informative updates.

In [None]:
### 3.3 Implementation of a Basic Actor-Critic Algorithm

Let's implement a simple one-step Actor-Critic algorithm. We'll need two networks: one for the actor (policy) and one for the critic (value function).

```python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import gym
import numpy as np
import matplotlib.pyplot as plt

# Environment setup
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

# Hyperparameters
actor_lr = 0.001
critic_lr = 0.005
gamma = 0.99
num_episodes = 1000

class Actor(nn.Module):
    def __init__(self, state_size, action_size):
        super(Actor, self).__init__()
        self.fc1 = nn.Linear(state_size, 128)
        self.fc2 = nn.Linear(128, action_size)

    def forward(self, state):
        x = torch.relu(self.fc1(state))
        logits = self.fc2(x)
        return logits

class Critic(nn.Module):
    def __init__(self, state_size):
        super(Critic, self).__init__()
        self.fc1 = nn.Linear(state_size, 128)
        self.fc2 = nn.Linear(128, 1)

    def forward(self, state):
        x = torch.relu(self.fc1(state))
        value = self.fc2(x)
        return value

actor = Actor(state_size, action_size)
critic = Critic(state_size)
actor_optimizer = optim.Adam(actor.parameters(), lr=actor_lr)
critic_optimizer = optim.Adam(critic.parameters(), lr=critic_lr)

def train_actor_critic():
    all_rewards = []
    for episode in range(num_episodes):
        state = env.reset()
        episode_reward = 0
        
        while True:
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            
            # Actor selects an action
            logits = actor(state_tensor)
            action_dist = Categorical(logits=logits)
            action = action_dist.sample()
            log_prob = action_dist.log_prob(action)
            
            # Environment step
            next_state, reward, done, _ = env.step(action.item())
            episode_reward += reward
            
            # Critic evaluates the state
            value = critic(state_tensor)
            next_state_tensor = torch.FloatTensor(next_state).unsqueeze(0)
            next_value = critic(next_state_tensor)
            
            if done:
                next_value = torch.tensor([0.0])
            
            # Compute advantage and TD error
            advantage = reward + gamma * next_value - value
            td_error = advantage
            
            # Actor update
            actor_loss = -log_prob * advantage.detach()
            actor_optimizer.zero_grad()
            actor_loss.backward()
            actor_optimizer.step()
            
            # Critic update
            critic_loss = td_error.pow(2)
            critic_optimizer.zero_grad()
            critic_loss.backward()
            critic_optimizer.step()
            
            state = next_state
            
            if done:
                break
        
        all_rewards.append(episode_reward)
        if (episode + 1) % 100 == 0:
            print(f"Episode {episode+1}, Average Reward: {np.mean(all_rewards[-100:])}")
            
    return all_rewards

# Train the agent
ac_rewards_history = train_actor_critic()

# Plot the results
plt.figure(figsize=(10, 6))
plt.plot(ac_rewards_history)
plt.title('Actor-Critic Training on CartPole-v1')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.grid(True)
plt.show()
```

This implementation shows a basic Actor-Critic setup. The actor and critic are updated at each time step, making it more sample-efficient than the episodic updates of REINFORCE.

---

## Part 4: Advanced Actor-Critic Methods (A2C and A3C)

The basic Actor-Critic algorithm can be improved further. Two popular and powerful extensions are the **Advantage Actor-Critic (A2C)** and the **Asynchronous Advantage Actor-Critic (A3C)**.

### 4.1 Advantage Actor-Critic (A2C)

A2C is a synchronous, deterministic version of A3C. It waits for all actors to finish their segment of experience before updating the global network, which can be more efficient on GPUs.

The core idea of A2C is to use a more sophisticated advantage function estimate. Instead of the one-step TD error, A2C often uses an **n-step return** to compute the advantage.

**n-step Advantage:**
**A(s_t, a_t) ≈ (Σ_{i=0}^{n-1} γ^i r_{t+i}) + γ^n V(s_{t+n}) - V(s_t)**

This provides a better trade-off between bias and variance.

### 4.2 Asynchronous Advantage Actor-Critic (A3C)

A3C is a parallel version of the Actor-Critic algorithm. It uses multiple actors, each with its own copy of the environment, to collect experience in parallel.

**How A3C works:**
1.  **Global Network**: There is a single global network with parameters **θ** (for the actor) and **φ** (for the critic).
2.  **Worker Actors**: Multiple worker threads are created. Each worker has its own copy of the actor and critic networks and its own environment.
3.  **Parallel Experience Collection**: Each worker interacts with its environment for a fixed number of steps, collecting a trajectory of experience.
4.  **Asynchronous Updates**: After collecting experience, each worker computes the gradients for the actor and critic and updates the global network asynchronously.

**Advantages of A3C:**
- **Decorrelated Experience**: Because the workers are exploring different parts of the state space, their experiences are more decorrelated, which stabilizes training.
- **Faster Training**: Parallelism allows for much faster training times.
- **No Replay Buffer**: A3C does not require an experience replay buffer, which saves memory.

### 4.3 A2C/A3C Implementation Details

A common implementation detail for A2C and A3C is to have the actor and critic share the initial layers of their networks. This can improve learning efficiency as both components can benefit from the shared feature representation.

**Entropy Regularization:**
To encourage exploration, a term is often added to the actor's loss function that penalizes the policy for being too deterministic. This is the **entropy** of the policy.

**Actor Loss with Entropy Regularization:**
**L_actor = - (A(s_t, a_t) ∇_θ log π_θ(a_t|s_t) + β H(π_θ(·|s_t)))**

Where:
- **H(π_θ(·|s_t))** is the entropy of the policy distribution.
- **β** is a hyperparameter that controls the strength of the entropy regularization.

This encourages the policy to maintain some randomness, which helps with exploration.