# Soft Actor-Critic (SAC) for Reinforcement Learning: Complete Implementation and Analysis

**Course:** Deep Reinforcement Learning  
**Assignment:** HW4 - Soft Actor-Critic Agent (115 Points)  
**Total Points:** 115

---

## Abstract

This notebook presents a comprehensive implementation of the Soft Actor-Critic (SAC) algorithm [1], a state-of-the-art off-policy deep reinforcement learning method. SAC maximizes a trade-off between expected return and entropy, encouraging exploration while learning optimal policies. We implement three variants: (1) **Online SAC** with environment interaction, (2) **Offline SAC** trained on fixed datasets, and (3) **Conservative SAC** using Conservative Q-Learning (CQL) [2] for robust offline learning. Experimental validation on the CartPole-v1 environment demonstrates the effectiveness of entropy regularization and the importance of conservatism in offline settings.

**Keywords:** Soft Actor-Critic, Maximum Entropy RL, Offline RL, Conservative Q-Learning, Deep RL

---

## I. INTRODUCTION

### A. Background

Reinforcement Learning (RL) aims to learn optimal policies by maximizing cumulative rewards through environment interaction. Traditional RL algorithms face challenges in exploration-exploitation trade-offs and sample efficiency. Actor-critic methods combine value-based and policy-based approaches, using a critic to estimate value functions and an actor to update policies.

### B. Soft Actor-Critic Overview

Soft Actor-Critic (SAC) [1] addresses these challenges through:
1. **Entropy Maximization**: Augments the standard RL objective with an entropy term
2. **Off-Policy Learning**: Improves sample efficiency through experience replay
3. **Stochastic Policies**: Maintains exploration throughout training
4. **Automatic Temperature Tuning**: Adaptively adjusts exploration-exploitation balance

### C. Contributions

This implementation provides:
- Complete SAC agent with discrete action spaces
- Comparative analysis of online vs offline training paradigms
- Conservative Q-Learning integration for offline RL
- Empirical evaluation on standard benchmarks

---

## II. THEORETICAL FOUNDATIONS

### A. Maximum Entropy Reinforcement Learning

Standard RL maximizes expected cumulative reward:

$$J_{\\text{standard}}(\\pi) = \\mathbb{E}_{\\tau \\sim \\pi}\\left[\\sum_{t=0}^{\\infty} \\gamma^t r(s_t, a_t)\\right]$$

SAC extends this with entropy regularization:

$$J_{\\text{SAC}}(\\pi) = \\mathbb{E}_{\\tau \\sim \\pi}\\left[\\sum_{t=0}^{\\infty} \\gamma^t \\left(r(s_t, a_t) + \\alpha \\mathcal{H}(\\pi(\\cdot|s_t))\\right)\\right]$$

where $\\mathcal{H}(\\pi(\\cdot|s_t)) = -\\mathbb{E}_{a \\sim \\pi}[\\log \\pi(a|s_t)]$ is the policy entropy and $\\alpha > 0$ is the temperature parameter controlling exploration.

### B. Soft Policy Iteration

SAC alternates between:

**1) Soft Policy Evaluation**: Compute soft Q-function satisfying the soft Bellman equation:

$$Q^{\\pi}(s_t, a_t) = r(s_t, a_t) + \\gamma \\mathbb{E}_{s_{t+1} \\sim p}[V^{\\pi}(s_{t+1})]$$

where the soft state-value function is:

$$V^{\\pi}(s_t) = \\mathbb{E}_{a_t \\sim \\pi}[Q^{\\pi}(s_t, a_t) - \\alpha \\log \\pi(a_t|s_t)]$$

**2) Soft Policy Improvement**: Update policy towards:

$$\\pi_{\\text{new}} = \\arg\\min_{\\pi'} D_{\\text{KL}}\\left(\\pi'(\\cdot|s_t) \\| \\frac{\\exp(Q^{\\pi_{\\text{old}}}(s_t, \\cdot))}{Z(s_t)}\\right)$$

---

## III. METHODOLOGY

### A. Network Architecture

We employ feedforward neural networks with the following architecture:
- **Input Layer**: State dimension $d_s$
- **Hidden Layer 1**: 256 neurons with ReLU activation
- **Hidden Layer 2**: 256 neurons with ReLU activation  
- **Output Layer**: Action dimension $d_a$ with task-specific activation

### B. SAC Components

**1) Critic Networks**: Two Q-networks $Q_{\\theta_1}, Q_{\\theta_2}$ to reduce overestimation bias (clipped double-Q learning)

**2) Target Networks**: Slowly-updated copies $Q_{\\theta'_1}, Q_{\\theta'_2}$ for stable training

**3) Actor Network**: Policy $\\pi_\\phi$ with Softmax output for discrete actions

**4) Temperature Parameter**: Learnable $\\alpha$ with automatic tuning

### C. Loss Functions

**Critic Loss** (Mean Squared Bellman Error):

$$L_Q(\\theta_i) = \\mathbb{E}_{(s,a,r,s',d) \\sim \\mathcal{D}}\\left[\\left(Q_{\\theta_i}(s,a) - y\\right)^2\\right]$$

where target:

$$y = r + \\gamma(1-d)\\sum_{a'} \\pi_\\phi(a'|s')\\left[\\min_{j=1,2} Q_{\\theta'_j}(s',a') - \\alpha \\log \\pi_\\phi(a'|s')\\right]$$

**Actor Loss**:

$$L_\\pi(\\phi) = \\mathbb{E}_{s \\sim \\mathcal{D}}\\left[\\sum_a \\pi_\\phi(a|s)\\left(\\alpha \\log \\pi_\\phi(a|s) - \\min_{j=1,2} Q_{\\theta_j}(s,a)\\right)\\right]$$

**Temperature Loss**:

$$L_\\alpha = \\mathbb{E}_{s \\sim \\mathcal{D}, a \\sim \\pi_\\phi}\\left[-\\alpha(\\log \\pi_\\phi(a|s) + \\bar{\\mathcal{H}})\\right]$$

where $\\bar{\\mathcal{H}}$ is target entropy.

### D. Conservative Q-Learning

For offline RL, CQL adds a regularization term:

$$L_{\\text{CQL}}(\\theta) = \\alpha_{\\text{CQL}}\\left(\\mathbb{E}_{s \\sim \\mathcal{D}}\\left[\\log\\sum_a \\exp Q_\\theta(s,a)\\right] - \\mathbb{E}_{(s,a) \\sim \\mathcal{D}}[Q_\\theta(s,a)]\\right) + L_Q(\\theta)$$

This pushes down Q-values for out-of-distribution actions while maintaining values for in-dataset actions.

---

## References

[1] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor," in *ICML*, 2018.

[2] A. Kumar, A. Zhou, G. Tucker, and S. Levine, "Conservative q-learning for offline reinforcement learning," in *NeurIPS*, 2020.
ue


## IV. IMPLEMENTATION

### A. Environment Setup and Dependencies

We begin by importing required libraries and setting random seeds for reproducibility.


In [None]:
"""
Dependencies and Random Seed Configuration
============================================
This cell imports all necessary libraries and configures random seeds
for reproducible experiments across PyTorch, NumPy, and Python's random module.
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Normal
import torch.optim as optim
import numpy as np
import random
import gym
import matplotlib.pyplot as plt
from typing import Tuple, Optional

# Set random seeds for reproducibility
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

print("✓ All libraries imported successfully")
print(f"✓ PyTorch version: {torch.__version__}")
print(f"✓ Random seed set to: {seed}")


### B. Neural Network Architecture (8 Points)

The `Network` class implements a 3-layer feedforward neural network that serves as the foundation for both actor and critic networks in our SAC implementation.

**Architecture Details:**
- **Layer 1**: Input $\rightarrow$ 256 neurons (ReLU activation)
- **Layer 2**: 256 $\rightarrow$ 256 neurons (ReLU activation)
- **Layer 3**: 256 $\rightarrow$ Output (Configurable activation)

**Design Rationale:**
1. **Hidden Layer Size (256)**: Sufficient capacity for CartPole while avoiding overfitting
2. **ReLU Activation**: Provides non-linearity and computational efficiency
3. **Modular Output Activation**: Allows `Identity` for critics and `Softmax` for actor

**Mathematical Formulation:**

$$h_1 = \text{ReLU}(W_1 x + b_1)$$
$$h_2 = \text{ReLU}(W_2 h_1 + b_2)$$
$$y = \sigma(W_3 h_2 + b_3)$$

where $\sigma$ is the output activation function and $x \in \mathbb{R}^{d_s}$ is the input state.


In [None]:
class Network(torch.nn.Module):
    """
    Feedforward Neural Network for SAC
    ==================================
    A 3-layer fully-connected neural network used for both actor and critic networks.
    
    Parameters
    ----------
    input_dimension : int
        Dimension of input features (state dimension)
    output_dimension : int
        Dimension of output (action dimension for actor, or action dimension for Q-values)
    output_activation : torch.nn.Module
        Activation function for output layer (default: Identity for critics, Softmax for actor)
        
    Architecture
    ------------
    Input → FC(256) → ReLU → FC(256) → ReLU → FC(output_dim) → output_activation
    
    Returns
    -------
    torch.Tensor
        Network output of shape (batch_size, output_dimension)
    """

    def __init__(self, input_dimension: int, output_dimension: int, 
                 output_activation: torch.nn.Module = torch.nn.Identity()):
        super(Network, self).__init__()
        
        # SOLUTION: Define network layers (4 points)
        # Layer 1: Input → 256 neurons
        self.layer_1 = torch.nn.Linear(input_dimension, 256)
        
        # Layer 2: 256 → 256 neurons
        self.layer_2 = torch.nn.Linear(256, 256)
        
        # Output layer: 256 → output_dimension
        self.output_layer = torch.nn.Linear(256, output_dimension)
        
        # Store output activation function
        self.output_activation = output_activation

    def forward(self, inpt: torch.Tensor) -> torch.Tensor:
        """
        Forward pass through the network.
        
        Parameters
        ----------
        inpt : torch.Tensor
            Input tensor of shape (batch_size, input_dimension)
            
        Returns
        -------
        torch.Tensor
            Output tensor of shape (batch_size, output_dimension)
        """
        
        # SOLUTION: Implement forward pass (4 points)
        # First hidden layer with ReLU activation
        x = torch.nn.functional.relu(self.layer_1(inpt))
        
        # Second hidden layer with ReLU activation
        x = torch.nn.functional.relu(self.layer_2(x))
        
        # Output layer with custom activation
        output = self.output_activation(self.output_layer(x))
        
        return output

# Test the network
print("✓ Network class implemented successfully")
test_net = Network(4, 2)
test_input = torch.randn(32, 4)
test_output = test_net(test_input)
print(f"  Test: Input shape {test_input.shape} → Output shape {test_output.shape}")


### C. Experience Replay Buffer

The replay buffer stores transitions $(s, a, r, s', d)$ and enables off-policy learning by sampling mini-batches from past experiences. This breaks temporal correlations and improves sample efficiency.

**Key Features:**
1. **Prioritized Sampling**: Transitions are sampled with weights based on TD error
2. **Circular Buffer**: Older experiences are overwritten when capacity is reached
3. **Type Safety**: Uses NumPy structured arrays for efficient storage

**Importance in SAC:**
- Enables **off-policy** learning (can reuse old transitions)
- **Decorrelates** samples (reduces variance in gradient estimates)
- Allows **offline RL** by freezing the buffer and training without environment interaction

**Buffer Operations:**
- `add_transition()`: Store new $(s, a, r, s', d)$ tuple
- `sample_minibatch()`: Randomly sample batch for training
- `update_weights()`: Update priorities based on TD errors


In [None]:
class ReplayBuffer:
    """
    Experience Replay Buffer for Off-Policy RL
    ==========================================
    Stores and samples transitions for training SAC agent.
    
    Parameters
    ----------
    environment : gym.Env
        The environment to extract state/action space information
    capacity : int, default=500000
        Maximum number of transitions to store
        
    Attributes
    ----------
    buffer : np.ndarray
        Circular buffer storing transitions
    weights : np.ndarray
        Priority weights for sampling
    count : int
        Current number of transitions stored
    """

    def __init__(self, environment: gym.Env, capacity: int = 500000):
        transition_type_str = self.get_transition_type_str(environment)
        self.buffer = np.zeros(capacity, dtype=transition_type_str)
        self.weights = np.zeros(capacity)
        self.head_idx = 0
        self.count = 0
        self.capacity = capacity
        self.max_weight = 10**-2
        self.delta = 10**-4
        self.indices = None
        self.mirror_index = np.random.permutation(range(self.buffer.shape[0]))

    def get_transition_type_str(self, environment: gym.Env) -> str:
        """Create NumPy dtype string for transition tuple."""
        state_dim = environment.observation_space.shape[0]
        state_dim_str = '' if state_dim == () else str(state_dim)
        state_type_str = environment.observation_space.sample().dtype.name
        action_dim = environment.action_space.shape
        action_dim_str = '' if action_dim == () else str(action_dim)
        action_type_str = environment.action_space.sample().__class__.__name__

        # Transition format: (state, action, reward, next_state, done)
        transition_type_str = '{0}{1}, {2}{3}, float32, {0}{1}, bool'.format(
            state_dim_str, state_type_str, action_dim_str, action_type_str)

        return transition_type_str

    def add_transition(self, transition: tuple):
        """
        Add a new transition to the buffer.
        
        Parameters
        ----------
        transition : tuple
            (state, action, reward, next_state, done)
        """
        self.buffer[self.head_idx] = transition
        self.weights[self.head_idx] = self.max_weight

        self.head_idx = (self.head_idx + 1) % self.capacity
        self.count = min(self.count + 1, self.capacity)

    def sample_minibatch(self, size: int = 100, 
                        batch_deterministic_start: Optional[int] = None) -> np.ndarray:
        """
        Sample a minibatch of transitions.
        
        Parameters
        ----------
        size : int
            Number of transitions to sample
        batch_deterministic_start : int, optional
            If provided, sample deterministically starting from this index
            (used for offline training to iterate through entire buffer)
            
        Returns
        -------
        np.ndarray
            Array of sampled transitions
        """
        set_weights = self.weights[:self.count] + self.delta
        probabilities = set_weights / sum(set_weights)
        
        if batch_deterministic_start is None:
            # Random sampling for online training
            self.indices = np.random.choice(range(self.count), size, 
                                          p=probabilities, replace=False)
        else:
            # Deterministic sampling for offline training
            self.indices = self.mirror_index[batch_deterministic_start:batch_deterministic_start+size]
            
        return self.buffer[self.indices]

    def update_weights(self, prediction_errors: np.ndarray):
        """Update sampling weights based on TD errors."""
        max_error = max(prediction_errors)
        self.max_weight = max(self.max_weight, max_error)
        self.weights[self.indices] = prediction_errors

    def get_size(self) -> int:
        """Return current buffer size."""
        return self.count

print("✓ ReplayBuffer class implemented successfully")


## V. CONCEPTUAL QUESTIONS (18 Points)

This section addresses fundamental theoretical aspects of SAC, offline RL, and conservative Q-learning. Understanding these concepts is crucial for proper implementation and analysis.


### Question 1: SAC Objective Function vs Standard RL (3 points)

**Q:** We know that standard RL maximizes the expected sum of rewards. What is the objective function of SAC algorithm? Compare it to the standard RL loss.

**Answer:**

**Standard RL Objective:**

The standard reinforcement learning objective maximizes the expected cumulative discounted reward:

$$J_{\text{standard}}(\pi) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t)\right]$$

where:
- $\tau = (s_0, a_0, s_1, a_1, \ldots)$ is a trajectory
- $\gamma \in [0,1)$ is the discount factor
- $r(s_t, a_t)$ is the immediate reward

**SAC Maximum Entropy Objective:**

SAC augments the standard objective with an entropy term to encourage exploration:

$$J_{\text{SAC}}(\pi) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{\infty} \gamma^t \left(r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|s_t))\right)\right]$$

where:
- $\mathcal{H}(\pi(\cdot|s_t)) = -\mathbb{E}_{a \sim \pi}[\log \pi(a|s_t)]$ is the policy entropy
- $\alpha > 0$ is the temperature parameter controlling exploration strength

**Key Differences:**

| Aspect | Standard RL | SAC |
|--------|------------|-----|
| **Objective** | Maximize reward only | Maximize reward + entropy |
| **Exploration** | Relies on $\epsilon$-greedy or noise | Built into objective function |
| **Policy Type** | Can be deterministic | Always stochastic |
| **Robustness** | May overfit to narrow policies | More robust, diverse behaviors |
| **Sample Efficiency** | Varies | Generally higher due to exploration |

**Intuitive Interpretation:**

SAC seeks policies that not only maximize rewards but also remain as **random** (high entropy) as possible given the reward constraint. This encourages:
- **Exploration**: High entropy maintains uncertainty and prevents premature convergence
- **Robustness**: Multiple near-optimal actions are learned, making the policy adaptable
- **Stability**: Prevents collapse to deterministic suboptimal policies


### Question 2: Actor Cost Function (3 points)

**Q:** Write down the actor cost function.

**Answer:**

The actor (policy network) is trained to maximize the expected Q-value while maintaining high entropy. For discrete action spaces, the actor loss is:

$$L_{\pi}(\phi) = \mathbb{E}_{s_t \sim \mathcal{D}}\left[\sum_{a \in \mathcal{A}} \pi_\phi(a|s_t)\left(\alpha \log \pi_\phi(a|s_t) - Q_\theta(s_t, a)\right)\right]$$

**Breakdown of Terms:**

1. **Expectation over states**: $\mathbb{E}_{s_t \sim \mathcal{D}}$ - Average over states sampled from replay buffer
2. **Sum over actions**: $\sum_{a \in \mathcal{A}}$ - Weighted sum over all possible actions
3. **Policy probability**: $\pi_\phi(a|s_t)$ - Weight for each action
4. **Entropy term**: $\alpha \log \pi_\phi(a|s_t)$ - Encourages high entropy (exploration)
5. **Q-value term**: $Q_\theta(s_t, a)$ - Expected return for taking action $a$

**Alternative Formulation (Continuous Actions):**

For continuous action spaces with reparameterization trick:

$$L_{\pi}(\phi) = \mathbb{E}_{s_t \sim \mathcal{D}, \epsilon \sim \mathcal{N}}\left[\alpha \log \pi_\phi(a_t|s_t) - Q_\theta(s_t, a_t)\right]$$

where $a_t = f_\phi(\epsilon; s_t)$ is the reparameterized action.

**Clipped Double-Q for Actor:**

In practice, we use the minimum of two Q-networks to reduce overestimation:

$$L_{\pi}(\phi) = \mathbb{E}_{s_t \sim \mathcal{D}}\left[\sum_{a} \pi_\phi(a|s_t)\left(\alpha \log \pi_\phi(a|s_t) - \min_{j=1,2} Q_{\theta_j}(s_t, a)\right)\right]$$

**Gradient Interpretation:**

The gradient $\nabla_\phi L_{\pi}$ pushes the policy to:
- **Increase probability** of actions with high Q-values (exploitation)
- **Increase entropy** to maintain exploration (prevents collapse to deterministic policy)
- Balance is controlled by temperature $\alpha$


### Question 3: Critic Cost Function (3 points)

**Q:** Write down the critic cost function.

**Answer:**

The critic networks estimate soft Q-values using the soft Bellman backup. The loss minimizes the mean squared temporal difference (TD) error:

$$L_Q(\theta_i) = \mathbb{E}_{(s_t, a_t, r_t, s_{t+1}, d_t) \sim \mathcal{D}}\left[\left(Q_{\theta_i}(s_t, a_t) - y_t\right)^2\right]$$

where the **soft Bellman backup target** is:

$$y_t = r_t + \gamma (1 - d_t) V(s_{t+1})$$

and the **soft state-value function** is computed as:

$$V(s_{t+1}) = \mathbb{E}_{a_{t+1} \sim \pi}\left[Q_{\theta'}(s_{t+1}, a_{t+1}) - \alpha \log \pi(a_{t+1}|s_{t+1})\right]$$

**For Discrete Actions (Used in This Implementation):**

$$V(s_{t+1}) = \sum_{a' \in \mathcal{A}} \pi(a'|s_{t+1})\left[\min_{j=1,2} Q_{\theta'_j}(s_{t+1}, a') - \alpha \log \pi(a'|s_{t+1})\right]$$

**Complete Critic Loss:**

$$L_Q(\theta_i) = \mathbb{E}_{(s_t, a_t, r_t, s_{t+1}, d_t) \sim \mathcal{D}}\Bigg[\Big(Q_{\theta_i}(s_t, a_t) - \Big(r_t + \gamma(1-d_t)$$
$$\times \sum_{a'} \pi_\phi(a'|s_{t+1})\left[\min_{j=1,2} Q_{\theta'_j}(s_{t+1}, a') - \alpha \log \pi_\phi(a'|s_{t+1})\right]\Big)\Big)^2\Bigg]$$

**Key Components:**

1. **Current Q-value**: $Q_{\theta_i}(s_t, a_t)$ - Prediction from local critic
2. **Immediate reward**: $r_t$ - Observed reward
3. **Discount factor**: $\gamma$ - Future reward weighting
4. **Terminal flag**: $(1-d_t)$ - Zero out next state value if episode ends
5. **Target Q-values**: $Q_{\theta'_j}$ - From slowly-updated target networks
6. **Entropy bonus**: $-\alpha \log \pi$ - Encourages stochastic policy
7. **Clipped double-Q**: $\min_{j=1,2}$ - Reduces overestimation bias

**Training Two Critics:**

In SAC, we train **two separate critics** with independent losses:
- $L_{Q_1}(\theta_1)$ for first critic
- $L_{Q_2}(\theta_2)$ for second critic

Both critics are trained simultaneously using the same targets but **independent optimizers**.


### Question 4: Two Critics Architecture (3 points)

**Q:** Elaborate on the reason why most implementations of SAC use two critics (one local and one target).

**Answer:**

SAC actually uses **FOUR critic networks**: two local critics and two target critics. This design addresses two critical challenges in deep RL.

**1. Two Local Critics (Clipped Double-Q Learning)**

**Purpose:** Reduce **overestimation bias** in Q-value estimates.

**Problem with Single Critic:**
- Standard Q-learning suffers from maximization bias
- $\max_a Q(s,a)$ tends to overestimate true Q-values due to noise
- This leads to unstable training and poor convergence

**Solution - Clipped Double-Q:**
- Train two independent critics: $Q_{\theta_1}$ and $Q_{\theta_2}$
- For target computation, use the **minimum**:

$$Q_{\text{target}}(s,a) = \min(Q_{\theta_1}(s,a), Q_{\theta_2}(s,a))$$

**Why This Works:**
- Taking minimum provides a **lower bound** on Q-values
- Reduces optimistic bias while maintaining reasonable estimates
- More conservative → more stable learning

**2. Two Target Networks (Stable Training Targets)**

**Purpose:** Provide **stable, slowly-changing targets** for TD learning.

**Problem with Single Network:**
- If we use same network for prediction and target:
  $$L = (Q_\theta(s,a) - (r + \gamma \max_{a'} Q_\theta(s',a')))^2$$
- Target keeps changing as we update $\theta$ → **moving target problem**
- Causes oscillations and divergence

**Solution - Target Networks:**
- Maintain separate target networks: $Q_{\theta'_1}$ and $Q_{\theta'_2}$
- Update them slowly using **soft updates**:

$$\theta'_i \leftarrow \tau \theta_i + (1-\tau)\theta'_i$$

where $\tau \ll 1$ (typically 0.005 - 0.01)

**Why This Works:**
- Targets change slowly → more stable learning
- Reduces correlation between prediction and target
- Prevents "chasing a moving target" phenomenon

**Complete Architecture Summary:**

| Network | Role | Update Method |
|---------|------|---------------|
| $Q_{\theta_1}$ | Local critic 1 | Gradient descent on $L_{Q_1}$ |
| $Q_{\theta_2}$ | Local critic 2 | Gradient descent on $L_{Q_2}$ |
| $Q_{\theta'_1}$ | Target critic 1 | Soft update from $Q_{\theta_1}$ |
| $Q_{\theta'_2}$ | Target critic 2 | Soft update from $Q_{\theta_2}$ |

**Training Algorithm:**

```python
# 1. Compute targets using target networks
V_target = min(Q_θ'₁(s',a'), Q_θ'₂(s',a')) - α log π(a'|s')
y = r + γ(1-d) * V_target

# 2. Update local critics
L_Q1 = (Q_θ₁(s,a) - y)²
L_Q2 = (Q_θ₂(s,a) - y)²

# 3. Soft update targets
θ'₁ ← τθ₁ + (1-τ)θ'₁
θ'₂ ← τθ₂ + (1-τ)θ'₂
```

**Benefits of This Design:**
- ✓ Reduces overestimation bias
- ✓ Stabilizes training
- ✓ Improves convergence
- ✓ Better final performance


### Question 5: Online vs Offline Training Samples (3 points)

**Q:** What is the difference between training samples in offline and online settings?

**Answer:**

The fundamental difference lies in **how data is collected and used during training**.

**Online Reinforcement Learning**

**Data Collection:**
- Agent **actively interacts** with environment during training
- Collects new transitions: $(s_t, a_t, r_t, s_{t+1}, d_t)$ at each step
- Continuously adds experiences to replay buffer
- Can explore new regions of state-action space

**Training Process:**
```python
for episode in range(num_episodes):
    state = env.reset()
    for step in range(max_steps):
        action = agent.get_action(state)  # Sample from current policy
        next_state, reward, done = env.step(action)
        
        # Add NEW transition to buffer
        buffer.add(state, action, reward, next_state, done)
        
        # Train on MIXED data (old + new)
        batch = buffer.sample()
        agent.update(batch)
```

**Characteristics:**
- ✓ **Exploration**: Can discover new strategies
- ✓ **Adaptation**: Policy improves → data distribution improves
- ✓ **On-policy convergence**: Eventually samples from optimal policy
- ✗ **Sample inefficiency**: Requires many environment interactions
- ✗ **Safety concerns**: May take dangerous actions during exploration

**Offline Reinforcement Learning**

**Data Collection:**
- Uses **pre-collected fixed dataset** $\mathcal{D}_{\text{fixed}}$
- NO environment interaction during training
- Dataset collected by behavior policy $\pi_\beta$ (often suboptimal or human)
- Cannot add new experiences

**Training Process:**
```python
# Collect dataset ONCE (before training)
dataset = collect_data_with_behavior_policy()

# Train WITHOUT environment interaction
for epoch in range(num_epochs):
    for batch in iterate_over_dataset(dataset):
        # Train ONLY on fixed data
        agent.update(batch)
    
    # Evaluate (optional environment interaction)
    evaluate_policy(agent)
```

**Characteristics:**
- ✓ **Safety**: No risky exploration in real environment
- ✓ **Sample efficiency**: No environment interactions needed
- ✓ **Batch learning**: Can leverage large offline datasets
- ✗ **Distribution shift**: $\pi_\theta$ diverges from $\pi_\beta$
- ✗ **Extrapolation error**: Poor Q-estimates for OOD actions
- ✗ **Limited coverage**: Bounded by dataset quality

**Key Differences Comparison**

| Aspect | Online RL | Offline RL |
|--------|-----------|------------|
| **Data Source** | Live environment interaction | Fixed pre-collected dataset |
| **Buffer Updates** | Continuously growing | Static, no new data |
| **Policy Distribution** | $\pi_\theta$ samples current states | $\pi_\beta$ (behavior) ≠ $\pi_\theta$ (learned) |
| **Exploration** | Active exploration possible | Limited to dataset coverage |
| **Distribution Shift** | Minimal (self-correcting) | **Major challenge** |
| **Sample Generation** | $s_{t+1} \sim p(\cdot|s_t, a_t), a_t \sim \pi_\theta$ | $s,a,r,s' \sim \mathcal{D}_{\text{fixed}}$ |
| **Q-value Reliability** | Good for visited states | Poor for out-of-distribution actions |
| **Use Cases** | Simulators, games, safe environments | Robotics, healthcare, autonomous driving |

**Distribution Shift Problem (Critical in Offline RL)**

**Online:** Policy and data distribution co-evolve
$$\mathcal{D}_t \sim \pi_\theta^{(t)} \rightarrow \text{Update } \theta \rightarrow \mathcal{D}_{t+1} \sim \pi_\theta^{(t+1)}$$

**Offline:** Policy diverges from data distribution
$$\mathcal{D} \sim \pi_\beta \text{ (fixed)}, \quad \pi_\theta \rightarrow \text{different distribution}$$

When $\pi_\theta$ assigns high probability to actions rare in $\mathcal{D}$:
- Q-values for those actions are **overestimated** (extrapolation error)
- Policy exploits these errors → **poor performance**

**Solution:** Conservative methods like CQL explicitly penalize OOD actions.


### Question 6: Conservative Q-Learning (CQL) Impact (3 points)

**Q:** How does adding CQL on top of SAC change the objective function?

**Answer:**

Conservative Q-Learning (CQL) modifies the critic loss to explicitly prevent overestimation of out-of-distribution (OOD) actions in offline RL.

**Standard SAC Critic Loss:**

$$L_{\text{SAC}}(\theta) = \mathbb{E}_{(s,a,r,s',d) \sim \mathcal{D}}\left[(Q_\theta(s,a) - y)^2\right]$$

where $y$ is the soft Bellman backup target.

**CQL Modified Critic Loss:**

$$L_{\text{CQL}}(\theta) = \underbrace{\alpha_{\text{CQL}} \cdot \mathcal{R}(\theta)}_{\text{Conservative Regularizer}} + \underbrace{L_{\text{SAC}}(\theta)}_{\text{Standard Bellman Error}}$$

where the **conservative regularizer** is:

$$\mathcal{R}(\theta) = \mathbb{E}_{s \sim \mathcal{D}}\left[\log \sum_{a} \exp(Q_\theta(s,a))\right] - \mathbb{E}_{(s,a) \sim \mathcal{D}}[Q_\theta(s,a)]$$

**Breaking Down the Regularizer:**

**First Term** (pushes Q-values DOWN):
$$\mathbb{E}_{s \sim \mathcal{D}}\left[\log \sum_{a} \exp(Q_\theta(s,a))\right]$$
- Log-sum-exp over ALL actions (including OOD)
- Approximates $\max_a Q_\theta(s,a)$
- Penalizes high Q-values for any action

**Second Term** (pushes Q-values UP for dataset actions):
$$-\mathbb{E}_{(s,a) \sim \mathcal{D}}[Q_\theta(s,a)]$$
- Negative expectation over actions IN dataset
- Encourages high Q-values for observed actions
- Counterbalances the first term

**Net Effect:**
- **Decreases** Q-values for **out-of-distribution** actions
- **Maintains** Q-values for **in-distribution** actions  
- Creates a **lower bound** on Q-values

**Alternative Formulation (Discrete Actions):**

For discrete action spaces, the log-sum-exp can be computed exactly:

$$\mathcal{R}(\theta) = \mathbb{E}_{s \sim \mathcal{D}}\left[\log \sum_{a \in \mathcal{A}} \exp(Q_\theta(s,a)) - \sum_{a \in \mathcal{A}} \pi_\beta(a|s) Q_\theta(s,a)\right]$$

where $\pi_\beta$ is the behavior policy that generated the dataset.

**Intuitive Understanding:**

Imagine a state $s$ where:
- Action $a_1$ appears 100 times in dataset → high confidence
- Action $a_2$ appears 0 times in dataset → no data

**Standard SAC might predict:**
- $Q(s, a_1) = 10$ ✓ (reliable estimate)
- $Q(s, a_2) = 15$ ✗ (overestimated due to extrapolation)

**CQL corrects this:**
- $Q_{\text{CQL}}(s, a_1) = 10$ ✓ (maintained)
- $Q_{\text{CQL}}(s, a_2) = 5$ ✓ (penalized for being OOD)

**Mathematical Justification:**

CQL learns a **lower bound** $Q^\pi$ on the true Q-function $Q^*$:

$$Q^\pi(s,a) \leq Q^*(s,a), \quad \forall (s,a) \text{ where } \pi(a|s) > 0$$

This conservatism prevents the policy from exploiting overestimated Q-values for unseen actions.

**Hyperparameter $\alpha_{\text{CQL}}$ (Trade-off Factor):**

Controls strength of conservatism:
- **Low** $\alpha_{\text{CQL}}$ (e.g., 0.1): Weak regularization, closer to standard SAC
- **Medium** $\alpha_{\text{CQL}}$ (e.g., 1-10): Balanced conservatism
- **High** $\alpha_{\text{CQL}}$ (e.g., 50+): Very conservative, may underestimate

**Complete CQL-SAC Objective:**

$$\min_\theta \mathbb{E}_{(s,a,r,s',d) \sim \mathcal{D}}\Bigg[\underbrace{\alpha_{\text{CQL}} \left(\log \sum_{a'} \exp(Q_\theta(s,a')) - Q_\theta(s,a)\right)}_{\text{CQL Penalty}} + \underbrace{(Q_\theta(s,a) - y)^2}_{\text{Bellman Error}}\Bigg]$$

**Benefits in Offline RL:**
- ✓ Reduces extrapolation error
- ✓ More stable training
- ✓ Better performance on limited datasets
- ✓ Provable lower bound on Q-values
- ✓ Prevents policy from taking unseen actions


## VI. SAC AGENT IMPLEMENTATION (50 Points)

This section implements the complete Soft Actor-Critic agent, including:
- Critic networks with loss computation
- Actor network with policy optimization  
- Automatic temperature tuning
- Training loop with gradient updates
- Conservative Q-Learning (CQL) support for offline RL

The implementation supports both online and offline training paradigms.


In [None]:
class SACAgent:
    """
    Soft Actor-Critic Agent
    ========================
    Complete implementation of SAC algorithm with support for:
    - Online training (with environment interaction)
    - Offline training (from fixed dataset)
    - Conservative Q-Learning (CQL) for offline RL
    
    Parameters
    ----------
    environment : gym.Env
        The environment to train on
    replay_buffer : ReplayBuffer, optional
        Pre-filled replay buffer for offline training
    use_cql : bool, default=False
        Whether to use Conservative Q-Learning regularization
    offline : bool, default=False
        Whether to train in offline mode (no new data collection)
    """
    
    # Hyperparameters
    ALPHA_INITIAL = 1.0
    REPLAY_BUFFER_BATCH_SIZE = 100
    DISCOUNT_RATE = 0.99
    LEARNING_RATE = 10 ** -4
    SOFT_UPDATE_INTERPOLATION_FACTOR = 0.01
    TRADEOFF_FACTOR = 5  # CQL regularization strength

    def __init__(self, environment: gym.Env, replay_buffer: Optional[ReplayBuffer] = None, 
                 use_cql: bool = False, offline: bool = False):
        
        # Validation checks
        assert not use_cql or offline, 'CQL requires offline mode to be enabled.'
        assert not offline or replay_buffer is not None, 'Offline mode requires a replay buffer.'
        
        self.environment = environment
        self.state_dim = self.environment.observation_space.shape[0]
        self.action_dim = self.environment.action_space.n
        
        self.offline = offline
        self.replay_buffer = ReplayBuffer(self.environment) if replay_buffer is None else replay_buffer
        self.use_cql = use_cql
        
        # ============================================================
        # SOLUTION: Initialize Critics (6 points)
        # ============================================================
        
        print("Initializing SAC Agent...")
        print(f"  State dim: {self.state_dim}, Action dim: {self.action_dim}")
        print(f"  Mode: {'Offline' if offline else 'Online'}")
        print(f"  CQL: {'Enabled' if use_cql else 'Disabled'}")
        
        # Two local critic networks for clipped double-Q learning
        self.critic_local = Network(self.state_dim, self.action_dim)
        self.critic_local2 = Network(self.state_dim, self.action_dim)
        
        # Separate optimizers for each critic
        self.critic_optimiser = optim.Adam(self.critic_local.parameters(), lr=self.LEARNING_RATE)
        self.critic_optimiser2 = optim.Adam(self.critic_local2.parameters(), lr=self.LEARNING_RATE)
        
        # Two target critic networks for stable training targets
        self.critic_target = Network(self.state_dim, self.action_dim)
        self.critic_target2 = Network(self.state_dim, self.action_dim)
        
        # Initialize target networks with local network weights
        self.soft_update_target_networks(tau=1.0)
        
        # ============================================================
        # SOLUTION: Initialize Actor (2 points)
        # ============================================================
        
        # Actor network with Softmax output for discrete action probabilities
        self.actor_local = Network(self.state_dim, self.action_dim,
                                   output_activation=torch.nn.Softmax(dim=-1))
        
        # Actor optimizer
        self.actor_optimiser = optim.Adam(self.actor_local.parameters(), lr=self.LEARNING_RATE)
        
        # ============================================================
        # Temperature (Entropy Coefficient) Setup
        # ============================================================
        
        # Target entropy: H_target = -log(1/|A|) = log(|A|)
        # We use 0.98 * target for slightly lower entropy
        self.target_entropy = 0.98 * -np.log(1 / self.environment.action_space.n)
        
        # Learnable log(alpha) for numerical stability
        self.log_alpha = torch.tensor(np.log(self.ALPHA_INITIAL), requires_grad=True)
        self.alpha = self.log_alpha.exp()
        self.alpha_optimiser = torch.optim.Adam([self.log_alpha], lr=self.LEARNING_RATE)
        
        print(f"  Target entropy: {self.target_entropy:.3f}")
        print(f"  Initial alpha: {self.alpha.item():.3f}")
        print("✓ SAC Agent initialized successfully\n")


In [None]:
    # ================================================================
    # Action Selection Methods
    # ================================================================
    
    def get_next_action(self, state: np.ndarray, evaluation_episode: bool = False) -> int:
        """
        Select action given current state.
        
        Parameters
        ----------
        state : np.ndarray
            Current state
        evaluation_episode : bool
            If True, use deterministic (greedy) policy; else stochastic
            
        Returns
        -------
        int
            Selected discrete action
        """
        if evaluation_episode:
            return self.get_action_deterministically(state)
        else:
            return self.get_action_nondeterministically(state)
    
    def get_action_nondeterministically(self, state: np.ndarray) -> int:
        """Sample action from stochastic policy (for training/exploration)."""
        action_probabilities = self.get_action_probabilities(state)
        discrete_action = np.random.choice(range(self.action_dim), p=action_probabilities)
        return discrete_action
    
    def get_action_deterministically(self, state: np.ndarray) -> int:
        """Select greedy action (for evaluation)."""
        action_probabilities = self.get_action_probabilities(state)
        discrete_action = np.argmax(action_probabilities)
        return discrete_action
    
    def get_action_probabilities(self, state: np.ndarray) -> np.ndarray:
        """Get action probability distribution from actor network."""
        state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
        action_probabilities = self.actor_local.forward(state_tensor)
        return action_probabilities.squeeze(0).detach().numpy()
    
    def get_action_info(self, states_tensor: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Get action probabilities and log probabilities for a batch of states.
        
        Parameters
        ----------
        states_tensor : torch.Tensor
            Batch of states (batch_size, state_dim)
            
        Returns
        -------
        action_probabilities : torch.Tensor
            Action probabilities (batch_size, action_dim)
        log_action_probabilities : torch.Tensor
            Log probabilities (batch_size, action_dim)
        """
        action_probabilities = self.actor_local.forward(states_tensor)
        
        # Add small epsilon to avoid log(0)
        z = (action_probabilities == 0.0).float() * 1e-8
        log_action_probabilities = torch.log(action_probabilities + z)
        
        return action_probabilities, log_action_probabilities
