# Soft Actor-Critic (SAC) for Reinforcement Learning: Complete Implementation and Analysis

**Course:** Deep Reinforcement Learning  
**Assignment:** HW4 - Soft Actor-Critic Agent (115 Points)  
**Total Points:** 115

---

## Abstract

This notebook presents a comprehensive implementation of the Soft Actor-Critic (SAC) algorithm [1], a state-of-the-art off-policy deep reinforcement learning method. SAC maximizes a trade-off between expected return and entropy, encouraging exploration while learning optimal policies. We implement three variants: (1) **Online SAC** with environment interaction, (2) **Offline SAC** trained on fixed datasets, and (3) **Conservative SAC** using Conservative Q-Learning (CQL) [2] for robust offline learning. Experimental validation on the CartPole-v1 environment demonstrates the effectiveness of entropy regularization and the importance of conservatism in offline settings.

**Keywords:** Soft Actor-Critic, Maximum Entropy RL, Offline RL, Conservative Q-Learning, Deep RL

---

## I. INTRODUCTION

### A. Background

Reinforcement Learning (RL) aims to learn optimal policies by maximizing cumulative rewards through environment interaction. Traditional RL algorithms face challenges in exploration-exploitation trade-offs and sample efficiency. Actor-critic methods combine value-based and policy-based approaches, using a critic to estimate value functions and an actor to update policies.

### B. Soft Actor-Critic Overview

Soft Actor-Critic (SAC) [1] addresses these challenges through:
1. **Entropy Maximization**: Augments the standard RL objective with an entropy term
2. **Off-Policy Learning**: Improves sample efficiency through experience replay
3. **Stochastic Policies**: Maintains exploration throughout training
4. **Automatic Temperature Tuning**: Adaptively adjusts exploration-exploitation balance

### C. Contributions

This implementation provides:
- Complete SAC agent with discrete action spaces
- Comparative analysis of online vs offline training paradigms
- Conservative Q-Learning integration for offline RL
- Empirical evaluation on standard benchmarks

---

## II. THEORETICAL FOUNDATIONS

### A. Maximum Entropy Reinforcement Learning

Standard RL maximizes expected cumulative reward:

$$J_{\\text{standard}}(\\pi) = \\mathbb{E}_{\\tau \\sim \\pi}\\left[\\sum_{t=0}^{\\infty} \\gamma^t r(s_t, a_t)\\right]$$

SAC extends this with entropy regularization:

$$J_{\\text{SAC}}(\\pi) = \\mathbb{E}_{\\tau \\sim \\pi}\\left[\\sum_{t=0}^{\\infty} \\gamma^t \\left(r(s_t, a_t) + \\alpha \\mathcal{H}(\\pi(\\cdot|s_t))\\right)\\right]$$

where $\\mathcal{H}(\\pi(\\cdot|s_t)) = -\\mathbb{E}_{a \\sim \\pi}[\\log \\pi(a|s_t)]$ is the policy entropy and $\\alpha > 0$ is the temperature parameter controlling exploration.

### B. Soft Policy Iteration

SAC alternates between:

**1) Soft Policy Evaluation**: Compute soft Q-function satisfying the soft Bellman equation:

$$Q^{\\pi}(s_t, a_t) = r(s_t, a_t) + \\gamma \\mathbb{E}_{s_{t+1} \\sim p}[V^{\\pi}(s_{t+1})]$$

where the soft state-value function is:

$$V^{\\pi}(s_t) = \\mathbb{E}_{a_t \\sim \\pi}[Q^{\\pi}(s_t, a_t) - \\alpha \\log \\pi(a_t|s_t)]$$

**2) Soft Policy Improvement**: Update policy towards:

$$\\pi_{\\text{new}} = \\arg\\min_{\\pi'} D_{\\text{KL}}\\left(\\pi'(\\cdot|s_t) \\| \\frac{\\exp(Q^{\\pi_{\\text{old}}}(s_t, \\cdot))}{Z(s_t)}\\right)$$

---

## III. METHODOLOGY

### A. Network Architecture

We employ feedforward neural networks with the following architecture:
- **Input Layer**: State dimension $d_s$
- **Hidden Layer 1**: 256 neurons with ReLU activation
- **Hidden Layer 2**: 256 neurons with ReLU activation  
- **Output Layer**: Action dimension $d_a$ with task-specific activation

### B. SAC Components

**1) Critic Networks**: Two Q-networks $Q_{\\theta_1}, Q_{\\theta_2}$ to reduce overestimation bias (clipped double-Q learning)

**2) Target Networks**: Slowly-updated copies $Q_{\\theta'_1}, Q_{\\theta'_2}$ for stable training

**3) Actor Network**: Policy $\\pi_\\phi$ with Softmax output for discrete actions

**4) Temperature Parameter**: Learnable $\\alpha$ with automatic tuning

### C. Loss Functions

**Critic Loss** (Mean Squared Bellman Error):

$$L_Q(\\theta_i) = \\mathbb{E}_{(s,a,r,s',d) \\sim \\mathcal{D}}\\left[\\left(Q_{\\theta_i}(s,a) - y\\right)^2\\right]$$

where target:

$$y = r + \\gamma(1-d)\\sum_{a'} \\pi_\\phi(a'|s')\\left[\\min_{j=1,2} Q_{\\theta'_j}(s',a') - \\alpha \\log \\pi_\\phi(a'|s')\\right]$$

**Actor Loss**:

$$L_\\pi(\\phi) = \\mathbb{E}_{s \\sim \\mathcal{D}}\\left[\\sum_a \\pi_\\phi(a|s)\\left(\\alpha \\log \\pi_\\phi(a|s) - \\min_{j=1,2} Q_{\\theta_j}(s,a)\\right)\\right]$$

**Temperature Loss**:

$$L_\\alpha = \\mathbb{E}_{s \\sim \\mathcal{D}, a \\sim \\pi_\\phi}\\left[-\\alpha(\\log \\pi_\\phi(a|s) + \\bar{\\mathcal{H}})\\right]$$

where $\\bar{\\mathcal{H}}$ is target entropy.

### D. Conservative Q-Learning

For offline RL, CQL adds a regularization term:

$$L_{\\text{CQL}}(\\theta) = \\alpha_{\\text{CQL}}\\left(\\mathbb{E}_{s \\sim \\mathcal{D}}\\left[\\log\\sum_a \\exp Q_\\theta(s,a)\\right] - \\mathbb{E}_{(s,a) \\sim \\mathcal{D}}[Q_\\theta(s,a)]\\right) + L_Q(\\theta)$$

This pushes down Q-values for out-of-distribution actions while maintaining values for in-dataset actions.

---

## References

[1] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor," in *ICML*, 2018.

[2] A. Kumar, A. Zhou, G. Tucker, and S. Levine, "Conservative q-learning for offline reinforcement learning," in *NeurIPS*, 2020.
ue


## IV. IMPLEMENTATION

### A. Environment Setup and Dependencies

We begin by importing required libraries and setting random seeds for reproducibility.


In [None]:
"""
Dependencies and Random Seed Configuration
============================================
This cell imports all necessary libraries and configures random seeds
for reproducible experiments across PyTorch, NumPy, and Python's random module.
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Normal
import torch.optim as optim
import numpy as np
import random
import gym
import matplotlib.pyplot as plt
from typing import Tuple, Optional

# Set random seeds for reproducibility
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

print("✓ All libraries imported successfully")
print(f"✓ PyTorch version: {torch.__version__}")
print(f"✓ Random seed set to: {seed}")


### B. Neural Network Architecture (8 Points)

The `Network` class implements a 3-layer feedforward neural network that serves as the foundation for both actor and critic networks in our SAC implementation.

**Architecture Details:**
- **Layer 1**: Input $\rightarrow$ 256 neurons (ReLU activation)
- **Layer 2**: 256 $\rightarrow$ 256 neurons (ReLU activation)
- **Layer 3**: 256 $\rightarrow$ Output (Configurable activation)

**Design Rationale:**
1. **Hidden Layer Size (256)**: Sufficient capacity for CartPole while avoiding overfitting
2. **ReLU Activation**: Provides non-linearity and computational efficiency
3. **Modular Output Activation**: Allows `Identity` for critics and `Softmax` for actor

**Mathematical Formulation:**

$$h_1 = \text{ReLU}(W_1 x + b_1)$$
$$h_2 = \text{ReLU}(W_2 h_1 + b_2)$$
$$y = \sigma(W_3 h_2 + b_3)$$

where $\sigma$ is the output activation function and $x \in \mathbb{R}^{d_s}$ is the input state.


In [None]:
class Network(torch.nn.Module):
    """
    Feedforward Neural Network for SAC
    ==================================
    A 3-layer fully-connected neural network used for both actor and critic networks.
    
    Parameters
    ----------
    input_dimension : int
        Dimension of input features (state dimension)
    output_dimension : int
        Dimension of output (action dimension for actor, or action dimension for Q-values)
    output_activation : torch.nn.Module
        Activation function for output layer (default: Identity for critics, Softmax for actor)
        
    Architecture
    ------------
    Input → FC(256) → ReLU → FC(256) → ReLU → FC(output_dim) → output_activation
    
    Returns
    -------
    torch.Tensor
        Network output of shape (batch_size, output_dimension)
    """

    def __init__(self, input_dimension: int, output_dimension: int, 
                 output_activation: torch.nn.Module = torch.nn.Identity()):
        super(Network, self).__init__()
        
        # SOLUTION: Define network layers (4 points)
        # Layer 1: Input → 256 neurons
        self.layer_1 = torch.nn.Linear(input_dimension, 256)
        
        # Layer 2: 256 → 256 neurons
        self.layer_2 = torch.nn.Linear(256, 256)
        
        # Output layer: 256 → output_dimension
        self.output_layer = torch.nn.Linear(256, output_dimension)
        
        # Store output activation function
        self.output_activation = output_activation

    def forward(self, inpt: torch.Tensor) -> torch.Tensor:
        """
        Forward pass through the network.
        
        Parameters
        ----------
        inpt : torch.Tensor
            Input tensor of shape (batch_size, input_dimension)
            
        Returns
        -------
        torch.Tensor
            Output tensor of shape (batch_size, output_dimension)
        """
        
        # SOLUTION: Implement forward pass (4 points)
        # First hidden layer with ReLU activation
        x = torch.nn.functional.relu(self.layer_1(inpt))
        
        # Second hidden layer with ReLU activation
        x = torch.nn.functional.relu(self.layer_2(x))
        
        # Output layer with custom activation
        output = self.output_activation(self.output_layer(x))
        
        return output

# Test the network
print("✓ Network class implemented successfully")
test_net = Network(4, 2)
test_input = torch.randn(32, 4)
test_output = test_net(test_input)
print(f"  Test: Input shape {test_input.shape} → Output shape {test_output.shape}")


### C. Experience Replay Buffer

The replay buffer stores transitions $(s, a, r, s', d)$ and enables off-policy learning by sampling mini-batches from past experiences. This breaks temporal correlations and improves sample efficiency.

**Key Features:**
1. **Prioritized Sampling**: Transitions are sampled with weights based on TD error
2. **Circular Buffer**: Older experiences are overwritten when capacity is reached
3. **Type Safety**: Uses NumPy structured arrays for efficient storage

**Importance in SAC:**
- Enables **off-policy** learning (can reuse old transitions)
- **Decorrelates** samples (reduces variance in gradient estimates)
- Allows **offline RL** by freezing the buffer and training without environment interaction

**Buffer Operations:**
- `add_transition()`: Store new $(s, a, r, s', d)$ tuple
- `sample_minibatch()`: Randomly sample batch for training
- `update_weights()`: Update priorities based on TD errors
