# 流行强化学习算法概述
# Popular Reinforcement Learning Algorithms Overview

---

本教程深入讲解三种最流行的连续控制强化学习算法:

| 算法 | 策略类型 | 核心创新 | 适用场景 |
|------|----------|----------|----------|
| **DDPG** | 确定性 | 将DQN扩展到连续动作空间 | 入门学习 |
| **TD3** | 确定性 | 三重改进解决DDPG不稳定 | 稳定训练 |
| **SAC** | 随机性 | 最大熵框架自动探索 | 生产部署 |

**学习目标**:
1. 理解Actor-Critic架构的数学原理
2. 掌握三种算法的核心差异和选择依据
3. 能够独立实现和调试这些算法

## 1. 环境配置与导入

In [None]:
import sys
import numpy as np
import torch
import matplotlib.pyplot as plt

# 设置随机种子保证可复现性
np.random.seed(42)
torch.manual_seed(42)

# 检查设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
print(f"PyTorch version: {torch.__version__}")

In [None]:
# 导入本地模块
from core.config import BaseConfig
from core.buffer import ReplayBuffer
from core.networks import (
    DeterministicActor, GaussianActor,
    QNetwork, TwinQNetwork
)

from algorithms.ddpg import DDPGConfig, DDPGAgent
from algorithms.td3 import TD3Config, TD3Agent
from algorithms.sac import SACConfig, SACAgent

print("All modules imported successfully!")

---

## 2. 理论基础: 马尔可夫决策过程 (MDP)

强化学习问题通常建模为**马尔可夫决策过程** $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$:

| 符号 | 含义 | 说明 |
|------|------|------|
| $\mathcal{S}$ | 状态空间 | 环境所有可能状态的集合 |
| $\mathcal{A}$ | 动作空间 | 智能体可执行的所有动作 |
| $P(s'|s,a)$ | 转移概率 | 执行动作后状态转移的概率 |
| $R(s,a,s')$ | 奖励函数 | 转移获得的即时奖励 |
| $\gamma \in [0,1]$ | 折扣因子 | 未来奖励的权重衰减 |

### 2.1 价值函数

**状态价值函数** $V^\pi(s)$: 从状态 $s$ 出发,遵循策略 $\pi$ 的期望累积回报

$$V^\pi(s) = \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s\right]$$

**动作价值函数** $Q^\pi(s, a)$: 在状态 $s$ 执行动作 $a$,然后遵循策略 $\pi$ 的期望回报

$$Q^\pi(s, a) = \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a\right]$$

两者关系:
$$V^\pi(s) = \mathbb{E}_{a \sim \pi}[Q^\pi(s, a)]$$

### 2.2 贝尔曼方程

**贝尔曼期望方程** (策略评估):

$$Q^\pi(s, a) = r(s, a) + \gamma \mathbb{E}_{s', a'}[Q^\pi(s', a')]$$

**贝尔曼最优方程** (策略优化):

$$Q^*(s, a) = r(s, a) + \gamma \mathbb{E}_{s'}[\max_{a'} Q^*(s', a')]$$

这是所有价值学习方法的核心—通过迭代逼近真实价值函数。

In [None]:
# 可视化折扣因子的影响
def visualize_discount():
    """展示不同 gamma 对未来奖励权重的影响"""
    steps = np.arange(50)
    gammas = [0.9, 0.95, 0.99]
    
    fig, ax = plt.subplots(figsize=(10, 5))
    
    for gamma in gammas:
        weights = gamma ** steps
        ax.plot(steps, weights, label=f'γ = {gamma}', linewidth=2)
    
    ax.set_xlabel('Future Timestep', fontsize=12)
    ax.set_ylabel('Reward Weight', fontsize=12)
    ax.set_title('Discount Factor: Impact on Future Reward Weighting', fontsize=14)
    ax.legend()
    ax.grid(True, alpha=0.3)
    ax.set_ylim(0, 1.1)
    
    plt.tight_layout()
    plt.show()

visualize_discount()

---

## 3. 核心组件详解

### 3.1 经验回放缓冲区 (Replay Buffer)

**为什么需要经验回放?**

1. **打破时序相关性**: 连续采样的数据高度相关,违反SGD的i.i.d.假设
2. **提高样本效率**: 每个transition可被多次使用
3. **稳定训练**: 防止灾难性遗忘

In [None]:
# 演示 ReplayBuffer 的使用
buffer = ReplayBuffer(capacity=1000, state_dim=4, action_dim=2)

# 模拟存储经验
for i in range(500):
    state = np.random.randn(4).astype(np.float32)
    action = np.random.randn(2).astype(np.float32)
    reward = np.random.randn()
    next_state = np.random.randn(4).astype(np.float32)
    done = i % 50 == 49  # 每50步结束一个episode
    
    buffer.push(state, action, reward, next_state, done)

print(f"Buffer size: {len(buffer)}")
print(f"Buffer ready for batch_size=64: {buffer.is_ready(64)}")

In [None]:
# 采样一个batch
states, actions, rewards, next_states, dones = buffer.sample(32)

print("Batch shapes:")
print(f"  States: {states.shape}")
print(f"  Actions: {actions.shape}")
print(f"  Rewards: {rewards.shape}")
print(f"  Next states: {next_states.shape}")
print(f"  Dones: {dones.shape}")

### 3.2 神经网络架构

Actor-Critic 架构包含两个网络:

| 网络 | 输入 | 输出 | 作用 |
|------|------|------|------|
| **Actor** (策略网络) | 状态 $s$ | 动作 $a$ | 决定采取什么动作 |
| **Critic** (价值网络) | 状态 $s$, 动作 $a$ | Q值 | 评估动作的好坏 |

In [None]:
# 创建确定性 Actor (用于 DDPG/TD3)
state_dim = 11
action_dim = 3
max_action = 1.0

det_actor = DeterministicActor(
    state_dim=state_dim,
    action_dim=action_dim,
    max_action=max_action,
    hidden_dims=[256, 256]
)

# 测试前向传播
test_state = torch.randn(1, state_dim)
action = det_actor(test_state)

print(f"Deterministic Actor:")
print(f"  Input shape: {test_state.shape}")
print(f"  Output shape: {action.shape}")
print(f"  Action bounds: [{action.min().item():.3f}, {action.max().item():.3f}]")
print(f"  Total parameters: {sum(p.numel() for p in det_actor.parameters()):,}")

In [None]:
# 创建随机 Actor (用于 SAC)
gauss_actor = GaussianActor(
    state_dim=state_dim,
    action_dim=action_dim,
    max_action=max_action,
    hidden_dims=[256, 256]
)

# 随机采样
action, log_prob = gauss_actor.sample(test_state)

print(f"\nGaussian Actor:")
print(f"  Action shape: {action.shape}")
print(f"  Log prob shape: {log_prob.shape}")
print(f"  Action bounds: [{action.min().item():.3f}, {action.max().item():.3f}]")

In [None]:
# 展示随机策略的多样性
test_state_batch = test_state.repeat(100, 1)
actions_sampled, _ = gauss_actor.sample(test_state_batch)

fig, axes = plt.subplots(1, action_dim, figsize=(12, 3))
for i in range(action_dim):
    axes[i].hist(actions_sampled[:, i].detach().numpy(), bins=20, edgecolor='black')
    axes[i].set_xlabel(f'Action Dim {i+1}')
    axes[i].set_ylabel('Frequency')
    axes[i].set_title(f'Distribution for Action {i+1}')

plt.suptitle('Gaussian Actor: Action Distribution from Same State', fontsize=14)
plt.tight_layout()
plt.show()

### 3.3 Twin Q-Networks

**Q值过估计问题**:

$$\mathbb{E}[\max_a Q(s', a)] \geq \max_a \mathbb{E}[Q(s', a)]$$

由于噪声,对最大值的估计总是偏高。

**解决方案**: 使用两个独立的Q网络,取最小值:

$$y = r + \gamma \min_{i=1,2} Q_{\phi'_i}(s', a')$$

In [None]:
# 创建 Twin Q-Network
twin_q = TwinQNetwork(
    state_dim=state_dim,
    action_dim=action_dim,
    hidden_dims=[256, 256]
)

test_action = torch.randn(1, action_dim)
q1, q2 = twin_q(test_state, test_action)
q_min = twin_q.min_q(test_state, test_action)

print(f"Twin Q-Network:")
print(f"  Q1: {q1.item():.4f}")
print(f"  Q2: {q2.item():.4f}")
print(f"  min(Q1, Q2): {q_min.item():.4f}")
print(f"  Total parameters: {sum(p.numel() for p in twin_q.parameters()):,}")

---

## 4. DDPG: Deep Deterministic Policy Gradient

### 4.1 核心思想

DDPG 将 DQN 扩展到连续动作空间:

- **确定性策略**: $a = \mu_\theta(s)$ 直接输出动作
- **经验回放**: 打破样本相关性
- **目标网络**: 稳定训练目标

### 4.2 数学原理

**确定性策略梯度定理**:

$$\nabla_\theta J(\theta) = \mathbb{E}_{s \sim \rho^\mu}\left[\nabla_a Q(s, a)|_{a=\mu(s)} \nabla_\theta \mu_\theta(s)\right]$$

**直觉理解**: 沿着Q函数对动作的梯度方向更新策略,使输出的动作获得更高的Q值。

In [None]:
# 创建 DDPG Agent
ddpg_config = DDPGConfig(
    state_dim=4,
    action_dim=2,
    max_action=1.0,
    hidden_dims=[64, 64],  # 小网络用于演示
    buffer_size=10000,
    batch_size=64,
    lr_actor=1e-4,
    lr_critic=1e-3,
    gamma=0.99,
    tau=0.005,
    exploration_noise=0.1,
)

ddpg_agent = DDPGAgent(ddpg_config)
print(f"DDPG Agent created on device: {ddpg_agent.device}")

In [None]:
# 演示动作选择
state = np.random.randn(4).astype(np.float32)

# 训练模式 (带噪声)
ddpg_agent.train_mode()
actions_train = [ddpg_agent.select_action(state) for _ in range(10)]

# 评估模式 (确定性)
ddpg_agent.eval_mode()
actions_eval = [ddpg_agent.select_action(state, deterministic=True) for _ in range(10)]

print("Training mode (with exploration noise):")
for i, a in enumerate(actions_train[:3]):
    print(f"  Action {i+1}: {a}")

print("\nEvaluation mode (deterministic):")
for i, a in enumerate(actions_eval[:3]):
    print(f"  Action {i+1}: {a}")

### 4.3 DDPG 的问题

DDPG 存在以下已知问题:

1. **Q值过估计**: 单一Q网络容易高估价值
2. **训练不稳定**: 策略和价值函数相互影响导致振荡
3. **探索不足**: 高斯噪声可能不够有效

这些问题催生了 TD3 和 SAC。

---

## 5. TD3: Twin Delayed DDPG

### 5.1 三重改进

TD3 通过三个关键改进解决 DDPG 的问题:

| 改进 | 解决的问题 | 实现方式 |
|------|------------|----------|
| **Clipped Double Q** | Q值过估计 | $\min(Q_1, Q_2)$ |
| **Delayed Policy Update** | 训练不稳定 | 每d步更新一次Actor |
| **Target Policy Smoothing** | Q函数脆弱性 | 目标动作加噪声 |

### 5.2 详细解析

**1. Clipped Double Q-Learning**

$$y = r + \gamma \min_{i=1,2} Q_{\phi'_i}(s', \tilde{a}')$$

使用两个Q网络的最小值作为目标,提供保守的价值估计。

**2. Delayed Policy Updates**

每 $d$ 次 Critic 更新才更新一次 Actor (通常 $d=2$),让 Critic 先稳定。

**3. Target Policy Smoothing**

$$\tilde{a}' = \text{clip}(\mu_{\theta'}(s') + \text{clip}(\epsilon, -c, c), -a_{max}, a_{max})$$

在目标动作上加入裁剪的噪声,平滑 Q 函数,防止策略利用 Q 函数的尖峰。

In [None]:
# 创建 TD3 Agent
td3_config = TD3Config(
    state_dim=4,
    action_dim=2,
    max_action=1.0,
    hidden_dims=[64, 64],
    buffer_size=10000,
    batch_size=64,
    lr_actor=3e-4,
    lr_critic=3e-4,
    gamma=0.99,
    tau=0.005,
    policy_delay=2,      # 每2次Critic更新才更新Actor
    target_noise=0.2,    # 目标策略噪声
    noise_clip=0.5,      # 噪声裁剪范围
    exploration_noise=0.1,
)

td3_agent = TD3Agent(td3_config)
print(f"TD3 Agent created")
print(f"  Policy delay: {td3_config.policy_delay}")
print(f"  Target noise: {td3_config.target_noise}")

In [None]:
# 演示 delayed policy update
# 填充缓冲区
for _ in range(200):
    s = np.random.randn(4).astype(np.float32)
    a = np.random.randn(2).astype(np.float32)
    td3_agent.store_transition(s, a, 1.0, s, False)

# 执行多次更新,观察 actor_loss 的出现模式
for i in range(5):
    metrics = td3_agent.update()
    has_actor = "actor_loss" in metrics
    print(f"Update {i+1}: critic_loss={metrics['critic_loss']:.4f}, "
          f"actor_loss={'%.4f' % metrics['actor_loss'] if has_actor else 'N/A'}")

---

## 6. SAC: Soft Actor-Critic

### 6.1 最大熵强化学习

SAC 的目标不仅是最大化回报,还要最大化策略的**熵**:

$$J(\pi) = \sum_{t=0}^{T} \mathbb{E}_{(s_t, a_t) \sim \rho_\pi}\left[r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|s_t))\right]$$

其中 $\mathcal{H}(\pi) = -\mathbb{E}[\log \pi(a|s)]$ 是熵。

**为什么要最大化熵?**

1. **鼓励探索**: 高熵意味着策略更随机,自然探索更多状态
2. **鲁棒性**: 不会过度依赖单一最优路径
3. **多模态**: 能学习多个接近最优的行为模式

### 6.2 软价值函数

**软Q函数**:

$$Q^\pi(s, a) = r + \gamma \mathbb{E}_{s'}[V^\pi(s')]$$

**软状态价值函数**:

$$V^\pi(s) = \mathbb{E}_{a \sim \pi}[Q^\pi(s, a) - \alpha \log \pi(a|s)]$$

**软贝尔曼备份**:

$$y = r + \gamma (1-d)(\min_{i=1,2} Q_{\phi'_i}(s', a') - \alpha \log \pi(a'|s'))$$

### 6.3 自动温度调节

温度参数 $\alpha$ 控制探索-利用的权衡:

- $\alpha \to 0$: 纯回报最大化 (像 TD3)
- $\alpha \to \infty$: 随机策略 (最大熵)

SAC 自动学习 $\alpha$ 来维持目标熵:

$$J(\alpha) = \mathbb{E}_{a \sim \pi}[-\alpha(\log \pi(a|s) + \bar{\mathcal{H}})]$$

目标熵通常设为 $\bar{\mathcal{H}} = -\dim(\mathcal{A})$。

In [None]:
# 创建 SAC Agent
sac_config = SACConfig(
    state_dim=4,
    action_dim=2,
    max_action=1.0,
    hidden_dims=[64, 64],
    buffer_size=10000,
    batch_size=64,
    lr_actor=3e-4,
    lr_critic=3e-4,
    gamma=0.99,
    tau=0.005,
    auto_alpha=True,     # 自动调节温度
    initial_alpha=0.2,
)

sac_agent = SACAgent(sac_config)
print(f"SAC Agent created")
print(f"  Auto alpha: {sac_config.auto_alpha}")
print(f"  Initial alpha: {sac_agent.alpha:.4f}")
print(f"  Target entropy: {sac_agent.target_entropy:.4f}")

In [None]:
# 演示 SAC 的随机策略
state = np.random.randn(4).astype(np.float32)

print("SAC Stochastic Policy - Same state, different actions:")
for i in range(5):
    action = sac_agent.select_action(state, deterministic=False)
    print(f"  Sample {i+1}: {action}")

print("\nSAC Deterministic (evaluation mode):")
for i in range(3):
    action = sac_agent.select_action(state, deterministic=True)
    print(f"  Sample {i+1}: {action}")

In [None]:
# 演示温度自动调节
# 填充缓冲区
for _ in range(200):
    s = np.random.randn(4).astype(np.float32)
    a = np.random.randn(2).astype(np.float32)
    sac_agent.store_transition(s, a, 1.0, s, False)

alphas = [sac_agent.alpha]
entropies = []

for i in range(100):
    metrics = sac_agent.update()
    alphas.append(sac_agent.alpha)
    if 'entropy' in metrics:
        entropies.append(metrics['entropy'])

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.plot(alphas)
ax1.set_xlabel('Update Step')
ax1.set_ylabel('Alpha (Temperature)')
ax1.set_title('Automatic Temperature Adjustment')
ax1.grid(True, alpha=0.3)

ax2.plot(entropies)
ax2.axhline(y=-sac_agent.target_entropy, color='r', linestyle='--', label='Target Entropy')
ax2.set_xlabel('Update Step')
ax2.set_ylabel('Policy Entropy')
ax2.set_title('Policy Entropy Over Training')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## 7. 算法对比与选择指南

### 7.1 特性对比

| 特性 | DDPG | TD3 | SAC |
|------|------|-----|-----|
| 策略类型 | 确定性 | 确定性 | 随机性 |
| Q网络数量 | 1 | 2 (twin) | 2 (twin) |
| 探索方式 | 外部噪声 | 外部噪声 | 熵最大化 |
| 温度参数 | 无 | 无 | 自动调节 |
| 稳定性 | 低 | 中 | 高 |
| 样本效率 | 好 | 好 | 最好 |
| 实现复杂度 | 简单 | 中等 | 中等 |

### 7.2 选择建议

**选择 SAC 如果**:
- 需要稳定可靠的训练
- 任务需要探索
- 不想调节探索噪声参数

**选择 TD3 如果**:
- 需要确定性策略
- 对计算资源敏感
- 任务相对简单

**选择 DDPG 如果**:
- 教学/学习目的
- 作为基准对比
- 简单任务

In [None]:
# 可视化三种算法的特性差异
algorithms = ['DDPG', 'TD3', 'SAC']
categories = ['Stability', 'Sample Efficiency', 'Simplicity', 
              'Exploration', 'Hyperparameter Robustness']

# 评分 (主观评估, 1-5)
scores = {
    'DDPG': [2, 3, 5, 2, 2],
    'TD3':  [4, 4, 3, 3, 4],
    'SAC':  [5, 5, 3, 5, 5],
}

# 雷达图
angles = np.linspace(0, 2*np.pi, len(categories), endpoint=False).tolist()
angles += angles[:1]  # 闭合

fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))

colors = ['#1f77b4', '#ff7f0e', '#2ca02c']
for algo, color in zip(algorithms, colors):
    values = scores[algo] + scores[algo][:1]
    ax.plot(angles, values, linewidth=2, label=algo, color=color)
    ax.fill(angles, values, alpha=0.1, color=color)

ax.set_xticks(angles[:-1])
ax.set_xticklabels(categories, fontsize=10)
ax.set_ylim(0, 5)
ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1))
ax.set_title('Algorithm Comparison', fontsize=14, pad=20)

plt.tight_layout()
plt.show()

---

## 8. 实践: 完整训练流程演示

下面展示如何使用这些算法训练智能体。由于没有实际环境,我们使用模拟数据。

In [None]:
class MockEnvironment:
    """模拟环境用于演示训练流程"""
    
    def __init__(self, state_dim=4, action_dim=2, max_action=1.0):
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.max_action = max_action
        self._step = 0
        self._state = None
        
        class ActionSpace:
            def __init__(self, dim, high):
                self.shape = (dim,)
                self.high = np.array([high] * dim)
                self.low = -self.high
            def sample(self):
                return np.random.uniform(self.low, self.high)
        
        self.action_space = ActionSpace(action_dim, max_action)
    
    def reset(self):
        self._step = 0
        self._state = np.random.randn(self.state_dim).astype(np.float32)
        return self._state, {}
    
    def step(self, action):
        self._step += 1
        # 简单奖励: 负距离 + 动作惩罚
        reward = -np.sum(self._state**2) - 0.1 * np.sum(action**2)
        
        # 简单动力学
        self._state = (
            self._state * 0.95 + 
            np.tanh(action).sum() * 0.1 * np.ones(self.state_dim) +
            np.random.randn(self.state_dim).astype(np.float32) * 0.05
        )
        
        done = self._step >= 100
        return self._state, reward, done, False, {}

env = MockEnvironment()
print(f"Environment: state_dim={env.state_dim}, action_dim={env.action_dim}")

In [None]:
def train_agent(agent, env, num_steps=1000, start_steps=200):
    """简化的训练循环"""
    state, _ = env.reset()
    episode_reward = 0
    episode_rewards = []
    metrics_history = []
    
    agent.train_mode()
    
    for step in range(num_steps):
        # 选择动作
        if step < start_steps:
            action = env.action_space.sample()
        else:
            action = agent.select_action(state)
        
        # 环境交互
        next_state, reward, done, _, _ = env.step(action)
        agent.store_transition(state, action, reward, next_state, done)
        episode_reward += reward
        
        # 学习更新
        if step >= start_steps:
            metrics = agent.update()
            if metrics:
                metrics_history.append(metrics)
        
        state = next_state
        
        if done:
            episode_rewards.append(episode_reward)
            state, _ = env.reset()
            episode_reward = 0
    
    return episode_rewards, metrics_history

In [None]:
# 训练三种算法并比较
results = {}

# 使用较小的参数进行快速演示
num_steps = 2000
start_steps = 500

# DDPG
print("Training DDPG...")
ddpg = DDPGAgent(DDPGConfig(
    state_dim=4, action_dim=2, max_action=1.0,
    hidden_dims=[64, 64], buffer_size=5000, batch_size=64
))
results['DDPG'] = train_agent(ddpg, MockEnvironment(), num_steps, start_steps)

# TD3
print("Training TD3...")
td3 = TD3Agent(TD3Config(
    state_dim=4, action_dim=2, max_action=1.0,
    hidden_dims=[64, 64], buffer_size=5000, batch_size=64,
    policy_delay=2
))
results['TD3'] = train_agent(td3, MockEnvironment(), num_steps, start_steps)

# SAC
print("Training SAC...")
sac = SACAgent(SACConfig(
    state_dim=4, action_dim=2, max_action=1.0,
    hidden_dims=[64, 64], buffer_size=5000, batch_size=64,
    auto_alpha=True
))
results['SAC'] = train_agent(sac, MockEnvironment(), num_steps, start_steps)

print("Training complete!")

In [None]:
# 可视化训练结果
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

colors = {'DDPG': '#1f77b4', 'TD3': '#ff7f0e', 'SAC': '#2ca02c'}

# Episode rewards
for name, (rewards, _) in results.items():
    ax1.plot(rewards, label=name, color=colors[name], alpha=0.7)

ax1.set_xlabel('Episode', fontsize=12)
ax1.set_ylabel('Episode Return', fontsize=12)
ax1.set_title('Training Progress: Episode Returns', fontsize=14)
ax1.legend()
ax1.grid(True, alpha=0.3)

# Critic loss
for name, (_, metrics) in results.items():
    if metrics:
        losses = [m.get('critic_loss', 0) for m in metrics]
        ax2.plot(losses, label=name, color=colors[name], alpha=0.7)

ax2.set_xlabel('Update Step', fontsize=12)
ax2.set_ylabel('Critic Loss', fontsize=12)
ax2.set_title('Training Progress: Critic Loss', fontsize=14)
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## 9. 调参指南

### 9.1 推荐超参数

| 参数 | DDPG | TD3 | SAC |
|------|------|-----|-----|
| `lr_actor` | 1e-4 | 3e-4 | 3e-4 |
| `lr_critic` | 1e-3 | 3e-4 | 3e-4 |
| `gamma` | 0.99 | 0.99 | 0.99 |
| `tau` | 0.005 | 0.005 | 0.005 |
| `batch_size` | 256 | 256 | 256 |
| `buffer_size` | 1e6 | 1e6 | 1e6 |
| `hidden_dims` | [256, 256] | [256, 256] | [256, 256] |

### 9.2 常见问题排查

**Q值爆炸/NaN**:
- 降低学习率
- 检查奖励尺度
- 添加梯度裁剪

**训练不收敛**:
- 增加 `start_timesteps` (更多随机探索)
- 检查 `gamma` 是否合适
- 增大 batch_size

**探索不足**:
- DDPG/TD3: 增大 `exploration_noise`
- SAC: 检查 `alpha` 是否正常 (过低则探索不足)

---

## 10. 总结

### 核心要点

1. **DDPG** 开创了深度确定性策略梯度,但存在过估计和不稳定问题

2. **TD3** 通过三重改进 (Clipped Double Q, Delayed Updates, Target Smoothing) 显著提升稳定性

3. **SAC** 引入最大熵框架,实现自动探索-利用平衡,是目前最稳定的选择

### 选择建议

- **生产环境**: 首选 SAC
- **需要确定性**: 选择 TD3
- **学习目的**: 从 DDPG 开始

### 下一步

- 在真实环境 (如 MuJoCo, Gymnasium) 中测试
- 尝试调节超参数观察效果
- 阅读原始论文深入理解数学推导

---

## 参考文献

1. Lillicrap et al. (2016). "Continuous control with deep reinforcement learning" (DDPG)
2. Fujimoto et al. (2018). "Addressing Function Approximation Error in Actor-Critic Methods" (TD3)
3. Haarnoja et al. (2018). "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning" (SAC)