# Deep Q-Network Variants: 深度Q网络变体详解

---

## 模块概述

本模块系统地介绍Deep Q-Network (DQN)的主要变体算法，包括理论推导、核心创新点、实现细节和对比分析。

### 学习目标

完成本模块后，您将能够：

1. **理解原始DQN的局限性**及其在实际应用中的失效模式
2. **掌握各DQN变体的数学原理**：Double DQN、Dueling DQN、Noisy Networks、Categorical DQN等
3. **实现各种replay buffer策略**：Uniform、Prioritized、N-step
4. **分析Rainbow算法**如何组合所有改进达到SOTA性能
5. **进行算法对比实验**并解读结果

---

## 目录

1. [环境配置](#1-环境配置)
2. [背景知识与DQN局限性](#2-背景知识与DQN局限性)
3. [Double DQN: 消除过估计偏差](#3-double-dqn-消除过估计偏差)
4. [Dueling DQN: 价值-优势分解](#4-dueling-dqn-价值-优势分解)
5. [Noisy Networks: 参数化探索](#5-noisy-networks-参数化探索)
6. [Categorical DQN (C51): 分布式强化学习](#6-categorical-dqn-c51-分布式强化学习)
7. [Prioritized Experience Replay](#7-prioritized-experience-replay)
8. [N-step Learning: 多步学习](#8-n-step-learning-多步学习)
9. [Rainbow: 集大成者](#9-rainbow-集大成者)
10. [实验对比与分析](#10-实验对比与分析)

---

## 1. 环境配置

In [None]:
# 标准库
import warnings
warnings.filterwarnings('ignore')

In [None]:
# 科学计算
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# 深度学习
import torch
import torch.nn as nn
import torch.nn.functional as F

In [None]:
# 本地模块导入
from core.config import DQNVariantConfig
from core.enums import DQNVariant
from buffers import ReplayBuffer, PrioritizedReplayBuffer, NStepReplayBuffer, SumTree
from networks import DQNNetwork, DuelingNetwork, NoisyLinear, NoisyNetwork, CategoricalNetwork, RainbowNetwork
from agents import DQNVariantAgent

In [None]:
# 随机种子与设备配置
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

In [None]:
# 可视化配置
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12
plt.rcParams['axes.unicode_minus'] = False

---

## 2. 背景知识与DQN局限性

### 2.1 原始DQN回顾

Deep Q-Network (Mnih et al., 2015) 是深度强化学习的里程碑式工作。

**核心组件**:
- 神经网络近似Q函数: $Q(s, a; \theta)$
- 经验回放缓冲区 (Experience Replay)
- 目标网络 (Target Network)

**TD目标**:
$$y = r + \gamma \max_{a'} Q(s', a'; \theta^-)$$

### 2.2 原始DQN的四大局限性

| 问题 | 原因 | 后果 | 解决方案 |
|------|------|------|----------|
| **过估计偏差** | max操作同时用于选择和评估 | 不稳定、次优策略 | Double DQN |
| **样本效率低** | 均匀随机采样 | 学习慢、数据浪费 | Prioritized Replay |
| **探索能力弱** | ε-greedy与状态无关 | 难以逃离局部最优 | Noisy Networks |
| **标量值局限** | 只建模期望值 | 丢失分布信息、风险中立 | Categorical DQN |

In [None]:
# 可视化过估计偏差
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 过估计原理
np.random.seed(42)
true_q = np.array([1.0, 1.5, 0.8, 1.2])
noise = np.random.randn(4, 1000) * 0.5
estimated_q = true_q[:, np.newaxis] + noise
max_estimated = np.max(estimated_q, axis=0)

In [None]:
# 绘制过估计分布
ax1 = axes[0]
ax1.hist(max_estimated, bins=30, alpha=0.7, density=True, label='E[max Q]')
ax1.axvline(np.max(true_q), color='red', linestyle='--', linewidth=2, 
            label=f'max Q* = {np.max(true_q)}')
ax1.axvline(np.mean(max_estimated), color='blue', linestyle='-', linewidth=2, 
            label=f'E[max Q] = {np.mean(max_estimated):.2f}')
ax1.set_xlabel('Value')
ax1.set_ylabel('Density')
ax1.set_title('Overestimation Bias')
ax1.legend()

In [None]:
# 探索策略对比
ax2 = axes[1]
steps = np.arange(50000)
epsilon_greedy = np.maximum(1.0 - steps / 25000, 0.1)
noisy_exploration = 0.5 * np.exp(-steps / 15000) + 0.1 * np.sin(steps / 2500)
noisy_exploration = np.maximum(noisy_exploration, 0.05)

ax2.plot(steps, epsilon_greedy, 'r-', linewidth=2, label='ε-greedy')
ax2.plot(steps, noisy_exploration, 'g-', linewidth=2, alpha=0.8, label='Noisy Networks')
ax2.set_xlabel('Training steps')
ax2.set_ylabel('Exploration amount')
ax2.set_title('Exploration Strategies')
ax2.legend()

plt.tight_layout()
plt.show()

---

## 3. Double DQN: 消除过估计偏差

### 3.1 核心思想

**问题**: 标准DQN的max操作导致系统性过估计：
$$\mathbb{E}[\max_a Q(s,a)] \geq \max_a \mathbb{E}[Q(s,a)]$$

**解决方案**: 解耦动作选择与动作评估

$$y^{\text{Double}} = r + \gamma Q\left(s', \underbrace{\arg\max_{a'} Q(s', a'; \theta)}_{\text{online选择}}; \theta^-\right)$$

In [None]:
def simulate_overestimation(n_actions=10, n_samples=1000, noise_std=1.0):
    """模拟过估计偏差实验"""
    true_q = np.zeros(n_actions)
    
    # 标准DQN: 同一噪声源
    dqn_estimates = []
    for _ in range(n_samples):
        noise = np.random.randn(n_actions) * noise_std
        estimated_q = true_q + noise
        dqn_estimates.append(np.max(estimated_q))
    
    # Double DQN: 不同噪声源
    double_dqn_estimates = []
    for _ in range(n_samples):
        noise_online = np.random.randn(n_actions) * noise_std
        noise_target = np.random.randn(n_actions) * noise_std
        estimated_online = true_q + noise_online
        estimated_target = true_q + noise_target
        best_action = np.argmax(estimated_online)
        double_dqn_estimates.append(estimated_target[best_action])
    
    return np.mean(dqn_estimates), np.mean(double_dqn_estimates), np.max(true_q)

In [None]:
# 比较不同动作数量下的过估计
action_counts = [2, 4, 8, 16, 32, 64]
dqn_bias = []
ddqn_bias = []

for n_actions in action_counts:
    dqn_est, ddqn_est, true_max = simulate_overestimation(n_actions=n_actions)
    dqn_bias.append(dqn_est - true_max)
    ddqn_bias.append(ddqn_est - true_max)

In [None]:
# 绘制结果
plt.figure(figsize=(10, 6))
plt.plot(action_counts, dqn_bias, 'ro-', linewidth=2, markersize=10, label='DQN (overestimation)')
plt.plot(action_counts, ddqn_bias, 'g^-', linewidth=2, markersize=10, label='Double DQN (corrected)')
plt.axhline(0, color='gray', linestyle='--', linewidth=1, label='True value')
plt.xlabel('Number of Actions')
plt.ylabel('Estimation Bias')
plt.title('Overestimation Bias: DQN vs Double DQN')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xscale('log', base=2)
plt.show()

---

## 4. Dueling DQN: 价值-优势分解

### 4.1 核心思想

将Q函数分解为状态价值V(s)和动作优势A(s,a):

$$Q(s, a) = V(s) + A(s, a) - \frac{1}{|\mathcal{A}|} \sum_{a'} A(s, a')$$

- **V(s)**: 这个状态有多好？（与动作无关）
- **A(s,a)**: 动作a比平均动作好多少？

In [None]:
# 演示Dueling网络的前向传播
state_dim, action_dim, hidden_dim = 4, 2, 64
dueling_net = DuelingNetwork(state_dim, action_dim, hidden_dim)
print("Dueling Network Architecture:")
print(dueling_net)

In [None]:
# 前向传播查看中间结果
sample_state = torch.randn(1, state_dim)

with torch.no_grad():
    features = F.relu(dueling_net.feature[0](sample_state))
    value = dueling_net.value_stream(features)
    advantage = dueling_net.advantage_stream(features)
    q_values = dueling_net(sample_state)

print(f"\nV(s) = {value.item():.4f}")
print(f"A(s,a) = {advantage.numpy().flatten()}")
print(f"Q(s,a) = {q_values.numpy().flatten()}")

---

## 5. Noisy Networks: 参数化探索

### 5.1 核心思想

用可学习的参数化噪声替代ε-greedy探索:

$$y = (\mu^w + \sigma^w \odot \varepsilon^w) x + (\mu^b + \sigma^b \odot \varepsilon^b)$$

**优势**:
1. 状态依赖探索
2. 自动退火（σ随学习减小）
3. 端到端学习

In [None]:
# 演示NoisyLinear层
in_features, out_features = 64, 32
noisy_layer = NoisyLinear(in_features, out_features)

print("NoisyLinear Layer:")
print(f"  mu_weight shape: {noisy_layer.weight_mu.shape}")
print(f"  sigma_weight shape: {noisy_layer.weight_sigma.shape}")

In [None]:
# 演示噪声采样对输出的影响
sample_input = torch.randn(1, in_features)

outputs = []
for _ in range(100):
    noisy_layer.reset_noise()
    with torch.no_grad():
        output = noisy_layer(sample_input)
    outputs.append(output.numpy().flatten())

outputs = np.array(outputs)
print(f"\nOutput variance (due to noise): {outputs.std(axis=0)[:5]}")

---

## 6. Categorical DQN (C51): 分布式强化学习

### 6.1 核心思想

从建模期望值转向建模完整的回报分布:

$$Z(s, a) \sim \text{Categorical}(z_1, ..., z_N; p_1, ..., p_N)$$

**支撑点**:
$$z_i = V_{\min} + i \cdot \Delta z, \quad \Delta z = \frac{V_{\max} - V_{\min}}{N - 1}$$

In [None]:
# 可视化C51分布表示
n_atoms = 51
v_min, v_max = -10, 10
support = np.linspace(v_min, v_max, n_atoms)
delta_z = (v_max - v_min) / (n_atoms - 1)

In [None]:
# 模拟两个动作的回报分布
def create_distribution(mean, std, support):
    probs = np.exp(-(support - mean)**2 / (2 * std**2))
    return probs / probs.sum()

dist_action1 = create_distribution(mean=5, std=2, support=support)
dist_action2 = create_distribution(mean=5, std=5, support=support)

In [None]:
# 绘制分布对比
plt.figure(figsize=(10, 6))
plt.bar(support, dist_action1, width=delta_z*0.8, alpha=0.7, label='Action 1 (low var)')
plt.bar(support, dist_action2, width=delta_z*0.4, alpha=0.5, label='Action 2 (high var)')
plt.xlabel('Return value')
plt.ylabel('Probability')
plt.title('C51: Return Distributions (same mean, different variance)')
plt.legend()
plt.show()

---

## 7. Prioritized Experience Replay

### 7.1 核心思想

根据TD误差大小分配采样优先级:

$$P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}, \quad p_i = |\delta_i| + \epsilon$$

In [None]:
# 演示SumTree数据结构
tree = SumTree(capacity=8)
priorities = [1.0, 3.0, 2.0, 4.0, 1.5, 2.5, 3.5, 0.5]

for i, p in enumerate(priorities):
    tree.add(p, f"data_{i}")

print(f"Total priority: {tree.total_priority}")
print(f"Expected: {sum(priorities)}")

In [None]:
# 演示PER采样
per_buffer = PrioritizedReplayBuffer(capacity=100, alpha=0.6, beta_start=0.4)

for _ in range(50):
    state = np.random.randn(4).astype(np.float32)
    per_buffer.push(state, 0, 1.0, state, False)

# 采样并查看权重
batch = per_buffer.sample(8)
weights = batch[-1]
print(f"IS weights: {weights}")

---

## 8. N-step Learning: 多步学习

### 8.1 核心思想

N-step Return在偏差与方差之间提供权衡:

$$G_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k R_{t+k+1} + \gamma^n V(S_{t+n})$$

In [None]:
# 演示N-step return计算
n_step_buffer = NStepReplayBuffer(capacity=100, n_steps=3, gamma=0.99)

rewards = [1.0, 2.0, 3.0, 4.0, 5.0]
state = np.zeros(4, dtype=np.float32)

for i, r in enumerate(rewards):
    result = n_step_buffer.push(state, 0, r, state, i == len(rewards) - 1)
    if result:
        print(f"Step {i+1}: n_step_return = {result.n_step_return:.4f}")

---

## 9. Rainbow: 集大成者

### 9.1 组合策略

Rainbow = Double + Dueling + Noisy + Categorical + PER + N-step

| 算法 | Atari中位数得分 | 相对DQN提升 |
|------|-----------------|-------------|
| DQN | 79% | baseline |
| Double DQN | 117% | +48% |
| Dueling DQN | 151% | +91% |
| Categorical DQN | 235% | +197% |
| **Rainbow** | **441%** | **+458%** |

In [None]:
# 演示Rainbow网络
rainbow_net = RainbowNetwork(
    state_dim=4, action_dim=2, hidden_dim=64,
    num_atoms=51, v_min=-10, v_max=10
)
print("Rainbow Network:")
print(f"  Parameters: {sum(p.numel() for p in rainbow_net.parameters())}")

In [None]:
# 测试前向传播
sample_state = torch.randn(1, 4)
rainbow_net.reset_noise()

with torch.no_grad():
    log_probs = rainbow_net(sample_state)
    q_values = rainbow_net.get_q_values(sample_state)

print(f"Log probs shape: {log_probs.shape}")
print(f"Q-values: {q_values.numpy().flatten()}")

---

## 10. 实验对比与分析

使用DQNVariantAgent对比不同变体的性能。

In [None]:
# 创建配置 (使用小参数快速演示)
config = DQNVariantConfig(
    state_dim=4,
    action_dim=2,
    hidden_dim=64,
    batch_size=32,
    buffer_size=1000,
    min_buffer_size=100,
    device="cpu",
)

In [None]:
# 测试各变体初始化
variants = [
    DQNVariant.VANILLA,
    DQNVariant.DOUBLE,
    DQNVariant.DUELING,
    DQNVariant.RAINBOW,
]

for variant in variants:
    agent = DQNVariantAgent(config, variant)
    print(f"{variant}: initialized successfully")

In [None]:
# 快速训练测试
agent = DQNVariantAgent(config, DQNVariant.DOUBLE)
mock_state = np.zeros(4, dtype=np.float32)

# 模拟训练步骤
losses = []
for i in range(200):
    action = agent.select_action(mock_state, training=True)
    loss = agent.train_step(mock_state, action, 1.0, mock_state, i % 50 == 49)
    if loss is not None:
        losses.append(loss)

print(f"Training steps: {len(losses)}")
print(f"Final loss: {losses[-1]:.4f}" if losses else "No loss yet")

---

## 总结

### 核心要点

| 变体 | 核心创新 | 解决的问题 |
|------|----------|------------|
| **Double DQN** | 解耦动作选择与评估 | 过估计偏差 |
| **Dueling DQN** | V/A分解架构 | 泛化能力 |
| **Noisy Networks** | 参数化噪声 | 状态无关探索 |
| **Categorical DQN** | 分布建模 | 标量值局限 |
| **PER** | TD误差优先级 | 样本效率 |
| **N-step** | 多步bootstrap | 信用分配 |
| **Rainbow** | 全部组合 | 最优性能 |

### 实践建议

1. **简单任务**: Vanilla DQN或Double DQN
2. **中等难度**: Double + Dueling + PER
3. **困难任务**: 使用Rainbow获得最佳性能

---

## 参考文献

1. Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. *Nature*.
2. van Hasselt, H. et al. (2016). Deep Reinforcement Learning with Double Q-learning. *AAAI*.
3. Wang, Z. et al. (2016). Dueling Network Architectures for Deep Reinforcement Learning. *ICML*.
4. Fortunato, M. et al. (2017). Noisy Networks for Exploration. *ICLR*.
5. Bellemare, M. et al. (2017). A Distributional Perspective on Reinforcement Learning. *ICML*.
6. Schaul, T. et al. (2016). Prioritized Experience Replay. *ICLR*.
7. Hessel, M. et al. (2018). Rainbow: Combining Improvements in Deep Reinforcement Learning. *AAAI*.