# 深度 Q 网络 (DQN) 实战教程

---

## 学习目标

通过本教程，你将学会：
- 理解 DQN 解决的核心问题（表格方法的局限）
- 掌握经验回放（Experience Replay）的原理与实现
- 掌握目标网络（Target Network）的作用与实现
- 实现完整的 DQN 算法
- 理解 Double DQN 和 Dueling DQN 的改进
- 在 CartPole 环境中训练和评估 DQN 智能体

## 前置知识

- Q-Learning 基础（贝尔曼方程、TD 误差）
- PyTorch 神经网络基础
- Python 面向对象编程

## 预计时间

60-90 分钟

---

## 第1部分：理论背景

### 1.1 为什么需要深度强化学习？

**表格型 Q-Learning 的局限**：

| 问题 | 说明 |
|------|------|
| 状态空间爆炸 | 围棋有 $10^{170}$ 种状态，无法存储 |
| 连续状态空间 | 机器人关节角度是连续值 |
| 无法泛化 | 相似状态需要独立学习 |
| 高维输入 | 图像有数百万像素 |

**解决方案**：用神经网络逼近 Q 函数

$$Q(s, a) \approx Q(s, a; \theta)$$

### 1.2 DQN 的核心创新

2013 年 DeepMind 提出 DQN，首次在 Atari 游戏上达到人类水平。两个关键技术：

#### 经验回放 (Experience Replay)

**问题**：连续采样的样本高度相关，违反 i.i.d. 假设

**解决**：将经验 $(s, a, r, s', done)$ 存入缓冲区，训练时随机采样

**优势**：
- 打破样本相关性
- 提高数据利用效率
- 稳定训练过程

#### 目标网络 (Target Network)

**问题**：更新 Q 网络时，目标值也在变化，导致不稳定

$$\text{Target} = r + \gamma \max_{a'} Q(s', a'; \theta)$$

**解决**：使用固定的目标网络 $\theta^-$，定期同步

$$\text{Target} = r + \gamma \max_{a'} Q(s', a'; \theta^-)$$

### 1.3 DQN 算法流程

```
初始化:
    - Q 网络 Q(s, a; θ)
    - 目标网络 Q̂(s, a; θ⁻) ← Q
    - 经验回放缓冲区 D

对于每个回合:
    初始化状态 s
    对于每一步:
        1. 使用 ε-greedy 选择动作 a
        2. 执行 a，观察 r, s'
        3. 存储 (s, a, r, s', done) 到 D
        4. 从 D 采样 mini-batch
        5. 计算目标: y = r + γ max_a' Q̂(s', a'; θ⁻)
        6. 梯度下降: θ ← θ - α∇_θ(y - Q(s, a; θ))²
        7. 每 C 步更新目标网络: θ⁻ ← θ
        8. s ← s'
```

---

## 第2部分：环境准备

In [None]:
# ============================================================
# 导入必要的库
# ============================================================

import numpy as np
import random
from collections import deque
from dataclasses import dataclass
from typing import Tuple, List, Optional

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

import matplotlib.pyplot as plt

# 尝试导入 gymnasium
try:
    import gymnasium as gym
    HAS_GYM = True
except ImportError:
    HAS_GYM = False
    print("请安装 gymnasium: pip install gymnasium")

# ============================================================
# 配置
# ============================================================

RANDOM_SEED = 42
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

# 可视化配置
plt.rcParams['font.sans-serif'] = ['SimHei', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['figure.figsize'] = (10, 6)

# 设备配置
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"使用设备: {DEVICE}")
print(f"PyTorch 版本: {torch.__version__}")
print("环境准备完成")

### 2.1 CartPole 环境介绍

CartPole 是经典的控制问题：通过左右移动小车来平衡竖直的杆子。

- **状态空间**: 4 维连续 (小车位置、速度、杆角度、角速度)
- **动作空间**: 2 个离散动作 (向左、向右)
- **奖励**: 每步 +1（杆子保持平衡）
- **终止条件**: 杆子倾斜超过 15°，或小车偏离中心超过 2.4 单位

In [None]:
# ============================================================
# 探索 CartPole 环境
# ============================================================

if HAS_GYM:
    env = gym.make('CartPole-v1')
    
    print("CartPole-v1 环境信息:")
    print(f"  状态空间: {env.observation_space}")
    print(f"  状态维度: {env.observation_space.shape[0]}")
    print(f"  动作空间: {env.action_space}")
    print(f"  动作数量: {env.action_space.n}")
    
    # 随机探索几步
    state, _ = env.reset(seed=RANDOM_SEED)
    print(f"\n初始状态: {state}")
    print(f"  - 小车位置: {state[0]:.4f}")
    print(f"  - 小车速度: {state[1]:.4f}")
    print(f"  - 杆子角度: {state[2]:.4f}")
    print(f"  - 杆子角速度: {state[3]:.4f}")
    
    env.close()

---

## 第3部分：核心组件实现

### 3.1 经验回放缓冲区

In [None]:
# ============================================================
# 经验回放缓冲区
# ============================================================

@dataclass
class Transition:
    """单步转换数据"""
    state: np.ndarray
    action: int
    reward: float
    next_state: np.ndarray
    done: bool


class ReplayBuffer:
    """
    经验回放缓冲区
    
    核心功能:
    - 存储交互经验
    - 均匀随机采样
    - 自动管理容量
    """
    
    def __init__(self, capacity: int = 100000):
        """
        Args:
            capacity: 缓冲区最大容量
        """
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        """存储一条经验"""
        self.buffer.append(Transition(state, action, reward, next_state, done))
    
    def sample(self, batch_size: int) -> Tuple[np.ndarray, ...]:
        """随机采样一个批次"""
        batch = random.sample(self.buffer, batch_size)
        
        states = np.array([t.state for t in batch], dtype=np.float32)
        actions = np.array([t.action for t in batch], dtype=np.int64)
        rewards = np.array([t.reward for t in batch], dtype=np.float32)
        next_states = np.array([t.next_state for t in batch], dtype=np.float32)
        dones = np.array([t.done for t in batch], dtype=np.float32)
        
        return states, actions, rewards, next_states, dones
    
    def __len__(self):
        return len(self.buffer)


# 测试
buffer = ReplayBuffer(capacity=1000)
for i in range(100):
    buffer.push(
        state=np.random.randn(4),
        action=random.randint(0, 1),
        reward=1.0,
        next_state=np.random.randn(4),
        done=False
    )

print(f"缓冲区大小: {len(buffer)}")
states, actions, rewards, next_states, dones = buffer.sample(32)
print(f"采样批次 - states 形状: {states.shape}")
print("经验回放缓冲区测试通过")

### 3.2 DQN 神经网络

In [None]:
# ============================================================
# DQN 网络架构
# ============================================================

class DQNNetwork(nn.Module):
    """
    基础 DQN 网络
    
    结构: State -> FC -> ReLU -> FC -> ReLU -> FC -> Q-values
    """
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128):
        super().__init__()
        
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )
        
        # 权重初始化
        self._init_weights()
    
    def _init_weights(self):
        """正交初始化"""
        for module in self.net:
            if isinstance(module, nn.Linear):
                nn.init.orthogonal_(module.weight, gain=np.sqrt(2))
                nn.init.zeros_(module.bias)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """前向传播，返回各动作的 Q 值"""
        return self.net(x)


# 测试
net = DQNNetwork(state_dim=4, action_dim=2, hidden_dim=128).to(DEVICE)
x = torch.randn(32, 4).to(DEVICE)
q_values = net(x)

print(f"输入形状: {x.shape}")
print(f"输出形状: {q_values.shape}")
print(f"Q 值示例: {q_values[0].detach().cpu().numpy()}")
print(f"网络参数量: {sum(p.numel() for p in net.parameters()):,}")
print("DQN 网络测试通过")

### 3.3 Dueling DQN 网络

**核心思想**：将 Q 值分解为状态价值 V(s) 和优势函数 A(s,a)

$$Q(s, a) = V(s) + A(s, a) - \frac{1}{|\mathcal{A}|}\sum_{a'} A(s, a')$$

**优势**：
- 状态价值流独立学习状态好坏
- 优势流专注于比较动作的相对好坏
- 在动作影响不大时学习效率更高

In [None]:
# ============================================================
# Dueling DQN 网络架构
# ============================================================

class DuelingDQNNetwork(nn.Module):
    """
    Dueling DQN 网络
    
    结构:
        State -> 共享层 -> 价值流 -> V(s)
                       -> 优势流 -> A(s,a)
        Q(s,a) = V(s) + (A(s,a) - mean(A))
    """
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128):
        super().__init__()
        
        # 共享特征层
        self.feature = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU()
        )
        
        # 价值流 V(s)
        self.value_stream = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
        
        # 优势流 A(s,a)
        self.advantage_stream = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        features = self.feature(x)
        
        value = self.value_stream(features)
        advantage = self.advantage_stream(features)
        
        # Q = V + (A - mean(A))
        q_values = value + (advantage - advantage.mean(dim=-1, keepdim=True))
        
        return q_values


# 测试
dueling_net = DuelingDQNNetwork(state_dim=4, action_dim=2).to(DEVICE)
q_values = dueling_net(x)
print(f"Dueling DQN 输出形状: {q_values.shape}")
print(f"参数量: {sum(p.numel() for p in dueling_net.parameters()):,}")
print("Dueling DQN 网络测试通过")

---

## 第4部分：DQN 智能体实现

In [None]:
# ============================================================
# DQN 智能体
# ============================================================

class DQNAgent:
    """
    DQN 智能体
    
    支持:
    - 标准 DQN
    - Double DQN (解耦动作选择和评估)
    - Dueling DQN (分解 V 和 A)
    """
    
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        hidden_dim: int = 128,
        lr: float = 1e-3,
        gamma: float = 0.99,
        epsilon_start: float = 1.0,
        epsilon_end: float = 0.01,
        epsilon_decay: float = 0.995,
        buffer_size: int = 100000,
        batch_size: int = 64,
        target_update_freq: int = 100,
        double_dqn: bool = False,
        dueling: bool = False,
        device: str = 'auto'
    ):
        self.action_dim = action_dim
        self.gamma = gamma
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        self.batch_size = batch_size
        self.target_update_freq = target_update_freq
        self.double_dqn = double_dqn
        
        # 设备
        if device == 'auto':
            self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        else:
            self.device = torch.device(device)
        
        # 网络
        NetworkClass = DuelingDQNNetwork if dueling else DQNNetwork
        self.q_network = NetworkClass(state_dim, action_dim, hidden_dim).to(self.device)
        self.target_network = NetworkClass(state_dim, action_dim, hidden_dim).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        
        # 优化器
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr)
        
        # 经验回放
        self.buffer = ReplayBuffer(buffer_size)
        
        # 计数器
        self.update_count = 0
    
    def get_action(self, state: np.ndarray, training: bool = True) -> int:
        """ε-greedy 动作选择"""
        if training and random.random() < self.epsilon:
            return random.randint(0, self.action_dim - 1)
        
        state_t = torch.FloatTensor(state).unsqueeze(0).to(self.device)
        with torch.no_grad():
            q_values = self.q_network(state_t)
        return q_values.argmax(dim=1).item()
    
    def update(self, state, action, reward, next_state, done) -> Optional[float]:
        """存储经验并训练"""
        self.buffer.push(state, action, reward, next_state, done)
        
        if len(self.buffer) < self.batch_size:
            return None
        
        # 采样批次
        states, actions, rewards, next_states, dones = self.buffer.sample(self.batch_size)
        
        states_t = torch.FloatTensor(states).to(self.device)
        actions_t = torch.LongTensor(actions).to(self.device)
        rewards_t = torch.FloatTensor(rewards).to(self.device)
        next_states_t = torch.FloatTensor(next_states).to(self.device)
        dones_t = torch.FloatTensor(dones).to(self.device)
        
        # 当前 Q 值
        current_q = self.q_network(states_t).gather(1, actions_t.unsqueeze(1)).squeeze()
        
        # 目标 Q 值
        with torch.no_grad():
            if self.double_dqn:
                # Double DQN: 用在线网络选动作，目标网络评估
                next_actions = self.q_network(next_states_t).argmax(dim=1)
                next_q = self.target_network(next_states_t).gather(1, next_actions.unsqueeze(1)).squeeze()
            else:
                next_q = self.target_network(next_states_t).max(dim=1)[0]
            
            target_q = rewards_t + self.gamma * next_q * (1 - dones_t)
        
        # 计算损失
        loss = F.mse_loss(current_q, target_q)
        
        # 优化
        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.q_network.parameters(), 10)
        self.optimizer.step()
        
        # 更新目标网络
        self.update_count += 1
        if self.update_count % self.target_update_freq == 0:
            self.target_network.load_state_dict(self.q_network.state_dict())
        
        return loss.item()
    
    def decay_epsilon(self):
        """衰减探索率"""
        self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)


print("DQN 智能体定义完成")

---

## 第5部分：训练与评估

In [None]:
# ============================================================
# 训练函数
# ============================================================

def train_dqn(
    env_name: str = 'CartPole-v1',
    num_episodes: int = 200,
    double_dqn: bool = False,
    dueling: bool = False,
    seed: int = 42,
    verbose: bool = True
) -> Tuple[DQNAgent, List[float]]:
    """
    训练 DQN 智能体
    
    Args:
        env_name: Gymnasium 环境名称
        num_episodes: 训练回合数
        double_dqn: 是否使用 Double DQN
        dueling: 是否使用 Dueling DQN
        seed: 随机种子
        verbose: 是否打印进度
    
    Returns:
        (agent, rewards_history)
    """
    if not HAS_GYM:
        print("需要安装 gymnasium")
        return None, []
    
    # 设置种子
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    
    # 创建环境
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    # 算法名称
    algo_name = "DQN"
    if double_dqn:
        algo_name = "Double " + algo_name
    if dueling:
        algo_name = "Dueling " + algo_name
    
    if verbose:
        print(f"\n{'='*50}")
        print(f"训练 {algo_name} on {env_name}")
        print(f"状态维度: {state_dim}, 动作数: {action_dim}")
        print(f"{'='*50}")
    
    # 创建智能体
    agent = DQNAgent(
        state_dim=state_dim,
        action_dim=action_dim,
        lr=1e-3,
        gamma=0.99,
        epsilon_start=1.0,
        epsilon_end=0.01,
        epsilon_decay=0.995,
        batch_size=64,
        target_update_freq=100,
        double_dqn=double_dqn,
        dueling=dueling
    )
    
    # 训练
    rewards_history = []
    best_avg = float('-inf')
    
    for episode in range(num_episodes):
        state, _ = env.reset(seed=seed + episode)
        total_reward = 0
        done = False
        
        while not done:
            action = agent.get_action(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            agent.update(state, action, reward, next_state, done)
            state = next_state
            total_reward += reward
        
        agent.decay_epsilon()
        rewards_history.append(total_reward)
        
        if verbose and (episode + 1) % 20 == 0:
            avg = np.mean(rewards_history[-20:])
            best_avg = max(best_avg, avg)
            print(f"回合 {episode+1:4d} | 平均: {avg:7.2f} | 最佳: {best_avg:7.2f} | ε: {agent.epsilon:.3f}")
    
    env.close()
    return agent, rewards_history

In [None]:
# ============================================================
# 训练基础 DQN
# ============================================================

# 使用较少的回合数进行快速测试
agent_dqn, rewards_dqn = train_dqn(
    num_episodes=150,
    double_dqn=False,
    dueling=False
)

In [None]:
# ============================================================
# 训练 Double Dueling DQN
# ============================================================

agent_dd, rewards_dd = train_dqn(
    num_episodes=150,
    double_dqn=True,
    dueling=True
)

---

## 第6部分：结果可视化

In [None]:
# ============================================================
# 绘制学习曲线
# ============================================================

def plot_learning_curves(results: dict, window: int = 20):
    """绘制多算法学习曲线对比"""
    plt.figure(figsize=(12, 5))
    
    colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']
    
    for idx, (name, rewards) in enumerate(results.items()):
        if len(rewards) >= window:
            smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
            plt.plot(smoothed, label=name, color=colors[idx % len(colors)], linewidth=2)
    
    plt.xlabel('Episode', fontsize=12)
    plt.ylabel('Total Reward', fontsize=12)
    plt.title('DQN 变体学习曲线对比', fontsize=14)
    plt.legend(loc='lower right', fontsize=10)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

# 绘制对比
if rewards_dqn and rewards_dd:
    plot_learning_curves({
        'DQN': rewards_dqn,
        'Double Dueling DQN': rewards_dd
    })

In [None]:
# ============================================================
# 评估智能体
# ============================================================

def evaluate_agent(agent, env_name: str, num_episodes: int = 10):
    """评估智能体性能"""
    if not HAS_GYM:
        return []
    
    env = gym.make(env_name)
    rewards = []
    
    for _ in range(num_episodes):
        state, _ = env.reset()
        total_reward = 0
        done = False
        
        while not done:
            action = agent.get_action(state, training=False)
            state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            total_reward += reward
        
        rewards.append(total_reward)
    
    env.close()
    return rewards

# 评估
if agent_dd:
    eval_rewards = evaluate_agent(agent_dd, 'CartPole-v1', num_episodes=20)
    print(f"\n评估结果 (Double Dueling DQN):")
    print(f"  平均奖励: {np.mean(eval_rewards):.2f}")
    print(f"  标准差: {np.std(eval_rewards):.2f}")
    print(f"  最高: {np.max(eval_rewards):.0f}")
    print(f"  最低: {np.min(eval_rewards):.0f}")

---

## 第7部分：交互式实验

### 实验1：探索率衰减的影响

In [None]:
# ============================================================
# 探索率衰减可视化
# ============================================================

def visualize_epsilon_decay(decay_rates: List[float], episodes: int = 200):
    """可视化不同衰减率下的探索率变化"""
    plt.figure(figsize=(10, 5))
    
    for decay in decay_rates:
        epsilons = []
        eps = 1.0
        for _ in range(episodes):
            epsilons.append(eps)
            eps = max(0.01, eps * decay)
        plt.plot(epsilons, label=f'decay={decay}')
    
    plt.xlabel('Episode')
    plt.ylabel('Epsilon')
    plt.title('ε-greedy 探索率衰减')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

visualize_epsilon_decay([0.99, 0.995, 0.999])

### 实验2：目标网络更新频率

In [None]:
# ============================================================
# 目标网络更新频率的影响（说明性示例）
# ============================================================

print("目标网络更新频率的影响:")
print("="*50)
print()
print("更新频率过低 (如每 1000 步):")
print("  - 优点: 目标更稳定")
print("  - 缺点: 学习较慢，目标可能过时")
print()
print("更新频率过高 (如每 10 步):")
print("  - 优点: 目标更准确")
print("  - 缺点: 可能导致不稳定")
print()
print("推荐范围: 100-1000 步")

---

## 总结

### 关键要点

1. **DQN 核心创新**：
   - 经验回放：打破样本相关性
   - 目标网络：稳定训练目标

2. **Double DQN**：
   - 解决 Q 值过估计问题
   - 用在线网络选动作，目标网络评估

3. **Dueling DQN**：
   - 分解 Q = V + A
   - 在动作影响不大时更高效

### 算法对比

| 算法 | 特点 | 适用场景 |
|------|------|----------|
| DQN | 基础版本 | 简单问题 |
| Double DQN | 减少过估计 | 需要稳定估计 |
| Dueling DQN | 分离 V 和 A | 动作影响差异大 |
| Rainbow | 集成所有改进 | 复杂问题 |

### 调参建议

| 参数 | 推荐范围 | 说明 |
|------|----------|------|
| 学习率 | 1e-4 ~ 1e-3 | 太大不稳定 |
| Batch Size | 32 ~ 256 | 视内存而定 |
| γ (折扣因子) | 0.99 | 短期任务可用 0.95 |
| 目标网络更新 | 100 ~ 1000 步 | 或软更新 τ=0.005 |
| ε 衰减 | 0.995 ~ 0.9999 | 视任务难度 |

### 下一步学习

- 优先级经验回放 (PER)
- 策略梯度方法 (A2C, PPO)
- 连续动作空间算法 (SAC, TD3)

---

## 单元测试

In [None]:
# ============================================================
# 单元测试
# ============================================================

def run_tests():
    """运行所有单元测试"""
    print("开始单元测试...\n")
    passed = 0
    failed = 0
    
    # 测试1: ReplayBuffer
    try:
        buf = ReplayBuffer(100)
        for i in range(50):
            buf.push(np.random.randn(4), 0, 1.0, np.random.randn(4), False)
        assert len(buf) == 50
        s, a, r, ns, d = buf.sample(32)
        assert s.shape == (32, 4)
        print("测试1通过: ReplayBuffer")
        passed += 1
    except Exception as e:
        print(f"测试1失败: {e}")
        failed += 1
    
    # 测试2: DQNNetwork
    try:
        net = DQNNetwork(4, 2, 64)
        x = torch.randn(32, 4)
        out = net(x)
        assert out.shape == (32, 2)
        print("测试2通过: DQNNetwork")
        passed += 1
    except Exception as e:
        print(f"测试2失败: {e}")
        failed += 1
    
    # 测试3: DuelingDQNNetwork
    try:
        net = DuelingDQNNetwork(4, 2, 64)
        x = torch.randn(32, 4)
        out = net(x)
        assert out.shape == (32, 2)
        print("测试3通过: DuelingDQNNetwork")
        passed += 1
    except Exception as e:
        print(f"测试3失败: {e}")
        failed += 1
    
    # 测试4: DQNAgent
    try:
        agent = DQNAgent(4, 2, batch_size=32, device='cpu')
        state = np.random.randn(4).astype(np.float32)
        action = agent.get_action(state)
        assert 0 <= action < 2
        
        for _ in range(50):
            agent.update(state, 0, 1.0, state, False)
        print("测试4通过: DQNAgent")
        passed += 1
    except Exception as e:
        print(f"测试4失败: {e}")
        failed += 1
    
    # 测试5: Double DQN 模式
    try:
        agent = DQNAgent(4, 2, batch_size=32, double_dqn=True, device='cpu')
        for _ in range(50):
            agent.update(np.random.randn(4).astype(np.float32), 0, 1.0, 
                        np.random.randn(4).astype(np.float32), False)
        print("测试5通过: Double DQN")
        passed += 1
    except Exception as e:
        print(f"测试5失败: {e}")
        failed += 1
    
    print(f"\n{'='*40}")
    print(f"测试完成: {passed} 通过, {failed} 失败")
    if failed == 0:
        print("所有测试通过！")
    print(f"{'='*40}")

run_tests()

---

## 参考资料

1. Mnih et al., "Playing Atari with Deep Reinforcement Learning", 2013
2. van Hasselt et al., "Deep Reinforcement Learning with Double Q-learning", 2016
3. Wang et al., "Dueling Network Architectures for Deep Reinforcement Learning", 2016
4. Hessel et al., "Rainbow: Combining Improvements in Deep Reinforcement Learning", 2018

---

[返回上级](../README.md)