# 基本 DQN
> 参考自 <https://hrl.boyuai.com/chapter/2/dqn%E7%AE%97%E6%B3%95/>

## 概述
* 包含 Double DQN 与经验回放的 DQN
* 模型实现代码 `./dqn/BaseDQN.py`
* 测试环境 gymnasium CartPole-v0

## 记录
* v1.0
    * 如果 loss 不断发散, 可能是出现严重的高估问题, 可通过检查经验队列中特定动作是否高频次出现
    * 当环境结束时, td target 不需要再需要模型预测 (未来奖励一定是 0)
* v1.1
    * 修复: 向量化动作决策时, 所有 batch 公用了一个 epsilon 判断结果 (对运行没有影响)
    * 修复: 模型训练时, 应使用 trasition 的 action 预测, 而不是 state


In [1]:
import gymnasium as gym
from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo
from torch.utils.tensorboard.writer import SummaryWriter
from tqdm import tqdm

from dqn.BaseDQN import *

In [2]:
def train(name: str, comment: str, episode: int = 500, hparam: HyperParam | None = None, is_write: bool = True):
    '''
    * `name` 训练名称
    * `comment` 训练注释
    * `episode` 训练片段数
    * `hparam` 超参数
    * `is_write` 是否记录训练数据
    '''
    env = gym.make(
        "CartPole-v0", 
        render_mode = "rgb_array"
    )
    if is_write:
        env = RecordEpisodeStatistics(env, buffer_length = 1)
        env = RecordVideo(
            env, 
            video_folder = "vedio_CartPole_with_BaseDQN", 
            name_prefix = name,
            episode_trigger = lambda x: (x + 1) % 100 == 0
        )

    if hparam == None:
        hparam = HyperParam()
    model = BaseDQN(hparam)

    writer = None
    if is_write:
        writer = SummaryWriter(comment = name + "_" + comment)

    for episode in tqdm(range(episode)):
        state, info = env.reset()
        done = False
        total_loss = 0

        while not done:
            
            # 完成一次状态转移
            action = model.take_action_single(state)
            next_state, reward, terminated, truncated, info = env.step(action[0])
            done = terminated or truncated

            # 更新模型
            transition = make_transition_from_numpy(state, action, next_state, reward, terminated)
            loss = model.update(transition)
            if loss != False:
                total_loss += loss

            state = next_state

        model.update_episode(episode)

        # tensorboard 记录平均损失与累计回报
        if writer != None:        
            writer.add_scalar(
                f"{name}/avg_loss",
                total_loss / info["episode"]["l"],
                episode
            )
            writer.add_scalar(
                f"{name}/return",
                info["episode"]["r"],
                episode
            )

        # 记录动作倾向
        if writer != None:  
            if episode % 50 == 0:
                action_sum = 0
                for i in model.reply_queue.buffer:
                    action_sum += i.action.item()

                writer.add_scalar(
                    f"{name}/avg_action",
                    action_sum / model.reply_queue.size(),
                    int(episode / 50)
                )
    env.close()
    
    if writer != None:
        writer.close()


In [None]:
hparam = HyperParam()
train("cartpole-v0", "test", 500, is_write = True)

# 运行结果

## v1.1
![](./res/CartPole_v0_BaseDQN_v1_1.png)

## todo
* 优化代码, 计算 TD 目标时, 当 done 为 True 时不进行预测
