# BaseDQN
参考自 <https://hrl.boyuai.com/chapter/2/dqn%E7%AE%97%E6%B3%95/>

## 算法内容
包含以下 DQN 的改进算法
* Double DQN: 每个 Episode 结束后覆盖更新 Target Network 参数
* 经验回放: 使用 `collect.dequeue` 实现, 具体代码见 `src.RL.utility.reply_queue` 模块
* epsilon-greedy 决策: 根据动作执行次数, 按指数规律衰减随机探索的概率

## 实现注意
* 模型训练时, 预测 $q_{t}=q(s_{t},a_{t};\bm{w}_{t})$, 而不是像决策时取最大值 
* 算法对学习率较为敏感, 学习率降低时, 前期学习所需的 Episode 增加; 学习率过高时, 收敛后将价值曲线不稳定


In [1]:
import os
import sys
from ipynb_utility import get_file, set_seed
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(get_file()), '..')))

seed = 114514
set_seed(seed)

import gymnasium as gym

from src.RL.BaseDQN import BaseDQN, HyperParam
from src.RL.utility.train_rl import RL_Teacher

In [2]:
env = gym.make("CartPole-v1", render_mode = "rgb_array")

model = BaseDQN(HyperParam())

teacher = RL_Teacher(model, "CartPole-v1_BaseDQN", f"seed_{seed}", id = "CartPole-v1", render_mode = "rgb_array")
teacher.train(episode = 600)
print("CartPole-v1_BaseDQN: ", teacher.test())

100%|██████████| 600/600 [32:24<00:00,  3.24s/it]
100%|██████████| 10/10 [00:14<00:00,  1.45s/it]


CartPole-v1_BaseDQN:  500.0


# BaseDQN 算法效果
测试环境 `gymnasium CartPole-v1`

## 学习曲线

![](../res/CartPole-v1_BaseDQN.png)

## 示例视频

<video controls src="../res/CartPole-v1_BaseDQN.mp4">animation</video>


In [2]:
# BaseDQN 的 Optuna 超参数搜索

# import logging
# import sys

# import optuna

# # Add stream handler of stdout to show the messages
# optuna.logging.get_logger("optuna").addHandler(logging.StreamHandler(sys.stdout))

# def target(trial: optuna.Trial):
#     alpha = trial.suggest_float("alpha", 1e-5, 1e-3, log = True)
#     action_take_time_decay = trial.suggest_int("action_take_time_decay", 500, 50000, log = True)

#     model = BaseDQN(HyperParam(
#         alpha = alpha,
#         action_take_time_decay = action_take_time_decay
#     ))

#     teacher = RL_Teacher(model, "CartPole-v1_BaseDQN_param_search", f"lr_{alpha:.2e}_attd_{action_take_time_decay:.2e}", id = "CartPole-v1", render_mode = "rgb_array")
#     avg_return = teacher.train(
#         episode = 600, is_log = False, 
#         last_episode_return = 200, 
#         is_fix_seed = True
#     )
#     return teacher.test(
#         is_log_vedio = True, 
#         vedio_record_gap = 6
#     ) + avg_return * 0.1

# study = optuna.create_study(
#     direction = "maximize", 
#     study_name = f"CartPole-v1_BaseDQN", 
#     storage = "sqlite:///optuna_study/RL.db", 
#     load_if_exists = True
# )
# study.optimize(target, 10)
# print(study.best_params)

  from .autonotebook import tqdm as notebook_tqdm
[I 2024-10-21 14:46:33,160] A new study created in RDB with name: CartPole-v1_BaseDQN


A new study created in RDB with name: CartPole-v1_BaseDQN


100%|██████████| 600/600 [01:31<00:00,  6.55it/s]
100%|██████████| 10/10 [00:02<00:00,  4.36it/s]
[I 2024-10-21 14:48:08,997] Trial 0 finished with value: 65.4565 and parameters: {'alpha': 2.2869913779667538e-05, 'action_take_time_decay': 798}. Best is trial 0 with value: 65.4565.


Trial 0 finished with value: 65.4565 and parameters: {'alpha': 2.2869913779667538e-05, 'action_take_time_decay': 798}. Best is trial 0 with value: 65.4565.


100%|██████████| 600/600 [32:08<00:00,  3.21s/it]
  logger.warn(
100%|██████████| 10/10 [00:04<00:00,  2.49it/s]
[I 2024-10-21 15:20:22,032] Trial 1 finished with value: 540.0625 and parameters: {'alpha': 0.0004950654380831124, 'action_take_time_decay': 954}. Best is trial 1 with value: 540.0625.


Trial 1 finished with value: 540.0625 and parameters: {'alpha': 0.0004950654380831124, 'action_take_time_decay': 954}. Best is trial 1 with value: 540.0625.


100%|██████████| 600/600 [05:58<00:00,  1.67it/s]
100%|██████████| 10/10 [00:03<00:00,  2.67it/s]
[I 2024-10-21 15:26:24,822] Trial 2 finished with value: 354.56 and parameters: {'alpha': 2.1557974525674185e-05, 'action_take_time_decay': 619}. Best is trial 1 with value: 540.0625.


Trial 2 finished with value: 354.56 and parameters: {'alpha': 2.1557974525674185e-05, 'action_take_time_decay': 619}. Best is trial 1 with value: 540.0625.


100%|██████████| 600/600 [25:18<00:00,  2.53s/it]
100%|██████████| 10/10 [00:02<00:00,  3.47it/s]
[I 2024-10-21 15:51:46,417] Trial 3 finished with value: 334.7325 and parameters: {'alpha': 0.00020522559385052695, 'action_take_time_decay': 1907}. Best is trial 1 with value: 540.0625.


Trial 3 finished with value: 334.7325 and parameters: {'alpha': 0.00020522559385052695, 'action_take_time_decay': 1907}. Best is trial 1 with value: 540.0625.


100%|██████████| 600/600 [23:30<00:00,  2.35s/it]
100%|██████████| 10/10 [00:03<00:00,  2.70it/s]
[I 2024-10-21 16:15:21,279] Trial 4 finished with value: 505.58 and parameters: {'alpha': 0.0001254952803349487, 'action_take_time_decay': 6811}. Best is trial 1 with value: 540.0625.


Trial 4 finished with value: 505.58 and parameters: {'alpha': 0.0001254952803349487, 'action_take_time_decay': 6811}. Best is trial 1 with value: 540.0625.


100%|██████████| 600/600 [01:38<00:00,  6.12it/s]
100%|██████████| 10/10 [00:00<00:00, 11.28it/s]
[I 2024-10-21 16:17:00,490] Trial 5 finished with value: 10.650500000000001 and parameters: {'alpha': 1.3728948696834698e-05, 'action_take_time_decay': 9616}. Best is trial 1 with value: 540.0625.


Trial 5 finished with value: 10.650500000000001 and parameters: {'alpha': 1.3728948696834698e-05, 'action_take_time_decay': 9616}. Best is trial 1 with value: 540.0625.


100%|██████████| 600/600 [09:35<00:00,  1.04it/s]
100%|██████████| 10/10 [00:03<00:00,  2.71it/s]
[I 2024-10-21 16:26:40,121] Trial 6 finished with value: 366.7135 and parameters: {'alpha': 3.46219117102119e-05, 'action_take_time_decay': 32410}. Best is trial 1 with value: 540.0625.


Trial 6 finished with value: 366.7135 and parameters: {'alpha': 3.46219117102119e-05, 'action_take_time_decay': 32410}. Best is trial 1 with value: 540.0625.


100%|██████████| 600/600 [01:44<00:00,  5.73it/s]
100%|██████████| 10/10 [00:00<00:00, 11.35it/s]
[I 2024-10-21 16:28:26,034] Trial 7 finished with value: 10.7375 and parameters: {'alpha': 1.3015750745629715e-05, 'action_take_time_decay': 12683}. Best is trial 1 with value: 540.0625.


Trial 7 finished with value: 10.7375 and parameters: {'alpha': 1.3015750745629715e-05, 'action_take_time_decay': 12683}. Best is trial 1 with value: 540.0625.


100%|██████████| 600/600 [24:10<00:00,  2.42s/it]
100%|██████████| 10/10 [00:03<00:00,  2.56it/s]
[I 2024-10-21 16:52:40,232] Trial 8 finished with value: 541.5795 and parameters: {'alpha': 0.0008547490835362053, 'action_take_time_decay': 15470}. Best is trial 8 with value: 541.5795.


Trial 8 finished with value: 541.5795 and parameters: {'alpha': 0.0008547490835362053, 'action_take_time_decay': 15470}. Best is trial 8 with value: 541.5795.


100%|██████████| 600/600 [31:16<00:00,  3.13s/it]
100%|██████████| 10/10 [00:02<00:00,  3.56it/s]
[I 2024-10-21 17:23:59,602] Trial 9 finished with value: 546.0305 and parameters: {'alpha': 6.984818894841864e-05, 'action_take_time_decay': 3972}. Best is trial 9 with value: 546.0305.


Trial 9 finished with value: 546.0305 and parameters: {'alpha': 6.984818894841864e-05, 'action_take_time_decay': 3972}. Best is trial 9 with value: 546.0305.
{'alpha': 6.984818894841864e-05, 'action_take_time_decay': 3972}
