-
Notifications
You must be signed in to change notification settings - Fork 7k
Description
What happened + What you expected to happen
I ran DQN training on CartPole-v1 for 10 iterations and observed steadily increasing rewards, reaching around 123.54 before saving a checkpoint and exiting. Everything looked stable:
[Training] Iteration 0: reward = 20.04, sampled_timesteps = 32
[Training] Iteration 1: reward = 23.34, sampled_timesteps = 16032
[Training] Iteration 2: reward = 30.50, sampled_timesteps = 32032
[Training] Iteration 3: reward = 32.42, sampled_timesteps = 48032
[Training] Iteration 4: reward = 40.64, sampled_timesteps = 64032
[Training] Iteration 5: reward = 57.04, sampled_timesteps = 80032
[Training] Iteration 6: reward = 69.62, sampled_timesteps = 96032
[Training] Iteration 7: reward = 91.88, sampled_timesteps = 112032
[Training] Iteration 8: reward = 110.36, sampled_timesteps = 128032
[Training] Iteration 9: reward = 123.54, sampled_timesteps = 144032
When I resumed training by loading the checkpoint and continued for another 10 iterations (iterations 11–20), the reward collapsed immediately—iteration 11 dropped to 22.78, despite the previous checkpoint’s reward being 123.54:
[Training] Iteration 10: reward = 22.78, sampled_timesteps = 16000
[Training] Iteration 11: reward = 27.72, sampled_timesteps = 32000
[Training] Iteration 12: reward = 36.40, sampled_timesteps = 48000
[Training] Iteration 13: reward = 41.04, sampled_timesteps = 64000
[Training] Iteration 14: reward = 45.60, sampled_timesteps = 80000
[Training] Iteration 15: reward = 50.84, sampled_timesteps = 96000
[Training] Iteration 16: reward = 59.66, sampled_timesteps = 112000
[Training] Iteration 17: reward = 67.42, sampled_timesteps = 128000
[Training] Iteration 18: reward = 72.24, sampled_timesteps = 144000
[Training] Iteration 19: reward = 80.78, sampled_timesteps = 160000
If I train continuously from scratch for 20 iterations, rewards increase normally (with iteration 11 around ~154):
[Training] Iteration 0: reward = 24.18, sampled_timesteps = 32
[Training] Iteration 1: reward = 20.68, sampled_timesteps = 16032
[Training] Iteration 2: reward = 28.98, sampled_timesteps = 32032
[Training] Iteration 3: reward = 36.06, sampled_timesteps = 48032
[Training] Iteration 4: reward = 46.12, sampled_timesteps = 64032
[Training] Iteration 5: reward = 56.12, sampled_timesteps = 80032
[Training] Iteration 6: reward = 75.36, sampled_timesteps = 96032
[Training] Iteration 7: reward = 90.38, sampled_timesteps = 112032
[Training] Iteration 8: reward = 104.32, sampled_timesteps = 128032
[Training] Iteration 9: reward = 125.52, sampled_timesteps = 144032
[Training] Iteration 10: reward = 142.24, sampled_timesteps = 160032
[Training] Iteration 11: reward = 154.98, sampled_timesteps = 176032
[Training] Iteration 12: reward = 170.42, sampled_timesteps = 192032
[Training] Iteration 13: reward = 191.34, sampled_timesteps = 208032
[Training] Iteration 14: reward = 202.32, sampled_timesteps = 224032
[Training] Iteration 15: reward = 218.48, sampled_timesteps = 240032
[Training] Iteration 16: reward = 224.14, sampled_timesteps = 256032
[Training] Iteration 17: reward = 221.78, sampled_timesteps = 272032
[Training] Iteration 18: reward = 220.42, sampled_timesteps = 288032
[Training] Iteration 19: reward = 211.40, sampled_timesteps = 304032
Additionally, I suspect the replay buffer isn't being correctly saved/restored. I tried setting store_buffer_in_checkpoints=True, but it appears to have no effect.
I suspect one or more of the following are being reset when loading from the checkpoint:
- Epsilon (exploration rate), possibly resetting back to 1.0
- Timestep counters, possibly resetting to 0, which affects epsilon decay
- Replay buffer, possibly lost upon restore
I haven't found a reliable way to monitor or verify any of these states after resuming.
Versions / Dependencies
Ray 2.44.1
Python 3.10.16
windows 11
Reproduction script
import ray
from ray.rllib.algorithms.dqn import DQNConfig
from ray.rllib.algorithms.algorithm import Algorithm
from pathlib import Path
CHECKPOINT_DIR = "my_dqn_checkpoints"
CONTINUE_TRAIN_ITER = 10
def continue_training(checkpoint_path):
try:
algo = Algorithm.from_checkpoint(checkpoint_path)
print(f"Loaded checkpoint from: {checkpoint_path}")
except Exception as e:
print(f"Error loading checkpoint: {e}, starting fresh training.")
config = (
DQNConfig()
.environment("CartPole-v1")
.env_runners(num_env_runners=2)
.framework("torch")
.training(replay_buffer_config={
"type": "PrioritizedEpisodeReplayBuffer",
"capacity": 60000,
"alpha": 0.5,
"beta": 0.5,
},store_buffer_in_checkpoints=True)
)
print(f"exploration_config: {config['exploration_config']}")
algo = config.build_algo()
print(f"exploration_config: {algo.config['exploration_config']}")
for i in range(algo.iteration, algo.iteration + CONTINUE_TRAIN_ITER):
result = algo.train()
reward = result.get("env_runners", {}).get("episode_return_mean")
if reward is None:
reward = 0.0
print(f"[Training] Iteration {i}: reward = {reward:.2f}, sampled_timesteps = {algo.local_replay_buffer.sampled_timesteps}")
new_checkpoint = algo.save_to_path(checkpoint_path)
print(f"\n Continued checkpoint saved to: {checkpoint_path}\n")
return new_checkpoint
if __name__ == "__main__":
checkpoint_dir = Path(CHECKPOINT_DIR).absolute()
checkpoint_dir.mkdir(parents=True, exist_ok=True)
checkpoint = str(checkpoint_dir)
print(f"Checkpoint directory: {checkpoint}")
new_checkpoint = continue_training(checkpoint)
Issue Severity
High: It blocks me from completing my task.