Skip to content

[RLlib] Significant drop in DQN training reward when resuming from checkpoint #53878

@WhizZest

Description

@WhizZest

What happened + What you expected to happen

I ran DQN training on CartPole-v1 for 10 iterations and observed steadily increasing rewards, reaching around 123.54 before saving a checkpoint and exiting. Everything looked stable:

[Training] Iteration 0: reward = 20.04, sampled_timesteps = 32
[Training] Iteration 1: reward = 23.34, sampled_timesteps = 16032
[Training] Iteration 2: reward = 30.50, sampled_timesteps = 32032
[Training] Iteration 3: reward = 32.42, sampled_timesteps = 48032
[Training] Iteration 4: reward = 40.64, sampled_timesteps = 64032
[Training] Iteration 5: reward = 57.04, sampled_timesteps = 80032
[Training] Iteration 6: reward = 69.62, sampled_timesteps = 96032
[Training] Iteration 7: reward = 91.88, sampled_timesteps = 112032
[Training] Iteration 8: reward = 110.36, sampled_timesteps = 128032
[Training] Iteration 9: reward = 123.54, sampled_timesteps = 144032

When I resumed training by loading the checkpoint and continued for another 10 iterations (iterations 11–20), the reward collapsed immediately—iteration 11 dropped to 22.78, despite the previous checkpoint’s reward being 123.54:

[Training] Iteration 10: reward = 22.78, sampled_timesteps = 16000
[Training] Iteration 11: reward = 27.72, sampled_timesteps = 32000
[Training] Iteration 12: reward = 36.40, sampled_timesteps = 48000
[Training] Iteration 13: reward = 41.04, sampled_timesteps = 64000
[Training] Iteration 14: reward = 45.60, sampled_timesteps = 80000
[Training] Iteration 15: reward = 50.84, sampled_timesteps = 96000
[Training] Iteration 16: reward = 59.66, sampled_timesteps = 112000
[Training] Iteration 17: reward = 67.42, sampled_timesteps = 128000
[Training] Iteration 18: reward = 72.24, sampled_timesteps = 144000
[Training] Iteration 19: reward = 80.78, sampled_timesteps = 160000

If I train continuously from scratch for 20 iterations, rewards increase normally (with iteration 11 around ~154):

[Training] Iteration 0: reward = 24.18, sampled_timesteps = 32
[Training] Iteration 1: reward = 20.68, sampled_timesteps = 16032
[Training] Iteration 2: reward = 28.98, sampled_timesteps = 32032
[Training] Iteration 3: reward = 36.06, sampled_timesteps = 48032
[Training] Iteration 4: reward = 46.12, sampled_timesteps = 64032
[Training] Iteration 5: reward = 56.12, sampled_timesteps = 80032
[Training] Iteration 6: reward = 75.36, sampled_timesteps = 96032
[Training] Iteration 7: reward = 90.38, sampled_timesteps = 112032
[Training] Iteration 8: reward = 104.32, sampled_timesteps = 128032
[Training] Iteration 9: reward = 125.52, sampled_timesteps = 144032
[Training] Iteration 10: reward = 142.24, sampled_timesteps = 160032
[Training] Iteration 11: reward = 154.98, sampled_timesteps = 176032
[Training] Iteration 12: reward = 170.42, sampled_timesteps = 192032
[Training] Iteration 13: reward = 191.34, sampled_timesteps = 208032
[Training] Iteration 14: reward = 202.32, sampled_timesteps = 224032
[Training] Iteration 15: reward = 218.48, sampled_timesteps = 240032
[Training] Iteration 16: reward = 224.14, sampled_timesteps = 256032
[Training] Iteration 17: reward = 221.78, sampled_timesteps = 272032
[Training] Iteration 18: reward = 220.42, sampled_timesteps = 288032
[Training] Iteration 19: reward = 211.40, sampled_timesteps = 304032

Additionally, I suspect the replay buffer isn't being correctly saved/restored. I tried setting store_buffer_in_checkpoints=True, but it appears to have no effect.
I suspect one or more of the following are being reset when loading from the checkpoint:

  1. Epsilon (exploration rate), possibly resetting back to 1.0
  2. Timestep counters, possibly resetting to 0, which affects epsilon decay
  3. Replay buffer, possibly lost upon restore

I haven't found a reliable way to monitor or verify any of these states after resuming.

Versions / Dependencies

Ray 2.44.1
Python 3.10.16
windows 11

Reproduction script

import ray
from ray.rllib.algorithms.dqn import DQNConfig
from ray.rllib.algorithms.algorithm import Algorithm
from pathlib import Path

CHECKPOINT_DIR = "my_dqn_checkpoints"
CONTINUE_TRAIN_ITER = 10

def continue_training(checkpoint_path):
    try:
        algo = Algorithm.from_checkpoint(checkpoint_path)
        print(f"Loaded checkpoint from: {checkpoint_path}")
    except Exception as e:
        print(f"Error loading checkpoint: {e}, starting fresh training.")
        config = (
            DQNConfig()
            .environment("CartPole-v1")
            .env_runners(num_env_runners=2)
            .framework("torch")
            .training(replay_buffer_config={
                "type": "PrioritizedEpisodeReplayBuffer",
                "capacity": 60000,
                "alpha": 0.5,
                "beta": 0.5,
            },store_buffer_in_checkpoints=True)
        )
        print(f"exploration_config: {config['exploration_config']}")
        algo = config.build_algo()

    print(f"exploration_config: {algo.config['exploration_config']}")
    for i in range(algo.iteration, algo.iteration + CONTINUE_TRAIN_ITER):
        result = algo.train()
        reward = result.get("env_runners", {}).get("episode_return_mean")
        if reward is None:
            reward = 0.0
        print(f"[Training] Iteration {i}: reward = {reward:.2f}, sampled_timesteps = {algo.local_replay_buffer.sampled_timesteps}")

    new_checkpoint = algo.save_to_path(checkpoint_path)
    print(f"\n Continued checkpoint saved to: {checkpoint_path}\n")

    return new_checkpoint

if __name__ == "__main__":
    checkpoint_dir = Path(CHECKPOINT_DIR).absolute()
    checkpoint_dir.mkdir(parents=True, exist_ok=True)
    checkpoint = str(checkpoint_dir)
    print(f"Checkpoint directory: {checkpoint}")

    new_checkpoint = continue_training(checkpoint)

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tquestionJust a question :)rllibRLlib related issuesstabilitywindows

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions