Skip to content

Atari fixes + benchmarks: memory, life-loss, per-game metric, README table#128

Merged
dnddnjs merged 4 commits into
masterfrom
atari-ppo-tuning
May 24, 2026
Merged

Atari fixes + benchmarks: memory, life-loss, per-game metric, README table#128
dnddnjs merged 4 commits into
masterfrom
atari-ppo-tuning

Conversation

@dnddnjs
Copy link
Copy Markdown
Contributor

@dnddnjs dnddnjs commented May 17, 2026

Summary

Originally a PPO hyperparameter tuning PR; while running the longer 10M-frame schedule a chain of issues surfaced and are fixed here. Also adds a README benchmarks section now that there are real numbers to report.

Replay buffer OOM (DQN). The buffer stored full (4, 84, 84) stacks per slot, so capacity 1M occupied ~28 GB and killed the laptop. Switched to single-frame storage with on-the-fly 4-stack reconstruction at sample time. Episode boundaries inside the stack are masked using stored done flags.

8 GB-friendly capacity. Even with the 4× memory cut, 1M × (84,84) is ~7 GB — borderline on an 8 GB unified-memory MacBook (swap starts, training output gets noisy). Default capacity is now 500k (~3.5 GB); bump back to 1M on machines with headroom.

Life loss triggered full game reset. terminal_on_life_loss=True combined with the main loop's env.reset() made every life loss restart the game — burning frames on noop_max=30 + FIRE and breaking long-horizon credit assignment. Added a LifeLossTerminalEnv wrapper that emits terminated=True on life loss but only resets the underlying env on real game-over. AtariPreprocessing's built-in flag is turned off so the wrapper owns the logic. Applies to both DQN and PPO via env.py.

DQN hyperparameters re-aligned with modern v5 defaults. BATCH_SIZE 64 → 32, TARGET_UPDATE_EVERY 2500 train steps → 250 (≈ 1k env frames, hard update), EPSILON_END 0.1 → 0.01.

Per-game return metric. Because life-loss now ends a logged episode, recent_mean_return reports per-life score. Added recent_mean_game_return that accumulates across all 5 lives and resets only on real game-over (signaled via info["game_over"] from LifeLossTerminalEnv). Logged to stdout and W&B in both DQN and PPO.

README benchmarks section. New "Benchmarks" block with a per-algorithm table (params, train time, final mean score, peak RAM, CPU/GPU, W&B report link). Hardware footprint is a MacBook Pro 14" (M3, 8 GB, MPS); CPU/GPU are read off Activity Monitor on the python3.11 process. Scores live in publicly shared W&B Reports.

Misc.

  • moviepy dep added for a local-only eval/recording script (kept out of git via scripts/).
  • .gitignore excludes scripts/, docs/, logs/ — all local-only working dirs.

Test plan

  • DQN 10M-frame run finishes within 8 GB RAM budget (5.27 GB peak)
  • DQN per-game mean reaches ~94 (was plateaued at ~12 per-life before)
  • PPO 10M-frame run with the new LifeLossTerminalEnv (rerun pending — previous run predates the fix)

dnddnjs added 2 commits May 18, 2026 06:59
…ames

Three of CleanRL's 'PPO 37 details' that were missing — flagged when the
5M and 10M Breakout runs both plateaued at per-game ~75 with entropy
stuck around 0.8 (policy wasn't sharpening, clip rarely activating):

- Linear LR anneal from 2.5e-4 -> 0 across the run; lets late updates
  fine-tune instead of bouncing.
- Value-function loss clipping around the old prediction (CLIP_COEF),
  matching the policy clipping range; stabilizes value targets.
- Advantage normalization moved inside the minibatch loop instead of
  once per batch.

Also bumps TOTAL_FRAMES 5M -> 10M to match the CleanRL Atari budget so
runs are directly comparable to their published curves. lr now logged
to wandb so the anneal is visible.
- ReplayBuffer stores single frames and stacks 4 at sample time (~28GB -> ~7GB).
- LifeLossTerminalEnv signals terminal on life loss but defers real reset to
  game-over, so noop_max + FIRE no longer fire every life and GAE/Q chains
  break only at the right boundary.
- DQN: BATCH_SIZE 64 -> 32, TARGET_UPDATE_EVERY 2500 -> 250 train steps
  (~1k env frames), EPSILON_END 0.1 -> 0.01.
- Log per-life and per-game returns separately (DQN and PPO).
@dnddnjs dnddnjs changed the title PPO tuning: LR anneal, value clipping, per-minibatch adv norm Atari fixes: DQN memory, life-loss episodes, per-game metric May 23, 2026
dnddnjs added 2 commits May 24, 2026 11:58
- README: add Atari to algorithms list, new Benchmarks section with hardware
  notes, per-algo row (params, train time, score, RAM, CPU/GPU, W&B report).
- DQN buffer 1M -> 500k (~3.5GB) so a 1M-capacity run stops swapping on
  8GB unified memory.
- moviepy added for the local eval/recording script.
- .gitignore: exclude scripts/ and docs/ (local-only working dirs).
@dnddnjs dnddnjs changed the title Atari fixes: DQN memory, life-loss episodes, per-game metric Atari fixes + benchmarks: memory, life-loss, per-game metric, README table May 24, 2026
@dnddnjs dnddnjs merged commit 54ffaeb into master May 24, 2026
@dnddnjs dnddnjs deleted the atari-ppo-tuning branch May 24, 2026 03:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant