Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to support Gymnasium #277

Closed
wants to merge 19 commits into from
Closed

Conversation

arjun-kg
Copy link
Contributor

@arjun-kg arjun-kg commented Sep 25, 2022

Description

A draft PR updating CleanRL to support Gymnasium. Closes #263

This mostly includes updating step and seed API. Tries to use gymnasium branches on the dependent packages (SB3 etc) After these are updated, will verify the changes, check the tests, and get the PR ready for review.

Costa's comment:

Thanks @arjun-kg for the PR. We look forward to supporting the next generation of gym.

It's important to identify the performance-impacting changes and non-performance-impacting changes:

In this PR for initial support fo v0.26.1, let's aim to make only non-performance-impacting changes. With that said, here is a todo list:

  • Deprecate pybullet (since the new mujoco environments are being maintained again but pybullet is not)
  • Temporarily remove Isaac Gym Support
  • re-run mojoco experiments (use the v4 environments instead of the current v2 environments)
  • Atari wrappers upstream fixes (Add Gym 0.26 support DLR-RM/stable-baselines3#780)
  • procgen

Checklist:

  • I've read the CONTRIBUTION guide (required).
  • I have ensured pre-commit run --all-files passes (required).
  • I have updated the documentation and previewed the changes via mkdocs serve.
  • I have updated the tests accordingly (if applicable).

If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.

  • I have contacted vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
  • I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
  • I have added additional documentation and previewed the changes via mkdocs serve.
    • I have explained note-worthy implementation details.
    • I have explained the logged metrics.
    • I have added links to the original paper and related papers (if applicable).
    • I have added links to the PR related to the algorithm.
    • I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
    • I have added the learning curves (in PNG format with width=500 and height=300).
    • I have added links to the tracked experiments.
    • I have updated the overview sections at the docs and the repo
  • I have updated the tests accordingly (if applicable).

@vercel
Copy link

vercel bot commented Sep 25, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated
cleanrl ✅ Ready (Inspect) Visit Preview 💬 Add your feedback Mar 27, 2023 at 6:37AM (UTC)

@vwxyzjn
Copy link
Owner

vwxyzjn commented Sep 27, 2022

Related to #271

Copy link
Owner

@vwxyzjn vwxyzjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @arjun-kg thanks for preparing the PR. Looking forward to using the latest gym. I left some very preliminary comments.

One important thing to note during the refactor is to see if the change could result in a performance difference (not just a simple variable renaming). For example, the current PPO scripts did not handle the time out correctly, so handling time out correctly in this PR is a performance-impacting change.

We need to be careful with the performance-impacting changes because we would need to re-run the benchmarks on those changes to ensure there is no surprise regression in the performance.

cleanrl/ppo.py Outdated
@@ -213,18 +213,18 @@ def get_action_and_value(self, x, action=None):
writer.add_scalar("charts/episodic_length", item["episode"]["l"], global_step)
break
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part may need to be changed to

            if "episode" in info:
                for item in info["episode"]["r"]:
                    print(f"global_step={global_step}, episodic_return={item}")
                    writer.add_scalar("charts/episodic_return", item, global_step)
                    break
                for item in info["episode"]["l"]:
                    writer.add_scalar("charts/episodic_length", item, global_step)
                    break

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To replicate the original behavior, you probably need something like (but hopefully better-looking than!):

if "episode" in info:
    first_idx = info["_episode"].nonzero()[0][0]
    r = info["episode"]["r"][first_idx]
    l = info["episode"]["l"][first_idx]
    print(f"global_step={global_step}, episodic_return={r}")
    writer.add_scalar("charts/episodic_return", r, global_step)
    writer.add_scalar("charts/episodic_length", l, global_step)

There's no guarantee that the first index in "episode" won't just be a zero, need the mask to specify which one.

Alternatively, it might be better to track a running average using the deques built into the RecordEpisodeStatistics wrapper, though that would likely results in different performance graphs.

@@ -159,12 +159,12 @@ def linear_schedule(start_e: float, end_e: float, duration: int, t: int):
envs.single_observation_space,
envs.single_action_space,
device,
handle_timeout_termination=True,
handle_timeout_termination=False,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect!

cleanrl/ppo.py Outdated
with torch.no_grad():
next_value = agent.get_value(next_obs).reshape(1, -1)
if args.gae:
advantages = torch.zeros_like(rewards).to(device)
lastgaelam = 0
for t in reversed(range(args.num_steps)):
if t == args.num_steps - 1:
nextnonterminal = 1.0 - next_done
nextnonterminal = 1.0 - next_terminated
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note for myself: this is a change that could impact performance. We would need to re-run the benchmark here.

@vwxyzjn vwxyzjn mentioned this pull request Oct 3, 2022
4 tasks
@vwxyzjn
Copy link
Owner

vwxyzjn commented Oct 3, 2022

Thanks @arjun-kg for the PR. We look forward to supporting the next generation of gym.

It's important to identify the performance-impacting changes and non-performance-impacting changes:

In this PR for initial support for v0.26.1, let's aim to make only non-performance-impacting changes. With that said, I have added a todo list in the PR description.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Oct 4, 2022

@arjun-kg I made the first pass of editing to make ppo.py and dqn.py to pass CI. Could you try looking at ddpg_continuous_action.py, TD3, and DDPG?

Btw the plan is to have an announcement like the following on the main page, since I expect to encounter more issues.

image

@GaetanLepage
Copy link

Hi !
When reading through the proposed changes, I am not sure to understand the following:
Why do you replace done with terminated and not terminated or truncated ?
I am not sure to get why the truncated return value is ignored.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Nov 12, 2022

@GaetanLepage yeah, we should do terminated or truncated for the moment until we properly deal with #198.

@arjun-kg I added some changes to ppo_continuous_action.py to make it work with DM control power by https://github.com/Farama-Foundation/Shimmy/blob/main/tests/test_dm_control.py

@vwxyzjn
Copy link
Owner

vwxyzjn commented Nov 15, 2022

@arjun-kg we are thinking of probably supporting both gymnasium and gym simultaneously. See #318 (comment) as an example. This will give us a much smoother transition

@arjun-kg
Copy link
Contributor Author

@vwxyzjn sounds good, will check it out. I'm a bit tied up this week. I'll continue work on this from next week if it's okay.

for idx, d in enumerate(dones):
if d:
real_next_obs[idx] = infos[idx]["terminal_observation"]
rb.add(obs, real_next_obs, actions, rewards, dones, infos)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I guess this line has been forgotten in the migration :-).

@pseudo-rnd-thoughts
Copy link
Collaborator

We have just released Gymnasium v0.27.0, this should be backward compatible. Would it be possible to update this Pr to v0.27 and check that nothing new breaks

@arjun-kg
Copy link
Contributor Author

@vwxyzjn recently SB3 supports gymnasium with a branch, but I'm not sure if some parallel work is going on to update cleanrl to gymnasium? Would you like me to update this PR to gymnasium with SB3 on the gymnasium branch?

@arjun-kg arjun-kg changed the title Update to support Gym v0.26.1 Update to support Gymnasium Mar 27, 2023
real_next_obs = next_obs.copy()
for idx, d in enumerate(dones):
for idx, d in enumerate(terminateds):
Copy link
Contributor

@vcharraut vcharraut Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it use truncated instead of terminated here ?

With truncated, the results are identical with same seeding between the old and new implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was a mistake, it should be truncated

@@ -191,12 +190,12 @@ def forward(self, x):
writer.add_scalar("charts/episodic_length", info["episode"]["l"], global_step)
break
Copy link
Contributor

@vcharraut vcharraut Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the assumption that there will be no parrallel env, this could work:

        if "final_info" in infos:
            info = infos["final_info"][0]
            print(f"global_step={global_step}, episodic_return={info['episode']['r']}")
            writer.add_scalar("charts/episodic_return", info["episode"]["r"], global_step)
            writer.add_scalar("charts/episodic_length", info["episode"]["l"], global_step)

But I have seen that there is a different solution in the DQN file

@@ -71,7 +70,7 @@ def thunk():
if capture_video:
if idx == 0:
env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")
env.seed(seed)

env.action_space.seed(seed)
env.observation_space.seed(seed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think env.observation_space.seed(seed) can be remove

@kvrban
Copy link

kvrban commented Apr 30, 2023

Just tried the PR with:

diff --git a/cleanrl/dqn.py b/cleanrl/dqn.py
 import time
 from distutils.util import strtobool
 
-import gym
+import gymnasium as gym
 import numpy as np
 import torch
 import torch.nn as nn

with stable-baselines3==2.0.0a5 and gymnasium==0.28.1

when i run

python3 cleanrl/cleanrl/dqn.py

always after 'global_step=10009' execution stop with this error:

Traceback (most recent call last):
  File "/home/kris/dev/cleanRL-Gymnasium/cleanrl/cleanrl/dqn.py", line 195, in <module>
    data = rb.sample(args.batch_size)
  File "/home/kris/.local/lib/python3.9/site-packages/stable_baselines3/common/buffers.py", line 285, in sample
    return super().sample(batch_size=batch_size, env=env)
  File "/home/kris/.local/lib/python3.9/site-packages/stable_baselines3/common/buffers.py", line 110, in sample
    batch_inds = np.random.randint(0, upper_bound, size=batch_size)
  File "mtrand.pyx", line 765, in numpy.random.mtrand.RandomState.randint
  File "_bounded_integers.pyx", line 1247, in numpy.random._bounded_integers._rand_int64
ValueError: high <= 0

@kvrban
Copy link

kvrban commented May 1, 2023

i think it was not intended to remove the following line:

rb.add(obs, real_next_obs, actions, rewards, terminateds, infos)

This could be the fix (fixed it for me):

diff --git a/cleanrl/dqn.py b/cleanrl/dqn.py
index 14864e7..4e73a6e 100644
--- a/cleanrl/dqn.py
+++ b/cleanrl/dqn.py
@@ -156,7 +156,7 @@ if __name__ == "__main__":
     start_time = time.time()
 
     # TRY NOT TO MODIFY: start the game
-    obs = envs.reset(seed=args.seed)
+    obs, _ = envs.reset(seed=args.seed)
     for global_step in range(args.total_timesteps):
         # ALGO LOGIC: put action logic here
         epsilon = linear_schedule(args.start_e, args.end_e, args.exploration_fraction * args.total_timesteps, global_step)
@@ -185,6 +185,7 @@ if __name__ == "__main__":
             for idx, d in enumerate(infos["_final_observation"]):
                 if d:
                     real_next_obs[idx] = infos["final_observation"][idx]
+        rb.add(obs, real_next_obs, actions, rewards, terminateds, infos)
 
         # TRY NOT TO MODIFY: CRUCIAL step easy to overlook
         obs = next_obs

@pseudo-rnd-thoughts
Copy link
Collaborator

@kvrban Thanks for the comment but I think the plan is to complete this PR as several smaller PRs, see #370 and #371.

@arjun-kg or @vwxyzjn Should this PR be closed to avoid confusion?

@vwxyzjn
Copy link
Owner

vwxyzjn commented May 1, 2023

@pseudo-rnd-thoughts absolutely. Closing this PR now.

@vwxyzjn vwxyzjn closed this May 1, 2023
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade gym version to 0.26.1
8 participants