Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Fix MB MPO #39654

Merged
merged 2 commits into from
Sep 15, 2023
Merged

Conversation

ArturNiederfahrenhorst
Copy link
Contributor

Why are these changes needed?

#38321 raises the issue that the MB MPO example does not work.
This PR fixes multiple issues within MBMPO that have gone undedected in the past:

  • When initializing the loss, we use a batch size of 32, which is incompatible with the MAML loss function.
  • Since gymnasium has changed with time, the bare Pendulum and CartPole envs that we are wrapping don't have any logic that truncates or terminates by default. Only when you create them with gym.make(...), wrappers are added accordingly. Therefore, our old technique of wrapping causes MBMPO to collect endlessly initially.
  • Later, when MBMPO uses a dynamics model to predict future observations, the dynamics model also lacks a truncation or termination mechanism. This PR also adds this to the envs. I'm not sure how this was done in the past.

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>
Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>
Copy link
Contributor

@sven1977 sven1977 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix! Thanks @ArturNiederfahrenhorst for digging into this.

@sven1977 sven1977 merged commit 586f1b5 into ray-project:master Sep 15, 2023
38 of 41 checks passed
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
Signed-off-by: Victor <vctr.y.m@example.com>
@MrSVF
Copy link

MrSVF commented Oct 18, 2023

Dear @ArturNiederfahrenhorst, thank you for your pull request! The mbmpo algorithm runs correctly now, but unfortunately it does not train properly. Could you analyze this issue #40400 and give your recommendation, why this is happening and how to correct this behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants