Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Fixing Memory Leak In Multi-Agent environments. Adding tooling for finding memory leaks in workers. #15815

Merged
merged 12 commits into from
May 18, 2021

Conversation

Bam4d
Copy link
Contributor

@Bam4d Bam4d commented May 14, 2021

Why are these changes needed?

There are a few memory leaks associated with multi-agent training that have been reported previously. Specifically, during MA training a very slow memory leak eventually kills training in many different algorithms. (reported in IMPALA and PPO specifically)

This PR also contains the tools I built that I have used to find the memory leak and a breif description of how they work in the comments.

The main memory leak fix is that a mapping from (agent, episode_id) -> policy_id described here:
https://discuss.ray.io/t/help-debugging-a-memory-leak-in-rllib/2100

Related issue number

Supercedes #15783
Closes #9964

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@Bam4d Bam4d marked this pull request as draft May 14, 2021 15:35
@Bam4d Bam4d marked this pull request as ready for review May 16, 2021 09:47
@Bam4d
Copy link
Contributor Author

Bam4d commented May 16, 2021

@sven1977 I think this is the main memory leak and I've struggled to find the cause of some of the other's i'm seeing. This fix is good for the case in the dummy environment that is in the original ticket, but when using Griddly, the memory is still leaking (albeit slower). I will do some more investigating over the next few days and make another PR If I find anything else to fix.

Thanks

@Bam4d Bam4d changed the title [WIP (do not merge)] Fixing Memory Leaks Fixing Memory Leak In Multi-Agent environments. Adding tooling for finding memory leaks in workers. May 16, 2021
@Bam4d
Copy link
Contributor Author

Bam4d commented May 16, 2021

Before 20M time steps:
image

After 500M+ time steps:
image

@Bam4d
Copy link
Contributor Author

Bam4d commented May 17, 2021

@sven1977 I think this is the main memory leak and I've struggled to find the cause of some of the other's i'm seeing. This fix is good for the case in the dummy environment that is in the original ticket, but when using Griddly, the memory is still leaking (albeit slower). I will do some more investigating over the next few days and make another PR If I find anything else to fix.

Thanks

I've found the cause of the other memory leak i'm seeing and it's a pytorch issue rather than rllib or griddly!

@Bam4d
Copy link
Contributor Author

Bam4d commented May 17, 2021

@sven1977 you have any pointers on how to run these tests locally so i can find out why they are broken?

@sven1977 sven1977 self-assigned this May 18, 2021
Copy link
Contributor

@sven1977 sven1977 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great @Bam4d ! Thanks for all the hard detective work and these additional Callback classes to help with other leaks in the future! :)

rllib/agents/callbacks/memory_callbacks.py Outdated Show resolved Hide resolved
@sven1977 sven1977 changed the title Fixing Memory Leak In Multi-Agent environments. Adding tooling for finding memory leaks in workers. [RLlib] Fixing Memory Leak In Multi-Agent environments. Adding tooling for finding memory leaks in workers. May 18, 2021
@sven1977 sven1977 merged commit 0be83d9 into ray-project:master May 18, 2021
@mvindiola1
Copy link
Contributor

@Bam4d,

Awesome fix, thank you. We have been dealing with an HPC memory leak in rllib for some time. Hopefully this will fix it for us.

@Bam4d
Copy link
Contributor Author

Bam4d commented May 18, 2021

@mvindiola1 no problem. I also had some HPC issues hence my finding this one. I also encountered another HPC (cgroups+MKL) issue with numpy version 1.19.2. Upgrading to 1.20.1 helped me as well as this fix. Also if you look at the MemoryTrackerCallbacks class there are some tools that might help find general memory leaks.

@Bam4d Bam4d mentioned this pull request May 19, 2021
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[rllib] Memory leak in environment worker in multi-agent setup
3 participants