-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Fixing Memory Leak In Multi-Agent environments. Adding tooling for finding memory leaks in workers. #15815
Conversation
@sven1977 I think this is the main memory leak and I've struggled to find the cause of some of the other's i'm seeing. This fix is good for the case in the dummy environment that is in the original ticket, but when using Griddly, the memory is still leaking (albeit slower). I will do some more investigating over the next few days and make another PR If I find anything else to fix. Thanks |
I've found the cause of the other memory leak i'm seeing and it's a pytorch issue rather than rllib or griddly! |
@sven1977 you have any pointers on how to run these tests locally so i can find out why they are broken? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great @Bam4d ! Thanks for all the hard detective work and these additional Callback classes to help with other leaks in the future! :)
Awesome fix, thank you. We have been dealing with an HPC memory leak in rllib for some time. Hopefully this will fix it for us. |
@mvindiola1 no problem. I also had some HPC issues hence my finding this one. I also encountered another HPC (cgroups+MKL) issue with numpy version 1.19.2. Upgrading to 1.20.1 helped me as well as this fix. Also if you look at the |
Why are these changes needed?
There are a few memory leaks associated with multi-agent training that have been reported previously. Specifically, during MA training a very slow memory leak eventually kills training in many different algorithms. (reported in IMPALA and PPO specifically)
This PR also contains the tools I built that I have used to find the memory leak and a breif description of how they work in the comments.
The main memory leak fix is that a mapping from (agent, episode_id) -> policy_id described here:
https://discuss.ray.io/t/help-debugging-a-memory-leak-in-rllib/2100
Related issue number
Supercedes #15783
Closes #9964
Checks
scripts/format.sh
to lint the changes in this PR.