[RLlib] Fixing Memory Leak In Multi-Agent environments. Adding tooling for finding memory leaks in workers. #15815

Bam4d · 2021-05-14T15:32:27Z

Why are these changes needed?

There are a few memory leaks associated with multi-agent training that have been reported previously. Specifically, during MA training a very slow memory leak eventually kills training in many different algorithms. (reported in IMPALA and PPO specifically)

This PR also contains the tools I built that I have used to find the memory leak and a breif description of how they work in the comments.

The main memory leak fix is that a mapping from (agent, episode_id) -> policy_id described here:
https://discuss.ray.io/t/help-debugging-a-memory-leak-in-rllib/2100

Related issue number

Supercedes #15783
Closes #9964

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Bam4d · 2021-05-16T09:49:21Z

@sven1977 I think this is the main memory leak and I've struggled to find the cause of some of the other's i'm seeing. This fix is good for the case in the dummy environment that is in the original ticket, but when using Griddly, the memory is still leaking (albeit slower). I will do some more investigating over the next few days and make another PR If I find anything else to fix.

Thanks

Bam4d · 2021-05-16T10:02:12Z

Before 20M time steps:

After 500M+ time steps:

Bam4d · 2021-05-17T11:35:15Z

@sven1977 I think this is the main memory leak and I've struggled to find the cause of some of the other's i'm seeing. This fix is good for the case in the dummy environment that is in the original ticket, but when using Griddly, the memory is still leaking (albeit slower). I will do some more investigating over the next few days and make another PR If I find anything else to fix.

Thanks

I've found the cause of the other memory leak i'm seeing and it's a pytorch issue rather than rllib or griddly!

Bam4d · 2021-05-17T19:02:48Z

@sven1977 you have any pointers on how to run these tests locally so i can find out why they are broken?

sven1977

This is really great @Bam4d ! Thanks for all the hard detective work and these additional Callback classes to help with other leaks in the future! :)

rllib/agents/callbacks/memory_callbacks.py

…emory_leak

mvindiola1 · 2021-05-18T15:10:28Z

@Bam4d,

Awesome fix, thank you. We have been dealing with an HPC memory leak in rllib for some time. Hopefully this will fix it for us.

Bam4d · 2021-05-18T16:07:15Z

@mvindiola1 no problem. I also had some HPC issues hence my finding this one. I also encountered another HPC (cgroups+MKL) issue with numpy version 1.19.2. Upgrading to 1.20.1 helped me as well as this fix. Also if you look at the MemoryTrackerCallbacks class there are some tools that might help find general memory leaks.

Bam4d added 8 commits May 12, 2021 20:40

remove leaking keys in policy id map

d76e01c

move the deletes to the end of the episode processing

cee33e5

try to remove more leaks

9f45664

memory leak discovery tooling

7dda1d7

formatting

6ce609f

trying formatting again

1b8afd6

more formatting

4159134

formatting

39f9a27

Bam4d marked this pull request as draft May 14, 2021 15:35

Bam4d marked this pull request as ready for review May 16, 2021 09:47

Bam4d changed the title ~~[WIP (do not merge)] Fixing Memory Leaks~~ Fixing Memory Leak In Multi-Agent environments. Adding tooling for finding memory leaks in workers. May 16, 2021

fixing some dependencies

91b353d

sven1977 self-assigned this May 18, 2021

sven1977 approved these changes May 18, 2021

View reviewed changes

rllib/agents/callbacks/memory_callbacks.py Outdated Show resolved Hide resolved

Update rllib/agents/callbacks/memory_callbacks.py

7224adb

sven1977 changed the title ~~Fixing Memory Leak In Multi-Agent environments. Adding tooling for finding memory leaks in workers.~~ [RLlib] Fixing Memory Leak In Multi-Agent environments. Adding tooling for finding memory leaks in workers. May 18, 2021

sven1977 added 2 commits May 18, 2021 11:49

wip

5818d65

Merge branch 'master' of https://github.com/ray-project/ray into ma_m…

e2bd49d

…emory_leak

sven1977 merged commit 0be83d9 into ray-project:master May 18, 2021

Bam4d mentioned this pull request May 19, 2021

Memory leak docs #15908

Merged

6 tasks

chazzmoney mentioned this pull request Jan 8, 2022

[RLlib] Memory leaks during RLlib training. #8469

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Fixing Memory Leak In Multi-Agent environments. Adding tooling for finding memory leaks in workers. #15815

[RLlib] Fixing Memory Leak In Multi-Agent environments. Adding tooling for finding memory leaks in workers. #15815

Bam4d commented May 14, 2021

Bam4d commented May 16, 2021

Bam4d commented May 16, 2021

Bam4d commented May 17, 2021

Bam4d commented May 17, 2021

sven1977 left a comment

mvindiola1 commented May 18, 2021

Bam4d commented May 18, 2021

[RLlib] Fixing Memory Leak In Multi-Agent environments. Adding tooling for finding memory leaks in workers. #15815

[RLlib] Fixing Memory Leak In Multi-Agent environments. Adding tooling for finding memory leaks in workers. #15815

Conversation

Bam4d commented May 14, 2021

Why are these changes needed?

Related issue number

Checks

Bam4d commented May 16, 2021

Bam4d commented May 16, 2021

Bam4d commented May 17, 2021

Bam4d commented May 17, 2021

sven1977 left a comment

Choose a reason for hiding this comment

mvindiola1 commented May 18, 2021

Bam4d commented May 18, 2021