[RLlib] Memory leak finding toolset using tracemalloc + CI/nightly memory leak tests. #15412

sven1977 · 2021-04-20T08:28:09Z

Adds memory leak CI-tests to RLlib.

Adds script and utilities: utils/tests/run_memory_leak_test.py utils/debug/memory.py
Adds tuned_examples used for running these memory leak tests.
Uses tracemalloc to pinpoint exactly where an allocation happened (file+line+stacktrace) that is likely never cleaned up/garbage collected.
Simple tools for testing an entire Trainer (its sub-components: env, policy, rollout_worker, etc..).

Why are these changes needed?

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…ry_leak_tests

…ry_leak_tests # Conflicts: # .buildkite/pipeline.yml # rllib/BUILD # rllib/examples/env/random_env.py

…ry_leak_tests

gjoliver

these new tests are still failing.
I wonder how stable these tests would be.
like what if we have a buffer that is larger than 600? there will be organic growth during the beginning period?
worth considering making these manual tests? just a question.

gjoliver · 2022-03-10T16:12:47Z

rllib/agents/impala/vtrace_tf.py

@@ -360,15 +360,6 @@ def from_importance_weights(
        rhos = tf.math.exp(log_rhos)
        if clip_rho_threshold is not None:
            clipped_rhos = tf.minimum(clip_rho_threshold, rhos, name="clipped_rhos")
-
-            tf1.summary.histogram("clipped_rhos_1000", tf.minimum(1000.0, rhos))


gjoliver · 2022-03-10T16:13:15Z

rllib/env/vector_env.py

@@ -237,11 +237,6 @@ def vector_step(self, actions):
        obs_batch, rew_batch, done_batch, info_batch = [], [], [], []
        for i in range(self.num_envs):
            obs, r, done, info = self.envs[i].step(actions[i])
-            if not np.isscalar(r) or not np.isreal(r) or not np.isfinite(r):


sven1977 · 2022-03-11T15:23:11Z

I think that tf2 still has a leak, actually, and that's why its new memory leak tests are failing:

File "/ray/python/ray/rllib/policy/eager_tf_policy.py", line 177
--
  | **kwargs,
  | File "/ray/python/ray/rllib/policy/eager_tf_policy.py", line 488
  | self.global_timestep += tree.flatten(ret[0])[0].shape.as_list()[0]
  | Increase total=838600B
  | Slope=1400.0 B/detection

…ry_leak_tests # Conflicts: # .buildkite/pipeline.ml.yml

…ry_leak_tests

sven1977 · 2022-04-01T14:31:53Z

.buildkite/pipeline.ml.yml

@@ -46,27 +46,7 @@
      --test_arg=--framework=tf2
      rllib/...

- label: ":brain: RLlib: Learning discr. actions TF1-static-graph (from rllib/tuned_examples/*.yaml)"


As we move more and more heavy testing in, we should also sometimes remove some tests: E.g. tf1.x should no longer be tested, imo.

do we want to double check with the bigger ML team on this?
we should have a coherent story about tf1 deprecation.

Yes, we should do this: But note that these are only tf1.x tests, so the tf version here is really 1.x. We are NOT disabling framework="tf" (tf static graph) tests here. Sorry for the confusion:

framework="tf": static graph (could be for both versions: tf1.x, but also tf2.x, if tf.compat.v1 mode is used) framework="tf2": NOT static graph AND tf2.x version

…ry_leak_tests

sven1977 · 2022-04-11T20:20:49Z

On the replay buffer question:
These additional memory tests run only RolloutWorker.sample() + get_metrics() on the rollout worker (both of which do not fill any buffers) or Policy.learn_on_batch() using a dummy batch, so no filling of the buffer either.

gjoliver · 2022-04-11T20:46:17Z

rllib/evaluation/tests/test_rollout_worker.py

@@ -167,7 +167,11 @@ def test_global_vars_update(self):
                        STEPS_SAMPLED_COUNTER, result["info"][STEPS_SAMPLED_COUNTER]
                    )
                )
-                global_timesteps = policy.global_timestep
+                global_timesteps = (
+                    policy.global_timestep


just a nit, maybe we clean up when we refactor policy.
we probably shouldn't do these one-off if statements everywhere. instead, policy.global_timestep() should be a getter, and we can do this for everyone.

gjoliver · 2022-04-11T20:49:47Z

rllib/policy/eager_tf_policy.py

@@ -693,6 +698,7 @@ def get_initial_state(self):
        @override(Policy)
        def get_state(self):
            state = super().get_state()
+            state["global_timestep"] = state["global_timestep"].numpy()


wait, if we can call numpy() on the tensor here, why can't we do it above in test_rollout_worker.py?

gjoliver · 2022-04-11T21:05:09Z

rllib/policy/tf_policy.py

                lambda ph, v: feed_dict.__setitem__(ph, v),
                placeholders,
                train_batch[key],
            )
+            del a


is this a memory leak, do we need to do this everywhere we use tree.map_structure?

sven1977 added 8 commits April 20, 2021 08:35

wip.

41db44d

wip.

6e426b1

Merge branch 'master' of https://github.com/ray-project/ray into memo…

667ec89

…ry_leak_tests

wip.

3a006a9

wip.

8c190d0

wip.

319a348

wip.

1d2186c

wip.

2a9555f

sven1977 requested a review from michaelzhiluo April 20, 2021 17:51

sven1977 assigned michaelzhiluo Apr 20, 2021

sven1977 added 14 commits April 21, 2021 13:35

wip.

47c825a

wip.

d12d8a8

wip.

2e464b3

wip.

781b118

Merge branch 'master' of https://github.com/ray-project/ray into memo…

a436a0e

…ry_leak_tests

wip.

6641ef1

fix.

bd2b368

Merge branch 'master' of https://github.com/ray-project/ray into memo…

63b8ce9

…ry_leak_tests # Conflicts: # .buildkite/pipeline.yml # rllib/BUILD # rllib/examples/env/random_env.py

wip

8eb3c22

wip

89d059c

Merge branch 'master' of https://github.com/ray-project/ray into memo…

f8be92a

…ry_leak_tests

wip

9252875

wip

8c762db

wip

89d1fa9

sven1977 changed the title ~~[RLlib] CI memory leak tests.~~ [WIP RLlib] Memory leak finding toolset using tracemalloc + CI/nightly memory leak tests. Oct 7, 2021

sven1977 added 5 commits October 8, 2021 15:02

global_timestep leak stopped in StochasticSampling

81bb2f6

wip.

22d3d56

wip.

165b786

Merge branch 'master' of https://github.com/ray-project/ray into memo…

f3786c6

…ry_leak_tests

merge

b075a94

wip.

de7e27a

gjoliver reviewed Mar 10, 2022

View reviewed changes

sven1977 added 10 commits March 24, 2022 09:31

wip.

bc9dccc

Merge branch 'master' of https://github.com/ray-project/ray into memo…

c7a5026

…ry_leak_tests # Conflicts: # .buildkite/pipeline.ml.yml

Merge branch 'master' of https://github.com/ray-project/ray into memo…

915a777

…ry_leak_tests

wip

fe03717

wip

84ee299

Merge branch 'master' of https://github.com/ray-project/ray into memo…

5643d64

…ry_leak_tests

wip

18adc28

wip

b60d440

wip

31ab269

LINT

abd3d7a

sven1977 commented Apr 1, 2022

View reviewed changes

sven1977 added 7 commits April 1, 2022 16:59

wip

44b11b4

wip

9a71657

wip

4cdc731

wip

03ec3d7

TEST

b5f45ab

Merge branch 'master' of https://github.com/ray-project/ray into memo…

c986405

…ry_leak_tests

wip

7015f12

sven1977 requested review from ArturNiederfahrenhorst and smorad as code owners April 7, 2022 20:19

merge

8aa0cdb

sven1977 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Apr 11, 2022

gjoliver approved these changes Apr 11, 2022

View reviewed changes

sven1977 merged commit a849474 into ray-project:master Apr 12, 2022

sven1977 mentioned this pull request Apr 19, 2022

[Bug] [RLlib] Potential Memory Leak in RolloutWorker / Env wrapper? #21718

Closed

2 tasks

jjyyxx mentioned this pull request Jun 9, 2022

[RLlib] tf2 framework slight memory leak #25615

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Memory leak finding toolset using tracemalloc + CI/nightly memory leak tests. #15412

[RLlib] Memory leak finding toolset using tracemalloc + CI/nightly memory leak tests. #15412

sven1977 commented Apr 20, 2021 •

edited

gjoliver left a comment

gjoliver Mar 10, 2022

gjoliver Mar 10, 2022

sven1977 commented Mar 11, 2022

sven1977 Apr 1, 2022

gjoliver Apr 4, 2022

sven1977 Apr 11, 2022

sven1977 commented Apr 11, 2022

gjoliver Apr 11, 2022

gjoliver Apr 11, 2022

gjoliver Apr 11, 2022

[RLlib] Memory leak finding toolset using tracemalloc + CI/nightly memory leak tests. #15412

[RLlib] Memory leak finding toolset using tracemalloc + CI/nightly memory leak tests. #15412

Conversation

sven1977 commented Apr 20, 2021 • edited

Why are these changes needed?

Related issue number

Checks

gjoliver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sven1977 commented Mar 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sven1977 commented Apr 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sven1977 commented Apr 20, 2021 •

edited