Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Memory leak finding toolset using tracemalloc + CI/nightly memory leak tests. #15412

Merged
merged 70 commits into from
Apr 12, 2022

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Apr 20, 2021

Adds memory leak CI-tests to RLlib.

  • Adds script and utilities: utils/tests/run_memory_leak_test.py utils/debug/memory.py
  • Adds tuned_examples used for running these memory leak tests.
  • Uses tracemalloc to pinpoint exactly where an allocation happened (file+line+stacktrace) that is likely never cleaned up/garbage collected.
  • Simple tools for testing an entire Trainer (its sub-components: env, policy, rollout_worker, etc..).

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@sven1977 sven1977 changed the title [RLlib] CI memory leak tests. [WIP RLlib] Memory leak finding toolset using tracemalloc + CI/nightly memory leak tests. Oct 7, 2021
Copy link
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these new tests are still failing.
I wonder how stable these tests would be.
like what if we have a buffer that is larger than 600? there will be organic growth during the beginning period?
worth considering making these manual tests? just a question.

@@ -360,15 +360,6 @@ def from_importance_weights(
rhos = tf.math.exp(log_rhos)
if clip_rho_threshold is not None:
clipped_rhos = tf.minimum(clip_rho_threshold, rhos, name="clipped_rhos")

tf1.summary.histogram("clipped_rhos_1000", tf.minimum(1000.0, rhos))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@@ -237,11 +237,6 @@ def vector_step(self, actions):
obs_batch, rew_batch, done_batch, info_batch = [], [], [], []
for i in range(self.num_envs):
obs, r, done, info = self.envs[i].step(actions[i])
if not np.isscalar(r) or not np.isreal(r) or not np.isfinite(r):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@sven1977
Copy link
Contributor Author

I think that tf2 still has a leak, actually, and that's why its new memory leak tests are failing:

File "/ray/python/ray/rllib/policy/eager_tf_policy.py", line 177
--
  | **kwargs,
  | File "/ray/python/ray/rllib/policy/eager_tf_policy.py", line 488
  | self.global_timestep += tree.flatten(ret[0])[0].shape.as_list()[0]
  | Increase total=838600B
  | Slope=1400.0 B/detection

@@ -46,27 +46,7 @@
--test_arg=--framework=tf2
rllib/...

- label: ":brain: RLlib: Learning discr. actions TF1-static-graph (from rllib/tuned_examples/*.yaml)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we move more and more heavy testing in, we should also sometimes remove some tests: E.g. tf1.x should no longer be tested, imo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to double check with the bigger ML team on this?
we should have a coherent story about tf1 deprecation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should do this: But note that these are only tf1.x tests, so the tf version here is really 1.x. We are NOT disabling framework="tf" (tf static graph) tests here. Sorry for the confusion:

framework="tf": static graph (could be for both versions: tf1.x, but also tf2.x, if tf.compat.v1 mode is used)
framework="tf2": NOT static graph AND tf2.x version

@sven1977
Copy link
Contributor Author

On the replay buffer question:
These additional memory tests run only RolloutWorker.sample() + get_metrics() on the rollout worker (both of which do not fill any buffers) or Policy.learn_on_batch() using a dummy batch, so no filling of the buffer either.

@sven1977 sven1977 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Apr 11, 2022
@@ -167,7 +167,11 @@ def test_global_vars_update(self):
STEPS_SAMPLED_COUNTER, result["info"][STEPS_SAMPLED_COUNTER]
)
)
global_timesteps = policy.global_timestep
global_timesteps = (
policy.global_timestep
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a nit, maybe we clean up when we refactor policy.
we probably shouldn't do these one-off if statements everywhere. instead, policy.global_timestep() should be a getter, and we can do this for everyone.

@@ -693,6 +698,7 @@ def get_initial_state(self):
@override(Policy)
def get_state(self):
state = super().get_state()
state["global_timestep"] = state["global_timestep"].numpy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, if we can call numpy() on the tensor here, why can't we do it above in test_rollout_worker.py?

lambda ph, v: feed_dict.__setitem__(ph, v),
placeholders,
train_batch[key],
)
del a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a memory leak, do we need to do this everywhere we use tree.map_structure?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants