-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Memory leak finding toolset using tracemalloc + CI/nightly memory leak tests. #15412
Conversation
…ry_leak_tests # Conflicts: # .buildkite/pipeline.yml # rllib/BUILD # rllib/examples/env/random_env.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these new tests are still failing.
I wonder how stable these tests would be.
like what if we have a buffer that is larger than 600? there will be organic growth during the beginning period?
worth considering making these manual tests? just a question.
@@ -360,15 +360,6 @@ def from_importance_weights( | |||
rhos = tf.math.exp(log_rhos) | |||
if clip_rho_threshold is not None: | |||
clipped_rhos = tf.minimum(clip_rho_threshold, rhos, name="clipped_rhos") | |||
|
|||
tf1.summary.histogram("clipped_rhos_1000", tf.minimum(1000.0, rhos)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
@@ -237,11 +237,6 @@ def vector_step(self, actions): | |||
obs_batch, rew_batch, done_batch, info_batch = [], [], [], [] | |||
for i in range(self.num_envs): | |||
obs, r, done, info = self.envs[i].step(actions[i]) | |||
if not np.isscalar(r) or not np.isreal(r) or not np.isfinite(r): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
I think that tf2 still has a leak, actually, and that's why its new memory leak tests are failing:
|
…ry_leak_tests # Conflicts: # .buildkite/pipeline.ml.yml
@@ -46,27 +46,7 @@ | |||
--test_arg=--framework=tf2 | |||
rllib/... | |||
|
|||
- label: ":brain: RLlib: Learning discr. actions TF1-static-graph (from rllib/tuned_examples/*.yaml)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we move more and more heavy testing in, we should also sometimes remove some tests: E.g. tf1.x should no longer be tested, imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to double check with the bigger ML team on this?
we should have a coherent story about tf1 deprecation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should do this: But note that these are only tf1.x tests, so the tf version here is really 1.x. We are NOT disabling framework="tf" (tf static graph) tests here. Sorry for the confusion:
framework="tf": static graph (could be for both versions: tf1.x, but also tf2.x, if tf.compat.v1 mode is used)
framework="tf2": NOT static graph AND tf2.x version
On the replay buffer question: |
@@ -167,7 +167,11 @@ def test_global_vars_update(self): | |||
STEPS_SAMPLED_COUNTER, result["info"][STEPS_SAMPLED_COUNTER] | |||
) | |||
) | |||
global_timesteps = policy.global_timestep | |||
global_timesteps = ( | |||
policy.global_timestep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a nit, maybe we clean up when we refactor policy.
we probably shouldn't do these one-off if statements everywhere. instead, policy.global_timestep() should be a getter, and we can do this for everyone.
@@ -693,6 +698,7 @@ def get_initial_state(self): | |||
@override(Policy) | |||
def get_state(self): | |||
state = super().get_state() | |||
state["global_timestep"] = state["global_timestep"].numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, if we can call numpy() on the tensor here, why can't we do it above in test_rollout_worker.py?
lambda ph, v: feed_dict.__setitem__(ph, v), | ||
placeholders, | ||
train_batch[key], | ||
) | ||
del a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a memory leak, do we need to do this everywhere we use tree.map_structure?
Adds memory leak CI-tests to RLlib.
Why are these changes needed?
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.