Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Issue 30139: PolicyServerInput OOMs if incoming samples cannot be handled in time (e.g. when too many clients are connected). #31400

Merged

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Jan 3, 2023

Signed-off-by: sven1977 svenmika1977@gmail.com

Issue 30139: PolicyServerInput OOMs if incoming samples cannot be handled in time (e.g. when too many clients are connected).

Closes #30139

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
# Protect ourselves from having a bottleneck on the server (learning) side.
# Once the queue (deque) is full, we throw away 50% (oldest
# samples first) of the samples, warn, and continue.
self.samples_queue = deque(maxlen=max_sample_queue_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah its good that we're bounding this now.

return self.samples_queue.get()
# Blocking wait until there is something in the deque.
while len(self.samples_queue) == 0:
time.sleep(0.1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its important to know that time.sleep is equivalent to yield in terms of letting go of control of the processor to other threads...

Copy link
Member

@avnishn avnishn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about whether making the queue max length infinity or 0 will cause errors when checking to see if the queue should be purged.

Could you p pls address them? Otherwise, LGTM


Args:
ioctx: IOContext provided by RLlib.
address: Server addr (e.g., "localhost").
port: Server port (e.g., 9900).
max_queue_size: The maximum size for the sample queue. Once full, will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be infinity?

samples_queue.put(batch)
samples_queue.append(batch)
# Deque is full -> purge 50% (oldest samples)
if len(samples_queue) == samples_queue.maxlen:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm what happens here if someone passes infinity or 0 or none for queue maxlen

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this cause an error?

@MattiasDC
Copy link
Contributor

I would like this PR to be in for 2.3.0, is it possible to give this one a review? :)

@stale
Copy link

stale bot commented Mar 11, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 11, 2023
@MattiasDC
Copy link
Contributor

Removing stale label..

@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 11, 2023
@stale
Copy link

stale bot commented Apr 11, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Apr 11, 2023
@MattiasDC
Copy link
Contributor

Removing stale label @avnishn , could you have a look?

@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Apr 11, 2023
@DenysAshikhin
Copy link

Just wanted to add my own experience that implementing this in my own project fixed a memory leak I had with policy server/client - so it would be great if this could be pulled in!

@kouroshHakha
Copy link
Contributor

@sven1977 Can you check the PR to see if it can still get merged in?

@DenysAshikhin
Copy link

@kouroshHakha @sven1977 any update for this? It would be greatly appreciated by me! (and probably ninja fix any issues others many not even realise using policy_server/client)

@kouroshHakha
Copy link
Contributor

test_impala failed, merging the master in to see if the test will pass.

@kouroshHakha kouroshHakha merged commit 4e2d84f into ray-project:master May 24, 2023
@DenysAshikhin
Copy link

Awesome to see this pulled in! (I can finally update my ray once this hits nightle releases.) Speaking of which, how can I tell once this is in a nightly release?

@kouroshHakha
Copy link
Contributor

@DenysAshikhin One way is to look for the commit to see if it's passed this commit number in the history.

import ray
print(ray.__commit__)

@DenysAshikhin
Copy link

Awesome, thanks!

scv119 pushed a commit to scv119/ray that referenced this pull request Jun 16, 2023
…ndled in time (e.g. when too many clients are connected). (ray-project#31400)

Signed-off-by: sven1977 <svenmika1977@gmail.com>
Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…ndled in time (e.g. when too many clients are connected). (ray-project#31400)

Signed-off-by: sven1977 <svenmika1977@gmail.com>
Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RLlib] PolicyServerInput memory leak
5 participants