[RLlib] Issue 30139: PolicyServerInput OOMs if incoming samples cannot be handled in time (e.g. when too many clients are connected). #31400

sven1977 · 2023-01-03T14:03:09Z

Signed-off-by: sven1977 svenmika1977@gmail.com

Issue 30139: PolicyServerInput OOMs if incoming samples cannot be handled in time (e.g. when too many clients are connected).

Closes #30139

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <svenmika1977@gmail.com>

avnishn · 2023-01-03T18:08:44Z

rllib/env/policy_server_input.py

+        # Protect ourselves from having a bottleneck on the server (learning) side.
+        # Once the queue (deque) is full, we throw away 50% (oldest
+        # samples first) of the samples, warn, and continue.
+        self.samples_queue = deque(maxlen=max_sample_queue_size)


ah its good that we're bounding this now.

avnishn · 2023-01-03T18:09:40Z

rllib/env/policy_server_input.py

-        return self.samples_queue.get()
+        # Blocking wait until there is something in the deque.
+        while len(self.samples_queue) == 0:
+            time.sleep(0.1)


its important to know that time.sleep is equivalent to yield in terms of letting go of control of the processor to other threads...

avnishn

I have some concerns about whether making the queue max length infinity or 0 will cause errors when checking to see if the queue should be purged.

Could you p pls address them? Otherwise, LGTM

avnishn · 2023-01-03T18:13:28Z

rllib/env/policy_server_input.py


        Args:
            ioctx: IOContext provided by RLlib.
            address: Server addr (e.g., "localhost").
            port: Server port (e.g., 9900).
+            max_queue_size: The maximum size for the sample queue. Once full, will


can this be infinity?

avnishn · 2023-01-03T18:13:33Z

rllib/env/policy_server_input.py

-        samples_queue.put(batch)
+        samples_queue.append(batch)
+        # Deque is full -> purge 50% (oldest samples)
+        if len(samples_queue) == samples_queue.maxlen:


hmm what happens here if someone passes infinity or 0 or none for queue maxlen

will this cause an error?

MattiasDC · 2023-02-01T16:37:20Z

I would like this PR to be in for 2.3.0, is it possible to give this one a review? :)

stale · 2023-03-11T12:48:51Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

MattiasDC · 2023-03-11T12:50:53Z

Removing stale label..

stale · 2023-04-11T08:27:46Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

MattiasDC · 2023-04-11T08:45:47Z

Removing stale label @avnishn , could you have a look?

DenysAshikhin · 2023-05-03T22:01:44Z

Just wanted to add my own experience that implementing this in my own project fixed a memory leak I had with policy server/client - so it would be great if this could be pulled in!

kouroshHakha · 2023-05-04T01:11:03Z

@sven1977 Can you check the PR to see if it can still get merged in?

DenysAshikhin · 2023-05-21T16:18:06Z

@kouroshHakha @sven1977 any update for this? It would be greatly appreciated by me! (and probably ninja fix any issues others many not even realise using policy_server/client)

kouroshHakha · 2023-05-23T16:39:29Z

test_impala failed, merging the master in to see if the test will pass.

DenysAshikhin · 2023-05-24T20:37:20Z

Awesome to see this pulled in! (I can finally update my ray once this hits nightle releases.) Speaking of which, how can I tell once this is in a nightly release?

kouroshHakha · 2023-05-25T00:30:53Z

@DenysAshikhin One way is to look for the commit to see if it's passed this commit number in the history.

import ray
print(ray.__commit__)

DenysAshikhin · 2023-05-25T04:03:16Z

Awesome, thanks!

…ndled in time (e.g. when too many clients are connected). (ray-project#31400) Signed-off-by: sven1977 <svenmika1977@gmail.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

…ndled in time (e.g. when too many clients are connected). (ray-project#31400) Signed-off-by: sven1977 <svenmika1977@gmail.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

wip

47b36ab

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 requested review from gjoliver, avnishn, ArturNiederfahrenhorst, smorad, maxpumperla, kouroshHakha and krfricke as code owners January 3, 2023 14:03

sven1977 assigned avnishn Jan 3, 2023

wip

482bf30

Signed-off-by: sven1977 <svenmika1977@gmail.com>

avnishn reviewed Jan 3, 2023

View reviewed changes

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 11, 2023

stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 11, 2023

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Apr 11, 2023

stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Apr 11, 2023

DenysAshikhin mentioned this pull request May 4, 2023

[RLLib] RuntimeError: Expected scalars to be on CPU, got cuda:0 instead #34159

Closed

Merge branch 'master' into issue_30139_policy_server_ooms

879bbb1

kouroshHakha approved these changes May 24, 2023

View reviewed changes

kouroshHakha merged commit 4e2d84f into ray-project:master May 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Issue 30139: PolicyServerInput OOMs if incoming samples cannot be handled in time (e.g. when too many clients are connected). #31400

[RLlib] Issue 30139: PolicyServerInput OOMs if incoming samples cannot be handled in time (e.g. when too many clients are connected). #31400

sven1977 commented Jan 3, 2023 •

edited

Loading

avnishn Jan 3, 2023

avnishn Jan 3, 2023

avnishn left a comment

avnishn Jan 3, 2023

avnishn Jan 3, 2023

avnishn Jan 3, 2023

MattiasDC commented Feb 1, 2023

stale bot commented Mar 11, 2023

MattiasDC commented Mar 11, 2023

stale bot commented Apr 11, 2023

MattiasDC commented Apr 11, 2023

DenysAshikhin commented May 3, 2023

kouroshHakha commented May 4, 2023

DenysAshikhin commented May 21, 2023

kouroshHakha commented May 23, 2023

DenysAshikhin commented May 24, 2023

kouroshHakha commented May 25, 2023

DenysAshikhin commented May 25, 2023

[RLlib] Issue 30139: PolicyServerInput OOMs if incoming samples cannot be handled in time (e.g. when too many clients are connected). #31400

[RLlib] Issue 30139: PolicyServerInput OOMs if incoming samples cannot be handled in time (e.g. when too many clients are connected). #31400

Conversation

sven1977 commented Jan 3, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

avnishn Jan 3, 2023

Choose a reason for hiding this comment

avnishn Jan 3, 2023

Choose a reason for hiding this comment

avnishn left a comment

Choose a reason for hiding this comment

avnishn Jan 3, 2023

Choose a reason for hiding this comment

avnishn Jan 3, 2023

Choose a reason for hiding this comment

avnishn Jan 3, 2023

Choose a reason for hiding this comment

MattiasDC commented Feb 1, 2023

stale bot commented Mar 11, 2023

MattiasDC commented Mar 11, 2023

stale bot commented Apr 11, 2023

MattiasDC commented Apr 11, 2023

DenysAshikhin commented May 3, 2023

kouroshHakha commented May 4, 2023

DenysAshikhin commented May 21, 2023

kouroshHakha commented May 23, 2023

DenysAshikhin commented May 24, 2023

kouroshHakha commented May 25, 2023

DenysAshikhin commented May 25, 2023

sven1977 commented Jan 3, 2023 •

edited

Loading