-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Issue 30139: PolicyServerInput OOMs if incoming samples cannot be handled in time (e.g. when too many clients are connected). #31400
Conversation
# Protect ourselves from having a bottleneck on the server (learning) side. | ||
# Once the queue (deque) is full, we throw away 50% (oldest | ||
# samples first) of the samples, warn, and continue. | ||
self.samples_queue = deque(maxlen=max_sample_queue_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah its good that we're bounding this now.
return self.samples_queue.get() | ||
# Blocking wait until there is something in the deque. | ||
while len(self.samples_queue) == 0: | ||
time.sleep(0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its important to know that time.sleep is equivalent to yield in terms of letting go of control of the processor to other threads...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns about whether making the queue max length infinity or 0 will cause errors when checking to see if the queue should be purged.
Could you p pls address them? Otherwise, LGTM
|
||
Args: | ||
ioctx: IOContext provided by RLlib. | ||
address: Server addr (e.g., "localhost"). | ||
port: Server port (e.g., 9900). | ||
max_queue_size: The maximum size for the sample queue. Once full, will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be infinity?
samples_queue.put(batch) | ||
samples_queue.append(batch) | ||
# Deque is full -> purge 50% (oldest samples) | ||
if len(samples_queue) == samples_queue.maxlen: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm what happens here if someone passes infinity or 0 or none for queue maxlen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this cause an error?
I would like this PR to be in for 2.3.0, is it possible to give this one a review? :) |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
Removing stale label.. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
Removing stale label @avnishn , could you have a look? |
Just wanted to add my own experience that implementing this in my own project fixed a memory leak I had with policy server/client - so it would be great if this could be pulled in! |
@sven1977 Can you check the PR to see if it can still get merged in? |
@kouroshHakha @sven1977 any update for this? It would be greatly appreciated by me! (and probably ninja fix any issues others many not even realise using policy_server/client) |
test_impala failed, merging the master in to see if the test will pass. |
Awesome to see this pulled in! (I can finally update my ray once this hits nightle releases.) Speaking of which, how can I tell once this is in a nightly release? |
@DenysAshikhin One way is to look for the commit to see if it's passed this commit number in the history.
|
Awesome, thanks! |
…ndled in time (e.g. when too many clients are connected). (ray-project#31400) Signed-off-by: sven1977 <svenmika1977@gmail.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
…ndled in time (e.g. when too many clients are connected). (ray-project#31400) Signed-off-by: sven1977 <svenmika1977@gmail.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Signed-off-by: sven1977 svenmika1977@gmail.com
Issue 30139: PolicyServerInput OOMs if incoming samples cannot be handled in time (e.g. when too many clients are connected).
Closes #30139
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.