-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve exponential backoff when connecting to the redis #24150
improve exponential backoff when connecting to the redis #24150
Conversation
a9ff0cc
to
6315651
Compare
The CI failures are unrelated. |
6315651
to
50d794a
Compare
python/ray/_private/services.py
Outdated
# the work node, it may occur that the redis started at 33s, the | ||
# work node did not connect the redis until 64s, which in turn | ||
# affected the delivery time of the Ray cluster. Therefore, we make | ||
# a fixed retry interval (1s) when backoff times >= 10. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, I think it's okay to just say something like "Make sure the retry interval doesn't increase too large, which will affect the delivery time of the Ray cluster" here. The concrete scenario can be explained in the PR description.
Due to the modification of the backoff algorithm, I changed the initial value of the parameter I'm not sure if it's necessary to change |
)" This reverts commit edf058d.
delay *= 2 | ||
# Make sure the retry interval doesn't increase too large, which will | ||
# affect the delivery time of the Ray cluster. | ||
delay = 1000 if i >= 10 else delay * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be delay = 1 if i >= 10 else delay * 2
because the unit of sleep is seconds.
It is fixed in PR(#24168)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we just use delay = min(1, delay*2)?
Why are these changes needed?
When deploying a Ray cluster based on k8s, if the shape of the head node is large, the delivery time of the redis inside the head pod will be long as well. If exponential backoff is used on the work node, it may occur that the redis started at 33s, the work node did not connect the redis until 64s, which in turn affected the delivery time of the Ray cluster. Therefore, we make a fixed retry interval (1s) when backoff times >= 10.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.