Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve exponential backoff when connecting to the redis #24150

Merged
merged 2 commits into from
Apr 25, 2022

Conversation

wumuzi520
Copy link
Contributor

@wumuzi520 wumuzi520 commented Apr 24, 2022

Why are these changes needed?

When deploying a Ray cluster based on k8s, if the shape of the head node is large, the delivery time of the redis inside the head pod will be long as well. If exponential backoff is used on the work node, it may occur that the redis started at 33s, the work node did not connect the redis until 64s, which in turn affected the delivery time of the Ray cluster. Therefore, we make a fixed retry interval (1s) when backoff times >= 10.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@wumuzi520 wumuzi520 force-pushed the improve_redis_connection_backoff branch from a9ff0cc to 6315651 Compare April 24, 2022 10:18
@wumuzi520
Copy link
Contributor Author

The CI failures are unrelated.

@wumuzi520 wumuzi520 requested a review from raulchen April 25, 2022 02:53
@wumuzi520 wumuzi520 assigned wumuzi520 and unassigned wumuzi520 Apr 25, 2022
@wumuzi520 wumuzi520 force-pushed the improve_redis_connection_backoff branch from 6315651 to 50d794a Compare April 25, 2022 03:10
# the work node, it may occur that the redis started at 33s, the
# work node did not connect the redis until 64s, which in turn
# affected the delivery time of the Ray cluster. Therefore, we make
# a fixed retry interval (1s) when backoff times >= 10.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, I think it's okay to just say something like "Make sure the retry interval doesn't increase too large, which will affect the delivery time of the Ray cluster" here. The concrete scenario can be explained in the PR description.

@raulchen raulchen merged commit edf058d into ray-project:master Apr 25, 2022
@raulchen raulchen deleted the improve_redis_connection_backoff branch April 25, 2022 08:10
@wumuzi520
Copy link
Contributor Author

wumuzi520 commented Apr 25, 2022

Due to the modification of the backoff algorithm, I changed the initial value of the parameter START_REDIS_WAIT_RETRIES from 16 to 60, so that the timeout of wait_for_redis_to_start is changed from 32s to 51s and the timeout of create_redis_client is changed from (2x16)s to (2x60)s.

I'm not sure if it's necessary to change sleep(2) to sleep(1) inside create_redis_client, so that the timeout is changed from 32s to 60s, which is about the same as that of wait_for_redis_to_start see(#24168)

cc @raulchen @scv119

raulchen added a commit that referenced this pull request Apr 25, 2022
@wumuzi520 wumuzi520 mentioned this pull request Apr 25, 2022
6 tasks
delay *= 2
# Make sure the retry interval doesn't increase too large, which will
# affect the delivery time of the Ray cluster.
delay = 1000 if i >= 10 else delay * 2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be delay = 1 if i >= 10 else delay * 2 because the unit of sleep is seconds.

It is fixed in PR(#24168)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we just use delay = min(1, delay*2)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants