-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significantly reduce retry duration of service discovery #1541
Conversation
The current 66s is quite long and means you don't see your (generally permanent) error quickly enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this PR @jackkleeman. I like this PR more than #1540 since 66s is probably a bit too long. The one thing we should make sure is that the exponential retry policy can work with the all deployments (in terms of response time) we want Restate to work with (maybe the expected response time * 2 being smaller than the max pause between the second to last and last attempt or so). Do you have an idea how long it can take a cold lambda to respond?
// Total duration roughly 1s | ||
let retry_policy = RetryPolicy::exponential(Duration::from_millis(100), 2.0, Some(4), None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does it take to spin up a cold Lambda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this duration is unrelated to the timeout on the request to the lambda - its just the duration between retries. in the lambda case the first request will block on the cold start, and then likely succeed. if it somehow fails transiently, an immediate subsequent retry will most likely not see a cold start, and then succeed immediately. in no scenario would a super slow cold start lead us to breach this retry policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, it is the duration between failures. Thanks for the clarification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. +1 for merging.
The current 66s is quite long and means you don't see your (generally permanent) error quickly enough