-
Notifications
You must be signed in to change notification settings - Fork 614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the ability to dictate custom retries #962
Conversation
span.WithField(ctx, "delay", ratelimitDelay.String()) | ||
val.plannedToStartWorkAt = val.plannedToStartWorkAt.Add(ratelimitDelay) | ||
val.delayedViaRateLimit = &ratelimitDelay | ||
span.WithField(ctx, "delay", actualDelay.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should special case stringers in the tracing so it only does the conversion when we are actually collecting the trace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you think we should special case nil duration? or should nil always just be nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could say use -1
(or < 0) instead of nil
to avoid all these pointer conversions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's already code (And tests for it), that allow you to pass a negative delay from the rate limiter back and result in it being processed immediately. That would regress / change this behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we did that since 0 should mean no delay. But it is what it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the way the kubernetes ratelimiters work, and we kept the contract the same.
87594c4
to
0eaede8
Compare
80f2c01
to
85800f6
Compare
// | ||
// If the work item should be is to be retried, a delay duration may be specified. If the value is nil, it will fall | ||
// back to the default behaviour of the queue, and use the ratelimiter that's configured to determine when to start work. | ||
type ShouldRetryFunc = queue.ShouldRetryFunc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when the returned delay is a negative value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point me to what changed? I'm not noticing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh, no idea why it didn't update inline:
ShouldRetryFunc is a mechanism to have a custom retry policy
it is passed metadata about the work item when the handler returns an error. It returns the following:
- The key
- The number of attempts that this item has already had (and failed)
- The (potentially wrapped) error from the queue handler.
The return value is an error, and optionally an amount to delay the work.
If an error is returned, the work will be aborted, and the returned error is bubbled up. It can be the error that
was passed in or that error can be wrapped.If the work item should be is to be retried, a delay duration may be specified. The delay is used to schedule when
the item should begin processing relative to now, it does not necessarily dictate when the item will start work.
Items are processed in the order they are scheduled. If the delay is nil, it will fall back to the default behaviour
of the queue, and use the rate limiter that's configured to determine when to start work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing where we are calling out what a negative value means, only nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't call out the negative value explicitly, but it says that the ordering is based on now + delay. If the delay is negative, your task will be scheduled in the past, thus pushing it to the front of the queue. Do you think we need to be explicit about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is worth mentioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added:
// If the delay is negative, the item will be scheduled "earlier" than now. This will result in the item being executed
// earlier than other items in the FIFO work order.
79e61e5
to
27a8050
Compare
27a8050
to
88606b1
Compare
Looks like there's a test timeout. |
Our current retry policy is naive and only does 20 retries. It is also based off of the rate limiter. If the user is somewhat aggressive in rate limiting, but they have a temporary outage on API server, they may want to continue to delay. In facts, K8s has a built-in function to suggest delays: https://pkg.go.dev/k8s.io/apimachinery/pkg/api/errors#SuggestsClientDelay Signed-off-by: Sargun Dhillon <sargun@sargun.me>
88606b1
to
b259cb0
Compare
Huh, didn't happen again. CC: @cpuguy83 |
I literally ran: and no errors. |
There's somehow a situation where "time="2021-04-14T20:50:38Z" level=warning msg="skipping sync of pod "default/my-pod" in "Succeeded" phase"" is happening, and this results in it not being put on the deletionQ. I don't see how this is possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Our current retry policy is naive and only does 20 retries. It is
also based off of the rate limiter. If the user is somewhat aggressive in
rate limiting, but they have a temporary outage on API server, they
may want to continue to delay.
In facts, K8s has a built-in function to suggest delays:
https://pkg.go.dev/k8s.io/apimachinery/pkg/api/errors#SuggestsClientDelay
Signed-off-by: Sargun Dhillon sargun@sargun.me