Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to dictate custom retries #962

Merged
merged 2 commits into from
Apr 15, 2021

Conversation

sargun
Copy link
Contributor

@sargun sargun commented Apr 9, 2021

Our current retry policy is naive and only does 20 retries. It is
also based off of the rate limiter. If the user is somewhat aggressive in
rate limiting, but they have a temporary outage on API server, they
may want to continue to delay.

In facts, K8s has a built-in function to suggest delays:
https://pkg.go.dev/k8s.io/apimachinery/pkg/api/errors#SuggestsClientDelay

Signed-off-by: Sargun Dhillon sargun@sargun.me

@sargun sargun requested a review from cpuguy83 April 9, 2021 08:57
node/podcontroller.go Outdated Show resolved Hide resolved
internal/queue/queue.go Outdated Show resolved Hide resolved
span.WithField(ctx, "delay", ratelimitDelay.String())
val.plannedToStartWorkAt = val.plannedToStartWorkAt.Add(ratelimitDelay)
val.delayedViaRateLimit = &ratelimitDelay
span.WithField(ctx, "delay", actualDelay.String())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should special case stringers in the tracing so it only does the conversion when we are actually collecting the trace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you think we should special case nil duration? or should nil always just be nil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could say use -1 (or < 0) instead of nil to avoid all these pointer conversions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already code (And tests for it), that allow you to pass a negative delay from the rate limiter back and result in it being processed immediately. That would regress / change this behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we did that since 0 should mean no delay. But it is what it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the way the kubernetes ratelimiters work, and we kept the contract the same.

@sargun sargun requested a review from cpuguy83 April 12, 2021 02:53
@sargun sargun force-pushed the expose-custom-retry branch 3 times, most recently from 80f2c01 to 85800f6 Compare April 13, 2021 06:03
//
// If the work item should be is to be retried, a delay duration may be specified. If the value is nil, it will fall
// back to the default behaviour of the queue, and use the ratelimiter that's configured to determine when to start work.
type ShouldRetryFunc = queue.ShouldRetryFunc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the returned delay is a negative value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to what changed? I'm not noticing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, no idea why it didn't update inline:

ShouldRetryFunc is a mechanism to have a custom retry policy

it is passed metadata about the work item when the handler returns an error. It returns the following:

  • The key
  • The number of attempts that this item has already had (and failed)
  • The (potentially wrapped) error from the queue handler.

The return value is an error, and optionally an amount to delay the work.
If an error is returned, the work will be aborted, and the returned error is bubbled up. It can be the error that
was passed in or that error can be wrapped.

If the work item should be is to be retried, a delay duration may be specified. The delay is used to schedule when
the item should begin processing relative to now, it does not necessarily dictate when the item will start work.
Items are processed in the order they are scheduled. If the delay is nil, it will fall back to the default behaviour
of the queue, and use the rate limiter that's configured to determine when to start work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing where we are calling out what a negative value means, only nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't call out the negative value explicitly, but it says that the ordering is based on now + delay. If the delay is negative, your task will be scheduled in the past, thus pushing it to the front of the queue. Do you think we need to be explicit about it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth mentioning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added:

// If the delay is negative, the item will be scheduled "earlier" than now. This will result in the item being executed
// earlier than other items in the FIFO work order.

@sargun sargun force-pushed the expose-custom-retry branch 2 times, most recently from 79e61e5 to 27a8050 Compare April 13, 2021 21:40
@sargun sargun requested a review from cpuguy83 April 13, 2021 21:41
@cpuguy83
Copy link
Contributor

Looks like there's a test timeout.

This starts watching for events prior to the start of the controller.
This smells like a bug in the fakeclient bits, but it seems to fix
the problem.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Our current retry policy is naive and only does 20 retries. It is
also based off of the rate limiter. If the user is somewhat aggressive in
rate limiting, but they have a temporary outage on API server, they
may want to continue to delay.

In facts, K8s has a built-in function to suggest delays:
https://pkg.go.dev/k8s.io/apimachinery/pkg/api/errors#SuggestsClientDelay

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
@sargun
Copy link
Contributor Author

sargun commented Apr 14, 2021

Huh, didn't happen again.

CC: @cpuguy83

@sargun
Copy link
Contributor Author

sargun commented Apr 15, 2021

I literally ran:
go test -run=TestPodLifecycle/Async/testDanglingPodScenarioWithDeletionTimestamp/mockProvider -v -count=1000 -parallel=8 -race ./node 2>&1|tee test.log

and no errors.

@sargun
Copy link
Contributor Author

sargun commented Apr 15, 2021

There's somehow a situation where "time="2021-04-14T20:50:38Z" level=warning msg="skipping sync of pod "default/my-pod" in "Succeeded" phase"" is happening, and this results in it not being put on the deletionQ.

I don't see how this is possible.

Copy link
Contributor

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cpuguy83 cpuguy83 merged commit bf3a764 into virtual-kubelet:master Apr 15, 2021
@sargun sargun deleted the expose-custom-retry branch April 15, 2021 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants