-
Notifications
You must be signed in to change notification settings - Fork 841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run one per-ns worker per namespace instead of namespace × component #3116
Conversation
LGTM |
service/worker/pernamespaceworker.go
Outdated
} | ||
|
||
func (wm *perNamespaceWorkerManager) responsibleForNamespace(ns *namespace.Namespace, queueName string, num int) (int, error) { | ||
// This can result in fewer than the intended number of workers if num > 1, because | ||
func (wm *perNamespaceWorkerManager) responsibleForNamespace(ns *namespace.Namespace) (int, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rename to getWorkerCount?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used getWorkerMultiplicity since that's what it's returning, the number of those workers that land on the current node
policy.SetMaximumInterval(1 * time.Minute) | ||
policy.SetExpirationInterval(backoff.NoInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If something goes wrong, worker will fall in this retry loop forever? Is there metrics to detect this and logs for troubleshooting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it stop retrying at some point? that seems worse: if there was some transient problem blocking the creation of a client, and we give up, but it's fixed after ten minutes, then we would have to wait for a membership change to try again.
I added some logging.
for metrics maybe we'd want a gauge of how many are stuck retrying at once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep slowly retry is fine, but we want to be able to be alert in case it keeps failing for long time, and have logs to investigate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a todo for metrics also, but I think we can merge it for now
What changed?
Rearrange per-namespace worker manager to always run all components in the same worker+client+task queue.
Why?
Reduces load when we have more than one component.
How did you test it?
updated unit test
Potential risks
Is hotfix candidate?