Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run one per-ns worker per namespace instead of namespace × component #3116

Merged
merged 6 commits into from
Jul 27, 2022

Conversation

dnr
Copy link
Member

@dnr dnr commented Jul 18, 2022

What changed?
Rearrange per-namespace worker manager to always run all components in the same worker+client+task queue.

Why?
Reduces load when we have more than one component.

How did you test it?
updated unit test

Potential risks

Is hotfix candidate?

@dnr dnr requested a review from yux0 July 18, 2022 23:28
@yux0
Copy link
Contributor

yux0 commented Jul 19, 2022

LGTM

common/dynamicconfig/constants.go Outdated Show resolved Hide resolved
service/worker/pernamespaceworker.go Outdated Show resolved Hide resolved
}

func (wm *perNamespaceWorkerManager) responsibleForNamespace(ns *namespace.Namespace, queueName string, num int) (int, error) {
// This can result in fewer than the intended number of workers if num > 1, because
func (wm *perNamespaceWorkerManager) responsibleForNamespace(ns *namespace.Namespace) (int, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename to getWorkerCount?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used getWorkerMultiplicity since that's what it's returning, the number of those workers that land on the current node

Comment on lines +315 to +327
policy.SetMaximumInterval(1 * time.Minute)
policy.SetExpirationInterval(backoff.NoInterval)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If something goes wrong, worker will fall in this retry loop forever? Is there metrics to detect this and logs for troubleshooting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it stop retrying at some point? that seems worse: if there was some transient problem blocking the creation of a client, and we give up, but it's fixed after ten minutes, then we would have to wait for a membership change to try again.

I added some logging.

for metrics maybe we'd want a gauge of how many are stuck retrying at once?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep slowly retry is fine, but we want to be able to be alert in case it keeps failing for long time, and have logs to investigate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a todo for metrics also, but I think we can merge it for now

@dnr dnr marked this pull request as ready for review July 21, 2022 20:38
@dnr dnr requested a review from a team as a code owner July 21, 2022 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants