Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Old workers expiration issue #1704

Closed
hackhowtofaq opened this issue Feb 10, 2020 · 7 comments
Closed

Old workers expiration issue #1704

hackhowtofaq opened this issue Feb 10, 2020 · 7 comments

Comments

@hackhowtofaq
Copy link

# Returns a list of workers that have sent a heartbeat in the past, but which

Since worker.to_s includes the pid, older workers with different pid but seconds_since_heartbeat > prune_interval are not expired.

Are you aware of this? Is this intended?

@josh-m-sharpe
Copy link
Contributor

I frequently notice after a deploy when old workers are killed and new ones started, the worker count in resque web is elevated inaccurately. Do you think this is related?

@hackhowtofaq
Copy link
Author

I've also noticed this. Yeap I think it has to do with the above issue.

@josh-m-sharpe
Copy link
Contributor

Seems like there's additional information here:

resque/lib/resque/worker.rb

Lines 605 to 610 in adb633a

# If the worker hasn't sent a heartbeat, remove it from the registry.
#
# If the worker hasn't ever sent a heartbeat, we won't remove it since
# the first heartbeat is sent before the worker is registred it means
# that this is a worker that doesn't support heartbeats, e.g., another
# client library or an older version of Resque. We won't touch these.

@iloveitaly
Copy link
Contributor

I've run into this as well. I wonder if we could add a configuration option to prune these workers automatically. If a user knows they aren't using an old version of resque or other client library, they could set the configuration option to auto-prune these workers.

If someone could whip up a PR I can help review and get it merged!

@jrochkind
Copy link
Contributor

jrochkind commented Jun 10, 2021

In my experience simply running Resque::Worker.all_workers_with_expired_heartbeats.each(&:unregister_worker) does succesfully prune these.

It's the more complicated logic in #prune_dead_workers that is maybe deciding they are not actually prune-able, perhaps because of what @hackhowtofaq explains. It seems like a bug to me, rather than something that should have a config variable, you don't have config variables for "run with this bug fixed", right?

It's just unclear to me what this more complicated logic is intended to do. Cause certainly one PR that would fix this problem would be just replacing #prune_dead_workers with the simple implementation Resque::Worker.all_workers_with_expired_heartbeats.each(&:unregister_worker) -- but what would that be losing, what is the more complicated logic meant to do?

If there aren't docs or tests about this, this is what makes it challenging to work on an under-maintained "legacy" project like this, there's nobody around with the context to understand what's up, so someone kind of has to spend some time trying to re-construct it archeologically to make changes safely.

Or we could just replace the implementation with ``Resque::Worker.all_workers_with_expired_heartbeats.each(&:unregister_worker)` and if no tests fail say that's good enough? Seems risky.

@iloveitaly
Copy link
Contributor

Pretty sure this is related to #1750 which has a PR associated with it. Hoping we can get that PR merged soon.

@iloveitaly
Copy link
Contributor

#1750 has been merged and should fix this issue! Re-open this issue if it's still a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants