Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup zombie worker leftovers as part of StartedJobRegistry's cleanup() #1372

Merged
merged 25 commits into from Apr 20, 2021

Conversation

rauchy
Copy link
Contributor

@rauchy rauchy commented Nov 6, 2020

When workers die ungracefully, jobs without a timeout will remain in the StartedJobRegistry forever.
This PR cleans those up as part of the routine cleanup() call in StartedJobRegistry.

This should solve #1164

@codecov
Copy link

codecov bot commented Nov 6, 2020

Codecov Report

Merging #1372 (acceaac) into master (dcbbd06) will decrease coverage by 0.09%.
The diff coverage is 94.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1372      +/-   ##
==========================================
- Coverage   95.08%   94.98%   -0.10%     
==========================================
  Files          44       44              
  Lines        6039     5939     -100     
==========================================
- Hits         5742     5641     -101     
- Misses        297      298       +1     
Impacted Files Coverage Δ
tests/test_worker.py 97.32% <ø> (-0.08%) ⬇️
rq/worker.py 88.67% <92.59%> (-0.07%) ⬇️
rq/job.py 97.80% <100.00%> (-0.11%) ⬇️
tests/test_job.py 100.00% <100.00%> (ø)
tests/test_decorator.py 89.60% <0.00%> (-1.12%) ⬇️
tests/__init__.py 81.39% <0.00%> (-0.83%) ⬇️
tests/fixtures.py 71.23% <0.00%> (-0.39%) ⬇️
rq/cli/cli.py 91.25% <0.00%> (-0.37%) ⬇️
rq/queue.py 93.47% <0.00%> (-0.24%) ⬇️
rq/scheduler.py 95.94% <0.00%> (-0.14%) ⬇️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dcbbd06...acceaac. Read the comment docs.

@selwin
Copy link
Collaborator

selwin commented Nov 8, 2020

Hey there, this is an interesting addition, but I think this needs a little bit of tweaking. This operation is potentially expensive to run on systems with a large number of workers running at the same time (hundreds or thousands), since it would verify that all workers of jobs in StartedJobRegistry are still alive.

There are a few ways we can approach this.

Approach 1

  1. Add another method registry.get_job_ids_without_expiry() that returns jobs with infinite timeouts
  2. Verify that the workers are still alive

Approach 2

This PR adds job.last_heartbeat. As part of the heartbeat process, we can actually track when a job last reported for duty, check the ones who haven't reported as active in the last few minutes and check whether their respective workers are still alive.

Note that I've also been working on adding job.worker_name in this PR #1375 so we can use this instead.

@rauchy
Copy link
Contributor Author

rauchy commented Nov 8, 2020

Thanks @selwin. Approach 2 (+ job.worker_name) feels to me like the way to go. I just have a couple of questions:

  1. I'm assuming you meant "As part of the cleanup process" instead of "As part of the heartbeat process"?
  2. What would be a good interval to declare a job as "not actively worked on"? I'm thinking it's something like job_monitoring_interval + 60, but that one is worker configuration and kinda feels awkward to pass down to the job, don't you think?

@selwin
Copy link
Collaborator

selwin commented Nov 8, 2020

I'm assuming you meant "As part of the cleanup process" instead of "As part of the heartbeat process"?

No, "as part of the heartbeat process" is correct, it's this line here. We store the timeout in a sorted set, and the cleanup process periodically check whether any job has "expired".

After we implement this, the cleanup process can rely on this sorted set as part of it's cleaning process (this is better than checking StartedJobRegistry for expired jobs). In most cases we'll be able to detect dead jobs sooner.

What would be a good interval to declare a job as "not actively worked on"? I'm thinking it's something like job_monitoring_interval + 60, but that one is worker configuration and kinda feels awkward to pass down to the job, don't you think?

I think job_monitoring_interval + <some interval> is fine.

@rauchy
Copy link
Contributor Author

rauchy commented Nov 8, 2020

Ok, we are talking about the same thing. I meant "look for out of date heartbeats as part of the cleanup process".

This does lead to my second question - how can we determine the right interval when running in the scope of the cleanup process? Passing down the monitoring interval from the worker to the cleanup method seems to be the easiest way, but makes the registry APIs uneven in terms of arguments.

@selwin
Copy link
Collaborator

selwin commented Nov 8, 2020

This does lead to my second question - how can we determine the right interval when running in the scope of the cleanup process? Passing down the monitoring interval from the worker to the cleanup method seems to be the easiest way, but makes the registry APIs uneven in terms of arguments.

The expiry is set during the heartbeat process, which is now + 60 seconds. The cleanup process simply looks for heartbeats whose value is lower than the current timestamp and moves those jobs to FailedJobRegistry.

@rauchy
Copy link
Contributor Author

rauchy commented Nov 9, 2020

Makes sense @selwin. Now using a sorted set for heartbeats per registry.

@selwin
Copy link
Collaborator

selwin commented Nov 14, 2020

We can actually simplify this PR by quite a bit.

StartedJobRegistry already maps a job to it's expiration time here. We just need to change:

  1. The ttl argument from job timeout to heartbeat timeout.
  2. On every job.heartbeat, the ttl gets extended

I think that's pretty much the changes we need to do.

@rauchy
Copy link
Contributor Author

rauchy commented Nov 15, 2020

Oh that makes total sense. I was working under the false assumption that the old non-heartbeats scores had to live side-by-side with heartbeats, but heartbeats do make a better indicator for a job's WIP status. Can you have another look?

rq/worker.py Outdated
self.kill_horse()
self.wait_for_horse()
break

with self.connection.pipeline() as pipeline:
self.heartbeat(self.job_monitoring_interval + 60, pipeline=pipeline)
self.heartbeat(heartbeat_ttl, pipeline=pipeline)
queue.started_job_registry.add(job, heartbeat_ttl, pipeline=pipeline)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to job.heartbeat()? I think it makes more sense that way.

@selwin
Copy link
Collaborator

selwin commented Nov 21, 2020

Two more things that we need to take into account:

  1. The heartbeat should be no later than when the job is supposed to timeout (plus some buffer time). So for example if job.timeout is 65 seconds and job_monitoring_interval is 30 seconds (job is checked every 30 seconds), the heartbeat_ttls that are sent should be:
  • 90 (30 + 60) at the beginning. 30 is the job monitoring interval, 60 is the buffer time
  • 90 (30 + 60) at the 30th second.
  • 65 (5 + 60) at the 60th second. 5 is the remaining job execution time.
  1. SimpleWorker doesn't have a main worker thread that monitors horse's execution so we can't use the variable TTLs sent by the regular Worker class.

To solve the two issues, I think we should add a worker.get_heartbeat_ttl(job, elapsed_execution_time) that gets called whenever job heartbeat is about to be sent. This way, we can:

  1. Easily test the job heartbeat TTL calculation
  2. SimpleWorker can override this method to always return TTLs as per the original implementation.

rq/worker.py Outdated
@@ -737,6 +737,10 @@ def fork_work_horse(self, job, queue):
self._horse_pid = child_pid
self.procline('Forked {0} at {1}'.format(child_pid, time.time()))

def get_heartbeat_ttl(self, job, elapsed_execution_time):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing elapsed_execution_time around, I think it makes sense for the worker to keep track of the start of job execution time. Can we make this change?

@selwin
Copy link
Collaborator

selwin commented Jan 26, 2021

@rauchy sorry it took so long, I missed this PR. The approach is already good, but I made a comment here. It would be good if we can get this addressed.

This is a PR I'd like to pull in.

@rauchy
Copy link
Contributor Author

rauchy commented Mar 16, 2021

@selwin whoops, also sorry it took me so long 🙊

Is this what you had in mind regarding tracking worker execution time?


with self.connection.pipeline() as pipeline:
self.set_state(WorkerStatus.BUSY, pipeline=pipeline)
self.set_current_job_id(job.id, pipeline=pipeline)
self.set_current_job_working_time(0, pipeline=pipeline)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be moved to monitor_work_horse() after job has finished so this will be set to zero after job execution finishes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I added it to monitor_work_horse, but I also kept it here in case a work horse dies (due to an OSError for example) without resetting it. WDYT?

@@ -374,6 +375,11 @@ def _get_state(self):

state = property(_get_state, _set_state)

def set_current_job_working_time(self, current_job_working_time, pipeline=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially thought that this can be a simple variable tracked by the worker itself. Thanks for persisting this in Redis, this is also a useful information to have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@rauchy
Copy link
Contributor Author

rauchy commented Apr 20, 2021

@selwin <GentlePing />

@selwin
Copy link
Collaborator

selwin commented Apr 20, 2021

Sorry I missed this. Thanks!

@michaelbrooks
Copy link

Turns out this change broke the public interface for Workers when it removed the heartbeat_ttl argument from Worker.prepare_job_execution. michaelbrooks/rq-win#11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants