Cleanup zombie worker leftovers as part of StartedJobRegistry's cleanup() #1372

rauchy · 2020-11-06T15:05:58Z

When workers die ungracefully, jobs without a timeout will remain in the StartedJobRegistry forever.
This PR cleans those up as part of the routine cleanup() call in StartedJobRegistry.

This should solve #1164

codecov · 2020-11-06T15:09:33Z

Codecov Report

Merging #1372 (acceaac) into master (dcbbd06) will decrease coverage by 0.09%.
The diff coverage is 94.11%.

@@            Coverage Diff             @@
##           master    #1372      +/-   ##
==========================================
- Coverage   95.08%   94.98%   -0.10%     
==========================================
  Files          44       44              
  Lines        6039     5939     -100     
==========================================
- Hits         5742     5641     -101     
- Misses        297      298       +1

Impacted Files	Coverage Δ
tests/test_worker.py	`97.32% <ø> (-0.08%)`	⬇️
rq/worker.py	`88.67% <92.59%> (-0.07%)`	⬇️
rq/job.py	`97.80% <100.00%> (-0.11%)`	⬇️
tests/test_job.py	`100.00% <100.00%> (ø)`
tests/test_decorator.py	`89.60% <0.00%> (-1.12%)`	⬇️
tests/__init__.py	`81.39% <0.00%> (-0.83%)`	⬇️
tests/fixtures.py	`71.23% <0.00%> (-0.39%)`	⬇️
rq/cli/cli.py	`91.25% <0.00%> (-0.37%)`	⬇️
rq/queue.py	`93.47% <0.00%> (-0.24%)`	⬇️
rq/scheduler.py	`95.94% <0.00%> (-0.14%)`	⬇️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dcbbd06...acceaac. Read the comment docs.

selwin · 2020-11-08T09:11:03Z

Hey there, this is an interesting addition, but I think this needs a little bit of tweaking. This operation is potentially expensive to run on systems with a large number of workers running at the same time (hundreds or thousands), since it would verify that all workers of jobs in StartedJobRegistry are still alive.

There are a few ways we can approach this.

Approach 1

Add another method registry.get_job_ids_without_expiry() that returns jobs with infinite timeouts
Verify that the workers are still alive

Approach 2

This PR adds job.last_heartbeat. As part of the heartbeat process, we can actually track when a job last reported for duty, check the ones who haven't reported as active in the last few minutes and check whether their respective workers are still alive.

Note that I've also been working on adding job.worker_name in this PR #1375 so we can use this instead.

rauchy · 2020-11-08T12:04:53Z

Thanks @selwin. Approach 2 (+ job.worker_name) feels to me like the way to go. I just have a couple of questions:

I'm assuming you meant "As part of the cleanup process" instead of "As part of the heartbeat process"?
What would be a good interval to declare a job as "not actively worked on"? I'm thinking it's something like job_monitoring_interval + 60, but that one is worker configuration and kinda feels awkward to pass down to the job, don't you think?

selwin · 2020-11-08T14:10:26Z

I'm assuming you meant "As part of the cleanup process" instead of "As part of the heartbeat process"?

No, "as part of the heartbeat process" is correct, it's this line here. We store the timeout in a sorted set, and the cleanup process periodically check whether any job has "expired".

After we implement this, the cleanup process can rely on this sorted set as part of it's cleaning process (this is better than checking StartedJobRegistry for expired jobs). In most cases we'll be able to detect dead jobs sooner.

What would be a good interval to declare a job as "not actively worked on"? I'm thinking it's something like job_monitoring_interval + 60, but that one is worker configuration and kinda feels awkward to pass down to the job, don't you think?

I think job_monitoring_interval + <some interval> is fine.

rauchy · 2020-11-08T14:27:48Z

Ok, we are talking about the same thing. I meant "look for out of date heartbeats as part of the cleanup process".

This does lead to my second question - how can we determine the right interval when running in the scope of the cleanup process? Passing down the monitoring interval from the worker to the cleanup method seems to be the easiest way, but makes the registry APIs uneven in terms of arguments.

selwin · 2020-11-08T14:43:18Z

This does lead to my second question - how can we determine the right interval when running in the scope of the cleanup process? Passing down the monitoring interval from the worker to the cleanup method seems to be the easiest way, but makes the registry APIs uneven in terms of arguments.

The expiry is set during the heartbeat process, which is now + 60 seconds. The cleanup process simply looks for heartbeats whose value is lower than the current timestamp and moves those jobs to FailedJobRegistry.

rauchy · 2020-11-09T08:11:47Z

Makes sense @selwin. Now using a sorted set for heartbeats per registry.

selwin · 2020-11-14T00:55:14Z

We can actually simplify this PR by quite a bit.

StartedJobRegistry already maps a job to it's expiration time here. We just need to change:

The ttl argument from job timeout to heartbeat timeout.
On every job.heartbeat, the ttl gets extended

I think that's pretty much the changes we need to do.

rauchy · 2020-11-15T12:30:37Z

Oh that makes total sense. I was working under the false assumption that the old non-heartbeats scores had to live side-by-side with heartbeats, but heartbeats do make a better indicator for a job's WIP status. Can you have another look?

selwin · 2020-11-21T06:11:24Z

rq/worker.py

                    self.kill_horse()
                    self.wait_for_horse()
                    break

                with self.connection.pipeline() as pipeline:
-                    self.heartbeat(self.job_monitoring_interval + 60, pipeline=pipeline)
+                    self.heartbeat(heartbeat_ttl, pipeline=pipeline)
+                    queue.started_job_registry.add(job, heartbeat_ttl, pipeline=pipeline)


Can we move this to job.heartbeat()? I think it makes more sense that way.

selwin · 2020-11-21T06:24:33Z

Two more things that we need to take into account:

The heartbeat should be no later than when the job is supposed to timeout (plus some buffer time). So for example if job.timeout is 65 seconds and job_monitoring_interval is 30 seconds (job is checked every 30 seconds), the heartbeat_ttls that are sent should be:

90 (30 + 60) at the beginning. 30 is the job monitoring interval, 60 is the buffer time
90 (30 + 60) at the 30th second.
65 (5 + 60) at the 60th second. 5 is the remaining job execution time.

SimpleWorker doesn't have a main worker thread that monitors horse's execution so we can't use the variable TTLs sent by the regular Worker class.

To solve the two issues, I think we should add a worker.get_heartbeat_ttl(job, elapsed_execution_time) that gets called whenever job heartbeat is about to be sent. This way, we can:

Easily test the job heartbeat TTL calculation
SimpleWorker can override this method to always return TTLs as per the original implementation.

selwin · 2021-01-26T00:37:32Z

rq/worker.py

@@ -737,6 +737,10 @@ def fork_work_horse(self, job, queue):
            self._horse_pid = child_pid
            self.procline('Forked {0} at {1}'.format(child_pid, time.time()))

+    def get_heartbeat_ttl(self, job, elapsed_execution_time):


Instead of passing elapsed_execution_time around, I think it makes sense for the worker to keep track of the start of job execution time. Can we make this change?

selwin · 2021-01-26T00:38:45Z

@rauchy sorry it took so long, I missed this PR. The approach is already good, but I made a comment here. It would be good if we can get this addressed.

This is a PR I'd like to pull in.

…t score

…SimpleWorker + move storing StartedJobRegistry scores to job.heartbeat()

rauchy · 2021-03-16T10:23:53Z

@selwin whoops, also sorry it took me so long 🙊

Is this what you had in mind regarding tracking worker execution time?

selwin · 2021-03-16T11:05:56Z

rq/worker.py


        with self.connection.pipeline() as pipeline:
            self.set_state(WorkerStatus.BUSY, pipeline=pipeline)
            self.set_current_job_id(job.id, pipeline=pipeline)
+            self.set_current_job_working_time(0, pipeline=pipeline)


I think this needs to be moved to monitor_work_horse() after job has finished so this will be set to zero after job execution finishes.

Good point. I added it to monitor_work_horse, but I also kept it here in case a work horse dies (due to an OSError for example) without resetting it. WDYT?

selwin · 2021-03-16T11:07:16Z

rq/worker.py

@@ -374,6 +375,11 @@ def _get_state(self):

    state = property(_get_state, _set_state)

+    def set_current_job_working_time(self, current_job_working_time, pipeline=None):


I initially thought that this can be a simple variable tracked by the worker itself. Thanks for persisting this in Redis, this is also a useful information to have.

rauchy · 2021-04-20T08:58:22Z

@selwin <GentlePing />

selwin · 2021-04-20T09:05:21Z

Sorry I missed this. Thanks!

michaelbrooks · 2021-07-07T16:24:03Z

Turns out this change broke the public interface for Workers when it removed the heartbeat_ttl argument from Worker.prepare_job_execution. michaelbrooks/rq-win#11

rauchy force-pushed the master branch from 39e6fe0 to 8d2ff55 Compare November 9, 2020 08:08

selwin reviewed Nov 21, 2020

View reviewed changes

selwin reviewed Jan 26, 2021

View reviewed changes

noahg2 mentioned this pull request Mar 7, 2021

Jobs are marked as started when docker container is killed #1164

Open

Omer Lachish added 15 commits March 7, 2021 11:34

cleanup jobs that are not really running due to zombie workers

9e8fb9c

remove registry entries for zombie jobs

1b91459

return only the job ids on cleanup

23be489

test zombie job cleanup

1a87d5e

format code

87dd450

rename variable to explain that second element in tuple is expiry, no…

ddad535

…t score

remove worker_key

e0da792

detect zombie jobs using old heartbeats

fecea9d

reuse get_expired_job_ids

c5e955b

set score using current_timestamp

7f35b94

test idle jobs using stale heartbeats

e72fa23

extract timeout into variable

85b7d6c

move heartbeats into StartedJobRegistry

ae7a0c4

use registry.heartbeat in tests

c9109a4

remove heartbeats when job removed from StartedJobRegistry

461f651

Omer Lachish added 6 commits March 7, 2021 11:34

remove idle and expired jobs from both wip and heartbeats set

b359ed5

send heartbeat_ttl to registry.add

362eebf

typo

b11e40d

revert everything 😶

ddb2116

only keep job heartbeats as score (and get rid of job timeouts as scores

3f8a177

calculate heartbeat_ttl in an overrideable function + override it in …

3cd6a75

…SimpleWorker + move storing StartedJobRegistry scores to job.heartbeat()

rauchy force-pushed the master branch from 365ef93 to 3cd6a75 Compare March 7, 2021 09:35

Omer Lachish added 2 commits March 16, 2021 11:43

set heartbeat to monitoring interval for infinite timeouts

21fd4ea

track elapsed_execution_time as part of worker

ae7ee91

rauchy force-pushed the master branch from e9091bc to ae7ee91 Compare March 16, 2021 10:22

selwin reviewed Mar 16, 2021

View reviewed changes

Omer Lachish added 2 commits March 16, 2021 14:10

reset current job working time when work on a job is done

ba83261

persisting the job working time as part of monitoring

acceaac

selwin merged commit 76ac0af into rq:master Apr 20, 2021

michaelbrooks mentioned this pull request Jul 7, 2021

fix for issue #9 => prepare_job_execution() takes 2 positional arguments but 3 were given michaelbrooks/rq-win#11

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup zombie worker leftovers as part of StartedJobRegistry's cleanup() #1372

Cleanup zombie worker leftovers as part of StartedJobRegistry's cleanup() #1372

rauchy commented Nov 6, 2020

codecov bot commented Nov 6, 2020 •

edited

selwin commented Nov 8, 2020

rauchy commented Nov 8, 2020

selwin commented Nov 8, 2020

rauchy commented Nov 8, 2020

selwin commented Nov 8, 2020

rauchy commented Nov 9, 2020

selwin commented Nov 14, 2020

rauchy commented Nov 15, 2020

selwin Nov 21, 2020

selwin commented Nov 21, 2020

selwin Jan 26, 2021

selwin commented Jan 26, 2021

rauchy commented Mar 16, 2021

selwin Mar 16, 2021

rauchy Mar 16, 2021

selwin Mar 16, 2021

rauchy Mar 16, 2021

rauchy commented Apr 20, 2021

selwin commented Apr 20, 2021

michaelbrooks commented Jul 7, 2021

		@@ -374,6 +375,11 @@ def _get_state(self):

		state = property(_get_state, _set_state)

		def set_current_job_working_time(self, current_job_working_time, pipeline=None):

Cleanup zombie worker leftovers as part of StartedJobRegistry's cleanup() #1372

Cleanup zombie worker leftovers as part of StartedJobRegistry's cleanup() #1372

Conversation

rauchy commented Nov 6, 2020

codecov bot commented Nov 6, 2020 • edited

Codecov Report

selwin commented Nov 8, 2020

Approach 1

Approach 2

rauchy commented Nov 8, 2020

selwin commented Nov 8, 2020

rauchy commented Nov 8, 2020

selwin commented Nov 8, 2020

rauchy commented Nov 9, 2020

selwin commented Nov 14, 2020

rauchy commented Nov 15, 2020

selwin Nov 21, 2020

Choose a reason for hiding this comment

selwin commented Nov 21, 2020

selwin Jan 26, 2021

Choose a reason for hiding this comment

selwin commented Jan 26, 2021

rauchy commented Mar 16, 2021

selwin Mar 16, 2021

Choose a reason for hiding this comment

rauchy Mar 16, 2021

Choose a reason for hiding this comment

selwin Mar 16, 2021

Choose a reason for hiding this comment

rauchy Mar 16, 2021

Choose a reason for hiding this comment

rauchy commented Apr 20, 2021

selwin commented Apr 20, 2021

michaelbrooks commented Jul 7, 2021

codecov bot commented Nov 6, 2020 •

edited