feat: added job heartbeat to track whether job is actually executing #1349

theambient · 2020-09-28T09:18:38Z

heartbeat might be needed in cases when worker was hardkilled or the whole VM/docker was forcibly rebooted.

codecov · 2020-09-28T10:04:51Z

Codecov Report

Merging #1349 into master will decrease coverage by 0.07%.
The diff coverage is 90.90%.

@@            Coverage Diff             @@
##           master    #1349      +/-   ##
==========================================
- Coverage   94.98%   94.90%   -0.08%     
==========================================
  Files          41       43       +2     
  Lines        5599     5728     +129     
==========================================
+ Hits         5318     5436     +118     
- Misses        281      292      +11

Impacted Files	Coverage Δ
rq/worker.py	`87.77% <66.66%> (+0.49%)`	⬆️
rq/job.py	`97.81% <100.00%> (+0.02%)`	⬆️
rq/utils.py	`92.96% <100.00%> (ø)`
rq/version.py	`100.00% <100.00%> (ø)`
tests/test_job.py	`99.67% <100.00%> (+<0.01%)`	⬆️
tests/test_worker.py	`97.34% <100.00%> (+0.03%)`	⬆️
tests/fixtures.py	`64.60% <0.00%> (-2.71%)`	⬇️
rq/registry.py	`96.95% <0.00%> (-1.83%)`	⬇️
rq/cli/helpers.py	`86.36% <0.00%> (ø)`
rq/command.py	`100.00% <0.00%> (ø)`
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e6f153e...c39b7da. Read the comment docs.

selwin · 2020-10-18T09:22:33Z

rq/worker.py

@@ -729,6 +730,8 @@ def monitor_work_horse(self, job, queue):
                    self.wait_for_horse()
                    break

+                job.set_heartbeat(utcnow())


The call to job and worker heartbeat should be pipelined.

selwin · 2020-10-18T11:53:27Z

rq/job.py

@@ -351,6 +351,7 @@ def __init__(self, id=None, connection=None, serializer=None):
        # retry_intervals is a list of int e.g [60, 120, 240]
        self.retry_intervals = None
        self.redis_server_version = None
+        self.heartbeat = None


This should be renamed to last_heartbeat

job.last_heartbeat should also be added to the docs here: https://github.com/rq/rq/blob/master/docs/docs/jobs.md

selwin · 2020-10-18T12:08:46Z

rq/job.py

@@ -384,6 +385,21 @@ def set_id(self, value):
            raise TypeError('id must be a string, not {0}'.format(type(value)))
        self._id = value

+    def set_heartbeat(self, heartbeat, pipeline=None):


Can we also rename this to heartbeat() so that it's consistent with worker.heartbeat()?

selwin · 2020-10-19T23:46:21Z

rq/job.py

+    def get_heartbeat(self, refresh=True):
+        if refresh:
+            raw = self.connection.hget(self.key, 'heartbeat')
+            if raw:
+                self.last_heartbeat = str_to_date(raw)
+            else:
+                self.last_heartbeat = None
+
+        return self.last_heartbeat


I think we can remove this method, it's unnecessary. When we fetch job from Redis, you can already call job.last_heartbeat. If you want to refresh job metadata from, call job.refresh().

selwin · 2020-10-19T23:46:48Z

rq/job.py

@@ -530,6 +547,7 @@ def to_dict(self, include_meta=True):
            'data': zlib.compress(self.data),
            'started_at': utcformat(self.started_at) if self.started_at else '',
            'ended_at': utcformat(self.ended_at) if self.ended_at else '',
+            'heartbeat': utcformat(self.last_heartbeat) if self.last_heartbeat else '',


Mind changing this to last_heartbeat to keep things consistent?

selwin · 2020-10-19T23:52:07Z

rq/worker.py

@@ -713,6 +713,7 @@ def monitor_work_horse(self, job, queue):

        ret_val = None
        job.started_at = utcnow()
+        job.heartbeat(job.started_at)


The first heartbeat should be located in prepare_job_execution() so it can be pipelined with job status changes etc.

This is no longer needed since the first heartbeat is already set in prepare_job_execution().

Don't forget to remove this line

thx, done, sorry for being messy

theambient · 2020-10-20T08:35:51Z

done

selwin · 2020-10-22T00:22:34Z

docs/docs/jobs.md

@@ -121,6 +121,7 @@ Some interesting job attributes include:
 * `job.started_at`
 * `job.ended_at`
 * `job.exc_info` stores exception information if job doesn't finish successfully.
+* `job.get_heartbeat()` returns last heartbeat of the job indicating last time the job was executing. Can be used to determine if a worker was killed forcely and to mark the job as `failed`.


This needs to be updated to job.last_heartbeat - the latest timestamp that's periodically updated when the job is executing. Can be used to determine if the job is still active.

selwin · 2020-10-26T13:42:11Z

Thanks!

Ruslan Mullakhmetov added 2 commits September 28, 2020 11:14

feat: added job heartbeat to track whether job is actually executing

10a2182

heartbeat might be needed in cases when worker was hardkilled or the whole VM/docker was forcibly rebooted.

fixed tests

5c5a07b

fixed test coverage issue

2da1da6

selwin reviewed Oct 18, 2020

View reviewed changes

Ruslan Mullakhmetov added 3 commits October 19, 2020 10:32

chore: renamed job.heartbeat stuff according to review feedback

5bad37f

chore: pipelined worker heartbeat and job heartbeat

7564d67

docs: documented job.heartbeat property

10382a2

selwin reviewed Oct 19, 2020

View reviewed changes

fixes after review

c79844a

selwin reviewed Oct 22, 2020

View reviewed changes

Ruslan Mullakhmetov added 2 commits October 26, 2020 10:43

docs: updated last_heartbeat description

7901c8a

chore: review

c39b7da

selwin merged commit ed264f0 into rq:master Oct 26, 2020

selwin mentioned this pull request Nov 8, 2020

Cleanup zombie worker leftovers as part of StartedJobRegistry's cleanup() #1372

Merged

theambient deleted the feature/job-heartbeat branch May 19, 2021 09:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: added job heartbeat to track whether job is actually executing #1349

feat: added job heartbeat to track whether job is actually executing #1349

theambient commented Sep 28, 2020

codecov bot commented Sep 28, 2020 •

edited

selwin Oct 18, 2020

selwin Oct 18, 2020

selwin Oct 18, 2020

selwin Oct 18, 2020

selwin Oct 19, 2020

selwin Oct 19, 2020

selwin Oct 19, 2020

selwin Oct 22, 2020

selwin Oct 26, 2020

theambient Oct 26, 2020

theambient commented Oct 20, 2020

selwin Oct 22, 2020

theambient Oct 26, 2020

selwin commented Oct 26, 2020

feat: added job heartbeat to track whether job is actually executing #1349

feat: added job heartbeat to track whether job is actually executing #1349

Conversation

theambient commented Sep 28, 2020

codecov bot commented Sep 28, 2020 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theambient commented Oct 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

selwin commented Oct 26, 2020

codecov bot commented Sep 28, 2020 •

edited