Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: added job heartbeat to track whether job is actually executing #1349

Merged
merged 9 commits into from Oct 26, 2020

Conversation

theambient
Copy link
Contributor

heartbeat might be needed in cases when worker was hardkilled or the whole VM/docker was forcibly rebooted.

Ruslan Mullakhmetov added 2 commits September 28, 2020 11:14
heartbeat might be needed in cases when worker was hardkilled or the whole VM/docker was forcibly rebooted.
@codecov
Copy link

codecov bot commented Sep 28, 2020

Codecov Report

Merging #1349 into master will decrease coverage by 0.07%.
The diff coverage is 90.90%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1349      +/-   ##
==========================================
- Coverage   94.98%   94.90%   -0.08%     
==========================================
  Files          41       43       +2     
  Lines        5599     5728     +129     
==========================================
+ Hits         5318     5436     +118     
- Misses        281      292      +11     
Impacted Files Coverage Δ
rq/worker.py 87.77% <66.66%> (+0.49%) ⬆️
rq/job.py 97.81% <100.00%> (+0.02%) ⬆️
rq/utils.py 92.96% <100.00%> (ø)
rq/version.py 100.00% <100.00%> (ø)
tests/test_job.py 99.67% <100.00%> (+<0.01%) ⬆️
tests/test_worker.py 97.34% <100.00%> (+0.03%) ⬆️
tests/fixtures.py 64.60% <0.00%> (-2.71%) ⬇️
rq/registry.py 96.95% <0.00%> (-1.83%) ⬇️
rq/cli/helpers.py 86.36% <0.00%> (ø)
rq/command.py 100.00% <0.00%> (ø)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e6f153e...c39b7da. Read the comment docs.

rq/worker.py Outdated
@@ -729,6 +730,8 @@ def monitor_work_horse(self, job, queue):
self.wait_for_horse()
break

job.set_heartbeat(utcnow())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to job and worker heartbeat should be pipelined.

rq/job.py Outdated
@@ -351,6 +351,7 @@ def __init__(self, id=None, connection=None, serializer=None):
# retry_intervals is a list of int e.g [60, 120, 240]
self.retry_intervals = None
self.redis_server_version = None
self.heartbeat = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be renamed to last_heartbeat

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

job.last_heartbeat should also be added to the docs here: https://github.com/rq/rq/blob/master/docs/docs/jobs.md

rq/job.py Outdated
@@ -384,6 +385,21 @@ def set_id(self, value):
raise TypeError('id must be a string, not {0}'.format(type(value)))
self._id = value

def set_heartbeat(self, heartbeat, pipeline=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also rename this to heartbeat() so that it's consistent with worker.heartbeat()?

rq/job.py Outdated
Comment on lines 393 to 401
def get_heartbeat(self, refresh=True):
if refresh:
raw = self.connection.hget(self.key, 'heartbeat')
if raw:
self.last_heartbeat = str_to_date(raw)
else:
self.last_heartbeat = None

return self.last_heartbeat
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this method, it's unnecessary. When we fetch job from Redis, you can already call job.last_heartbeat. If you want to refresh job metadata from, call job.refresh().

rq/job.py Outdated
@@ -530,6 +547,7 @@ def to_dict(self, include_meta=True):
'data': zlib.compress(self.data),
'started_at': utcformat(self.started_at) if self.started_at else '',
'ended_at': utcformat(self.ended_at) if self.ended_at else '',
'heartbeat': utcformat(self.last_heartbeat) if self.last_heartbeat else '',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind changing this to last_heartbeat to keep things consistent?

rq/worker.py Outdated
@@ -713,6 +713,7 @@ def monitor_work_horse(self, job, queue):

ret_val = None
job.started_at = utcnow()
job.heartbeat(job.started_at)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first heartbeat should be located in prepare_job_execution() so it can be pipelined with job status changes etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed since the first heartbeat is already set in prepare_job_execution().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to remove this line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, done, sorry for being messy

@theambient
Copy link
Contributor Author

done

@@ -121,6 +121,7 @@ Some interesting job attributes include:
* `job.started_at`
* `job.ended_at`
* `job.exc_info` stores exception information if job doesn't finish successfully.
* `job.get_heartbeat()` returns last heartbeat of the job indicating last time the job was executing. Can be used to determine if a worker was killed forcely and to mark the job as `failed`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated to job.last_heartbeat - the latest timestamp that's periodically updated when the job is executing. Can be used to determine if the job is still active.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@selwin selwin merged commit ed264f0 into rq:master Oct 26, 2020
@selwin
Copy link
Collaborator

selwin commented Oct 26, 2020

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants