Succeeded jobs mysteriously moved to FailedJobRegistry #1507

waldner · 2021-07-04T15:28:20Z

Once in a while I get jobs that completed successfully moved to FailedJobRegistry.
The job terminates correctly, as shown in the logs:

16:44:39 default: Job OK (5a33fdd2-54bf-466a-9c6c-5aea8c37be76)
16:44:39 Result is kept for 600 seconds

But then after a while I see the job has been moved to FailedJobRegistry. Looking at the queues with rq-dashboard, I see this terse message:

Moved to FailedJobRegistry at 2021-07-04 16:46:12.479190

But nothing else (don't even know where rq-dashboard gets that message from). No other information in the logs. As I said, this happens only for a minority of jobs, but it does happen.
If it helps, it happened both with rq 1.8.1 and rq 1.9.0.
Could it be a failure in the rq<->redis communication so the successful termination of the job isn't properly written to redis? Looking at the rq code, I see that the move happens in StartedJobRegistry.cleanup() in rq/registry.py. From what I understand, rq thinks that the job is "expired" (based on redis score) so moves it to the failed registry.

The text was updated successfully, but these errors were encountered:

adamsenn · 2021-07-26T13:01:21Z

We are also regularly seeing this issue now as well. We are on rq 1.9, using it with

Django 3.2
django-rq 2.4, and
django-rq-scheduler@79a6c7a26b482302f3ee73467b97572f0c31effa.

We have similar setup in multiple other projects, but have only noticed this issue in 2 projects thus far. Unfortunately, we have not been able to nail down a reproducible case yet. While it appears your fail occurred ~90sec after your success, ours seem to consistently be in 6-7min range (3-4min jobs with 900sec timeout, ttl = None), FWIW.

adamsenn · 2021-08-09T18:56:34Z

We are seeing this very frequently now in a new project we are working on. I strongly suspect the issue is somewhere in https://github.com/rq/rq/blob/master/rq/worker.py#L982-L1006. As @waldner said, the StartedJobRegistry's .cleanup() function seems to be finding our successfully finished jobs stuck inside it. Using django-rq, we have seen finished jobs stuck in the StartedJobRegistry for a time, until they are eventually expired to the FailedJobRegistry:

Apparently, worker.py Ln 996 successfully sets the job's status to JobStatus.FINISHED, but the started_job_registry.remove() isn't successful. Perhaps that has something to do with the fact that we are using custom job ids?

https://github.com/rq/rq/blob/master/rq/registry.py#L73-L75

We've tried to track this further, but unfortunately are not Redis experts and don't fully understand all that rq is doing with that. Hopefully this is useful information!

waldner · 2021-08-09T19:11:32Z

FWIW, I'm not using custom job IDs. Also, I'm seeing that in a queue that has a lot of job churning, it might have peaks of 180/200 jobs at a time in the finished job registry. Other less active queues don't have the problem, HTH.

selwin · 2021-08-21T10:19:29Z

@waldner @adamsenn sorry for the late reply, I don't think it's anything to do with custom job IDs. I'll try to look into this, but it's not easy because I don't have jobs that are incorrectly moved to FailedJobRegistry.

Are you able to spot trends in jobs that are incorrectly failed? Like they take a long time to finish or something like that.

selwin · 2021-08-21T10:52:14Z

One more thing, are you both able to verify that these jobs are completed successfully? If they are, I can narrow down the scope of my search.

selwin · 2021-08-21T11:09:42Z

I added some logging statements in #1544 . Do you mind running RQ with debug logging turned on so we can see what's happening?

waldner · 2021-08-21T11:15:54Z

For my part, I can confirm that the jobs are successful (or at least the logs say so), as I showed in my original message. I'll apply the debug patch and let you know what I see. Thanks!

waldner · 2021-08-21T18:22:30Z

Ok, here is the log for one of those successful jobs which are moved to FailedJobRegistry:

19:19:53 default: myjob(foo=20107, bar='XXXX') (95995224-58fd-4e9a-af06-4cae9589e781)
... (removed) messages from the job doing its thing....
19:21:54 Handling successful execution of job 95995224-58fd-4e9a-af06-4cae9589e781
19:21:54 Setting job 95995224-58fd-4e9a-af06-4cae9589e781 status to finished
19:21:54 Removing job 95995224-58fd-4e9a-af06-4cae9589e781 from StartedJobRegistry
19:21:54 Finished handling successful execution of job 95995224-58fd-4e9a-af06-4cae9589e781
19:21:54 default: Job OK (95995224-58fd-4e9a-af06-4cae9589e781)

Yet it's moved to FailedJobRegistry later:

Moved to FailedJobRegistry at 2021-08-21 17:23:26.722741

I notice now that the timestamp above seems to be in UTC (it's really 19:23:26.722741). Yet the workers have their timezone set correctly, as the timestamps in the log messages show (and a simple date in a shell inside the worker container also confirms that). Don't know if it's related, but it just caught my attention.

adamsenn · 2021-08-21T18:46:25Z

I can also confirm that the jobs definitely succeed first as we have multiple indicators (including logs) that are all showing success.

selwin · 2021-08-22T00:10:01Z

Ok, this is super weird, from your logs we can clearly see that the job was removed from StartedJobRegistry and yet it was Moved to FailedJobRegistry again about 90 seconds later.

We know that the job has to be moved to FailedJobRegistry from StartedJobRegistry because the text Moved to FailedJobRegistry only appears in StartedJobRegistry.cleanup().

Something is definitely weird here. May I know what redis-server version both of you are running @waldner @adamsenn ?

adamsenn · 2021-08-22T01:54:15Z

We are using AWS ElastiCache Redis 5.0.6 with Version 3.5.3 of the Python redis client.

waldner · 2021-08-22T09:09:27Z

redis 6.0.5-alpine in a Docker container, python-redis 3.5.3.

selwin · 2021-08-22T14:02:15Z

After reading the code multiple times, I think the most probable explanation is that the pipeline.execute() call here fails silently.

@waldner can you check if jobs that were erroneously added to FailedJobRegistry have job.ended_at value set? This value should be persisted to Redis in the same pipeline call.

waldner · 2021-08-22T15:49:52Z

Yes they indeed have the ended_at set, along with started_at, created_at, enqueued_at. Again, I see that in redis these are all GMT times.

selwin · 2021-08-23T08:21:16Z

Again, I see that in redis these are all GMT times.

@waldner this is by design. Because the queue and workers may be hosted on different servers, it makes sense to store all timestamps internally in UTC. They could then be converted to local timestamps when displayed to be read by humans.

I added some more debug logging statements in this branch: https://github.com/rq/rq/tree/logging-additions . Can you run this commit and see what it shows us? This is the diff of the commit: c0dd5f7

It logs two things:

The result of pipeline.execute() when the key is supposedly removed from Redis
Logging of StartedJobRegistry cleanup process when it finds keys that have expired and move them to FailedJobRegistry

waldner · 2021-08-23T11:56:23Z

Ok, here goes:

13:33:10 default: myjob(foo=20107, bar='XXXX') (e1e813a3-bf6a-4585-8a01-50bb8841a188)
... (removed) messages from the job doing its thing....
13:34:10 Handling successful execution of job e1e813a3-bf6a-4585-8a01-50bb8841a188
13:34:10 Setting job e1e813a3-bf6a-4585-8a01-50bb8841a188 status to finished
13:34:10 Removing job e1e813a3-bf6a-4585-8a01-50bb8841a188 from StartedJobRegistry
13:34:10 Finished handling successful execution of job e1e813a3-bf6a-4585-8a01-50bb8841a188. Redis returns [0, 1, 113, 4325.734237, 0, 0, 1, True, False, False, 1]
13:34:10 default: Job OK (e1e813a3-bf6a-4585-8a01-50bb8841a188)
13:34:10 Result is kept for 600 seconds

Here are the relevant lines (Redis returns) for other jobs moved to FailedJobRegistry:

12:46:04 Finished handling successful execution of job 2c67d796-9169-4b6f-ac9e-0902b378bb4e. Redis returns [0, 1, 24, 1476.257723, 0, 0, 1, True, False, False, 1]
13:51:44 Finished handling successful execution of job f6a3beeb-2316-4899-a011-a66c56bb6783. Redis returns [0, 1, 72, 5391.487906, 0, 0, 1, True, False, False, 1]
13:57:57 Finished handling successful execution of job 05f104a5-1c91-4edf-9836-5ab04f53a302. Redis returns [0, 1, 65, 5769.783253, 0, 0, 1, True, False, False, 1]
13:59:44 Finished handling successful execution of job 34d2af4c-ab10-414a-a02c-a5b409aa9a90. Redis returns [0, 1, 129, 5852.962609, 0, 0, 1, True, False, False, 1]
13:57:03 Finished handling successful execution of job 45ab43f8-a9aa-4cac-8e78-56c461d017fd. Redis returns [0, 1, 73, 5709.280607, 0, 0, 1, True, False, False, 1]

And here's the same line for some other random jobs not moved to FailedJobRegistry, if it helps:

12:31:36 Finished handling successful execution of job 913a7401-2d2d-4b95-83c1-77d238b8e692. Redis returns [0, 1, 14, 614.955188, 1, 1, 0]
12:34:36 Finished handling successful execution of job 343b69ed-6948-425d-987d-6aa25f3a06d1. Redis returns [0, 1, 15, 794.284134, 1, 1, 0]
12:37:03 Finished handling successful execution of job 4236f401-60a7-4bb4-ab3b-ae931bcceab3. Redis returns [0, 1, 13, 941.835423, 0, 0, 1, True, False, False, 1]
12:37:47 Finished handling successful execution of job af14ca24-2c7a-4ebd-8957-7d2f4fdfdf45. Redis returns [0, 1, 15, 985.826952, 0, 0, 1, True, False, False, 1]
12:38:40 Finished handling successful execution of job 7a1ba6c9-b7be-4785-ad01-d1643536e55e. Redis returns [0, 1, 17, 1036.490723, 0, 0, 1, True, False, False, 1]
13:22:06 Finished handling successful execution of job 3cd5638f-c8aa-4228-a805-3017eeafca8e. Redis returns [0, 1, 96, 3609.265077, 0, 0, 1, True, False, False, 1]
13:46:37 Finished handling successful execution of job efaf0e84-adfa-40e6-a156-e166bd4ed5f9. Redis returns [0, 1, 128, 5065.963707, 0, 0, 1, True, False, False, 1]
13:47:05 Finished handling successful execution of job bdb434aa-21c3-4b16-b3fd-96fc266ee302. Redis returns [0, 1, 124, 5096.179208, 0, 0, 1, True, False, False, 1]
13:57:21 Finished handling successful execution of job 78318706-453d-40b1-874c-fe6d78be8bfc. Redis returns [0, 1, 127, 5710.579746, 0, 0, 1, True, False, False, 1]
13:57:44 Finished handling successful execution of job 82f668a5-1fc0-4760-aa72-3a9f7e541f49. Redis returns [0, 1, 128, 5732.900007, 1, 1, 0]

I see some instances return a different number of values from others, if that's relevant; in some cases the last 4 values are missing.

Strangely, I don't see any instance of the registry.py log messages (Moving job(s) to FailedJobRegistry and Cleaning up StartedJobRegistry for queue), do I have to do anything special to enable them?

selwin · 2021-08-23T13:44:46Z

Ok, let's take a look at this line:

12:46:04 Finished handling successful execution of job 2c67d796-9169-4b6f-ac9e-0902b378bb4e. Redis returns [0, 1, 24, 1476.257723, 0, 0, 1, True, False, False, 1]

The last result 1 is the result of ZREM call that's initiated by started_job_registry.remove(). 1 means Redis successfully removed 1 key from the sorted set, which means that key has been removed from StartedJobRegistry.

This means that since the key is already removed from StartedJobRegistry, it shouldn't have this key to cleanup and move to FailedJobRegistry. The fact that this key is still there means it's reintroduced again later, which could be due to a race condition.

I added an extra commit to check if my suspicion is True. Could you try my latest patch? If this works, I'll work on a more permanent fix.

waldner · 2021-08-23T17:37:27Z

So this has been running for a few hours now, and so far I see exactly zero successful jobs moved to FailedJobRegistry (earlier there would already have been some). I'll leave it running overnight and see. Great work! Thanks!

selwin · 2021-08-24T00:35:21Z

Great! If it stays that way, I’ll work on a more permanent fix and may need your help to verify the fix again (hopefully just once).

…

On Aug 24, 2021 at 00:37:37, waldner ***@***.***> wrote: So this has been running for a few hours now, and so far I see exactly zero successful jobs moved to FailedJobRegistry (earlier there would already have been some). I'll leave it running overnight and see. Great work! Thanks! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1507 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABOB4VDTCBTARCIONKRGTTT6KBODANCNFSM47ZJJBHQ> .

selwin · 2021-08-24T06:55:18Z

@waldner this branch contains the permanent fix. Could you please test this out and see whether it works for you?

waldner · 2021-08-24T08:31:49Z

Yes! It ran all night and no job moved to FailedJobRegistry. I can say that the problem is fixed. I'll test the new branch and let you know.

waldner · 2021-08-24T10:42:45Z

I'm sorry but I have to report that the fix-started-job-registry-race-condition branch reintroduces the problem of jobs moved to FailedJobRegistry. I've been running it for just about two hours and I've already seen 4 such jobs. This seems to be happening at a lower rate than before, but nonetheless it's happening.
Furthermore, I now see that job parallelism seems to be greatly reduced for some reason. For this particular task, I have 40 workers set aside, and I used to see all 40 running at the same time (by looking at how many jobs were in the StartedJobRegistry). Now the number continuously changes, but it never gets past 30 or 32 (and it's often lower, around 20/25).

selwin · 2021-08-24T12:53:54Z

Ok, in that case can you try running this fix? I'm trying to do this without incurring an additional Redis command. If this also fails, we can fallback to the quick fix.

waldner · 2021-08-24T16:40:13Z

This is working! Parallelism restored and no jobs moved to FailedJobRegistry. Thanks! Will you include the fix in a release?

selwin · 2021-08-25T00:18:47Z

Yes. Thanks for testing.

…

On Aug 24, 2021, at 11:40 PM, waldner ***@***.***> wrote: This is working! Parallelism restored and no jobs moved to FailedJobRegistry. Thanks! Will you include the fix in a release? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

adamsenn · 2021-08-25T00:48:39Z

Great work @selwin and @waldner!

waldner · 2021-08-25T09:08:03Z

@adamsenn I hope it solves the issue for you too!

waldner · 2021-09-11T11:25:20Z

1.10.0 definitely solves this for me. Closing.

the idea is to port the changes from the current version of rq's monitor_work_horse to the version introduced by redash. This is a classic example why overriding a method that you don't own is usually a bad idea. You miss on upstream fixes. Another point is that I'm not even sure if this custom method is even necessary with the version of rq we're using. Maybe we should investigate it more

rfrancois · 2022-10-27T14:14:26Z

I just tried what is written in the documentation https://python-rq.org/docs/#job-callbacks with v1.11.1 and I have the same error you described.

waldner changed the title ~~Succeeded jobs mysteriously moved to FaildJobRegistry~~ Succeeded jobs mysteriously moved to FailedJobRegistry Jul 4, 2021

selwin mentioned this issue Aug 26, 2021

Main worker should use zadd(xx=True) to update heartbeat #1550

Merged

waldner closed this as completed Sep 11, 2021

antmatyjajo mentioned this issue Mar 19, 2024

Jobs wrongly moved to failed registry (stuck in intermediate queue) #2061

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Succeeded jobs mysteriously moved to FailedJobRegistry #1507

Succeeded jobs mysteriously moved to FailedJobRegistry #1507

waldner commented Jul 4, 2021 •

edited

Loading

adamsenn commented Jul 26, 2021

adamsenn commented Aug 9, 2021

waldner commented Aug 9, 2021 •

edited

Loading

selwin commented Aug 21, 2021

selwin commented Aug 21, 2021

selwin commented Aug 21, 2021

waldner commented Aug 21, 2021

waldner commented Aug 21, 2021

adamsenn commented Aug 21, 2021

selwin commented Aug 22, 2021

adamsenn commented Aug 22, 2021

waldner commented Aug 22, 2021 •

edited

Loading

selwin commented Aug 22, 2021

waldner commented Aug 22, 2021 •

edited

Loading

selwin commented Aug 23, 2021

waldner commented Aug 23, 2021 •

edited

Loading

selwin commented Aug 23, 2021

waldner commented Aug 23, 2021

selwin commented Aug 24, 2021 via email

selwin commented Aug 24, 2021

waldner commented Aug 24, 2021

waldner commented Aug 24, 2021

selwin commented Aug 24, 2021

waldner commented Aug 24, 2021

selwin commented Aug 25, 2021 via email

adamsenn commented Aug 25, 2021

waldner commented Aug 25, 2021

waldner commented Sep 11, 2021

rfrancois commented Oct 27, 2022

Succeeded jobs mysteriously moved to FailedJobRegistry #1507

Succeeded jobs mysteriously moved to FailedJobRegistry #1507

Comments

waldner commented Jul 4, 2021 • edited Loading

adamsenn commented Jul 26, 2021

adamsenn commented Aug 9, 2021

waldner commented Aug 9, 2021 • edited Loading

selwin commented Aug 21, 2021

selwin commented Aug 21, 2021

selwin commented Aug 21, 2021

waldner commented Aug 21, 2021

waldner commented Aug 21, 2021

adamsenn commented Aug 21, 2021

selwin commented Aug 22, 2021

adamsenn commented Aug 22, 2021

waldner commented Aug 22, 2021 • edited Loading

selwin commented Aug 22, 2021

waldner commented Aug 22, 2021 • edited Loading

selwin commented Aug 23, 2021

waldner commented Aug 23, 2021 • edited Loading

selwin commented Aug 23, 2021

waldner commented Aug 23, 2021

selwin commented Aug 24, 2021 via email

selwin commented Aug 24, 2021

waldner commented Aug 24, 2021

waldner commented Aug 24, 2021

selwin commented Aug 24, 2021

waldner commented Aug 24, 2021

selwin commented Aug 25, 2021 via email

adamsenn commented Aug 25, 2021

waldner commented Aug 25, 2021

waldner commented Sep 11, 2021

rfrancois commented Oct 27, 2022

waldner commented Jul 4, 2021 •

edited

Loading

waldner commented Aug 9, 2021 •

edited

Loading

waldner commented Aug 22, 2021 •

edited

Loading

waldner commented Aug 22, 2021 •

edited

Loading

waldner commented Aug 23, 2021 •

edited

Loading