Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Succeeded jobs mysteriously moved to FailedJobRegistry #1507

Closed
waldner opened this issue Jul 4, 2021 · 29 comments
Closed

Succeeded jobs mysteriously moved to FailedJobRegistry #1507

waldner opened this issue Jul 4, 2021 · 29 comments

Comments

@waldner
Copy link
Contributor

waldner commented Jul 4, 2021

Once in a while I get jobs that completed successfully moved to FailedJobRegistry.
The job terminates correctly, as shown in the logs:

16:44:39 default: Job OK (5a33fdd2-54bf-466a-9c6c-5aea8c37be76)
16:44:39 Result is kept for 600 seconds

But then after a while I see the job has been moved to FailedJobRegistry. Looking at the queues with rq-dashboard, I see this terse message:

Moved to FailedJobRegistry at 2021-07-04 16:46:12.479190

But nothing else (don't even know where rq-dashboard gets that message from). No other information in the logs. As I said, this happens only for a minority of jobs, but it does happen.
If it helps, it happened both with rq 1.8.1 and rq 1.9.0.
Could it be a failure in the rq<->redis communication so the successful termination of the job isn't properly written to redis? Looking at the rq code, I see that the move happens in StartedJobRegistry.cleanup() in rq/registry.py. From what I understand, rq thinks that the job is "expired" (based on redis score) so moves it to the failed registry.

@waldner waldner changed the title Succeeded jobs mysteriously moved to FaildJobRegistry Succeeded jobs mysteriously moved to FailedJobRegistry Jul 4, 2021
@adamsenn
Copy link

We are also regularly seeing this issue now as well. We are on rq 1.9, using it with

  • Django 3.2
  • django-rq 2.4, and
  • django-rq-scheduler@79a6c7a26b482302f3ee73467b97572f0c31effa.

We have similar setup in multiple other projects, but have only noticed this issue in 2 projects thus far. Unfortunately, we have not been able to nail down a reproducible case yet. While it appears your fail occurred ~90sec after your success, ours seem to consistently be in 6-7min range (3-4min jobs with 900sec timeout, ttl = None), FWIW.

@adamsenn
Copy link

adamsenn commented Aug 9, 2021

We are seeing this very frequently now in a new project we are working on. I strongly suspect the issue is somewhere in https://github.com/rq/rq/blob/master/rq/worker.py#L982-L1006. As @waldner said, the StartedJobRegistry's .cleanup() function seems to be finding our successfully finished jobs stuck inside it. Using django-rq, we have seen finished jobs stuck in the StartedJobRegistry for a time, until they are eventually expired to the FailedJobRegistry:

image

Apparently, worker.py Ln 996 successfully sets the job's status to JobStatus.FINISHED, but the started_job_registry.remove() isn't successful. Perhaps that has something to do with the fact that we are using custom job ids?

https://github.com/rq/rq/blob/master/rq/registry.py#L73-L75

We've tried to track this further, but unfortunately are not Redis experts and don't fully understand all that rq is doing with that. Hopefully this is useful information!

@waldner
Copy link
Contributor Author

waldner commented Aug 9, 2021

FWIW, I'm not using custom job IDs. Also, I'm seeing that in a queue that has a lot of job churning, it might have peaks of 180/200 jobs at a time in the finished job registry. Other less active queues don't have the problem, HTH.

@selwin
Copy link
Collaborator

selwin commented Aug 21, 2021

@waldner @adamsenn sorry for the late reply, I don't think it's anything to do with custom job IDs. I'll try to look into this, but it's not easy because I don't have jobs that are incorrectly moved to FailedJobRegistry.

Are you able to spot trends in jobs that are incorrectly failed? Like they take a long time to finish or something like that.

@selwin
Copy link
Collaborator

selwin commented Aug 21, 2021

One more thing, are you both able to verify that these jobs are completed successfully? If they are, I can narrow down the scope of my search.

@selwin
Copy link
Collaborator

selwin commented Aug 21, 2021

I added some logging statements in #1544 . Do you mind running RQ with debug logging turned on so we can see what's happening?

@waldner
Copy link
Contributor Author

waldner commented Aug 21, 2021

For my part, I can confirm that the jobs are successful (or at least the logs say so), as I showed in my original message. I'll apply the debug patch and let you know what I see. Thanks!

@waldner
Copy link
Contributor Author

waldner commented Aug 21, 2021

Ok, here is the log for one of those successful jobs which are moved to FailedJobRegistry:

19:19:53 default: myjob(foo=20107, bar='XXXX') (95995224-58fd-4e9a-af06-4cae9589e781)
... (removed) messages from the job doing its thing....
19:21:54 Handling successful execution of job 95995224-58fd-4e9a-af06-4cae9589e781
19:21:54 Setting job 95995224-58fd-4e9a-af06-4cae9589e781 status to finished
19:21:54 Removing job 95995224-58fd-4e9a-af06-4cae9589e781 from StartedJobRegistry
19:21:54 Finished handling successful execution of job 95995224-58fd-4e9a-af06-4cae9589e781
19:21:54 default: Job OK (95995224-58fd-4e9a-af06-4cae9589e781)

Yet it's moved to FailedJobRegistry later:

Moved to FailedJobRegistry at 2021-08-21 17:23:26.722741

I notice now that the timestamp above seems to be in UTC (it's really 19:23:26.722741). Yet the workers have their timezone set correctly, as the timestamps in the log messages show (and a simple date in a shell inside the worker container also confirms that). Don't know if it's related, but it just caught my attention.

@adamsenn
Copy link

I can also confirm that the jobs definitely succeed first as we have multiple indicators (including logs) that are all showing success.

@selwin
Copy link
Collaborator

selwin commented Aug 22, 2021

Ok, this is super weird, from your logs we can clearly see that the job was removed from StartedJobRegistry and yet it was Moved to FailedJobRegistry again about 90 seconds later.

We know that the job has to be moved to FailedJobRegistry from StartedJobRegistry because the text Moved to FailedJobRegistry only appears in StartedJobRegistry.cleanup().

Something is definitely weird here. May I know what redis-server version both of you are running @waldner @adamsenn ?

@adamsenn
Copy link

We are using AWS ElastiCache Redis 5.0.6 with Version 3.5.3 of the Python redis client.

@waldner
Copy link
Contributor Author

waldner commented Aug 22, 2021

redis 6.0.5-alpine in a Docker container, python-redis 3.5.3.

@selwin
Copy link
Collaborator

selwin commented Aug 22, 2021

After reading the code multiple times, I think the most probable explanation is that the pipeline.execute() call here fails silently.

@waldner can you check if jobs that were erroneously added to FailedJobRegistry have job.ended_at value set? This value should be persisted to Redis in the same pipeline call.

@waldner
Copy link
Contributor Author

waldner commented Aug 22, 2021

Yes they indeed have the ended_at set, along with started_at, created_at, enqueued_at. Again, I see that in redis these are all GMT times.

@selwin
Copy link
Collaborator

selwin commented Aug 23, 2021

Again, I see that in redis these are all GMT times.

@waldner this is by design. Because the queue and workers may be hosted on different servers, it makes sense to store all timestamps internally in UTC. They could then be converted to local timestamps when displayed to be read by humans.

I added some more debug logging statements in this branch: https://github.com/rq/rq/tree/logging-additions . Can you run this commit and see what it shows us? This is the diff of the commit: c0dd5f7

It logs two things:

  1. The result of pipeline.execute() when the key is supposedly removed from Redis
  2. Logging of StartedJobRegistry cleanup process when it finds keys that have expired and move them to FailedJobRegistry

@waldner
Copy link
Contributor Author

waldner commented Aug 23, 2021

Ok, here goes:

13:33:10 default: myjob(foo=20107, bar='XXXX') (e1e813a3-bf6a-4585-8a01-50bb8841a188)
... (removed) messages from the job doing its thing....
13:34:10 Handling successful execution of job e1e813a3-bf6a-4585-8a01-50bb8841a188
13:34:10 Setting job e1e813a3-bf6a-4585-8a01-50bb8841a188 status to finished
13:34:10 Removing job e1e813a3-bf6a-4585-8a01-50bb8841a188 from StartedJobRegistry
13:34:10 Finished handling successful execution of job e1e813a3-bf6a-4585-8a01-50bb8841a188. Redis returns [0, 1, 113, 4325.734237, 0, 0, 1, True, False, False, 1]
13:34:10 default: Job OK (e1e813a3-bf6a-4585-8a01-50bb8841a188)
13:34:10 Result is kept for 600 seconds

Here are the relevant lines (Redis returns) for other jobs moved to FailedJobRegistry:

12:46:04 Finished handling successful execution of job 2c67d796-9169-4b6f-ac9e-0902b378bb4e. Redis returns [0, 1, 24, 1476.257723, 0, 0, 1, True, False, False, 1]
13:51:44 Finished handling successful execution of job f6a3beeb-2316-4899-a011-a66c56bb6783. Redis returns [0, 1, 72, 5391.487906, 0, 0, 1, True, False, False, 1]
13:57:57 Finished handling successful execution of job 05f104a5-1c91-4edf-9836-5ab04f53a302. Redis returns [0, 1, 65, 5769.783253, 0, 0, 1, True, False, False, 1]
13:59:44 Finished handling successful execution of job 34d2af4c-ab10-414a-a02c-a5b409aa9a90. Redis returns [0, 1, 129, 5852.962609, 0, 0, 1, True, False, False, 1]
13:57:03 Finished handling successful execution of job 45ab43f8-a9aa-4cac-8e78-56c461d017fd. Redis returns [0, 1, 73, 5709.280607, 0, 0, 1, True, False, False, 1]

And here's the same line for some other random jobs not moved to FailedJobRegistry, if it helps:

12:31:36 Finished handling successful execution of job 913a7401-2d2d-4b95-83c1-77d238b8e692. Redis returns [0, 1, 14, 614.955188, 1, 1, 0]
12:34:36 Finished handling successful execution of job 343b69ed-6948-425d-987d-6aa25f3a06d1. Redis returns [0, 1, 15, 794.284134, 1, 1, 0]
12:37:03 Finished handling successful execution of job 4236f401-60a7-4bb4-ab3b-ae931bcceab3. Redis returns [0, 1, 13, 941.835423, 0, 0, 1, True, False, False, 1]
12:37:47 Finished handling successful execution of job af14ca24-2c7a-4ebd-8957-7d2f4fdfdf45. Redis returns [0, 1, 15, 985.826952, 0, 0, 1, True, False, False, 1]
12:38:40 Finished handling successful execution of job 7a1ba6c9-b7be-4785-ad01-d1643536e55e. Redis returns [0, 1, 17, 1036.490723, 0, 0, 1, True, False, False, 1]
13:22:06 Finished handling successful execution of job 3cd5638f-c8aa-4228-a805-3017eeafca8e. Redis returns [0, 1, 96, 3609.265077, 0, 0, 1, True, False, False, 1]
13:46:37 Finished handling successful execution of job efaf0e84-adfa-40e6-a156-e166bd4ed5f9. Redis returns [0, 1, 128, 5065.963707, 0, 0, 1, True, False, False, 1]
13:47:05 Finished handling successful execution of job bdb434aa-21c3-4b16-b3fd-96fc266ee302. Redis returns [0, 1, 124, 5096.179208, 0, 0, 1, True, False, False, 1]
13:57:21 Finished handling successful execution of job 78318706-453d-40b1-874c-fe6d78be8bfc. Redis returns [0, 1, 127, 5710.579746, 0, 0, 1, True, False, False, 1]
13:57:44 Finished handling successful execution of job 82f668a5-1fc0-4760-aa72-3a9f7e541f49. Redis returns [0, 1, 128, 5732.900007, 1, 1, 0]

I see some instances return a different number of values from others, if that's relevant; in some cases the last 4 values are missing.

Strangely, I don't see any instance of the registry.py log messages (Moving job(s) to FailedJobRegistry and Cleaning up StartedJobRegistry for queue), do I have to do anything special to enable them?

@selwin
Copy link
Collaborator

selwin commented Aug 23, 2021

Ok, let's take a look at this line:

12:46:04 Finished handling successful execution of job 2c67d796-9169-4b6f-ac9e-0902b378bb4e. Redis returns [0, 1, 24, 1476.257723, 0, 0, 1, True, False, False, 1]

The last result 1 is the result of ZREM call that's initiated by started_job_registry.remove(). 1 means Redis successfully removed 1 key from the sorted set, which means that key has been removed from StartedJobRegistry.

This means that since the key is already removed from StartedJobRegistry, it shouldn't have this key to cleanup and move to FailedJobRegistry. The fact that this key is still there means it's reintroduced again later, which could be due to a race condition.

I added an extra commit to check if my suspicion is True. Could you try my latest patch? If this works, I'll work on a more permanent fix.

@waldner
Copy link
Contributor Author

waldner commented Aug 23, 2021

So this has been running for a few hours now, and so far I see exactly zero successful jobs moved to FailedJobRegistry (earlier there would already have been some). I'll leave it running overnight and see. Great work! Thanks!

@selwin
Copy link
Collaborator

selwin commented Aug 24, 2021 via email

@selwin
Copy link
Collaborator

selwin commented Aug 24, 2021

@waldner this branch contains the permanent fix. Could you please test this out and see whether it works for you?

@waldner
Copy link
Contributor Author

waldner commented Aug 24, 2021

Yes! It ran all night and no job moved to FailedJobRegistry. I can say that the problem is fixed. I'll test the new branch and let you know.

@waldner
Copy link
Contributor Author

waldner commented Aug 24, 2021

I'm sorry but I have to report that the fix-started-job-registry-race-condition branch reintroduces the problem of jobs moved to FailedJobRegistry. I've been running it for just about two hours and I've already seen 4 such jobs. This seems to be happening at a lower rate than before, but nonetheless it's happening.
Furthermore, I now see that job parallelism seems to be greatly reduced for some reason. For this particular task, I have 40 workers set aside, and I used to see all 40 running at the same time (by looking at how many jobs were in the StartedJobRegistry). Now the number continuously changes, but it never gets past 30 or 32 (and it's often lower, around 20/25).

@selwin
Copy link
Collaborator

selwin commented Aug 24, 2021

Ok, in that case can you try running this fix? I'm trying to do this without incurring an additional Redis command. If this also fails, we can fallback to the quick fix.

@waldner
Copy link
Contributor Author

waldner commented Aug 24, 2021

This is working! Parallelism restored and no jobs moved to FailedJobRegistry. Thanks! Will you include the fix in a release?

@selwin
Copy link
Collaborator

selwin commented Aug 25, 2021 via email

@adamsenn
Copy link

Great work @selwin and @waldner!

@waldner
Copy link
Contributor Author

waldner commented Aug 25, 2021

@adamsenn I hope it solves the issue for you too!

@waldner
Copy link
Contributor Author

waldner commented Sep 11, 2021

1.10.0 definitely solves this for me. Closing.

@waldner waldner closed this as completed Sep 11, 2021
igorcalabria added a commit to inloco/redash that referenced this issue Jul 25, 2022
the idea is to port the changes from the current version of rq's monitor_work_horse to the version
introduced by redash. This is a classic example why overriding a method that you don't own is
usually a bad idea. You miss on upstream fixes. Another point is that I'm not even sure if this
custom method is even necessary with the version of rq we're using. Maybe we should investigate it
more
@rfrancois
Copy link

I just tried what is written in the documentation https://python-rq.org/docs/#job-callbacks with v1.11.1 and I have the same error you described.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants