-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Succeeded jobs mysteriously moved to FailedJobRegistry #1507
Comments
We are also regularly seeing this issue now as well. We are on rq 1.9, using it with
We have similar setup in multiple other projects, but have only noticed this issue in 2 projects thus far. Unfortunately, we have not been able to nail down a reproducible case yet. While it appears your fail occurred ~90sec after your success, ours seem to consistently be in 6-7min range (3-4min jobs with 900sec timeout, ttl = None), FWIW. |
We are seeing this very frequently now in a new project we are working on. I strongly suspect the issue is somewhere in https://github.com/rq/rq/blob/master/rq/worker.py#L982-L1006. As @waldner said, the StartedJobRegistry's Apparently, worker.py Ln 996 successfully sets the job's status to https://github.com/rq/rq/blob/master/rq/registry.py#L73-L75 We've tried to track this further, but unfortunately are not Redis experts and don't fully understand all that rq is doing with that. Hopefully this is useful information! |
FWIW, I'm not using custom job IDs. Also, I'm seeing that in a queue that has a lot of job churning, it might have peaks of 180/200 jobs at a time in the finished job registry. Other less active queues don't have the problem, HTH. |
@waldner @adamsenn sorry for the late reply, I don't think it's anything to do with custom job IDs. I'll try to look into this, but it's not easy because I don't have jobs that are incorrectly moved to FailedJobRegistry. Are you able to spot trends in jobs that are incorrectly failed? Like they take a long time to finish or something like that. |
One more thing, are you both able to verify that these jobs are completed successfully? If they are, I can narrow down the scope of my search. |
I added some logging statements in #1544 . Do you mind running RQ with debug logging turned on so we can see what's happening? |
For my part, I can confirm that the jobs are successful (or at least the logs say so), as I showed in my original message. I'll apply the debug patch and let you know what I see. Thanks! |
Ok, here is the log for one of those successful jobs which are moved to FailedJobRegistry:
Yet it's moved to FailedJobRegistry later:
I notice now that the timestamp above seems to be in UTC (it's really 19:23:26.722741). Yet the workers have their timezone set correctly, as the timestamps in the log messages show (and a simple |
I can also confirm that the jobs definitely succeed first as we have multiple indicators (including logs) that are all showing success. |
Ok, this is super weird, from your logs we can clearly see that the job was removed from We know that the job has to be moved to Something is definitely weird here. May I know what redis-server version both of you are running @waldner @adamsenn ? |
We are using AWS ElastiCache Redis 5.0.6 with Version 3.5.3 of the Python redis client. |
redis 6.0.5-alpine in a Docker container, python-redis 3.5.3. |
After reading the code multiple times, I think the most probable explanation is that the @waldner can you check if jobs that were erroneously added to |
Yes they indeed have the |
@waldner this is by design. Because the queue and workers may be hosted on different servers, it makes sense to store all timestamps internally in UTC. They could then be converted to local timestamps when displayed to be read by humans. I added some more debug logging statements in this branch: https://github.com/rq/rq/tree/logging-additions . Can you run this commit and see what it shows us? This is the diff of the commit: c0dd5f7 It logs two things:
|
Ok, here goes:
Here are the relevant lines (
And here's the same line for some other random jobs not moved to
I see some instances return a different number of values from others, if that's relevant; in some cases the last 4 values are missing. Strangely, I don't see any instance of the |
Ok, let's take a look at this line:
The last result This means that since the key is already removed from I added an extra commit to check if my suspicion is True. Could you try my latest patch? If this works, I'll work on a more permanent fix. |
So this has been running for a few hours now, and so far I see exactly zero successful jobs moved to FailedJobRegistry (earlier there would already have been some). I'll leave it running overnight and see. Great work! Thanks! |
Great! If it stays that way, I’ll work on a more permanent fix and may need
your help to verify the fix again (hopefully just once).
…On Aug 24, 2021 at 00:37:37, waldner ***@***.***> wrote:
So this has been running for a few hours now, and so far I see exactly
zero successful jobs moved to FailedJobRegistry (earlier there would
already have been some). I'll leave it running overnight and see. Great
work! Thanks!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1507 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABOB4VDTCBTARCIONKRGTTT6KBODANCNFSM47ZJJBHQ>
.
|
@waldner this branch contains the permanent fix. Could you please test this out and see whether it works for you? |
Yes! It ran all night and no job moved to FailedJobRegistry. I can say that the problem is fixed. I'll test the new branch and let you know. |
I'm sorry but I have to report that the |
Ok, in that case can you try running this fix? I'm trying to do this without incurring an additional Redis command. If this also fails, we can fallback to the quick fix. |
This is working! Parallelism restored and no jobs moved to |
Yes. Thanks for testing.
… On Aug 24, 2021, at 11:40 PM, waldner ***@***.***> wrote:
This is working! Parallelism restored and no jobs moved to FailedJobRegistry. Thanks! Will you include the fix in a release?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@adamsenn I hope it solves the issue for you too! |
1.10.0 definitely solves this for me. Closing. |
the idea is to port the changes from the current version of rq's monitor_work_horse to the version introduced by redash. This is a classic example why overriding a method that you don't own is usually a bad idea. You miss on upstream fixes. Another point is that I'm not even sure if this custom method is even necessary with the version of rq we're using. Maybe we should investigate it more
I just tried what is written in the documentation https://python-rq.org/docs/#job-callbacks with v1.11.1 and I have the same error you described. |
Once in a while I get jobs that completed successfully moved to FailedJobRegistry.
The job terminates correctly, as shown in the logs:
But then after a while I see the job has been moved to FailedJobRegistry. Looking at the queues with rq-dashboard, I see this terse message:
But nothing else (don't even know where rq-dashboard gets that message from). No other information in the logs. As I said, this happens only for a minority of jobs, but it does happen.
If it helps, it happened both with rq 1.8.1 and rq 1.9.0.
Could it be a failure in the rq<->redis communication so the successful termination of the job isn't properly written to redis? Looking at the rq code, I see that the move happens in
StartedJobRegistry.cleanup()
inrq/registry.py
. From what I understand, rq thinks that the job is "expired" (based on redis score) so moves it to the failed registry.The text was updated successfully, but these errors were encountered: