Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RQ Scheduler causes workers to sporadically hang after receiving jobs #306

Closed
DataKinds opened this issue Jan 23, 2024 · 1 comment
Closed

Comments

@DataKinds
Copy link

DataKinds commented Jan 23, 2024

This occurs on Python 3.11.4 Versions of installed libraries (pared down for relevance):

Package                   Version
------------------------- ------------ 
arrow                     1.3.0
crontab                   1.0.1
Cython                    3.0.4
Flask                     2.3.3
Flask-RQ2                 18.3
pyarrow                   13.0.0
redis                     5.0.1
Redis-Sentinel-Url        1.0.1
rq                        1.15.1
rq-dashboard              0.6.7
rq-scheduler              0.13.1
# oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
# Redis version=5.0.3, bits=64, commit=00000000, modified=0, pid=3999114, just started

About 1 in 10 jobs sent to an RQ worker while RQ scheduler is running will hang indefinitely. This just began happening last Friday after a pip install to update a venv's deps.

The job's main body is never reached, I have seen jobs whose first call is logging never produce any logs after being anomalously claimed. It does not matter if RQ scheduler was the originator of the job: I have seen both scheduled and manually invoked jobs hang in this manner. The behavior is no longer exhibited when RQ scheduler is closed. The RQ worker still responds to signals and can still have a successful warm shutdown. No logs are produced after the first logger call indicating the worker took the job. Here's an annotated example of one of those logs:

# Previous job successfully finishing
INFO:rq.worker:default: Job OK (92135fb3-a891-4bc5-97be-8f0e4801424c)
INFO:rq.worker:Result is kept for 3600 seconds
# Anomalous job begins
INFO:rq.worker:default: Uploading file (84db2e7a-c99b-463b-8e70-76cce10beb01) 
# Hangs here indefinitely, until a signal is sent
INFO:rq.worker:Worker 480298b5bce645e696cd8082c25510bd [PID 3260661]: warm shut down requested                                                                            

No pertinent logging is reported by RQ scheduler, just the regular Registering birth and Registering death that happens when workers are spun up or killed.

I inspected the worker object as saved in the Redis server, and it has a busy status with the correct job ID listed within. The job objects seem to be mildly corrupted, as the status of the job remains queued and the last_heartbeat corresponding to the second the RQ worker froze up.

I note that there is precedence for RQ scheduler to cause this behavior, even if I was not able to track down the root cause: https://stackoverflow.com/questions/53640418/django-rq-rqworker-freezes-indefinitely / #197

The config file fed into RQ is minimal and just spins up a copy of the parent Flask app in order to access an app context. Let me know if I need to post any more info or logs here. Thanks!

@DataKinds
Copy link
Author

We're having the same behavior crop up today after removing RQ scheduler from the stack last week. Closing this issue as it likely isn't related to RQ scheduler.

@DataKinds DataKinds closed this as not planned Won't fix, can't repro, duplicate, stale Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant