You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, I want to say thanks for all of the great work on this awesome tool.
I'm working on a Heroku hosted app and I've encountered an issue causing jobs to get stuck in the wip queue. It happens when some long taking tasks (e.g. emulated by a sleep(20)) were being processed when the workers get SIGTERM (e.g. when changing the no. of dynos or re-deploying).
After that point, the tasks seem to be left in the WIP queue (StartedJobsRegistry) forever, while I would have expected them to eventually be put in the Failed queue or somewhere else. I'm not sure that's the expected behaviour.
In the above scenario, an exception seems to be occuring, so that might cause things to end a bit too fast (copying it here in case it's relevant):
Stopping all processes with SIGTERM
Mar 17 13:55:49 xxx: 12:55:48 system | SIGTERM received
Mar 17 13:55:49 xxx: 12:55:48 system | sending SIGTERM to rq_worker.2 (pid 7)
Mar 17 13:55:49 xxx: 12:55:48 system | sending SIGTERM to rq_worker.1 (pid 8)
Mar 17 13:55:49 xxx: Traceback (most recent call last):
Mar 17 13:55:49 xxx: File "/app/.heroku/python/bin/honcho", line 11, in <module>
Mar 17 13:55:49 xxx: sys.exit(main())
Mar 17 13:55:49 xxx: File "/app/.heroku/python/lib/python2.7/site-packages/honcho/command.py", line 266, in main
Mar 17 13:55:49 xxx: COMMANDS[args.command](args)
Mar 17 13:55:49 xxx: File "/app/.heroku/python/lib/python2.7/site-packages/honcho/command.py", line 213, in command_start
Mar 17 13:55:49 xxx: manager.loop()
Mar 17 13:55:49 xxx: File "/app/.heroku/python/lib/python2.7/site-packages/honcho/manager.py", line 100, in loop
Mar 17 13:55:49 xxx: msg = self.events.get(timeout=0.1)
Mar 17 13:55:49 xxx: File "/app/.heroku/python/lib/python2.7/multiprocessing/queues.py", line 131, in get
Mar 17 13:55:49 xxx: if timeout < 0 or not self._poll(timeout):
Mar 17 13:55:49 xxx: IOError: [Errno 4] Interrupted system call
Mar 17 13:55:49 xxx: manager.loop()
Mar 17 13:55:49 xxx: File "/app/.heroku/python/lib/python2.7/site-packages/honcho/manager.py", line 100, in loop
Mar 17 13:55:49 xxx: msg = self.events.get(timeout=0.1)
Mar 17 13:55:49 xxx: File "/app/.heroku/python/lib/python2.7/multiprocessing/queues.py", line 131, in get
Mar 17 13:55:49 xxx: if timeout < 0 or not self._poll(timeout):
Mar 17 13:55:49 xxx: IOError: [Errno 4] Interrupted system call
Thanks
The text was updated successfully, but these errors were encountered:
@jtushman made a pull request implementing rq suspend and rq resume commands that allow you to wait for all tasks to finish before resuming execution again so all workers can quite gracefully. I think those commands aren't yet documented :(
Jobs in StartedJobRegistry will only be moved to FailedQueue when StartedJobRegistry.cleanup() is called. The documentation in this area is also sorely lacking. I also think we should provide an easier way to trigger these periodic maintenance tasks.
For now, I've handled this by running a periodic job (via rq-scheduler), that runs the cleanup, but it'd be great to have this handled automagically somehow.
First of all, I want to say thanks for all of the great work on this awesome tool.
I'm working on a Heroku hosted app and I've encountered an issue causing jobs to get stuck in the wip queue. It happens when some long taking tasks (e.g. emulated by a sleep(20)) were being processed when the workers get
SIGTERM
(e.g. when changing the no. of dynos or re-deploying).After that point, the tasks seem to be left in the WIP queue (StartedJobsRegistry) forever, while I would have expected them to eventually be put in the Failed queue or somewhere else. I'm not sure that's the expected behaviour.
In the above scenario, an exception seems to be occuring, so that might cause things to end a bit too fast (copying it here in case it's relevant):
Thanks
The text was updated successfully, but these errors were encountered: