SIGTERM causes jobs to get stuck in WIP queue #507

cosminstefanxp · 2015-03-17T13:27:58Z

First of all, I want to say thanks for all of the great work on this awesome tool.

I'm working on a Heroku hosted app and I've encountered an issue causing jobs to get stuck in the wip queue. It happens when some long taking tasks (e.g. emulated by a sleep(20)) were being processed when the workers get SIGTERM (e.g. when changing the no. of dynos or re-deploying).
After that point, the tasks seem to be left in the WIP queue (StartedJobsRegistry) forever, while I would have expected them to eventually be put in the Failed queue or somewhere else. I'm not sure that's the expected behaviour.

In the above scenario, an exception seems to be occuring, so that might cause things to end a bit too fast (copying it here in case it's relevant):

Stopping all processes with SIGTERM
Mar 17 13:55:49 xxx: 12:55:48 system | SIGTERM received
Mar 17 13:55:49 xxx: 12:55:48 system | sending SIGTERM to rq_worker.2 (pid 7)
Mar 17 13:55:49 xxx: 12:55:48 system | sending SIGTERM to rq_worker.1 (pid 8)
Mar 17 13:55:49 xxx: Traceback (most recent call last):
Mar 17 13:55:49 xxx:   File "/app/.heroku/python/bin/honcho", line 11, in <module>
Mar 17 13:55:49 xxx:     sys.exit(main())
Mar 17 13:55:49 xxx:   File "/app/.heroku/python/lib/python2.7/site-packages/honcho/command.py", line 266, in main
Mar 17 13:55:49 xxx:     COMMANDS[args.command](args)
Mar 17 13:55:49 xxx:   File "/app/.heroku/python/lib/python2.7/site-packages/honcho/command.py", line 213, in command_start
Mar 17 13:55:49 xxx:     manager.loop()
Mar 17 13:55:49 xxx:   File "/app/.heroku/python/lib/python2.7/site-packages/honcho/manager.py", line 100, in loop
Mar 17 13:55:49 xxx:     msg = self.events.get(timeout=0.1)
Mar 17 13:55:49 xxx:   File "/app/.heroku/python/lib/python2.7/multiprocessing/queues.py", line 131, in get
Mar 17 13:55:49 xxx:     if timeout < 0 or not self._poll(timeout):
Mar 17 13:55:49 xxx: IOError: [Errno 4] Interrupted system call
Mar 17 13:55:49 xxx:     manager.loop()
Mar 17 13:55:49 xxx:   File "/app/.heroku/python/lib/python2.7/site-packages/honcho/manager.py", line 100, in loop
Mar 17 13:55:49 xxx:     msg = self.events.get(timeout=0.1)
Mar 17 13:55:49 xxx:   File "/app/.heroku/python/lib/python2.7/multiprocessing/queues.py", line 131, in get
Mar 17 13:55:49 xxx:     if timeout < 0 or not self._poll(timeout):
Mar 17 13:55:49 xxx: IOError: [Errno 4] Interrupted system call

Thanks

The text was updated successfully, but these errors were encountered:

selwin · 2015-03-19T01:42:31Z

@jtushman made a pull request implementing rq suspend and rq resume commands that allow you to wait for all tasks to finish before resuming execution again so all workers can quite gracefully. I think those commands aren't yet documented :(

Jobs in StartedJobRegistry will only be moved to FailedQueue when StartedJobRegistry.cleanup() is called. The documentation in this area is also sorely lacking. I also think we should provide an easier way to trigger these periodic maintenance tasks.

cosminstefanxp · 2015-04-23T10:53:25Z

For now, I've handled this by running a periodic job (via rq-scheduler), that runs the cleanup, but it'd be great to have this handled automagically somehow.

selwin · 2015-06-18T01:10:02Z

Queues are now periodically cleaned by Worker. See #534

selwin closed this as completed Jun 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGTERM causes jobs to get stuck in WIP queue #507

SIGTERM causes jobs to get stuck in WIP queue #507

cosminstefanxp commented Mar 17, 2015

selwin commented Mar 19, 2015

cosminstefanxp commented Apr 23, 2015

selwin commented Jun 18, 2015

SIGTERM causes jobs to get stuck in WIP queue #507

SIGTERM causes jobs to get stuck in WIP queue #507

Comments

cosminstefanxp commented Mar 17, 2015

selwin commented Mar 19, 2015

cosminstefanxp commented Apr 23, 2015

selwin commented Jun 18, 2015