Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Force workers to quit when a job's heartbeat times out #66

Closed
myronmarston opened this Issue Jan 15, 2013 · 6 comments

Comments

Projects
None yet
2 participants
Owner

myronmarston commented Jan 15, 2013

Currently when a job times out its heartbeat interval, the job fails, but a worker may keep trying to work on it.

It would be useful if we could cause the heartbeat timeout to kill the worker, so the worker is not stuck, continuing to work on the job.

Here's an idea of how we can do that:

  • Just before a worker process starts working on a job, have it spin up a thread that subscribes to a redis pub/sub channel named after the worker process.
  • When Qless times out the heartbeat interval of a job, have it publish a "stop working on this job" message in the worker's channel.
  • The worker's listening thread can use ruby's Thread#raise API to cause an exception to be raised in the main worker thread, effectively killing it.

Thoughts from @proby, @dlecocq, @benkirzhner, @waltjones ?

Contributor

dlecocq commented Jan 15, 2013

I'd be down for some mechanism, to be sure. Pub/Sub definitely seems like a good candidate for this, and probably the most efficient mechanism. The alternative would probably be to have workers periodically check the time to live on jobs, but any chance to get away from polling, I'm all for.

Owner

myronmarston commented Jan 15, 2013

@dlecocq -- Would this kind of "listening thread" solution (or something similar) work for the qless clients in other languages as well?

Contributor

dlecocq commented Jan 15, 2013

I can't speak for perl (though, I imagine it would), but it certainly would in python and node.

Owner

myronmarston commented Jan 15, 2013

Cool. Not sure when I'll have a chance to take a stab at this, but I imagine we'll find it useful for the platform work we're doing.

BTW, @benkirzhner had a conversation yesterday about using pubsub vs blpop for the communication mechanism. We decided pubsub is better; one reason is that it sets us up to allow the qless server to communicate with the worker for other purposes. Not sure yet what other kinds of messages the server may want to send the worker, but it could prove useful in the future, so we should bear it in mind.

Contributor

dlecocq commented Jan 15, 2013

FWIW, I'm trying this out on the dan branch of qless-py and qless-core, though I'm going to squash some commits before sending out a pull request. Good news is, it's a dead simple change in qless-core, and seems to be working well.

Owner

myronmarston commented May 16, 2013

This has been done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment