You can clone with
Currently when a job times out its heartbeat interval, the job fails, but a worker may keep trying to work on it.
It would be useful if we could cause the heartbeat timeout to kill the worker, so the worker is not stuck, continuing to work on the job.
Here's an idea of how we can do that:
Thoughts from @proby, @dlecocq, @benkirzhner, @waltjones ?
I'd be down for some mechanism, to be sure. Pub/Sub definitely seems like a good candidate for this, and probably the most efficient mechanism. The alternative would probably be to have workers periodically check the time to live on jobs, but any chance to get away from polling, I'm all for.
@dlecocq -- Would this kind of "listening thread" solution (or something similar) work for the qless clients in other languages as well?
I can't speak for perl (though, I imagine it would), but it certainly would in python and node.
Cool. Not sure when I'll have a chance to take a stab at this, but I imagine we'll find it useful for the platform work we're doing.
BTW, @benkirzhner had a conversation yesterday about using pubsub vs blpop for the communication mechanism. We decided pubsub is better; one reason is that it sets us up to allow the qless server to communicate with the worker for other purposes. Not sure yet what other kinds of messages the server may want to send the worker, but it could prove useful in the future, so we should bear it in mind.
FWIW, I'm trying this out on the dan branch of qless-py and qless-core, though I'm going to squash some commits before sending out a pull request. Good news is, it's a dead simple change in qless-core, and seems to be working well.
This has been done.