Auto-heartbeating? #104

pospischil · 2013-04-30T17:21:49Z

Hi all,

I have built a generic class to execute arbitrary methods -- I'd love to have this class automatically heartbeat as long as it's still around.

Using a thread to do this doesn't quite work as a) it causes a race condition with redis responses and b) the threads aren't guaranteed to run, even in Ruby 2.0.

I'm currently trying to fork a process for this, but that seems a bit excessive -- is there a best practice for this, or are you implementing something in each of the perform methods?

Thanks in advance (also, is there a mailing list I'm missing, or is this the best method for communication?)

myronmarston · 2013-04-30T17:47:51Z

The qless heart beat can be used in a few different ways:

If your jobs are longer-running, and have an inner loop, you'll want to heartbeat the job in each iteration of the loop. I believe this is how @dlecocq has been doing it on his projects. The kind of work his jobs do are a good fit for this kind of loop structure.
If your jobs do single, discrete, granular pieces of work (i.e. they have no loop) you'll probably want to configure Qless to have a longer heartbeat timeout, not bother to heartbeat from your jobs, and then just let jobs timeout. That's what we've been doing on my projects: we set the heartbeat to 30 minutes, and if any job takes >= 30 minutes that's too long and an indication that something is wrong, so we want Qless to timeout the job. In this arrangement, we're treating Qless's heartbeat feature as a job timeout.

Both approaches have their merits, depending on how you structure your jobs, how granular they are, etc.

As for a mailing list...there's not one yet, but it might be worth creating one.

/cc @dlecocq @jstorimer

pospischil · 2013-04-30T18:02:45Z

In our use case, we have a generic class that performs a range of jobs (some take less than a second while others can take well over 24 hours).

Unfortunately, even if we broke the these long running jobs into classes of their own, they don't have an inner-loop type mechanism where we could easily inject the heartbeat (we're often waiting for long running R/C/fortran code to finish computation).

We're really just concerned with whether or not that process is still running (we're looking to restart jobs if Amazon took down our spot instance or if the OOM killer took the process down). It sounded like this was similar to the use case described in the readme.

I'll play with the fork a bit more, and also look to see if we can get god to help with this (maybe it can do the heartbeating as long as the process is still running, which is really all we're concerned with). I'll report back with what we end up doing, in case it's helpful for others.

In the interim, if anyone has any other ideas/suggestions I'd love to hear them!

dlecocq · 2013-06-04T14:21:05Z

Sorry to be late to the party.

The original thinking about heartbeats was that 1) programmers would generally have a lot of programatic insight / access into the work being done and 2) a running process isn't a sufficient condition for work being done. This is partly informed by one of the original purposes for qless, with a system that had some problematic code that would sometimes deadlock. However, of course, it obviously gets tricky when invoking a big outer loop in external code :-/

That's a shame about threads, because that's what I would have suggested at first. Are you by chance forking the process that's doing the computation, or are they native bindings that you're using? I thought I recalled waitpid accepting a timeout parameter but it doesn't appear to. In C++ land, you might set an alarm handler to do the timeout, but perhaps Ruby has a nicer utility to do that. Alternatively, a polling loop to check the pid state every few seconds might be possible, even if less attractive. You could probably even set up a SIGCHLD handler to complete (or fail) the job, and have the original worker process just have a loop that sleeps a little less than job.ttl before heartbeating.

With respect to a mailing list, we actually just put one up yesterday: https://groups.google.com/forum/#!members/qless

b4hand · 2013-06-04T18:13:43Z

You can use a self-pipe to mimic the behavior of a timeout on waitpid:

http://stackoverflow.com/questions/282176/waitpid-equivalent-with-timeout

pospischil · 2013-06-05T00:24:12Z

Hey all,

Thanks for getting back to me. We ended up using god to monitor the process and do the heart beating. Sounds like this is different than the designed for use case (we're not worried about deadlocks).

Regardless, I'm going to write up what we ended up doing for this and some other work we did to make migrating from DJ to qless. I'll update this with the notes when I get a chance to do so.

wr0ngway · 2014-12-03T14:30:17Z

@pospischil we have a thread setup for all our workers to auto heartbeat, which seems to work well enough, but we're running into an issue where even though the thread logs the fact that it made a heartbeat call, qless still times it out. I'm think it might have something to do with the race condition you mentioned in the initial issue text, but I have no idea what that race condition actually is, can you elaborate?

pospischil · 2014-12-03T16:07:02Z

@wr0ngway unfortunately it's been so long now I don't remember, and naturally I never got those notes written up.

Originally we had been using god to monitor the processes, and the god process itself did the heartbeating. We recently migrated over to a new architecture, and god was causing a lot of annoyances, so after sinking a few days into it we migrated over to monit, and then we have another process heartbeat the jobs by pids. Both approaches are basically the same and have worked perfectly for well over a year now, so I'd recommend going that route. Let me know if you have any other questions or if I can provide anything to help...

wr0ngway · 2014-12-03T16:43:56Z

No worries, thanks for the reply. Was just trying to decrease my debugging time :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-heartbeating? #104

Auto-heartbeating? #104

pospischil commented Apr 30, 2013

myronmarston commented Apr 30, 2013

pospischil commented Apr 30, 2013

dlecocq commented Jun 4, 2013

b4hand commented Jun 4, 2013

pospischil commented Jun 5, 2013

wr0ngway commented Dec 3, 2014

pospischil commented Dec 3, 2014

wr0ngway commented Dec 3, 2014

Auto-heartbeating? #104

Auto-heartbeating? #104

Comments

pospischil commented Apr 30, 2013

myronmarston commented Apr 30, 2013

pospischil commented Apr 30, 2013

dlecocq commented Jun 4, 2013

b4hand commented Jun 4, 2013

pospischil commented Jun 5, 2013

wr0ngway commented Dec 3, 2014

pospischil commented Dec 3, 2014

wr0ngway commented Dec 3, 2014