Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-heartbeating? #104

Open
pospischil opened this issue Apr 30, 2013 · 8 comments
Open

Auto-heartbeating? #104

pospischil opened this issue Apr 30, 2013 · 8 comments

Comments

@pospischil
Copy link

Hi all,

I have built a generic class to execute arbitrary methods -- I'd love to have this class automatically heartbeat as long as it's still around.

Using a thread to do this doesn't quite work as a) it causes a race condition with redis responses and b) the threads aren't guaranteed to run, even in Ruby 2.0.

I'm currently trying to fork a process for this, but that seems a bit excessive -- is there a best practice for this, or are you implementing something in each of the perform methods?

Thanks in advance (also, is there a mailing list I'm missing, or is this the best method for communication?)

@myronmarston
Copy link
Contributor

The qless heart beat can be used in a few different ways:

  • If your jobs are longer-running, and have an inner loop, you'll want to heartbeat the job in each iteration of the loop. I believe this is how @dlecocq has been doing it on his projects. The kind of work his jobs do are a good fit for this kind of loop structure.
  • If your jobs do single, discrete, granular pieces of work (i.e. they have no loop) you'll probably want to configure Qless to have a longer heartbeat timeout, not bother to heartbeat from your jobs, and then just let jobs timeout. That's what we've been doing on my projects: we set the heartbeat to 30 minutes, and if any job takes >= 30 minutes that's too long and an indication that something is wrong, so we want Qless to timeout the job. In this arrangement, we're treating Qless's heartbeat feature as a job timeout.

Both approaches have their merits, depending on how you structure your jobs, how granular they are, etc.

As for a mailing list...there's not one yet, but it might be worth creating one.

/cc @dlecocq @jstorimer

@pospischil
Copy link
Author

In our use case, we have a generic class that performs a range of jobs (some take less than a second while others can take well over 24 hours).

Unfortunately, even if we broke the these long running jobs into classes of their own, they don't have an inner-loop type mechanism where we could easily inject the heartbeat (we're often waiting for long running R/C/fortran code to finish computation).

We're really just concerned with whether or not that process is still running (we're looking to restart jobs if Amazon took down our spot instance or if the OOM killer took the process down). It sounded like this was similar to the use case described in the readme.

I'll play with the fork a bit more, and also look to see if we can get god to help with this (maybe it can do the heartbeating as long as the process is still running, which is really all we're concerned with). I'll report back with what we end up doing, in case it's helpful for others.

In the interim, if anyone has any other ideas/suggestions I'd love to hear them!

@dlecocq
Copy link
Contributor

dlecocq commented Jun 4, 2013

Sorry to be late to the party.

The original thinking about heartbeats was that 1) programmers would generally have a lot of programatic insight / access into the work being done and 2) a running process isn't a sufficient condition for work being done. This is partly informed by one of the original purposes for qless, with a system that had some problematic code that would sometimes deadlock. However, of course, it obviously gets tricky when invoking a big outer loop in external code :-/

That's a shame about threads, because that's what I would have suggested at first. Are you by chance forking the process that's doing the computation, or are they native bindings that you're using? I thought I recalled waitpid accepting a timeout parameter but it doesn't appear to. In C++ land, you might set an alarm handler to do the timeout, but perhaps Ruby has a nicer utility to do that. Alternatively, a polling loop to check the pid state every few seconds might be possible, even if less attractive. You could probably even set up a SIGCHLD handler to complete (or fail) the job, and have the original worker process just have a loop that sleeps a little less than job.ttl before heartbeating.

With respect to a mailing list, we actually just put one up yesterday: https://groups.google.com/forum/#!members/qless

@b4hand
Copy link
Contributor

b4hand commented Jun 4, 2013

You can use a self-pipe to mimic the behavior of a timeout on waitpid:

http://stackoverflow.com/questions/282176/waitpid-equivalent-with-timeout

@pospischil
Copy link
Author

Hey all,

Thanks for getting back to me. We ended up using god to monitor the process and do the heart beating. Sounds like this is different than the designed for use case (we're not worried about deadlocks).

Regardless, I'm going to write up what we ended up doing for this and some other work we did to make migrating from DJ to qless. I'll update this with the notes when I get a chance to do so.

@wr0ngway
Copy link
Contributor

wr0ngway commented Dec 3, 2014

@pospischil we have a thread setup for all our workers to auto heartbeat, which seems to work well enough, but we're running into an issue where even though the thread logs the fact that it made a heartbeat call, qless still times it out. I'm think it might have something to do with the race condition you mentioned in the initial issue text, but I have no idea what that race condition actually is, can you elaborate?

@pospischil
Copy link
Author

@wr0ngway unfortunately it's been so long now I don't remember, and naturally I never got those notes written up.

Originally we had been using god to monitor the processes, and the god process itself did the heartbeating. We recently migrated over to a new architecture, and god was causing a lot of annoyances, so after sinking a few days into it we migrated over to monit, and then we have another process heartbeat the jobs by pids. Both approaches are basically the same and have worked perfectly for well over a year now, so I'd recommend going that route. Let me know if you have any other questions or if I can provide anything to help...

@wr0ngway
Copy link
Contributor

wr0ngway commented Dec 3, 2014

No worries, thanks for the reply. Was just trying to decrease my debugging time :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants