Sometimes Loner "locks" a given job when it completed correctly #18

Closed
marcgg opened this Issue Jan 17, 2012 · 25 comments

6 participants

@marcgg

Hi!

First of thanks for the great gem, I've been using it extensively on a large scale application, but recently I'm seeing some issue.

It seems that sometimes, even though the job is performed correctly, loner will not remove the lock.
It appears sporadically and I don't see a pattern for it.

The thing is that I use resque to process hundreds of jobs per seconds, so I was wondering if you saw a point in loner where this could happen. For instance if there are too many access at once etc.

I kind of "fixed" the problem by adding a TTL, but some queues can't afford to be locked for more than a couple of minutes. I decided to remove loner on these kind of queues, but I'd rather keep it for security and get loner to work perfectly.

I'm using a custom redis_key as well, not sure if it could be related.

If you have any ideas regarding this or if you saw the same behavior somewhere it would interest me a lot.

Thanks

@marcgg

I could even add that ever since I updated to the latest version I'm seeing this issue way more often.

@jayniz

Hey @marcgg, thanks for the thanks, I appreciate!

Hmm... We use it with the default redis_key method and get it every once in a while. Didn't notice a bump with the last update, but also didn't keep a close eye on it, since it's not super critical for us.

I think it would be interesting to use a redis_key method (i should really have named it loner_key) that doesn't do any md5'ing but uses key value pairs, like in URLs. That would help to make sure that method really creates only one value for each job with the (semantically) same parameters.

Would that be possible in your case?

E.g.

def redis_key(payload)
  payload.keys.sort.map do |key|
    "#{key}:#{payload[key]}"
  end.join ";"
end

Given you don't have arbitrary stuff in your arguments that makes for invalid redis keys, this would be interesting to have a look at stale loner keys like that.

@marcgg

My redis key method:

  def self.redis_key(payload)
     payload = decode(encode(payload))
     args = (payload[:args]  || payload["args"])
     args[ATTRIBUTES_ORDER.index("facebook_id")]
  end

the ATTRIBUTES_ORDER is just a hash to help us map that the Xth argument is the facebook_id

@jayniz

Ah, so the jobs are unique by facebook id - did you take a look at the stale key to see if you notice anything suspicious?

@marcgg

The key stored in redis looks like that: staging:loners:queue:fetchers:likes:high_priority:job:319273011451268 with 319273011451268 the facebook id I'm working on. It looks ok to me.

@jayniz

Okay, so this really doesn't leave any room for an error in the redis_key method, I think. That would tell me it is not the first of the following two scenarios I have in my head:

  1. Loner tries to remove another loner key than it initially created - so the loner key that was initially created never gets removed
  2. Loner doesn't reach the point where the key is removed

The latter is where I initially expected the problem, but 1. would be something that's easier to find, probably. Next step in debugging would probably be to log the removal commands for loner keys somewhere to compare which loner keys were created and which were removed (or attempted to remove)

@marcgg

Okay I'll try out logging loner actions in redis and post the result here.
Thanks for your help

@jayniz

@marcgg Awesome, let me know how it goes!

@marcgg

What could cause the second problem?
Could it be something internal to Loner?

I'm adding a lot of log, but I don't see anything interesting just yet. Like I said it's sporadic and hard to reproduce. I'm trying to rule out as many options as possible to pinpoint the problem.

At some points I had workers "freezing". I added https://github.com/jeffgran/resque-timeout in order to avoid too many problems. It appeared very rarely, but maybe it's the same problem?

@jayniz

Yeah, the second problem is tricky. Resque-loner contains all kinds of tests to make sure that the loner keys are being removed when a job throws an exception, etc. But I don't know what happens when a worker takes too long and freezes and/or gets killed from outside, there might be some glitch where a key remains un-removed. Since the time I wrote resque loner new hooks were introduced into resque, if I remember correctly.

So it's possible that resque loner could become a "clean" plugin that doesn't extend/mess with resque but relies solely on hooks. In the README I described which hook I was missing back then. I don't have much time to look into it at the moment, though :-/

@marcgg

Thanks, I'll look more into it and let you know what I've decided/found

@jayniz

Thanks a lot @marcgg, I'll leave this here until we resolved it.

@jayniz

pinging señor @marcgg ;)

@marcgg

I kept having the same issue, with no real pattern.

I decided to try out resque-lock to see if it would change anything, and it didn't, so I figured it's more of a design flaw of either my system or resque when used with a very large amount of data (enqueuing thousands of jobs, constantly).

@bogdan

I am not pretty sure but this could be related to https://github.com/defunkt/resque/pull/479

@yankov

Having the same issue for a long time. Could it be related to a Redis version? Mine is 2.0.4. Pretty old I guess. @marcgg what's yours?

@marcgg

@yankov Mine is 2.2.1

@marcgg

@bogdan All my jobs have a queue specified, so i don't think that the bug you reference is related to my problem.

@craigambrose

Having the issue here as well. It particularly seems to occur with some of my longest running jobs which I'm fairly sure had been hitting resource (RAM) limits and timing out.

@jayniz

@craigambrose What exactly do you mean by timing out?

@marcgg

I have a few workers timing out as well. In my situation it's monit that is killing them because they have been running for way too long for various reason (slow HTTP calls, internal API dowtimes...)

@craigambrose

@jayniz , sorry, very imprecise language on my part. I'm talking about workers that monit is killing, based on my configuration, such as for exceeding RAM limitations.

@aerodynamik

Hi all!

After spending hours on this, the problem has been finally detected :) plz check pull request for details.

@jayniz

Merged, gem built, 1.2.1 released, thank you sir! @aerodynamik

@jayniz jayniz closed this Apr 27, 2012
@bogdan

Great job, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment