First of thanks for the great gem, I've been using it extensively on a large scale application, but recently I'm seeing some issue.
It seems that sometimes, even though the job is performed correctly, loner will not remove the lock.
It appears sporadically and I don't see a pattern for it.
The thing is that I use resque to process hundreds of jobs per seconds, so I was wondering if you saw a point in loner where this could happen. For instance if there are too many access at once etc.
I kind of "fixed" the problem by adding a TTL, but some queues can't afford to be locked for more than a couple of minutes. I decided to remove loner on these kind of queues, but I'd rather keep it for security and get loner to work perfectly.
I'm using a custom redis_key as well, not sure if it could be related.
If you have any ideas regarding this or if you saw the same behavior somewhere it would interest me a lot.
I could even add that ever since I updated to the latest version I'm seeing this issue way more often.
Hey @marcgg, thanks for the thanks, I appreciate!
Hmm... We use it with the default redis_key method and get it every once in a while. Didn't notice a bump with the last update, but also didn't keep a close eye on it, since it's not super critical for us.
I think it would be interesting to use a redis_key method (i should really have named it loner_key) that doesn't do any md5'ing but uses key value pairs, like in URLs. That would help to make sure that method really creates only one value for each job with the (semantically) same parameters.
Would that be possible in your case?
payload.keys.sort.map do |key|
Given you don't have arbitrary stuff in your arguments that makes for invalid redis keys, this would be interesting to have a look at stale loner keys like that.
My redis key method:
payload = decode(encode(payload))
args = (payload[:args] || payload["args"])
the ATTRIBUTES_ORDER is just a hash to help us map that the Xth argument is the facebook_id
Ah, so the jobs are unique by facebook id - did you take a look at the stale key to see if you notice anything suspicious?
The key stored in redis looks like that: staging:loners:queue:fetchers:likes:high_priority:job:319273011451268 with 319273011451268 the facebook id I'm working on. It looks ok to me.
Okay, so this really doesn't leave any room for an error in the redis_key method, I think. That would tell me it is not the first of the following two scenarios I have in my head:
The latter is where I initially expected the problem, but 1. would be something that's easier to find, probably. Next step in debugging would probably be to log the removal commands for loner keys somewhere to compare which loner keys were created and which were removed (or attempted to remove)
Okay I'll try out logging loner actions in redis and post the result here.
Thanks for your help
@marcgg Awesome, let me know how it goes!
What could cause the second problem?
Could it be something internal to Loner?
I'm adding a lot of log, but I don't see anything interesting just yet. Like I said it's sporadic and hard to reproduce. I'm trying to rule out as many options as possible to pinpoint the problem.
At some points I had workers "freezing". I added https://github.com/jeffgran/resque-timeout in order to avoid too many problems. It appeared very rarely, but maybe it's the same problem?
Yeah, the second problem is tricky. Resque-loner contains all kinds of tests to make sure that the loner keys are being removed when a job throws an exception, etc. But I don't know what happens when a worker takes too long and freezes and/or gets killed from outside, there might be some glitch where a key remains un-removed. Since the time I wrote resque loner new hooks were introduced into resque, if I remember correctly.
So it's possible that resque loner could become a "clean" plugin that doesn't extend/mess with resque but relies solely on hooks. In the README I described which hook I was missing back then. I don't have much time to look into it at the moment, though :-/
Thanks, I'll look more into it and let you know what I've decided/found
Thanks a lot @marcgg, I'll leave this here until we resolved it.
pinging señor @marcgg ;)
I kept having the same issue, with no real pattern.
I decided to try out resque-lock to see if it would change anything, and it didn't, so I figured it's more of a design flaw of either my system or resque when used with a very large amount of data (enqueuing thousands of jobs, constantly).
I am not pretty sure but this could be related to https://github.com/defunkt/resque/pull/479
Having the same issue for a long time. Could it be related to a Redis version? Mine is 2.0.4. Pretty old I guess. @marcgg what's yours?
@yankov Mine is 2.2.1
@bogdan All my jobs have a queue specified, so i don't think that the bug you reference is related to my problem.
Having the issue here as well. It particularly seems to occur with some of my longest running jobs which I'm fairly sure had been hitting resource (RAM) limits and timing out.
@craigambrose What exactly do you mean by timing out?
I have a few workers timing out as well. In my situation it's monit that is killing them because they have been running for way too long for various reason (slow HTTP calls, internal API dowtimes...)
@jayniz , sorry, very imprecise language on my part. I'm talking about workers that monit is killing, based on my configuration, such as for exceeding RAM limitations.
After spending hours on this, the problem has been finally detected :) plz check pull request for details.
Merged, gem built, 1.2.1 released, thank you sir! @aerodynamik
Great job, thank you.