Reconnect to Redis after writing the pidfile. #1528

Merged
merged 1 commit into from Dec 13, 2016

Projects

None yet

4 participants

@fimmtiu
Contributor
fimmtiu commented Dec 12, 2016 edited

We've been having rare transient problems for a while now with Resque workers which escape from our monitoring & management scripts and live to a ripe old age, unbothered by redeploys or restarts, until someone notices "Hey, that worker is running old buggy code! What the hell?"

Our management script starts new Resque processes by spawning a process, waiting a while for that child's pidfile to show up, then continuing. If a pidfile doesn't show up after a suitable delay, it assumes that the child failed to start and starts a new one. (The Resque worker does a double-fork when it daemonizes, so the pidfile is pretty much our only sensible means of getting the child's PID.) This works well for the case where the child actually failed to start, but hoses us in the case where a child was started but didn't write a pidfile in a timely fashion.

After some investigation, it looks like the problem is due to this bit of code. Consider what happens if reconnecting to Redis takes, say, 30 seconds due to a transient AWS network issue or redis-sentinel being a gong show. We fork the new process, daemonize, try to reconnect to Redis, then sit there for 30 seconds doing nothing. Meanwhile, the manager thinks "Well, I guess that one failed to start", creates a new, successful Resque process, notes its PID, and continues. Now the current process, because of that 30-second delay between forking and writing the pidfile, is still running but the manager doesn't know about it.

This two-line patch just moves the self.reconnect call below the pidfile-writing. Transient network issues shouldn't cause pidfiles to be slow — once the child forks, it should publish its PID right away. If the reconnect fails and we end up with a stale pidfile in the PIDs directory, that's okay! It's easy to test to see whether or not it's stale. That's more predictable and less irritating than having a living child with no pidfile.

@coveralls

Coverage Status

Coverage remained the same at 83.148% when pulling baf9bac on fimmtiu:reconnect-after-writing-pidfile into 20d8850 on resque:master.

@coveralls
coveralls commented Dec 12, 2016 edited

Coverage Status

Coverage remained the same at 83.148% when pulling baf9bac on fimmtiu:reconnect-after-writing-pidfile into 20d8850 on resque:master.

@fimmtiu
Contributor
fimmtiu commented Dec 12, 2016

Note that there's a CI failure on this branch which has nothing to do with the contents of the pull request; looks like some sort of issue with the term-ansicolor gem being incompatible with Ruby 1.8.7. Looks like you've got a number of recent branches with similar failures.

@fw42
Member
fw42 commented Dec 12, 2016

This sounds reasonable to me. @casperisfine, thoughts?

@casperisfine

That change makes perfect sense. 👍

@fw42 fw42 merged commit 835a4a7 into resque:master Dec 13, 2016

1 check failed

continuous-integration/travis-ci/pr The Travis CI build could not complete due to an error
Details
@fw42
Member
fw42 commented Dec 13, 2016

Thanks for your contribution!

@fimmtiu
Contributor
fimmtiu commented Dec 13, 2016

No problem! I quite appreciate your timely responses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment