We've been having rare transient problems for a while now with Resque workers which escape from our monitoring & management scripts and live to a ripe old age, unbothered by redeploys or restarts, until someone notices "Hey, that worker is running old buggy code! What the hell?"
Our management script starts new Resque processes by spawning a process, waiting a while for that child's pidfile to show up, then continuing. If a pidfile doesn't show up after a suitable delay, it assumes that the child failed to start and starts a new one. (The Resque worker does a double-fork when it daemonizes, so the pidfile is pretty much our only sensible means of getting the child's PID.) This works well for the case where the child actually failed to start, but hoses us in the case where a child was started but didn't write a pidfile in a timely fashion.
After some investigation, it looks like the problem is due to this bit of code. Consider what happens if reconnecting to Redis takes, say, 30 seconds due to a transient AWS network issue or redis-sentinel being a gong show. We fork the new process, daemonize, try to reconnect to Redis, then sit there for 30 seconds doing nothing. Meanwhile, the manager thinks "Well, I guess that one failed to start", creates a new, successful Resque process, notes its PID, and continues. Now the current process, because of that 30-second delay between forking and writing the pidfile, is still running but the manager doesn't know about it.
This two-line patch just moves the self.reconnect call below the pidfile-writing. Transient network issues shouldn't cause pidfiles to be slow — once the child forks, it should publish its PID right away. If the reconnect fails and we end up with a stale pidfile in the PIDs directory, that's okay! It's easy to test to see whether or not it's stale. That's more predictable and less irritating than having a living child with no pidfile.
Reconnect to Redis after writing the pidfile.
Coverage remained the same at 83.148% when pulling baf9bac on fimmtiu:reconnect-after-writing-pidfile into 20d8850 on resque:master.
Note that there's a CI failure on this branch which has nothing to do with the contents of the pull request; looks like some sort of issue with the term-ansicolor gem being incompatible with Ruby 1.8.7. Looks like you've got a number of recent branches with similar failures.
This sounds reasonable to me. @casperisfine, thoughts?
That change makes perfect sense. 👍
Thanks for your contribution!
No problem! I quite appreciate your timely responses.