Unresponsive leader and lost DOWNs #30
I ran into an issue when globally registered name would not be automatically cleaned up by gproc if the name owner process crashed at the time when leader was unresponsive. My guess is that this is happening due to lost DOWN notification. Steps to reproduce:
start two nodes dev1 and dev2 (dev1 is the leader):
start a process on dev2 which would register global name:
within 10 seconds timeout, send dev1 to background by hitting CTRL+Z
wait for dev1 to disappear from nodes:
registration is still there:
gproc refuses to register it:
where/1 filters it out since it's a local pid:
is this behavior a bug or feature? is there a good way to cope with it?
The text was updated successfully, but these errors were encountered:
I have some ideas, but the trickiest part of the problem is that a netsplit occurs. Only a few of the gen_leader versions (e.g. garret-smith/gen_leader_revival) have some support for netsplits, and at least when I try this scenario with garret-smith's version, it doesn't seem to do the right thing.
However, a few things come to mind:
@msadkov There exists an application to detect network splits for mnesia and hibari. It would need some customisation for gproc. Nevertheless, it might be of help to you.
The application is here => https://github.com/hibari/partition-detector
The admin documentation is here => http://hibari.github.com/hibari-doc/hibari-sysadmin-guide.en.html#partition-detector
@uwiger @norton thank you for your replies! I'm aware of gen_leader/gproc not being able to handle net splits, so