detect when IPAM has been seeded by different peers#1499
Conversation
This is an early detector for a multitude of other symptoms, some of which cause panics. When creating the ring we record the paxos-supplied normalised list of peers used for seeding. This subsequently gets gossipped around as part of the Ring data structure, and checked in ring.Merge that it is the same for the recipient. There are some subtleties due to the desire to maintain backward compatibility and deal with empty rings - The gossip gob decoder will ignore the Ring.Seeds field when an old weave receives a Ring from a new weave. So everything on the old weave will just continue to work as before. - The gossip gob decoder will set the Ring.Seeds field to nil when a new weave receives a Ring from an old weave. We therefore omit the seed equivalence check when the received Seeds slice is empty. - Our ring may not have any seeds - either because it is empty or was created from a ring received from an old weave. We therefore a) omit the seed equivalence check in that situation, and b) set our seeds to those received. Note that this allows us to learn about the seeds well after ring creation - a situation which may arise when receiving gossip from a mixture of old and new weaves. Fixes #1463.
|
The only possible thing left to do is presenting a better error message...
With the seed and range check in place, can any of the other errors returned from @tomwilkie appreciate review of this and thoughts on the above. |
|
This change looks straight forward enough and is definitely worth it. My biggest concern is that I'm not convinced it will fix all occurrences of #1463; or put another way, I'm not convinces the assertion "all rings seeded by same set of peers will never panic on merge, regardless of the combinations of independently allowed mutations" is true (I'm paraphrasing). Does that make sense? I'm not saying that the assertion can't be true, just that is not clear to me right now that it is. |
I am not 100% sure either, so let's stick with non-panics on merge failures. |
So the idea was to not panic on merge failures (by implementing a bunch of preconditions to detect when the merge would fail). We missed this precondition. I'd prefer to add another precondition to catch this (as opposed to disabling all panics) - if we disable the panics I worries we won't catch the next problem as quickly. |
|
The current code (on master) does not panic on merge failures. Only when the post-conditions fail. |
|
i.e. there are four types of checks we perform in
On master, 1+4 panic, 2+3 just cause the connection to drop. I was contemplating whether with the changes in this PR, 3 should become panics too, on the grounds that I believe the newly introduced pre-merge check now means that 3 can only arise due to bugs and data corruption. But I am not 100% sure about this, and neither are you, so my latest proposal is to continue with just dropping the connection on 3. |
|
The panic that this is fixing was (4), IIRC. By introducing this new check, we aim to avoid this panic by catching the case in (2), right? If so, and we are not convince this new check will catch all occurrences of this problem, then it stands the the checks at (4) could still panic? Given this, I'm asking that we add another check at (2) for this particular case (merging two rings would result in invalid free space reporting). Does that make sense? |
|
Well, I am quite convinced that the new check catches all user-induced occurrences of the problem. And even if it doesn't, I wouldn't mind the panics; at least that way we'll find out what we missed. |
Yes, but it does more than that, i.e. I believe it also eliminates all the conditions checked for in (3). |
There are three significant changes here: - rewrite ring.ErrDifferentSeeds errors - remove the pre-checking of ranges and instead rewrite ring.ErrDifferentRange errors - don't pruneNicknames() and signal ringUpdated() when there was an error We don't rewrite other errors since our current belief is that only ErrDifferentSeeds and ErrDifferentRange can arise from user error.
|
new error output: |
|
Code looks good, and it works (tested locally): I'd prefer if we dropped the "fixes #1463", as I'd like more time to think, but that shouldn't block the merging of this. |
detect when IPAM has been seeded by different peers
This is an early detector for a multitude of other symptoms, some of which cause panics.
When creating the ring we record the paxos-supplied normalised list of peers used for seeding. This subsequently gets gossipped around as part of the Ring data structure, and checked in ring.Merge that it is the same for the recipient.
There are some subtleties due to the desire to maintain backward compatibility and deal with empty rings
Fixes #1463. Fixes #1178.