Skip to content
This repository was archived by the owner on Jun 20, 2024. It is now read-only.

detect when IPAM has been seeded by different peers#1499

Merged
tomwilkie merged 3 commits intomasterfrom
1463-ipam-seed-mismatch
Oct 9, 2015
Merged

detect when IPAM has been seeded by different peers#1499
tomwilkie merged 3 commits intomasterfrom
1463-ipam-seed-mismatch

Conversation

@rade
Copy link
Member

@rade rade commented Oct 2, 2015

This is an early detector for a multitude of other symptoms, some of which cause panics.

When creating the ring we record the paxos-supplied normalised list of peers used for seeding. This subsequently gets gossipped around as part of the Ring data structure, and checked in ring.Merge that it is the same for the recipient.

There are some subtleties due to the desire to maintain backward compatibility and deal with empty rings

  • The gossip gob decoder will ignore the Ring.Seeds field when an old weave receives a Ring from a new weave. So everything on the old weave will just continue to work as before.
  • The gossip gob decoder will set the Ring.Seeds field to nil when a new weave receives a Ring from an old weave. We therefore omit the seed equivalence check when the received Seeds slice is empty.
  • Our ring may not have any seeds - either because it is empty or was created from a ring received from an old weave. We therefore a) omit the seed equivalence check in that situation, and b) set our seeds to those received. Note that this allows us to learn about the seeds well after ring creation - a situation which may arise when receiving gossip from a mixture of old and new weaves.

Fixes #1463. Fixes #1178.

This is an early detector for a multitude of other symptoms, some of
which cause panics.

When creating the ring we record the paxos-supplied normalised list of
peers used for seeding. This subsequently gets gossipped around as
part of the Ring data structure, and checked in ring.Merge that it is
the same for the recipient.

There are some subtleties due to the desire to maintain backward
compatibility and deal with empty rings

- The gossip gob decoder will ignore the Ring.Seeds field when an old
  weave receives a Ring from a new weave. So everything on the old
  weave will just continue to work as before.

- The gossip gob decoder will set the Ring.Seeds field to nil when a
  new weave receives a Ring from an old weave. We therefore omit the
  seed equivalence check when the received Seeds slice is empty.

- Our ring may not have any seeds - either because it is empty or was
  created from a ring received from an old weave. We therefore a) omit
  the seed equivalence check in that situation, and b) set our seeds
  to those received. Note that this allows us to learn about the seeds
  well after ring creation - a situation which may arise when
  receiving gossip from a mixture of old and new weaves.

Fixes #1463.
@rade rade changed the title detect when IPAM has been seeded by different peers WIP - detect when IPAM has been seeded by different peers Oct 2, 2015
@rade
Copy link
Member Author

rade commented Oct 2, 2015

The only possible thing left to do is presenting a better error message...

ring.go knows too little about the context to produce intelligible errors. So my plan is to catch the ErrDifferentSeeds at the call site and return a better error. While at it, we might as well do the same for ErrDifferentRange, instead of the current pre-Merge check.

With the seed and range check in place, can any of the other errors returned from Ring.Merge actually arise except through buggy code / corrupted data? If so, perhaps we should panic?

@tomwilkie appreciate review of this and thoughts on the above.

@tomwilkie
Copy link
Contributor

This change looks straight forward enough and is definitely worth it. My biggest concern is that I'm not convinced it will fix all occurrences of #1463; or put another way, I'm not convinces the assertion "all rings seeded by same set of peers will never panic on merge, regardless of the combinations of independently allowed mutations" is true (I'm paraphrasing).

Does that make sense? I'm not saying that the assertion can't be true, just that is not clear to me right now that it is.

@rade rade assigned rade and unassigned tomwilkie Oct 5, 2015
@rade
Copy link
Member Author

rade commented Oct 5, 2015

I'm not saying that the assertion can't be true, just that is not clear to me right now that it is.

I am not 100% sure either, so let's stick with non-panics on merge failures.

@tomwilkie
Copy link
Contributor

I am not 100% sure either, so let's stick with non-panics on merge failures.

So the idea was to not panic on merge failures (by implementing a bunch of preconditions to detect when the merge would fail). We missed this precondition. I'd prefer to add another precondition to catch this (as opposed to disabling all panics) - if we disable the panics I worries we won't catch the next problem as quickly.

@rade
Copy link
Member Author

rade commented Oct 5, 2015

The current code (on master) does not panic on merge failures. Only when the post-conditions fail.

@rade
Copy link
Member Author

rade commented Oct 5, 2015

i.e. there are four types of checks we perform in ring.Merge:

  1. pre condition invariants
  2. pre-merge checks
  3. checks during merge, before we've done any state updates
  4. post condition invariants

On master, 1+4 panic, 2+3 just cause the connection to drop.

I was contemplating whether with the changes in this PR, 3 should become panics too, on the grounds that I believe the newly introduced pre-merge check now means that 3 can only arise due to bugs and data corruption. But I am not 100% sure about this, and neither are you, so my latest proposal is to continue with just dropping the connection on 3.

@tomwilkie
Copy link
Contributor

The panic that this is fixing was (4), IIRC. By introducing this new check, we aim to avoid this panic by catching the case in (2), right?

If so, and we are not convince this new check will catch all occurrences of this problem, then it stands the the checks at (4) could still panic? Given this, I'm asking that we add another check at (2) for this particular case (merging two rings would result in invalid free space reporting).

Does that make sense?

@rade
Copy link
Member Author

rade commented Oct 5, 2015

Well, I am quite convinced that the new check catches all user-induced occurrences of the problem. And even if it doesn't, I wouldn't mind the panics; at least that way we'll find out what we missed.

@rade
Copy link
Member Author

rade commented Oct 5, 2015

The panic that this is fixing was (4), IIRC.

Yes, but it does more than that, i.e. I believe it also eliminates all the conditions checked for in (3).

rade added 2 commits October 6, 2015 14:20
There are three significant changes here:

- rewrite ring.ErrDifferentSeeds errors
- remove the pre-checking of ranges and instead rewrite
  ring.ErrDifferentRange errors
- don't pruneNicknames() and signal ringUpdated() when there was an
  error

We don't rewrite other errors since our current belief is that only
ErrDifferentSeeds and ErrDifferentRange can arise from user error.
@rade rade assigned tomwilkie and unassigned rade Oct 6, 2015
@rade
Copy link
Member Author

rade commented Oct 6, 2015

new error output:

$ weave status connections
-> 192.168.48.11:6783    failed      Incompatible IP allocation ranges (received: 10.2.0.0/16, ours: 10.32.0.0/12), retry: 2015-10-06 12:47:52.865402414 +0000 UTC
...
$ weave status connections
-> 192.168.48.11:6783    failed      IP allocation was seeded by different peers (received: [d6:ab:08:3a:05:c4(host1)], ours: [5a:f9:90:8d:38:54(xps)]), retry: 2015-10-06 13:39:41.186260443 +0000 UTC

@rade rade changed the title WIP - detect when IPAM has been seeded by different peers detect when IPAM has been seeded by different peers Oct 6, 2015
@tomwilkie
Copy link
Contributor

Code looks good, and it works (tested locally):

vagrant@ubuntu-15:~/src/github.com/weaveworks/weave$ ./weave status connections
-> 172.16.0.3:6783       failed      IP allocation was seeded by different peers (received: [d6:4f:b2:81:01:3a(ubuntu-14)], ours: [66:ad:85:ac:89:02(ubuntu-15)]), retry: 2015-10-09 13:45:50.685351114 +0000 UTC

I'd prefer if we dropped the "fixes #1463", as I'd like more time to think, but that shouldn't block the merging of this.

tomwilkie added a commit that referenced this pull request Oct 9, 2015
detect when IPAM has been seeded by different peers
@tomwilkie tomwilkie merged commit 33085e3 into master Oct 9, 2015
@tomwilkie tomwilkie deleted the 1463-ipam-seed-mismatch branch October 9, 2015 13:46
@rade rade modified the milestone: 1.2.0 Oct 15, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants