Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling concurrent updates to configmap and IP reclamation #3724

Merged
merged 4 commits into from
Oct 24, 2019

Conversation

mmerrill3
Copy link
Contributor

Handling concurrent updates to configmap and IP reclamation by multiple weave instances to fix #3722
Signed-off-by: mmerrill3 michael.merrill@vonage.com

…le weave instances

Signed-off-by: mmerrill3 <michael.merrill@vonage.com>
@bboreham
Copy link
Contributor

I've pushed your branch to the weaveworks/weave repo so the smoke tests will run.

@bboreham
Copy link
Contributor

Thanks for the PR.
I had an alternative idea, which is to re-check the peer list in the annotation before rmpeer.
That way we will see that someone else has removed the entry, and it does not rely on the sleep being "long enough", which is difficult to get right.

@mmerrill3
Copy link
Contributor Author

@bboreham, I agree that the sleep is not deterministic. The solution you have come up with makes sense, but I just have one point. The determination that a peer is to be removed was already make within the reclaimRemovedPeers() function, which is called after the sleep. At that time, cml.getPeerList() is called, and the peer is within the annotation for the existing peer list. So, chances are, if we call cml.getPeerList() again shortly after, the annotation will probably still be there.
I think maybe we do what you suggest, but also wrap the new call to cml.getPeerList() and weave.RmPeer() inside a separate cml.LoopUpdate(), so we make sure we have the most recent cml config map version.
Either way is not bullet proof, but you're idea is simpler.
I can work on that...

mmerrill3 added 2 commits October 22, 2019 14:33
…y multiple weave instances"

This reverts commit 5378157.
Signed-off-by: mmerrill3 <michael.merrill@vonage.com>
@mmerrill3
Copy link
Contributor Author

Hi @bboreham,
I pushed my changes. I don't have permission to the weaveworks/weave branch feature/fix-3722 to push my changes there as well.

@bboreham
Copy link
Contributor

make sure we have the most recent cml config map version

Yes, that's what I meant - re-fetch from the api-server.

Either way is not bullet proof

Here's what I think the failure at #3722 looks like - time goes from top to bottom:

Node A Node B
write annotation "X:A"
rmpeer X
remove X from peer list
update peer list
remove annotation "X:A"
. write annotation "X:B"
. rmpeer X
inconsistent data

The reason why I think it is more bullet-proof is we get this instead:

Node A Node B
write annotation "X:A"
rmpeer X
remove X from peer list
update peer list
remove annotation "X:A"
. write annotation "X:B"
. re-fetch peer list
. X no longer there - do nothing
. remove annotation "X:B"

Can you show in a similar notation the sequence(s) where we still get an error?

Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits

prog/kube-utils/main.go Outdated Show resolved Hide resolved
prog/kube-utils/main.go Outdated Show resolved Hide resolved
storedPeerList.remove(peerName)
if err := cml.UpdatePeerList(*storedPeerList); err != nil {
return err
if storedPeerList.contains(peerName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you think of a sequence where this is false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's false, there's nothing to do from an annotation perspective, so maybe just log it and then proceed to remove the locking annotation.
But, the remove will be a no/op if there's nothing in the list, so the check for contains() is unnecessary, unless we want to catch an unthought of edge case where the peer is really not in the list.
I'd prefer to keep this, but log if the peer is not in the list so we have evidence that this edge case occurred.

prog/kube-utils/main.go Outdated Show resolved Hide resolved
Signed-off-by: mmerrill3 <michael.merrill@vonage.com>
@mmerrill3
Copy link
Contributor Author

Just pushed changed from your feedback @bboreham , thanks for that. I agree with your diagram about what happened.

Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@bboreham bboreham merged commit 6a48da6 into weaveworks:master Oct 24, 2019
@bboreham bboreham added this to the 2.6 milestone Nov 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants