Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: remove peers that have disappeared from kubernetes #3022

Closed
wants to merge 8 commits into from

Conversation

Projects
None yet
4 participants
@bboreham
Copy link
Member

commented Jun 19, 2017

Fixes #2797

Using Kubernetes annotations as a way to ensure only one peer at a time runs rmpeer

@bboreham bboreham self-assigned this Jun 19, 2017

// This should be sufficiently rare that we don't care.

// Question: Should we check against Weave Net IPAM?
// i.e. If peer X owns any address space and is marked unreachable, we want to rmpeer X

This comment has been minimized.

Copy link
@mikebryant

mikebryant Jun 19, 2017

Collaborator

I think no:
On a temporary network-based node failure (e.g. bare metal, localised power failure for the relevant swith for example), we can get into a situation where:

  • peer X owns address space
  • peer X is unreachable

We don't want to rmpeer it, as it can come back and still be expecting to connect back to the network. It may even still have pods running on those IPs

This comment has been minimized.

Copy link
@bboreham

bboreham Jun 19, 2017

Author Member

That question was meant as a narrowing of step 6, i.e. after Kubernetes has decided it's gone, should we additionally see if Weave Net thinks it's still there. Not a widening.

This comment has been minimized.

Copy link
@mikebryant

mikebryant Jun 19, 2017

Collaborator

Oh, I see. I don't see that would harm anything then.
It might help alleviate some of the issues in my comment about 46442, if kubernetes thinks it's gone but weave can still talk to it

func reclaimRemovedPeers(apl *peerList, nodes []nodeInfo) error {
// TODO
// Outline of function:
// 1. Compare peers stored in the peerList against all peers reported by k8s now.

This comment has been minimized.

Copy link
@mikebryant

mikebryant Jun 19, 2017

Collaborator

Be wary of kubernetes/kubernetes#46442

In several environments removed nodes may come back.
If weavedb persistence is enabled, this will mean those nodes can't join the cluster properly, as they will believe they own address space and the cluster will disagree.

This comment has been minimized.

Copy link
@bboreham

bboreham Jun 19, 2017

Author Member

Gah, that kills the whole idea.

We need to distinguish between "stopped but coming back" and "stopped and never coming back".

Possibly we could extend the use of the peerList annotation to say "If I'm not currently listed as a peer then delete any old IPAM persistence data"?

This comment has been minimized.

Copy link
@mikebryant

mikebryant Jun 19, 2017

Collaborator

That would work for nodes that have been restarted. (Which is probably the most common case)

One that was just off the network for a while though.. I guess it depends on the exact failure mode in the cloud provider. Not sure what would happen if there were still running pods

This comment has been minimized.

Copy link
@mikebryant

mikebryant Jul 11, 2017

Collaborator

For info:
Our hacky rmpeers solution was hitting this quite a bit. So this is our current fix:

      initContainers:
      - name: reset-weavedb
        image: hub.docker.tech.lastmile.com/ocadotechnology/weave-net-reset-weavedb:0acd87fd3f7bde7d4cfe8b5009e0fafa2d0e9c18@sha256:05722937400c0e0449a80e9af8285323ac9ca7036c7f009e6c3c1e46d4bae855
        env:
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        volumeMounts:
        - name: weavedb
          mountPath: /weavedb

The source of which is:

#!/bin/sh

set -eux

CURRENT_NODE_UUID=$(kubectl get node -o custom-columns=ID:.metadata.uid --no-headers ${HOSTNAME})

if [ ! -f /weavedb/NODE_UUID ]; then
  # Does not exist, so create it with the Node's current UUID
  echo -n ${CURRENT_NODE_UUID} > /weavedb/NODE_UUID
  exit 0;
fi;

if [ "${CURRENT_NODE_UUID}" == "$(cat /weavedb/NODE_UUID)" ]; then
  # It's still the same, so we're not likely to be out of sync
  exit 0;
fi;

# If we're here, it didn't match, so reset the persistent data on the assumption
# that we've been rmpeer'd
rm -f /weavedb/weavedata.db
rm -f /weavedb/weave-netdata.db

# And also set the NODE_UUID
echo -n ${CURRENT_NODE_UUID} > /weavedb/NODE_UUID

exit 0

We're basing this of the assumption that if our Node object has gone away and come back with a new UUID, our IPs may have been hoovered up by our rmpeers script, so we'll start over.

This comment has been minimized.

Copy link
@bboreham

bboreham Jul 19, 2017

Author Member

One that was just off the network for a while though.. I guess it depends on the exact failure mode in the cloud provider. Not sure what would happen if there were still running pods

Suppose rmpeer X is done while X is alive but disconnected, then X will be unable to reconnect to any peer which has received that update. The pods on X will be attached to a bridge but the bridge will be isolated from any other node.

However, I would expect the scheduler to evict pods on a node which is deleted from etcd, hence after rejoin kubelet would delete all those existing pods and the weave-kube pod would go through its restart cycle and everything should be OK after that.

- apiVersion: rbac.authorization.k8s.io/v1beta1
kind: Role
metadata:
name: weave-net2

This comment has been minimized.

Copy link
@mikebryant

mikebryant Jun 19, 2017

Collaborator

nit: Is there a reason the permissions aren't just added to the existing role?
If they should be separate, a more descriptive name would be nice (weave-net-peers?)

This comment has been minimized.

Copy link
@bboreham

bboreham Jun 19, 2017

Author Member

The existing one is a ClusterRole; this is a Role to narrow down the write permissions.

I agree with your point about naming; it was late...

// 8. If step 5 failed due to optimistic lock conflict, stop: someone else is handling X
// 9. Else there is an existing annotation at step 3; call its owner peer Y.
// 10. If Y is not known to k8s and marked unreachable, recurse to remove Y at 3
// 11. If we succeed to claim the annotation for Y, remove its annotation for X too

This comment has been minimized.

Copy link
@mikebryant

mikebryant Jun 19, 2017

Collaborator

On ordering, should it then be:

	// 7a:    Remove any annotations Z* that have contents X
	// 7b.    Remove X from peerList
	// 7c.    Remove annotation with key X

Instead of recursing, could change the algorithm such that for the loop around step 2, instead of going through each peer once, instead keep looping until we get the whole way through the list without doing any work
e.g. all peers that we want to remove are in step 8

This comment has been minimized.

Copy link
@bboreham

bboreham Jul 20, 2017

Author Member

I updated in line with this suggestion.

@bboreham bboreham force-pushed the issues/2797-remove-dead-peers branch from 9e83787 to 040d7b7 Jul 20, 2017

@bboreham

This comment has been minimized.

Copy link
Member Author

commented Jul 20, 2017

I have squashed some of the commits and written an implementation of the pseudo-code previously discussed.

Next steps: try it out, figure out how to test it properly.

@bboreham

This comment has been minimized.

Copy link
Member Author

commented Jul 20, 2017

Also still to do, as noted earlier:

We need to distinguish between "stopped but coming back" and "stopped and never coming back".

Possibly we could extend the use of the peerList annotation to say "If I'm not currently listed as a peer then delete any old IPAM persistence data"?

@caarlos0

This comment has been minimized.

Copy link
Contributor

commented Oct 19, 2017

We need to distinguish between "stopped but coming back" and "stopped and never coming back".

I'm not very familiar with weave, but, why? If a node disappears and we release its IPs and etc, and if it come back later, can't we just "re-setup" it (with new IPs and all)?

@bboreham

This comment has been minimized.

Copy link
Member Author

commented Oct 19, 2017

Weave Net has no central point of control: its native data structure is a CRDT. The "release its IPs" breaks the rules of the CRDT so we need some extra mechanism to avoid going mad.

@caarlos0

This comment has been minimized.

Copy link
Contributor

commented Oct 19, 2017

hmm, interesting. Thanks for explaining @bboreham :)

@bboreham

This comment has been minimized.

Copy link
Member Author

commented Nov 9, 2017

Replaced by #3149

@bboreham bboreham closed this Nov 9, 2017

@bboreham bboreham added this to the n/a milestone Nov 13, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.