Flight Controller that Automatically Fixes Common Problems #246

BugRoger · 2018-02-25T11:33:17Z

Flight Control

This controller takes care about Kluster health. It looks for obvious problems and tries to repair them. It will:

Reconcile the security group on each instance
Use ConfigDrive instead of Metadata Agent (Implemented in Use the config-drive option to deliver ignition json to the node #249)
Delete Incompletely Spawned Instances
Reconcile PodCIDRs added to the Kluster's Security Group

Currently Implemented

It deletes Nodes that didn't manage to register within 10m after
inital creation. This is a workaround for DHCP/DVS (latency) issues.
In effect it will delete the incompletely spawned node and launch
control will ramp it back up.
It ensures tcp/udp/icmp rules exist in the security group defined during
kluster creation. The rules explicitly allow all pod-to-pod
communication. This is a workaround for Neutron missing the
side-channel security group events.
It ensures each Nodes is member of the security group defined in
the kluster spec. This ensures missing security groups due to whatever
reason are again added to the node.

1. Security Group (DVS) Update Latency

Updates to the security group are not instant. There is a non-trivial amount of latency involved. This leads to timeouts and edge cases as the definition of tolerable latency differs. Most notable this affects DHCP and Ignition. Default tolerances are in the range of 30s.

The latency of updates is related to general load on the regions, "noisy" neighbors blocking the update loops, amount of ports in the same project and Neutron and VCenter performance.

Symptoms

Nodes Running but never become healthy
No IPv4 Address visible on the VNC Console
No SSH Login Possible
Kubelet not running

These symptom indicates that the node couldn't configure its network interface before the Ignition timeout. This effectifly leaves the node broken.

Workarounds

Increase DHCP/Ignition Timeout
This configuration needs to be baked into the image as an OEM customization. It also interacts with the DHCP client timeout which again requires a modification of the image. With frequent CoreOS updates this modification needs to be automatic and included in the image build pipeline.
Reboot the Instance
~~This is the preferred workaround~~. It gives the DVS agents additional time to configure the existing image and retries the Ignition run. This does not work. Ignition only runs on first boot. Failure also counts as first run. The node is broken if it doesn't work the first time.
Delete the Instance
This workaround is problematic. It will not succeed if the update latency is too high in general.
Use ConfigDrive
It is possible to pass in metadata via config-drive. The initial configuration of a node is independent of network connectivity. With this a node is guaranteed to boot correctly. It will not be functional but at least be in a recoverable state.

2. Security Group Update Event Missed

If an instance successfully downloads and startes the Kubelet, it registers itself with the APIServer and gets a PodCIDR range assigned. This triggers a reconfiguration of the Neutron Router and adds a static route. The route points the PodCIDR to the node's IP address. This is required to satisfy the Kubernetes pod to pod communication requirements.

As the PodCIDR subnet is now actually routed via the Neutron router it is required to be allowed in the security group.

This happens by updating the node's Neutron port and adding the required CIDR to allowed_address_pairs. This triggers an event that the port was updated. The DVS agent are catching this update and adding an additional rule to the security group.

Occasionally, this update is missed. Until a full reconcilation loop happens (usually by restarting or update of the DVS agents) the following symptoms appear.

Symptoms

Sporadic Pod Communication
Sporadic Service Communication
Sporadic Load Balancer Commnication
Sporadic DNS Problems in Pods
Depending on the disconnected node pods can't reach the Kube-DNS service. DNS will work on the nodes.
Load Balancer Pools Flapping

Workarounds

Add PodCIDRs to Security Group
Instead of relying on the unreliable Oslo events, all possible PodCIDRs are being added to the kluster's security group. Per default this is 198.19.0.0/16
Trigger Security Group Sync
Instead of waiting for a security group reconcilliation force an update by periodially add a (random) change to the security group. If possible this should only be triggered when a node condition indicates pod communication problems.

3. Default Security Group Missing

When a new node is created via Nova a security group can be specified. The user can select this security group during Kluster creation. If nothing is selected the default security group is assumed.

For yet unknown reasons, there's a chance that the instance is configured without this security group association by Nova. In effect the instance is completely disconnected from the network.

Symptoms

Nodes Running but never become healthy
No IPv4 Address visible on the VNC Console
No SSH Login Possible
Kubelet not running

Workarounds

Reconcile Security Group Associations
Periodically check that instances which haven't registerd as nodes do have the required security group enabled. If not, set it again.

4. ASR Route Duplicates

When a node is being deleted its route is removed in Neutron. The ASR agents get notified by an event and do remove the route from the ASR device.

First of all, this requires that the state of the Neutron DB reflects reality. Updates to the routes are done by the RouteController in the Kubernetes OpenStack cloud provider. Before 1.10 there's a bug that misses the updates. In Kubernikus we fixed this by adding an additional RouteNanny for now.

When a Kluster is terminated forcefully, the RouteController might be destroyed before it manages to update the Neutron database. The reconciliation happens every 60 seconds. We counter this by gracefully deorbiting the Kluster waiting for the updates either by the RouteController or the RouteNanny.

Unfortunately, during normal operations by scaling node pools or terminating nodes updates to the routes do get missed as well. In that case the state in Neutron is correct, while the state on the ASR device still shows the deleted route. This should be fixable by triggering or manually running a re-sync. Unfortunately, that does not work.

The re-sync mechanism between Neutron and ASR is not perfect. There is currently the problem that routes that have been removed in Neutron will not be removed during the sync. It only considers additional routes that have been added.

The only way to recover this situation is to manually delete and then sync the router using the asr1k_utils. Additional node conditions that check for this problem will facilitate alerting and manual intervention.

These duplicate routes are fatal because the IPAM module in the CNI plugin recycles PodCIDRs immediately. A new node will receive the PodCIDR of a previously existing node. The old node's routes are still configured on the ASR device and take precedence. That leaves the new node broken. For example, the state of the ASR routes after multiple node deletions:

		198.19.1.0/24 -> 10.180.0.3
		198.19.1.0/24 -> 10.180.0.4
		198.19.1.0/24 -> 10.180.0.5

Currently correct is the last route, pointing to the latest instance. In effect is the first route pointing to 10.180.0.3 which doesn't exist anymore.

Symptoms

See (2). Sporadic Pod/Service/LB Communication Problems
Only the first cluster in each project works

Workarounds:

None Known
There's no known way to trigger a router asr1k_utils delete + asr1k_utils sync via OpenStack without actually deleting the whole Neutron router construct. If a side-channel could somehow be leveraged it would be possible to recover automatically.

5. ASR Missing Routes

Due to various effects it is possible that the ASR agents miss the event to add an additional route when a new node is created.

On specifically fatal effect is the failover between ACTIVE and STANDBY router. It seems to be a rather common defect (potentially even intended) that only the ACTIVE receives sync events. Upon failover routes are missing or reflect the state of a previous cluster.

Symptoms:

See (2)

Workarounds:

Manual Sync
Trigger Sync automatically
There's no direct interface to trigger a sync via OpenStack API. It can be forced indirectly by an action that triggers a sync: Attaching/Detaching a FIP/Interface, Adding/Removing a Route

6. Neutron Static Route Limit

There's a static limit of 31 routes in Neutron.

In projects with frequent Kluster create and deletes, the route limit can be
exceeded due duplicate routes as described in (4).

Symptoms:

See (2). Pod/Service/LB communication problems
409 Conflict Errors in RouteController and RouteNanny
Klusters have a max size of 31 Nodes

Workarounds:

None. Neutron needs to be reconfigured
Clean up duplicate routes

BugRoger · 2018-02-25T11:37:27Z

@fwiesel @abattye Do you have additional insight? Maybe something we missed or miss-interpreted. Tips and tricks?

abattye · 2018-02-25T19:16:22Z

I think the route issues are pretty much as you describe. We have an update from Cisco that might improve the route situation, I haven't applied or tested it yet because of the time constraints of the new driver. If this is critical we can try it out in a tame region to see if it improves things. I think there is some locking situation at the root of this.

SchwarzM · 2018-03-12T08:22:25Z

you're good but not 2mil. lines good, so i guess there is still vendor crap

SchwarzM · 2018-03-12T08:37:34Z

pkg/controller/flight/factory.go

 	reconciler = &LoggingFlightReconciler{reconciler, f.Logger}
 	return reconciler, nil
 }

-func (d *flightReconcilerFactory) getNodes(kluster *v1.Kluster, client openstack_kluster.KlusterClient) ([]Node, error) {
-	nodes := []Node{}
+func (d *flightReconcilerFactory) getInstances(kluster *v1.Kluster, client openstack_kluster.KlusterClient) ([]Instance, error) {


maybe we should put the getInstances for whole cluster in the client as well. and not just for single pool

SchwarzM · 2018-03-12T08:40:20Z

pkg/controller/flight/reconciler.go

-	return []string{}
+	deletedInstanceIDs := []string{}
+	timedOutInstances := f.getTimedOutInstances()
+	unregisteredInstances := f.getUnregisterdInstances()


unregistered ? or unregisterd ?

SchwarzM · 2018-03-12T08:40:40Z

pkg/controller/flight/reconciler.go

+	return timedOut
+}
+
+func (f *flightReconciler) getUnregisterdInstances() []Instance {


SchwarzM · 2018-03-12T08:51:19Z

pkg/client/openstack/kluster/client.go

-	// TODO: Implement Me
-	return nil
+func (c *klusterClient) SetSecurityGroup(nodeID string) (err error) {
+	return secgroups.AddServer(c.ComputeClient, nodeID, c.Kluster.Spec.Openstack.SecurityGroupName).ExtractErr()


if i understood @edda right there is no error given back if the endpoint is reachable as the call is async.
is there some way to ensure we notice if this goes wrong and we repetitively set the same secgroup again and again without backoff ? like a metric setsecgroups vs getsecgroups.

SchwarzM · 2018-03-12T08:59:31Z

pkg/controller/flight/reconciler.go

+		}
+
+		if found {
+			if err := f.Client.DeleteNode(unregistered.GetID()); err != nil {


If memory serves me right deleting an instance in state deleting produces an error.
Either we check the status and backoff or we force it down with --force.

This could in deed happen when an instance is stuck in deleting. I'm unsure we should be handling that. It sounds more like a job for a Nova nanny.

If an error occurs this will try to delete it again in the next reconciliation loop. Also, I tried to write funcs to be unaffected by errors and not break the global loop. Hence this only logs the error and continues. It should at most result in 1 log line every 5 minutes, if an instance is stuck or a concurrent deletion happens.

BugRoger · 2018-03-19T07:51:55Z

Cleaned up the vendor directory again...

BugRoger · 2018-03-22T08:52:51Z

Rebased, squashed, manually tested. This is ready for review

databus23

Awesome job. Especially for taking the time to write some proper tests!

Only some minor nits.

On a more general note I think this controller does to much und things that have nothing in common. I'm okay with that as its a grab back of workarounds that we can hopefully remove at least partially in the future as things get more stable in the underlying platform.

IMO the security group reconciliation should be part of launchctl as part of a regular node reconciliation. As it is not only a workaround for a bug but also a feature that the security group gets synced when a user changes it in the spec. Maybe we can pull it out when we get to refactoring launchctl.

In general I'm not sure were the responsibilities of the launch and flight controller begin and end in regards to nodes. Maybe we can clarify that at some point.

I'm not for holding this PR back because of any of the above comments as I think its important we have it merged ASAP to improve the overall stability.

LGTM!

databus23 · 2018-03-23T14:56:34Z

pkg/client/openstack/kluster/client.go

+func (c *klusterClient) SetSecurityGroup(nodeID string) (err error) {
+	err = secgroups.AddServer(c.ComputeClient, nodeID, c.Kluster.Spec.Openstack.SecurityGroupName).ExtractErr()
+	// ?????
+	if err.Error() == "EOF" {


I think a gophercloud update will fix this. See: gophercloud/gophercloud#815

Yes, nice. Fixed it with an update.

databus23 · 2018-03-23T14:59:53Z

pkg/client/openstack/kluster/client.go

+func (c *klusterClient) EnsureKubernikusRuleInSecurityGroup() (created bool, err error) {
+	page, err := securitygroups.List(c.NetworkClient, securitygroups.ListOpts{Name: c.Kluster.Spec.Openstack.SecurityGroupName}).AllPages()
+	if err != nil {
+		return false, fmt.Errorf("SecurityGroup %v not found", c.Kluster.Spec.Openstack.SecurityGroupName)


Returning "Not found" might not be accurate. Could be a lot of other reasons for the list to fail.
I would suggest to return at least something like fmt.Errorf("SecurityGroup %v not found: %s", c.Kluster.Spec.Openstack.SecurityGroupName, err)

databus23 · 2018-03-23T15:16:34Z

pkg/controller/flight/controller.go

+	"github.com/sapcc/kubernikus/pkg/controller/config"
+)
+
+// Flight Control


Awesome docs, but should this maybe also be copied to our markdown docs?

* It deletes Nodes that didn't manage to register within 10m after inital creation. This is a workaround for DHCP/DVS (latency) issues. In effect it will delete the incompletely spawned node and launch control will ramp it back up. * It ensures tcp/udp/icmp rules exist in the security group defined during kluster creation. The rules explicitly allow all pod-to-pod communication. This is a workaround for Neutron missing the side-channel security group events. * It ensures each Nodes is member of the security group defined in the kluster spec. This ensures missing security groups due to whatever reason are again added to the node.

BugRoger requested a review from databus23 as a code owner February 25, 2018 11:33

BugRoger changed the title ~~adds docs~~ [WIP] Flight Control Feb 25, 2018

BugRoger force-pushed the flight branch from 789ec2b to 215763a Compare March 9, 2018 16:30

BugRoger requested review from auhlig, edda, joluc, SchwarzM and vonneudeck as code owners March 9, 2018 16:30

SchwarzM reviewed Mar 12, 2018

View reviewed changes

BugRoger force-pushed the flight branch from 8596f56 to 20edf8c Compare March 19, 2018 15:23

BugRoger changed the title ~~[WIP] Flight Control~~ Flight Control Mar 22, 2018

BugRoger force-pushed the flight branch 2 times, most recently from 20e8c7d to 3fa7368 Compare March 22, 2018 08:50

BugRoger changed the title ~~Flight Control~~ Flight Controller that Automatically Fixes Common Problems Mar 22, 2018

BugRoger force-pushed the flight branch from 3fa7368 to f078d0a Compare March 22, 2018 10:22

databus23 approved these changes Mar 23, 2018

View reviewed changes

BugRoger force-pushed the flight branch from f078d0a to a955ff2 Compare March 26, 2018 15:38

BugRoger force-pushed the flight branch from a955ff2 to 021a1d8 Compare March 26, 2018 16:00

BugRoger merged commit e37f6f2 into master Mar 26, 2018

BugRoger deleted the flight branch March 26, 2018 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flight Controller that Automatically Fixes Common Problems #246

Flight Controller that Automatically Fixes Common Problems #246

BugRoger commented Feb 25, 2018 •

edited

BugRoger commented Feb 25, 2018

abattye commented Feb 25, 2018 •

edited

SchwarzM commented Mar 12, 2018

SchwarzM Mar 12, 2018

SchwarzM Mar 12, 2018

BugRoger Mar 21, 2018

SchwarzM Mar 12, 2018

BugRoger Mar 21, 2018

SchwarzM Mar 12, 2018

SchwarzM Mar 12, 2018

BugRoger Mar 19, 2018

BugRoger commented Mar 19, 2018

BugRoger commented Mar 22, 2018 •

edited

databus23 left a comment

databus23 Mar 23, 2018

BugRoger Mar 26, 2018

databus23 Mar 23, 2018

BugRoger Mar 26, 2018

databus23 Mar 23, 2018

Flight Controller that Automatically Fixes Common Problems #246

Flight Controller that Automatically Fixes Common Problems #246

Conversation

BugRoger commented Feb 25, 2018 • edited

Flight Control

Currently Implemented

1. Security Group (DVS) Update Latency

Symptoms

Workarounds

2. Security Group Update Event Missed

Symptoms

Workarounds

3. Default Security Group Missing

Symptoms

Workarounds

4. ASR Route Duplicates

Symptoms

Workarounds:

5. ASR Missing Routes

Symptoms:

Workarounds:

6. Neutron Static Route Limit

Symptoms:

Workarounds:

BugRoger commented Feb 25, 2018

abattye commented Feb 25, 2018 • edited

SchwarzM commented Mar 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BugRoger commented Mar 19, 2018

BugRoger commented Mar 22, 2018 • edited

databus23 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BugRoger commented Feb 25, 2018 •

edited

abattye commented Feb 25, 2018 •

edited

BugRoger commented Mar 22, 2018 •

edited