Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flight Controller that Automatically Fixes Common Problems #246

Merged
merged 1 commit into from
Mar 26, 2018
Merged

Conversation

BugRoger
Copy link
Contributor

@BugRoger BugRoger commented Feb 25, 2018

Flight Control

This controller takes care about Kluster health. It looks for obvious problems and tries to repair them. It will:

Currently Implemented

  • It deletes Nodes that didn't manage to register within 10m after
    inital creation. This is a workaround for DHCP/DVS (latency) issues.
    In effect it will delete the incompletely spawned node and launch
    control will ramp it back up.

  • It ensures tcp/udp/icmp rules exist in the security group defined during
    kluster creation. The rules explicitly allow all pod-to-pod
    communication. This is a workaround for Neutron missing the
    side-channel security group events.

  • It ensures each Nodes is member of the security group defined in
    the kluster spec. This ensures missing security groups due to whatever
    reason are again added to the node.

1. Security Group (DVS) Update Latency

Updates to the security group are not instant. There is a non-trivial amount of latency involved. This leads to timeouts and edge cases as the definition of tolerable latency differs. Most notable this affects DHCP and Ignition. Default tolerances are in the range of 30s.

The latency of updates is related to general load on the regions, "noisy" neighbors blocking the update loops, amount of ports in the same project and Neutron and VCenter performance.

Symptoms

  • Nodes Running but never become healthy
  • No IPv4 Address visible on the VNC Console
  • No SSH Login Possible
  • Kubelet not running

These symptom indicates that the node couldn't configure its network interface before the Ignition timeout. This effectifly leaves the node broken.

Workarounds

  • Increase DHCP/Ignition Timeout
    This configuration needs to be baked into the image as an OEM customization. It also interacts with the DHCP client timeout which again requires a modification of the image. With frequent CoreOS updates this modification needs to be automatic and included in the image build pipeline.

  • Reboot the Instance
    This is the preferred workaround. It gives the DVS agents additional time to configure the existing image and retries the Ignition run. This does not work. Ignition only runs on first boot. Failure also counts as first run. The node is broken if it doesn't work the first time.

  • Delete the Instance
    This workaround is problematic. It will not succeed if the update latency is too high in general.

  • Use ConfigDrive
    It is possible to pass in metadata via config-drive. The initial configuration of a node is independent of network connectivity. With this a node is guaranteed to boot correctly. It will not be functional but at least be in a recoverable state.

2. Security Group Update Event Missed

If an instance successfully downloads and startes the Kubelet, it registers itself with the APIServer and gets a PodCIDR range assigned. This triggers a reconfiguration of the Neutron Router and adds a static route. The route points the PodCIDR to the node's IP address. This is required to satisfy the Kubernetes pod to pod communication requirements.

As the PodCIDR subnet is now actually routed via the Neutron router it is required to be allowed in the security group.

This happens by updating the node's Neutron port and adding the required CIDR to allowed_address_pairs. This triggers an event that the port was updated. The DVS agent are catching this update and adding an additional rule to the security group.

Occasionally, this update is missed. Until a full reconcilation loop happens (usually by restarting or update of the DVS agents) the following symptoms appear.

Symptoms

  • Sporadic Pod Communication
  • Sporadic Service Communication
  • Sporadic Load Balancer Commnication
  • Sporadic DNS Problems in Pods
    Depending on the disconnected node pods can't reach the Kube-DNS service. DNS will work on the nodes.
  • Load Balancer Pools Flapping

Workarounds

  • Add PodCIDRs to Security Group
    Instead of relying on the unreliable Oslo events, all possible PodCIDRs are being added to the kluster's security group. Per default this is 198.19.0.0/16

  • Trigger Security Group Sync
    Instead of waiting for a security group reconcilliation force an update by periodially add a (random) change to the security group. If possible this should only be triggered when a node condition indicates pod communication problems.

3. Default Security Group Missing

When a new node is created via Nova a security group can be specified. The user can select this security group during Kluster creation. If nothing is selected the default security group is assumed.

For yet unknown reasons, there's a chance that the instance is configured without this security group association by Nova. In effect the instance is completely disconnected from the network.

Symptoms

  • Nodes Running but never become healthy
  • No IPv4 Address visible on the VNC Console
  • No SSH Login Possible
  • Kubelet not running

Workarounds

  • Reconcile Security Group Associations
    Periodically check that instances which haven't registerd as nodes do have the required security group enabled. If not, set it again.

4. ASR Route Duplicates

When a node is being deleted its route is removed in Neutron. The ASR agents get notified by an event and do remove the route from the ASR device.

First of all, this requires that the state of the Neutron DB reflects reality. Updates to the routes are done by the RouteController in the Kubernetes OpenStack cloud provider. Before 1.10 there's a bug that misses the updates. In Kubernikus we fixed this by adding an additional RouteNanny for now.

When a Kluster is terminated forcefully, the RouteController might be destroyed before it manages to update the Neutron database. The reconciliation happens every 60 seconds. We counter this by gracefully deorbiting the Kluster waiting for the updates either by the RouteController or the RouteNanny.

Unfortunately, during normal operations by scaling node pools or terminating nodes updates to the routes do get missed as well. In that case the state in Neutron is correct, while the state on the ASR device still shows the deleted route. This should be fixable by triggering or manually running a re-sync. Unfortunately, that does not work.

The re-sync mechanism between Neutron and ASR is not perfect. There is currently the problem that routes that have been removed in Neutron will not be removed during the sync. It only considers additional routes that have been added.

The only way to recover this situation is to manually delete and then sync the router using the asr1k_utils. Additional node conditions that check for this problem will facilitate alerting and manual intervention.

These duplicate routes are fatal because the IPAM module in the CNI plugin recycles PodCIDRs immediately. A new node will receive the PodCIDR of a previously existing node. The old node's routes are still configured on the ASR device and take precedence. That leaves the new node broken. For example, the state of the ASR routes after multiple node deletions:

		198.19.1.0/24 -> 10.180.0.3
		198.19.1.0/24 -> 10.180.0.4
		198.19.1.0/24 -> 10.180.0.5

Currently correct is the last route, pointing to the latest instance. In effect is the first route pointing to 10.180.0.3 which doesn't exist anymore.

Symptoms

  • See (2). Sporadic Pod/Service/LB Communication Problems
  • Only the first cluster in each project works

Workarounds:

  • None Known

  • There's no known way to trigger a router asr1k_utils delete + asr1k_utils sync via OpenStack without actually deleting the whole Neutron router construct. If a side-channel could somehow be leveraged it would be possible to recover automatically.

5. ASR Missing Routes

Due to various effects it is possible that the ASR agents miss the event to add an additional route when a new node is created.

On specifically fatal effect is the failover between ACTIVE and STANDBY router. It seems to be a rather common defect (potentially even intended) that only the ACTIVE receives sync events. Upon failover routes are missing or reflect the state of a previous cluster.

Symptoms:

  • See (2)

Workarounds:

  • Manual Sync
  • Trigger Sync automatically
    There's no direct interface to trigger a sync via OpenStack API. It can be forced indirectly by an action that triggers a sync: Attaching/Detaching a FIP/Interface, Adding/Removing a Route

6. Neutron Static Route Limit

There's a static limit of 31 routes in Neutron.

In projects with frequent Kluster create and deletes, the route limit can be
exceeded due duplicate routes as described in (4).

Symptoms:

  • See (2). Pod/Service/LB communication problems
  • 409 Conflict Errors in RouteController and RouteNanny
  • Klusters have a max size of 31 Nodes

Workarounds:

  • None. Neutron needs to be reconfigured
  • Clean up duplicate routes

@BugRoger BugRoger changed the title adds docs [WIP] Flight Control Feb 25, 2018
@BugRoger
Copy link
Contributor Author

@fwiesel @abattye Do you have additional insight? Maybe something we missed or miss-interpreted. Tips and tricks?

@abattye
Copy link

abattye commented Feb 25, 2018

I think the route issues are pretty much as you describe. We have an update from Cisco that might improve the route situation, I haven't applied or tested it yet because of the time constraints of the new driver. If this is critical we can try it out in a tame region to see if it improves things. I think there is some locking situation at the root of this.

@SchwarzM
Copy link
Contributor

screen shot 2018-03-12 at 09 21 17

you're good but not 2mil. lines good, so i guess there is still vendor crap

reconciler = &LoggingFlightReconciler{reconciler, f.Logger}
return reconciler, nil
}

func (d *flightReconcilerFactory) getNodes(kluster *v1.Kluster, client openstack_kluster.KlusterClient) ([]Node, error) {
nodes := []Node{}
func (d *flightReconcilerFactory) getInstances(kluster *v1.Kluster, client openstack_kluster.KlusterClient) ([]Instance, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should put the getInstances for whole cluster in the client as well. and not just for single pool

return []string{}
deletedInstanceIDs := []string{}
timedOutInstances := f.getTimedOutInstances()
unregisteredInstances := f.getUnregisterdInstances()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unregistered ? or unregisterd ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return timedOut
}

func (f *flightReconciler) getUnregisterdInstances() []Instance {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

// TODO: Implement Me
return nil
func (c *klusterClient) SetSecurityGroup(nodeID string) (err error) {
return secgroups.AddServer(c.ComputeClient, nodeID, c.Kluster.Spec.Openstack.SecurityGroupName).ExtractErr()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i understood @edda right there is no error given back if the endpoint is reachable as the call is async.
is there some way to ensure we notice if this goes wrong and we repetitively set the same secgroup again and again without backoff ? like a metric setsecgroups vs getsecgroups.

}

if found {
if err := f.Client.DeleteNode(unregistered.GetID()); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If memory serves me right deleting an instance in state deleting produces an error.
Either we check the status and backoff or we force it down with --force.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could in deed happen when an instance is stuck in deleting. I'm unsure we should be handling that. It sounds more like a job for a Nova nanny.

If an error occurs this will try to delete it again in the next reconciliation loop. Also, I tried to write funcs to be unaffected by errors and not break the global loop. Hence this only logs the error and continues. It should at most result in 1 log line every 5 minutes, if an instance is stuck or a concurrent deletion happens.

@BugRoger
Copy link
Contributor Author

Cleaned up the vendor directory again...

@BugRoger BugRoger changed the title [WIP] Flight Control Flight Control Mar 22, 2018
@BugRoger BugRoger force-pushed the flight branch 2 times, most recently from 20e8c7d to 3fa7368 Compare March 22, 2018 08:50
@BugRoger BugRoger changed the title Flight Control Flight Controller that Automatically Fixes Common Problems Mar 22, 2018
@BugRoger
Copy link
Contributor Author

BugRoger commented Mar 22, 2018

Rebased, squashed, manually tested. This is ready for review

Copy link
Member

@databus23 databus23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome job. Especially for taking the time to write some proper tests!

Only some minor nits.

On a more general note I think this controller does to much und things that have nothing in common. I'm okay with that as its a grab back of workarounds that we can hopefully remove at least partially in the future as things get more stable in the underlying platform.

IMO the security group reconciliation should be part of launchctl as part of a regular node reconciliation. As it is not only a workaround for a bug but also a feature that the security group gets synced when a user changes it in the spec. Maybe we can pull it out when we get to refactoring launchctl.

In general I'm not sure were the responsibilities of the launch and flight controller begin and end in regards to nodes. Maybe we can clarify that at some point.

I'm not for holding this PR back because of any of the above comments as I think its important we have it merged ASAP to improve the overall stability.

LGTM!

func (c *klusterClient) SetSecurityGroup(nodeID string) (err error) {
err = secgroups.AddServer(c.ComputeClient, nodeID, c.Kluster.Spec.Openstack.SecurityGroupName).ExtractErr()
// ?????
if err.Error() == "EOF" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a gophercloud update will fix this. See: gophercloud/gophercloud#815

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, nice. Fixed it with an update.

func (c *klusterClient) EnsureKubernikusRuleInSecurityGroup() (created bool, err error) {
page, err := securitygroups.List(c.NetworkClient, securitygroups.ListOpts{Name: c.Kluster.Spec.Openstack.SecurityGroupName}).AllPages()
if err != nil {
return false, fmt.Errorf("SecurityGroup %v not found", c.Kluster.Spec.Openstack.SecurityGroupName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning "Not found" might not be accurate. Could be a lot of other reasons for the list to fail.
I would suggest to return at least something like fmt.Errorf("SecurityGroup %v not found: %s", c.Kluster.Spec.Openstack.SecurityGroupName, err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

"github.com/sapcc/kubernikus/pkg/controller/config"
)

// Flight Control
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome docs, but should this maybe also be copied to our markdown docs?

   * It deletes Nodes that didn't manage to register within 10m after
   inital creation. This is a workaround for DHCP/DVS (latency) issues.
   In effect it will delete the incompletely spawned node and launch
   control will ramp it back up.

   * It ensures tcp/udp/icmp rules exist in the security group defined during
   kluster creation. The rules explicitly allow all pod-to-pod
   communication. This is a workaround for Neutron missing the
   side-channel security group events.

   * It ensures each Nodes is member of the security group defined in
   the kluster spec. This ensures missing security groups due to whatever
   reason are again added to the node.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants