Clouddriver in Spinnaker 1.6.1 is more fragile #2683

wheleph · 2018-04-20T09:26:05Z

Cloud Provider

GKE (Kubernetes)

Environment

Spinnaker 1.6.1 running on GKE deployed with halyard 0.49.0

Description

In our Spinnaker setup we use many Kubernetes accounts and sometimes some of them become outdated. This wasn’t a problem with Spinnaker 1.5.4 (Clouddriver 1.0.4-20180110144440) since the Clouddriver health endpoint returned OK and the clouddriver was up and running (even though some accounts had outdated keys):

bash-4.3# curl http://localhost:7002/health
{"status":"UP"}

After upgrade to Spinnaker 1.6.0 (Clouddriver 2.0.0-20180221152902) and/or 1.6.1 (Clouddriver 2.1.0-20180319132609) we noticed that an error even in a single Kubernetes account causes the health endpoint to fail:

bash-4.4# curl -m 10 http://localhost:7002/health
{"error":"Internal Server Error","exception":"com.netflix.spinnaker.clouddriver.kubernetes.v1.deploy.exception.KubernetesOperationException","message":"Get Namespace kuba-test for account spinnaker-bolcom-stg-kuba-test-f65 failed: Unauthorized! Token may have expired! Please log-in again. Unauthorized","status":500,"timestamp":1521210238133}

Which in turn makes Kubernetes think that Clouddriver is not healthy and Spinnaker becomes not functional.

Additional information

More discussion: https://community.spinnaker.io/t/clouddriver-in-spinnaker-1-6-0-is-more-fragile/232

The text was updated successfully, but these errors were encountered:

ethanfrogers · 2018-04-20T13:10:02Z

@wheleph spinnaker/clouddriver#2531 was merged last night which takes a similar approach for Docker registries so it's possible that this could be implemented for Kubernetes. I'm not sure if it would conflict with the proposal in #2604 but I don't think so.

wheleph · 2018-04-23T09:25:06Z

@ethanfrogers something like that would definitely help but the previous version of clouddriver became health even if some of the accounts were initially invalid and we relied on that fact in our automation.

mdirkse · 2018-05-01T22:25:33Z

Alright, I've found the exact commit that introduces the behavior that breaks our deployment of Spinnaker: spinnaker/clouddriver@87c6921

Before this commit, if you launch clouddriver with a faulty k8s account it reports "UP" at /health after it has started. After the commit, it reports a 500 status with whatever it is that ails the k8s account.

Note that if clouddriver is configured in such a way that the k8s account info points to a non-existent file it will never come up healthy, before or after said commit.

On the current master the exception originates here: https://github.com/spinnaker/clouddriver/blob/master/clouddriver-kubernetes/src/main/groovy/com/netflix/spinnaker/clouddriver/kubernetes/v1/security/KubernetesV1Credentials.java#L191

The exception occurs because the health check hits this: https://github.com/spinnaker/clouddriver/blob/a681c7eab5d6945b2fe60ae071577f5bfb5092af/clouddriver-kubernetes/src/main/groovy/com/netflix/spinnaker/clouddriver/kubernetes/health/KubernetesHealthIndicator.groovy#L65

@ethanfrogers the active namespace check that results from your commit makes the very first health check fail, which means that Halyard will never report the clouddriver deploy as successful because it never becomes healthy. Before your commit things at least started out healty so Halyard could complete the deploy.
What your commit does, however, seems sensible, so the question is how to remedy this situation. In the discussion thread you say:

I’ll also add that the behavior is not intended. Clouddriver should be able to handle partial unavailability (network blip, etc) of Kubernetes.

Given this prerequisite, I guess it would make sense to have clouddriver report "UP" even though some k8s accounts might be (temporarily) broken. Would you agree?

ethanfrogers · 2018-05-02T15:52:38Z

@mdirkse great find! if i remember correctly, i made this change because we removed the hard dependency on clouddriver-docker from clouddriver-kubernetes since V2 didn't need it. that caused issues with startup where image pull secrets weren't being created in each namespace listed under namespaces. it seems that i even saw the issue with health but never actually reverted, but i can't remember if it was resolved or i found a workaround.

perhaps we should implement something like the above for the V1 provider that will prevent clouddriver from being deployed with invalid credentials but will not kill clouddriver is those credentials are invalidated in the future? of maybe that's missing the point?

mdirkse · 2018-05-03T09:00:20Z

Well my larger question was: when is clouddriver not healthy? If assume the requirement that Clouddriver should be able to handle unavailability of (some) k8s accounts, then it would stand to reason that it's still healthy even though an account may be non-functional. So then I'd also not expect a broken account to stop clouddriver from being deployed (otherwise you could get into a situation where you can't do a Spinnaker upgrade because a network connection to a particular k8s cluster is temporarily down, seems a little strange).

Lemme know if I misunderstood your comment about clouddriver health. If not then we could explore ways to not have k8s account health impact clouddriver health.

ethanfrogers · 2018-05-03T13:34:31Z

@mdirkse you're right, that is where i was going with that. i guess we just need to define the semantics are around clouddriver health vs account health with respect to kubernetes. from my perspective, we have 2 options:

fix the regression such that you can deploy clouddriver with unhealthy kubernetes accounts. the health of the account shouldn't determine the health of clouddriver at all. i believe halyard verifies account health so it may be a non-issue for those using it, but for those who aren't it could lead to some added debugging overheard.
take the same approach as spinnaker/clouddriver/@87c6921 and only mark clouddriver as healthy once all accounts are healthy BUT subsequent health issues in the account will be ignored.

@lwander do you have any thoughts on this?

wheleph · 2018-05-03T16:44:23Z

Halyard can completely skip validation with --no-validate option

spinnakerbot · 2018-08-15T14:53:40Z

This issue hasn't been updated in 103 days, so we are tagging it as 'stale'. If you want to remove this label, comment:

@spinnakerbot remove-label stale

wheleph · 2018-09-02T17:04:44Z

@spinnakerbot remove-label stale

lwander · 2018-09-04T13:15:32Z

I believe this was fixed for 1.8 and greater: spinnaker/clouddriver#2752

spinnakerbot · 2018-10-19T14:38:50Z

This issue hasn't been updated in 45 days, so we are tagging it as 'stale'. If you want to remove this label, comment:

@spinnakerbot remove-label stale

ethanfrogers · 2018-10-22T13:27:04Z

closing because 1.6 is deprecated. if the problem still exists in newer version of Spinnaker please submit a new issue.

jtk54 added the component/clouddriver label Apr 20, 2018

spinnakerbot added the stale label Aug 15, 2018

spinnakerbot removed the stale label Sep 2, 2018

spinnakerbot added the stale label Oct 19, 2018

ethanfrogers closed this as completed Oct 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clouddriver in Spinnaker 1.6.1 is more fragile #2683

Clouddriver in Spinnaker 1.6.1 is more fragile #2683

wheleph commented Apr 20, 2018

ethanfrogers commented Apr 20, 2018

wheleph commented Apr 23, 2018

mdirkse commented May 1, 2018 •

edited

ethanfrogers commented May 2, 2018

mdirkse commented May 3, 2018

ethanfrogers commented May 3, 2018

wheleph commented May 3, 2018

spinnakerbot commented Aug 15, 2018

wheleph commented Sep 2, 2018

lwander commented Sep 4, 2018

spinnakerbot commented Oct 19, 2018

ethanfrogers commented Oct 22, 2018

Clouddriver in Spinnaker 1.6.1 is more fragile #2683

Clouddriver in Spinnaker 1.6.1 is more fragile #2683

Comments

wheleph commented Apr 20, 2018

ethanfrogers commented Apr 20, 2018

wheleph commented Apr 23, 2018

mdirkse commented May 1, 2018 • edited

ethanfrogers commented May 2, 2018

mdirkse commented May 3, 2018

ethanfrogers commented May 3, 2018

wheleph commented May 3, 2018

spinnakerbot commented Aug 15, 2018

wheleph commented Sep 2, 2018

lwander commented Sep 4, 2018

spinnakerbot commented Oct 19, 2018

ethanfrogers commented Oct 22, 2018

mdirkse commented May 1, 2018 •

edited