Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clouddriver in Spinnaker 1.6.1 is more fragile #2683

Closed
wheleph opened this issue Apr 20, 2018 · 12 comments
Closed

Clouddriver in Spinnaker 1.6.1 is more fragile #2683

wheleph opened this issue Apr 20, 2018 · 12 comments

Comments

@wheleph
Copy link

wheleph commented Apr 20, 2018

Cloud Provider

GKE (Kubernetes)

Environment

Spinnaker 1.6.1 running on GKE deployed with halyard 0.49.0

Description

In our Spinnaker setup we use many Kubernetes accounts and sometimes some of them become outdated. This wasn’t a problem with Spinnaker 1.5.4 (Clouddriver 1.0.4-20180110144440) since the Clouddriver health endpoint returned OK and the clouddriver was up and running (even though some accounts had outdated keys):

bash-4.3# curl http://localhost:7002/health
{"status":"UP"}

After upgrade to Spinnaker 1.6.0 (Clouddriver 2.0.0-20180221152902) and/or 1.6.1 (Clouddriver 2.1.0-20180319132609) we noticed that an error even in a single Kubernetes account causes the health endpoint to fail:

bash-4.4# curl -m 10 http://localhost:7002/health
{"error":"Internal Server Error","exception":"com.netflix.spinnaker.clouddriver.kubernetes.v1.deploy.exception.KubernetesOperationException","message":"Get Namespace kuba-test for account spinnaker-bolcom-stg-kuba-test-f65 failed: Unauthorized! Token may have expired! Please log-in again. Unauthorized","status":500,"timestamp":1521210238133}

Which in turn makes Kubernetes think that Clouddriver is not healthy and Spinnaker becomes not functional.

Additional information

More discussion: https://community.spinnaker.io/t/clouddriver-in-spinnaker-1-6-0-is-more-fragile/232

@ethanfrogers
Copy link
Contributor

@wheleph spinnaker/clouddriver#2531 was merged last night which takes a similar approach for Docker registries so it's possible that this could be implemented for Kubernetes. I'm not sure if it would conflict with the proposal in #2604 but I don't think so.

@wheleph
Copy link
Author

wheleph commented Apr 23, 2018

@ethanfrogers something like that would definitely help but the previous version of clouddriver became health even if some of the accounts were initially invalid and we relied on that fact in our automation.

@mdirkse
Copy link

mdirkse commented May 1, 2018

Alright, I've found the exact commit that introduces the behavior that breaks our deployment of Spinnaker: spinnaker/clouddriver@87c6921

Before this commit, if you launch clouddriver with a faulty k8s account it reports "UP" at /health after it has started. After the commit, it reports a 500 status with whatever it is that ails the k8s account.

Note that if clouddriver is configured in such a way that the k8s account info points to a non-existent file it will never come up healthy, before or after said commit.

On the current master the exception originates here: https://github.com/spinnaker/clouddriver/blob/master/clouddriver-kubernetes/src/main/groovy/com/netflix/spinnaker/clouddriver/kubernetes/v1/security/KubernetesV1Credentials.java#L191

The exception occurs because the health check hits this: https://github.com/spinnaker/clouddriver/blob/a681c7eab5d6945b2fe60ae071577f5bfb5092af/clouddriver-kubernetes/src/main/groovy/com/netflix/spinnaker/clouddriver/kubernetes/health/KubernetesHealthIndicator.groovy#L65

@ethanfrogers the active namespace check that results from your commit makes the very first health check fail, which means that Halyard will never report the clouddriver deploy as successful because it never becomes healthy. Before your commit things at least started out healty so Halyard could complete the deploy.
What your commit does, however, seems sensible, so the question is how to remedy this situation. In the discussion thread you say:

I’ll also add that the behavior is not intended. Clouddriver should be able to handle partial unavailability (network blip, etc) of Kubernetes.

Given this prerequisite, I guess it would make sense to have clouddriver report "UP" even though some k8s accounts might be (temporarily) broken. Would you agree?

@ethanfrogers
Copy link
Contributor

@mdirkse great find! if i remember correctly, i made this change because we removed the hard dependency on clouddriver-docker from clouddriver-kubernetes since V2 didn't need it. that caused issues with startup where image pull secrets weren't being created in each namespace listed under namespaces. it seems that i even saw the issue with health but never actually reverted, but i can't remember if it was resolved or i found a workaround.

perhaps we should implement something like the above for the V1 provider that will prevent clouddriver from being deployed with invalid credentials but will not kill clouddriver is those credentials are invalidated in the future? of maybe that's missing the point?

@mdirkse
Copy link

mdirkse commented May 3, 2018

Well my larger question was: when is clouddriver not healthy? If assume the requirement that Clouddriver should be able to handle unavailability of (some) k8s accounts, then it would stand to reason that it's still healthy even though an account may be non-functional. So then I'd also not expect a broken account to stop clouddriver from being deployed (otherwise you could get into a situation where you can't do a Spinnaker upgrade because a network connection to a particular k8s cluster is temporarily down, seems a little strange).

Lemme know if I misunderstood your comment about clouddriver health. If not then we could explore ways to not have k8s account health impact clouddriver health.

@ethanfrogers
Copy link
Contributor

@mdirkse you're right, that is where i was going with that. i guess we just need to define the semantics are around clouddriver health vs account health with respect to kubernetes. from my perspective, we have 2 options:

  1. fix the regression such that you can deploy clouddriver with unhealthy kubernetes accounts. the health of the account shouldn't determine the health of clouddriver at all. i believe halyard verifies account health so it may be a non-issue for those using it, but for those who aren't it could lead to some added debugging overheard.
  2. take the same approach as spinnaker/clouddriver/@87c6921 and only mark clouddriver as healthy once all accounts are healthy BUT subsequent health issues in the account will be ignored.

@lwander do you have any thoughts on this?

@wheleph
Copy link
Author

wheleph commented May 3, 2018

Halyard can completely skip validation with --no-validate option

@spinnakerbot
Copy link

This issue hasn't been updated in 103 days, so we are tagging it as 'stale'. If you want to remove this label, comment:

@spinnakerbot remove-label stale

@wheleph
Copy link
Author

wheleph commented Sep 2, 2018

@spinnakerbot remove-label stale

@spinnakerbot spinnakerbot removed the stale label Sep 2, 2018
@lwander
Copy link
Member

lwander commented Sep 4, 2018

I believe this was fixed for 1.8 and greater: spinnaker/clouddriver#2752

@spinnakerbot
Copy link

This issue hasn't been updated in 45 days, so we are tagging it as 'stale'. If you want to remove this label, comment:

@spinnakerbot remove-label stale

@ethanfrogers
Copy link
Contributor

closing because 1.6 is deprecated. if the problem still exists in newer version of Spinnaker please submit a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants