Spinnaker shows old deleted k8s clusters in application and clusters tab #4803

perek · 2019-08-27T18:04:38Z

Issue Summary:

Deleting old clusters from the clouddriver config fails to remove their assets in certain views/apis.

Cloud Provider(s):

K8s, GKE

Environment:

1.15.1 - GKE - MySQL CloudDriver

Feature Area (if this issue is UI/UX related, please tag `@spinnaker/ui-ux-team`):

Kubernetes CloudDriver, CloudDriver MySQL

Description:

We have noticed a bug when deleting old clusters from our cluster. It seems that the old clusters live on even after deploying the cluster with the config removed. While the clusters show in application and clusters tab, when clicking on the asset, the ui breaks as the api returns a 404 for the asset (pod, deployment, etc)

Steps to Reproduce:

onboard cluster with mysql cache enabled
-delete config for cluster
-deploy spinnaker minus cluster
pods should still show

Additional Details:

The text was updated successfully, but these errors were encountered:

dodizzle · 2019-09-04T17:26:51Z

I am seeing similar behavior.
I had applications deployed to 3 k8s clusters. Those clusters no longer exist.
hal config provider kubernetes account list confirms this.
The applications have no infrastructure and yet they cannot be deleted.
I have tried deleting them using both spin app delete and in the UI.
In both cases the application still appears and there are no errors when I attempt to delete.
When I get the application using spin app get it still lists the application with the k8s clusters names as accounts even though they do not exist in the configuration.

maggieneterval · 2019-10-10T20:13:47Z

@perek thanks for reporting this issue!

I think the problem stems from the fact that, for "logical" cache items (for example, Spinnaker clusters), the Kubernetes V2 provider relies on a 10-minute time to live (configured here) after which Redis will expire the entry.

@robzienert does SQL-backed Clouddriver respect a cache item's configured ttlSeconds, or should cloud providers that support SQL backing not be relying on time to live? I'm curious how the AWS provider handles evicting cluster entries from the cache when an account is removed (and happy to refactor the Kubernetes V2 implementation accordingly).

robzienert · 2019-10-18T18:16:24Z

Hi! It looks like the cache items' TTL is not honored by SQL. We're tracking this value, but there's no cleanup agent to actually remove stale items beyond on-demand cache items. This seems to be an oversight, since Netflix never removes accounts.

It'd be neat if there could be help on this front, but if no one has capacity to add a cleanup agent, let me know and I'll swoop in and make it happen.

spinnakerbot · 2019-12-02T18:20:30Z

This issue hasn't been updated in 45 days, so we are tagging it as 'stale'. If you want to remove this label, comment:

@spinnakerbot remove-label stale

kuberkaul · 2019-12-19T20:23:02Z

@spinnakerbot remove-label stale

maggieneterval · 2019-12-20T20:06:40Z

Hey all, sorry for the delayed followup on this one! @robzienert and I chatted offline and the tentative plan here is to implement an admin endpoint for ad-hoc cleanup of orphaned rows, and a feature flag that will let you opt in to automatic execution of that cleanup logic on startup.

kuberkaul · 2019-12-23T18:46:43Z

great news! , is there any tentative eta ?

robzienert · 2019-12-24T03:54:07Z

There is no ETA. I'm hoping to get something out for review before 2020, but where that lands for OSS releases, I'm unsure.

In the meantime, there is a workaround, and it's how Netflix manages zero-downtime database changes for Clouddriver.

Workaround

We do this by performing a red/black on Clouddriver. We do this by performing a red/black on Clouddriver. In order for this pattern to work, you must have at least broken Clouddriver up into two clusters:

clouddriver-main for caching agents only. They serve no API traffic.
clouddriver-main-api for API traffic. They do not participate in caching.

Clouddriver SQL supports the concept of table namespaces, meaning you can red/black the database schema & contents. Since Clouddriver is largely ephemeral, switching table namespaces just means you're going to re-cache stuff. This table namespace is defined via a sql.tableNamespace configuration property: It defaults to null.

In our startup script for Clouddriver, we set an explicit value on sql.tableNamespace based on the cluster name. For example we cycle between an a cluster (clouddriver-main-a, clouddriver-main-api-a) and a b cluster (clouddriver-main-b, clouddriver-main-api-b).

# set sql table namespace
TABLE_NAMESPACE=" -Dsql.tableNamespace=a"
if [[ ${NETFLIX_CLUSTER} == *"-b" ]]; then
  TABLE_NAMESPACE=" -Dsql.tableNamespace=b"
fi

Our deployment pipeline has a parameter of isTableFlip, which when set to true, will use a Run Job stage to evaluate what namespace is currently being used and set the cluster names to flip to the other namespace appropriately.

This pipeline always deploys our caching agents first with a 5-minute wait stage following it before deploying any API clusters, to give time to the caching cluster to populate the database. YMMV on how long the wait stage should wait for.

Example pipeline:

[Run Job: Set Cluster Names] -> [Evaluate Variables] -> [Deploy Caching Agents] -> [Wait] -> [Deploy API clusters] -> [Teardown Old Clusters]

Once everything has been flipped over, you can delete (or truncate) the old tables.

This workaround does have a decent amount of operations to it if you're not already setup to do this, but once setup, this process is a breeze and takes no manual effort beyond setting the isTableFlip pipeline parameter.

robzienert · 2019-12-24T04:24:09Z

Forgot to mention, the truncate admin endpoint on clouddriver: https://github.com/spinnaker/clouddriver/blob/master/cats/cats-sql/src/main/kotlin/com/netflix/spinnaker/cats/sql/controllers/CatsSqlAdminController.kt#L35

Adds a background agent that regularly scans the database for cache records that are related to caching agents that do not exist anymore, and deletes them. This addresses spinnaker/spinnaker#4803

Adds a background agent that regularly scans the database for cache records that are related to caching agents that do not exist anymore, and deletes them. This addresses spinnaker/spinnaker#4803 Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

Adds a background agent that regularly scans the database for cache records that are related to caching agents that do not exist anymore, and deletes them. This addresses spinnaker/spinnaker#4803 Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Rob Zienert <rob@robzienert.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

maggieneterval · 2020-01-17T15:22:57Z

@robzienert added a disabled-by-default cleanup agent that will take care of purging entries from deleted accounts. You can enable the agent starting in 1.18, which will be released next week, by setting sql.unknown-agent-cleanup-agent.enabled: true in your clouddriver-local.yaml. Thanks all!

maggieneterval added bug provider/kubernetes-v2 Manifest based Kubernetes provider sig/kubernetes labels Aug 27, 2019

maggieneterval self-assigned this Aug 27, 2019

spinnakerbot added the stale label Dec 2, 2019

spinnakerbot removed the stale label Dec 19, 2019

robzienert assigned robzienert and maggieneterval and unassigned maggieneterval Dec 24, 2019

robzienert mentioned this issue Dec 24, 2019

feat(sql): Add cache cleanup agent for removed accounts spinnaker/clouddriver#4232

Merged

maggieneterval mentioned this issue Jan 16, 2020

fix(sql): SqlUnknownAgentCleanupAgent should not throw errors on nonexistent tables spinnaker/clouddriver#4262

Merged

maggieneterval closed this as completed Jan 17, 2020

karlskewes mentioned this issue Aug 6, 2020

Clouddriver with Mysql unable to invalidate stale cache #5958

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spinnaker shows old deleted k8s clusters in application and clusters tab #4803

Spinnaker shows old deleted k8s clusters in application and clusters tab #4803

perek commented Aug 27, 2019

dodizzle commented Sep 4, 2019

maggieneterval commented Oct 10, 2019

robzienert commented Oct 18, 2019

spinnakerbot commented Dec 2, 2019

kuberkaul commented Dec 19, 2019

maggieneterval commented Dec 20, 2019

kuberkaul commented Dec 23, 2019

robzienert commented Dec 24, 2019

robzienert commented Dec 24, 2019

maggieneterval commented Jan 17, 2020

Spinnaker shows old deleted k8s clusters in application and clusters tab #4803

Spinnaker shows old deleted k8s clusters in application and clusters tab #4803

Comments

perek commented Aug 27, 2019

Issue Summary:

Cloud Provider(s):

Environment:

Feature Area (if this issue is UI/UX related, please tag @spinnaker/ui-ux-team):

Description:

Steps to Reproduce:

Additional Details:

dodizzle commented Sep 4, 2019

maggieneterval commented Oct 10, 2019

robzienert commented Oct 18, 2019

spinnakerbot commented Dec 2, 2019

kuberkaul commented Dec 19, 2019

maggieneterval commented Dec 20, 2019

kuberkaul commented Dec 23, 2019

robzienert commented Dec 24, 2019

Workaround

robzienert commented Dec 24, 2019

maggieneterval commented Jan 17, 2020

Feature Area (if this issue is UI/UX related, please tag `@spinnaker/ui-ux-team`):