Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spinnaker shows old deleted k8s clusters in application and clusters tab #4803

Closed
perek opened this issue Aug 27, 2019 · 10 comments
Closed

Spinnaker shows old deleted k8s clusters in application and clusters tab #4803

perek opened this issue Aug 27, 2019 · 10 comments
Assignees
Labels
bug provider/kubernetes-v2 Manifest based Kubernetes provider sig/kubernetes

Comments

@perek
Copy link

perek commented Aug 27, 2019

Issue Summary:

Deleting old clusters from the clouddriver config fails to remove their assets in certain views/apis.

Cloud Provider(s):

K8s, GKE

Environment:

1.15.1 - GKE - MySQL CloudDriver

Feature Area (if this issue is UI/UX related, please tag @spinnaker/ui-ux-team):

Kubernetes CloudDriver, CloudDriver MySQL

Description:

We have noticed a bug when deleting old clusters from our cluster. It seems that the old clusters live on even after deploying the cluster with the config removed. While the clusters show in application and clusters tab, when clicking on the asset, the ui breaks as the api returns a 404 for the asset (pod, deployment, etc)

Steps to Reproduce:

  • onboard cluster with mysql cache enabled
    -delete config for cluster
    -deploy spinnaker minus cluster
  • pods should still show

Additional Details:

@maggieneterval maggieneterval added bug provider/kubernetes-v2 Manifest based Kubernetes provider sig/kubernetes labels Aug 27, 2019
@maggieneterval maggieneterval self-assigned this Aug 27, 2019
@dodizzle
Copy link
Contributor

dodizzle commented Sep 4, 2019

I am seeing similar behavior.
I had applications deployed to 3 k8s clusters. Those clusters no longer exist.
hal config provider kubernetes account list confirms this.
The applications have no infrastructure and yet they cannot be deleted.
I have tried deleting them using both spin app delete and in the UI.
In both cases the application still appears and there are no errors when I attempt to delete.
When I get the application using spin app get it still lists the application with the k8s clusters names as accounts even though they do not exist in the configuration.

@maggieneterval
Copy link
Contributor

@perek thanks for reporting this issue!

I think the problem stems from the fact that, for "logical" cache items (for example, Spinnaker clusters), the Kubernetes V2 provider relies on a 10-minute time to live (configured here) after which Redis will expire the entry.

@robzienert does SQL-backed Clouddriver respect a cache item's configured ttlSeconds, or should cloud providers that support SQL backing not be relying on time to live? I'm curious how the AWS provider handles evicting cluster entries from the cache when an account is removed (and happy to refactor the Kubernetes V2 implementation accordingly).

@robzienert
Copy link
Member

Hi! It looks like the cache items' TTL is not honored by SQL. We're tracking this value, but there's no cleanup agent to actually remove stale items beyond on-demand cache items. This seems to be an oversight, since Netflix never removes accounts.

It'd be neat if there could be help on this front, but if no one has capacity to add a cleanup agent, let me know and I'll swoop in and make it happen.

@spinnakerbot
Copy link

This issue hasn't been updated in 45 days, so we are tagging it as 'stale'. If you want to remove this label, comment:

@spinnakerbot remove-label stale

@kuberkaul
Copy link

@spinnakerbot remove-label stale

@maggieneterval
Copy link
Contributor

Hey all, sorry for the delayed followup on this one! @robzienert and I chatted offline and the tentative plan here is to implement an admin endpoint for ad-hoc cleanup of orphaned rows, and a feature flag that will let you opt in to automatic execution of that cleanup logic on startup.

@kuberkaul
Copy link

great news! , is there any tentative eta ?

@robzienert
Copy link
Member

There is no ETA. I'm hoping to get something out for review before 2020, but where that lands for OSS releases, I'm unsure.

In the meantime, there is a workaround, and it's how Netflix manages zero-downtime database changes for Clouddriver.

Workaround

We do this by performing a red/black on Clouddriver. We do this by performing a red/black on Clouddriver. In order for this pattern to work, you must have at least broken Clouddriver up into two clusters:

  • clouddriver-main for caching agents only. They serve no API traffic.
  • clouddriver-main-api for API traffic. They do not participate in caching.

Clouddriver SQL supports the concept of table namespaces, meaning you can red/black the database schema & contents. Since Clouddriver is largely ephemeral, switching table namespaces just means you're going to re-cache stuff. This table namespace is defined via a sql.tableNamespace configuration property: It defaults to null.

In our startup script for Clouddriver, we set an explicit value on sql.tableNamespace based on the cluster name. For example we cycle between an a cluster (clouddriver-main-a, clouddriver-main-api-a) and a b cluster (clouddriver-main-b, clouddriver-main-api-b).

# set sql table namespace
TABLE_NAMESPACE=" -Dsql.tableNamespace=a"
if [[ ${NETFLIX_CLUSTER} == *"-b" ]]; then
  TABLE_NAMESPACE=" -Dsql.tableNamespace=b"
fi

Our deployment pipeline has a parameter of isTableFlip, which when set to true, will use a Run Job stage to evaluate what namespace is currently being used and set the cluster names to flip to the other namespace appropriately.

This pipeline always deploys our caching agents first with a 5-minute wait stage following it before deploying any API clusters, to give time to the caching cluster to populate the database. YMMV on how long the wait stage should wait for.

Example pipeline:

[Run Job: Set Cluster Names] -> [Evaluate Variables] -> [Deploy Caching Agents] -> [Wait] -> [Deploy API clusters] -> [Teardown Old Clusters]

Once everything has been flipped over, you can delete (or truncate) the old tables.


This workaround does have a decent amount of operations to it if you're not already setup to do this, but once setup, this process is a breeze and takes no manual effort beyond setting the isTableFlip pipeline parameter.

@robzienert
Copy link
Member

Forgot to mention, the truncate admin endpoint on clouddriver: https://github.com/spinnaker/clouddriver/blob/master/cats/cats-sql/src/main/kotlin/com/netflix/spinnaker/cats/sql/controllers/CatsSqlAdminController.kt#L35

robzienert added a commit to robzienert/clouddriver that referenced this issue Dec 24, 2019
Adds a background agent that regularly scans the database for cache records that are
related to caching agents that do not exist anymore, and deletes them. This addresses
spinnaker/spinnaker#4803
robzienert added a commit to robzienert/clouddriver that referenced this issue Dec 24, 2019
Adds a background agent that regularly scans the database for cache records that are
related to caching agents that do not exist anymore, and deletes them. This addresses
spinnaker/spinnaker#4803
robzienert added a commit to robzienert/clouddriver that referenced this issue Jan 1, 2020
Adds a background agent that regularly scans the database for cache records that are
related to caching agents that do not exist anymore, and deletes them. This addresses
spinnaker/spinnaker#4803
robzienert added a commit to robzienert/clouddriver that referenced this issue Jan 1, 2020
Adds a background agent that regularly scans the database for cache records that are
related to caching agents that do not exist anymore, and deletes them. This addresses
spinnaker/spinnaker#4803
robzienert added a commit to robzienert/clouddriver that referenced this issue Jan 1, 2020
Adds a background agent that regularly scans the database for cache records that are
related to caching agents that do not exist anymore, and deletes them. This addresses
spinnaker/spinnaker#4803
robzienert added a commit to robzienert/clouddriver that referenced this issue Jan 8, 2020
Adds a background agent that regularly scans the database for cache records that are
related to caching agents that do not exist anymore, and deletes them. This addresses
spinnaker/spinnaker#4803
robzienert added a commit to robzienert/clouddriver that referenced this issue Jan 9, 2020
Adds a background agent that regularly scans the database for cache records that are
related to caching agents that do not exist anymore, and deletes them. This addresses
spinnaker/spinnaker#4803
robzienert added a commit to robzienert/clouddriver that referenced this issue Jan 11, 2020
Adds a background agent that regularly scans the database for cache records that are
related to caching agents that do not exist anymore, and deletes them. This addresses
spinnaker/spinnaker#4803
robzienert added a commit to robzienert/clouddriver that referenced this issue Jan 12, 2020
Adds a background agent that regularly scans the database for cache records that are
related to caching agents that do not exist anymore, and deletes them. This addresses
spinnaker/spinnaker#4803
robzienert added a commit to robzienert/clouddriver that referenced this issue Jan 13, 2020
Adds a background agent that regularly scans the database for cache records that are
related to caching agents that do not exist anymore, and deletes them. This addresses
spinnaker/spinnaker#4803
mergify bot added a commit to spinnaker/clouddriver that referenced this issue Jan 13, 2020
Adds a background agent that regularly scans the database for cache records that are
related to caching agents that do not exist anymore, and deletes them. This addresses
spinnaker/spinnaker#4803

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
duftler pushed a commit to duftler/clouddriver that referenced this issue Jan 14, 2020
Adds a background agent that regularly scans the database for cache records that are
related to caching agents that do not exist anymore, and deletes them. This addresses
spinnaker/spinnaker#4803

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
duftler pushed a commit to spinnaker/clouddriver that referenced this issue Jan 14, 2020
Adds a background agent that regularly scans the database for cache records that are
related to caching agents that do not exist anymore, and deletes them. This addresses
spinnaker/spinnaker#4803

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

Co-authored-by: Rob Zienert <rob@robzienert.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
@maggieneterval
Copy link
Contributor

@robzienert added a disabled-by-default cleanup agent that will take care of purging entries from deleted accounts. You can enable the agent starting in 1.18, which will be released next week, by setting sql.unknown-agent-cleanup-agent.enabled: true in your clouddriver-local.yaml. Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug provider/kubernetes-v2 Manifest based Kubernetes provider sig/kubernetes
Projects
None yet
Development

No branches or pull requests

6 participants