Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clouddriver with Mysql unable to invalidate stale cache #5958

Closed
piyushGoyal2 opened this issue Aug 5, 2020 · 12 comments
Closed

Clouddriver with Mysql unable to invalidate stale cache #5958

piyushGoyal2 opened this issue Aug 5, 2020 · 12 comments

Comments

@piyushGoyal2
Copy link

Issue Summary:

Clouddriver Mysql unable to invalidate stale cache.

Cloud Provider(s):

AWS, EKS, Kubernetes

Environment:

1.19.x - EKS -Clouddriver - Aurora Mysql 5.7.12

Feature Area:

Kubernetes CloudDriver, CloudDriver MySQL

Description:

My clouddriver is backed by Aurora Mysql - 5.7.12. Somehow Mysql is not able to actually invalidate stale cache. On infrastructure page, under cluster tab, even if the kuberntes resource(deployment, replicaset or pod) gets deleted either through the cluster tab or kubectl or a delete manifest stage, it still gets listed on the cluster tab ever and ever. Checking Mysql tables, the cats_pod table still contains the older pods entries with a ttlseconds=-1.

Steps to Reproduce:

Deploy a spinnaker managed kubernetes resource.
Delete it from cluster tab on infrastructure page.
The resource still exists.

Additional Details:

@ajordens
Copy link
Contributor

ajordens commented Aug 6, 2020

Assigning to the Kubernetes SIG as it seems like it might be more related to how the cloud provider itself is caching.

@piyushGoyal2
Copy link
Author

@ajordens - Thanks. Yeah makes sense. Waiting for guys to debug on this along with them. Looks like a blocker for us right now to move clouddriver to mysql.

@karlskewes
Copy link
Contributor

karlskewes commented Aug 6, 2020

We have both ec2 and kubernetes accounts configured.
Only ec2 accounts appear to be targeted for cleanup.
We have removed a couple of Kubernetes accounts weeks ago but the items remain in the Infrastructure view. Clicking on them just shows a white side pane, no details.

Looking through our clouddriver logs we can only see cleanup running for ec2 accounts, as per below logs:

$ kubectl logs clouddriver-caching-54cbc7b8cd-455n8 | grep -i clean
2020-08-06 22:19:34.780  INFO 1 --- [ionAction-50992] .n.s.c.a.a.CleanupDetachedInstancesAgent : Looking for instances pending termination in <ec2-account-1>:ap-southeast-2
2020-08-06 22:19:34.829  INFO 1 --- [ionAction-50992] .n.s.c.a.a.CleanupDetachedInstancesAgent : Looking for instances pending termination in <ec2-account-2>:ap-southeast-2
2020-08-06 22:19:34.924  INFO 1 --- [ionAction-50992] .n.s.c.a.a.CleanupDetachedInstancesAgent : Looking for instances pending termination in <ec2-account-2>:ap-southeast-1
2020-08-06 22:19:35.066  INFO 1 --- [ionAction-50992] .n.s.c.a.a.CleanupDetachedInstancesAgent : Looking for instances pending termination in <ec2-account-2>:us-east-2
2020-08-06 22:19:35.297  INFO 1 --- [ionAction-50992] .n.s.c.a.a.CleanupDetachedInstancesAgent : Looking for instances pending termination in <ec2-account-3>:ap-southeast-2
2020-08-06 22:19:35.347  INFO 1 --- [ionAction-50992] .n.s.c.a.a.CleanupDetachedInstancesAgent : Looking for instances pending termination in <ec2-account-4>:ap-southeast-2
2020-08-06 22:21:15.773  INFO 1 --- [ionAction-50975] c.n.s.c.sql.SqlTaskCleanupAgent          : Cleaning up 3 completed tasks (82 states, 3 result objects)
2020-08-06 22:36:17.189  INFO 1 --- [ionAction-51001] c.n.s.c.sql.SqlTaskCleanupAgent          : Cleaning up 2 completed tasks (49 states, 2 result objects)
$ kubectl logs clouddriver-rw-68b9c8dd96-k497v | grep -i clean
2020-07-20 01:45:28.745  INFO 1 --- [           main] c.n.spinnaker.cats.sql.cache.SqlCache    : Configured for com.netflix.spinnaker.clouddriver.aws.provider.AwsCleanupProvider

Looking at the original issue #4803 and corresponding PR spinnaker/clouddriver#4232 that added the cleanup agent I see there is a flag which was commented in the PR:

This functionality defaults to disabled. It can be enabled with sql.unknown-agent-cleanup-agent.enabled=true

https://github.com/spinnaker/clouddriver/blob/master/cats/cats-sql/src/main/kotlin/com/netflix/spinnaker/config/SqlCacheConfiguration.kt#L168

Do we need to enable this for Kubernetes? is it an 'unknown-agent' ??

@ezimanyi ezimanyi self-assigned this Aug 7, 2020
@piyushGoyal2
Copy link
Author

piyushGoyal2 commented Aug 24, 2020

@ezimanyi : Hi Eric. Any updates on this? Would love to hear your stance on same.

@karlskewes
Copy link
Contributor

karlskewes commented Sep 10, 2020

Per RZ's guidance in Slack we enabled sql.unknown-agent-cleanup-agent.enabled=true for clouddriver-caching and all of the old Kubernetes application replicas (Spinnaker clusters) and accounts were cleaned up. 🎉

As always with Databases, suggest taking a snapshot first. Here's a validating query (I think).

mysql> SELECT COUNT(*) FROM cats_v1_clusters;
+----------+
| COUNT(*) |
+----------+
|     500 |  << was ~1000
+----------+
1 row in set (0.00 sec)

I'll work on a PR to the Clouddriver SQL docs but ideally as suggested this could be enabled automatically if using SQL and Kubernetes.

@ezimanyi
Copy link
Contributor

@robzienert : Any thoughts on whether it would be safe to set sql.unknown-agent-cleanup-agent.enabled=true by default? I remember discussing disabling it by default when you added that in spinnaker/clouddriver#4232 but forget if that was mostly for safety reasons as part of the rollout of the change, or if that was intended to stay off by default always.

I think it should be enabled by default for kubernetes users (as suggested by @kskewes above), but it feels like a strange coupling to have the default here depend on which cloud providers are enabled so thought it might be worth just enabling by default for everyone.

@piyushGoyal2
Copy link
Author

@ezimanyi and @kskewes - I believe apart from older accounts cleanup, the main reason for opening up issue was if the cluster state changes, then the infrastructure tab doesn't change real time along with cluster giving a stale view of cluster.

Let me cross verify the configuration which @kskewes suggested and will update the issue.

@ezimanyi ezimanyi removed their assignment Sep 22, 2020
@ezimanyi
Copy link
Contributor

@piyushGoyal2 : If you're seeing the cluster tab fail to update to account for changes, the root cause is likely that your caching cycles are not completing quickly enough. There is ongoing performance improvement work to make this less likely, and a discussion of workarounds on the closed issue #5611.

@spinnakerbot
Copy link

This issue hasn't been updated in 45 days, so we are tagging it as 'stale'. If you want to remove this label, comment:

@spinnakerbot remove-label stale

@spinnakerbot
Copy link

This issue is tagged as 'stale' and hasn't been updated in 45 days, so we are tagging it as 'to-be-closed'. It will be closed in 45 days unless updates are made. If you want to remove this label, comment:

@spinnakerbot remove-label to-be-closed

@spinnakerbot
Copy link

This issue is tagged as 'stale' and hasn't been updated in 45 days, so we are tagging it as 'to-be-closed'. It will be closed in 45 days unless updates are made. If you want to remove this label, comment:

@spinnakerbot remove-label to-be-closed

@spinnakerbot
Copy link

This issue is tagged as 'to-be-closed' and hasn't been updated in 45 days, so we are closing it. You can always reopen this issue if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants