New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubernetes V2 account Infrastructure is not stable #5611
Comments
We tried with Same issue . We are using external GCP based redis |
This is very likely happening because caching cycles are not completing within the 10-minute expiration timeout for logical cache items (server groups, load balancers, etc.). The way the Kubernetes V2 provider works is that it reads all of your deployed infrastructure, and then creates two types of cache entries:
When you load pages in the UI it's generally the logical entries that are driving that UI (though if you drill down enough they'll pull in some info from the raw infrastructure). The logical entries expire automatically after 10 minutes; so if a caching cycle has not completed in the past 10 minutes, the infrastructure tab will become empty (or at least will lose items from the accounts that have not been cached in the past 10 minutes). The account filter will go away as well, as this is just based on what is on the infrastructure page...so if there is no infrastructure, the account filter will also be empty. Some functionality (ex: search) does also look up infrastructure entries so that's why you'll still see some evidence of the infrastructure in the search tab. I believe the reason that we expire the logical keys after 10 minutes was as a shortcut to adding logic added to ensure these are fully purged in all cases when the relevant items are deleted (though I'm not 100% sure of the motivations as there's no context provided on spinnaker/clouddriver#2051). Potentially this expiry could be removed, but I'm a bit concerned about introducing other bugs and haven't deeply analyzed the code to see if it's safe. Ultimately the root cause here is that caching cycles are taking longer than 10 minutes, or are being scheduled less frequently that 10 minutes. Have you set the If you do have the
to log how long each caching agent is taking and can see if some are taking >10 minutes. There were some pretty significant improvements to caching performance in the last year or so, so please ensure you're on at least Spinnaker 1.18 to see if that helps as well. |
Hi @ezimanyi , thanks a lot with the insights, we will look into it and report back. |
Hi @ezimanyi , to provide more info, we have 25 kubernetes accounts in spinnaker, each having considerable amount of resources to cache. Could it be the reason why caching is taking ober 10 mins? If so, is it possible for us to split/shard the work load between multiple replicas of clouddriver-caching? Thanks |
@raghunandanbs : It of course depends on how big each cluster you're caching is, but that doesn't sound like an excessively large deployment to me...clouddriver should be able to keep up with this. Maybe you need to increase the memory/cpu resources available to clouddriver? |
Hi @ezimanyi , thank you.
I will update on the outcome. |
@ezimanyi update: I made the changes mentioned above, and I see a slight improvement in number the stability of infrastructure listed. Any other step I need to take to get a stable view of clusters in infrastructure tab? Thank you |
@raghunandanbs : Thanks for the update! It's really hard to say exactly how to size/scale your cluster to be honest...the real test is if the caching is taking less that the 10 minutes noted above. One thing I will say is that i don't have a lot of experience running multiple |
@raghunandanbs, looks like you're heading in the right direction with resources. |
Thanks a lot for your inputs @kskewes |
Update : I've almost doubled the capacity of caching isntances, and infrastructures seem to load just fine. Here is the configuration for spinnaker that holds 21 kubernetes accounts
|
I have same issue and it's very weird, even using Spinnaker 1.20.0 with only two kubernetes accounts and external Redis (memorystore from GCP). Is there any updates? |
This may not apply to everyone that's seen this issue, but for us we're using a SQL DB as the backing store for When this was enabled, we found these messages in the
One of the unknown caching agents in the list was for the account in which Spinnaker itself was deployed! So while the caching agents were running fast (less than 30 seconds) and fine, this clean up agent was deleting valid entries - causing the temporary disappearance of the information from the infrastructure page. Disabling the SQL Caching Cleanup Agent fixed the issue. Obviously, we would like to be able to have Spinnaker automatically clean up cached entries for accounts that have been deleted, but not at the expense of a poor user experience. Perhaps this affects a wider audience than just those using a SQL database for |
This issue hasn't been updated in 45 days, so we are tagging it as 'stale'. If you want to remove this label, comment:
|
To anyone that stumbles upon this issue, we had the same exact problem and after a few days of trying to figure it out, we were able to resolve it by increasing our EDIT: I wasn't sure how Clouddriver would handle simply just scaling it's deployment up a few replicas, so I found this doc that recommends running Clouddriver in High Availability mode. It seems that this is preferred over simply just scaling the |
This issue is tagged as 'stale' and hasn't been updated in 45 days, so we are tagging it as 'to-be-closed'. It will be closed in 45 days unless updates are made. If you want to remove this label, comment:
|
Hi, We're facing this issue on new spinnaker deployment managing a few k8s clusters.
Would those 3 optimizations decrease the time spent by clouddriver to update the cache? What we see right now is that the cycle to get the state is slow, but cpu is very low, because the Of course, at some point, scaling will still be needed, but having some values tuned might reduce the number of replicas needed. Any feedback would be welcome on those options. Thanks. |
@jfrabaute : The The other two flags will decrease the work that the caching agent needs to do, though of course the benefit will depend on how much work it is doing on the omitted things:
I would suggest joining the |
Also, this made me realize this is still open though my understanding is that the original issue has been solved; there is continued work on the performance of the caching agents that will continue to help this, but there's no outstanding work to be done directly on this issue. |
I enabled debug logs and the cycle finished between 2 and 10 seconds all the time. Here is an output example:
I don't exactly know what I'm also confused about the I've pinged the #sig-ops channel as well as some other channels, and I didn't get any answer so far. Sorry for commenting on this closed issue, but this is unfortunately blocking us to use spinnaker for our services and we're stuck with this problem for several weeks now. |
I have spent a considerable amount of debugging this same issue. I have tried all the suggestions here, but still run into issues running with an Elasticache Redis Instance as the backend. I don't have that big of a deployment, and even scaled it back to about 5 Kubernetes clusters with a few hundred pods. I am working on migrating from Redis to SQL to see if that resolves the issue. But I have messed with cache threads, number of clouddriver-caching pods, throwing additional resources (both mem and cpu) at the problem. I have tweaked the xms and xmx java options, and the redis.poll settings for clouddriver. All to no avail. This seems to still be a problem for others as well and despite running nearly every microservice in debug, I have been unable to identify a cause. |
same here! is there any way to deactivate completely the k8s caching objects in cloud-driver? |
At The Home Depot we rewrote the Kubernetes portion of Clouddriver in a manner that does not use constant poll-caching. You can read about it here: https://github.com/homedepot/go-clouddriver. We will be publishing an article on the Spinnaker blog soon about the performance benefits we've seen from this migration. A full technical document on installation is still a work in progress. |
Issue Summary: We are using Kubernetes distributed spinnaker for our deployments and we are seeing unstable Infrastructure page. The Kubernetes accounts will be visible some time and becomes null after a period of time and again comes back
Cloud Provider(s): Our spinnaker is running in GKE and we are managing Kubernetes clusters of all Cloud Providers(Manifest V2 account )
Environment: Distributed Spinnaker with Kubernetes .... Spinnaker version 1.16.6
Description: Kubernetes Accounts and Infrastructure are missing from spinnaker sometime and again it comes back after some time. Screenshots are attached
When accounts are not found
When accounts are visible
Can anyone please check this issue ... Is it due to Caching ? We are using External Redis server and not enabled direct Cloud driver caching, or is it any other issues ?
The text was updated successfully, but these errors were encountered: