Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes V2 account Infrastructure is not stable #5611

Closed
soorajpv opened this issue Mar 23, 2020 · 25 comments
Closed

Kubernetes V2 account Infrastructure is not stable #5611

soorajpv opened this issue Mar 23, 2020 · 25 comments

Comments

@soorajpv
Copy link

soorajpv commented Mar 23, 2020

Issue Summary: We are using Kubernetes distributed spinnaker for our deployments and we are seeing unstable Infrastructure page. The Kubernetes accounts will be visible some time and becomes null after a period of time and again comes back

Cloud Provider(s): Our spinnaker is running in GKE and we are managing Kubernetes clusters of all Cloud Providers(Manifest V2 account )

Environment: Distributed Spinnaker with Kubernetes .... Spinnaker version 1.16.6

Description: Kubernetes Accounts and Infrastructure are missing from spinnaker sometime and again it comes back after some time. Screenshots are attached

When accounts are not found

Screenshot (7)

When accounts are visible

Screenshot (4)

Can anyone please check this issue ... Is it due to Caching ? We are using External Redis server and not enabled direct Cloud driver caching, or is it any other issues ?

@rampatnaik
Copy link

We tried with
Environment: Distributed Spinnaker with Kubernetes .... Spinnaker version 1.17.6 too .

Same issue . We are using external GCP based redis

@corneredrat
Copy link

corneredrat commented Mar 24, 2020

+1 , We are observing this problem on and off. That is, sometimes pod details comes under infrastructure tab. When they're not listing under infrastructure tab, they show up as standalone instances. We can still list them under search.
We're running on 1.17.8 version

Capture

@JadCham
Copy link

JadCham commented Mar 26, 2020

We are also experiencing the same issue intermittently on Spinnaker 1.19.0. No errors in clouddriver logs at all. Monitoring looks ok as well.
All Kubernetes accounts are on AWS EKS 1.14.
We are using external Redis AWS Elasticache.
image

@ezimanyi ezimanyi self-assigned this Apr 1, 2020
@ezimanyi
Copy link
Contributor

ezimanyi commented Apr 2, 2020

This is very likely happening because caching cycles are not completing within the 10-minute expiration timeout for logical cache items (server groups, load balancers, etc.).

The way the Kubernetes V2 provider works is that it reads all of your deployed infrastructure, and then creates two types of cache entries:

  • Infrastructure entries that represent raw kubernetes objects in your cluster (services, replica sets, pods, etc.)
  • Logical entries that represent Spinnaker concepts. For example each service you have creates a load balancer entry, each replicaSet you have creates a cluster, and each pod creates an instance.

When you load pages in the UI it's generally the logical entries that are driving that UI (though if you drill down enough they'll pull in some info from the raw infrastructure).

The logical entries expire automatically after 10 minutes; so if a caching cycle has not completed in the past 10 minutes, the infrastructure tab will become empty (or at least will lose items from the accounts that have not been cached in the past 10 minutes). The account filter will go away as well, as this is just based on what is on the infrastructure page...so if there is no infrastructure, the account filter will also be empty. Some functionality (ex: search) does also look up infrastructure entries so that's why you'll still see some evidence of the infrastructure in the search tab.

I believe the reason that we expire the logical keys after 10 minutes was as a shortcut to adding logic added to ensure these are fully purged in all cases when the relevant items are deleted (though I'm not 100% sure of the motivations as there's no context provided on spinnaker/clouddriver#2051). Potentially this expiry could be removed, but I'm a bit concerned about introducing other bugs and haven't deeply analyzed the code to see if it's safe.

Ultimately the root cause here is that caching cycles are taking longer than 10 minutes, or are being scheduled less frequently that 10 minutes. Have you set the redis.poll.intervalSeconds key in your clouddriver config to something long? Also, the initial description says that you have "not enabled direct Cloud driver caching"---does this mean you just have caching entirely disabled? If that's the case then definitely I'd expect that the infrastructure tab would not be reliable/working, but maybe I'm mis-understanding.

If you do have the redis.poll.intervalSeconds set to something significantly less that 10 minutes (the default being 30 seconds) then perhaps this is a performance issue with caching? You can set the following in your clouddriver config:

logging:
  level:
    com.netflix.spinnaker.clouddriver.cache: DEBUG

to log how long each caching agent is taking and can see if some are taking >10 minutes. There were some pretty significant improvements to caching performance in the last year or so, so please ensure you're on at least Spinnaker 1.18 to see if that helps as well.

@corneredrat
Copy link

Hi @ezimanyi , thanks a lot with the insights, we will look into it and report back.

@corneredrat
Copy link

Hi @ezimanyi , to provide more info, we have 25 kubernetes accounts in spinnaker, each having considerable amount of resources to cache. Could it be the reason why caching is taking ober 10 mins?

If so, is it possible for us to split/shard the work load between multiple replicas of clouddriver-caching?

Thanks

@ezimanyi
Copy link
Contributor

ezimanyi commented Apr 3, 2020

@raghunandanbs : It of course depends on how big each cluster you're caching is, but that doesn't sound like an excessively large deployment to me...clouddriver should be able to keep up with this. Maybe you need to increase the memory/cpu resources available to clouddriver?

@corneredrat
Copy link

corneredrat commented Apr 6, 2020

Hi @ezimanyi , thank you.
I have taken the following steps:

  • increased the clouddriver-caching replicas to 4
  • set resource requests for clouddriver caching as follows:
    • cpu: 1000m
    • memory: 2Gi

I will update on the outcome.

@corneredrat
Copy link

@ezimanyi update: I made the changes mentioned above, and I see a slight improvement in number the stability of infrastructure listed. Any other step I need to take to get a stable view of clusters in infrastructure tab?

Thank you

@ezimanyi
Copy link
Contributor

ezimanyi commented Apr 7, 2020

@raghunandanbs : Thanks for the update! It's really hard to say exactly how to size/scale your cluster to be honest...the real test is if the caching is taking less that the 10 minutes noted above.

One thing I will say is that i don't have a lot of experience running multiple clouddriver-caching replicas. It is documented that this should work here but I haven't personally done it. I might suggest even adding more requests for the caching replicas as for large clusters clouddriver is pretty resource-intensive. (The exact numbers I don't know...you might get some good suggestions from users in the #kubernetes channel on Spinnaker slack, but I suspect most users with large deployments have at least twice those requests if not more.)

@karlskewes
Copy link
Contributor

@raghunandanbs, looks like you're heading in the right direction with resources.
FWIW we're running double those cpu & memory requests with ~15 kubernetes accounts. Maybe check your Java memory settings as well - we use 80% of request (& limit) memory for Xmx.
We've also seen similar behaviour when one Kubernetes cluster API is non-responsive combined with a low number of clouddriver-caching replicas/caching threads.

@corneredrat
Copy link

corneredrat commented Apr 16, 2020

@raghunandanbs, looks like you're heading in the right direction with resources.
FWIW we're running double those cpu & memory requests with ~15 kubernetes accounts. Maybe check your Java memory settings as well - we use 80% of request (& limit) memory for Xmx.
We've also seen similar behaviour when one Kubernetes cluster API is non-responsive combined with a low number of clouddriver-caching replicas/caching threads.

Thanks a lot for your inputs @kskewes

@corneredrat
Copy link

Update : I've almost doubled the capacity of caching isntances, and infrastructures seem to load just fine. Here is the configuration for spinnaker that holds 21 kubernetes accounts

      spin-clouddriver-caching:
        replicas: 8
        requests:
          cpu: 1500m
          memory: 3Gi

@nerddelphi
Copy link

I have same issue and it's very weird, even using Spinnaker 1.20.0 with only two kubernetes accounts and external Redis (memorystore from GCP).

Is there any updates?

@dmrogers7
Copy link

dmrogers7 commented May 14, 2020

This may not apply to everyone that's seen this issue, but for us we're using a SQL DB as the backing store for clouddriver and the cause was the Clouddriver SQL Caching Cleanup Agent.

When this was enabled, we found these messages in the clouddriver logs

2020-05-14 10:30:28.045 INFO 1 --- [tionAction-7662] c.n.s.c.s.c.SqlUnknownAgentCleanupAgent : Found 49 records to cleanup from 'cats_v1_applications' for data type 'applications'. Reason: Data generated by unknown caching agents <list of caching agents>
2020-05-14 10:32:26.974 INFO 1 --- [tionAction-7313] c.n.s.c.s.c.SqlUnknownAgentCleanupAgent : Found 49 records to cleanup from 'cats_v1_applications' for data type 'applications'. Reason: Data generated by unknown caching agents <list of caching agents>
2020-05-14 10:34:27.675 INFO 1 --- [tionAction-4189] c.n.s.c.s.c.SqlUnknownAgentCleanupAgent : Found 50 records to cleanup from 'cats_v1_applications' for data type 'applications'. Reason: Data generated by unknown caching agents <list of caching agents>
2020-05-14 10:36:28.919 INFO 1 --- [tionAction-4246] c.n.s.c.s.c.SqlUnknownAgentCleanupAgent : Found 50 records to cleanup from 'cats_v1_applications' for data type 'applications'. Reason: Data generated by unknown caching agents <list of caching agents>
2020-05-14 10:38:28.443 INFO 1 --- [tionAction-6132] c.n.s.c.s.c.SqlUnknownAgentCleanupAgent : Found 49 records to cleanup from 'cats_v1_applications' for data type 'applications'. Reason: Data generated by unknown caching agents <list of caching agents>

One of the unknown caching agents in the list was for the account in which Spinnaker itself was deployed! So while the caching agents were running fast (less than 30 seconds) and fine, this clean up agent was deleting valid entries - causing the temporary disappearance of the information from the infrastructure page.

Disabling the SQL Caching Cleanup Agent fixed the issue. Obviously, we would like to be able to have Spinnaker automatically clean up cached entries for accounts that have been deleted, but not at the expense of a poor user experience.

Perhaps this affects a wider audience than just those using a SQL database for clouddriver because the problem seems to be with getting the list of caching agents. We're running in Google and due to the warning in the Clouddriver SQL Setup Instructions, we have Agent Scheduling configured to use Redis.

@spinnakerbot
Copy link

This issue hasn't been updated in 45 days, so we are tagging it as 'stale'. If you want to remove this label, comment:

@spinnakerbot remove-label stale

@jdepp
Copy link

jdepp commented Jul 1, 2020

To anyone that stumbles upon this issue, we had the same exact problem and after a few days of trying to figure it out, we were able to resolve it by increasing our spin-clouddriver Kubernetes deployment by a few replicas. Originally we deployed it with just one replica which in the end, I believe was performing slow due to being over utilized and wasn't able to complete it's caching cycle in 10 minutes (@ezimanyi mentions the importance of this above).

EDIT: I wasn't sure how Clouddriver would handle simply just scaling it's deployment up a few replicas, so I found this doc that recommends running Clouddriver in High Availability mode. It seems that this is preferred over simply just scaling the spin-clouddriver deployment up replicas.

@spinnakerbot
Copy link

This issue is tagged as 'stale' and hasn't been updated in 45 days, so we are tagging it as 'to-be-closed'. It will be closed in 45 days unless updates are made. If you want to remove this label, comment:

@spinnakerbot remove-label to-be-closed

@jfrabaute
Copy link

Hi,

We're facing this issue on new spinnaker deployment managing a few k8s clusters.
Increasing resources seems like an option, but I'm wondering what would be the impact of those few configs:

  • Increasing --cache-threads:
    --cache-threads: (Default: 1) Number of caching agents for this kubernetes account. Each agent handles a subset of the namespaces available to this account. By default, only 1 agent caches all kinds for all namespaces in the account.

  • Defining a list of omittedNameSpaces so spinnaker does not monitor namespaces not having spinnaker-managed resources.

  • Defining a list of kinds so spinnaker monitors only those kinds.

Would those 3 optimizations decrease the time spent by clouddriver to update the cache?

What we see right now is that the cycle to get the state is slow, but cpu is very low, because the kubectl commands are just slow, so running all of them sequentially for all the clusters, all the namespaces, all the kinds takes a lot of time.
So, adding replicas or more components might just waste cpu, as the current clouddriver instance is not doing much so far.
Reducing the namespaces+kinds as well as increasing "cache-threads" should reduce the cycle time. Is that correct?
Any feedback on tuning cache-threads value?

Of course, at some point, scaling will still be needed, but having some values tuned might reduce the number of replicas needed.

Any feedback would be welcome on those options.

Thanks.

@ezimanyi
Copy link
Contributor

@jfrabaute : The cacheThreads flag will only really help if you have a single account with many resources, as different accounts will automatically cache resources in parallel anyway. But if you do have a single account with a lot of namespaces, assigning more threads to that account will parallelize the work (up to the number of namespaces).

The other two flags will decrease the work that the caching agent needs to do, though of course the benefit will depend on how much work it is doing on the omitted things:

  • omitNamespaces will be most helpful if there are other namespaces with a lot of resources (if there are just empty namespaces you're omitting, it probably won't help much, but it won't hurt performance)
  • kinds/omitKinds if there are kinds that you don't need to deploy with Spinnaker then definitely removing them will reduce the amount of work (again depending on the case); but you won't be able to deploy those kinds with Spinnaker.

I would suggest joining the #sig-ops channel in Spinnaker slack or joining their meetings; the participants have successfully scaled Spinnaker to really large deployments and will definitely have good advice on how they tuned these parameters to get there.

@ezimanyi
Copy link
Contributor

Also, this made me realize this is still open though my understanding is that the original issue has been solved; there is continued work on the performance of the caching agents that will continue to help this, but there's no outstanding work to be done directly on this issue.

@jfrabaute
Copy link

I enabled debug logs and the cycle finished between 2 and 10 seconds all the time.
Still, I see NO SERVER GROUPS FOUND IN THIS APPLICATION.

Here is an output example:

2020-08-25 23:40:03.371  INFO 1 --- [cutionAction-15] n.s.c.k.c.a.KubernetesCacheDataConverter : CLUSTERNAME/KubernetesCoreCachingAgent[1/1]: grouping artifact has 4 entries and 0 relationships
2020-08-25 23:40:03.371  INFO 1 --- [cutionAction-15] n.s.c.k.c.a.KubernetesCacheDataConverter : CLUSTERNAME/KubernetesCoreCachingAgent[1/1]: grouping pod has 1 entries and 1 relationships
2020-08-25 23:40:03.371  INFO 1 --- [cutionAction-15] n.s.c.k.c.a.KubernetesCacheDataConverter : CLUSTERNAME/KubernetesCoreCachingAgent[1/1]: grouping configMap has 1 entries and 0 relationships
2020-08-25 23:40:03.372  INFO 1 --- [cutionAction-15] n.s.c.k.c.a.KubernetesCacheDataConverter : CLUSTERNAME/KubernetesCoreCachingAgent[1/1]: grouping service has 2 entries and 4 relationships
2020-08-25 23:40:03.372  INFO 1 --- [cutionAction-15] n.s.c.k.c.a.KubernetesCacheDataConverter : CLUSTERNAME/KubernetesCoreCachingAgent[1/1]: grouping clusters has 3 entries and 6 relationships
2020-08-25 23:40:03.372  INFO 1 --- [cutionAction-15] n.s.c.k.c.a.KubernetesCacheDataConverter : CLUSTERNAME/KubernetesCoreCachingAgent[1/1]: grouping applications has 1 entries and 6 relationships
2020-08-25 23:40:03.372  INFO 1 --- [cutionAction-15] n.s.c.k.c.a.KubernetesCacheDataConverter : CLUSTERNAME/KubernetesCoreCachingAgent[1/1]: grouping deployment has 1 entries and 2 relationships

I don't exactly know what grouping pod has 1 entries and 1 relationships, but it seems legit: I deployed one app, which has one pod.
But I don't see anything in the UI. I'm now wondering if there might be a problem with the logical grouping/caching?
From the "search" I can see the pods and the replicasets, etc.
It's really on the CLUSTERS panel where it is empty.
It does not seem to be a caching problem as the cycle is done in a few seconds.
I've checked redis and there are entries in the db 0. I don't really know how to make sense of those tho, I'll need to probably read the code here.

I'm also confused about the not enabled direct Cloud driver caching in the thread. Is clouddriver caching by default (I have it deployed in non-ha mode), or not?
Does it need to run in "ha" mode to really work correctly?

I've pinged the #sig-ops channel as well as some other channels, and I didn't get any answer so far.

Sorry for commenting on this closed issue, but this is unfortunately blocking us to use spinnaker for our services and we're stuck with this problem for several weeks now.

@mtaylor98
Copy link

I have spent a considerable amount of debugging this same issue. I have tried all the suggestions here, but still run into issues running with an Elasticache Redis Instance as the backend. I don't have that big of a deployment, and even scaled it back to about 5 Kubernetes clusters with a few hundred pods. I am working on migrating from Redis to SQL to see if that resolves the issue. But I have messed with cache threads, number of clouddriver-caching pods, throwing additional resources (both mem and cpu) at the problem. I have tweaked the xms and xmx java options, and the redis.poll settings for clouddriver. All to no avail. This seems to still be a problem for others as well and despite running nearly every microservice in debug, I have been unable to identify a cause.

@jfachal
Copy link

jfachal commented Mar 31, 2022

same here! is there any way to deactivate completely the k8s caching objects in cloud-driver?

@billiford
Copy link

At The Home Depot we rewrote the Kubernetes portion of Clouddriver in a manner that does not use constant poll-caching. You can read about it here: https://github.com/homedepot/go-clouddriver.

We will be publishing an article on the Spinnaker blog soon about the performance benefits we've seen from this migration. A full technical document on installation is still a work in progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests