Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Kiam Causing Latency Issues on Microservices #191
We ran into some issues with Kiam Server (or Agent it is unknown which currently) after a significant uptime that led to issue with response times across the board. The Kiam Server needed a configuration change to set
We run a number of unique microservices each with their own IAM roles, we do a handful of deploys every day for each of these services as well.
We started seeing significant increases in latency as throughput increased. Initially these were attributed to network latency, then CPU latency and kubelet scheduling for the shared pool. We were able to eliminate 60 - 70% of the response time on one service by increasing
While preparing to run some tests to further characterization this problem. The Kiam server went through the changes described above. Resulting in the solution of the problems we were experiencing.
I will begin by showcasing the latency we were seeing, then the data that supports the memory issue.
This is one example of the change we saw in the latency after restarting the Kiam Servers and applying the configuration change.
As you can see, there is an approximately 99.9% decrease in response time. In addition to an increase of 300 - 400% in the throughput of the service. Which is awesome!
So, I started diving into some of the metrics that the kiam servers, agents and clients report.
The first shows two particular metrics
The next shows the reported RSS (physical memory allocated) of the containers for Kiam Server and Kiam Agent over the course of the last 6 weeks. It is averaged across all instances of the different instances in the Kubernetes cluster.
In case it is relevant, here is the number of instances running of each across the cluster over the last 3 months.
We saw a similar drop in both CPU utilization and Network Utilization on the EC2 Nodes hosting the Kiam Server Containers. I can pull those metrics if need be.
If you need me to run any tests or anything let me know and I can coordinate the those. It appears to be related to uptime, so there may needs to be a simulation of load that we can run very quickly to help incite the problem.
Based on the metrics I was seeing, and my brief look at the code. I think that there may be channels that are not being closed and those channels are making more and more requests to the system.
Or the server eventually disregards the refresh interval and attempts far more often than it needs.
I've taken a screenshot of our servers (there's a large drop when we deployed the new version)
Unfortunately we don't have the retention to look back much farther in time. It's pretty hard to pick out an obvious trend. The new version definitely shows a slow increase but it's not clear whether that's just as it settles to the load on our clusters or a gradual leak.
My suspicion is there probably is something- I'd suggest we add pprof to try and capture some dumps from running servers over longer periods of time and see if there's anything that shows. At the same time we could add leaktest (#125) to some of the tests we do have and see if that catches any obvious go routine leaks.
I've opened #192 which adds leaktest back. Most handlers and Kiam code under test don't reveal any routine leaks. There is a leak inside the cache library but this is related to a single routine and shouldn't be related.
I think adding pprof is probably the fastest way to trying to identify the cause unless others have any suggestions/recommendations?
I found https://medium.com/dm03514-tech-blog/sre-debugging-simple-memory-leaks-in-go-e0a9e6d63d4d which has some nice tips in.
I think the behaviour in #147 was associated to a poorly implemented Prometheus adapter. In v3 we switched to using the native Prometheus client library and it's resolved the issue (and you can achieve the same by not enabling Prometheus exposition in v2).
The cache library leaktest failure is:
--- FAIL: TestRequestsCredentialsFromGatewayWithEmptyCache (5.03s) <autogenerated>:1: leaktest: timed out checking goroutines <autogenerated>:1: leaktest: leaked goroutine: goroutine 43 [select]: github.com/uswitch/kiam/vendor/github.com/patrickmn/go-cache.(*janitor).Run(0xc0001625d0, 0xc00016e6c0) /Users/pingles/gopath/src/github.com/uswitch/kiam/vendor/github.com/patrickmn/go-cache/cache.go:1079 +0x151 created by github.com/uswitch/kiam/vendor/github.com/patrickmn/go-cache.runJanitor /Users/pingles/gopath/src/github.com/uswitch/kiam/vendor/github.com/patrickmn/go-cache/cache.go:1099 +0x101
From looking through the code I don't think it's an issue because the test can't cleanly shutdown the routines the cache creates. In reality, though, there's only ever two instances of that cache and so it's unlikely to be creating lots of routines.
We're running v3 in our production clusters and will keep a watch on memory usage to see if we observe the same behaviour you've reported against 2.7. In particular, I think looking at the results of the following query:
If that's incrementing it would suggest that we're leaking. Admittedly we only have data for a few days but the kiam instances we operate don't show an increasing line:
I've got a patch that adds pprof #193- hopefully that'll help get more detail if there is a memory/routine leak.
In case you were waiting for the v3.0 release we've just tagged it. Upgrade notes are available https://github.com/uswitch/kiam/blob/master/docs/UPGRADING.md.
I am pretty interested in a recent change we saw.
I think the memory leak, shown in the initial graph, and CPU "leak" shown in this image are related to prometheus as described by #147 (comment). And so those I believe would be fixed by the upgrade to
However, I am worried about a change we made. We converted the
As you can see from the image. There is a very large increase in the number of client
I believe this is causing the issue, and given the very low number of pods running in our test environment, I think this gets exacerbated in large Kubernetes clusters.
Here is a more zoomed in version of the ramp we saw after the configuration change
Thanks for the information. Could you give us some numbers around the number of pods, roles, agents and servers you have please? Anything else about your setup may also be useful (CNI, networking setup etc). We'll try and take a look against our production data and see how it compares.…
On Fri, 7 Dec 2018 at 18:09, Michael Sleevi ***@***.***> wrote: I am pretty interested in a recent change we saw. I think the memory leak, shown in the initial graph, and CPU "leak" shown in this image are related to prometheus as described by #147 <#147> (comment). And so those I believe would be fixed by the upgrade to v3.0 [image: screen shot 2018-12-07 at 11 03 08 am] <https://user-images.githubusercontent.com/9967497/49664651-b6047900-fa0f-11e8-8b1b-2b4746f3d14f.png> However, I am worried about a change we made. We converted the refresh-interval from 60m down to 15m to determine if it really was a service bounce or the configuration change that caused the issue. We did this in a test environment and saw the following results: [image: screen shot 2018-12-07 at 11 06 00 am] <https://user-images.githubusercontent.com/9967497/49664781-18f61000-fa10-11e8-9f10-025b680490b4.png> As you can see from the image. There is a very large increase in the number of client getpodrole and server credentialscache_cachehit metrics. Disproportionately so to the configuration change. I would have expected an approximate ~4x increase, but this appears to be in the ~2000x range. I believe this is causing the issue, and given the very low number of pods running in our test environment, I think this gets exacerbated in large Kubernetes clusters. Here is a more zoomed in version of the ramp we saw after the configuration change [image: screen shot 2018-12-05 at 12 55 23 pm] <https://user-images.githubusercontent.com/9967497/49664907-8144f180-fa10-11e8-9b6e-42187729fa76.png> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#191 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAEfkUlhuG9hGtFnOLFUg17TEnKIkRVks5u2q7TgaJpZM4Y41Ln> .
(1) Number of Pods Running for Kiam Server and Kiam Agent
Blue Line: Kiam Server, Purple Line: Kiam Agent. It hovers mostly around 3 and 20 respectively.
(2) Total Number of Pods Running
A majority of these are the
(3) Number of IAM Roles
No graph. 146 currently. This includes production specific roles and some default AWS roles. It's hard to narrow. I can try to get a more precise number here. But you assume at least ~57 are being used (1 per service)
(4) Networking Setup
We use calico for the networking. Pulled from the helm charts:
So Calico and Cross-Subnet
Things you didn't ask for that I thought might be helpful:
Note all these are from the same timeframe.
(1) Base Graph:
Kiam Client Get Pod Role Purple versus Kiam Server Credentials Cache Cache Hit Blue
(2) Number of Server Get Pod Roles
I wanted to bring this one in to showcase that the two metrics match up (client and server). Also, I was wondering if this
Kiam Client Is Allowed Assume Role Purple versus Kiam Server Get Pod Role Blue
(3) Network Traffic for Kiam Server
This is a bit obvious, more requests, more network traffic. I just wanted to demonstrate again that it lines up.
Docker Network Received Bytes Purple versus Docker Network Sent Bytes Blue
(5) Memory Usage
This demonstrates the memory usage, averaged across all containers for the instance type. You see that Kiam Server on average goes down (potentially due to the prometheus issue and a restart) but Kiam Agent climbs with the increasing requests which is strange and the memory usage on the server doesn't appear to make the same erratic jumps.
Kiam Server Memory Usage Red versus Kiam Client Memory Usage Orange
(6) CPU Usage Breakdown Kiam Agent
This one is a little strange. On Kiam in particular I see that large swaths of CPU time is not consumed by system or user. I know there are things outside those two groups that can consume CPU time, but I would think they should be limited. And given my knowledge of GC on the JVM anytime I see
User Time Purple, System Time Red, Total Usage Blue
(7) Kiam Server CPU Usage
We see spikes here that correlate strongly with the spikes in number of requests. However, the same strange
User Time Blue Area, System Time Purple, Total Usage Blue Line
(8) EC2 Network Utilization
This one I can't quite seem to correlate strongly, but it makes almost no sense. This is not across the test environment, but across all environments. It is grouped by availability zone.
Zone A Light Blue, Zone B Dark Blue, Zone D Purple
@msleevi thanks for continuing to dig into the data you've collected. I know you mentioned a few more bits on Slack, have you managed to form any new hypotheses?
We recently discovered this issue on our own, and after tracing some code in the AWS Java SDK, we found that the SDK refreshes instance profile tokens once they are within 15 minutes of expiration. KIAM creates tokens with 15 minutes of expiration. Therefore, every request triggers a token refresh.
Here are the relevant constants in the SDK:
In order to get around this issue, we've added these arguments to KIAM. This seems to be fixing the issue.
I've filed an issue with the AWS Java SDK, which can be found here: