Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kiam Causing Latency Issues on Microservices #191

Open
msleevi opened this Issue Nov 29, 2018 · 16 comments

Comments

Projects
None yet
3 participants
@msleevi
Copy link

msleevi commented Nov 29, 2018

Issue Description

We ran into some issues with Kiam Server (or Agent it is unknown which currently) after a significant uptime that led to issue with response times across the board. The Kiam Server needed a configuration change to set session-duration from 15m -> 60m and as a result the Kiam Server needed a restart. After the restart, we saw a significant drop in response time (80 - 99% across the board). I believe this is due to a memory leak that I will demonstrate momentarily.

Version

v2.7

Background

We run a number of unique microservices each with their own IAM roles, we do a handful of deploys every day for each of these services as well.

We started seeing significant increases in latency as throughput increased. Initially these were attributed to network latency, then CPU latency and kubelet scheduling for the shared pool. We were able to eliminate 60 - 70% of the response time on one service by increasing cpu.requests and scaling the pods horizontally. However, we still didn't understand the last 100 ms of latency on several services, and at times 2s of latency on other services.

While preparing to run some tests to further characterization this problem. The Kiam server went through the changes described above. Resulting in the solution of the problems we were experiencing.

I will begin by showcasing the latency we were seeing, then the data that supports the memory issue.

Services Latency

This is one example of the change we saw in the latency after restarting the Kiam Servers and applying the configuration change.

screen shot 2018-11-28 at 2 20 00 pm

As you can see, there is an approximately 99.9% decrease in response time. In addition to an increase of 300 - 400% in the throughput of the service. Which is awesome!

Data

So, I started diving into some of the metrics that the kiam servers, agents and clients report.

The first shows two particular metrics getpodrole on the Server and isallowedassumerole on the Client. You can see a significant drop at the upgrade time. Similar to the latency cliff we see.

screen shot 2018-11-28 at 3 18 31 pm

The next shows the reported RSS (physical memory allocated) of the containers for Kiam Server and Kiam Agent over the course of the last 6 weeks. It is averaged across all instances of the different instances in the Kubernetes cluster.

screen shot 2018-11-28 at 3 16 31 pm

In case it is relevant, here is the number of instances running of each across the cluster over the last 3 months.

screen shot 2018-11-28 at 3 22 06 pm

We saw a similar drop in both CPU utilization and Network Utilization on the EC2 Nodes hosting the Kiam Server Containers. I can pull those metrics if need be.

If you need me to run any tests or anything let me know and I can coordinate the those. It appears to be related to uptime, so there may needs to be a simulation of load that we can run very quickly to help incite the problem.

Summary

Based on the metrics I was seeing, and my brief look at the code. I think that there may be channels that are not being closed and those channels are making more and more requests to the system.

Or the server eventually disregards the refresh interval and attempts far more often than it needs.

@msleevi msleevi changed the title Kiam Memory Issues Kiam Causing Latency Issues on Microservices Nov 29, 2018

@pingles

This comment has been minimized.

Copy link
Contributor

pingles commented Nov 29, 2018

Yeah- having #108 in place would make this much easier! #125 may also help identify stuff.

I've taken a screenshot of our servers (there's a large drop when we deployed the new version)

screenshot 2018-11-29 at 11 15 17

Unfortunately we don't have the retention to look back much farther in time. It's pretty hard to pick out an obvious trend. The new version definitely shows a slow increase but it's not clear whether that's just as it settles to the load on our clusters or a gradual leak.

My suspicion is there probably is something- I'd suggest we add pprof to try and capture some dumps from running servers over longer periods of time and see if there's anything that shows. At the same time we could add leaktest (#125) to some of the tests we do have and see if that catches any obvious go routine leaks.

@pingles

This comment has been minimized.

Copy link
Contributor

pingles commented Nov 29, 2018

I've opened #192 which adds leaktest back. Most handlers and Kiam code under test don't reveal any routine leaks. There is a leak inside the cache library but this is related to a single routine and shouldn't be related.

I think adding pprof is probably the fastest way to trying to identify the cause unless others have any suggestions/recommendations?

@pingles

This comment has been minimized.

Copy link
Contributor

pingles commented Nov 29, 2018

@msleevi

This comment has been minimized.

Copy link
Author

msleevi commented Nov 29, 2018

@pingles Could the leak inside the cache library cause the timeout to be ignored potentially?

@msleevi

This comment has been minimized.

Copy link
Author

msleevi commented Nov 29, 2018

Is this related: #147 (comment)?

@pingles pingles referenced this issue Nov 29, 2018

Merged

pprof HTTP #193

@pingles

This comment has been minimized.

Copy link
Contributor

pingles commented Nov 29, 2018

I think the behaviour in #147 was associated to a poorly implemented Prometheus adapter. In v3 we switched to using the native Prometheus client library and it's resolved the issue (and you can achieve the same by not enabling Prometheus exposition in v2).

The cache library leaktest failure is:

--- FAIL: TestRequestsCredentialsFromGatewayWithEmptyCache (5.03s)
    <autogenerated>:1: leaktest: timed out checking goroutines
    <autogenerated>:1: leaktest: leaked goroutine: goroutine 43 [select]:
        github.com/uswitch/kiam/vendor/github.com/patrickmn/go-cache.(*janitor).Run(0xc0001625d0, 0xc00016e6c0)
        	/Users/pingles/gopath/src/github.com/uswitch/kiam/vendor/github.com/patrickmn/go-cache/cache.go:1079 +0x151
        created by github.com/uswitch/kiam/vendor/github.com/patrickmn/go-cache.runJanitor
        	/Users/pingles/gopath/src/github.com/uswitch/kiam/vendor/github.com/patrickmn/go-cache/cache.go:1099 +0x101

From looking through the code I don't think it's an issue because the test can't cleanly shutdown the routines the cache creates. In reality, though, there's only ever two instances of that cache and so it's unlikely to be creating lots of routines.

We're running v3 in our production clusters and will keep a watch on memory usage to see if we observe the same behaviour you've reported against 2.7. In particular, I think looking at the results of the following query:

avg(go_memstats_mallocs_total{app="kiam-server"} - go_memstats_frees_total{app="kiam-server"}) by (cluster)

If that's incrementing it would suggest that we're leaking. Admittedly we only have data for a few days but the kiam instances we operate don't show an increasing line:

screenshot 2018-11-29 at 16 38 44

I've got a patch that adds pprof #193- hopefully that'll help get more detail if there is a memory/routine leak.

@pingles

This comment has been minimized.

Copy link
Contributor

pingles commented Nov 29, 2018

This is what it looks like for the most recent 24h period.

screenshot 2018-11-29 at 17 23 31

@pingles

This comment has been minimized.

Copy link
Contributor

pingles commented Dec 6, 2018

In case you were waiting for the v3.0 release we've just tagged it. Upgrade notes are available https://github.com/uswitch/kiam/blob/master/docs/UPGRADING.md.

@msleevi

This comment has been minimized.

Copy link
Author

msleevi commented Dec 7, 2018

I am pretty interested in a recent change we saw.

I think the memory leak, shown in the initial graph, and CPU "leak" shown in this image are related to prometheus as described by #147 (comment). And so those I believe would be fixed by the upgrade to v3.0

screen shot 2018-12-07 at 11 03 08 am

However, I am worried about a change we made. We converted the refresh-interval from 60m down to 15m to determine if it really was a service bounce or the configuration change that caused the issue. We did this in a test environment and saw the following results:

screen shot 2018-12-07 at 11 06 00 am

As you can see from the image. There is a very large increase in the number of client getpodrole and server credentialscache_cachehit metrics. Disproportionately so to the configuration change. I would have expected an approximate ~4x increase, but this appears to be in the ~2000x range.

I believe this is causing the issue, and given the very low number of pods running in our test environment, I think this gets exacerbated in large Kubernetes clusters.

Here is a more zoomed in version of the ramp we saw after the configuration change

screen shot 2018-12-05 at 12 55 23 pm

@pingles

This comment has been minimized.

Copy link
Contributor

pingles commented Dec 7, 2018

@msleevi

This comment has been minimized.

Copy link
Author

msleevi commented Dec 7, 2018

(1) Number of Pods Running for Kiam Server and Kiam Agent

Blue Line: Kiam Server, Purple Line: Kiam Agent. It hovers mostly around 3 and 20 respectively.

screen shot 2018-12-07 at 3 23 08 pm

(2) Total Number of Pods Running

A majority of these are the pause-amd64 by my estimate there are ~57 actual pods for the services running at the time presented.

screen shot 2018-12-07 at 3 28 39 pm

(3) Number of IAM Roles

No graph. 146 currently. This includes production specific roles and some default AWS roles. It's hard to narrow. I can try to get a more precise number here. But you assume at least ~57 are being used (1 per service)

(4) Networking Setup

We use calico for the networking. Pulled from the helm charts:

networking:
    calico:
      crossSubnet: true

So Calico and Cross-Subnet

@msleevi

This comment has been minimized.

Copy link
Author

msleevi commented Dec 7, 2018

Things you didn't ask for that I thought might be helpful:

Note all these are from the same timeframe.

(1) Base Graph:

Kiam Client Get Pod Role Purple versus Kiam Server Credentials Cache Cache Hit Blue

screen shot 2018-12-07 at 3 39 33 pm

(2) Number of Server Get Pod Roles

I wanted to bring this one in to showcase that the two metrics match up (client and server). Also, I was wondering if this isallowedassumerole on the Client(s) carries any significance.

Kiam Client Is Allowed Assume Role Purple versus Kiam Server Get Pod Role Blue

screen shot 2018-12-07 at 3 39 54 pm

(3) Network Traffic for Kiam Server

This is a bit obvious, more requests, more network traffic. I just wanted to demonstrate again that it lines up.

Docker Network Received Bytes Purple versus Docker Network Sent Bytes Blue

screen shot 2018-12-07 at 3 40 10 pm

(5) Memory Usage

This demonstrates the memory usage, averaged across all containers for the instance type. You see that Kiam Server on average goes down (potentially due to the prometheus issue and a restart) but Kiam Agent climbs with the increasing requests which is strange and the memory usage on the server doesn't appear to make the same erratic jumps.

Kiam Server Memory Usage Red versus Kiam Client Memory Usage Orange

screen shot 2018-12-07 at 3 39 04 pm

(6) CPU Usage Breakdown Kiam Agent

This one is a little strange. On Kiam in particular I see that large swaths of CPU time is not consumed by system or user. I know there are things outside those two groups that can consume CPU time, but I would think they should be limited. And given my knowledge of GC on the JVM anytime I see real > sys + user it indicates a constrained resource. And that appears to be consistent here.

User Time Purple, System Time Red, Total Usage Blue

screen shot 2018-12-07 at 3 40 26 pm

(7) Kiam Server CPU Usage

We see spikes here that correlate strongly with the spikes in number of requests. However, the same strange real > sys + user. This is something that is persistent across all Kiam instances everywhere.

User Time Blue Area, System Time Purple, Total Usage Blue Line

screen shot 2018-12-07 at 3 39 21 pm

(8) EC2 Network Utilization

This one I can't quite seem to correlate strongly, but it makes almost no sense. This is not across the test environment, but across all environments. It is grouped by availability zone.

Zone A Light Blue, Zone B Dark Blue, Zone D Purple

screen shot 2018-12-07 at 3 41 07 pm

@pingles

This comment has been minimized.

Copy link
Contributor

pingles commented Dec 13, 2018

@msleevi thanks for continuing to dig into the data you've collected. I know you mentioned a few more bits on Slack, have you managed to form any new hypotheses?

v3.0 is released which includes the optional pprof HTTP endpoint to aid sampling for both agents and server processes: that may help diagnose stuff further if you've not managed to find any new leads.

@msleevi

This comment has been minimized.

Copy link
Author

msleevi commented Dec 13, 2018

@pingles I have a handful of tests I need to run. And some insights about the AWS SDK to gather. But if none of that pans out. I will try v3.0.

@msleevi

This comment has been minimized.

Copy link
Author

msleevi commented Dec 19, 2018

@pingles No dice on the tests. It's going to be a while before Kiam gets upgraded though, so I am sitting in a holding pattern for now.

@jutley

This comment has been minimized.

Copy link

jutley commented Jan 23, 2019

We recently discovered this issue on our own, and after tracing some code in the AWS Java SDK, we found that the SDK refreshes instance profile tokens once they are within 15 minutes of expiration. KIAM creates tokens with 15 minutes of expiration. Therefore, every request triggers a token refresh.

Here are the relevant constants in the SDK:
https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/auth/EC2CredentialsFetcher.java#L43-L53

In order to get around this issue, we've added these arguments to KIAM. This seems to be fixing the issue.

--session-duration=60m --session-refresh=16m

I've filed an issue with the AWS Java SDK, which can be found here:
aws/aws-sdk-java#1893

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.