Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase of resource utilization by hour #147

Closed
caiohaz opened this Issue Aug 22, 2018 · 28 comments

Comments

Projects
None yet
5 participants
@caiohaz
Copy link

caiohaz commented Aug 22, 2018

I'm using kiam 2.8 with resources limits:

agent:
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 128Mi

server:
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 128Mi

But i'm see a increase of resources utilization by hour, i attached the chart below:
cpu-usage

It's normal? Or i have some problem?

Regards

@camilb

This comment has been minimized.

Copy link

camilb commented Aug 22, 2018

ref #72

@pingles

This comment has been minimized.

Copy link
Contributor

pingles commented Aug 28, 2018

CPU activity would increase as the number of cached credentials increases, the number of pods start/stop etc. so it'd be useful to include plots that also showed general cluster activity.

@caiohaz

This comment has been minimized.

Copy link
Author

caiohaz commented Aug 28, 2018

hmm that makes sense, it's possible "limit" this cache?

@pingles

This comment has been minimized.

Copy link
Contributor

pingles commented Aug 28, 2018

Well credentials are cached and stored for any roles that are annotated. CPU activity could also be higher if the server is having to retry lots of operations (i.e. you have annotations for roles that don't exist/the IAM policies prevent the server from assuming etc.).

@caiohaz

This comment has been minimized.

Copy link
Author

caiohaz commented Aug 28, 2018

Right, so in "normal" environment this problem don't happen, correct? If i have this problem ( (i.e. you have annotations for roles that don't exist/the IAM policies prevent the server from assuming etc.).) can i see in some log?

@pingles

This comment has been minimized.

Copy link
Contributor

pingles commented Aug 28, 2018

The screenshot below is from one of our production clusters and tracks CPU millicores for the Server process across a month- there's spikes but it doesn't increase forever.

screenshot 2018-08-28 at 15 49 10

There may be a bug but without knowing what kind of activity you have elsewhere on your cluster (if any?) it's difficult to infer much just from the graph.

@caiohaz

This comment has been minimized.

Copy link
Author

caiohaz commented Aug 28, 2018

I got it, what data can I send you so you can analyze better?

@caiohaz

This comment has been minimized.

Copy link
Author

caiohaz commented Aug 28, 2018

@pingles i attached the server and agent logs (logs.zip).

@pingles

This comment has been minimized.

Copy link
Contributor

pingles commented Aug 29, 2018

@XDexter thanks.

Looking at the agent logs it's not dealing with any metadata requests- the only paths are for /ping which is used by the healthcheck.

The server logs all look reasonable too- looks like it's requesting credentials for 4 different roles, using the default 15min expiry so its renewing every 5 minutes. Are you seeing CPU usage continue to climb?

@caiohaz

This comment has been minimized.

Copy link
Author

caiohaz commented Aug 29, 2018

Hi @pingles

I attached again other agent log file with metadata requests (i think that the server that i got the logs doesn't have any pods with annotation). (agent2.log.zip)

I can see that some pods are restarting:

kiam-agent-2b8dq 1/1 Running 6 2d
kiam-agent-xhm8p 1/1 Running 1 5d
kiam-server-cpjvd 1/1 Running 17 7d

The agent "kiam-agent-2b8dq" generated the logs that i attached now, and the agent "kiam-agent-xhm8p" generated the logs that i attached yesterday.

Yes, the cpu usage continue climbing:
screen2

I'm using helm to deploy kiam and i attached the helm output with all informations about services deployed. (helm.txt.zip)

Thank you for your attention.

Regards

@pingles

This comment has been minimized.

Copy link
Contributor

pingles commented Aug 29, 2018

Interesting- could you include the precise prometheus metrics you're using in the Grafana plot please. It'd be useful to see the logs for the agent that sees its CPU continually climb also- and what's different about the nodes that the two agents are running on (one had no activity while the other did?).

@caiohaz

This comment has been minimized.

Copy link
Author

caiohaz commented Aug 29, 2018

Sure, i'm using this metric:

sum(irate(container_cpu_usage_seconds_total{pod_name=~"$pod",image!="",name=~"^k8s_.*",container_name!="POD",kubernetes_io_hostname=~"^$node$"}[1m])) by (pod_name, container_name)

Both agents has the cpu usage climb and yes, the difference is that one has activity (pods annotations) and other host doesn't have any pod annotation.

Regards

@pingles

This comment has been minimized.

Copy link
Contributor

pingles commented Aug 29, 2018

Interesting! I'll see if I can get a chance to debug further- @uswitch/cloud any other suggestions?

@caiohaz

This comment has been minimized.

Copy link
Author

caiohaz commented Aug 31, 2018

@pingles there are something else that can i do to help you to solve this problem?

@roffe

This comment has been minimized.

Copy link
Contributor

roffe commented Aug 31, 2018

@XDexter have you tried latest kiam from head.

I can see in my metrics that i had similar behaviour until i built and deployed latest 3.0 source

screen shot 2018-08-31 at 15 11 50

@roffe

This comment has been minimized.

Copy link
Contributor

roffe commented Aug 31, 2018

sorry for the bad screenshot

screen shot 2018-08-31 at 15 13 33

@caiohaz

This comment has been minimized.

Copy link
Author

caiohaz commented Aug 31, 2018

@roffe but 3.0 version is RC e not stable, right?

@roffe

This comment has been minimized.

Copy link
Contributor

roffe commented Aug 31, 2018

@XDexter half of the stuff you run in Kubernetes is alpha, beta or RC, I have several components i build from HEAD in my cluster. just be sure to test it in staging before they go to prod. In many cases this is the only way to get bugg-fixes fast enough without having to wait half a year. I know where you are coming from. I also had huge problems getting used to having to run beta or RC software in my prod env

@pingles

This comment has been minimized.

Copy link
Contributor

pingles commented Aug 31, 2018

I'm hoping 3.0-rc will be put into our clusters soon (they've been working on other cluster upgrades that have been more urgent). It's a release candidate: it works*, the tests pass and all features are there. I don't like to mark it as released until we've been running it in all our clusters for a little bit :) I also wanted to make sure we had upgrade notes in place (there were some TLS changes, in particular) so didn't want to tag it finally until we had all that written up also.

  • as far as manual testing and the automated tests demonstrate
@caiohaz

This comment has been minimized.

Copy link
Author

caiohaz commented Aug 31, 2018

@roffe i got it, no problems, i will clone the helm chart to local and make the changes to the 3.0 version. @pingles i'm trying start the 3.0 version but i have an error associated with your comment:

{"level":"fatal","msg":"error creating server gateway: error dialing grpc server: context deadline exceeded","time":"2018-08-31T13:56:25Z"}

I need change something?

Regards

@pingles

This comment has been minimized.

Copy link
Contributor

pingles commented Aug 31, 2018

It's probably related to the TLS changes: https://github.com/uswitch/kiam/blob/master/docs/TLS.md#breaking-change-in-v30

#115 has some info on how to get better information from the underlying gRPC lib about what's going on when it tries to initiate a connection.

@caiohaz

This comment has been minimized.

Copy link
Author

caiohaz commented Aug 31, 2018

I got it, @roffe how you to generated the tls certs? Could you said the process? I can't start the kiam in v3 version...

@pingles

This comment has been minimized.

Copy link
Contributor

pingles commented Aug 31, 2018

@XDexter the linked TLS document describes it- they're updated for v3.

@caiohaz

This comment has been minimized.

Copy link
Author

caiohaz commented Aug 31, 2018

Hi @pingles right, i did this steps.. so i think that error isn't with tls certs...

@pingles

This comment has been minimized.

Copy link
Contributor

pingles commented Sep 1, 2018

@XDexter please see #115 with more info on how to get more info in the logs to help diagnose it further (its the gRPC env vars you'll need to set).

@caiohaz

This comment has been minimized.

Copy link
Author

caiohaz commented Sep 3, 2018

@pingles i have this error:

Readiness probe failed: INFO: 2018/09/03 13:15:51 parsed scheme: "dns" INFO: 2018/09/03 13:15:51 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.localhost on 100.64.0.10:53: no such host. INFO: 2018/09/03 13:15:51 grpc: failed dns TXT record lookup due to lookup localhost on 100.64.0.10:53: no such host. INFO: 2018/09/03 13:15:51 ccResolverWrapper: got new service config: WARNING: 2018/09/03 13:15:51 grpc: parseServiceConfig error unmarshaling due to unexpected end of JSON input INFO: 2018/09/03 13:15:51 ccResolverWrapper: sending new addresses to cc: [{[::1]:443 0 } {127.0.0.1:443 0 }] INFO: 2018/09/03 13:15:51 base.baseBalancer: got new resolved addresses: [{[::1]:443 0 } {127.0.0.1:443 0 }] INFO: 2018/09/03 13:15:51 base.baseBalancer: handle SubConn state change: 0xc42038ea80, CONNECTING INFO: 2018/09/03 13:15:51 base.baseBalancer: handle SubConn state change: 0xc42038eaa0, CONNECTING time="2018-09-03T13:15:51Z" level=fatal msg="error creating server gateway: error dialing grpc server: context deadline exceeded"
Back-off restarting failed container

@caiohaz

This comment has been minimized.

Copy link
Author

caiohaz commented Sep 5, 2018

Hello,

I increased the param "--gateway-timeout-creation" and the containers started with success, the cpu utilization are stable too:

kiam-cpu2

Thank you so much @pingles and @roffe!

Regards

@sbuliarca

This comment has been minimized.

Copy link

sbuliarca commented Oct 24, 2018

If this is still of interest to anybody: seems that the issue appears when exposing the prometheus metrics with the flags: prometheus-listen-addr and prometheus-sync-interval. We dropped using these flags with the v2.8 version, both in agent and server and it is working fine now.
Here is a screenshot with the metrics:
screen shot 2018-10-24 at 3 19 20 pm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.