Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage #2673

Closed
tipruzs opened this issue Jan 8, 2018 · 37 comments
Closed

High CPU usage #2673

tipruzs opened this issue Jan 8, 2018 · 37 comments
Labels
kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. priority/P2 need to be fixed in the future status/5-frozen-due-to-age

Comments

@tipruzs
Copy link

tipruzs commented Jan 8, 2018

Do you want to request a feature or report a bug?

bug

What did you do?

When using traefik in front of dynamic web applications (e.g. nextcloud, speedtest) i see very high cpu usage up to 100%, when data is transferred (e.g. large downloads or speedtests).

Note: traefik and web apps run both in docker swarm

Update 1: disable compression on entrypoint. cpu usage still at 50-60% - is this expected?

What did you expect to see?

no cpu spikes (e.g. like general load balancers / reverse proxies)

What did you see instead?

cpu rises up to 100% cpu load

Output of traefik version: (What version of Traefik are you using?)

Version:      v1.4.5
Codename:     roquefort
Go version:   go1.9.2
Built:        2017-12-06_10:16:48AM
OS/Arch:      linux/amd64

What is your environment & configuration (arguments, toml, provider, platform, ...)?

services:
  traefik:
    image: ${IMAGE}:${RELEASE}
    environment:
      ACME_EMAIL: ${ACME_EMAIL}
      ACME_DEFAULT_HOST: ${ACME_DEFAULT_HOST}
    command:
      - --configfile=/run/secrets/traefik_admin
      - --loglevel=WARN
      - --checknewversion=false
      - --insecureskipverify=true
      - --defaultentrypoints=http,https
      - "--entrypoints=Name:http Address::80 Compress:true Redirect.EntryPoint:https"
      - "--entryPoints=Name:https Address::443 TLS Compress:true"
      - --acme
      - --acme.email=${ACME_EMAIL}
      - --acme.domains=${ACME_DEFAULT_HOST}
      - --acme.onhostrule
      - --acme.entrypoint=https
      - --acme.storage=/data/acme.json
      - --docker
      - --docker.watch
      - --docker.exposedbydefault=false
      - --docker.swarmmode
      # - --debug
    networks:
     - traefik
    ports:
      - target: 80
        published: 80
        protocol: tcp
        mode: host
      - target: 443
        published: 443
        protocol: tcp
        mode: host
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - traefik:/data
    healthcheck:
      test: ["CMD-SHELL", "traefik healthcheck --web"]
      interval: 30s
      timeout: 5s
      retries: 2
      start_period: 30s
    deploy:
      restart_policy:
        condition: any
        delay: 2s
      update_config:
        monitor: 120s
        failure_action: continue
      labels:
        - "traefik.enable=true"
        - "traefik.docker.network=org_traefik"
        - "traefik.port=8080"
        - "traefik.protocol=http"
        - "traefik.frontend.rule=Host:${ACME_DEFAULT_HOST};PathPrefixStrip:/traefik"
    secrets:
      - traefik_admin

If applicable, please paste the log output in debug mode (--debug switch)

(paste your output here)
@timoreimann
Copy link
Contributor

Duplicate of #1064?

@emilevauge emilevauge added kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. contributor/waiting-for-feedback priority/P1 need to be fixed in next release and removed status/0-needs-triage labels Jan 10, 2018
@juliens
Copy link
Member

juliens commented Jan 10, 2018

Can you try with the 1.5.0-rc4 ?

@juliens
Copy link
Member

juliens commented Jan 11, 2018

After some investigation, I isolated the piece of code that uses the CPU (by rewriting a simple reverse-proxy). And the culprit is io.Copy and more precisely the syscall.Write and syscall.Read.

perf

After some research it seems that this is a known problem and the solution could be to use syscall.Splice, but I don't think it is possible easily ( @mvertes )

@tipruzs
Copy link
Author

tipruzs commented Jan 24, 2018

I retested with 1.5.0 and got similar cpu metrics.

@juliens juliens added priority/P2 need to be fixed in the future and removed priority/P1 need to be fixed in next release labels Feb 1, 2018
@juliens juliens removed their assignment Feb 1, 2018
@javydekoning
Copy link

javydekoning commented Mar 2, 2018

After migrating from jwilder/nginx-proxy:latest to Traefik v1.5.3 I run into the same issue. The CPU utilization spikes as soon as I open a page through Traefik.

What did you do?
Thought it could be related to HTTPS or DEBUG logging. But even with both disabled the CPU spikes when opening a single connection.

What did you expect to see?
Slightly increased CPU.

What did you see instead?
Heavy spike. The below is the result of just opening a single connection to a Grafana or Rutorrent container. Using HTTP through Traefik:

spike

Traefik Version:

Version:      v1.5.3
Codename:     cancoillotte
Go version:   go1.9.4
Built:        2018-02-27_02:47:04PM
OS/Arch:      linux/amd64

docker-compose:

version: '3'
services:
  traefik:
    image: traefik
    container_name: traefik
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - "./traefik/traefik.toml:/traefik.toml"
      - "./traefik/acme.json:/acme.json"
    labels:
      - "traefik.port=8080"
      - "traefik.frontend.rule=Host:lab.<removed>"
      - "traefik.frontend.auth.basic=<removed>"

  grafana:
    container_name: grafana
    image: grafana/grafana:latest
    restart: unless-stopped
    ports:
      - "3000:3000"
    labels:
      - "traefik.port=3000"
      - "traefik.frontend.rule=Host:grafana.<removed>"
    volumes:
      - ./grafana:/var/lib/grafana

traefik.toml:

defaultEntryPoints = ["http","https"]

[web]
  address = ":8080"

[entryPoints]
  [entryPoints.http]
    address = ":80"
    whitelistSourceRange = ["192.168.1.0/24","192.168.3.0/24"]
  [entryPoints.https]
    address = ":443"
    [entryPoints.https.tls]

[acme]
  email = <removed>
  storage = "acme.json"
  entryPoint = "https"
  onHostRule = true
  [acme.httpChallenge]
    entryPoint = "http"

[docker]
  endpoint = "unix:///var/run/docker.sock"
  domain = <removed>
  watch = true

If there is anything I can do to help debug, let me know.

@mo
Copy link
Contributor

mo commented Mar 12, 2018

I benchmarked haproxy vs traefik 1.5.3 this weekend and concluded that traefik takes quite a bit more CPU.

I believe haproxy avoids copying the data to user space (via IO splicing so that the kernel moves data directly from the read buffer to the write buffer). Can someone confirm that Traefik simply does not have this feature yet? Or could there by some general general "high CPU usage" bug in traefik that is worth chasing down?

@timoreimann
Copy link
Contributor

@juliens any further thoughts on how to proceed here?

When in doubt, we should ask users to collect CPU profiles via pprof and analyze the data. My very own experience with benchmarking Traefik taught me that the performance behavior is very dependent on the existing configuration, i.e., which options are set and providers used.

@mo
Copy link
Contributor

mo commented Mar 12, 2018

If someone posts some instructions I'd be happy to do some profiling.

@juliens
Copy link
Member

juliens commented Mar 14, 2018

In fact, it could be the syscall.Read / Write problem, or it could be another problem with your configuration (as @timoreimann said).
We must analyze to be sure of the high CPU reason.

Here is how to get profiling data:

  1. Start you traefik with --debug in command line or debug=true in toml file
  2. Generate profile files (during Træfik is under load)
curl -o profile.pprof <traefik_ip>:8080/debug/pprof/profile    # CPU (takes 30 seconds to return)
curl -o heap.pprof <traefik_ip>:8080/debug/pprof/heap    
curl -o block.pprof <traefik_ip>:8080/debug/pprof/block    
curl -o mutex.pprof <traefik_ip>:8080/debug/pprof/mutex
curl -o goroutine.pprof <traefik_ip>:8080/debug/pprof/goroutine
curl -o trace.pprof <traefik_ip>:8080/debug/pprof/trace?seconds=5    # the seconds parameter is needed
  1. Send us all the files you have generated

Thank you for your interest in Træfik.

@GaruGaru
Copy link

GaruGaru commented Mar 14, 2018

Done the profiling on a test machine (cloud scaleway VC1S instance 2 core Intel Atom C2750 @ 2.40GHz) with traefik 1.5.3 running on docker (using siege with 50 concurrent users), cpu usage 44% on node with traefik container, only 5% on the node serving the actual content.

profile.zip

My docker-compose

 traefik:
    image: traefik:1.5.3-alpine
    command:
        - "--logLevel=DEBUG"
        - "--api"
        - "--docker"
        - "--docker.swarmmode" 
        - "--docker.watch"
        - "--web"
        - "--web.metrics.statsd.address=<STATSD_HOST>"
        - "--entrypoints=Name:http Address::80 Redirect.EntryPoint:https"
        - "--entrypoints=Name:https Address::443 TLS"
        - "--defaultentrypoints=http,https"
        - "--acme"
        - "--acme.storage=acme.json"
        - "--acme.entryPoint=https"
        - "--acme.httpChallenge.entryPoint=http"
        - "--acme.OnHostRule=true" 
        - "--acme.email=<MAIL>"
        - "--docker"
        - "--docker.swarmmode"
        - "--docker.domain=<DOMAIN>"
        - "--docker.watch"
    networks:
      - proxy
    ports:
        - 80:80
        - 443:443
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

@mo
Copy link
Contributor

mo commented Mar 14, 2018

I collected a profile as well. Mine is from the smallest droplet on Digital Ocean (1 cpu, 1 GB RAM). While profiling I ran "wrk -c 6 -t 6 -d 999 ..." on it, causing constant 30% CPU usage in traefik v1.5.3 and 30% in my node webapp.

pprof-mo-20180314.zip

/etc/traefik/traefik.toml:

defaultEntryPoints = ["http", "https"]

logLevel = "DEBUG"

[entryPoints]
  [entryPoints.http]
  address = ":80"
    [entryPoints.http.redirect]
    entryPoint = "https"
  [entryPoints.https]
  address = ":443"
    [entryPoints.https.tls]

[acme]
  entryPoint = "https"
#  caServer = "https://acme-staging.api.letsencrypt.org/directory"
  storage = "/etc/traefik/acme.json"
  email = "martin@minimum.se"
  onHostRule = true

[acme.httpChallenge]
  entryPoint = "http"

[traefikLog]
#  filePath = "/var/log/traefik.log"

[accessLog]
#  filePath = "/var/log/traefik-access.log"

[web]
  address = "127.0.0.1:8787"

[file]
  watch = true
  directory = "/etc/traefik/conf.d"

And I have one webapp in /etc/traefik/conf.d/foobar.toml

[frontends]
  [frontends.foobar-frontend]
    entryPoints = ["http", "https"]
    backend = "foobar-backend"
    passHostHeader = true

    [frontends.foobar-frontend.routes.default]
      rule = "Host:foobar.com"

[backends]
  [backends.foobar-backend]
    [backends.foobar-backend.servers]
      [backends.foobar-backend.servers.server0]
        url = "http://127.0.0.1:3000"

    [backends.foobar-backend.loadBalancer]
      method = "drr"
      [backends.foobar-backend.loadBalancer.stickiness]
        cookieName = "sticky-session"

[[acme.domains]]
   main = "foobar.com"
   sans = []

@amq
Copy link

amq commented Mar 18, 2018

Similar here. Traefik consumes more CPU than the backend PHP app and almost the same as MySQL.

I have observed this behavior since I started using Traefik at 1.3.x, it seems to be even a bit worse at 1.5.4

All static files are hosted by a CDN and the site is not using a full page cache.

When I attach strace, most of the output is:

futex() = 1
futex() = -1 EAGAIN (Resource temporarily unavailable)
read() = -1 EAGAIN (Resource temporarily unavailable

Total times:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.23    1.343642         116     11631       830 futex
  1.58    0.021588           4      5459           epoll_wait
  0.06    0.000875           1      1704        38 write
  0.05    0.000672           0      1632           pselect6
  0.05    0.000662           0      3201      2328 read
  0.02    0.000306           3        95           getrandom
  0.00    0.000050           0       111           close
  0.00    0.000000           0        12           sched_yield
  0.00    0.000000           0         1           socket
  0.00    0.000000           0         1         1 connect
  0.00    0.000000           0        44           getsockname
  0.00    0.000000           0        45           setsockopt
  0.00    0.000000           0         1         1 restart_syscall
  0.00    0.000000           0       156           epoll_ctl
  0.00    0.000000           0        87        43 accept4
------ ----------- ----------- --------- --------- ----------------
100.00    1.367795                 24180      3241 total

@juliens
Copy link
Member

juliens commented Mar 19, 2018

@GaruGaru @mo Thank you for this profiles, but can you add your traefik configuration?

@GaruGaru
Copy link

@juliens Edited my previous post with my configuration

@mo
Copy link
Contributor

mo commented Mar 19, 2018

@juliens I have updated my comment above as well

@timoreimann
Copy link
Contributor

Looking at @GaruGaru's profile, it seems like most CPU cycles (43.25% sample time) are spent inside Go's math package for the purpose of TLS communication/handshaking:

(pprof) top
Showing nodes accounting for 16.46s, 72.86% of 22.59s total
Dropped 576 nodes (cum <= 0.11s)
Showing top 10 nodes out of 153
      flat  flat%   sum%        cum   cum%
     9.77s 43.25% 43.25%      9.77s 43.25%  math/big.addMulVVW
     1.94s  8.59% 51.84%     12.43s 55.02%  math/big.nat.montgomery
     1.67s  7.39% 59.23%      1.74s  7.70%  syscall.Syscall
     1.03s  4.56% 63.79%      1.03s  4.56%  runtime.memmove
     0.58s  2.57% 66.36%      0.58s  2.57%  math/big.mulAddVWW
     0.41s  1.81% 68.17%      1.90s  8.41%  math/big.nat.divLarge
     0.34s  1.51% 69.68%      0.34s  1.51%  runtime.usleep
     0.27s  1.20% 70.87%      0.27s  1.20%  crypto/sha256.block
     0.26s  1.15% 72.02%      0.26s  1.15%  math/big.subVV
     0.19s  0.84% 72.86%      0.19s  0.84%  math/big.shlVU

As a graph:

garugaru-profile

It's hard for me to tell if those 44% CPU consumption that @GaruGaru reported are reasonable or not. (By the way, is that 44% per core or per entire machine?) It also depends on the request rate that those 50 concurrent users produced.

Is there a chance for you to disable TLS to see if it makes a difference?

@mo
Copy link
Contributor

mo commented Mar 20, 2018

@timoreimann I built 1.5.4 with your logging fix cherry picked, changed the logLevel to INFO in the config, and re-profiled. I got 40% CPU in node and 30% in traefik (and nothing in systemd this time because the log spam is gone).

pprof-mo-20180320-traefik-1.5.4-without-debug-logging.zip

@ulope
Copy link

ulope commented Apr 5, 2018

Just to add an observation I recently made by accident:

I've been load-testing a certain backend application (doesn't matter which one for this) and also noticed very high traefik CPU load. However, switching the API client I've been using to use HTTP keep-alive dropped the CPU load to almost nothing. So it appears that establishing new connections is related to the high load.

@GaruGaru
Copy link

GaruGaru commented Apr 6, 2018

Nice find @ulope i repeated the same test with and without keepalive and i've got the following results:

With keepalive

siege -r 100  -c 50  -R <(echo connection = keep-alive) "<service>"

Transactions:		        5000 hits
Availability:		      100.00 %
Elapsed time:		        3.23 secs
Data transferred:	        0.09 MB
Response time:		        0.03 secs
Transaction rate:	     1547.99 trans/sec
Throughput:		        0.03 MB/sec
Concurrency:		       48.73
Successful transactions:           0
Failed transactions:	           0
Longest transaction:	        0.44
Shortest transaction:	        0.00

Without keepalive

siege -r 100  -c 50  "<service>"

Transactions:		        5000 hits
Availability:		      100.00 %
Elapsed time:		       64.36 secs
Data transferred:	        0.09 MB
Response time:		        0.07 secs
Transaction rate:	       77.69 trans/sec
Throughput:		        0.00 MB/sec
Concurrency:		        5.08
Successful transactions:           0
Failed transactions:	           0
Longest transaction:	        0.56
Shortest transaction:	        0.00

@amq
Copy link

amq commented Jun 3, 2018

Things have gotten worse in 1.6, I wish you would give more love towards performance soon, because it is not that far away till your space is overtaken by Envoy or some Nginx cloud-native advancements.

@cregev
Copy link

cregev commented Jun 24, 2018

We have recently upgraded to Traefik version 1.6.4 and the CPU usage increased dramatically.
screen shot 2018-06-24 at 21 53 04

@ldez
Copy link
Member

ldez commented Jun 24, 2018

@cregev We must have more information to analyze the high CPU reason.

Could you give your configuration and do some profiling.

Here is how to get profiling data:

  1. Start you traefik with --debug in command line or debug=true in toml file
  2. Generate profile files (during Træfik is under load)
curl -o profile.pprof <traefik_ip>:8080/debug/pprof/profile    # CPU (takes 30 seconds to return)
curl -o heap.pprof <traefik_ip>:8080/debug/pprof/heap    
curl -o block.pprof <traefik_ip>:8080/debug/pprof/block    
curl -o mutex.pprof <traefik_ip>:8080/debug/pprof/mutex
curl -o goroutine.pprof <traefik_ip>:8080/debug/pprof/goroutine
curl -o trace.pprof <traefik_ip>:8080/debug/pprof/trace?seconds=5    # the seconds parameter is needed
  1. Send us all the files you have generated

Thank you for your interest in Træfik.

@cregev
Copy link

cregev commented Jun 25, 2018

traefik_cpu_issue.zip
@ldez Attached the pprof files

logLevel = "INFO"
debug = true
[accessLog]
format = "json"
 [accessLog.filters]
   statusCodes = [ "500-520" ]
[api]
 entryPoint = "traefik"
 dashboard = true
 debug = true
[ping]
 # Name of the related entry point
 #
 # Optional
 # Default: "traefik"
 #
 entryPoint = "traefik"
[metrics]
 [metrics.datadog]
   address = "<Datadog  Address>"
   pushInterval = "10s"
[consulCatalog]
endpoint = "<Consul Local agent Address>"
constraints = ["tag==http_external_service"]
watch = true
domain = "< customized_domain of similarweb >"
prefix = "traefik"

@emilevauge
Copy link
Member

@cregev Any chance to test with 1.6.3? We suspect a potential cause in 1.6.4. Thanks for your help :)

@Falx
Copy link

Falx commented Sep 19, 2018

Is there any update on this topic, we also suffer greatly from this. We use traefik as k8s ingress and after a few days, cpu io-wait spikes up to 90% in our monitoring tools and cripples the whole microservice platform. Tested 1.5.4 and 1.6.6 with similar results.

@kachkaev
Copy link
Contributor

@Falx Do you have a memory limit set in your pod? The issue could be related to this. I once noticed high CPU usage after certain requests and it simply turned out that there was a lot of pod’s swap IO due to memory limits set my default in the helm chart.

problem: #1908 (comment)

solution: #1908 (comment)

@Falx
Copy link

Falx commented Sep 19, 2018

@Falx Do you have a memory limit set in your pod? The issue could be related to this. I once noticed high CPU usage after certain requests and it simply turned out that there was a lot of pod’s swap IO due to memory limits set my default in the helm chart.

problem: #1908 (comment)

solution: #1908 (comment)

Thank you very much for this information. We've updated our memoryLimit in the helm chart to 100Mi and will see in the coming days!

@kachkaev
Copy link
Contributor

This memory limit in the helm chart is a pretty painful default – your team is not the first one that has been hurt by it. I tried to lobby a change in the chart values a while ago, but no luck to this date.

Now, given that the official helm chart is maintained by traefik team directly, feel free to propose a change in the defaults if you agree with me that this is reasonable. IMO the compound time spent on upgrading from chart v1 to v2 is far less than the effort needed to tackle this IO nightmare people randomly enter. It took me a while to figure out why traefik started killing my CPU, I felt desperate during the investigation 😞

@Falx
Copy link

Falx commented Sep 19, 2018

So what exactly is up with the gzip compresssion? Our API serves almost only json (albeit, huge json message with over 20k entries sometimes). Next to that we have some other microservices too that host single page applications with small image assets.

Would we benefit from turning gzip compression off? We have a reverse proxy nginx instance in front of the whole k8s cluster (and thus traefik ingress).

@kachkaev
Copy link
Contributor

The issue is not specific to compression, it’s about resource usage in general. When you have compression, https, high load or any other resource-consuming factor, traefik goes over the default 20MiB ram limit and starts endlessly writing and reading from swap.

IMO the problem is that chart defaults are not picked correctly and this causes CPU madness in pretty standard conditions. I was in favor of changing the defaults and thus upgrading the chart to v2, but this proposal has not been accepted.

The result is that new traefik chart users fall into the same trap as me a year ago.

@Falx
Copy link

Falx commented Sep 19, 2018

Ok I see, so right now we increased the limit to 100Mi, but we will have to monitor if this is enough for our application. I am happy though that at least now we have a logical explanation for this weird behavior we've been seeing.

@clausndk
Copy link

clausndk commented Jan 22, 2019

profile.zip
I created a profile.

Our issue is migrating from a self implemented gateway, to Traefik.
FYI: We are using Service Fabric as our hosting environment.

Our own implementation did not use as much CPU in general, but after the upgrade, we have VM average CPU usage around 50%, where we came from 10-20% Avg.

Can you help us analyse, since we wanted to move to Traefik for a standard product, but right now our own code is faster (Which I really doubt!).

Info: Traefik 1.7.7

Thanks for any pointers!

@Kindratte
Copy link

Looking at @GaruGaru's profile, it seems like most CPU cycles (43.25% sample time) are spent inside Go's math package for the purpose of TLS communication/handshaking:

(pprof) top
Showing nodes accounting for 16.46s, 72.86% of 22.59s total
Dropped 576 nodes (cum <= 0.11s)
Showing top 10 nodes out of 153
      flat  flat%   sum%        cum   cum%
     9.77s 43.25% 43.25%      9.77s 43.25%  math/big.addMulVVW
     1.94s  8.59% 51.84%     12.43s 55.02%  math/big.nat.montgomery
     1.67s  7.39% 59.23%      1.74s  7.70%  syscall.Syscall
     1.03s  4.56% 63.79%      1.03s  4.56%  runtime.memmove
     0.58s  2.57% 66.36%      0.58s  2.57%  math/big.mulAddVWW
     0.41s  1.81% 68.17%      1.90s  8.41%  math/big.nat.divLarge
     0.34s  1.51% 69.68%      0.34s  1.51%  runtime.usleep
     0.27s  1.20% 70.87%      0.27s  1.20%  crypto/sha256.block
     0.26s  1.15% 72.02%      0.26s  1.15%  math/big.subVV
     0.19s  0.84% 72.86%      0.19s  0.84%  math/big.shlVU

As a graph:

garugaru-profile

It's hard for me to tell if those 44% CPU consumption that @GaruGaru reported are reasonable or not. (By the way, is that 44% per core or per entire machine?) It also depends on the request rate that those 50 concurrent users produced.

Is there a chance for you to disable TLS to see if it makes a difference?

You're right, high cpu usage due to slow handshakes with default RSA4096 cert in traefik, check golang issue golang/go#20058 . I change keytype to ECDSA "--acme.keytype=EC256" and cpu usage back to normal.

@tboerger
Copy link
Contributor

Does this issue still apply to Traefik 2.0?

@dduportal
Copy link
Contributor

Hi @tboerger , the Traefik 2.0 GA version don't have this behaviour as far as we know (reference: #5294 was closed during the RC phase).

As this issue is old and hijacked by a lot of different users, we are closing it for triage cleanup.

Please feel free to open new issues with reproduction case if you are still having this behavior with latest Traefik v2.0. Thanks!

@traefik traefik locked and limited conversation to collaborators Oct 31, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. priority/P2 need to be fixed in the future status/5-frozen-due-to-age
Projects
None yet
Development

No branches or pull requests