Random 500 Error returned by Traefik #1926

karthimohan · 2017-08-07T22:31:24Z

Report a bug

What version of Traefik are you using (traefik version)?

v1.3.0

What did you do?

We are using Traefik as a load balancer using Marathon provider in our Production

Below is our setup

Edge Proxy - Haproxy
Internal proxy Layer - Traefik
We use Haproxy as our edge proxy due to complex rewrites/ redirects and the backends are configured configured with Traefik servers.
Also all our internal service to service communication happens via Traefik.

Here is our traffic flow

API Call --> Haproxy --> Traefik Servers --> Docker containers

Initially when we introduced Traefik in Prod, we just switched the inter service communication that used to happen via internal haproxy load balancers, we did not see any issues.
So we decided to move the Edge Proxy Traffic via Traefik instead of sending directly to app containers. We made this change only to send 10% of our traffic to observe before we switch all traffic.

So out of 3 Edge Haproxy Servers, only one server sends traffic via Traefik and other 2 servers send directly to app containers.

What did you expect to see?

No Change in Behavior

What did you see instead?

After we made this change, we started observing random 500 Errors returned by the Haproxy server sending Traffic via Traefik server. But we dont see that issue on Proxy servers for similar api calls. We dont the see those failed requests reaching the Backend app containers as we dont see any events in the logs during this time

Log Event from Haproxy

Aug 6 22:56:52 localhost.localdomain haproxy[118832]: ::ffff:12:23:234:12:35449 [06/Aug/2017:22:56:51.741] HTTPS~ appname/traefik-80 345/0/0/231/576 500 157 - - ---- 35/34/0/0/0 0/0 {api.abc.cloud||||okhttp/3.4.2} "GET /v1.1/abc/c45f21be84824984b34c8538bd8b519a/abc?offset=1000&count=100 HTTP/1.1"

Log Event from Traefik

101.110.24.120 - - [06/Aug/2017:22:56:52 -0700] "GET /api/abc/c45f21be84824984b34c8538bd8b519a/abc HTTP/1.1" 500 21 "" "okhttp/3.4.2" 44609802 "appname" "http://hostname:31020" 229ms

What is your environment & configuration (arguments, toml, provider, platform, ...)?

configuration

################################################################
# Global configuration
################################################################

# Timeout in seconds.
# Duration to give active requests a chance to finish during hot-reloads
#
# Optional
# Default: 10
#
graceTimeOut = 10

# Traefik logs file
# If not defined, logs to stdout
#
# Optional
#
traefikLogsFile = "/var/log/traefik/traefik.log"


# Access logs file
#
# Optional
#
accessLogsFile = "/var/log/traefik/traefik-access.log"

# Log level
#
# Optional
# Default: "ERROR"
#
logLevel = "DEBUG"

# Backends throttle duration: minimum duration between 2 events from providers
# before applying a new configuration. It avoids unnecessary reloads if multiples events
# are sent in a short amount of time.
#
# Optional
# Default: "2s"
#
ProvidersThrottleDuration = 1


# If non-zero, controls the maximum idle (keep-alive) to keep per-host.  If zero, DefaultMaxIdleConnsPerHost is used.
# If you encounter 'too many open files' errors, you can either change this value, or change `ulimit` value.
#
# Optional
# Default: http.DefaultMaxIdleConnsPerHost
#
MaxIdleConnsPerHost = 200

# If set to true invalid SSL certificates are accepted for backends.
# Note: This disables detection of man-in-the-middle attacks so should only be used on secure backend networks.
# Optional
# Default: false
#
InsecureSkipVerify = false

# Entrypoints to be used by frontends that do not specify any entrypoint.
# Each frontend can specify its own entrypoints.
#
# Optional
# Default: ["http"]
#
defaultEntryPoints = ["http"]

################################################################
# Web configuration backend
################################################################

[web]
address = ":8080"

################################################################
# Mesos/Marathon configuration backend
################################################################

# Enable Marathon configuration backend
#
# Optional
#
#[marathon]
[marathon]


# Marathon server endpoint.
# You can also specify multiple endpoint for Marathon:
# endpoint := "http://10.241.1.71:8080,10.241.1.72:8080,10.241.1.73:8080"
#
# Required
#
endpoint = "http://101.211.1.11:8080,101.211.1.13:8080,101.211.1.13:8080"

# Enable watch Marathon changes
#
# Optional
#
watch = true

# Default domain used.
#
# Required
#
domain = "traefik.service.consul"

# Override default configuration template. For advanced users :)
#
# Optional
#
# filename = "marathon.tmpl"
filename = "/opt/traefik/conf/marathon.tmpl"

# Expose Marathon apps by default in traefik
#
# Optional
# Default: false
#
exposedByDefault = false


# Convert Marathon groups to subdomains
# Default behavior: /foo/bar/myapp => foo-bar-myapp.{defaultDomain}
# with groupsAsSubDomains enabled: /foo/bar/myapp => myapp.bar.foo.{defaultDomain}
#
# Optional
# Default: false
#
groupsAsSubDomains = false

# Enable Marathon basic authentication
#
# Optional
#
[marathon.basic]
httpBasicAuthUser = "xxx"
httpBasicPassword = "xxxxxxx"

# To enable more detailed statistics
[web.statistics]
   RecentErrors = 10
# To enable Traefik to export internal metrics to Prometheus
[web.metrics.prometheus]
   Buckets=[0.1,0.3,1.2,5.0]

# TLS client configuration. https://golang.org/pkg/crypto/tls/#Config
#
# Optional
#
# [marathon.TLS]
# InsecureSkipVerify = true


[retry]

We use Traefik Healthcheck and below is the Marathon Labels for one of the service where we see the failure

  "labels": {
    "consul": "servicename-5654",
    "external": "tag",
    "internal": "tag",
    "prod": "tag",
    "5654": "tag",
    "traefik.backend.healthcheck.path": "/healthcheck",
    "traefik.backend.healthcheck.interval": "10s",
    "traefik.frontend.rule": "PathPrefixStrip:/servicename",
    "traefik.backend": "-servicename",
    "traefik.portIndex": "0",
    "traefik.enable": "true"
  },

We are not able to identify a pattern. We were running Traefik serveres in C4.xlarge and changed the Instance Type to c4.2xlarge to check if any network bandwidth/ resource issue. Appreciate if your help in troubleshooting this issue.

The text was updated successfully, but these errors were encountered:

ldez · 2017-08-07T23:31:43Z

Thanks for your interest in Traefik !

Could you try with the latest version (1.3.5) ?

karthimohan · 2017-08-10T03:37:46Z

@idez I am testing version 1.3.5 in preprod. Will update this issue with my result in a day

karthimohan · 2017-08-14T20:05:20Z

@ldez tested with 1.3.5 and it broke all the websocket connection related to this change #1930. So i can't say for sure if the version 1.3.5 fixed my original reported issue.

I tried the patch mentioned in #1930 and the wbsocket issue is fixed. So I am going to apply the patch in rest of my preprod stack and test.

Do you have any timelines on when #1930 will be merged and released?

karthimohan · 2017-08-19T01:13:52Z

@ldez After applying the patch mentioned in #1930, i dont see the websocket issue but back to the old random failures on other http requests as i described in my bug.

ldez · 2017-08-19T12:01:13Z

could you provide more logs? traefik --debug

karthimohan · 2017-08-22T01:42:53Z

@ldez Are you looking for any specific events so i can filter out those events. Below the some of the log events for which we received 500 status code.

Haproxy
Aug 21 10:09:12 localhost.localdomain haproxy[31211]: ::ffff:127.0.0.1:41450 [21/Aug/2017:10:08:12.613] HTTPS~ app/traefikserver-80 0/0/0/60001/60001 500 157 - - ---- 1/1/0/0/0 0/0 {api.abc.cloud||||Apache-HttpClient/4.3.6 (java 1.5)} "GET /v1.1/abc/applications?offset=0 HTTP/1.1"
Aug 21 10:16:51 localhost.localdomain haproxy[31211]: ::ffff:127.0.0.1:41818 [21/Aug/2017:10:15:51.368] HTTPS~ app/traefikserver-80 0/0/0/60001/60001 500 157 - - ---- 1/1/0/0/0 0/0 {api.abc.cloud||||Apache-HttpClient/4.3.6 (java 1.5)} "GET /v1.1/abc/applications?offset=0 HTTP/1.1"
Aug 21 10:32:53 localhost.localdomain haproxy[31211]: ::ffff:127.0.0.1:42757 [21/Aug/2017:10:31:53.376] HTTPS~ app/traefikserver-80 0/0/0/60004/60004 500 157 - - ---- 1/1/0/0/0 0/0 {api.abc.cloud||||Apache-HttpClient/4.3.6 (java 1.5)} "GET /v1.1/applications HTTP/1.1"
Aug 21 10:41:51 localhost.localdomain haproxy[31211]: ::ffff:127.0.0.1:43500 [21/Aug/2017:10:40:51.598] HTTPS~ app/traefikserver-80 0/0/0/60001/60001 500 157 - - ---- 1/1/0/0/0 0/0 {api.abc.cloud||||Apache-HttpClient/4.3.6 (java 1.5)} "GET /v1.1/abc/applications HTTP/1.1"
Aug 21 10:43:53 localhost.localdomain haproxy[31211]: ::ffff:127.0.0.1:43625 [21/Aug/2017:10:42:53.851] HTTPS~ app/traefikserver-80 0/0/0/60001/60001 500 157 - - ---- 1/1/0/0/0 0/0 {api.abc.cloud||||Apache-HttpClient/4.3.6 (java 1.5)} "GET /v1.1/api/3c19537140ed438f83ce423c4e6a4701/applications?offset=0&count=100 HTTP/1.1"
Aug 21 11:07:30 localhost.localdomain haproxy[31211]: ::ffff:127.0.0.1:45088 [21/Aug/2017:11:07:18.893] HTTPS~ app/traefikserver-80 0/0/0/12044/12044 500 157 - - ---- 1/1/0/0/0 0/0 {api.abc.cloud|||142|Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.3} "PUT /v1.1/api/dt7e29b491d8fb4ad79ca9beccd7d21a96 HTTP/1.1"

Traefik-access.log
101.202.100.100 - - [21/Aug/2017:10:08:12 -0700] "GET /abc/applications HTTP/1.1" 500 21 "" "Apache-HttpClient/4.3.6 (java 1.5)" 153198 "appfrontendA" "http://hostname:31574" 60000ms
101.202.100.100 - - [21/Aug/2017:10:15:51 -0700] "GET /abc/applications HTTP/1.1" 500 21 "" "Apache-HttpClient/4.3.6 (java 1.5)" 153384 "appfrontendA" "http://hostname:31574" 60000ms
101.202.100.100 - - [21/Aug/2017:10:31:53 -0700] "GET /applications HTTP/1.1" 500 21 "" "Apache-HttpClient/4.3.6 (java 1.5)" 153818 "appfrontendA" "http://hostname:31485" 60000ms
101.202.100.100 - - [21/Aug/2017:10:40:51 -0700] "GET /abc/applications HTTP/1.1" 500 21 "" "Apache-HttpClient/4.3.6 (java 1.5)" 154226 "appfrontendA" "http://hostname" 60000ms
101.202.100.100 - - [21/Aug/2017:10:42:53 -0700] "GET /api/3c19537140ed438f83ce423c4e6a4701/applications HTTP/1.1" 500 21 "" "Apache-HttpClient/4.3.6 (java 1.5)" 154308 "appfrontendA" "http://hostname:31574" 60000ms
101.202.100.100 - - [21/Aug/2017:11:07:18 -0700] "PUT api/dt7e29b491d8fb4ad79ca9beccd7d21a96 HTTP/1.1" 500 21 "https://URI/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36" 155003 "appfrontendA" "http://hostname:31485" 12042ms

Traefik Log
time="2017-08-21T10:09:12-07:00" level=warning msg="Error forwarding to http://hostname:31574, err: context canceled"
time="2017-08-21T10:16:51-07:00" level=warning msg="Error forwarding to http://hostname:31574, err: context canceled"
time="2017-08-21T10:32:53-07:00" level=warning msg="Error forwarding to http://hostname:31485, err: context canceled"
time="2017-08-21T10:41:51-07:00" level=warning msg="Error forwarding to http://hostname:31574, err: context canceled"
time="2017-08-21T10:43:53-07:00" level=warning msg="Error forwarding to http://hostname:31574, err: context canceled"
time="2017-08-21T11:07:30-07:00" level=warning msg="Error forwarding to http://hostname:31485, err: context canceled"

emilevauge · 2017-08-23T20:42:24Z

@karthimohan that's weird 🤔
Could you try commenting these sections ?

[web.statistics]
   RecentErrors = 10
# To enable Traefik to export internal metrics to Prometheus
[web.metrics.prometheus]
   Buckets=[0.1,0.3,1.2,5.0]

[retry]

karthimohan · 2017-08-25T00:33:41Z

@emilevauge Not sure how relevant is to remove those. But I just did the suggested change in my preprod and going to monitor it for a day and update here

warroyo · 2017-10-27T21:48:03Z

I am seeing the context canceled error as well. this is extremely noticeable when under heavy load.

i ran a, performance test with wrk

wrk -t10 -c100 -d20s --latency

here are the results:

10 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   527.92ms  218.30ms   1.89s    65.72%
    Req/Sec    19.46     11.15    60.00     82.91%
  Latency Distribution
     50%  517.02ms
     75%  684.67ms
     90%  800.57ms
     99%    1.06s
  3023 requests in 20.09s, 3.21MB read
  Socket errors: connect 0, read 0, write 0, timeout 99
Requests/sec:    150.48
Transfer/sec:    163.84KB

notice the 99 timeouts these are all time="2017-10-27T21:44:46Z" level=warning msg="Error forwarding to http://10.42.24.15:80, err: context canceled"

if i increase the connections it gets much worse sometimes to the point all requests are failing

are there some settings that need to be tweaked for heavier load?

not sure if this could be related #1849

timoreimann · 2017-11-20T00:51:24Z

One reason why this can apparently happen is due to disappearing clients (see also golang/go#20071). Is there a chance that Traefik is running out of capacity or backends cannot keep up with the request rate causing clients to eventually lose patience and prematurely terminate the request to Traefik?

thuandt · 2017-12-29T15:33:02Z

I got same problem. It's too weird. Most of all backend behind is ok but one of them always return 500 with err: context canceled.

wrk -t10 -c100 -d20s -H "Authorization: Basic xxx==" https://xyz.abc/services/health_check                                                       
Running 20s test @ https://xyz.abc/services/health_check                    
  10 threads and 100 connections               
  Thread Stats   Avg      Stdev     Max   +/- Stdev                                            
    Latency   513.82ms  141.49ms   1.29s    73.54%                                             
    Req/Sec    20.66     12.13    80.00     82.34%                                             
  3696 requests in 20.05s, 1.59MB read         
  Non-2xx or 3xx responses: 3696               
Requests/sec:    184.38                        
Transfer/sec:     81.03KB

I tried with v1.5.0-rc3 and 1.4 stable but same results

mrnugget · 2018-01-11T09:29:35Z

We're experiencing the same issue: requests where the client disappears before a response could be sent are logged as errors.

The problem is that it's really easy to reproduce. You can just load a site in your browser that takes 1s+ to load and hit reload multiple times before the page is fully rendered to "reissue" the request. The client now disappeared for the first requests and traefik logs a 500.

I think it's debatable whether this should be reported as a status 500 error. IMHO it's not an error. But even if we classify it as an error, I don't think it should be reported as an "Internal Server Error".

Our alerting system (based on grafana + prometheus) goes nuts if users are impatient and hit reload multiple times.

ortz · 2018-03-14T15:12:33Z

hey, we're experiencing the same issue.
it seems that sometimes traefik returns a lot of 500 with the following message: "Error forwarding to http://IP:PORT, err: context canceled".
The servers weren't under unusual load and it's becoming OK after a couple of minutes.

any suggestions?

hsmade · 2018-07-31T07:27:06Z

I suggest to start traefik with the following env var: GODEBUG="http2debug=2". Then look for 500s and check if you see a RST_STREAM right before that. These are requests that are being reset by the (web)client.

coufalja · 2018-08-02T08:48:48Z

I think that issue lies in vulcand/oxy library. The default error handler does this:

func (e *StdHandler) ServeHTTP(w http.ResponseWriter, req *http.Request, err error) {
	statusCode := http.StatusInternalServerError
	if e, ok := err.(net.Error); ok {
		if e.Timeout() {
			statusCode = http.StatusGatewayTimeout
		} else {
			statusCode = http.StatusBadGateway
		}
	} else if err == io.EOF {
		statusCode = http.StatusBadGateway
	}
	w.WriteHeader(statusCode)
	w.Write([]byte(http.StatusText(statusCode)))
	log.Debugf("'%d %s' caused by: %v", statusCode, http.StatusText(statusCode), err)
}

so it for ContextCancelled error tries to write 500 SC which is then catched by metrics middleware in Traefik and reported as 500. The question is should it be fixed in Traefik by using custom ErrorHandler or in oxy? It drives our Monitoring crazy so I would like to help with PR I just want to know the preference where and how to fix it.

kristinn · 2018-08-07T09:25:41Z

I want to confirm @mrnugget's findings. I'm able to reproduce this error when running siege and cancelling it midway.

An example:

echo "GET https://example.com" | vegeta attack -duration=600s -rate=10 | tee results.bin | vegeta report -reporter=json | jq

I'm able to reproduce the issue if I CTRL+C midway through the stress test (the same effect as constantly refreshing a page in the browser before the client has received a response).

It doesn't seem to affect our users in any way, feels more like a "false alarm" by how Traefik is logging these errors.

coufalja · 2018-08-08T15:58:06Z

Submitted vulcand/oxy#155 with the fix to oxy, and confirmed that it no longer reports those false alarms using wrk.

traefiker · 2018-08-20T08:38:07Z

Closed by #3777.

ldez added area/provider/marathon status/0-needs-triage labels Aug 7, 2017

ldez added kind/bug/confirmed a confirmed bug (reproducible). area/websocket kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. and removed status/0-needs-triage area/websocket kind/bug/confirmed a confirmed bug (reproducible). labels Aug 19, 2017

warroyo mentioned this issue Oct 30, 2017

Context Canceled error with heavy load #2346

Closed

mmatur self-assigned this Jan 11, 2018

mmatur assigned juliens Feb 12, 2018

mmatur mentioned this issue Feb 26, 2018

Prometheus metrics counts a client cancel as a 500 #2932

Closed

mmatur added the status/0-needs-triage label Mar 14, 2018

nmengin added kind/enhancement a new or improved feature. priority/P2 need to be fixed in the future and removed status/0-needs-triage area/provider/marathon labels Mar 15, 2018

nmengin removed the kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. label Mar 15, 2018

coufalja mentioned this issue Aug 6, 2018

Fixed 500 reported for client disconnection vulcand/oxy#155

Merged

emilevauge closed this as completed in vulcand/oxy#155 Aug 16, 2018

coufalja mentioned this issue Aug 16, 2018

Updating oxy dependency which fixes the issue #1926 #3768

Closed

juliens mentioned this issue Aug 17, 2018

Update oxy dependency #3777

Merged

traefiker added this to the 1.6 milestone Aug 20, 2018

traefik locked and limited conversation to collaborators Sep 1, 2019

traefiker added the status/5-frozen-due-to-age label Sep 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random 500 Error returned by Traefik #1926

Random 500 Error returned by Traefik #1926

karthimohan commented Aug 7, 2017 •

edited by ldez

ldez commented Aug 7, 2017

karthimohan commented Aug 10, 2017

karthimohan commented Aug 14, 2017

karthimohan commented Aug 19, 2017

ldez commented Aug 19, 2017

karthimohan commented Aug 22, 2017

emilevauge commented Aug 23, 2017 •

edited by ldez

karthimohan commented Aug 25, 2017

warroyo commented Oct 27, 2017 •

edited

timoreimann commented Nov 20, 2017 •

edited

thuandt commented Dec 29, 2017 •

edited

mrnugget commented Jan 11, 2018 •

edited

ortz commented Mar 14, 2018

hsmade commented Jul 31, 2018

coufalja commented Aug 2, 2018

kristinn commented Aug 7, 2018

coufalja commented Aug 8, 2018

traefiker commented Aug 20, 2018

Random 500 Error returned by Traefik #1926

Random 500 Error returned by Traefik #1926

Comments

karthimohan commented Aug 7, 2017 • edited by ldez

What version of Traefik are you using (traefik version)?

What did you do?

What did you expect to see?

What did you see instead?

What is your environment & configuration (arguments, toml, provider, platform, ...)?

ldez commented Aug 7, 2017

karthimohan commented Aug 10, 2017

karthimohan commented Aug 14, 2017

karthimohan commented Aug 19, 2017

ldez commented Aug 19, 2017

karthimohan commented Aug 22, 2017

emilevauge commented Aug 23, 2017 • edited by ldez

karthimohan commented Aug 25, 2017

warroyo commented Oct 27, 2017 • edited

timoreimann commented Nov 20, 2017 • edited

thuandt commented Dec 29, 2017 • edited

mrnugget commented Jan 11, 2018 • edited

ortz commented Mar 14, 2018

hsmade commented Jul 31, 2018

coufalja commented Aug 2, 2018

kristinn commented Aug 7, 2018

coufalja commented Aug 8, 2018

traefiker commented Aug 20, 2018

karthimohan commented Aug 7, 2017 •

edited by ldez

emilevauge commented Aug 23, 2017 •

edited by ldez

warroyo commented Oct 27, 2017 •

edited

timoreimann commented Nov 20, 2017 •

edited

thuandt commented Dec 29, 2017 •

edited

mrnugget commented Jan 11, 2018 •

edited