proxy backend -> EADDRNOTAVAIL -> probes failing -> all backends down -> site down #2622

nigoroll · 2018-03-20T14:05:13Z

I've been bitten badly by a series of issues which I think we need to address. Each of them may not be that bad in isolation but I think the whole picture is relevant for understanding their relevance.

Seen on ee1d34d Linux 3.14.53+ Debian 8.10 - note that the connection pool code has changed since then but I do not see semantic differences

setup

cache clustering/sharing is used in a two-varnish setup: requests which are not identified as passes are forwarded to the other varnish based on the backend selection of the shard director.
The other varnish is configured as a backend with proxy_header
The rate of miss+hitmiss is ~1k/s, so the expected rate of connections to the other varnish is ~.5k/s
1M max open files
<100k client connections

symptoms

connections to "the other varnish" are starting to fail (FetchError backend ...: fail)
backend probes are starting to fail, until almost all backends are down (FetchError no backend)
site goes down

lack of visibility

analyzing this issue was complicated by the facts that

FetchError does not give information about why the connection failed
probes do not log why a connection failed

assured facts

For proxy backend requests, the connection is closed after a single request: https://github.com/varnishcache/varnish-cache/blob/master/bin/varnishd/cache/cache_backend.c#L184 and will end up in TIME_WAIT because it is closed by the client
Thus, with default Linux tcp tuning (connections staying in TIME_WAIT for 120s and ip_local_port_range ~28k), connections will fail when a request rate of ~235/s is exceeded

open question

I suspect a causal relation between the failing proxy connections and the fact that backend probes failed, but I cannot prove it:

The hypothesis is that failing backend requests use up file descriptors, for example via a delayed close() in the pool code, but I failed to come up with a plausible explanation.

So I am still looking for either support of the hypothesis or an alternative hypothesis which would explain why backend probes are failing in the situation documented under assured facts

TODOs

backend proxy connections are problematic as-are due to the TIME_WAIT issue and, in particular, future scenarios like connecting to an ssl-onloader require connection reuse/pooling.
our error messages should point to the exact root-cause and should include the relevant errno

The text was updated successfully, but these errors were encountered:

Dridi · 2018-03-21T08:29:40Z

The hypothesis is that failing backend requests use up file descriptors, for example via a delayed close() in the pool code, but I failed to come up with a plausible explanation.

No, there's simply no backend connection pooling when PROXY is involved. The reuse logic needs a lot more work if we want to support that so currently there's no recycling at all between your two Varnish instances whether requests fail or not.

Incidentally there's also no pooling at all for probes, by design.

nigoroll · 2018-03-21T10:53:20Z

Re @Dridi both backend connections and probes use VTP

Dridi · 2018-03-21T11:50:46Z

VTP or whatever we call it these days with unix domain sockets, I'll need to catch up at some point :)

Probes bypass the recycling code by always opening a new connection, and a backend proxy_header will always neuter recycling with a Connection: close in the bereq (and also request a fresh connection to write the PROXY bits).

edit: maybe not Connection: close, but at least guaranteed by the line you pointed to.

varnish-cache/bin/varnishd/cache/cache_backend.c

Lines 184 to 190 in 5e2b0d8

    
           if (bo->htc->doclose != SC_NULL || bp->proxy_header != 0) { 
        
           	VSLb(bo->vsl, SLT_BackendClose, "%d %s", *PFD_Fd(pfd), 
        
           	    bp->director->display_name); 
        
           	VTP_Close(&pfd); 
        
           	AZ(pfd); 
        
           	Lck_Lock(&bp->mtx); 
        
           } else {

nigoroll · 2018-03-21T13:15:07Z

@Dridi yes. As I said, I was suspecting a delayed file descriptor close issue, but could not find evidence. As the failing health check on the backend with too many TIME_WAIT connections coincided with other backends' health checks failing, the code shared for all backends (VTP) is a possible candidate for a causal relation, but right now that is still speculation.

nigoroll · 2018-03-28T13:58:13Z

I suspect to have seen another incarnation of the suspected issue:It seems that backend issues caused varnish to significantly slow down the acceptor with ~20k tcp connections in SYN_RECV.
Because in this setup, backends are making connections which circle back to varnish (for other urls), this seems to have caused some vicios circle / deadlock situation which we could resolve by momentarily adding a return(synth(503)) to vcl_recv until SYN_RECV went back to normal. After switching back to the previous VCL, the situation normalized and the site came back.

Ref #2622

We already emitted Debug log records for accept failures, but for all practical purposes this is useless for forensics. These counters will point directly to the root cause for the most common issues with sess_fail (accept failures). sess_fail is preserved as an overall counter as it will most likely be already monitored for many installations. Ref varnishcache#2622

Previously, we had zero reporting on the cause of backend connection errors, which made it close to impossible to diagnose such issues directly. We now add statistics per connection pool and Debug VSLs. Ref varnishcache#2622

This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable and should thus help improve the overall situation. Should fix or at least mitigate varnishcache#2622

We already emitted Debug log records for accept failures, but for all practical purposes this is useless for forensics. These counters will point directly to the root cause for the most common issues with sess_fail (accept failures). sess_fail is preserved as an overall counter as it will most likely be already monitored for many installations. Ref varnishcache#2622

Previously, we had zero reporting on the cause of backend connection errors, which made it close to impossible to diagnose such issues directly. We now add statistics per connection pool and Debug VSLs. Ref varnishcache#2622

This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable and should thus help improve the overall situation. Should fix or at least mitigate varnishcache#2622

nigoroll · 2018-04-04T11:08:44Z

got #2636 in production since last night

I find the additional statistics really helpful
holddown does kick in, but not for fail_eaddrnotavail
but we still see the backend->frontend impact, still see connections piling up with SYN_RECV

So there is at last a second issue

nigoroll · 2018-04-04T12:23:22Z

the second issue seems to have been lack of available threads

We already emitted Debug log records for accept failures, but for all practical purposes this is useless for forensics. These counters will point directly to the root cause for the most common issues with sess_fail (accept failures). sess_fail is preserved as an overall counter as it will most likely be already monitored for many installations. Ref varnishcache#2622

Previously, we had zero reporting on the cause of backend connection errors, which made it close to impossible to diagnose such issues directly. We now add statistics per connection pool and Debug VSLs. Ref varnishcache#2622

This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable and should thus help improve the overall situation. Should fix or at least mitigate varnishcache#2622

We already emitted Debug log records for accept failures, but for all practical purposes this is useless for forensics. These counters will point directly to the root cause for the most common issues with sess_fail (accept failures). sess_fail is preserved as an overall counter as it will most likely be already monitored for many installations. Ref varnishcache#2622

Previously, we had zero reporting on the cause of backend connection errors, which made it close to impossible to diagnose such issues directly. We now add statistics per connection pool and Debug VSLs. Ref varnishcache#2622

This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable and should thus help improve the overall situation. Should fix or at least mitigate varnishcache#2622

Ref #2622

nigoroll · 2018-04-09T12:25:04Z

I've docfixed the backend proxy connection limitation.
The remaining issues from this ticket are going to get addressed with #2636 mostly

We already emitted Debug log records for accept failures, but for all practical purposes this is useless for forensics. These counters will point directly to the root cause for the most common issues with sess_fail (accept failures). sess_fail is preserved as an overall counter as it will most likely be already monitored for many installations. Ref varnishcache#2622

Previously, we had zero reporting on the cause of backend connection errors, which made it close to impossible to diagnose such issues directly. We now add statistics per connection pool and Debug VSLs. Ref varnishcache#2622

This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable and should thus help improve the overall situation. Should fix or at least mitigate varnishcache#2622

We already emitted Debug log records for accept failures, but for all practical purposes this is useless for forensics. These counters will point directly to the root cause for the most common issues with sess_fail (accept failures). sess_fail is preserved as an overall counter as it will most likely be already monitored for many installations. Ref varnishcache#2622

Previously, we had zero reporting on the cause of backend connection errors, which made it close to impossible to diagnose such issues directly. We now add statistics per connection pool and Debug VSLs. Ref varnishcache#2622

This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable and should thus help improve the overall situation. Should fix or at least mitigate varnishcache#2622

We already emitted Debug log records for accept failures, but for all practical purposes this is useless for forensics. These counters will point directly to the root cause for the most common issues with sess_fail (accept failures). sess_fail is preserved as an overall counter as it will most likely be already monitored for many installations. Ref #2622

Previously, we had zero reporting on the cause of backend connection errors, which made it close to impossible to diagnose such issues directly. We now add statistics per connection pool and Debug VSLs. Ref varnishcache#2622

This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable and should thus help improve the overall situation. Should fix or at least mitigate varnishcache#2622

Previously, we had zero stats on the cause of backend connection errors, which made it close to impossible to diagnose such issues in retrospect (only via log mining). We now pass an optional backend vsc to vcp and record errors per backend. Open errors are really per vcp entry (ip + port or udc path), which can be shared amongst backends (and even vcls), but we maintain the counters per backend (and, consequently, per vcl) for simplicity. It should be noted though that errors for shared endpoints affect all backends using them. Ref varnishcache#2622

This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable, and should thus help improve the overall situation. Fixes varnishcache#2622

Previously, we had zero stats on the cause of backend connection errors, which made it close to impossible to diagnose such issues in retrospect (only via log mining). We now pass an optional backend vsc to vcp and record errors per backend. Open errors are really per vcp entry (ip + port or udc path), which can be shared amongst backends (and even vcls), but we maintain the counters per backend (and, consequently, per vcl) for simplicity. It should be noted though that errors for shared endpoints affect all backends using them. Ref varnishcache#2622

This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable, and should thus help improve the overall situation. Fixes varnishcache#2622

Previously, we had zero stats on the cause of backend connection errors, which made it close to impossible to diagnose such issues in retrospect (only via log mining). We now pass an optional backend vsc to vcp and record errors per backend. Open errors are really per vcp entry (ip + port or udc path), which can be shared amongst backends (and even vcls), but we maintain the counters per backend (and, consequently, per vcl) for simplicity. It should be noted though that errors for shared endpoints affect all backends using them. Ref varnishcache#2622

This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable, and should thus help improve the overall situation. Fixes varnishcache#2622

Ref varnishcache#2622

We already emitted Debug log records for accept failures, but for all practical purposes this is useless for forensics. These counters will point directly to the root cause for the most common issues with sess_fail (accept failures). sess_fail is preserved as an overall counter as it will most likely be already monitored for many installations. Ref varnishcache#2622 Conflicts: doc/changes.rst

Previously, we had zero stats on the cause of backend connection errors, which made it close to impossible to diagnose such issues in retrospect (only via log mining). We now pass an optional backend vsc to vcp and record errors per backend. Open errors are really per vcp entry (ip + port or udc path), which can be shared amongst backends (and even vcls), but we maintain the counters per backend (and, consequently, per vcl) for simplicity. It should be noted though that errors for shared endpoints affect all backends using them. Ref varnishcache#2622 Conflicts: bin/varnishd/cache/cache_backend.c

This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable, and should thus help improve the overall situation. Fixes varnishcache#2622

Ref varnishcache#2622

We already emitted Debug log records for accept failures, but for all practical purposes this is useless for forensics. These counters will point directly to the root cause for the most common issues with sess_fail (accept failures). sess_fail is preserved as an overall counter as it will most likely be already monitored for many installations. Ref varnishcache#2622

Previously, we had zero stats on the cause of backend connection errors, which made it close to impossible to diagnose such issues in retrospect (only via log mining). We now pass an optional backend vsc to vcp and record errors per backend. Open errors are really per vcp entry (ip + port or udc path), which can be shared amongst backends (and even vcls), but we maintain the counters per backend (and, consequently, per vcl) for simplicity. It should be noted though that errors for shared endpoints affect all backends using them. Ref varnishcache#2622

This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable, and should thus help improve the overall situation. Fixes varnishcache#2622

bsdphk added the a=need bugwash label Mar 26, 2018

nigoroll added a commit that referenced this issue Apr 3, 2018

more details to fetcherror

aab5dda

Ref #2622

nigoroll mentioned this issue Apr 3, 2018

sess_fail_* and vcp stats, holddown for vcp #2636

Closed

nigoroll added a commit that referenced this issue Apr 9, 2018

warning and advise about backend proxy connections

3eba5f8

Ref #2622

nigoroll closed this as completed Apr 9, 2018

vssync added the a=backport-review-4.1 label Apr 9, 2018

Dridi pushed a commit to Dridi/varnish-cache that referenced this issue Jun 20, 2018

warning and advise about backend proxy connections

18c9e0c

Ref varnishcache#2622

hermunn removed the a=backport-review-4.1 label Sep 13, 2018

dmatetelki pushed a commit to dmatetelki/varnish-cache that referenced this issue Mar 14, 2019

more details to fetcherror

222c329

Ref varnishcache#2622

dmatetelki pushed a commit to dmatetelki/varnish-cache that referenced this issue Mar 14, 2019

warning and advise about backend proxy connections

e5c7ccf

Ref varnishcache#2622

nigoroll mentioned this issue May 13, 2020

RFC: VSCs for connection pools? backend_busy per backend? #3314

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proxy backend -> EADDRNOTAVAIL -> probes failing -> all backends down -> site down #2622

proxy backend -> EADDRNOTAVAIL -> probes failing -> all backends down -> site down #2622

nigoroll commented Mar 20, 2018

Dridi commented Mar 21, 2018

nigoroll commented Mar 21, 2018

Dridi commented Mar 21, 2018 •

edited

nigoroll commented Mar 21, 2018

nigoroll commented Mar 28, 2018 •

edited

nigoroll commented Apr 4, 2018

nigoroll commented Apr 4, 2018

nigoroll commented Apr 9, 2018

proxy backend -> EADDRNOTAVAIL -> probes failing -> all backends down -> site down #2622

proxy backend -> EADDRNOTAVAIL -> probes failing -> all backends down -> site down #2622

Comments

nigoroll commented Mar 20, 2018

setup

symptoms

lack of visibility

assured facts

open question

TODOs

Dridi commented Mar 21, 2018

nigoroll commented Mar 21, 2018

Dridi commented Mar 21, 2018 • edited

nigoroll commented Mar 21, 2018

nigoroll commented Mar 28, 2018 • edited

nigoroll commented Apr 4, 2018

nigoroll commented Apr 4, 2018

nigoroll commented Apr 9, 2018

Dridi commented Mar 21, 2018 •

edited

nigoroll commented Mar 28, 2018 •

edited