Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proxy backend -> EADDRNOTAVAIL -> probes failing -> all backends down -> site down #2622

Closed
nigoroll opened this issue Mar 20, 2018 · 8 comments

Comments

@nigoroll
Copy link
Member

I've been bitten badly by a series of issues which I think we need to address. Each of them may not be that bad in isolation but I think the whole picture is relevant for understanding their relevance.

Seen on ee1d34d Linux 3.14.53+ Debian 8.10 - note that the connection pool code has changed since then but I do not see semantic differences

setup

  • cache clustering/sharing is used in a two-varnish setup: requests which are not identified as passes are forwarded to the other varnish based on the backend selection of the shard director.
  • The other varnish is configured as a backend with proxy_header
  • The rate of miss+hitmiss is ~1k/s, so the expected rate of connections to the other varnish is ~.5k/s
  • 1M max open files
  • <100k client connections

symptoms

  • connections to "the other varnish" are starting to fail (FetchError backend ...: fail)
  • backend probes are starting to fail, until almost all backends are down (FetchError no backend)
  • site goes down

lack of visibility

analyzing this issue was complicated by the facts that

  • FetchError does not give information about why the connection failed
  • probes do not log why a connection failed

assured facts

open question

I suspect a causal relation between the failing proxy connections and the fact that backend probes failed, but I cannot prove it:

The hypothesis is that failing backend requests use up file descriptors, for example via a delayed close() in the pool code, but I failed to come up with a plausible explanation.

So I am still looking for either support of the hypothesis or an alternative hypothesis which would explain why backend probes are failing in the situation documented under assured facts

TODOs

  • backend proxy connections are problematic as-are due to the TIME_WAIT issue and, in particular, future scenarios like connecting to an ssl-onloader require connection reuse/pooling.
  • our error messages should point to the exact root-cause and should include the relevant errno
@Dridi
Copy link
Member

Dridi commented Mar 21, 2018

The hypothesis is that failing backend requests use up file descriptors, for example via a delayed close() in the pool code, but I failed to come up with a plausible explanation.

No, there's simply no backend connection pooling when PROXY is involved. The reuse logic needs a lot more work if we want to support that so currently there's no recycling at all between your two Varnish instances whether requests fail or not.

Incidentally there's also no pooling at all for probes, by design.

@nigoroll
Copy link
Member Author

Re @Dridi both backend connections and probes use VTP

@Dridi
Copy link
Member

Dridi commented Mar 21, 2018

VTP or whatever we call it these days with unix domain sockets, I'll need to catch up at some point :)

Probes bypass the recycling code by always opening a new connection, and a backend proxy_header will always neuter recycling with a Connection: close in the bereq (and also request a fresh connection to write the PROXY bits).

edit: maybe not Connection: close, but at least guaranteed by the line you pointed to.

if (bo->htc->doclose != SC_NULL || bp->proxy_header != 0) {
VSLb(bo->vsl, SLT_BackendClose, "%d %s", *PFD_Fd(pfd),
bp->director->display_name);
VTP_Close(&pfd);
AZ(pfd);
Lck_Lock(&bp->mtx);
} else {

@nigoroll
Copy link
Member Author

@Dridi yes. As I said, I was suspecting a delayed file descriptor close issue, but could not find evidence. As the failing health check on the backend with too many TIME_WAIT connections coincided with other backends' health checks failing, the code shared for all backends (VTP) is a possible candidate for a causal relation, but right now that is still speculation.

@nigoroll
Copy link
Member Author

nigoroll commented Mar 28, 2018

I suspect to have seen another incarnation of the suspected issue:It seems that backend issues caused varnish to significantly slow down the acceptor with ~20k tcp connections in SYN_RECV.
Because in this setup, backends are making connections which circle back to varnish (for other urls), this seems to have caused some vicios circle / deadlock situation which we could resolve by momentarily adding a return(synth(503)) to vcl_recv until SYN_RECV went back to normal. After switching back to the previous VCL, the situation normalized and the site came back.

nigoroll added a commit that referenced this issue Apr 3, 2018
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Apr 3, 2018
We already emitted Debug log records for accept failures, but for
all practical purposes this is useless for forensics.

These counters will point directly to the root cause for the most
common issues with sess_fail (accept failures). sess_fail is preserved
as an overall counter as it will most likely be already monitored for
many installations.

Ref	varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Apr 3, 2018
Previously, we had zero reporting on the cause of backend connection
errors, which made it close to impossible to diagnose such issues directly.

We now add statistics per connection pool and Debug VSLs.

Ref varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Apr 3, 2018
This is similar to the vca pace: Depending on the backend connection
error, it does not make sense to re-try in rapid succession, instead
not attempting the failed connection again for some time will save
resources both locally and remotely, where applicable and should thus
help improve the overall situation.

Should fix or at least mitigate varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Apr 3, 2018
We already emitted Debug log records for accept failures, but for
all practical purposes this is useless for forensics.

These counters will point directly to the root cause for the most
common issues with sess_fail (accept failures). sess_fail is preserved
as an overall counter as it will most likely be already monitored for
many installations.

Ref	varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Apr 3, 2018
Previously, we had zero reporting on the cause of backend connection
errors, which made it close to impossible to diagnose such issues directly.

We now add statistics per connection pool and Debug VSLs.

Ref varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Apr 3, 2018
This is similar to the vca pace: Depending on the backend connection
error, it does not make sense to re-try in rapid succession, instead
not attempting the failed connection again for some time will save
resources both locally and remotely, where applicable and should thus
help improve the overall situation.

Should fix or at least mitigate varnishcache#2622
@nigoroll
Copy link
Member Author

nigoroll commented Apr 4, 2018

got #2636 in production since last night

  • I find the additional statistics really helpful
  • holddown does kick in, but not for fail_eaddrnotavail
  • but we still see the backend->frontend impact, still see connections piling up with SYN_RECV

So there is at last a second issue

@nigoroll
Copy link
Member Author

nigoroll commented Apr 4, 2018

the second issue seems to have been lack of available threads

nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Apr 4, 2018
We already emitted Debug log records for accept failures, but for
all practical purposes this is useless for forensics.

These counters will point directly to the root cause for the most
common issues with sess_fail (accept failures). sess_fail is preserved
as an overall counter as it will most likely be already monitored for
many installations.

Ref	varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Apr 4, 2018
Previously, we had zero reporting on the cause of backend connection
errors, which made it close to impossible to diagnose such issues directly.

We now add statistics per connection pool and Debug VSLs.

Ref varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Apr 4, 2018
This is similar to the vca pace: Depending on the backend connection
error, it does not make sense to re-try in rapid succession, instead
not attempting the failed connection again for some time will save
resources both locally and remotely, where applicable and should thus
help improve the overall situation.

Should fix or at least mitigate varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Apr 9, 2018
We already emitted Debug log records for accept failures, but for
all practical purposes this is useless for forensics.

These counters will point directly to the root cause for the most
common issues with sess_fail (accept failures). sess_fail is preserved
as an overall counter as it will most likely be already monitored for
many installations.

Ref	varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Apr 9, 2018
Previously, we had zero reporting on the cause of backend connection
errors, which made it close to impossible to diagnose such issues directly.

We now add statistics per connection pool and Debug VSLs.

Ref varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Apr 9, 2018
This is similar to the vca pace: Depending on the backend connection
error, it does not make sense to re-try in rapid succession, instead
not attempting the failed connection again for some time will save
resources both locally and remotely, where applicable and should thus
help improve the overall situation.

Should fix or at least mitigate varnishcache#2622
@nigoroll
Copy link
Member Author

nigoroll commented Apr 9, 2018

I've docfixed the backend proxy connection limitation.
The remaining issues from this ticket are going to get addressed with #2636 mostly

@nigoroll nigoroll closed this as completed Apr 9, 2018
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Apr 25, 2018
We already emitted Debug log records for accept failures, but for
all practical purposes this is useless for forensics.

These counters will point directly to the root cause for the most
common issues with sess_fail (accept failures). sess_fail is preserved
as an overall counter as it will most likely be already monitored for
many installations.

Ref	varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Apr 25, 2018
Previously, we had zero reporting on the cause of backend connection
errors, which made it close to impossible to diagnose such issues directly.

We now add statistics per connection pool and Debug VSLs.

Ref varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Apr 25, 2018
This is similar to the vca pace: Depending on the backend connection
error, it does not make sense to re-try in rapid succession, instead
not attempting the failed connection again for some time will save
resources both locally and remotely, where applicable and should thus
help improve the overall situation.

Should fix or at least mitigate varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Apr 25, 2018
We already emitted Debug log records for accept failures, but for
all practical purposes this is useless for forensics.

These counters will point directly to the root cause for the most
common issues with sess_fail (accept failures). sess_fail is preserved
as an overall counter as it will most likely be already monitored for
many installations.

Ref	varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Apr 25, 2018
Previously, we had zero reporting on the cause of backend connection
errors, which made it close to impossible to diagnose such issues directly.

We now add statistics per connection pool and Debug VSLs.

Ref varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Apr 25, 2018
This is similar to the vca pace: Depending on the backend connection
error, it does not make sense to re-try in rapid succession, instead
not attempting the failed connection again for some time will save
resources both locally and remotely, where applicable and should thus
help improve the overall situation.

Should fix or at least mitigate varnishcache#2622
nigoroll added a commit that referenced this issue May 28, 2018
We already emitted Debug log records for accept failures, but for
all practical purposes this is useless for forensics.

These counters will point directly to the root cause for the most
common issues with sess_fail (accept failures). sess_fail is preserved
as an overall counter as it will most likely be already monitored for
many installations.

Ref	#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue May 28, 2018
Previously, we had zero reporting on the cause of backend connection
errors, which made it close to impossible to diagnose such issues directly.

We now add statistics per connection pool and Debug VSLs.

Ref varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue May 28, 2018
This is similar to the vca pace: Depending on the backend connection
error, it does not make sense to re-try in rapid succession, instead
not attempting the failed connection again for some time will save
resources both locally and remotely, where applicable and should thus
help improve the overall situation.

Should fix or at least mitigate varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Jun 5, 2018
Previously, we had zero stats on the cause of backend connection
errors, which made it close to impossible to diagnose such issues in
retrospect (only via log mining).

We now pass an optional backend vsc to vcp and record errors per
backend.

Open errors are really per vcp entry (ip + port or udc path), which
can be shared amongst backends (and even vcls), but we maintain the
counters per backend (and, consequently, per vcl) for simplicity.  It
should be noted though that errors for shared endpoints affect all
backends using them.

Ref varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Jun 5, 2018
This is similar to the vca pace: Depending on the backend connection
error, it does not make sense to re-try in rapid succession, instead
not attempting the failed connection again for some time will save
resources both locally and remotely, where applicable, and should
thus help improve the overall situation.

Fixes varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Jun 5, 2018
Previously, we had zero stats on the cause of backend connection
errors, which made it close to impossible to diagnose such issues in
retrospect (only via log mining).

We now pass an optional backend vsc to vcp and record errors per
backend.

Open errors are really per vcp entry (ip + port or udc path), which
can be shared amongst backends (and even vcls), but we maintain the
counters per backend (and, consequently, per vcl) for simplicity.  It
should be noted though that errors for shared endpoints affect all
backends using them.

Ref varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Jun 5, 2018
This is similar to the vca pace: Depending on the backend connection
error, it does not make sense to re-try in rapid succession, instead
not attempting the failed connection again for some time will save
resources both locally and remotely, where applicable, and should
thus help improve the overall situation.

Fixes varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Jun 11, 2018
Previously, we had zero stats on the cause of backend connection
errors, which made it close to impossible to diagnose such issues in
retrospect (only via log mining).

We now pass an optional backend vsc to vcp and record errors per
backend.

Open errors are really per vcp entry (ip + port or udc path), which
can be shared amongst backends (and even vcls), but we maintain the
counters per backend (and, consequently, per vcl) for simplicity.  It
should be noted though that errors for shared endpoints affect all
backends using them.

Ref varnishcache#2622
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Jun 11, 2018
This is similar to the vca pace: Depending on the backend connection
error, it does not make sense to re-try in rapid succession, instead
not attempting the failed connection again for some time will save
resources both locally and remotely, where applicable, and should
thus help improve the overall situation.

Fixes varnishcache#2622
Dridi pushed a commit to Dridi/varnish-cache that referenced this issue Jun 20, 2018
Dridi pushed a commit to Dridi/varnish-cache that referenced this issue Jun 20, 2018
We already emitted Debug log records for accept failures, but for
all practical purposes this is useless for forensics.

These counters will point directly to the root cause for the most
common issues with sess_fail (accept failures). sess_fail is preserved
as an overall counter as it will most likely be already monitored for
many installations.

Ref	varnishcache#2622

Conflicts:
	doc/changes.rst
Dridi pushed a commit to Dridi/varnish-cache that referenced this issue Jun 20, 2018
Previously, we had zero stats on the cause of backend connection
errors, which made it close to impossible to diagnose such issues in
retrospect (only via log mining).

We now pass an optional backend vsc to vcp and record errors per
backend.

Open errors are really per vcp entry (ip + port or udc path), which
can be shared amongst backends (and even vcls), but we maintain the
counters per backend (and, consequently, per vcl) for simplicity.  It
should be noted though that errors for shared endpoints affect all
backends using them.

Ref varnishcache#2622

Conflicts:
	bin/varnishd/cache/cache_backend.c
Dridi pushed a commit to Dridi/varnish-cache that referenced this issue Jun 20, 2018
This is similar to the vca pace: Depending on the backend connection
error, it does not make sense to re-try in rapid succession, instead
not attempting the failed connection again for some time will save
resources both locally and remotely, where applicable, and should
thus help improve the overall situation.

Fixes varnishcache#2622
dmatetelki pushed a commit to dmatetelki/varnish-cache that referenced this issue Mar 14, 2019
dmatetelki pushed a commit to dmatetelki/varnish-cache that referenced this issue Mar 14, 2019
dmatetelki pushed a commit to dmatetelki/varnish-cache that referenced this issue Mar 14, 2019
We already emitted Debug log records for accept failures, but for
all practical purposes this is useless for forensics.

These counters will point directly to the root cause for the most
common issues with sess_fail (accept failures). sess_fail is preserved
as an overall counter as it will most likely be already monitored for
many installations.

Ref	varnishcache#2622
dmatetelki pushed a commit to dmatetelki/varnish-cache that referenced this issue Mar 14, 2019
Previously, we had zero stats on the cause of backend connection
errors, which made it close to impossible to diagnose such issues in
retrospect (only via log mining).

We now pass an optional backend vsc to vcp and record errors per
backend.

Open errors are really per vcp entry (ip + port or udc path), which
can be shared amongst backends (and even vcls), but we maintain the
counters per backend (and, consequently, per vcl) for simplicity.  It
should be noted though that errors for shared endpoints affect all
backends using them.

Ref varnishcache#2622
dmatetelki pushed a commit to dmatetelki/varnish-cache that referenced this issue Mar 14, 2019
This is similar to the vca pace: Depending on the backend connection
error, it does not make sense to re-try in rapid succession, instead
not attempting the failed connection again for some time will save
resources both locally and remotely, where applicable, and should
thus help improve the overall situation.

Fixes varnishcache#2622
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants