New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proxy backend -> EADDRNOTAVAIL -> probes failing -> all backends down -> site down #2622
Comments
No, there's simply no backend connection pooling when PROXY is involved. The reuse logic needs a lot more work if we want to support that so currently there's no recycling at all between your two Varnish instances whether requests fail or not. Incidentally there's also no pooling at all for probes, by design. |
Re @Dridi both backend connections and probes use VTP |
VTP or whatever we call it these days with unix domain sockets, I'll need to catch up at some point :) Probes bypass the recycling code by always opening a new connection, and a backend proxy_header will always neuter recycling with a edit: maybe not varnish-cache/bin/varnishd/cache/cache_backend.c Lines 184 to 190 in 5e2b0d8
|
@Dridi yes. As I said, I was suspecting a delayed file descriptor close issue, but could not find evidence. As the failing health check on the backend with too many TIME_WAIT connections coincided with other backends' health checks failing, the code shared for all backends (VTP) is a possible candidate for a causal relation, but right now that is still speculation. |
I suspect to have seen another incarnation of the suspected issue:It seems that backend issues caused varnish to significantly slow down the acceptor with ~20k tcp connections in |
We already emitted Debug log records for accept failures, but for all practical purposes this is useless for forensics. These counters will point directly to the root cause for the most common issues with sess_fail (accept failures). sess_fail is preserved as an overall counter as it will most likely be already monitored for many installations. Ref varnishcache#2622
Previously, we had zero reporting on the cause of backend connection errors, which made it close to impossible to diagnose such issues directly. We now add statistics per connection pool and Debug VSLs. Ref varnishcache#2622
This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable and should thus help improve the overall situation. Should fix or at least mitigate varnishcache#2622
We already emitted Debug log records for accept failures, but for all practical purposes this is useless for forensics. These counters will point directly to the root cause for the most common issues with sess_fail (accept failures). sess_fail is preserved as an overall counter as it will most likely be already monitored for many installations. Ref varnishcache#2622
Previously, we had zero reporting on the cause of backend connection errors, which made it close to impossible to diagnose such issues directly. We now add statistics per connection pool and Debug VSLs. Ref varnishcache#2622
This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable and should thus help improve the overall situation. Should fix or at least mitigate varnishcache#2622
got #2636 in production since last night
So there is at last a second issue |
the second issue seems to have been lack of available threads |
We already emitted Debug log records for accept failures, but for all practical purposes this is useless for forensics. These counters will point directly to the root cause for the most common issues with sess_fail (accept failures). sess_fail is preserved as an overall counter as it will most likely be already monitored for many installations. Ref varnishcache#2622
Previously, we had zero reporting on the cause of backend connection errors, which made it close to impossible to diagnose such issues directly. We now add statistics per connection pool and Debug VSLs. Ref varnishcache#2622
This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable and should thus help improve the overall situation. Should fix or at least mitigate varnishcache#2622
We already emitted Debug log records for accept failures, but for all practical purposes this is useless for forensics. These counters will point directly to the root cause for the most common issues with sess_fail (accept failures). sess_fail is preserved as an overall counter as it will most likely be already monitored for many installations. Ref varnishcache#2622
Previously, we had zero reporting on the cause of backend connection errors, which made it close to impossible to diagnose such issues directly. We now add statistics per connection pool and Debug VSLs. Ref varnishcache#2622
This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable and should thus help improve the overall situation. Should fix or at least mitigate varnishcache#2622
I've docfixed the backend proxy connection limitation. |
We already emitted Debug log records for accept failures, but for all practical purposes this is useless for forensics. These counters will point directly to the root cause for the most common issues with sess_fail (accept failures). sess_fail is preserved as an overall counter as it will most likely be already monitored for many installations. Ref varnishcache#2622
Previously, we had zero reporting on the cause of backend connection errors, which made it close to impossible to diagnose such issues directly. We now add statistics per connection pool and Debug VSLs. Ref varnishcache#2622
This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable and should thus help improve the overall situation. Should fix or at least mitigate varnishcache#2622
We already emitted Debug log records for accept failures, but for all practical purposes this is useless for forensics. These counters will point directly to the root cause for the most common issues with sess_fail (accept failures). sess_fail is preserved as an overall counter as it will most likely be already monitored for many installations. Ref varnishcache#2622
Previously, we had zero reporting on the cause of backend connection errors, which made it close to impossible to diagnose such issues directly. We now add statistics per connection pool and Debug VSLs. Ref varnishcache#2622
This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable and should thus help improve the overall situation. Should fix or at least mitigate varnishcache#2622
We already emitted Debug log records for accept failures, but for all practical purposes this is useless for forensics. These counters will point directly to the root cause for the most common issues with sess_fail (accept failures). sess_fail is preserved as an overall counter as it will most likely be already monitored for many installations. Ref #2622
Previously, we had zero reporting on the cause of backend connection errors, which made it close to impossible to diagnose such issues directly. We now add statistics per connection pool and Debug VSLs. Ref varnishcache#2622
This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable and should thus help improve the overall situation. Should fix or at least mitigate varnishcache#2622
Previously, we had zero stats on the cause of backend connection errors, which made it close to impossible to diagnose such issues in retrospect (only via log mining). We now pass an optional backend vsc to vcp and record errors per backend. Open errors are really per vcp entry (ip + port or udc path), which can be shared amongst backends (and even vcls), but we maintain the counters per backend (and, consequently, per vcl) for simplicity. It should be noted though that errors for shared endpoints affect all backends using them. Ref varnishcache#2622
This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable, and should thus help improve the overall situation. Fixes varnishcache#2622
Previously, we had zero stats on the cause of backend connection errors, which made it close to impossible to diagnose such issues in retrospect (only via log mining). We now pass an optional backend vsc to vcp and record errors per backend. Open errors are really per vcp entry (ip + port or udc path), which can be shared amongst backends (and even vcls), but we maintain the counters per backend (and, consequently, per vcl) for simplicity. It should be noted though that errors for shared endpoints affect all backends using them. Ref varnishcache#2622
This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable, and should thus help improve the overall situation. Fixes varnishcache#2622
Previously, we had zero stats on the cause of backend connection errors, which made it close to impossible to diagnose such issues in retrospect (only via log mining). We now pass an optional backend vsc to vcp and record errors per backend. Open errors are really per vcp entry (ip + port or udc path), which can be shared amongst backends (and even vcls), but we maintain the counters per backend (and, consequently, per vcl) for simplicity. It should be noted though that errors for shared endpoints affect all backends using them. Ref varnishcache#2622
This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable, and should thus help improve the overall situation. Fixes varnishcache#2622
We already emitted Debug log records for accept failures, but for all practical purposes this is useless for forensics. These counters will point directly to the root cause for the most common issues with sess_fail (accept failures). sess_fail is preserved as an overall counter as it will most likely be already monitored for many installations. Ref varnishcache#2622 Conflicts: doc/changes.rst
Previously, we had zero stats on the cause of backend connection errors, which made it close to impossible to diagnose such issues in retrospect (only via log mining). We now pass an optional backend vsc to vcp and record errors per backend. Open errors are really per vcp entry (ip + port or udc path), which can be shared amongst backends (and even vcls), but we maintain the counters per backend (and, consequently, per vcl) for simplicity. It should be noted though that errors for shared endpoints affect all backends using them. Ref varnishcache#2622 Conflicts: bin/varnishd/cache/cache_backend.c
This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable, and should thus help improve the overall situation. Fixes varnishcache#2622
We already emitted Debug log records for accept failures, but for all practical purposes this is useless for forensics. These counters will point directly to the root cause for the most common issues with sess_fail (accept failures). sess_fail is preserved as an overall counter as it will most likely be already monitored for many installations. Ref varnishcache#2622
Previously, we had zero stats on the cause of backend connection errors, which made it close to impossible to diagnose such issues in retrospect (only via log mining). We now pass an optional backend vsc to vcp and record errors per backend. Open errors are really per vcp entry (ip + port or udc path), which can be shared amongst backends (and even vcls), but we maintain the counters per backend (and, consequently, per vcl) for simplicity. It should be noted though that errors for shared endpoints affect all backends using them. Ref varnishcache#2622
This is similar to the vca pace: Depending on the backend connection error, it does not make sense to re-try in rapid succession, instead not attempting the failed connection again for some time will save resources both locally and remotely, where applicable, and should thus help improve the overall situation. Fixes varnishcache#2622
I've been bitten badly by a series of issues which I think we need to address. Each of them may not be that bad in isolation but I think the whole picture is relevant for understanding their relevance.
Seen on ee1d34d Linux 3.14.53+ Debian 8.10 - note that the connection pool code has changed since then but I do not see semantic differences
setup
proxy_header
symptoms
FetchError backend ...: fail
)FetchError no backend
)lack of visibility
analyzing this issue was complicated by the facts that
FetchError
does not give information about why the connection failedassured facts
TIME_WAIT
because it is closed by the clientTIME_WAIT
for 120s andip_local_port_range
~28k), connections will fail when a request rate of ~235/s is exceededopen question
I suspect a causal relation between the failing proxy connections and the fact that backend probes failed, but I cannot prove it:
The hypothesis is that failing backend requests use up file descriptors, for example via a delayed close() in the pool code, but I failed to come up with a plausible explanation.
So I am still looking for either support of the hypothesis or an alternative hypothesis which would explain why backend probes are failing in the situation documented under assured facts
TODOs
TIME_WAIT
issue and, in particular, future scenarios like connecting to an ssl-onloader require connection reuse/pooling.errno
The text was updated successfully, but these errors were encountered: