-
Notifications
You must be signed in to change notification settings - Fork 497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not blame cache_peer for CONNECT errors #1772
Do not blame cache_peer for CONNECT errors #1772
Conversation
ERROR: Connection to [such-and-such-cache_peer] failed TCP_TUNNEL/503 CONNECT nxdomain.test:443 FIRSTUP_PARENT Squid does not alert an admin about (and decrease health level of) a cache_peer that responded with an error to a GET request. Just like GET responses from a cache_peer, CONNECT responses may (and often do!) reflect client or origin server failures. We should not penalize cache_peers (and alert admins) until we can distinguish these frequent client/origin failures from (relatively rare) cache_peer problems. This change absolves cache_peers of CONNECT problems, restoring parity with GETs and restoring v4 behavior changed (probably by accident) in v5. Also removed Http::StatusCode parameter from failure notification functions because it became essentially unused after the primary Http::Tunneler changes. Tunneler was the only source of status code information that (in some cases) used received HTTP response to compute that status code. All other cases extracted that status code from Squid-generated errors. Those errors were arguably never meant to supply status code information for "this failure is not our fault" decision, and they do not supply 4xx status codes driving that decision. ### Problem evolution 2019 commit f5e1794 effectively started blaming cache_peer for all FwdState CONNECT errors. That functionality change was probably accidental, likely influenced by the names of noteConnectFailure() and peerConnectFailed() functions that abbreviated "Connection", making the functions look as applicable to CONNECT failures. Prior to that commit, the functions were never used for CONNECT errors. After it, FwdState started calling peerConnectFailed() for all CONNECT failures. In 2020 commit 25b0ce4, TunnelStateData started blaming cache_peers as well (by moving that FwdState-only error handling code into Tunneler). The same "accidental functionality change" speculations apply here. In 2022 commit 022dbab, we made an exception for 4xx CONNECT errors as folks deploying newer code started complaining about cache_peers getting blamed for client-caused errors (e.g., HTTP 403 Forbidden replies). We did not realize that the blaming code itself was an unwanted accident. Now we are getting complaints about cache_peers getting blamed for 502 and 503 CONNECT errors caused by, for example, domain names without IPs: As these CONNECT error responses are propagated from parent to child caches, every child cache in the chain logs ERRORs and every cache_peer in the chain gets its health counter decreased! ---- Inspired by SQUID-961-detail-cache-peer-conn-failures-bag51 commits a93f486 and 3f9a99d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments in this review annotate the diff and do not request any code changes.
{ | ||
assert(connection); | ||
NoteOutgoingConnectionFailure(connection->getPeer(), error ? error->httpStatus : Http::scNone); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This single call removal is the primary fix. Everything else is related cleanup.
@@ -76,7 +76,7 @@ Security::BlindPeerConnector::noteNegotiationDone(ErrorState *error) | |||
// based on TCP results, SSL results, or both. And the code is probably not | |||
// consistent in this aspect across tunnelling and forwarding modules. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, we are testing changes that address the above XXX. They are independent from this fix.
// TODO: Require callers to detail failures instead of using one (and often | ||
// misleading!) "connection failed" phrase for all of them. | ||
/// noteFailure() helper for handling failures attributed to this peer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, we are testing changes that address the above TODO. They are independent from this fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting my objection to the existence of NoteOutgoingConnectionFailure()
and its matching NoteOutgoingConnectionSuccess()
. IMO these inline wrappers have no valid reason for existence and further encourage the bad practice of passing raw-pointers around.
That said, I am not going to stop this needed change going in over that style disagreement.
ERROR: Connection to [such-and-such-cache_peer] failed TCP_TUNNEL/503 CONNECT nxdomain.test:443 FIRSTUP_PARENT Squid does not alert an admin about (and decrease health level of) a cache_peer that responded with an error to a GET request. Just like GET responses from a cache_peer, CONNECT responses may (and often do!) reflect client or origin server failures. We should not penalize cache_peers (and alert admins) until we can distinguish these frequent client/origin failures from (relatively rare) cache_peer problems. This change absolves cache_peers of CONNECT problems, restoring parity with GETs and restoring v4 behavior changed (probably by accident) in v5. Also removed Http::StatusCode parameter from failure notification functions because it became essentially unused after the primary Http::Tunneler changes. Tunneler was the only source of status code information that (in some cases) used received HTTP response to compute that status code. All other cases extracted that status code from Squid-generated errors. Those errors were arguably never meant to supply status code information for "this failure is not our fault" decision, and they do not supply 4xx status codes driving that decision. ### Problem evolution 2019 commit f5e1794 effectively started blaming cache_peer for all FwdState CONNECT errors. That functionality change was probably accidental, likely influenced by the names of noteConnectFailure() and peerConnectFailed() functions that abbreviated "Connection", making the functions look as applicable to CONNECT failures. Prior to that commit, the functions were never used for CONNECT errors. After it, FwdState started calling peerConnectFailed() for all CONNECT failures. In 2020 commit 25b0ce4, TunnelStateData started blaming cache_peers as well (by moving that FwdState-only error handling code into Tunneler). The same "accidental functionality change" speculations apply here. In 2022 commit 022dbab, we made an exception for 4xx CONNECT errors as folks deploying newer code started complaining about cache_peers getting blamed for client-caused errors (e.g., HTTP 403 Forbidden replies). We did not realize that the blaming code itself was an unwanted accident. Now we are getting complaints about cache_peers getting blamed for 502 and 503 CONNECT errors caused by, for example, domain names without IPs: As these CONNECT error responses are propagated from parent to child caches, every child cache in the chain logs ERRORs and every cache_peer in the chain gets its health counter decreased!
FWIW, I cannot do anything about the above objection (beyond rejecting it as invalid) because it is based on two false assertions:
Thank you! |
ERROR: Connection to [such-and-such-cache_peer] failed TCP_TUNNEL/503 CONNECT nxdomain.test:443 FIRSTUP_PARENT Squid does not alert an admin about (and decrease health level of) a cache_peer that responded with an error to a GET request. Just like GET responses from a cache_peer, CONNECT responses may (and often do!) reflect client or origin server failures. We should not penalize cache_peers (and alert admins) until we can distinguish these frequent client/origin failures from (relatively rare) cache_peer problems. This change absolves cache_peers of CONNECT problems, restoring parity with GETs and restoring v4 behavior changed (probably by accident) in v5. Also removed Http::StatusCode parameter from failure notification functions because it became essentially unused after the primary Http::Tunneler changes. Tunneler was the only source of status code information that (in some cases) used received HTTP response to compute that status code. All other cases extracted that status code from Squid-generated errors. Those errors were arguably never meant to supply status code information for "this failure is not our fault" decision, and they do not supply 4xx status codes driving that decision. ### Problem evolution 2019 commit f5e1794 effectively started blaming cache_peer for all FwdState CONNECT errors. That functionality change was probably accidental, likely influenced by the names of noteConnectFailure() and peerConnectFailed() functions that abbreviated "Connection", making the functions look as applicable to CONNECT failures. Prior to that commit, the functions were never used for CONNECT errors. After it, FwdState started calling peerConnectFailed() for all CONNECT failures. In 2020 commit 25b0ce4, TunnelStateData started blaming cache_peers as well (by moving that FwdState-only error handling code into Tunneler). The same "accidental functionality change" speculations apply here. In 2022 commit 022dbab, we made an exception for 4xx CONNECT errors as folks deploying newer code started complaining about cache_peers getting blamed for client-caused errors (e.g., HTTP 403 Forbidden replies). We did not realize that the blaming code itself was an unwanted accident. Now we are getting complaints about cache_peers getting blamed for 502 and 503 CONNECT errors caused by, for example, domain names without IPs: As these CONNECT error responses are propagated from parent to child caches, every child cache in the chain logs ERRORs and every cache_peer in the chain gets its health counter decreased!
ERROR: Connection to [such-and-such-cache_peer] failed TCP_TUNNEL/503 CONNECT nxdomain.test:443 FIRSTUP_PARENT Squid does not alert an admin about (and decrease health level of) a cache_peer that responded with an error to a GET request. Just like GET responses from a cache_peer, CONNECT responses may (and often do!) reflect client or origin server failures. We should not penalize cache_peers (and alert admins) until we can distinguish these frequent client/origin failures from (relatively rare) cache_peer problems. This change absolves cache_peers of CONNECT problems, restoring parity with GETs and restoring v4 behavior changed (probably by accident) in v5. Also removed Http::StatusCode parameter from failure notification functions because it became essentially unused after the primary Http::Tunneler changes. Tunneler was the only source of status code information that (in some cases) used received HTTP response to compute that status code. All other cases extracted that status code from Squid-generated errors. Those errors were arguably never meant to supply status code information for "this failure is not our fault" decision, and they do not supply 4xx status codes driving that decision. ### Problem evolution 2019 commit f5e1794 effectively started blaming cache_peer for all FwdState CONNECT errors. That functionality change was probably accidental, likely influenced by the names of noteConnectFailure() and peerConnectFailed() functions that abbreviated "Connection", making the functions look as applicable to CONNECT failures. Prior to that commit, the functions were never used for CONNECT errors. After it, FwdState started calling peerConnectFailed() for all CONNECT failures. In 2020 commit 25b0ce4, TunnelStateData started blaming cache_peers as well (by moving that FwdState-only error handling code into Tunneler). The same "accidental functionality change" speculations apply here. In 2022 commit 022dbab, we made an exception for 4xx CONNECT errors as folks deploying newer code started complaining about cache_peers getting blamed for client-caused errors (e.g., HTTP 403 Forbidden replies). We did not realize that the blaming code itself was an unwanted accident. Now we are getting complaints about cache_peers getting blamed for 502 and 503 CONNECT errors caused by, for example, domain names without IPs: As these CONNECT error responses are propagated from parent to child caches, every child cache in the chain logs ERRORs and every cache_peer in the chain gets its health counter decreased!
ERROR: Connection to [such-and-such-cache_peer] failed TCP_TUNNEL/503 CONNECT nxdomain.test:443 FIRSTUP_PARENT Squid does not alert an admin about (and decrease health level of) a cache_peer that responded with an error to a GET request. Just like GET responses from a cache_peer, CONNECT responses may (and often do!) reflect client or origin server failures. We should not penalize cache_peers (and alert admins) until we can distinguish these frequent client/origin failures from (relatively rare) cache_peer problems. This change absolves cache_peers of CONNECT problems, restoring parity with GETs and restoring v4 behavior changed (probably by accident) in v5. Also removed Http::StatusCode parameter from failure notification functions because it became essentially unused after the primary Http::Tunneler changes. Tunneler was the only source of status code information that (in some cases) used received HTTP response to compute that status code. All other cases extracted that status code from Squid-generated errors. Those errors were arguably never meant to supply status code information for "this failure is not our fault" decision, and they do not supply 4xx status codes driving that decision. ### Problem evolution 2019 commit f5e1794 effectively started blaming cache_peer for all FwdState CONNECT errors. That functionality change was probably accidental, likely influenced by the names of noteConnectFailure() and peerConnectFailed() functions that abbreviated "Connection", making the functions look as applicable to CONNECT failures. Prior to that commit, the functions were never used for CONNECT errors. After it, FwdState started calling peerConnectFailed() for all CONNECT failures. In 2020 commit 25b0ce4, TunnelStateData started blaming cache_peers as well (by moving that FwdState-only error handling code into Tunneler). The same "accidental functionality change" speculations apply here. In 2022 commit 022dbab, we made an exception for 4xx CONNECT errors as folks deploying newer code started complaining about cache_peers getting blamed for client-caused errors (e.g., HTTP 403 Forbidden replies). We did not realize that the blaming code itself was an unwanted accident. Now we are getting complaints about cache_peers getting blamed for 502 and 503 CONNECT errors caused by, for example, domain names without IPs: As these CONNECT error responses are propagated from parent to child caches, every child cache in the chain logs ERRORs and every cache_peer in the chain gets its health counter decreased!
Squid does not alert an admin about (and decrease health level of) a
cache_peer that responded with an error to a GET request. Just like GET
responses from a cache_peer, CONNECT responses may (and often do!)
reflect client or origin server failures. We should not penalize
cache_peers (and alert admins) until we can distinguish these frequent
client/origin failures from (relatively rare) cache_peer problems. This
change absolves cache_peers of CONNECT problems, restoring parity with
GETs and restoring v4 behavior changed (probably by accident) in v5.
Also removed Http::StatusCode parameter from failure notification
functions because it became essentially unused after the primary
Http::Tunneler changes. Tunneler was the only source of status code
information that (in some cases) used received HTTP response to compute
that status code. All other cases extracted that status code from
Squid-generated errors. Those errors were arguably never meant to supply
status code information for "this failure is not our fault" decision,
and they do not supply 4xx status codes driving that decision.
Problem evolution
2019 commit f5e1794 effectively started blaming cache_peer for all
FwdState CONNECT errors. That functionality change was probably
accidental, likely influenced by the names of noteConnectFailure() and
peerConnectFailed() functions that abbreviated "Connection", making the
functions look as applicable to CONNECT failures. Prior to that commit,
the functions were never used for CONNECT errors. After it, FwdState
started calling peerConnectFailed() for all CONNECT failures.
In 2020 commit 25b0ce4, TunnelStateData started blaming cache_peers as
well (by moving that FwdState-only error handling code into Tunneler).
The same "accidental functionality change" speculations apply here.
In 2022 commit 022dbab, we made an exception for 4xx CONNECT errors as
folks deploying newer code started complaining about cache_peers getting
blamed for client-caused errors (e.g., HTTP 403 Forbidden replies). We
did not realize that the blaming code itself was an unwanted accident.
Now we are getting complaints about cache_peers getting blamed for 502
and 503 CONNECT errors caused by, for example, domain names without IPs:
As these CONNECT error responses are propagated from parent to child
caches, every child cache in the chain logs ERRORs and every cache_peer
in the chain gets its health counter decreased!