Do not blame cache_peer for CONNECT errors #1772

rousskov · 2024-04-02T17:59:34Z

ERROR: Connection to [such-and-such-cache_peer] failed
TCP_TUNNEL/503 CONNECT nxdomain.test:443 FIRSTUP_PARENT

Squid does not alert an admin about (and decrease health level of) a
cache_peer that responded with an error to a GET request. Just like GET
responses from a cache_peer, CONNECT responses may (and often do!)
reflect client or origin server failures. We should not penalize
cache_peers (and alert admins) until we can distinguish these frequent
client/origin failures from (relatively rare) cache_peer problems. This
change absolves cache_peers of CONNECT problems, restoring parity with
GETs and restoring v4 behavior changed (probably by accident) in v5.

Also removed Http::StatusCode parameter from failure notification
functions because it became essentially unused after the primary
Http::Tunneler changes. Tunneler was the only source of status code
information that (in some cases) used received HTTP response to compute
that status code. All other cases extracted that status code from
Squid-generated errors. Those errors were arguably never meant to supply
status code information for "this failure is not our fault" decision,
and they do not supply 4xx status codes driving that decision.

Problem evolution

2019 commit f5e1794 effectively started blaming cache_peer for all
FwdState CONNECT errors. That functionality change was probably
accidental, likely influenced by the names of noteConnectFailure() and
peerConnectFailed() functions that abbreviated "Connection", making the
functions look as applicable to CONNECT failures. Prior to that commit,
the functions were never used for CONNECT errors. After it, FwdState
started calling peerConnectFailed() for all CONNECT failures.

In 2020 commit 25b0ce4, TunnelStateData started blaming cache_peers as
well (by moving that FwdState-only error handling code into Tunneler).
The same "accidental functionality change" speculations apply here.

In 2022 commit 022dbab, we made an exception for 4xx CONNECT errors as
folks deploying newer code started complaining about cache_peers getting
blamed for client-caused errors (e.g., HTTP 403 Forbidden replies). We
did not realize that the blaming code itself was an unwanted accident.

Now we are getting complaints about cache_peers getting blamed for 502
and 503 CONNECT errors caused by, for example, domain names without IPs:
As these CONNECT error responses are propagated from parent to child
caches, every child cache in the chain logs ERRORs and every cache_peer
in the chain gets its health counter decreased!

ERROR: Connection to [such-and-such-cache_peer] failed TCP_TUNNEL/503 CONNECT nxdomain.test:443 FIRSTUP_PARENT Squid does not alert an admin about (and decrease health level of) a cache_peer that responded with an error to a GET request. Just like GET responses from a cache_peer, CONNECT responses may (and often do!) reflect client or origin server failures. We should not penalize cache_peers (and alert admins) until we can distinguish these frequent client/origin failures from (relatively rare) cache_peer problems. This change absolves cache_peers of CONNECT problems, restoring parity with GETs and restoring v4 behavior changed (probably by accident) in v5. Also removed Http::StatusCode parameter from failure notification functions because it became essentially unused after the primary Http::Tunneler changes. Tunneler was the only source of status code information that (in some cases) used received HTTP response to compute that status code. All other cases extracted that status code from Squid-generated errors. Those errors were arguably never meant to supply status code information for "this failure is not our fault" decision, and they do not supply 4xx status codes driving that decision. ### Problem evolution 2019 commit f5e1794 effectively started blaming cache_peer for all FwdState CONNECT errors. That functionality change was probably accidental, likely influenced by the names of noteConnectFailure() and peerConnectFailed() functions that abbreviated "Connection", making the functions look as applicable to CONNECT failures. Prior to that commit, the functions were never used for CONNECT errors. After it, FwdState started calling peerConnectFailed() for all CONNECT failures. In 2020 commit 25b0ce4, TunnelStateData started blaming cache_peers as well (by moving that FwdState-only error handling code into Tunneler). The same "accidental functionality change" speculations apply here. In 2022 commit 022dbab, we made an exception for 4xx CONNECT errors as folks deploying newer code started complaining about cache_peers getting blamed for client-caused errors (e.g., HTTP 403 Forbidden replies). We did not realize that the blaming code itself was an unwanted accident. Now we are getting complaints about cache_peers getting blamed for 502 and 503 CONNECT errors caused by, for example, domain names without IPs: As these CONNECT error responses are propagated from parent to child caches, every child cache in the chain logs ERRORs and every cache_peer in the chain gets its health counter decreased! ---- Inspired by SQUID-961-detail-cache-peer-conn-failures-bag51 commits a93f486 and 3f9a99d.

rousskov

The comments in this review annotate the diff and do not request any code changes.

rousskov · 2024-04-02T18:01:17Z

src/clients/HttpTunneler.cc

 {
    assert(connection);
-    NoteOutgoingConnectionFailure(connection->getPeer(), error ? error->httpStatus : Http::scNone);


This single call removal is the primary fix. Everything else is related cleanup.

rousskov · 2024-04-02T18:02:44Z

src/security/BlindPeerConnector.cc

@@ -76,7 +76,7 @@ Security::BlindPeerConnector::noteNegotiationDone(ErrorState *error)
        // based on TCP results, SSL results, or both. And the code is probably not
        // consistent in this aspect across tunnelling and forwarding modules.


FWIW, we are testing changes that address the above XXX. They are independent from this fix.

rousskov · 2024-04-02T18:03:22Z

src/CachePeer.cc

 // TODO: Require callers to detail failures instead of using one (and often
 // misleading!) "connection failed" phrase for all of them.
 /// noteFailure() helper for handling failures attributed to this peer


FWIW, we are testing changes that address the above TODO. They are independent from this fix.

yadij

Noting my objection to the existence of NoteOutgoingConnectionFailure() and its matching NoteOutgoingConnectionSuccess(). IMO these inline wrappers have no valid reason for existence and further encourage the bad practice of passing raw-pointers around.

That said, I am not going to stop this needed change going in over that style disagreement.

ERROR: Connection to [such-and-such-cache_peer] failed TCP_TUNNEL/503 CONNECT nxdomain.test:443 FIRSTUP_PARENT Squid does not alert an admin about (and decrease health level of) a cache_peer that responded with an error to a GET request. Just like GET responses from a cache_peer, CONNECT responses may (and often do!) reflect client or origin server failures. We should not penalize cache_peers (and alert admins) until we can distinguish these frequent client/origin failures from (relatively rare) cache_peer problems. This change absolves cache_peers of CONNECT problems, restoring parity with GETs and restoring v4 behavior changed (probably by accident) in v5. Also removed Http::StatusCode parameter from failure notification functions because it became essentially unused after the primary Http::Tunneler changes. Tunneler was the only source of status code information that (in some cases) used received HTTP response to compute that status code. All other cases extracted that status code from Squid-generated errors. Those errors were arguably never meant to supply status code information for "this failure is not our fault" decision, and they do not supply 4xx status codes driving that decision. ### Problem evolution 2019 commit f5e1794 effectively started blaming cache_peer for all FwdState CONNECT errors. That functionality change was probably accidental, likely influenced by the names of noteConnectFailure() and peerConnectFailed() functions that abbreviated "Connection", making the functions look as applicable to CONNECT failures. Prior to that commit, the functions were never used for CONNECT errors. After it, FwdState started calling peerConnectFailed() for all CONNECT failures. In 2020 commit 25b0ce4, TunnelStateData started blaming cache_peers as well (by moving that FwdState-only error handling code into Tunneler). The same "accidental functionality change" speculations apply here. In 2022 commit 022dbab, we made an exception for 4xx CONNECT errors as folks deploying newer code started complaining about cache_peers getting blamed for client-caused errors (e.g., HTTP 403 Forbidden replies). We did not realize that the blaming code itself was an unwanted accident. Now we are getting complaints about cache_peers getting blamed for 502 and 503 CONNECT errors caused by, for example, domain names without IPs: As these CONNECT error responses are propagated from parent to child caches, every child cache in the chain logs ERRORs and every cache_peer in the chain gets its health counter decreased!

rousskov · 2024-04-03T13:46:42Z

Noting my objection to the existence of NoteOutgoingConnectionFailure() and its matching NoteOutgoingConnectionSuccess(). IMO these inline wrappers have no valid reason for existence and further encourage the bad practice of passing raw-pointers around.

FWIW, I cannot do anything about the above objection (beyond rejecting it as invalid) because it is based on two false assertions:

"These [functions] have not valid reason for existence". In reality, these function eliminate dangerous code [duplication]. If they are removed, the same checks would have to be done in three out of four remaining callers (and we will eventually forget or screw up those tricky checks as code gets refactored).
"passing raw-pointers around is bad practice". In reality, passing raw pointers around is usually the right/best solution when the caller needs to pass an optional heap-allocated object to the function that does not store that object. Moreover, wrong taboos like this make Squid development so much harder! We should utilize basic C++ concepts/techniques correctly instead of objecting their use. This is so much more than a "style disagreement"! I have failed to find the right words to develop consensus about this, but I remain open to any discussions that may lead to such consensus.

That said, I am not going to stop this needed change going in

Thank you!

ERROR: Connection to [such-and-such-cache_peer] failed TCP_TUNNEL/503 CONNECT nxdomain.test:443 FIRSTUP_PARENT Squid does not alert an admin about (and decrease health level of) a cache_peer that responded with an error to a GET request. Just like GET responses from a cache_peer, CONNECT responses may (and often do!) reflect client or origin server failures. We should not penalize cache_peers (and alert admins) until we can distinguish these frequent client/origin failures from (relatively rare) cache_peer problems. This change absolves cache_peers of CONNECT problems, restoring parity with GETs and restoring v4 behavior changed (probably by accident) in v5. Also removed Http::StatusCode parameter from failure notification functions because it became essentially unused after the primary Http::Tunneler changes. Tunneler was the only source of status code information that (in some cases) used received HTTP response to compute that status code. All other cases extracted that status code from Squid-generated errors. Those errors were arguably never meant to supply status code information for "this failure is not our fault" decision, and they do not supply 4xx status codes driving that decision. ### Problem evolution 2019 commit f5e1794 effectively started blaming cache_peer for all FwdState CONNECT errors. That functionality change was probably accidental, likely influenced by the names of noteConnectFailure() and peerConnectFailed() functions that abbreviated "Connection", making the functions look as applicable to CONNECT failures. Prior to that commit, the functions were never used for CONNECT errors. After it, FwdState started calling peerConnectFailed() for all CONNECT failures. In 2020 commit 25b0ce4, TunnelStateData started blaming cache_peers as well (by moving that FwdState-only error handling code into Tunneler). The same "accidental functionality change" speculations apply here. In 2022 commit 022dbab, we made an exception for 4xx CONNECT errors as folks deploying newer code started complaining about cache_peers getting blamed for client-caused errors (e.g., HTTP 403 Forbidden replies). We did not realize that the blaming code itself was an unwanted accident. Now we are getting complaints about cache_peers getting blamed for 502 and 503 CONNECT errors caused by, for example, domain names without IPs: As these CONNECT error responses are propagated from parent to child caches, every child cache in the chain logs ERRORs and every cache_peer in the chain gets its health counter decreased!

rousskov commented Apr 2, 2024

View reviewed changes

kinkie approved these changes Apr 3, 2024

View reviewed changes

yadij approved these changes Apr 3, 2024

View reviewed changes

yadij added the M-cleared-for-merge https://github.com/measurement-factory/anubis#pull-request-labels label Apr 3, 2024

squid-anubis added the M-waiting-staging-checks https://github.com/measurement-factory/anubis#pull-request-labels label Apr 3, 2024

squid-anubis closed this Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not blame cache_peer for CONNECT errors #1772

Do not blame cache_peer for CONNECT errors #1772

rousskov commented Apr 2, 2024

rousskov left a comment

rousskov Apr 2, 2024

rousskov Apr 2, 2024

rousskov Apr 2, 2024

yadij left a comment

rousskov commented Apr 3, 2024

		@@ -76,7 +76,7 @@ Security::BlindPeerConnector::noteNegotiationDone(ErrorState *error)
		// based on TCP results, SSL results, or both. And the code is probably not
		// consistent in this aspect across tunnelling and forwarding modules.

Do not blame cache_peer for CONNECT errors #1772

Do not blame cache_peer for CONNECT errors #1772

Conversation

rousskov commented Apr 2, 2024

Problem evolution

rousskov left a comment

Choose a reason for hiding this comment

rousskov Apr 2, 2024

Choose a reason for hiding this comment

rousskov Apr 2, 2024

Choose a reason for hiding this comment

rousskov Apr 2, 2024

Choose a reason for hiding this comment

yadij left a comment

Choose a reason for hiding this comment

rousskov commented Apr 3, 2024