Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not blame cache_peer for CONNECT errors #1772

Conversation

rousskov
Copy link
Contributor

@rousskov rousskov commented Apr 2, 2024

ERROR: Connection to [such-and-such-cache_peer] failed
TCP_TUNNEL/503 CONNECT nxdomain.test:443 FIRSTUP_PARENT

Squid does not alert an admin about (and decrease health level of) a
cache_peer that responded with an error to a GET request. Just like GET
responses from a cache_peer, CONNECT responses may (and often do!)
reflect client or origin server failures. We should not penalize
cache_peers (and alert admins) until we can distinguish these frequent
client/origin failures from (relatively rare) cache_peer problems. This
change absolves cache_peers of CONNECT problems, restoring parity with
GETs and restoring v4 behavior changed (probably by accident) in v5.

Also removed Http::StatusCode parameter from failure notification
functions because it became essentially unused after the primary
Http::Tunneler changes. Tunneler was the only source of status code
information that (in some cases) used received HTTP response to compute
that status code. All other cases extracted that status code from
Squid-generated errors. Those errors were arguably never meant to supply
status code information for "this failure is not our fault" decision,
and they do not supply 4xx status codes driving that decision.

Problem evolution

2019 commit f5e1794 effectively started blaming cache_peer for all
FwdState CONNECT errors. That functionality change was probably
accidental, likely influenced by the names of noteConnectFailure() and
peerConnectFailed() functions that abbreviated "Connection", making the
functions look as applicable to CONNECT failures. Prior to that commit,
the functions were never used for CONNECT errors. After it, FwdState
started calling peerConnectFailed() for all CONNECT failures.

In 2020 commit 25b0ce4, TunnelStateData started blaming cache_peers as
well (by moving that FwdState-only error handling code into Tunneler).
The same "accidental functionality change" speculations apply here.

In 2022 commit 022dbab, we made an exception for 4xx CONNECT errors as
folks deploying newer code started complaining about cache_peers getting
blamed for client-caused errors (e.g., HTTP 403 Forbidden replies). We
did not realize that the blaming code itself was an unwanted accident.

Now we are getting complaints about cache_peers getting blamed for 502
and 503 CONNECT errors caused by, for example, domain names without IPs:
As these CONNECT error responses are propagated from parent to child
caches, every child cache in the chain logs ERRORs and every cache_peer
in the chain gets its health counter decreased!

    ERROR: Connection to [such-and-such-cache_peer] failed
    TCP_TUNNEL/503 CONNECT nxdomain.test:443 FIRSTUP_PARENT

Squid does not alert an admin about (and decrease health level of) a
cache_peer that responded with an error to a GET request. Just like GET
responses from a cache_peer, CONNECT responses may (and often do!)
reflect client or origin server failures. We should not penalize
cache_peers (and alert admins) until we can distinguish these frequent
client/origin failures from (relatively rare) cache_peer problems. This
change absolves cache_peers of CONNECT problems, restoring parity with
GETs and restoring v4 behavior changed (probably by accident) in v5.

Also removed Http::StatusCode parameter from failure notification
functions because it became essentially unused after the primary
Http::Tunneler changes. Tunneler was the only source of status code
information that (in some cases) used received HTTP response to compute
that status code. All other cases extracted that status code from
Squid-generated errors. Those errors were arguably never meant to supply
status code information for "this failure is not our fault" decision,
and they do not supply 4xx status codes driving that decision.

### Problem evolution

2019 commit f5e1794 effectively started blaming cache_peer for all
FwdState CONNECT errors. That functionality change was probably
accidental, likely influenced by the names of noteConnectFailure() and
peerConnectFailed() functions that abbreviated "Connection", making the
functions look as applicable to CONNECT failures. Prior to that commit,
the functions were never used for CONNECT errors. After it, FwdState
started calling peerConnectFailed() for all CONNECT failures.

In 2020 commit 25b0ce4, TunnelStateData started blaming cache_peers as
well (by moving that FwdState-only error handling code into Tunneler).
The same "accidental functionality change" speculations apply here.

In 2022 commit 022dbab, we made an exception for 4xx CONNECT errors as
folks deploying newer code started complaining about cache_peers getting
blamed for client-caused errors (e.g., HTTP 403 Forbidden replies). We
did not realize that the blaming code itself was an unwanted accident.

Now we are getting complaints about cache_peers getting blamed for 502
and 503 CONNECT errors caused by, for example, domain names without IPs:
As these CONNECT error responses are propagated from parent to child
caches, every child cache in the chain logs ERRORs and every cache_peer
in the chain gets its health counter decreased!

----

Inspired by SQUID-961-detail-cache-peer-conn-failures-bag51 commits
a93f486 and 3f9a99d.
Copy link
Contributor Author

@rousskov rousskov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments in this review annotate the diff and do not request any code changes.

{
assert(connection);
NoteOutgoingConnectionFailure(connection->getPeer(), error ? error->httpStatus : Http::scNone);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This single call removal is the primary fix. Everything else is related cleanup.

@@ -76,7 +76,7 @@ Security::BlindPeerConnector::noteNegotiationDone(ErrorState *error)
// based on TCP results, SSL results, or both. And the code is probably not
// consistent in this aspect across tunnelling and forwarding modules.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, we are testing changes that address the above XXX. They are independent from this fix.

Comment on lines 72 to 74
// TODO: Require callers to detail failures instead of using one (and often
// misleading!) "connection failed" phrase for all of them.
/// noteFailure() helper for handling failures attributed to this peer
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, we are testing changes that address the above TODO. They are independent from this fix.

Copy link
Contributor

@yadij yadij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting my objection to the existence of NoteOutgoingConnectionFailure() and its matching NoteOutgoingConnectionSuccess(). IMO these inline wrappers have no valid reason for existence and further encourage the bad practice of passing raw-pointers around.

That said, I am not going to stop this needed change going in over that style disagreement.

@yadij yadij added the M-cleared-for-merge https://github.com/measurement-factory/anubis#pull-request-labels label Apr 3, 2024
squid-anubis pushed a commit that referenced this pull request Apr 3, 2024
    ERROR: Connection to [such-and-such-cache_peer] failed
    TCP_TUNNEL/503 CONNECT nxdomain.test:443 FIRSTUP_PARENT

Squid does not alert an admin about (and decrease health level of) a
cache_peer that responded with an error to a GET request. Just like GET
responses from a cache_peer, CONNECT responses may (and often do!)
reflect client or origin server failures. We should not penalize
cache_peers (and alert admins) until we can distinguish these frequent
client/origin failures from (relatively rare) cache_peer problems. This
change absolves cache_peers of CONNECT problems, restoring parity with
GETs and restoring v4 behavior changed (probably by accident) in v5.

Also removed Http::StatusCode parameter from failure notification
functions because it became essentially unused after the primary
Http::Tunneler changes. Tunneler was the only source of status code
information that (in some cases) used received HTTP response to compute
that status code. All other cases extracted that status code from
Squid-generated errors. Those errors were arguably never meant to supply
status code information for "this failure is not our fault" decision,
and they do not supply 4xx status codes driving that decision.

### Problem evolution

2019 commit f5e1794 effectively started blaming cache_peer for all
FwdState CONNECT errors. That functionality change was probably
accidental, likely influenced by the names of noteConnectFailure() and
peerConnectFailed() functions that abbreviated "Connection", making the
functions look as applicable to CONNECT failures. Prior to that commit,
the functions were never used for CONNECT errors. After it, FwdState
started calling peerConnectFailed() for all CONNECT failures.

In 2020 commit 25b0ce4, TunnelStateData started blaming cache_peers as
well (by moving that FwdState-only error handling code into Tunneler).
The same "accidental functionality change" speculations apply here.

In 2022 commit 022dbab, we made an exception for 4xx CONNECT errors as
folks deploying newer code started complaining about cache_peers getting
blamed for client-caused errors (e.g., HTTP 403 Forbidden replies). We
did not realize that the blaming code itself was an unwanted accident.

Now we are getting complaints about cache_peers getting blamed for 502
and 503 CONNECT errors caused by, for example, domain names without IPs:
As these CONNECT error responses are propagated from parent to child
caches, every child cache in the chain logs ERRORs and every cache_peer
in the chain gets its health counter decreased!
@squid-anubis squid-anubis added the M-waiting-staging-checks https://github.com/measurement-factory/anubis#pull-request-labels label Apr 3, 2024
@squid-anubis squid-anubis added M-merged https://github.com/measurement-factory/anubis#pull-request-labels and removed M-waiting-staging-checks https://github.com/measurement-factory/anubis#pull-request-labels M-cleared-for-merge https://github.com/measurement-factory/anubis#pull-request-labels labels Apr 3, 2024
@rousskov
Copy link
Contributor Author

rousskov commented Apr 3, 2024

Noting my objection to the existence of NoteOutgoingConnectionFailure() and its matching NoteOutgoingConnectionSuccess(). IMO these inline wrappers have no valid reason for existence and further encourage the bad practice of passing raw-pointers around.

FWIW, I cannot do anything about the above objection (beyond rejecting it as invalid) because it is based on two false assertions:

  • "These [functions] have not valid reason for existence". In reality, these function eliminate dangerous code [duplication]. If they are removed, the same checks would have to be done in three out of four remaining callers (and we will eventually forget or screw up those tricky checks as code gets refactored).

  • "passing raw-pointers around is bad practice". In reality, passing raw pointers around is usually the right/best solution when the caller needs to pass an optional heap-allocated object to the function that does not store that object. Moreover, wrong taboos like this make Squid development so much harder! We should utilize basic C++ concepts/techniques correctly instead of objecting their use. This is so much more than a "style disagreement"! I have failed to find the right words to develop consensus about this, but I remain open to any discussions that may lead to such consensus.

That said, I am not going to stop this needed change going in

Thank you!

kinkie pushed a commit to kinkie/squid that referenced this pull request Apr 9, 2024
    ERROR: Connection to [such-and-such-cache_peer] failed
    TCP_TUNNEL/503 CONNECT nxdomain.test:443 FIRSTUP_PARENT

Squid does not alert an admin about (and decrease health level of) a
cache_peer that responded with an error to a GET request. Just like GET
responses from a cache_peer, CONNECT responses may (and often do!)
reflect client or origin server failures. We should not penalize
cache_peers (and alert admins) until we can distinguish these frequent
client/origin failures from (relatively rare) cache_peer problems. This
change absolves cache_peers of CONNECT problems, restoring parity with
GETs and restoring v4 behavior changed (probably by accident) in v5.

Also removed Http::StatusCode parameter from failure notification
functions because it became essentially unused after the primary
Http::Tunneler changes. Tunneler was the only source of status code
information that (in some cases) used received HTTP response to compute
that status code. All other cases extracted that status code from
Squid-generated errors. Those errors were arguably never meant to supply
status code information for "this failure is not our fault" decision,
and they do not supply 4xx status codes driving that decision.

### Problem evolution

2019 commit f5e1794 effectively started blaming cache_peer for all
FwdState CONNECT errors. That functionality change was probably
accidental, likely influenced by the names of noteConnectFailure() and
peerConnectFailed() functions that abbreviated "Connection", making the
functions look as applicable to CONNECT failures. Prior to that commit,
the functions were never used for CONNECT errors. After it, FwdState
started calling peerConnectFailed() for all CONNECT failures.

In 2020 commit 25b0ce4, TunnelStateData started blaming cache_peers as
well (by moving that FwdState-only error handling code into Tunneler).
The same "accidental functionality change" speculations apply here.

In 2022 commit 022dbab, we made an exception for 4xx CONNECT errors as
folks deploying newer code started complaining about cache_peers getting
blamed for client-caused errors (e.g., HTTP 403 Forbidden replies). We
did not realize that the blaming code itself was an unwanted accident.

Now we are getting complaints about cache_peers getting blamed for 502
and 503 CONNECT errors caused by, for example, domain names without IPs:
As these CONNECT error responses are propagated from parent to child
caches, every child cache in the chain logs ERRORs and every cache_peer
in the chain gets its health counter decreased!
kinkie pushed a commit to kinkie/squid that referenced this pull request Apr 9, 2024
    ERROR: Connection to [such-and-such-cache_peer] failed
    TCP_TUNNEL/503 CONNECT nxdomain.test:443 FIRSTUP_PARENT

Squid does not alert an admin about (and decrease health level of) a
cache_peer that responded with an error to a GET request. Just like GET
responses from a cache_peer, CONNECT responses may (and often do!)
reflect client or origin server failures. We should not penalize
cache_peers (and alert admins) until we can distinguish these frequent
client/origin failures from (relatively rare) cache_peer problems. This
change absolves cache_peers of CONNECT problems, restoring parity with
GETs and restoring v4 behavior changed (probably by accident) in v5.

Also removed Http::StatusCode parameter from failure notification
functions because it became essentially unused after the primary
Http::Tunneler changes. Tunneler was the only source of status code
information that (in some cases) used received HTTP response to compute
that status code. All other cases extracted that status code from
Squid-generated errors. Those errors were arguably never meant to supply
status code information for "this failure is not our fault" decision,
and they do not supply 4xx status codes driving that decision.

### Problem evolution

2019 commit f5e1794 effectively started blaming cache_peer for all
FwdState CONNECT errors. That functionality change was probably
accidental, likely influenced by the names of noteConnectFailure() and
peerConnectFailed() functions that abbreviated "Connection", making the
functions look as applicable to CONNECT failures. Prior to that commit,
the functions were never used for CONNECT errors. After it, FwdState
started calling peerConnectFailed() for all CONNECT failures.

In 2020 commit 25b0ce4, TunnelStateData started blaming cache_peers as
well (by moving that FwdState-only error handling code into Tunneler).
The same "accidental functionality change" speculations apply here.

In 2022 commit 022dbab, we made an exception for 4xx CONNECT errors as
folks deploying newer code started complaining about cache_peers getting
blamed for client-caused errors (e.g., HTTP 403 Forbidden replies). We
did not realize that the blaming code itself was an unwanted accident.

Now we are getting complaints about cache_peers getting blamed for 502
and 503 CONNECT errors caused by, for example, domain names without IPs:
As these CONNECT error responses are propagated from parent to child
caches, every child cache in the chain logs ERRORs and every cache_peer
in the chain gets its health counter decreased!
kinkie pushed a commit to kinkie/squid that referenced this pull request Apr 14, 2024
    ERROR: Connection to [such-and-such-cache_peer] failed
    TCP_TUNNEL/503 CONNECT nxdomain.test:443 FIRSTUP_PARENT

Squid does not alert an admin about (and decrease health level of) a
cache_peer that responded with an error to a GET request. Just like GET
responses from a cache_peer, CONNECT responses may (and often do!)
reflect client or origin server failures. We should not penalize
cache_peers (and alert admins) until we can distinguish these frequent
client/origin failures from (relatively rare) cache_peer problems. This
change absolves cache_peers of CONNECT problems, restoring parity with
GETs and restoring v4 behavior changed (probably by accident) in v5.

Also removed Http::StatusCode parameter from failure notification
functions because it became essentially unused after the primary
Http::Tunneler changes. Tunneler was the only source of status code
information that (in some cases) used received HTTP response to compute
that status code. All other cases extracted that status code from
Squid-generated errors. Those errors were arguably never meant to supply
status code information for "this failure is not our fault" decision,
and they do not supply 4xx status codes driving that decision.

### Problem evolution

2019 commit f5e1794 effectively started blaming cache_peer for all
FwdState CONNECT errors. That functionality change was probably
accidental, likely influenced by the names of noteConnectFailure() and
peerConnectFailed() functions that abbreviated "Connection", making the
functions look as applicable to CONNECT failures. Prior to that commit,
the functions were never used for CONNECT errors. After it, FwdState
started calling peerConnectFailed() for all CONNECT failures.

In 2020 commit 25b0ce4, TunnelStateData started blaming cache_peers as
well (by moving that FwdState-only error handling code into Tunneler).
The same "accidental functionality change" speculations apply here.

In 2022 commit 022dbab, we made an exception for 4xx CONNECT errors as
folks deploying newer code started complaining about cache_peers getting
blamed for client-caused errors (e.g., HTTP 403 Forbidden replies). We
did not realize that the blaming code itself was an unwanted accident.

Now we are getting complaints about cache_peers getting blamed for 502
and 503 CONNECT errors caused by, for example, domain names without IPs:
As these CONNECT error responses are propagated from parent to child
caches, every child cache in the chain logs ERRORs and every cache_peer
in the chain gets its health counter decreased!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
M-merged https://github.com/measurement-factory/anubis#pull-request-labels
Projects
None yet
4 participants