Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Bug 5055: FATAL FwdState::noteDestinationsEnd exception: opening (#967)
* Bug 5055: FATAL FwdState::noteDestinationsEnd exception: opening The bug was caused by commit 25b0ce4. Other known symptoms are: assertion failed: store.cc:1793: "isEmpty()" assertion failed: FwdState.cc:501: "serverConnection() == conn" assertion failed: FwdState.cc:1037: "!opening()" This change has several overlapping parts. Unfortunately, merging individual parts is both difficult and likely to cause crashes. ## Part 1: Bug 5055. FwdState used to check serverConn to decide whether to open a connection to forward the request. Since commit 25b0ce4, a nil serverConn pointer no longer implies that a new connection should be opened: FwdState helper jobs may already be working on preparing an existing open connection (e.g., sending a CONNECT request or negotiating encryption). Bad serverConn checks in both FwdState::noteDestination() and FwdState::noteDestinationsEnd() methods led to extra connectStart() calls creating two conflicting concurrent helper jobs. To fix this, we replaced direct serverConn inspection with a usingDestination() call which also checks whether we are waiting for a helper job. Testing that fix exposed another set of bugs: The helper job pointers or in-job connections left stale/set after forwarding failures. The changes described below addressed (most of) those problems. ## Part 2: Connection establishing helper jobs and their callbacks A proper fix for Bug 5055 required answering a difficult question: When should a dying job call its callbacks? We only found one answer which required cooperation from the job creator and led to the following rules: * AsyncJob destructors must not call any callbacks. * AsyncJob::swanSong() is responsible for async-calling any remaining (i.e. set, uncalled, and uncancelled) callbacks. * AsyncJob::swanSong() is called (only) for started jobs. * AsyncJob destructing sequence should validate that no callbacks remain uncalled for started jobs. ... where an AsyncJob x is considered "started" if AsyncJob::Start(x) has returned without throwing. A new JobWait class helps job creators follow these rules while keeping track on in-progress helper jobs and killing no-longer-needed helpers. Also fixed very similar bugs in tunnel.cc code. ## Part 3: ConnOpener fixes 1. Many ConnOpener users are written to keep a ConnectionPointer to the destination given to ConnOpener. This means that their connection magically opens when ConnOpener successfully connects, before ConnOpener has a chance to notify the user about the changes. Having multiple concurrent connection owners is always dangerous, and the user cannot even have a close handler registered for its now-open connection. When something happens to ConnOpener or its answer, the user job may be in trouble. Now, ConnOpener callers no longer pass Connection objects they own, cloning them as needed. That adjustment required adjustment 2: 2. Refactored ConnOpener users to stop assuming that the answer contains a pointer to their connection object. After adjustment 1 above, it does not. HappyConnOpener relied on that assumption quite a bit so we had to refactor to use two custom callback methods instead of one with a complicated if-statement distinguishing prime from spare attempts. This refactoring is an overall improvement because it simplifies the code. Other ConnOpener users just needed to remove a few no longer valid paranoid assertions/Musts. 3. ConnOpener users were forced to remember to close params.conn when processing negative answers. Some, naturally, forgot, triggering warnings about orphaned connections (e.g., Ident and TcpLogger). ConnOpener now closes its open connection before sending a negative answer. 4. ConnOpener would trigger orphan connection warnings if the job ended after opening the connection but without supplying the connection to the requestor (e.g., because the requestor has gone away). Now, ConnOpener explicitly closes its open connection if it has not been sent to the requestor. Also fixed Comm::ConnOpener::cleanFd() debugging that was incorrectly saying that the method closes the temporary descriptor. Also fixed ConnOpener callback's syncWithComm(): The stale CommConnectCbParams override was testing unused (i.e. always negative) CommConnectCbParams::fd and was trying to cancel the callback that most (possibly all) recipients rely on: ConnOpener users expect a negative answer rather than no answer at all. Also, after comparing the needs of two old/existing and a temporary added ("clone everything") Connection cloning method callers, we decided there is no need to maintain three different methods. All existing callers should be fine with a single method because none of them suffers from "extra" copying of members that others need. Right now, the new cloneProfile() method copies everything except FD and a few special-purpose members (with documented reasons for not copying). Also added Comm::Connection::cloneDestinationDetails() debugging to simplify tracking dependencies between half-baked Connection objects carrying destination/flags/other metadata and open Connection objects created by ConnOpener using that metadata (which are later delivered to ConnOpener users and, in some cases, replace those half-baked connections mentioned earlier. Long-term, we need to find a better way to express these and other Connection states/stages than comments and debugging messages. ## Part 4: Comm::Connection closure callbacks We improved many closure callbacks to make sure (to the extent possible) that Connection and other objects are in sync with Comm. There are lots of small bugs, inconsistencies, and other problems in Connection closure handlers. It is not clear whether any of those problems could result in serious runtime errors or leaks. In theory, the rest of the code could neutralize their negative side effects. However, even in that case, it was just a matter of time before the next bug will bite us due to stale Connection::fd and such. These changes themselves carry elevated risk, but they get us closer to reliable code as far as Connection maintenance is concerned; otherwise, we will keep chasing their deadly side effects. Long-term, all these manual efforts to keep things in sync should become unnecessary with the introduction of appropriate Connection ownership APIs that automatically maintain the corresponding environments (TODO). ## Part 5: Other notable improvements in the adjusted code Improved Security::PeerConnector::serverConn and Http::Tunneler::connection management, especially when sending negative answers. When sending a negative answer, we would set answer().conn to an open connection, async-send that answer, and then hurry to close the connection using our pointer to the shared Connection object. If everything went according to the plan, the recipient would get a non-nil but closed Connection object. Now, a negative answer simply means no connection at all. Same for a tunneled answer. Refactored ICAP connection-establishing code to to delay Connection ownership until the ICAP connection is fully ready. This change addresses primary Connection ownership concerns (as they apply to this ICAP code) except orphaning of the temporary Connection object by helper job start exceptions (now an explicit XXX). For example, the transaction no longer shares a Connection object with ConnOpener and IcapPeerConnector jobs. Probably fixed a bug where PeerConnector::negotiate() assumed that a sslFinalized() does not return true after callBack(). It may (e.g., when CertValidationHelper::Submit() throws). Same for PeekingPeerConnector::checkForPeekAndSpliceMatched(). Fixed FwdState::advanceDestination() bug that did not save ERR_GATEWAY_FAILURE details and "lost" the address of that failed destination, making it unavailable to future retries (if any). Polished PeerPoolMgr, Ident, and Gopher code to be able to fix similar job callback and connection management issues. Polished AsyncJob::Start() API. Start() used to return a job pointer, but that was a bad idea: * It implies that a failed Start() will return a nil pointer, and that the caller should check the result. Neither is true. * It encourages callers to dereference the returned pointer to further adjust the job. That technically works (today) but violates the rules of communicating with an async job. The Start() method is the boundary after which the job is deemed asynchronous. Also removed old "and wait for..." post-Start() comments because the code itself became clear enough, and those comments were becoming increasingly stale (because they duplicated the callback names above them). Fix Tunneler and PeerConnector handling of last-resort callback requirements. Events like handleStopRequest() and callException() stop the job but should not be reported as a BUG (e.g., it would be up to the callException() to decide how to report the caught exception). There might (or there will) be other, similar cases where the job is stopped prematurely for some non-BUG reason beyond swanSong() knowledge. The existence of non-bug cases does not mean there could be no bugs worth reporting here, but until they can be identified more reliably than all these benign/irrelevant cases, reporting no BUGs is a (much) lesser evil. TODO: Revise AsyncJob::doneAll(). Many of its overrides are written to check for both positive (i.e. mission accomplished) and negative (i.e. mission cancelled or cannot be accomplished) conditions, but the latter is usually unnecessary, especially after we added handleStopRequest() API to properly support external job cancellation events. Many doneAll() overrides can probably be greatly simplified. ---- Cherry picked SQUID-568-premature-serverconn-use-v5 commit 22b5f78. * fixup: Cherry-picked SQUID-568-premature-serverconn-use PR-time fixes In git log order: * e64a6c1: Undone an out-of-scope change and added a missing 'auto' * aeaf83d: Fixed an 'unused parameter' error * f49d009: fixup: No explicit destructors with implicit copying methods * c30c37f: Removed an unnecessary explicit copy constructor * 012f5ec: Excluded Connection::rfc931 from cloning * 366c78a: ICAP: do not set connect_timeout on the established conn... This branch is now in sync with SQUID-568-premature-serverconn-use (S) commit e64a6c1 (except for official changes merged from master/v6 into S closer to the end of PR 877 work (i.e. S' merge commit 0a7432a). * Fix FATAL ServiceRep::putConnection exception: theBusyConns > 0 FATAL: check failed: theBusyConns > 0 exception location: ServiceRep.cc(163) putConnection Since master/v6 commit 2b6b1bc, a timeout on a ready-to-shovel Squid-service ICAP connection was decrementing theBusyConns level one extra time because Adaptation::Icap::Xaction::noteCommTimedout() started calling both noteConnectionFailed() and closeConnection(). Depending on the actual theBusyConns level, the extra decrement could result in FATAL errors later, when putConnection() was called (for a different ICAP transaction) with zero theBusyConns in an exception-unprotected context. Throughout these changes, Xaction still counts the above timeouts as a service failure. That is done by calling ServiceRep::noteFailure() from Xaction::callException(), including in timeout cases described above. ---- Cherry-picked master/v6 commit a8ac892.
- Loading branch information