Remove serialized HTTP headers from storeClientCopy() (#1335)

Do not send serialized HTTP response header bytes in storeClientCopy() answers. Ignore serialized header size when calling storeClientCopy(). This complex change adjusts storeClientCopy() API to addresses several related problems with storeClientCopy() and its callers. The sections below summarize storeClientCopy() changes and then move on to callers. Squid incorrectly assumed that serialized HTTP response headers are read from disk in a single storeRead() request. In reality, many situations lead to store_client::readBody() receiving partial HTTP headers, resulting in parseCharBuf() failure and a level-0 cache.log message: Could not parse headers from on disk object Inadequate handling of this failure resulted in a variety of problems. Squid now accumulates storeRead() results to parse larger headers and also handles parsing failures better, but we could not just stop there. With the storeRead() accumulation in place, it is no longer possible to send parsed serialized HTTP headers to storeClientCopy() callers because those callers do not provide enough buffer space to fit larger headers. Increasing caller buffer capacity does not work well because the actual size of the serialized header is unknown in advance and may be quite large. Always allocating large buffers "just in case" is bad for performance. Finally, larger buffers may jeopardize hard-to-find code that uses hard-coded 4KB buffers without using HTTP_REQBUF_SZ macro. Fortunately, storeClientCopy() callers either do not care about serialized HTTP response headers or should not care about them! The API forced callers to deal with serialized headers, but callers could (and some did) just use the parsed headers available in the corresponding MemObject. With this API change, storeClientCopy() callers no longer receive serialized headers and do not need to parse or skip them. Consequently, callers also do not need to account for response headers size when computing offsets for subsequent storeClientCopy() requests. Restricting storeClientCopy() API to HTTP _body_ bytes removed a lot of problematic caller code. Caller changes are summarized further below. A similar HTTP response header parsing problem existed in shared memory cache code. That code was actually aware that headers may span multiple cache slices but incorrectly assumed that httpMsgParseStep() accumulates input as needed (to make another parsing "step"). It does not. Large response headers cached in shared memory triggered a level-1 message: Corrupted mem-cached headers: e:... Fixed MemStore code now accumulates serialized HTTP response headers as needed to parse them, sharing high-level parsing code with store_client. Old clientReplyContext methods worked hard to skip received serialized HTTP headers. The code contained dangerous and often complex/unreadable manipulation of various raw offsets and buffer pointers, aggravated by the perceived need to save/restore those offsets across asynchronous checks (see below). That header skipping code is gone now. Several stale and misleading comments related to Store buffers management were also removed or updated. We replaced reqofs/reqsize with simpler/safer lastStreamBufferedBytes, while becoming more consistent with that "cached" info invalidation. We still need this info to resume HTTP body processing after asynchronous http_reply_access checks and cache hit validations, but we no longer save/restore this info for hit validation: No need to save/restore information about the buffer that hit validation does not use and must never touch! The API change also moved from-Store StoreIOBuffer usage closer to StoreIOBuffers manipulated by Clients Streams code. Buffers in both categories now contain just the body bytes, and both now treat zero length as EOF only _after_ processing the response headers. These changes improve overall code quality, but this code path and these changes still suffer from utterly unsafe legacy interfaces like StoreIOBuffer and clientStreamNode. We cannot rely on the compiler to check our work. The risk of these changes exposing/causing bugs is high. asHandleReply() expected WHOIS response body bytes where serialized HTTP headers were! The code also had multiple problems typical for manually written C parsers dealing with raw input buffers. Now replaced with a Tokenizer-based code. To skip received HTTP response headers, peerDigestHandleReply() helper functions called headersEnd() on the received buffer. Twice. We have now merged those two parsing helper functions into one (that just checks the already parsed headers). This merger preserved "304s must come with fetch->pd->cd" logic that was hidden/spread across those two functions. urnHandleReply() re-parsed received HTTP response headers. We left its HTTP body parsing code unchanged except for polishing NUL-termination. netdbExchangeHandleReply() re-parsed received HTTP response headers to find where they end (via headersEnd()). We improved handing of corner cases and replaced some "tricky bits" code, reusing the new Store::ParsingBuffer class. The net_db record parsing code is unchanged. Mgr::StoreToCommWriter::noteStoreCopied() is a very special case. It actually worked OK because, unlike all other storeClientCopy() callers, this code does not get serialized HTTP headers from Store: The code adding bytes to the corresponding StoreEntry does not write serialized HTTP headers at all. StoreToCommWriter is used to deliver kid-specific pieces of an HTTP body of an SMP cache manager response. The HTTP headers of that response are handled elsewhere. We left this code unchanged, but the existence of the special no-headers case does complicate storeClientCopy() API documentation, implementation, and understanding. Co-authored-by: Eduard Bagdasaryan <eduard.bagdasaryan@measurement-factory.com>
squid-cache · Oct 11, 2023 · a27bf4b · a27bf4b
1 parent 29805b0
commit a27bf4b
Show file tree

Hide file tree

Showing 24 changed files with 1,090 additions and 727 deletions.
diff --git a/src/HttpReply.cc b/src/HttpReply.cc
@@ -21,7 +21,9 @@
 #include "HttpReply.h"
 #include "HttpRequest.h"
 #include "MemBuf.h"
+#include "sbuf/Stream.h"
 #include "SquidConfig.h"
+#include "SquidMath.h"
 #include "Store.h"
 #include "StrList.h"
 
@@ -456,6 +458,38 @@ HttpReply::parseFirstLine(const char *blk_start, const char *blk_end)
     return sline.parse(protoPrefix, blk_start, blk_end);
 }
 
+size_t
+HttpReply::parseTerminatedPrefix(const char * const terminatedBuf, const size_t bufSize)
+{
+    auto error = Http::scNone;
+    const bool eof = false; // TODO: Remove after removing atEnd from HttpHeader::parse()
+    if (parse(terminatedBuf, bufSize, eof, &error)) {
+        debugs(58, 7, "success after accumulating " << bufSize << " bytes and parsing " << hdr_sz);
+        Assure(pstate == Http::Message::psParsed);
+        Assure(hdr_sz > 0);
+        Assure(!Less(bufSize, hdr_sz)); // cannot parse more bytes than we have
+        return hdr_sz; // success
+    }
+
+    Assure(pstate != Http::Message::psParsed);
+    hdr_sz = 0;
+
+    if (error) {
+        throw TextException(ToSBuf("failed to parse HTTP headers",
+                                   Debug::Extra, "parser error code: ", error,
+                                   Debug::Extra, "accumulated unparsed bytes: ", bufSize,
+                                   Debug::Extra, "reply_header_max_size: ", Config.maxReplyHeaderSize),
+                            Here());
+    }
+
+    debugs(58, 3, "need more bytes after accumulating " << bufSize << " out of " << Config.maxReplyHeaderSize);
+
+    // the parse() call above enforces Config.maxReplyHeaderSize limit
+    // XXX: Make this a strict comparison after fixing Http::Message::parse() enforcement
+    Assure(bufSize <= Config.maxReplyHeaderSize);
+    return 0; // parsed nothing, need more data
+}
+
 void
 HttpReply::configureContentLengthInterpreter(Http::ContentLengthInterpreter &interpreter)
 {

diff --git a/src/HttpReply.h b/src/HttpReply.h
@@ -125,6 +125,13 @@ class HttpReply: public Http::Message
     /// parses reply header using Parser
     bool parseHeader(Http1::Parser &hp);
 
+    /// Parses response status line and headers at the start of the given
+    /// NUL-terminated buffer of the given size. Respects reply_header_max_size.
+    /// Assures pstate becomes Http::Message::psParsed on (and only on) success.
+    /// \returns the number of bytes in a successfully parsed prefix (or zero)
+    /// \retval 0 implies that more data is needed to parse the response prefix
+    size_t parseTerminatedPrefix(const char *, size_t);
+
 private:
     /** initialize */
     void init();

diff --git a/src/MemObject.cc b/src/MemObject.cc
@@ -353,6 +353,12 @@ MemObject::policyLowestOffsetToKeep(bool swap) const
      */
     int64_t lowest_offset = lowestMemReaderOffset();
 
+    // XXX: Remove the last (Config.onoff.memory_cache_first-based) condition
+    // and update keepForLocalMemoryCache() accordingly. The caller wants to
+    // remove all local memory that is safe to remove. Honoring caching
+    // preferences is its responsibility. Our responsibility is safety. The
+    // situation was different when ff4b33f added that condition -- there was no
+    // keepInLocalMemory/keepForLocalMemoryCache() call guard back then.
     if (endOffset() < lowest_offset ||
             endOffset() - inmem_lo > (int64_t)Config.Store.maxInMemObjSize ||
             (swap && !Config.onoff.memory_cache_first))

diff --git a/src/MemObject.h b/src/MemObject.h
@@ -89,6 +89,15 @@ class MemObject
     bool appliedUpdates = false;
 
     void stat (MemBuf * mb) const;
+
+    /// The offset of the last memory-stored HTTP response byte plus one.
+    /// * HTTP response headers (if any) are stored at offset zero.
+    /// * HTTP response body byte[n] usually has offset (hdr_sz + n), where
+    ///   hdr_sz is the size of stored HTTP response headers (zero if none); and
+    ///   n is the corresponding byte offset in the whole resource body.
+    ///   However, some 206 (Partial Content) response bodies are stored (and
+    ///   retrieved) as regular 200 response bodies, disregarding offsets of
+    ///   their body parts. \sa HttpStateData::decideIfWeDoRanges().
     int64_t endOffset () const;
 
     /// sets baseReply().hdr_sz (i.e. written reply headers size) to endOffset()

diff --git a/src/MemStore.cc b/src/MemStore.cc
@@ -17,6 +17,8 @@
 #include "MemObject.h"
 #include "MemStore.h"
 #include "mime_header.h"
+#include "sbuf/SBuf.h"
+#include "sbuf/Stream.h"
 #include "SquidConfig.h"
 #include "SquidMath.h"
 #include "StoreStats.h"
@@ -311,19 +313,25 @@ MemStore::get(const cache_key *key)
     // create a brand new store entry and initialize it with stored info
     StoreEntry *e = new StoreEntry();
 
-    // XXX: We do not know the URLs yet, only the key, but we need to parse and
-    // store the response for the Root().find() callers to be happy because they
-    // expect IN_MEMORY entries to already have the response headers and body.
-    e->createMemObject();
-
-    anchorEntry(*e, index, *slot);
-
-    const bool copied = copyFromShm(*e, index, *slot);
-
-    if (copied)
-        return e;
+    try {
+        // XXX: We do not know the URLs yet, only the key, but we need to parse and
+        // store the response for the Root().find() callers to be happy because they
+        // expect IN_MEMORY entries to already have the response headers and body.
+        e->createMemObject();
+
+        anchorEntry(*e, index, *slot);
+
+        // TODO: make copyFromShm() throw on all failures, simplifying this code
+        if (copyFromShm(*e, index, *slot))
+            return e;
+        debugs(20, 3, "failed for " << *e);
+    } catch (...) {
+        // see store_client::parseHttpHeadersFromDisk() for problems this may log
+        debugs(20, DBG_IMPORTANT, "ERROR: Cannot load a cache hit from shared memory" <<
+               Debug::Extra << "exception: " << CurrentException <<
+               Debug::Extra << "cache_mem entry: " << *e);
+    }
 
-    debugs(20, 3, "failed for " << *e);
     map->freeEntry(index); // do not let others into the same trap
     destroyStoreEntry(static_cast<hash_link *>(e));
     return nullptr;
@@ -469,6 +477,8 @@ MemStore::copyFromShm(StoreEntry &e, const sfileno index, const Ipc::StoreMapAnc
     Ipc::StoreMapSliceId sid = anchor.start; // optimize: remember the last sid
     bool wasEof = anchor.complete() && sid < 0;
     int64_t sliceOffset = 0;
+
+    SBuf httpHeaderParsingBuffer;
     while (sid >= 0) {
         const Ipc::StoreMapSlice &slice = map->readableSlice(index, sid);
         // slice state may change during copying; take snapshots now
@@ -491,10 +501,18 @@ MemStore::copyFromShm(StoreEntry &e, const sfileno index, const Ipc::StoreMapAnc
             const StoreIOBuffer sliceBuf(wasSize - prefixSize,
                                          e.mem_obj->endOffset(),
                                          page + prefixSize);
-            if (!copyFromShmSlice(e, sliceBuf, wasEof))
-                return false;
+
+            copyFromShmSlice(e, sliceBuf);
             debugs(20, 8, "entry " << index << " copied slice " << sid <<
                    " from " << extra.page << '+' << prefixSize);
+
+            // parse headers if needed; they might span multiple slices!
+            auto &reply = e.mem().adjustableBaseReply();
+            if (reply.pstate != Http::Message::psParsed) {
+                httpHeaderParsingBuffer.append(sliceBuf.data, sliceBuf.length);
+                if (reply.parseTerminatedPrefix(httpHeaderParsingBuffer.c_str(), httpHeaderParsingBuffer.length()))
+                    httpHeaderParsingBuffer = SBuf(); // we do not need these bytes anymore
+            }
         }
         // else skip a [possibly incomplete] slice that we copied earlier
 
@@ -526,6 +544,9 @@ MemStore::copyFromShm(StoreEntry &e, const sfileno index, const Ipc::StoreMapAnc
     debugs(20, 5, "mem-loaded all " << e.mem_obj->endOffset() << '/' <<
            anchor.basics.swap_file_sz << " bytes of " << e);
 
+    if (e.mem().adjustableBaseReply().pstate != Http::Message::psParsed)
+        throw TextException(ToSBuf("truncated mem-cached headers; accumulated: ", httpHeaderParsingBuffer.length()), Here());
+
     // from StoreEntry::complete()
     e.mem_obj->object_sz = e.mem_obj->endOffset();
     e.store_status = STORE_OK;
@@ -541,40 +562,18 @@ MemStore::copyFromShm(StoreEntry &e, const sfileno index, const Ipc::StoreMapAnc
 }
 
 /// imports one shared memory slice into local memory
-bool
-MemStore::copyFromShmSlice(StoreEntry &e, const StoreIOBuffer &buf, bool eof)
+void
+MemStore::copyFromShmSlice(StoreEntry &e, const StoreIOBuffer &buf)
 {
     debugs(20, 7, "buf: " << buf.offset << " + " << buf.length);
 
-    // from store_client::readBody()
-    // parse headers if needed; they might span multiple slices!
-    const auto rep = &e.mem().adjustableBaseReply();
-    if (rep->pstate < Http::Message::psParsed) {
-        // XXX: have to copy because httpMsgParseStep() requires 0-termination
-        MemBuf mb;
-        mb.init(buf.length+1, buf.length+1);
-        mb.append(buf.data, buf.length);
-        mb.terminate();
-        const int result = rep->httpMsgParseStep(mb.buf, buf.length, eof);
-        if (result > 0) {
-            assert(rep->pstate == Http::Message::psParsed);
-        } else if (result < 0) {
-            debugs(20, DBG_IMPORTANT, "Corrupted mem-cached headers: " << e);
-            return false;
-        } else { // more slices are needed
-            assert(!eof);
-        }
-    }
-    debugs(20, 7, "rep pstate: " << rep->pstate);
-
     // local memory stores both headers and body so copy regardless of pstate
     const int64_t offBefore = e.mem_obj->endOffset();
     assert(e.mem_obj->data_hdr.write(buf)); // from MemObject::write()
     const int64_t offAfter = e.mem_obj->endOffset();
     // expect to write the entire buf because StoreEntry::write() never fails
     assert(offAfter >= 0 && offBefore <= offAfter &&
            static_cast<size_t>(offAfter - offBefore) == buf.length);
-    return true;
 }
 
 /// whether we should cache the entry

diff --git a/src/MemStore.h b/src/MemStore.h
@@ -80,7 +80,7 @@ class MemStore: public Store::Controlled, public Ipc::StoreMapCleaner
     void copyToShm(StoreEntry &e);
     void copyToShmSlice(StoreEntry &e, Ipc::StoreMapAnchor &anchor, Ipc::StoreMap::Slice &slice);
     bool copyFromShm(StoreEntry &e, const sfileno index, const Ipc::StoreMapAnchor &anchor);
-    bool copyFromShmSlice(StoreEntry &e, const StoreIOBuffer &buf, bool eof);
+    void copyFromShmSlice(StoreEntry &, const StoreIOBuffer &);
 
     void updateHeadersOrThrow(Ipc::StoreMapUpdate &update);
 

diff --git a/src/StoreClient.h b/src/StoreClient.h
@@ -13,10 +13,23 @@
 #include "base/AsyncCall.h"
 #include "base/forward.h"
 #include "dlink.h"
+#include "store/ParsingBuffer.h"
 #include "StoreIOBuffer.h"
 #include "StoreIOState.h"
 
-typedef void STCB(void *, StoreIOBuffer);   /* store callback */
+/// A storeClientCopy() callback function.
+///
+/// Upon storeClientCopy() success, StoreIOBuffer::flags.error is zero, and
+/// * HTTP response headers (if any) are available via MemObject::freshestReply();
+/// * HTTP response body bytes (if any) are available via StoreIOBuffer.
+///
+/// STCB callbacks may use response semantics to detect certain EOF conditions.
+/// Callbacks that expect HTTP headers may call store_client::atEof(). Similar
+/// to clientStreamCallback() callbacks, callbacks dedicated to receiving HTTP
+/// bodies may use zero StoreIOBuffer::length as an EOF condition.
+///
+/// Errors are indicated by setting StoreIOBuffer flags.error.
+using STCB = void (void *, StoreIOBuffer);
 
 class StoreEntry;
 class ACLFilledChecklist;
@@ -88,7 +101,13 @@ class store_client
 
     void dumpStats(MemBuf * output, int clientNumber) const;
 
-    int64_t cmp_offset;
+    // TODO: When STCB gets a dedicated Answer type, move this info there.
+    /// Whether the last successful storeClientCopy() answer was known to
+    /// contain the last body bytes of the HTTP response
+    /// \retval true requesting bytes at higher offsets is futile
+    /// \sa STCB
+    bool atEof() const { return atEof_; }
+
 #if STORE_CLIENT_LIST_DEBUG
 
     void *owner;
@@ -123,32 +142,51 @@ class store_client
     dlink_node node;
 
 private:
-    bool moreToSend() const;
+    bool moreToRead() const;
+    bool canReadFromMemory() const;
+    bool answeredOnce() const { return answers >= 1; }
+    bool sendingHttpHeaders() const;
+    int64_t nextHttpReadOffset() const;
 
     void fileRead();
     void scheduleDiskRead();
-    void scheduleMemRead();
+    void readFromMemory();
     void scheduleRead();
     bool startSwapin();
+    void handleBodyFromDisk();
+    void maybeWriteFromDiskToMemory(const StoreIOBuffer &);
+
+    bool parseHttpHeadersFromDisk();
+    bool tryParsingHttpHeaders();
+    void skipHttpHeadersFromDisk();
 
     void fail();
     void callback(ssize_t);
     void noteCopiedBytes(size_t);
-    void noteEof();
     void noteNews();
     void finishCallback();
     static void FinishCallback(store_client *);
 
     int type;
     bool object_ok;
 
+    /// \copydoc atEof()
+    bool atEof_;
+
     /// Storage and metadata associated with the current copy() request. Ought
     /// to be ignored when not answering a copy() request.
     StoreIOBuffer copyInto;
 
-    /// The number of bytes loaded from Store into copyInto while answering the
-    /// current copy() request. Ought to be ignored when not answering.
-    size_t copiedSize;
+    /// the total number of finishCallback() calls
+    uint64_t answers;
+
+    /// Accumulates raw bytes read from Store while answering the current copy()
+    /// request. Buffer contents depends on the source and parsing stage; it may
+    /// hold (parts of) swap metadata, HTTP response headers, and/or HTTP
+    /// response body bytes.
+    std::optional<Store::ParsingBuffer> parsingBuffer;
+
+    StoreIOBuffer lastDiskRead; ///< buffer used for the last storeRead() call
 
     /* Until we finish stuffing code into store_client */
 
@@ -173,7 +211,18 @@ class store_client
     } _callback;
 };
 
+/// Asynchronously read HTTP response headers and/or body bytes from Store.
+///
+/// The requested zero-based HTTP body offset is specified via the
+/// StoreIOBuffer::offset field. The first call (for a given store_client
+/// object) must specify zero offset.
+///
+/// The requested HTTP body portion size is specified via the
+/// StoreIOBuffer::length field. The function may return fewer body bytes.
+///
+/// See STCB for result delivery details.
 void storeClientCopy(store_client *, StoreEntry *, StoreIOBuffer, STCB *, void *);
+
 store_client* storeClientListAdd(StoreEntry * e, void *data);
 int storeUnregister(store_client * sc, StoreEntry * e, void *data);
 int storePendingNClients(const StoreEntry * e);

diff --git a/src/StoreIOBuffer.h b/src/StoreIOBuffer.h
@@ -43,6 +43,9 @@ class StoreIOBuffer
         return Range<int64_t>(offset, offset + length);
     }
 
+    /// convenience method for changing the offset of a being-configured buffer
+    StoreIOBuffer &positionAt(const int64_t newOffset) { offset = newOffset; return *this; }
+
     void dump() const {
         if (fwrite(data, length, 1, stderr)) {}
         if (fwrite("\n", 1, 1, stderr)) {}