Skip to content

2025.1.2.0-b60

@mdbridge mdbridge tagged this 09 Oct 14:39
Summary:
Original commit: bb053dcbc88216fcfe8e20b48274f979c2833fc4 / D46076
Only one conflict unrelated to my code:
```
<<<<<<< HEAD
    RETURN_NOT_OK(leaders.front()->tablet()->Flush(tablet::FlushMode::kSync));
    RETURN_NOT_OK(leaders.front()->RunLogGC());
=======
        RETURN_NOT_OK(
            VERIFY_RESULT(leaders.front()->shared_tablet())->Flush(tablet::FlushMode::kSync));
        RETURN_NOT_OK(leaders.front()->RunLogGC());
>>>>>>> bc8f624d12 ([#28124,22992] xCluster: limit use of LogReader
memory)
```
resolved in favor of 2025.1 code.

Currently xCluster can read a huge number of WALs into memory from disk simultaneously when xCluster replication is resumed if the checkpoint has fallen far enough behind that the relevant WAL records are no longer in memory.  This can cause TServer crashes due to running out of memory.

This diff fixes this by putting a limit on how much of this memory can be used simultaneously.

This builds on the previous diff (see https://github.com/yugabyte/yugabyte-db/issues/28623) that created a memory tracker for log reader memory.  When requested, the low-level log reader code obeys the limit set on this tracker and limits the number of WAL records it reads if tracker is lacking capacity.

I say when requested because many of the paths that use log reader memory are not (yet) limited.  As of this diff, only the xCluster paths are limited.  I have created another issue to limit CDC paths as well.  Not clear if/when we will limit the remaining paths, which include tablet bootstrapping and I think remote bootstrap.

To be clear, we track the log reader usage of all of these paths but only on the xCluster path do we actually obey the limit; the other paths can go past the limit.  The memory being used by those paths does count towards the limit so xCluster will be able to use less log reader memory if those paths are also using log reader memory.

Note that unlike many of the memory limits the TServer has, this is not reserved memory but rather oversubscribed memory -- if all the other memory compartments use all of their memory then we will hit the soft memory limit when this memory is used.

In practice, xCluster only uses log reader memory when it needs to catch up after being paused for a long time.  Until it replicates all the WAL records not in the existing block cache memory, it will use log reader memory.  Once it has caught up, it will stop using log reader memory.  This limit is thus for xCluster more a burst limiter that spreads out the burst of memory usage to avoid the first getting too high than an ongoing source of memory usage. See test plan for an example of this.

In terms of actual code, the code that actually consumes memory is:
```
~/code/yugabyte-db/src/yb/consensus/log_util.h:374:
  // Reads a log entry batch from the provided readable segment, which gets decoded
  // into 'entry_batch' and increments 'offset' by the batch's length.
  //
  // Returns status Busy if obey_memory_limit is set and there is insufficient memory to hold the
  // batch.
  Result<std::shared_ptr<LWLogEntryBatchPB>> ReadEntryBatch(
      bool obey_memory_limit, int64_t* offset, const EntryHeader& header);
```

Because we have to decide whether to unpack the WAL record before we know how much memory is going to be consumed, we need to estimate memory consumption ahead of time.  I have done some simple empirical experiments to produce a reasonable estimate.  I don't believe the "prevent xCluster from crashing" ability of this limit really depends too much on how accurate the estimate is so I haven't spent a lot of time refining the estimate.  We can always replace the current estimate with a more polished one later if it turns out to be necessary.

Further upstream is code that reads multiple batches to produce a list of operations:
```
~/code/yugabyte-db/src/yb/consensus/log_reader.h:98:
  // Reads all ReplicateMsgs from 'starting_at' to 'up_to' both inclusive.
  // The caller takes ownership of the returned ReplicateMsg objects.
  //
  // Will attempt to read no more than 'max_bytes_to_read', unless it is set to
  // LogReader::kNoSizeLimit.  If the size limit would prevent reading any operations at all, then
  // will read exactly one operation.  Exception: if obey_memory_limit is true then may fail to read
  // even one operation if doing so would exceed the log reader memory tracker limit; returns Busy
  // in that case.
  //
  // Requires that a LogIndex was passed into LogReader::Open().
  // Requires up_to operation index to be Raft-committed, otherwise might return NotFound error if
  // Raft operation at up_to index will be rewritten (due to term change).
  //
  // The parameters starting_op_segment_seq_num, modified_schema, schema_version are used to read
  // appropriate schema corresponding to the from_op_id in the segment header or from the segment
  // itself if there is a DDL log.
  Status ReadReplicatesInRange(
      const int64_t starting_at,
      const int64_t up_to,
      int64_t max_bytes_to_read,
      bool obey_memory_limit,
      consensus::ReplicateMsgs* replicates,
      int64_t* starting_op_segment_seq_num,
      CoarseTimePoint deadline = CoarseTimePoint::max()) const;
```

```
~/code/yugabyte-db/src/yb/consensus/log_cache.h:108:
  // Read operations from the log, following 'after_op_index'.
  //
  // If such an op exists in the log, an OK result will always include at least one operation unless
  // obey_memory_limit is set and doing so would have required reading past the log reader memory
  // limit.
  //
  // The result will be limited by the available memory if obey_memory_limit is set.  It will also
  // be limited such that the total ByteSize() of the returned ops is less than max_size_bytes,
  // unless that would result in an empty result, in which case one over-size op is allowed.
  //
  // The OpId that precedes the returned ops is returned in *preceding_op.  The index of this OpId
  // will match 'after_op_index'.
  //
  // If the ops being requested are not available in the log, this will synchronously read these ops
  // from disk.  Therefore, this function may take a substantial amount of time and should not be
  // called with important locks held, etc.
  Result<ReadOpsResult> ReadOps(
      int64_t after_op_index, size_t max_size_bytes, bool obey_memory_limit);
```

```
~/code/yugabyte-db/src/yb/consensus/consensus_queue.h:576:
  // Reads operations from the log cache in the range (after_index, to_index].
  //
  // If 'to_index' is 0, then all operations after 'after_index' will be included.
  //
  // May return status Busy if obey_memory_limit is true and reading even one operation would exceed
  // the log reader memory tracker limit.
  Result<ReadOpsResult> ReadFromLogCache(
      int64_t after_index, int64_t to_index, size_t max_batch_size, const std::string& peer_uuid,
      bool obey_memory_limit, const CoarseTimePoint deadline = CoarseTimePoint::max(),
      bool fetch_single_entry = false);

  // May return status Busy if obey_memory_limit is true and reading even one operation would exceed
  // the log reader memory tracker limit.
  Result<ReadOpsResult> ReadFromLogCacheForXRepl(
      int64_t last_op_id_index, int64_t to_index, bool obey_memory_limit,
      CoarseTimePoint deadline = CoarseTimePoint::max(), bool fetch_single_entry = false);
```

Note that semantics have changed here in the presence of obey_memory_limit: previously we were guaranteed that if there was an operation meeting the request, we would return at least one operation even if it was bigger than the batch size.  If obey_memory_limit is true then this guarantee no longer holds: we may return Busy if there is insufficient memory to read and even a single WAL record at the moment.

Finally, ReadReplicatedMessagesForXCluster converts any Busy status into no operations plus more messages exist:
```
  // If we are unable to read any WAL entries due to memory limits, returns no operations with
  // have_more_messages true.
  Result<XClusterReadOpsResult> ReadReplicatedMessagesForXCluster(
      const yb::OpId& last_op_id, const CoarseTimePoint deadline, bool fetch_single_entry);
```

The remaining xCluster code is unchanged; this means that if we got a Busy status then the poller will retry (with back off) because its call to GetChanges returned no changes.

Fixes #28124
Fixes #22992
Jira: DB-17767

Test Plan:
Testing ReadReplicatesInRange/log_reader.*:
```
ybd release --cxx-test log-test --gtest_filter '*.TestReadReplicatesWithOnlyPartialMemory'
ybd release --cxx-test log-test --gtest_filter '*.TestReadReplicatesWithInsufficientMemory'
```

Testing LogCache::ReadOps:
```
ybd release --cxx-test log_cache-test --gtest_filter '*.TestAdequateLogReaderMemory'

ybd release --cxx-test log_cache-test --gtest_filter '*.TestInsufficientLogReaderMemory'

ybd release --cxx-test log_cache-test --gtest_filter '*.TestPartialLogReaderMemory'
```

For calibrating, I created a local universe then ran:
```
bin/ysqlsh <<EOF
CREATE TABLE test_mdl (key int, x int, y int, PRIMARY KEY (key));
INSERT INTO test_mdl
SELECT x, x*2, x+3 FROM GENERATE_SERIES(1,1000) AS x;
CREATE TABLE test_mdl2 (key int, s text, z int, q int, PRIMARY KEY (key));
INSERT INTO test_mdl2
SELECT x, 'foobar', x+3, -x FROM GENERATE_SERIES(1,10000) AS x;
CREATE TABLE test_mdl3 (key int, s text, z int, q int, PRIMARY KEY (key)) WITH (colocation = 0);
INSERT INTO test_mdl3
SELECT x, 'foobar', x+3, -x FROM GENERATE_SERIES(1,10000) AS x;
EOF
```
I then stopped and restarted the universe (this causes tablet bootstrapping to read in the WAL records) and looked at the memory adjustment log messages to see how far off I was.

The value I'm currently using, 6, is within 15% for this test.

The actual QA tests aren't sensitive to this value because their WAL records are below the four KB minimum arena size.

I am running a QA pipeline, https://perf.dev.yugabyte.com/perfstudio?pipelineId=90, which runs out of WAL memory, crashing the TServer.

Charts from a recent run:
{F400787}

{F400788}

( there is a brief spike of ~256 MiB  when the resume happens but you can barely see anything happening on the overall memory chart.   without this diff, the TServer crashes due to running out of memory)

Reviewers: xCluster, hsunder

Reviewed By: hsunder

Subscribers: ybase, steve.varnau

Differential Revision: https://phorge.dev.yugabyte.com/D47306
Assets 2
Loading