Skip to content

GetRandomInternalKeysAppend

ZengJingtao edited this page Jan 30, 2023 · 2 revisions

In the document Level 1 Sub Compaction, we know that L0 + L1 -> L1 compact will enable subcompact, but how is subcompact divided?

In RocksDB, subcompact is divided by taking the maximum and minimum keys of each input SST file to obtain an ordered set of keys, and then let each subcompact cover the same number of keys.

When the number of input SSTs is large, this mechanism can work well, but when the L0 SST is large (≈ write_buffer_size, such as 512M), and the L1 SST is small (target_file_size_base, such as 32M), such a mechanism is prone to data bias. Skew (skew), resulting in a huge difference in the size of the subcompact, so that the entire compact time-consuming depends on the largest subcompact.

So, ToplingDB added a function to TableReader:

  virtual bool GetRandomInternalKeysAppend(
      size_t num, std::vector<std::string>* output) const {
    return false; // indicate not implemented
  }

The Append suffix in the function name means to append to output instead of overwriting output.

With this function, we can get multiple keys from each SST, so that the subcompact can be evenly distributed!

The L0 SST size is much larger than the L1 SST configuration, which is the recommended usage for ToplingDB because:

  1. Reduced read amplification (lower number of L0 SSTs)
  2. Reduced write amplification (a single SST on L0 is a sorted run, all SSTs on L1 form a sorted run, and the sizes of different sorted runs are roughly equal)
  3. L0 + L1 -> L1 enables subcompact, fully parallel

In ToplingDB, through these mechanisms, L0 is pushed to L1 at the fastest speed, and because L1 files are small, the number of L1 files is large, and then using distributed compact, L1->L2 compact can be Fully parallelized (max_background_compactions), it is transferred to the Compact cluster anyway, and does not consume computing resources of the DB node.

In order to reduce the load on the DB node as much as possible, neither L0 nor L1 SST is compressed, and ToplingFastTable is used (instead of BlockBasedTable); starting from L2, ToplingZipTable + Distributed Compaction is used.