Skip to content

ToplingZipTable

rockeet edited this page Mar 10, 2023 · 5 revisions

1. Introduction

In ToplingZipTable, there are two core concepts: CO-Index and PA-Zip

  • CO-Index: That is, Compressed Ordered Index, which maps a Key of type ByteArray to an integer ID, and this ID is used to access the corresponding Value in PA-Zip.
  • PA-Zip: Point Accessible Zip, which can be regarded as an abstract array. The core function is to use the ID as the subscript of the abstract array to access the elements of the abstract array. Of course, these elements are compressed and stored.

CO-Index and PA-Zip together form a logical map<Key, Value>

Here, Key is the InternalKey: {UserKey, Seq, OpType} triplet in RocksDB.

The most typical implementation of CO-Index in ToplingDB is NestLoudsTrie, and the most typical implementation of PA-Zip in ToplingDB is DictZipBlobStore, both of which are memory compressed, that is, their forms in memory are compressed, and all Search and read operations are performed in the form of memory compression. And the compression rate of CO-Index + PA-Zip is very high, far better than BlockBasedTable + zstd.

The calculation overhead of ToplingZipTable compression is relatively large (about twice that of zstd), so in ToplingDB, it is mainly configured in the lower layer of LSM through DispatchTable, and compression is performed through distributed compact.

2. Configuration

configuration item type default explanation
localTempDir string /tmp Temporary files will be used during the creation of ToplingZipTable SST, here specify the temporary file directory
enableStatistics bool true Whether to perform performance measurement on SST's Get operation
keyPrefixLen int 0 DB systems such as MyRocks use fixed-length prefixes (generally 4) to distinguish different tables or indexes. Different tables or indexes generally have different data characteristics, so different compression schemes should be used
checksumLevel int 0 0: do not enable checksum
1: enable checksum for metadata only
2: separately checksum for each piece of data
3: checksum for the entire file
warmupLevel enum kIndex Warm up files (load into memory) when opening SST
kNone : do not warm up
kIndex: warm up index
kValue: Warm up the entire file (including value content)
debugLevel int 0 mainly for testing
sampleRatio float 0.03 Sampling rate, since Value's global compression requires sampling
minPreadLen int 0 When page faults are frequent, the performance of using pread will be better, because fewer IOs are required, and the overhead of creating PTE is avoided. This parameter is used to control when to use pread
  < 0 : do not use pread
== 0 : always use pread
  > 0 : use pread when greater than this value
minPrefetchPages int 0 When reading each value from mmap, if the size of the value in the file is large (at least across the Page boundary), how many pages are pre-read at a time to alleviate frequent page faults during random access. 0 disables the feature, since calls to MADV_POPULATE_READ also have overhead, which is unnecessary when page faults are low
builderMinLevel int 0 In the LSM, this layer is used as the boundary, the top layer does not use ToplingZipTable, and the bottom layer (including) uses ToplingZipTable.
This is because when using distributed compaction, if the distributed compact fails and falls back to execute local compact, it will consume the computing resources of the DB node, and the computing resources of the DB node are in short supply, so at this time we hope to use Create other TableFactory (such as SingleFastTable) with lower overhead, this parameter is mainly for this purpose
indexType string Mixed_XL_
256_32_FL
The default NestLoudsTrie type, NestLoudsTrie can use different types of Rank-Select implementations, mainly for testing, use the default value normally
indexNestLevel int 3 Maximum number of nesting levels for NestLoudsTrie index
indexNestScale int 8 Every time NestLoudsTrie nests deeper, the size of the deeper layer will decrease, and when it is reduced to a fraction of the outermost layer, the nesting will stop
  • In general, the deeper the nesting layer, the higher the compression ratio and the slower the search speed. We hope to strike a balance between the compression ratio and speed
indexCacheRatio float 0 NestLoudsTrie's underlying Select operation can be accelerated by Cache. This is the Cache ratio, generally set below 0.01, then the search can be accelerated by about 10%. More Cache's acceleration effect is not obvious
indexTempLevel int 0 When creating a NestLoudsTrie, use temporary files to reduce memory usage. The more temporary files you use, the less memory you need
  • ToplingZipTable used to be preferred over creating very large indexes, this configuration item was useful
  • Now that ToplingZipTable does not prioritize very large indexes, this configuration is less useful
indexMemAsResident bool false Make index resident in memory
indexMemAsHugePage bool false Make index use hugepage
speedupNestTrieBuild bool true The optimization parameters when NestLoudsTrie is created, just keep the default
optimizeCpuL3Cache bool true Value's global compression uses a multi-threaded pipeline. The memory dictionary of this compression algorithm is large, and the memory access in the dictionary is very random. This option allows the data of a single SST to be compressed at the same time as much as possible to improve the performance of the CPU L3 Cache. Utilization, this parameter can be left as default
bytesPerBatch int 256K When compressing Value, each Task in the Pipeline is a Batch, and the total size of all values ​​in a single Batch
recordsPerBatch int 500 When compressing Value, the upper limit of the number of all values ​​in a single batch
entropyAlgo enum kNoEntropy Use the global dictionary to compress the Value and then use entropy encoding to compress it again. The compression benefit of entropy encoding is small, but the decompression/reading overhead is very high, so it is disabled by default, and it is not recommended to enable it. Other optional values: kHuffman, kFSE
offsetArrayBlockUnits int 0 The positioning of the variable-length Value is realized through the Offset array, and the length is calculated through the difference between adjacent Offsets. This array can be compressed using PForDelta. This option is used to configure the number of elements in each PForDelta compression block. 0 means no compression; for compression, 128 is preferred, and it can also be set to 64, and cannot be set to other values
minDictZipValueSize int 30 When the average value length is less than this value, no compression
keyRankCacheRatio float 0 Used to speed up ApproximateOffsetOf, set to 0 means disabled, non-zero means Cache from the overall sampling rate
acceptCompressionRatio float 0.8 After compression/before compression, when the compression ratio of the Value represented is too poor, the compression is abandoned
nltAcceptCompressionRatio float 0.4 When the compression ratio of NestLoudsTrie is too poor, give up using this index and use other types of indexes instead
softZipWorkingMemLimit
hardZipWorkingMemLimit
smallTaskMemory
uint64 16G
32G
1.2G
When multiple Compactions are executed concurrently, each requires memory, so a limit is required. When the expected memory usage exceeds the soft limit, a single new task whose expected memory usage does not exceed smallTaskMemory is allowed to execute. When the hard limit is reached, no new task is allowed to execute
fileWriterBufferSize int 128K write buffer size
fixedLenIndexCacheLeafSize int 512 For FixedLenKeyIndex, configure the leaf node size of its double array query cache, the larger the leaf node, the smaller the cache, just keep the default

2.1. Examples

ToplingZipTable is configured through SidePlugin. In the (yaml) configuration file, an example is as follows:

TableFactory:
  zip:
    class: ToplingZipTable
    params:
      localTempDir: "/dev/shm/tmp"
      indexType: Mixed_XL_256_32_FL
      indexNestLevel: 3
      indexNestScale: 8
      indexTempLevel: 0
      indexCacheRatio: 0
      warmupLevel: kIndex
      compressGlobalDict: false
      optimizeCpuL3Cache: true
      enableEntropyStore: false
      offsetArrayBlockUnits: 128
      sampleRatio: 0.01
      checksumLevel: 0
      entropyAlgo: kNoEntropy
      debugLevel: 0
      softZipWorkingMemLimit: 16G
      hardZipWorkingMemLimit: 32G
      smallTaskMemory: 1G
      minDictZipValueSize: 30
      keyPrefixLen: 0
      minPreadLen: 64

For complete configuration, please refer to lcompact_enterprise.yaml

3. Tip: Configure multiple ToplingZipTables

In DispatcherTable, multiple ToplingZipTables can be configured with different compression options, for example:

TableFactory:
  lightZip:
    class: ToplingZipTable
    params:
      localTempDir: "/dev/shm/tmp"
      indexNestLevel: 3
      indexNestScale: 8
      minDictZipValueSize: 10M

Set minDictZipValueSize to a large value, so that the data which average length of a single value is less than 10M will not be compressed, which is suitable for the upper-level data in LSM (such as level 2,3). No compression can not only reduce the CPU consumption in Compact, but also greatly improve the read performance, because it can not only save decompression operation, but also use ZeroCopy to directly return the mmap memory where the value is stored in the SST to the user code.