CSPP MemTable

中文版 Chinese

CSPP MemTable only support BytewiseComparator & ReverseBytewiseComparator

In ToplingDB, CSPP MemTable is implemented as a SidePlugin, to use CSPP MemTable, user code does not need any changes, just json/yaml conf files need changes.

When compiling ToplingDB, this github repo is auto cloned by ToplingDB's Makefile.

1. Configurations

cspp-memtable is configed as an json/yaml object as decribed in SidePlugin, the class name is cspp, params:

param name	type	default	description
mem_cap	uint64	2G	cspp needs preallocate an single continuous address space, which is just areserved address space but not allocated physical memory. Max is 16G
use_vm	bool	true	If using malloc/posix_memalign, it does not ensure the address space is just reserved but not allocated. If `use_vm` is true, it use `mmap` to get address space, ensuring just reserved but not allocated
use_hugepage	bool	false	If this param is true, you should set enough `vm.nr_hugepages` in linux
vm_explicit_commit	bool	false	Windows `VirtualAlloc` explicit commit(allocate) memory, but linux need not explicit commit, but it will SegFault when accessing the memory but the system has no free memory. In linux kernel 5.14+, `MADV_POPULATE_WRITE` can be used as explicit like `VirtualAlloc commit` in Windows
ref_log_format	enum	kShortLogRef	When DBOptions.memtable_as_log_index is true, this param instructs how to ref WAL log, as name described: `{kNoLogRef, kPlainLogRef, kShortLogRef}`
convert_to_sst	enum	kDontConvert	How to convert MemTable into SST when omit Flush, as name described: `{kDontConvert, kDumpMem, kFileMmap}`
sync_sst_file	bool	true	When convert_to_sst `kFileMmap`, whether issue fsync on output SST file
token_use_idle	bool	true	Used for optimize token ring, just keep it as default
accurate_memsize	bool	false	Just used for unit test, don't set to true in production

json sample

yaml sample

"MemTableRepFactory": {
   "cspp": {
      "class": "cspp",
      "params": {
         "mem_cap": "2G",
         "use_vm": false,
         "token_use_idle": true
      }
   },
   "skiplist": {
      "class": "SkipList",
      "params": {
         "lookahead": 0
      }
   }
}

MemTableRepFactory:
  cspp:
    class: cspp
    params:
      mem_cap: 2G
      use_vm: false
      token_use_idle: true
  skiplist:
    class: SkipList
    params:
      lookahead: 0

ref it in json

ref it in yaml

2. Convert MemTable into SST directly

There are greate performance gains by which do converting MemTable into SST directly instead of MemTable Flush, only CSPP MemTable support this feature currently.

CSPP operations are implemented by directly ReadWrite on file mmap, the in-memory format is exactly same with in-file format, because we use integers instead of pointers to represent relations between objects.

enum ref_log_format values:

kNoLogRef：do not use log ref
kPlainLogRef：log ref includes value length, max len of SSO is 11; when reading values, just return pointers to value content in log mmap, value memory will not be touched, thus gains a little perf
kShortLogRef：log ref does not includes value length, max len of SSO is 7; when reading values, it need to read varint encoded value len before value content in log mmap, thus loses a little perf

When DBOptions.memtable_as_log_index is false, ref_log_format is ignored and behave as kNoLogRef.

enum convert_to_sst values:

kDontConvert：Disable the feature, this is the default, this will disable kPlainLogRef & kShortLogRef
kDumpMem: Do not use file mmap, write the MemTable's memory block to SST file, avoid CPU overhead on convert, but still keep memory overhead
kFileMmap: Create a file and create ReadWrite mmap on this file on MemTable construction, both CPU & memory overhead are reduced, when DBOptions.memtable_as_log_index is true, MemTable Flush is completely omited

mmap created in CSPPMemTab construction can be file mmap, in this case, the file is truncate-ed to mem_cap. Mainstream filesystem(ext4,xfs,...) support sparse files, so the file is truncate-ed to mem_cap, the virtual address space is also allocated(reserved) for mem_cap, but it does not need disk/SSD space and does not need physical memory.

Only when we write data into some memory address, the OS will allcoate physical memory(page cache), only when the page cache is dirty for a while(configurable OS param) and write to file, such data will be allocated disk/SSD space and written to.

ls -l -s -h to show file space usage and file size

When CSPP MemTable is converting from Active into Immutable(marked as ReadOnly), the file will be truncate-ed to its real size. When converting into SST, it just need to append an SST File Footer on the file. A wrapper class is implemented as TableReader interface, wrapping a CSPP MemTable object as an SST, thus it need the SST's TableFactory:

    "cspp_memtab_sst": {
      "class": "CSPPMemTabTable",
      "params": { }
    }

In DispatchTable, cspp_memtab_sst is put in readers as the SST TableFactory for creating CSPPMemTabTable SST's TableReader:

NOTE: "CSPPMemTabTable": "cspp_memtab_sst",

    "dispatch": {
      "class": "DispatcherTable",
      "params": {
        "default": "light_dzip",
        "readers": {
          "VecAutoSortTable": "auto_sort",
          "CSPPMemTabTable": "cspp_memtab_sst",
          "BlockBasedTable": "bb",
          "SingleFastTable": "sng",
          "ToplingZipTable": "dzip"
        },
        "level_writers": ["sng", "sng", "dzip", "dzip", "dzip", "dzip", "dzip"]
      }
    }

DispatcherTable will never create CSPPMemTabTable SST, it just read such SST.

Best practice

ColumnFamilyOptions::write_buffer_size should be large(such as 2G, while set CSPPMemTab::mem_cap as 3G)
ColumnFamilyOptions::max_bytes_for_level_base need not be configed, by default it will be set to be same as write_buffer_size

Performance gains from convert MemTable to SST

1. Reduce CPU overhead: In MemTable Flush, MemTable need to be scaned for key value pairs to feed to SST, for small value it is CPU bound, for large value it is disk/SSD bandwidth bound. Such computation is not need in convert operation.

With distributed compactions, DB node does not need L2+ compactions, it just need to execute MemTable Flush and L0 -> L1 compactions, MemTable Flush, the half of which are removed.

2. Reduce Memory Usage: In MemTable Flush, it needs double memory space, if there is long lived SuperVersion to MemTable, such double memory consumption will keep a long time, if the SST is BlockBasedTable, the BlockCache is another memory consumption.

If convert CSPP MemTable into SST, even if SST and MemTable are referenced at same time, their underlying memory are the same physical memory: the PageCache for same file does not need multiple copies for multi references.

3. Reduce IO bandwith: less data are write to disl/SSD, thus less bandwith is needed.

About Crash Safe

To achive high performance parallel read write, CSPP use Copy On Write, with less developements, we got the feature Crash Safe: If the process crashes at any time, the CSPP Trie structure in file mmap is consistent, thus it realized the ACD of ACID.

Now CSPP MemTable does not utilize the feature Crash Safe.

3. memtable_as_log_index

When DBOptions.memtable_as_log_index is true, the ToplingDB frame work provides the support for link MemTable with WAL log, ToplingDB designed and implemented a new WAL format for this purpose, it is completely different with RocksDB WAL format and does not compatible each other. So memtable_as_log_index is a field of ImmutableDBOptions, to change memtable_as_log_index all WAL should have been Flush-ed(the semantic of ConvertToSST is also Flush).

When convert_to_sst is kDontConvert, Flush is executed normally, when memtable_as_log_index is true, it just reduced MemTable's memory usage and memcpy for values in insert operation, it still needs MemTable scanning in Flush, such config is less useful and is not well tested.

When convert_to_sst is kDumpMem or kFileMmap, the semantic Flush is implemented by ConvertToSST, these are valuable, in ConvertToSST:

LogRef:<Plain|Short>;blob_no:wal_no:cnt:bytes,... are added into TableProperties.compression_options for identifying WAL files referenced by MemTable, to let rocksdb's existing code managing the extended lifetime of such WAL files, we need to tell the LSM tree the relations between the WAL and the SST converted from MemTable, we create a blob file as a hard link to the WAL, and tell LSM the blob file(hard link) is referenced by the SST file. The LSM tree will record the relation between the SST file and the blob file(hard link). The result is, we implemented this feature by the minimum efforts.

When using this feature, the storage format of MemTable is also changed, which is identified by TableProperties.compression_options.

SSO is used by KeyValueToLogRef, when ValueLen ≤ 11(when kPlainLogRef), Value content is stored in KeyValueToLogRef, when all values of a WAL referecned by a MemTable, this WAL is not needed to be referenced by the MemTable.

When ref_log_format is kShortLogRef, KeyValueToLogRef above is replaced with KV_ToShortLogRef and max SSO len is 7.

NOTE: DBOptions.max_total_wal_size should be large enough, it is auto sanitized to a reasonable value in most cases, but when a large WriteBatch is written, it may be exceeding the limit:
IO fenced off: memtable_as_log_index log::Writer::AddRecordv: write offset XXX : XXX, len XXX, exceeds mmap size XXX by XXX

4. memtablerep_bench

In ToplingDB, cspp is added to memtablerep_bench which derived from RocksDB, to benchmark skiplist & cspp, use such script:

sudo yum -y install git libaio-devel gcc-c++ gflags-devel zlib-devel bzip2-devel libcurl-devel liburing-devel
git clone https://github.com/topling/toplingdb
cd toplingdb
make DEBUG_LEVEL=0 memtablerep_bench -j`nproc`
export LD_LIBRARY_PATH=.:`find sideplugin -name lib_shared`:${LD_LIBRARY_PATH}
./memtablerep_bench -memtablerep=skiplist -huge_page_tlb_size=2097152 \
  -benchmarks=fillrandom,readrandom,readwrite \
  -write_buffer_size=536870912 -item_size=0 -num_operations=10000000
./memtablerep_bench -memtablerep='cspp:{"mem_cap":"16G"}' \
  -benchmarks=fillrandom,readrandom,readwrite \
  -write_buffer_size=536870912 -item_size=0 -num_operations=10000000

It will shows that CSPP is 6x faster on write and 8x faster on read than SkipList on X86_64, 10x and 11x on ARM.

NOTE: -item_size=0 will set value size to 0, eliminate the noise from memcpy for value
NOTE: the most valuable metrics are write us/op and read us/op
NOTE: memtablerep_bench just benchmark the performance for MemTableRep, in which the overhead of function call chain is insignificant
- When used in DB, the overhead of the function call chain is significant, the speedup ratio will be reduced
NOTE: memtablerep_bench does not supported multi-thread parallel write, using db_bench for this purpose
- For example: db_bench -threads=10 -batch_size=100 -benchmarks=fillrandom

CSPP MemTable

中文版 Chinese

CSPP MemTable only support BytewiseComparator & ReverseBytewiseComparator

1. Configurations

2. Convert MemTable into SST directly

Best practice

Performance gains from convert MemTable to SST

About Crash Safe

3. memtable_as_log_index

4. memtablerep_bench

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally