tidesdb/mwbench
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
==============================================================================
mwbench -- a stress mixed-workload bench for tidesdb
==============================================================================
OVERVIEW
--------
mwbench drives a tidesdb instance through a configurable write workload while
a pool of reader threads continuously probes it with point gets, iterator
seeks, and short range scans. Every metric is dimensioned along several
axes so a single run produces enough data to answer most "how does X behave
when Y changes" questions about an LSM engine:
- workload: sequential / uniform / zipfian / latest key patterns, both
for writes and for reads (reads can also inherit the write pattern)
- storage layout: 1..64 column families with per-CF config overrides,
optional unified memtable mode (single shared memtable + WAL)
- storage backend: local-only, object-store via the tidesdb fs connector
(no extra deps), or S3-compatible (rustfs / MinIO / AWS S3) via the
library's S3 connector
- role: primary or read-only replica polling MANIFEST + WAL from the
object store. one mwbench invocation can spawn a sibling replica
against the same bucket and produce side-by-side CSVs
- integrity: every value read back is byte-compared against a determin-
istic recomputation from its key. tombstoned ids are recognised in
O(1) via the deletion pattern, not counted as misses
- amplification: write amp (syscall / block / disk), read amp
(syscall / block / internal sstables), space amp -- both per-sample
window and rolling 30 s. derived from /proc/self/io deltas + tidesdb
db stats + filesystem walk
- power: Intel/AMD RAPL energy counters (package / core / uncore / dram
when present), windowed watts, process-attributed energy via
cpu_share, J/GiB and J/M-reads efficiency. zero-config when /sys/class
/powercap is readable
- process: /proc/self/io, getrusage, peak RSS, page faults, CPU% with
the saturation ceiling annotated
- commit latency: per-commit nanos timing, p50/p95/p99 sampled by the
writers. surfaces mirror-mode fan-out cost at multi-CF setups
- delete-reclaim experiment: a configurable fraction of ingested keys
(default 5%, batched 10K per txn) is tombstoned at the tail, then
compacted. tombstone count + disk reclaim captured at four discrete
snapshots
Per-run output is a directory under --out-dir holding five CSVs.
ARCHITECTURE
------------
Threading model:
.---------.
.---------------. commits .------------. | |
| writer threads| -----.----->| tidesdb | | |
'---------------' | | column | | disk |
.---------------. | | families | | (local |
| reader threads| <----'----- | |<-| +/or |
'---------------' probes '------------' | obj- |
| | | store) |
| per-(op,cf) latency | db_stats | |
v buckets v '---------'
.---------------. .---------------.
| sampler thread| | rusage / proc |
| (1 s tick + | | io / RAPL |
| sample- | ------> | snapshots |
| interval s | '---------------'
| aggregation) |
'---------------'
|
v
.---------------.
| samples.csv, |
| amplification |
| .csv, ... |
'---------------'
Per writer thread: claims a contiguous id range from a global atomic
(sequential) or draws ids from a per-thread PRNG (uniform / zipfian /
latest). Mirrors each id to every column family in a single transaction;
times the commit. Mirror writes scale the byte target by num_cfs so a 64
GiB target with --num-cfs 4 produces 64 GiB of total writes (= 16 GiB of
unique logical data written four times).
TDB_ERR_BUSY out of tidesdb_txn_begin / tidesdb_txn_put / tidesdb_txn_-
commit is treated as backpressure, not as a fatal error: the writer
sleeps with exponential backoff (capped at 50 ms) and retries the batch
instead of exiting. For SEQUENTIAL pattern the previously-claimed
seq_base is preserved across retries so the same ids are rewritten and
the reader's "all ids < max_written_key - safety are committed somewhere"
invariant holds. Other returns still break the writer loop. The same
backpressure semantics apply to the delete-phase txn loop.
Per reader thread: holds one long-lived read-only tidesdb_txn_t and
recycles it through rollback + reset every MWB_READER_TXN_RESET_OPS ops
to release per-op buffers (tidesdb's READ_COMMITTED default refreshes
the snapshot per read, so a persistent handle still sees freshly-
committed writes). The point / seek / range op handlers each use their
own distinct tidesdb API (txn_get for point, iter_seek for seek, seek +
iter_next loop for range); only the underlying txn handle is shared.
Picks a CF uniformly per op, picks an id from the configured read
pattern, runs one of the three ops with equal probability, records
latency into a per-(op, cf) bucket. Tombstoned-id steering: when a
non-empty deleted set exists in the target CF, 30% of reads are aimed
at the steering subset so the sampler captures the tombstone-walk
latency.
Sampler thread: 1 s tick prints a one-line progress; every
--sample-interval-sec it drains the latency buckets across all readers,
computes pooled and per-CF percentiles, reads tidesdb_db_stats_t,
tidesdb_get_cache_stats, walks the data dir for du-style size,
reads /proc/self/io + getrusage + RAPL energy counters, computes
write / read / space amplification both windowed and rolling-30s, and
emits one row per CF into samples.csv plus one row per tick into
amplification.csv. Per-phase aggregators close out into summary.csv at
end of run.
CAPABILITY MATRIX
-----------------
This is the short list of dimensions the bench exposes. Every row is
controlled by one or more CLI flags (see OPTIONS below).
axis options / range
----------------------- ---------------------------------------------
write pattern sequential | uniform | zipfian | latest
read pattern sequential | uniform | zipfian | latest |
match_write
num CFs 1..64, each with per-CF config overrides
unified memtable off | on (separate buffer sized via flag)
compression none | snappy | lz4 | zstd | lz4_fast
sync mode none | full | interval
default isolation read_uncommitted | read_committed |
repeatable_read | snapshot | serializable
klog format skiplist (default) | b+tree
bloom filter on | off, with FPR knob
block indexes on | off, with sample ratio + prefix len
storage backend local | fs connector | S3 (rustfs/MinIO/AWS)
role primary | replica (read-only)
primary+replica single-process orchestrator: fork+exec a
sibling in --replica-mode against the same
objstore bucket
delete experiment fraction in [0, 1], batch size, target cf or
all-cfs
RAPL energy auto-probed at startup; degrades gracefully
when /sys/class/powercap is locked down
read profiling reserved (#ifdef TIDESDB_ENABLE_READ_PROFILING
hook; zeros otherwise)
BUILD
-----
mwbench is a single C file linked against tidesdb + pthreads + libm,
built with CMake.
cmake -S . -B build \
-DTIDESDB_ROOT=/path/to/tidesdb \
-DTIDESDB_BUILD=/path/to/tidesdb/build
cmake --build build -j
That produces ./build/mwbench. Every invocation below is written as
./mwbench for readability -- substitute ./build/mwbench or symlink:
ln -s build/mwbench mwbench
TIDESDB_ROOT and TIDESDB_BUILD default to paths baked into CMakeLists.txt;
override on the cmake line if your tidesdb checkout is elsewhere. The
build embeds an rpath pointing at the tidesdb build dir so libtidesdb.so
resolves at runtime without LD_LIBRARY_PATH.
Optional dependency for S3 mode:
The S3 connector lives in libtidesdb and is compiled in only when the
library itself was built with -DTIDESDB_WITH_S3=ON. Verify with:
grep TIDESDB_HAS_S3 /path/to/tidesdb/build/include/tidesdb/tidesdb_version.h
A non-empty match means S3 mode is available. mwbench picks it up
automatically -- no rebuild needed when the lib's S3 support toggles.
QUICK START
-----------
Smoke test (1 GiB target, short cooldowns, all output cleaned up):
./mwbench --quick
Medium overnight run (64 GiB target, 8 writers, 4 readers, generous cache):
./mwbench --target-gib 64 --write-threads 8 --read-threads 4 \
--block-cache $((16 * 1024 * 1024 * 1024)) \
--sample-interval-sec 15
RUNNING
-------
mwbench is configured via CLI flags. Defaults are tuned for a 1 TiB
ingest; override per scenario.
The data-dir is wiped on close by default (it's intended to be throwaway).
The out-dir is never overwritten -- each invocation creates a new
run_YYYYMMDD_HHMMSS subdir alongside any earlier runs.
OPTIONS
-------
Workload size:
--target-bytes N total bytes to ingest 1 TiB
--target-gib N same, expressed in GiB --
--value-size N per-value payload size, bytes 1024
--batch-size N keys per writer txn 256
Multi-column-family:
--num-cfs N number of column families to create 1
each commit mirrors the same key+value
into all N CFs (secondary-index style)
--cf-override IDX:KEY=VAL patch a single CF's config; repeatable.
IDX is the 0-based CF index, KEY is in
the apply_cf_kv whitelist below. later
overrides on the same key win.
Key patterns:
--write-pattern PAT sequential | uniform | zipfian | latest
(default sequential -- LSM best case)
--read-pattern PAT same set plus match_write (default).
match_write inherits whichever pattern
writers use.
--zipf-skew F zipfian theta in (0, 1) 0.99
--keyspace N id pool size for non-sequential
(default: target_bytes / value_size)
Concurrency:
--write-threads N parallel writer threads 4
--read-threads N parallel reader threads 2
--flush-threads N tidesdb flush worker count 4
--compaction-threads N tidesdb compaction worker count 4
Database-level engine config (tidesdb_config_t):
--block-cache N clock-cache size, bytes 1 GiB
--max-open-sstables N ceiling on open sst file handles 65536
--max-memory N global memory cap, bytes (0 = auto) 0
--log-level LVL debug | info | warn | error | fatal | none
--log-to-file write tidesdb log to a file in data-dir
--log-truncate-at N truncate log file at N bytes (0 = never)
Unified-memtable mode (all CFs share one memtable + WAL):
--unified-memtable enable unified mode
--unified-write-buffer N memtable rotation threshold, bytes 256 MiB
applies only when --unified-memtable is
set and no explicit value is given
--unified-skip-list-max-level N default 12
--unified-skip-list-probability F default 0.25
--unified-sync-mode MODE none | full | interval default none
--unified-sync-interval-us N microseconds between syncs 128000
Per-CF engine tuning (applied to every CF unless overridden):
--write-buffer N per-CF memtable size, bytes 64 MiB
--write-buffer-size N alias for --write-buffer (matches the
underlying tidesdb field name)
--compression NAME none | snappy | lz4 | zstd | lz4_fast
--sync-mode MODE none | full | interval
--sync-interval-us N sync interval in microseconds
--bloom-filter 0|1 enable bloom filters 1
--bloom-fpr F target false-positive rate 0.01
--block-indexes 0|1 enable block indexes 1
--index-sample-ratio N block-index sampling ratio 1
--block-index-prefix-len N prefix length in bytes 16
--klog-value-threshold N values >= N bytes go to .vlog 16 KiB
--level-size-ratio N LSM level capacity ratio 10
--min-levels N minimum number of disk levels 1
--dividing-level-offset N spooky dividing-level offset (0,1,2..) 1
--l1-file-count-trigger N L1 file-count compaction trigger
--l0-queue-stall-threshold N L0 queue depth at which writers stall
--tombstone-density-trigger F per-sst tombstone density that
escalates compaction priority [0,1]
--tombstone-density-min-entries N min entries before density counts
--use-btree 0|1 use B+tree klog format 0
--isolation-level NAME default txn isolation:
read_uncommitted | read_committed |
repeatable_read | snapshot | serializable
--skip-list-max-level N skip-list max level 12
--skip-list-probability F skip-list probability 0.25
--min-disk-space N min free disk space (bytes) 100 MiB
apply_cf_kv whitelist (keys valid in --cf-override):
write_buffer_size, level_size_ratio, min_levels, dividing_level_offset,
klog_value_threshold, compression, enable_bloom_filter, bloom_fpr,
enable_block_indexes, index_sample_ratio, block_index_prefix_len,
sync_mode, sync_interval_us, skip_list_max_level, skip_list_probability,
isolation_level, min_disk_space, l1_file_count_trigger,
l0_queue_stall_threshold, tombstone_density_trigger,
tombstone_density_min_entries, use_btree
Object-store mode:
--objstore-mode MODE none | fs | s3 default none
fs = local filesystem connector, no
external deps. stores objects as
files under --objstore-fs-path
s3 = S3-compatible. requires libtidesdb
built with -DTIDESDB_WITH_S3=ON
and a running endpoint (rustfs,
MinIO, AWS, ...)
--objstore-fs-path PATH fs-connector root (default <data-dir>_objstore)
S3 endpoint (only with --objstore-mode s3):
--objstore-endpoint HOST:PORT e.g. 127.0.0.1:9000
--objstore-bucket NAME bucket name (must exist; pre-create
via aws s3 mb or the rustfs console)
--objstore-prefix STR optional key prefix
--objstore-access-key K
--objstore-secret-key S
--objstore-region R default "us-east-1"
--objstore-use-ssl 0|1 default 0 (HTTP)
--objstore-path-style 0|1 default 1 (rustfs / MinIO style)
Object-store config (applies to either backend):
--objstore-cache-bytes N local cache cap, bytes (0 = no cap)
--objstore-cache-on-read 0|1
--objstore-cache-on-write 0|1
--objstore-uploads N parallel upload threads
--objstore-downloads N parallel download threads
--objstore-multipart-threshold N bytes; above this, multipart
--objstore-multipart-part N multipart chunk size
--objstore-sync-manifest 0|1 upload MANIFEST after each compact
--objstore-replicate-wal 0|1 upload closed WAL segments
--objstore-wal-upload-sync 0|1 block flush on WAL upload
--objstore-wal-sync-bytes N active-WAL sync threshold (bytes)
--objstore-wal-sync-on-commit 0|1 sync after every commit (RPO=0)
Primary / replica orchestration:
--replica-mode this process opens tidesdb read-only and
runs readers + sampler only. requires
--objstore-mode fs|s3 and a populated
bucket / fs root from a primary
--spawn-replica this process is a primary AND fork+execs
a sibling --replica-mode child against
the same objstore (data dirs / out dirs
scoped per process)
--replica-data-dir PATH local cache dir for the spawned child
(default <data-dir>_replica)
--replica-sync-us N replica MANIFEST poll interval (us)
--replica-replay-wal 0|1 replay WAL for near-real-time replica
reads (default on)
Delete-reclaim experiment:
--delete-fraction F fraction of ingested keys to delete 0.05
(5% by default). set to 0 to skip.
--delete-batch N keys per delete txn 10000
--delete-target-cf IDX|all delete scope. default 0 (cf_0 only).
"all" tombstones every CF in the run
--delete-compact-wait-sec N seconds in the compaction phase after
tidesdb_compact() (default 60)
Sampling and timing:
--sample-interval-sec N seconds between CSV rows 10
--range-scan-len N keys per range probe 100
--cooldown-sec N idle window after writers finish 30
Paths and lifecycle:
--data-dir PATH tidesdb data directory (default ./data)
--out-dir PATH parent of the run_* subdirs (default ./out)
--resume open existing data dir instead of fresh
(skips the wipe + wipe-confirmation entirely)
--keep-data do not wipe data dir on clean exit
--force, --yes, -y skip the y/N confirmation prompt before
wiping a non-empty data-dir. required for
non-interactive (no-tty) runs that need to
wipe -- otherwise the bench aborts safely
Directory handling:
Both --data-dir and --out-dir are auto-created (mkdir -p semantics) if
any parent directory is missing.
If --data-dir already exists and is non-empty, the bench:
- with --resume: leaves it alone and reopens the existing db
- with --force (or --yes / -y): wipes and recreates without asking
- with an interactive tty: prompts
"data-dir <path> is non-empty. wipe it and start fresh? [y/N]:"
anything except y/Y aborts the run (data dir untouched)
- non-interactive (piped / scripted) with neither flag: aborts with
a message explaining --force vs --resume
--out-dir is never wiped. Each run lands in a new run_YYYYMMDD_HHMMSS
subdir so earlier runs are preserved.
--spawn-replica's child process always runs with --force injected so
the spawner never blocks on a prompt -- the replica's data dir is
just an objstore cache, not persistent data.
Preset:
--quick 1 GiB target, 2 s sample interval,
5 s cooldown, 10 s compact wait
KEY PATTERNS
------------
Writers and readers each choose ids under one of four distributions. The
choice changes the stress the workload puts on the LSM:
sequential monotonically increasing ids. LSM best case: L0 sstables
have disjoint key ranges so compaction is concat-only;
reads hit one sstable per level; bloom filter overhead
is minimal. baseline / control.
uniform uniform random over [0, keyspace). SSTs overlap heavily
so compaction is real merging work; reads exercise the
bloom filter false-positive path. closest to a real-world
random-access workload.
zipfian power-law skew with theta in (0,1). hot keys cluster,
stressing the block cache; long tail probes the bloom
false-positive cost. classic YCSB pattern.
latest exponentially weighted toward max_written_key (95% mass
in top 5%). time-series / write-recent-read-recent
workloads. exercises L0 + memtable for reads.
Reads default to read-pattern=match_write -- whichever distribution
writers use, readers mirror. Override to e.g. --write-pattern uniform
--read-pattern zipfian to study cache effectiveness against a flat
write distribution.
Integrity check semantics:
sequential -- ids 0..max_written_key are guaranteed written, so any
"not found" inside the safety window counts as data
loss. Seek that lands on a different key inside the
safe window is also data loss -- the bench counts it
as a miss rather than silently validating whatever
next-greater key the iterator surfaced. Mismatches
always count.
others -- no per-key tracking. "not found" is statistically
expected and suppressed; seek that lands on a
next-greater key is intended behaviour and the
returned key is the one validated. Only mismatches
count as corruption.
OBJECT-STORE MODE
-----------------
tidesdb supports running the LSM atop an object store. mwbench exposes
this via --objstore-mode:
fs -- local filesystem connector. objects stored as files mirroring
the key path. always available, no extra deps. valuable as a
baseline that exercises the upload pipeline + local cache +
replica logic without network overhead.
s3 -- S3-compatible (rustfs / MinIO / AWS S3). requires libtidesdb
built with -DTIDESDB_WITH_S3=ON. adds real network + protocol
cost on top of fs-mode behavior.
In either mode, tidesdb forces unified-memtable on, opens a local cache
under <data-dir>, ships sstables + manifest + WAL to the objstore via a
background upload thread pool, and exposes telemetry consumed by the
bench: local_cache_bytes_used, upload_queue_depth, total_uploads,
total_upload_failures, last_uploaded_generation.
Primary / replica:
--replica-mode opens the bench in read-only replica mode: no
writers, no delete experiment. tidesdb polls
MANIFEST + WAL from the bucket and serves reads
from a local cache hydrated on miss.
--spawn-replica orchestrator: parent opens primary, fork+execs
a sibling --replica-mode child against the same
bucket. argv is rewritten so the child gets a
scoped --data-dir (default <data-dir>_replica),
a nested --out-dir (<out>/replica), and the
primary's --objstore-fs-path. parent waits for
child at end of run.
Output for a spawn-replica run:
out/run_YYYYMMDD_HHMMSS/ <- primary
samples.csv amplification.csv summary.csv ...
replica/
run_YYYYMMDD_HHMMSS/ <- replica child
samples.csv amplification.csv ...
METRICS
-------
What gets measured and where it lives:
per-tick time series (samples.csv, one row per (tick, CF)):
workload throughput bytes_written, keys_written, write_mibs
read throughput <op>_ops_s
read latency (pooled) <op>_p50_us, _p95_us, _p99_us, _misses
read latency (per-CF) cf_<op>_p50_us, _p95_us, _p99_us
integrity mismatches, deleted_count
disk + memory disk_bytes, data_size_bytes, memtable_bytes
(memtable_bytes = sum over every CF's active
+ non-flushed immutable + the unified memtable;
NOT a per-memtable figure)
LSM state sstable_count, immutable_count, open_sstables,
flush_qsize, compact_qsize, lvl[0..15]_ssts
(sstable_count = sum across every level of
every CF; immutable_count includes flushed-
but-not-yet-cleaned entries because the
immutable queue is swept in batches)
per-CF state cf_num_levels, cf_total_keys, cf_tombstones,
cf_tombstone_ratio, cf_max_density,
cf_max_density_level, cf_memtable_size,
cf_avg_key_size, cf_avg_value_size
(in --unified-memtable mode the per-CF
memtable / immutable counters read empty
because the active memtable and WAL are
shared -- look at mtab_tot / imm_tot in
the progress line for the aggregate)
block cache cache_hits, cache_misses, cache_hit_rate
/proc/self/io deltas proc_rchar_d, proc_wchar_d, proc_read_bytes_d,
proc_write_bytes_d, proc_syscr_d, proc_syscw_d,
proc_cancelled_write_bytes_d
process resources ru_maxrss_kb, ru_minflt, ru_majflt,
ru_inblock, ru_oublock, ru_utime_us_d,
ru_stime_us_d
RAPL energy (cumulative) energy_pkg_uj, energy_core_uj,
energy_uncore_uj, energy_dram_uj
RAPL power (windowed) power_pkg_w, power_core_w, power_uncore_w,
power_dram_w
attribution cpu_share, proc_pkg_energy_j_window,
proc_pkg_energy_j_cum,
energy_per_gib_written, energy_per_mread
write amplification wa_app, wa_io, wa_disk (window + 30s)
read amplification ra_app, ra_io (window + 30s)
space amplification sa_disk, sa_data
commit latency commit_p50_us, _p95_us, _p99_us,
commits_in_window
objstore objstore_enabled, objstore_local_cache_bytes,
objstore_local_cache_max,
objstore_local_cache_files,
objstore_upload_queue_depth,
objstore_total_uploads,
objstore_total_upload_failures,
objstore_last_uploaded_gen,
objstore_replica_mode
amplification.csv (one row per tick, single collapsed view):
elapsed_s, phase,
user_bytes_window/total, wchar/write_bytes/rchar/read_bytes
(window + total), wa_app/io/disk (window + 30s),
ra_app/io (window + 30s), cpu_share,
disk_delta_bytes, disk_total_bytes, sa_disk, sa_data,
pkg_energy_window_uj, power_pkg_w, proc_pkg_energy_j_window,
energy_per_gib_written
summary.csv (one row per phase, end of run):
phase, phase_name, duration_s, ingest_gib, mean_write_mibs,
n_samples, mean_wa_app/io/disk, mean_ra_app/io, mean_sa_disk/data,
mean_pkg_w/core_w/uncore_w/dram_w, mean_cpu_share,
max_point_p99_us/seek_p99_us/range_p99_us/commit_p99_us,
max_rss_mib, max_open_ssts, cum_misses, cum_mismatches
delete_experiment.csv (one row per snapshot, four snapshots per run):
phase {peak, post_delete, post_compact, post_compact_settle},
disk_bytes, data_bytes, sstable_count, immutable_count,
tombstones, tombstone_ratio, total_deleted
config.csv (one row per CF, at startup):
cf_index, cf_name, every tunable CF knob with enums in their
human-readable string form
run_meta.csv (key,value table, at startup):
environment: hostname, os_name/release/machine, cpu_model, ncpu,
total_memory_mib, rapl_enabled, rapl_domains
library: tidesdb_version, tidesdb_has_s3, compiler, build_date
disk: data_dir, data_dir_device, data_dir_filesystem,
data_dir_total_gib, data_dir_free_gib
workload: target_bytes, value_size, batch_size, write/read_threads,
num_cfs, write_pattern, read_pattern, zipf_skew,
keyspace, range_scan_len, sample_interval_sec,
cooldown_sec, delete_fraction, delete_batch,
delete_target_cf
engine: block_cache_size, flush/compaction_threads,
max_open_sstables, unified_memtable,
unified_write_buffer, log_level, cf_write_buffer,
cf_compression, cf_sync_mode, cf_klog_value_threshold,
cf_enable_bloom_filter, cf_bloom_fpr, cf_use_btree
objstore: role {primary,replica}, spawn_replica, objstore_mode,
objstore_fs_path, objstore_endpoint, objstore_bucket,
objstore_prefix, objstore_region, objstore_use_ssl,
objstore_path_style, every objstore_cfg field
PHASES
------
Each samples.csv row carries a phase tag. The run walks through these in order:
0 ingest writers active, readers probing live keys
1 cooldown writers stopped, db quiescing, readers continue
2 delete deleted_fraction * keyspace ids removed from the
target CF(s) in batches; readers can hit tombstones
from this point on
3 post-delete settle window, snapshot taken
4 compaction tidesdb_compact() driven on target CFs
5 post-compact final settle and snapshot
In --replica-mode the bench enters cooldown immediately (no writers, no
delete) and stays there for the configured wait windows so it can build
a parallel time-series of read performance over the primary's run.
OUTPUT FILES
------------
Per run, all under out/run_YYYYMMDD_HHMMSS/:
samples.csv time series, N rows per tick (one per CF)
amplification.csv time series, 1 row per tick (collapsed)
summary.csv per-phase aggregates at end of run
delete_experiment.csv four discrete snapshots
config.csv per-CF effective config
run_meta.csv environment + workload metadata
EXAMPLES
--------
Sequential 1 GiB smoke (defaults, just to verify the pipeline):
./mwbench --quick
64 GiB sequential, 8 writers, larger memtable, 16 GiB block cache:
./mwbench --target-gib 64 --write-threads 8 \
--write-buffer $((512 * 1024 * 1024)) \
--block-cache $((16 * 1024 * 1024 * 1024))
Primary + 3 secondary indexes, mixed engine configs across CFs:
./mwbench --target-gib 64 --num-cfs 4 \
--cf-override 0:use_btree=1 \
--cf-override 1:compression=zstd \
--cf-override 2:enable_bloom_filter=0 \
--cf-override 3:klog_value_threshold=4096
Uniform random workload (stresses compaction):
./mwbench --target-gib 32 --write-pattern uniform
Zipfian (hot keys, cache-stressing):
./mwbench --target-gib 32 --write-pattern zipfian --zipf-skew 0.99
Latest (time series workload):
./mwbench --target-gib 32 --write-pattern latest
Unified memtable comparison (same workload, single shared memtable):
./mwbench --target-gib 64 --num-cfs 4 --unified-memtable
./mwbench --target-gib 64 --num-cfs 4 # per-CF baseline
Full WAL sync durability:
./mwbench --target-gib 32 --sync-mode full
Object-store mode (fs connector, no installs, no external deps):
./mwbench --target-gib 8 --num-cfs 2 \
--objstore-mode fs
Object-store mode + primary/replica orchestrator:
./mwbench --target-gib 8 --num-cfs 2 \
--objstore-mode fs --spawn-replica
# parent writes; child runs --replica-mode against the same path.
# both produce CSVs
S3 mode against a local rustfs (see notes below):
./mwbench --target-gib 8 --num-cfs 2 --objstore-mode s3 \
--objstore-endpoint 127.0.0.1:9000 \
--objstore-bucket mwbench \
--objstore-access-key mwbench \
--objstore-secret-key mwbenchsecret \
--objstore-region us-east-1 \
--objstore-path-style 1 \
--objstore-use-ssl 0
Long compaction-watch run -- give the compactor 10 minutes after the
deletes so the reclaim curve is well sampled:
./mwbench --target-gib 128 \
--delete-compact-wait-sec 600 \
--sample-interval-sec 5
Keep the data dir for post-mortem (re-run with --force to wipe + restart
or --resume to open the leftover db in-place):
./mwbench --quick --keep-data --data-dir ./scratch/data
# then later:
./mwbench --quick --resume --data-dir ./scratch/data
# or wipe and restart:
./mwbench --quick --force --data-dir ./scratch/data
Scripted / CI runs (no tty, must opt into wiping):
./mwbench --target-gib 64 --force \
--data-dir /mnt/scratch/data --out-dir /mnt/scratch/out
RUSTFS NOTES
------------
rustfs is an S3-compatible object store. The bench was developed against
the 1.0.0-beta.4 musl build run bare-metal (no Docker).
Install:
RUSTFS_HOME=/path/to/rustfs # any persistent dir, prefer non-OS disk
mkdir -p "$RUSTFS_HOME"/{bin,data,logs}
curl -fL -o "$RUSTFS_HOME"/rustfs.zip \
https://dl.rustfs.com/artifacts/rustfs/release/rustfs-linux-x86_64-musl-latest.zip
unzip -o "$RUSTFS_HOME"/rustfs.zip -d "$RUSTFS_HOME"/bin
Start (binds 9000 = S3, 9001 = console; logs JSON to stderr):
nohup "$RUSTFS_HOME"/bin/rustfs server \
--address 127.0.0.1:9000 \
--console-address 127.0.0.1:9001 \
--access-key mwbench \
--secret-key mwbenchsecret \
--region us-east-1 \
"$RUSTFS_HOME"/data > "$RUSTFS_HOME"/logs/rustfs.log 2>&1 &
Pre-create the bench bucket via aws CLI (or the rustfs console at :9001):
AWS_ACCESS_KEY_ID=mwbench AWS_SECRET_ACCESS_KEY=mwbenchsecret \
aws --endpoint-url http://127.0.0.1:9000 s3 mb s3://mwbench
Stop:
pkill -f "rustfs/bin/rustfs server"
CONSOLE OUTPUT
--------------
Startup banner echoes the resolved config in five lines: workload
(target/value/threads/num_cfs), engine (buffer / cache / pool sizes),
patterns (write/read/keyspace), delete config, role + objstore, host /
OS / CPU / memory, tidesdb version / compiler / build date, data-dir
disk. Per-CF override diffs print one line per CF.
In unified-memtable mode the banner shows wbuf_u (the unified write
buffer) and the unified sync mode instead of the per-CF values, because
tidesdb ignores the per-CF write_buffer_size / sync_mode when unified
mode is on. When --num-cfs > 1 is combined with --unified-memtable, the
banner also prints an operator note pointing out that per-CF memtable
stats will read empty -- the unified counters carry that data instead.
The unified_memtable banner field reflects the value tidesdb will
actually use, including the silent force-enable that happens whenever
--objstore-mode is fs or s3 (object-store mode requires unified mode).
During the run:
- one progress tick per second:
[t= 120s ingest ] 3.42 GiB W=291.5 MiB/s keys=...
ssts_tot=37 imm_tot=2 mtab_tot= 412.0 MiB fq=0 cq=0
- a fuller line every --sample-interval-sec:
t= 120.0s GiB= 3.42 W=291.5 MiB/s
pt(p50/p99)=18.3/92.4 sk(p99)=410.1 rg(p99)=1820.6
ssts_tot=37 imm_tot=2 mtab_tot= 412.0MiB miss=0/4096000 corrupt=0
pkg= 92.4W WA(io)= 2.13
Column meanings (from tidesdb_get_db_stats):
ssts_tot total SSTable files across every level of every CF,
not L0-only and not per-CF
imm_tot queue_size of every CF's immutable queue + the unified
immutable queue, including entries already flushed
but not yet swept by the batched cleanup
mtab_tot live skip-list bytes of every active + non-flushed
immutable memtable, summed across CFs and unified;
NOT one memtable's bytes
fq / cq the single db-wide flush / compaction work queues
Read these as global aggregates; pair them with --num-cfs and (in
unified mode) the operator note printed at startup.
- delete-phase progress every ~5% of the plan:
deleted 5000000/12000000 (41.7%) rate=215000 keys/s
- end-of-run per-phase summary table to stdout, plus the four CSV
paths and a summary CSV path.
TROUBLESHOOTING
---------------
"data-dir <path> is non-empty and stdin is not a tty.":
Scripted run hit a non-empty data-dir without consent. Either pass
--force / --yes / -y to wipe non-interactively, or --resume to open
the existing db in place. The bench is intentionally conservative
here -- it will not silently overwrite a populated dir.
"data-dir <path> is non-empty. wipe it and start fresh? [y/N]:":
Interactive prompt before wiping. Answer y/Y to wipe; anything else
(including just Enter) aborts the run cleanly. Pass --force on the
command line to skip the prompt.
"Failed to create database directory ...":
The parent path could not be created. mkdir -p semantics apply, so
this usually means a permission issue or a parent that points at a
non-existent mount.
mismatches > 0 in samples.csv:
Real data corruption. The column counts byte-compare failures on
values that were definitely written and not deleted. Reproduce with
the same config and file an issue.
unexpected misses in sequential workloads:
Indicates lost writes. The reader only probes ids inside a safety
window inside max_written_key, so a miss is real. Two paths count:
- point get returns not-found
- seek returns a key whose id != the requested id (the bench
flags this as a miss rather than silently validating whatever
next-greater key the iterator surfaced)
In non-sequential workloads (uniform/zipfian/latest) misses are
statistically expected and suppressed -- only mismatches count.
writers stall under sustained TDB_ERR_BUSY:
The engine throttles writers (returns TDB_ERR_BUSY out of
txn_begin / txn_put / txn_commit) when flush or compaction cannot
drain fast enough. The bench treats this as backpressure: each
writer sleeps with exponential backoff (200 us minimum, 50 ms cap)
and retries the same batch instead of exiting, so a busy spike no
longer terminates the run early. If the throughput line stays at
~0 MiB/s for many seconds AND ssts_tot is not advancing, the
flush pool is wedged -- see the next entry.
reader latency dominated by L0:
Bump --flush-threads and --write-buffer so memtables flush sooner,
or raise --compaction-threads so L0->L1 doesn't back up. In unified
mode bump --unified-write-buffer instead. Watch ssts_tot / imm_tot
in the progress line: an imm_tot that plateaus while ssts_tot does
not advance is the symptom of a flush pool that is saturated (every
slot in the max_concurrent_flushes cap held by an in-flight write).
objstore upload failures > 0:
Check connectivity to the endpoint and credentials. For fs connector,
the failure usually means the target path was made unreadable mid-run.
replica child fails with "no CFs visible":
The primary hadn't uploaded a manifest within the 60 s poll window
-- usually means the primary's ingest is too short and finished
before its first flush. Bump --target-gib or pre-populate the bucket
from an earlier primary run.
S3 mode crashes immediately on tidesdb_open:
Likely tidesdb wasn't actually built with -DTIDESDB_WITH_S3=ON.
Check the version header has #define TIDESDB_HAS_S3. If it does
but the bench still crashes, suspect a partial rebuild -- do a
clean rebuild of libtidesdb.
--cf-override unknown key '...':
The parse-time error prints the apply_cf_kv whitelist verbatim.
--cf-override idx N >= num_cfs:
Caught at parse time, before tidesdb_open. Either bump --num-cfs or
drop the offending override.
disk fills up mid-run:
Lower --target-gib, move --data-dir to a bigger volume. In multi-CF
mirror mode the on-disk footprint is roughly num_cfs times a single-
CF run for the same target. In objstore mode disk_bytes counts only
the local cache, not the bucket -- check the bucket size separately.
FILES
-----
main.c the bench
CMakeLists.txt build definition (produces build/mwbench)
out/run_YYYYMMDD_HHMMSS/ per-run output:
config.csv
run_meta.csv
samples.csv
amplification.csv
summary.csv
delete_experiment.csv
replica/run_YYYYMMDD_HHMMSS/...
(only with --spawn-replica)
==============================================================================