GitHub - tidesdb/mwbench: Mixed stress workload for TidesDB

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README		README
code_formatter.ps1		code_formatter.ps1
code_formatter.sh		code_formatter.sh
main.c		main.c
plot.py		plot.py
Repository files navigation

==============================================================================
                  mwbench -- a stress mixed-workload bench for tidesdb
==============================================================================


OVERVIEW
--------

mwbench drives a tidesdb instance through a configurable write workload while
a pool of reader threads continuously probes it with point gets, iterator
seeks, and short range scans. Every metric is dimensioned along several
axes so a single run produces enough data to answer most "how does X behave
when Y changes" questions about an LSM engine:

  - workload: sequential / uniform / zipfian / latest key patterns, both
    for writes and for reads (reads can also inherit the write pattern)
  - storage layout: 1..64 column families with per-CF config overrides,
    optional unified memtable mode (single shared memtable + WAL)
  - storage backend: local-only, object-store via the tidesdb fs connector
    (no extra deps), or S3-compatible (rustfs / MinIO / AWS S3) via the
    library's S3 connector
  - role: primary or read-only replica polling MANIFEST + WAL from the
    object store. one mwbench invocation can spawn a sibling replica
    against the same bucket and produce side-by-side CSVs
  - integrity: every value read back is byte-compared against a determin-
    istic recomputation from its key. tombstoned ids are recognised in
    O(1) via the deletion pattern, not counted as misses
  - amplification: write amp (syscall / block / disk), read amp
    (syscall / block / internal sstables), space amp -- both per-sample
    window and rolling 30 s. derived from /proc/self/io deltas + tidesdb
    db stats + filesystem walk
  - power: Intel/AMD RAPL energy counters (package / core / uncore / dram
    when present), windowed watts, process-attributed energy via
    cpu_share, J/GiB and J/M-reads efficiency. zero-config when /sys/class
    /powercap is readable
  - process: /proc/self/io, getrusage, peak RSS, page faults, CPU% with
    the saturation ceiling annotated
  - commit latency: per-commit nanos timing, p50/p95/p99 sampled by the
    writers. surfaces mirror-mode fan-out cost at multi-CF setups
  - delete-reclaim experiment: a configurable fraction of ingested keys
    (default 5%, batched 10K per txn) is tombstoned at the tail, then
    compacted. tombstone count + disk reclaim captured at four discrete
    snapshots

Per-run output is a directory under --out-dir holding five CSVs.


ARCHITECTURE
------------

Threading model:
                                                      .---------.
       .---------------.   commits   .------------.   |         |
       | writer threads| -----.----->| tidesdb     |  |         |
       '---------------'      |      | column      |  |  disk   |
       .---------------.      |      | families    |  |  (local |
       | reader threads| <----'----- |             |<-|  +/or   |
       '---------------'  probes     '------------'   |  obj-   |
              |                         |             |  store) |
              |  per-(op,cf) latency    |  db_stats   |         |
              v  buckets                v             '---------'
       .---------------.         .---------------.
       | sampler thread|         | rusage / proc |
       | (1 s tick +   |         | io / RAPL     |
       | sample-       | ------> | snapshots     |
       | interval s    |         '---------------'
       | aggregation)  |
       '---------------'
              |
              v
       .---------------.
       | samples.csv,  |
       | amplification |
       | .csv, ...     |
       '---------------'

Per writer thread: claims a contiguous id range from a global atomic
(sequential) or draws ids from a per-thread PRNG (uniform / zipfian /
latest). Mirrors each id to every column family in a single transaction;
times the commit. Mirror writes scale the byte target by num_cfs so a 64
GiB target with --num-cfs 4 produces 64 GiB of total writes (= 16 GiB of
unique logical data written four times).

TDB_ERR_BUSY out of tidesdb_txn_begin / tidesdb_txn_put / tidesdb_txn_-
commit is treated as backpressure, not as a fatal error: the writer
sleeps with exponential backoff (capped at 50 ms) and retries the batch
instead of exiting. For SEQUENTIAL pattern the previously-claimed
seq_base is preserved across retries so the same ids are rewritten and
the reader's "all ids < max_written_key - safety are committed somewhere"
invariant holds. Other returns still break the writer loop. The same
backpressure semantics apply to the delete-phase txn loop.

Per reader thread: holds one long-lived read-only tidesdb_txn_t and
recycles it through rollback + reset every MWB_READER_TXN_RESET_OPS ops
to release per-op buffers (tidesdb's READ_COMMITTED default refreshes
the snapshot per read, so a persistent handle still sees freshly-
committed writes). The point / seek / range op handlers each use their
own distinct tidesdb API (txn_get for point, iter_seek for seek, seek +
iter_next loop for range); only the underlying txn handle is shared.
Picks a CF uniformly per op, picks an id from the configured read
pattern, runs one of the three ops with equal probability, records
latency into a per-(op, cf) bucket. Tombstoned-id steering: when a
non-empty deleted set exists in the target CF, 30% of reads are aimed
at the steering subset so the sampler captures the tombstone-walk
latency.

Sampler thread: 1 s tick prints a one-line progress; every
--sample-interval-sec it drains the latency buckets across all readers,
computes pooled and per-CF percentiles, reads tidesdb_db_stats_t,
tidesdb_get_cache_stats, walks the data dir for du-style size,
reads /proc/self/io + getrusage + RAPL energy counters, computes
write / read / space amplification both windowed and rolling-30s, and
emits one row per CF into samples.csv plus one row per tick into
amplification.csv. Per-phase aggregators close out into summary.csv at
end of run.


CAPABILITY MATRIX
-----------------

This is the short list of dimensions the bench exposes. Every row is
controlled by one or more CLI flags (see OPTIONS below).

  axis                     options / range
  -----------------------  ---------------------------------------------
  write pattern            sequential | uniform | zipfian | latest
  read pattern             sequential | uniform | zipfian | latest |
                           match_write
  num CFs                  1..64, each with per-CF config overrides
  unified memtable         off | on (separate buffer sized via flag)
  compression              none | snappy | lz4 | zstd | lz4_fast
  sync mode                none | full | interval
  default isolation        read_uncommitted | read_committed |
                           repeatable_read | snapshot | serializable
  klog format              skiplist (default) | b+tree
  bloom filter             on | off, with FPR knob
  block indexes            on | off, with sample ratio + prefix len
  storage backend          local | fs connector | S3 (rustfs/MinIO/AWS)
  role                     primary | replica (read-only)
  primary+replica          single-process orchestrator: fork+exec a
                           sibling in --replica-mode against the same
                           objstore bucket
  delete experiment        fraction in [0, 1], batch size, target cf or
                           all-cfs
  RAPL energy              auto-probed at startup; degrades gracefully
                           when /sys/class/powercap is locked down
  read profiling           reserved (#ifdef TIDESDB_ENABLE_READ_PROFILING
                           hook; zeros otherwise)


BUILD
-----

mwbench is a single C file linked against tidesdb + pthreads + libm,
built with CMake. 

    cmake -S . -B build \
          -DTIDESDB_ROOT=/path/to/tidesdb \
          -DTIDESDB_BUILD=/path/to/tidesdb/build
    cmake --build build -j

That produces ./build/mwbench. Every invocation below is written as
./mwbench for readability -- substitute ./build/mwbench or symlink:

    ln -s build/mwbench mwbench

TIDESDB_ROOT and TIDESDB_BUILD default to paths baked into CMakeLists.txt;
override on the cmake line if your tidesdb checkout is elsewhere. The
build embeds an rpath pointing at the tidesdb build dir so libtidesdb.so
resolves at runtime without LD_LIBRARY_PATH.

Optional dependency for S3 mode:
    The S3 connector lives in libtidesdb and is compiled in only when the
    library itself was built with -DTIDESDB_WITH_S3=ON. Verify with:
        grep TIDESDB_HAS_S3 /path/to/tidesdb/build/include/tidesdb/tidesdb_version.h
    A non-empty match means S3 mode is available. mwbench picks it up
    automatically -- no rebuild needed when the lib's S3 support toggles.



QUICK START
-----------

Smoke test (1 GiB target, short cooldowns, all output cleaned up):

    ./mwbench --quick

Medium overnight run (64 GiB target, 8 writers, 4 readers, generous cache):

    ./mwbench --target-gib 64 --write-threads 8 --read-threads 4 \
              --block-cache $((16 * 1024 * 1024 * 1024)) \
              --sample-interval-sec 15



RUNNING
-------

mwbench is configured via CLI flags. Defaults are tuned for a 1 TiB
ingest; override per scenario. 

The data-dir is wiped on close by default (it's intended to be throwaway).
The out-dir is never overwritten -- each invocation creates a new
run_YYYYMMDD_HHMMSS subdir alongside any earlier runs.


OPTIONS
-------

Workload size:

    --target-bytes N            total bytes to ingest                 1 TiB
    --target-gib   N            same, expressed in GiB                  --
    --value-size   N            per-value payload size, bytes         1024
    --batch-size   N            keys per writer txn                    256

Multi-column-family:

    --num-cfs      N            number of column families to create     1
                                each commit mirrors the same key+value
                                into all N CFs (secondary-index style)
    --cf-override IDX:KEY=VAL   patch a single CF's config; repeatable.
                                IDX is the 0-based CF index, KEY is in
                                the apply_cf_kv whitelist below. later
                                overrides on the same key win.

Key patterns:

    --write-pattern PAT         sequential | uniform | zipfian | latest
                                (default sequential -- LSM best case)
    --read-pattern  PAT         same set plus match_write (default).
                                match_write inherits whichever pattern
                                writers use.
    --zipf-skew    F            zipfian theta in (0, 1)            0.99
    --keyspace     N            id pool size for non-sequential
                                (default: target_bytes / value_size)

Concurrency:

    --write-threads N           parallel writer threads                  4
    --read-threads  N           parallel reader threads                  2
    --flush-threads N           tidesdb flush worker count               4
    --compaction-threads N      tidesdb compaction worker count          4

Database-level engine config (tidesdb_config_t):

    --block-cache         N     clock-cache size, bytes              1 GiB
    --max-open-sstables   N     ceiling on open sst file handles    65536
    --max-memory          N     global memory cap, bytes (0 = auto)     0
    --log-level    LVL          debug | info | warn | error | fatal | none
    --log-to-file               write tidesdb log to a file in data-dir
    --log-truncate-at     N     truncate log file at N bytes (0 = never)

Unified-memtable mode (all CFs share one memtable + WAL):

    --unified-memtable          enable unified mode
    --unified-write-buffer    N memtable rotation threshold, bytes  256 MiB
                                applies only when --unified-memtable is
                                set and no explicit value is given
    --unified-skip-list-max-level     N    default 12
    --unified-skip-list-probability   F    default 0.25
    --unified-sync-mode    MODE   none | full | interval     default none
    --unified-sync-interval-us  N microseconds between syncs   128000

Per-CF engine tuning (applied to every CF unless overridden):

    --write-buffer   N          per-CF memtable size, bytes         64 MiB
    --write-buffer-size N       alias for --write-buffer (matches the
                                underlying tidesdb field name)
    --compression  NAME         none | snappy | lz4 | zstd | lz4_fast
    --sync-mode    MODE         none | full | interval
    --sync-interval-us N        sync interval in microseconds
    --bloom-filter 0|1          enable bloom filters                     1
    --bloom-fpr      F          target false-positive rate            0.01
    --block-indexes 0|1         enable block indexes                     1
    --index-sample-ratio N      block-index sampling ratio               1
    --block-index-prefix-len N  prefix length in bytes                  16
    --klog-value-threshold N    values >= N bytes go to .vlog       16 KiB
    --level-size-ratio N        LSM level capacity ratio                10
    --min-levels     N          minimum number of disk levels            1
    --dividing-level-offset N   spooky dividing-level offset (0,1,2..)   1
    --l1-file-count-trigger N   L1 file-count compaction trigger
    --l0-queue-stall-threshold N L0 queue depth at which writers stall
    --tombstone-density-trigger F per-sst tombstone density that
                                  escalates compaction priority [0,1]
    --tombstone-density-min-entries N min entries before density counts
    --use-btree     0|1         use B+tree klog format                   0
    --isolation-level NAME      default txn isolation:
                                read_uncommitted | read_committed |
                                repeatable_read | snapshot | serializable
    --skip-list-max-level N     skip-list max level                     12
    --skip-list-probability F   skip-list probability                 0.25
    --min-disk-space N          min free disk space (bytes)        100 MiB

apply_cf_kv whitelist (keys valid in --cf-override):

    write_buffer_size, level_size_ratio, min_levels, dividing_level_offset,
    klog_value_threshold, compression, enable_bloom_filter, bloom_fpr,
    enable_block_indexes, index_sample_ratio, block_index_prefix_len,
    sync_mode, sync_interval_us, skip_list_max_level, skip_list_probability,
    isolation_level, min_disk_space, l1_file_count_trigger,
    l0_queue_stall_threshold, tombstone_density_trigger,
    tombstone_density_min_entries, use_btree

Object-store mode:

    --objstore-mode    MODE     none | fs | s3       default none
                                fs  = local filesystem connector, no
                                      external deps. stores objects as
                                      files under --objstore-fs-path
                                s3  = S3-compatible. requires libtidesdb
                                      built with -DTIDESDB_WITH_S3=ON
                                      and a running endpoint (rustfs,
                                      MinIO, AWS, ...)
    --objstore-fs-path PATH     fs-connector root (default <data-dir>_objstore)

  S3 endpoint (only with --objstore-mode s3):

    --objstore-endpoint HOST:PORT   e.g. 127.0.0.1:9000
    --objstore-bucket   NAME        bucket name (must exist; pre-create
                                    via aws s3 mb or the rustfs console)
    --objstore-prefix   STR         optional key prefix
    --objstore-access-key  K
    --objstore-secret-key  S
    --objstore-region   R           default "us-east-1"
    --objstore-use-ssl   0|1        default 0 (HTTP)
    --objstore-path-style 0|1       default 1 (rustfs / MinIO style)

  Object-store config (applies to either backend):

    --objstore-cache-bytes      N   local cache cap, bytes (0 = no cap)
    --objstore-cache-on-read    0|1
    --objstore-cache-on-write   0|1
    --objstore-uploads          N   parallel upload threads
    --objstore-downloads        N   parallel download threads
    --objstore-multipart-threshold  N   bytes; above this, multipart
    --objstore-multipart-part   N   multipart chunk size
    --objstore-sync-manifest    0|1 upload MANIFEST after each compact
    --objstore-replicate-wal    0|1 upload closed WAL segments
    --objstore-wal-upload-sync  0|1 block flush on WAL upload
    --objstore-wal-sync-bytes   N   active-WAL sync threshold (bytes)
    --objstore-wal-sync-on-commit 0|1   sync after every commit (RPO=0)

Primary / replica orchestration:

    --replica-mode              this process opens tidesdb read-only and
                                runs readers + sampler only. requires
                                --objstore-mode fs|s3 and a populated
                                bucket / fs root from a primary
    --spawn-replica             this process is a primary AND fork+execs
                                a sibling --replica-mode child against
                                the same objstore (data dirs / out dirs
                                scoped per process)
    --replica-data-dir PATH     local cache dir for the spawned child
                                (default <data-dir>_replica)
    --replica-sync-us  N        replica MANIFEST poll interval (us)
    --replica-replay-wal 0|1    replay WAL for near-real-time replica
                                reads (default on)

Delete-reclaim experiment:

    --delete-fraction  F        fraction of ingested keys to delete  0.05
                                (5% by default). set to 0 to skip.
    --delete-batch     N        keys per delete txn                 10000
    --delete-target-cf IDX|all  delete scope. default 0 (cf_0 only).
                                "all" tombstones every CF in the run
    --delete-compact-wait-sec N seconds in the compaction phase after
                                tidesdb_compact() (default 60)

Sampling and timing:

    --sample-interval-sec N     seconds between CSV rows                10
    --range-scan-len     N      keys per range probe                   100
    --cooldown-sec       N      idle window after writers finish        30

Paths and lifecycle:

    --data-dir PATH             tidesdb data directory (default ./data)
    --out-dir  PATH             parent of the run_* subdirs (default ./out)
    --resume                    open existing data dir instead of fresh
                                (skips the wipe + wipe-confirmation entirely)
    --keep-data                 do not wipe data dir on clean exit
    --force, --yes, -y          skip the y/N confirmation prompt before
                                wiping a non-empty data-dir. required for
                                non-interactive (no-tty) runs that need to
                                wipe -- otherwise the bench aborts safely

  Directory handling:

    Both --data-dir and --out-dir are auto-created (mkdir -p semantics) if
    any parent directory is missing.

    If --data-dir already exists and is non-empty, the bench:
      - with --resume: leaves it alone and reopens the existing db
      - with --force (or --yes / -y): wipes and recreates without asking
      - with an interactive tty: prompts
            "data-dir <path> is non-empty. wipe it and start fresh? [y/N]:"
        anything except y/Y aborts the run (data dir untouched)
      - non-interactive (piped / scripted) with neither flag: aborts with
        a message explaining --force vs --resume

    --out-dir is never wiped. Each run lands in a new run_YYYYMMDD_HHMMSS
    subdir so earlier runs are preserved.

    --spawn-replica's child process always runs with --force injected so
    the spawner never blocks on a prompt -- the replica's data dir is
    just an objstore cache, not persistent data.

Preset:

    --quick                     1 GiB target, 2 s sample interval,
                                5 s cooldown, 10 s compact wait


KEY PATTERNS
------------

Writers and readers each choose ids under one of four distributions. The
choice changes the stress the workload puts on the LSM:

    sequential  monotonically increasing ids. LSM best case: L0 sstables
                have disjoint key ranges so compaction is concat-only;
                reads hit one sstable per level; bloom filter overhead
                is minimal. baseline / control.

    uniform     uniform random over [0, keyspace). SSTs overlap heavily
                so compaction is real merging work; reads exercise the
                bloom filter false-positive path. closest to a real-world
                random-access workload.

    zipfian     power-law skew with theta in (0,1). hot keys cluster,
                stressing the block cache; long tail probes the bloom
                false-positive cost. classic YCSB pattern.

    latest      exponentially weighted toward max_written_key (95% mass
                in top 5%). time-series / write-recent-read-recent
                workloads. exercises L0 + memtable for reads.

Reads default to read-pattern=match_write -- whichever distribution
writers use, readers mirror. Override to e.g. --write-pattern uniform
--read-pattern zipfian to study cache effectiveness against a flat
write distribution.

Integrity check semantics:

    sequential -- ids 0..max_written_key are guaranteed written, so any
                  "not found" inside the safety window counts as data
                  loss. Seek that lands on a different key inside the
                  safe window is also data loss -- the bench counts it
                  as a miss rather than silently validating whatever
                  next-greater key the iterator surfaced. Mismatches
                  always count.

    others     -- no per-key tracking. "not found" is statistically
                  expected and suppressed; seek that lands on a
                  next-greater key is intended behaviour and the
                  returned key is the one validated. Only mismatches
                  count as corruption.


OBJECT-STORE MODE
-----------------

tidesdb supports running the LSM atop an object store. mwbench exposes
this via --objstore-mode:

  fs   -- local filesystem connector. objects stored as files mirroring
          the key path. always available, no extra deps. valuable as a
          baseline that exercises the upload pipeline + local cache +
          replica logic without network overhead.

  s3   -- S3-compatible (rustfs / MinIO / AWS S3). requires libtidesdb
          built with -DTIDESDB_WITH_S3=ON. adds real network + protocol
          cost on top of fs-mode behavior.

In either mode, tidesdb forces unified-memtable on, opens a local cache
under <data-dir>, ships sstables + manifest + WAL to the objstore via a
background upload thread pool, and exposes telemetry consumed by the
bench: local_cache_bytes_used, upload_queue_depth, total_uploads,
total_upload_failures, last_uploaded_generation.

Primary / replica:

    --replica-mode      opens the bench in read-only replica mode: no
                        writers, no delete experiment. tidesdb polls
                        MANIFEST + WAL from the bucket and serves reads
                        from a local cache hydrated on miss.

    --spawn-replica     orchestrator: parent opens primary, fork+execs
                        a sibling --replica-mode child against the same
                        bucket. argv is rewritten so the child gets a
                        scoped --data-dir (default <data-dir>_replica),
                        a nested --out-dir (<out>/replica), and the
                        primary's --objstore-fs-path. parent waits for
                        child at end of run.

Output for a spawn-replica run:

    out/run_YYYYMMDD_HHMMSS/                     <- primary
        samples.csv  amplification.csv  summary.csv  ...
        replica/
            run_YYYYMMDD_HHMMSS/                 <- replica child
                samples.csv  amplification.csv  ...



METRICS
-------

What gets measured and where it lives:

  per-tick time series (samples.csv, one row per (tick, CF)):

    workload throughput        bytes_written, keys_written, write_mibs
    read throughput            <op>_ops_s
    read latency (pooled)      <op>_p50_us, _p95_us, _p99_us, _misses
    read latency (per-CF)      cf_<op>_p50_us, _p95_us, _p99_us
    integrity                  mismatches, deleted_count
    disk + memory              disk_bytes, data_size_bytes, memtable_bytes
                               (memtable_bytes = sum over every CF's active
                               + non-flushed immutable + the unified memtable;
                               NOT a per-memtable figure)
    LSM state                  sstable_count, immutable_count, open_sstables,
                               flush_qsize, compact_qsize, lvl[0..15]_ssts
                               (sstable_count = sum across every level of
                               every CF; immutable_count includes flushed-
                               but-not-yet-cleaned entries because the
                               immutable queue is swept in batches)
    per-CF state               cf_num_levels, cf_total_keys, cf_tombstones,
                               cf_tombstone_ratio, cf_max_density,
                               cf_max_density_level, cf_memtable_size,
                               cf_avg_key_size, cf_avg_value_size
                               (in --unified-memtable mode the per-CF
                               memtable / immutable counters read empty
                               because the active memtable and WAL are
                               shared -- look at mtab_tot / imm_tot in
                               the progress line for the aggregate)
    block cache                cache_hits, cache_misses, cache_hit_rate
    /proc/self/io deltas       proc_rchar_d, proc_wchar_d, proc_read_bytes_d,
                               proc_write_bytes_d, proc_syscr_d, proc_syscw_d,
                               proc_cancelled_write_bytes_d
    process resources          ru_maxrss_kb, ru_minflt, ru_majflt,
                               ru_inblock, ru_oublock, ru_utime_us_d,
                               ru_stime_us_d
    RAPL energy (cumulative)   energy_pkg_uj, energy_core_uj,
                               energy_uncore_uj, energy_dram_uj
    RAPL power (windowed)      power_pkg_w, power_core_w, power_uncore_w,
                               power_dram_w
    attribution                cpu_share, proc_pkg_energy_j_window,
                               proc_pkg_energy_j_cum,
                               energy_per_gib_written, energy_per_mread
    write amplification        wa_app, wa_io, wa_disk (window + 30s)
    read amplification         ra_app, ra_io          (window + 30s)
    space amplification        sa_disk, sa_data
    commit latency             commit_p50_us, _p95_us, _p99_us,
                               commits_in_window
    objstore                   objstore_enabled, objstore_local_cache_bytes,
                               objstore_local_cache_max,
                               objstore_local_cache_files,
                               objstore_upload_queue_depth,
                               objstore_total_uploads,
                               objstore_total_upload_failures,
                               objstore_last_uploaded_gen,
                               objstore_replica_mode

  amplification.csv (one row per tick, single collapsed view):

    elapsed_s, phase,
    user_bytes_window/total, wchar/write_bytes/rchar/read_bytes
    (window + total), wa_app/io/disk (window + 30s),
    ra_app/io (window + 30s), cpu_share,
    disk_delta_bytes, disk_total_bytes, sa_disk, sa_data,
    pkg_energy_window_uj, power_pkg_w, proc_pkg_energy_j_window,
    energy_per_gib_written

  summary.csv (one row per phase, end of run):

    phase, phase_name, duration_s, ingest_gib, mean_write_mibs,
    n_samples, mean_wa_app/io/disk, mean_ra_app/io, mean_sa_disk/data,
    mean_pkg_w/core_w/uncore_w/dram_w, mean_cpu_share,
    max_point_p99_us/seek_p99_us/range_p99_us/commit_p99_us,
    max_rss_mib, max_open_ssts, cum_misses, cum_mismatches

  delete_experiment.csv (one row per snapshot, four snapshots per run):

    phase {peak, post_delete, post_compact, post_compact_settle},
    disk_bytes, data_bytes, sstable_count, immutable_count,
    tombstones, tombstone_ratio, total_deleted

  config.csv (one row per CF, at startup):

    cf_index, cf_name, every tunable CF knob with enums in their
    human-readable string form

  run_meta.csv (key,value table, at startup):

    environment: hostname, os_name/release/machine, cpu_model, ncpu,
                 total_memory_mib, rapl_enabled, rapl_domains
    library:     tidesdb_version, tidesdb_has_s3, compiler, build_date
    disk:        data_dir, data_dir_device, data_dir_filesystem,
                 data_dir_total_gib, data_dir_free_gib
    workload:    target_bytes, value_size, batch_size, write/read_threads,
                 num_cfs, write_pattern, read_pattern, zipf_skew,
                 keyspace, range_scan_len, sample_interval_sec,
                 cooldown_sec, delete_fraction, delete_batch,
                 delete_target_cf
    engine:      block_cache_size, flush/compaction_threads,
                 max_open_sstables, unified_memtable,
                 unified_write_buffer, log_level, cf_write_buffer,
                 cf_compression, cf_sync_mode, cf_klog_value_threshold,
                 cf_enable_bloom_filter, cf_bloom_fpr, cf_use_btree
    objstore:    role {primary,replica}, spawn_replica, objstore_mode,
                 objstore_fs_path, objstore_endpoint, objstore_bucket,
                 objstore_prefix, objstore_region, objstore_use_ssl,
                 objstore_path_style, every objstore_cfg field


PHASES
------

Each samples.csv row carries a phase tag. The run walks through these in order:

    0  ingest         writers active, readers probing live keys
    1  cooldown       writers stopped, db quiescing, readers continue
    2  delete         deleted_fraction * keyspace ids removed from the
                      target CF(s) in batches; readers can hit tombstones
                      from this point on
    3  post-delete    settle window, snapshot taken
    4  compaction     tidesdb_compact() driven on target CFs
    5  post-compact   final settle and snapshot

In --replica-mode the bench enters cooldown immediately (no writers, no
delete) and stays there for the configured wait windows so it can build
a parallel time-series of read performance over the primary's run.


OUTPUT FILES
------------

Per run, all under out/run_YYYYMMDD_HHMMSS/:

    samples.csv             time series, N rows per tick (one per CF)
    amplification.csv       time series, 1 row per tick (collapsed)
    summary.csv             per-phase aggregates at end of run
    delete_experiment.csv   four discrete snapshots
    config.csv              per-CF effective config
    run_meta.csv            environment + workload metadata


EXAMPLES
--------

Sequential 1 GiB smoke (defaults, just to verify the pipeline):

    ./mwbench --quick

64 GiB sequential, 8 writers, larger memtable, 16 GiB block cache:

    ./mwbench --target-gib 64 --write-threads 8 \
              --write-buffer $((512 * 1024 * 1024)) \
              --block-cache  $((16 * 1024 * 1024 * 1024))

Primary + 3 secondary indexes, mixed engine configs across CFs:

    ./mwbench --target-gib 64 --num-cfs 4 \
              --cf-override 0:use_btree=1 \
              --cf-override 1:compression=zstd \
              --cf-override 2:enable_bloom_filter=0 \
              --cf-override 3:klog_value_threshold=4096

Uniform random workload (stresses compaction):

    ./mwbench --target-gib 32 --write-pattern uniform

Zipfian (hot keys, cache-stressing):

    ./mwbench --target-gib 32 --write-pattern zipfian --zipf-skew 0.99

Latest (time series workload):

    ./mwbench --target-gib 32 --write-pattern latest

Unified memtable comparison (same workload, single shared memtable):

    ./mwbench --target-gib 64 --num-cfs 4 --unified-memtable
    ./mwbench --target-gib 64 --num-cfs 4              # per-CF baseline

Full WAL sync durability:

    ./mwbench --target-gib 32 --sync-mode full

Object-store mode (fs connector, no installs, no external deps):

    ./mwbench --target-gib 8 --num-cfs 2 \
              --objstore-mode fs

Object-store mode + primary/replica orchestrator:

    ./mwbench --target-gib 8 --num-cfs 2 \
              --objstore-mode fs --spawn-replica
    # parent writes; child runs --replica-mode against the same path.
    # both produce CSVs

S3 mode against a local rustfs (see notes below):

    ./mwbench --target-gib 8 --num-cfs 2 --objstore-mode s3 \
              --objstore-endpoint 127.0.0.1:9000 \
              --objstore-bucket   mwbench \
              --objstore-access-key mwbench \
              --objstore-secret-key mwbenchsecret \
              --objstore-region   us-east-1 \
              --objstore-path-style 1 \
              --objstore-use-ssl 0

Long compaction-watch run -- give the compactor 10 minutes after the
deletes so the reclaim curve is well sampled:

    ./mwbench --target-gib 128 \
              --delete-compact-wait-sec 600 \
              --sample-interval-sec 5

Keep the data dir for post-mortem (re-run with --force to wipe + restart
or --resume to open the leftover db in-place):

    ./mwbench --quick --keep-data --data-dir ./scratch/data
    # then later:
    ./mwbench --quick --resume    --data-dir ./scratch/data
    # or wipe and restart:
    ./mwbench --quick --force     --data-dir ./scratch/data

Scripted / CI runs (no tty, must opt into wiping):

    ./mwbench --target-gib 64 --force \
              --data-dir /mnt/scratch/data --out-dir /mnt/scratch/out


RUSTFS NOTES
------------

rustfs is an S3-compatible object store. The bench was developed against
the 1.0.0-beta.4 musl build run bare-metal (no Docker).

Install:

    RUSTFS_HOME=/path/to/rustfs    # any persistent dir, prefer non-OS disk
    mkdir -p "$RUSTFS_HOME"/{bin,data,logs}
    curl -fL -o "$RUSTFS_HOME"/rustfs.zip \
        https://dl.rustfs.com/artifacts/rustfs/release/rustfs-linux-x86_64-musl-latest.zip
    unzip -o "$RUSTFS_HOME"/rustfs.zip -d "$RUSTFS_HOME"/bin

Start (binds 9000 = S3, 9001 = console; logs JSON to stderr):

    nohup "$RUSTFS_HOME"/bin/rustfs server \
        --address 127.0.0.1:9000 \
        --console-address 127.0.0.1:9001 \
        --access-key mwbench \
        --secret-key mwbenchsecret \
        --region us-east-1 \
        "$RUSTFS_HOME"/data > "$RUSTFS_HOME"/logs/rustfs.log 2>&1 &

Pre-create the bench bucket via aws CLI (or the rustfs console at :9001):

    AWS_ACCESS_KEY_ID=mwbench AWS_SECRET_ACCESS_KEY=mwbenchsecret \
        aws --endpoint-url http://127.0.0.1:9000 s3 mb s3://mwbench

Stop:

    pkill -f "rustfs/bin/rustfs server"


CONSOLE OUTPUT
--------------

Startup banner echoes the resolved config in five lines: workload
(target/value/threads/num_cfs), engine (buffer / cache / pool sizes),
patterns (write/read/keyspace), delete config, role + objstore, host /
OS / CPU / memory, tidesdb version / compiler / build date, data-dir
disk. Per-CF override diffs print one line per CF.

In unified-memtable mode the banner shows wbuf_u (the unified write
buffer) and the unified sync mode instead of the per-CF values, because
tidesdb ignores the per-CF write_buffer_size / sync_mode when unified
mode is on. When --num-cfs > 1 is combined with --unified-memtable, the
banner also prints an operator note pointing out that per-CF memtable
stats will read empty -- the unified counters carry that data instead.

The unified_memtable banner field reflects the value tidesdb will
actually use, including the silent force-enable that happens whenever
--objstore-mode is fs or s3 (object-store mode requires unified mode).

During the run:

  - one progress tick per second:
      [t=  120s ingest      ]  3.42 GiB  W=291.5 MiB/s  keys=...
      ssts_tot=37 imm_tot=2 mtab_tot= 412.0 MiB  fq=0 cq=0

  - a fuller line every --sample-interval-sec:
      t= 120.0s GiB= 3.42  W=291.5 MiB/s
      pt(p50/p99)=18.3/92.4  sk(p99)=410.1  rg(p99)=1820.6
      ssts_tot=37 imm_tot=2 mtab_tot= 412.0MiB  miss=0/4096000 corrupt=0
      pkg= 92.4W WA(io)= 2.13

  Column meanings (from tidesdb_get_db_stats):

      ssts_tot   total SSTable files across every level of every CF,
                 not L0-only and not per-CF
      imm_tot    queue_size of every CF's immutable queue + the unified
                 immutable queue, including entries already flushed
                 but not yet swept by the batched cleanup
      mtab_tot   live skip-list bytes of every active + non-flushed
                 immutable memtable, summed across CFs and unified;
                 NOT one memtable's bytes
      fq / cq    the single db-wide flush / compaction work queues

  Read these as global aggregates; pair them with --num-cfs and (in
  unified mode) the operator note printed at startup.

  - delete-phase progress every ~5% of the plan:
      deleted 5000000/12000000 (41.7%)  rate=215000 keys/s

  - end-of-run per-phase summary table to stdout, plus the four CSV
    paths and a summary CSV path.


TROUBLESHOOTING
---------------

"data-dir <path> is non-empty and stdin is not a tty.":
    Scripted run hit a non-empty data-dir without consent. Either pass
    --force / --yes / -y to wipe non-interactively, or --resume to open
    the existing db in place. The bench is intentionally conservative
    here -- it will not silently overwrite a populated dir.

"data-dir <path> is non-empty. wipe it and start fresh? [y/N]:":
    Interactive prompt before wiping. Answer y/Y to wipe; anything else
    (including just Enter) aborts the run cleanly. Pass --force on the
    command line to skip the prompt.

"Failed to create database directory ...":
    The parent path could not be created. mkdir -p semantics apply, so
    this usually means a permission issue or a parent that points at a
    non-existent mount.

mismatches > 0 in samples.csv:
    Real data corruption. The column counts byte-compare failures on
    values that were definitely written and not deleted. Reproduce with
    the same config and file an issue.

unexpected misses in sequential workloads:
    Indicates lost writes. The reader only probes ids inside a safety
    window inside max_written_key, so a miss is real. Two paths count:
      - point get returns not-found
      - seek returns a key whose id != the requested id (the bench
        flags this as a miss rather than silently validating whatever
        next-greater key the iterator surfaced)
    In non-sequential workloads (uniform/zipfian/latest) misses are
    statistically expected and suppressed -- only mismatches count.

writers stall under sustained TDB_ERR_BUSY:
    The engine throttles writers (returns TDB_ERR_BUSY out of
    txn_begin / txn_put / txn_commit) when flush or compaction cannot
    drain fast enough. The bench treats this as backpressure: each
    writer sleeps with exponential backoff (200 us minimum, 50 ms cap)
    and retries the same batch instead of exiting, so a busy spike no
    longer terminates the run early. If the throughput line stays at
    ~0 MiB/s for many seconds AND ssts_tot is not advancing, the
    flush pool is wedged -- see the next entry.

reader latency dominated by L0:
    Bump --flush-threads and --write-buffer so memtables flush sooner,
    or raise --compaction-threads so L0->L1 doesn't back up. In unified
    mode bump --unified-write-buffer instead. Watch ssts_tot / imm_tot
    in the progress line: an imm_tot that plateaus while ssts_tot does
    not advance is the symptom of a flush pool that is saturated (every
    slot in the max_concurrent_flushes cap held by an in-flight write).

objstore upload failures > 0:
    Check connectivity to the endpoint and credentials. For fs connector,
    the failure usually means the target path was made unreadable mid-run.

replica child fails with "no CFs visible":
    The primary hadn't uploaded a manifest within the 60 s poll window
    -- usually means the primary's ingest is too short and finished
    before its first flush. Bump --target-gib or pre-populate the bucket
    from an earlier primary run.

S3 mode crashes immediately on tidesdb_open:
    Likely tidesdb wasn't actually built with -DTIDESDB_WITH_S3=ON.
    Check the version header has #define TIDESDB_HAS_S3. If it does
    but the bench still crashes, suspect a partial rebuild -- do a
    clean rebuild of libtidesdb.

--cf-override unknown key '...':
    The parse-time error prints the apply_cf_kv whitelist verbatim.

--cf-override idx N >= num_cfs:
    Caught at parse time, before tidesdb_open. Either bump --num-cfs or
    drop the offending override.

disk fills up mid-run:
    Lower --target-gib, move --data-dir to a bigger volume. In multi-CF
    mirror mode the on-disk footprint is roughly num_cfs times a single-
    CF run for the same target. In objstore mode disk_bytes counts only
    the local cache, not the bucket -- check the bucket size separately.


FILES
-----

    main.c                   the bench
    CMakeLists.txt           build definition (produces build/mwbench)
    out/run_YYYYMMDD_HHMMSS/  per-run output:
                                 config.csv
                                 run_meta.csv
                                 samples.csv
                                 amplification.csv
                                 summary.csv
                                 delete_experiment.csv
                                 replica/run_YYYYMMDD_HHMMSS/...
                                                   (only with --spawn-replica)


==============================================================================