Skip to content

2.25.0.0-b317

@ttyusupov ttyusupov tagged this 16 Nov 08:46
Summary:
Currently the compaction queue (or priority thread pool) contains compaction objects which include compaction input files. So, between the time the compaction task was created and the time it actually started, the set of live SST files in RocksDB can change and it would not be the most optimal compaction that we want to run.

In the original RocksDB compaction queue stored pointers to ColumnFamilyData instead of compactions and decision about which files to include in compaction was made at compaction start.

In yugabyte-db this logic was changed in D2700. Looks like this was done because of split into small/large compaction queues and we needed to decide in which queue we needed to put an entry.

This revision adds `rocksdb_determine_compaction_input_at_start` flag which switches RocksDB to put pointer to ColumnFamilyData and presumed compaction size (small/large) into priority thread pool CompactionTask object. And once compaction task is actually started, we pick the appropriate compaction and its input files.

Example: if small compaction starts first, and then large compaction also starts for the same RocksDB instance, they will work on distinct set of files because first compaction will mark its input files as being compaction as soon as it is started. The difference depending on `rocksdb_determine_compaction_input_at_start` flag value is the following:
1) With `rocksdb_determine_compaction_input_at_start=false` input files for compaction task are selected as soon as we put compaction task in queue. And at the same time these files are marked as being compacted.
2) With `rocksdb_determine_compaction_input_at_start=true` input files for compaction task are selected when compaction task is picked up from queue and compaction execution is started. And only at that time input files are marked as being compacted.

Note: in this change new behaviour is off by default (`rocksdb_determine_compaction_input_at_start=false`).
Jira: DB-13575

Test Plan:
- Jenkins
- Stress test on RF=5 7-nodes cluster using 4 instances of `java -jar /tmp/tests/artifacts/stress-sample-app-tool/yb-stress-sample-apps-1.1.73.jar --workload CassandraBatchTimeseries --num_writes -1 --num_threads_write 5 --num_threads_read 1 --num_reads -1 --num_unique_keys 300000000000000 --table_ttl_seconds 21600 --default_cassandra_keyspace ... --row_ttl_seconds 21600 --uuid_marker  ...`

Reviewers: sergei, arybochkin, rthallam

Reviewed By: sergei, arybochkin, rthallam

Subscribers: rthallam, ybase

Tags: #jenkins-ready, #jenkins-trigger

Differential Revision: https://phorge.dev.yugabyte.com/D39500
Assets 2
Loading