[Alternator] Some sstables with large sizes left after TTL expiration, gc-grace-period and major compaction (tombstones are not deleted) #11915

yarongilor · 2022-11-07T14:11:51Z

Installation details

Kernel Version: 5.15.0-1021-aws
Scylla version (or git commit hash): 2022.2.0~rc3-20221009.994a5f0fbb4c with build-id 756ea8d62c25ed4acdf087054e11b3d07596a117
Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-x86_64-package-2022.2.0-rc3.0.20221009.994a5f0fbb4c.tar.gz
Cluster size: 4 nodes (i3.4xlarge)

Scylla Nodes used in this run:

alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-9 (34.248.19.247 | 10.4.0.151) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-8 (52.209.108.120 | 10.4.0.160) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-7 (34.245.201.183 | 10.4.3.188) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-6 (34.250.32.253 | 10.4.2.151) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5 (34.245.124.139 | 10.4.2.213) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-4 (54.195.144.153 | 10.4.0.41) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-3 (34.254.89.227 | 10.4.0.77) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-2 (54.155.84.133 | 10.4.3.211) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-16 (3.250.192.234 | 10.4.0.39) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-15 (3.251.81.51 | 10.4.0.27) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-14 (3.250.105.87 | 10.4.2.126) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-13 (54.194.207.154 | 10.4.0.92) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-12 (34.244.29.173 | 10.4.1.91) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-11 (54.194.213.46 | 10.4.1.151) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-10 (52.212.227.101 | 10.4.0.71) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1 (63.35.163.138 | 10.4.0.55) (shards: 14)

OS / Image: ami-0b9c9dd9d3af4cec6 (aws: eu-west-1)

Test: longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-nemesis
Test id: 7da36ba4-479e-42fd-bc55-641409ff1c77
Test name: scylla-staging/yarongilor/longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-nemesis
Test config file(s):

longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-sisyphus.yaml

Issue description

>>>>>>>
scenario:

Test ran 2 days with nemesis and Alternator TTL writes.
TTL=12 hours, grace-sconds=2 hours, TTL-scan=1 hours.
Test load ended after 2 days.
Then after another one day executed a repair and major compaction on all nodes.
Then, looking at nodes - the number of sstables is decreased, but still high. node-1 has 57 sstables where some of them are 1GB size.
node-1 state is:

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/*big-Data.db | wc -l
57

some large sstables are:

1001M -rw-r--r-- 1 scylla scylla 1001M Nov  6 16:14 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-127475-big-Data.db
1001M -rw-r--r-- 1 scylla scylla 1001M Nov  6 16:14 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-127467-big-Data.db
1001M -rw-r--r-- 1 scylla scylla 1001M Nov  6 16:14 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-127466-big-Data.db
1001M -rw-r--r-- 1 scylla scylla 1001M Nov  6 16:14 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-127464-big-Data.db
1001M -rw-r--r-- 1 scylla scylla 1001M Nov  6 16:14 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-127462-big-Data.db

nodetool status:

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1:~$ nodetool status
Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.2.213  18.1 GB    256          ?       bdb02b1e-819a-452e-92a4-1b093438db8b  1a
UN  10.4.0.39   22.06 GB   256          ?       2654a908-baaf-4fb9-af5f-e12653552b01  1a
UN  10.4.0.55   55.67 GB   256          ?       91ed0563-c5a6-4705-a0de-633862bfba92  1a
UN  10.4.0.27   19.44 GB   256          ?       96c48dab-a1d7-46b4-9a96-beb77c14241f  1a

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1:~$

nodetool cfstats on node-1 shows Number of partitions (estimate): 936282794:

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 72
----------------
Keyspace : alternator_usertable_no_lwt
        Read Count: 13269849
        Read Latency: 7.524749528046627E-4 ms
        Write Count: 600764593
        Write Latency: 7.30815039893671E-6 ms
        Pending Flushes: 0
                Table: usertable_no_lwt
                SSTable count: 55
                SSTables in each level: [55/4]
                Space used (live): 58741483520
                Space used (total): 58741483520
                Space used by snapshots (total): 0
                Off heap memory used (total): 552484908
                SSTable Compression Ratio: 0.554702
                Number of partitions (estimate): 936282794
                Memtable cell count: 0
                Memtable data size: 0
                Memtable off heap memory used: 0
                Memtable switch count: 406
                Local read count: 13269849
                Local read latency: 0.752 ms
                Local write count: 600764593
                Local write latency: 0.007 ms
                Pending flushes: 0
                Percent repaired: 0.0
                Bloom filter false positives: 0
                Bloom filter false ratio: 0.00000
                Bloom filter space used: 513346200
                Bloom filter off heap memory used: 517504780
                Index summary off heap memory used: 34980128
                Compression metadata off heap memory used: 0
                Compacted partition minimum bytes: 43
                Compacted partition maximum bytes: 124
                Compacted partition mean bytes: 60
                Average live cells per slice (last five minutes): 0.0
                Maximum live cells per slice (last five minutes): 0
                Average tombstones per slice (last five minutes): 0.0
                Maximum tombstones per slice (last five minutes): 0
                Dropped Mutations: 0

----------------

Any CQLSH query to any range failed with a timeout:

cqlsh> SELECT p from alternator_usertable_no_lwt.usertable_no_lwt WHERE p < 'user609602598667831698' and p > 'user609602598667830698' and c = 'YCSB_0' LIMIT 1 ALLOW FILTERING using timeout 10m;
OperationTimedOut: errors={'10.4.0.55': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.4.0.55

<<<<<<<

Restore Monitor Stack command: $ hydra investigate show-monitor 7da36ba4-479e-42fd-bc55-641409ff1c77
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 7da36ba4-479e-42fd-bc55-641409ff1c77

Logs:

The cluster's nodes and monitor are alive in:

| alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-15     | eu-west-1a | running | 7da36ba4-479e-42fd-bc55-641409ff1c77 | yarongilor | Sat Nov  5 09:00:41 2022 |
| alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-16     | eu-west-1a | running | 7da36ba4-479e-42fd-bc55-641409ff1c77 | yarongilor | Sat Nov  5 15:12:01 2022 |
| alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1      | eu-west-1a | running | 7da36ba4-479e-42fd-bc55-641409ff1c77 | yarongilor | Thu Nov  3 13:34:28 2022 |
| alternator-ttl-4-loaders-no-lwt-sis-monitor-node-7da36ba4-1 | eu-west-1a | running | 7da36ba4-479e-42fd-bc55-641409ff1c77 | yarongilor | Thu Nov  3 13:34:31 2022 |
| alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5      | eu-west-1a | running | 7da36ba4-479e-42fd-bc55-641409ff1c77 | yarongilor | Thu Nov  3 17:24:34 2022 |

original logs of sct test:

18:59:45  +-----------------------------------------------------------------------------------------------------------------------------------------------+
18:59:45  |                                        Collected logs by test-id: 7da36ba4-479e-42fd-bc55-641409ff1c77                                        |
18:59:45  +-------------+---------------------------------------------------------------------------------------------------------------------------------+
18:59:45  | Cluster set | Link                                                                                                                            |
18:59:45  +-------------+---------------------------------------------------------------------------------------------------------------------------------+
18:59:45  | db-cluster  | https://cloudius-jenkins-test.s3.amazonaws.com/7da36ba4-479e-42fd-bc55-641409ff1c77/20221105_163501/db-cluster-7da36ba4.tar.gz  |
18:59:45  | monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/7da36ba4-479e-42fd-bc55-641409ff1c77/20221105_163501/monitor-set-7da36ba4.tar.gz |
18:59:45  | loader-set  | https://cloudius-jenkins-test.s3.amazonaws.com/7da36ba4-479e-42fd-bc55-641409ff1c77/20221105_163501/loader-set-7da36ba4.tar.gz  |
18:59:45  | sct-runner  | https://cloudius-jenkins-test.s3.amazonaws.com/7da36ba4-479e-42fd-bc55-641409ff1c77/20221105_163501/sct-runner-7da36ba4.tar.gz  |
18:59:45  +-------------+---------------------------------------------------------------------------------------------------------------------------------+

Manually collected logs after test run ended and manual operations (repair + major compaction on all nodes):

db-cluster-7da36ba4.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7da36ba4-479e-42fd-bc55-641409ff1c77/20221107_115652/db-cluster-7da36ba4.tar.gz
monitor-set-7da36ba4.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7da36ba4-479e-42fd-bc55-641409ff1c77/20221107_115652/monitor-set-7da36ba4.tar.gz

Jenkins job URL

The text was updated successfully, but these errors were encountered:

yarongilor · 2022-11-07T14:26:05Z

live nodes on AWS:

yarongilor · 2022-11-07T14:28:59Z

node-5 state:

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/*big-Data.db -hs
 117M -rw-r--r-- 1 scylla scylla  117M Nov  6 00:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-128542-big-Data.db
 1.1G -rw-r--r-- 1 scylla scylla  1.1G Nov  7 06:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-145526-big-Data.db
 1.1G -rw-r--r-- 1 scylla scylla  1.1G Nov  7 06:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-145541-big-Data.db
 722M -rw-r--r-- 1 scylla scylla  722M Nov  7 06:44 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-145864-big-Data.db
 943M -rw-r--r-- 1 scylla scylla  943M Nov  7 06:43 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-145882-big-Data.db
 709M -rw-r--r-- 1 scylla scylla  709M Nov  7 06:44 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-146053-big-Data.db
1006M -rw-r--r-- 1 scylla scylla 1006M Nov  7 06:45 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-146244-big-Data.db
 1.1G -rw-r--r-- 1 scylla scylla  1.1G Nov  7 06:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-146461-big-Data.db
1015M -rw-r--r-- 1 scylla scylla 1015M Nov  7 06:45 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-146490-big-Data.db
1009M -rw-r--r-- 1 scylla scylla 1009M Nov  7 06:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-146558-big-Data.db
 1.1G -rw-r--r-- 1 scylla scylla  1.1G Nov  7 06:46 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-147387-big-Data.db
 1.2G -rw-r--r-- 1 scylla scylla  1.2G Nov  7 06:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-147673-big-Data.db
 485M -rw-r--r-- 1 scylla scylla  485M Nov  7 07:34 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-150517-big-Data.db
 568M -rw-r--r-- 1 scylla scylla  568M Nov  7 07:34 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-151185-big-Data.db
 654M -rw-r--r-- 1 scylla scylla  654M Nov  7 07:35 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-151462-big-Data.db
 665M -rw-r--r-- 1 scylla scylla  665M Nov  7 07:35 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-151463-big-Data.db
 657M -rw-r--r-- 1 scylla scylla  657M Nov  7 07:35 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-151744-big-Data.db
 569M -rw-r--r-- 1 scylla scylla  569M Nov  7 07:34 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-151902-big-Data.db
 539M -rw-r--r-- 1 scylla scylla  539M Nov  7 07:34 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-151918-big-Data.db
 573M -rw-r--r-- 1 scylla scylla  573M Nov  7 07:34 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-152110-big-Data.db
 645M -rw-r--r-- 1 scylla scylla  645M Nov  7 07:34 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-152171-big-Data.db
 643M -rw-r--r-- 1 scylla scylla  643M Nov  7 07:35 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-152523-big-Data.db
 631M -rw-r--r-- 1 scylla scylla  631M Nov  7 07:35 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-152538-big-Data.db
 621M -rw-r--r-- 1 scylla scylla  621M Nov  7 07:35 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-152676-big-Data.db
 662M -rw-r--r-- 1 scylla scylla  662M Nov  7 07:35 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-153393-big-Data.db
 827M -rw-r--r-- 1 scylla scylla  827M Nov  7 07:36 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-153637-big-Data.db

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/*big-Data.db | wc -l
26

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 72
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 10630737
	Read Latency: 4.8010396645124413E-4 ms
	Write Count: 591986672
	Write Latency: 7.22821002970148E-6 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 14
		SSTables in each level: [14/4]
		Space used (live): 18398912512
		Space used (total): 18398912512
		Space used by snapshots (total): 0
		Off heap memory used (total): 206641440
		SSTable Compression Ratio: 0.528944
		Number of partitions (estimate): 293853780
		Memtable cell count: 0
		Memtable data size: 0
		Memtable off heap memory used: 0
		Memtable switch count: 399
		Local read count: 10630737
		Local read latency: 0.480 ms
		Local write count: 591986672
		Local write latency: 0.007 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 194167184
		Bloom filter off heap memory used: 195035192
		Index summary off heap memory used: 11606248
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 51
		Compacted partition maximum bytes: 60
		Compacted partition mean bytes: 60
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------

An additional related issue is that after repairing all nodes except node-1, the nodetool status had somewhat smaller data than after reparing node-1:

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/*big-Data.db | wc -l
96
scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5:~$ nodetool compact
scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/*big-Data.db | wc -l
12
scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 72
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 10630737
	Read Latency: 4.8010396645124413E-4 ms
	Write Count: 591986672
	Write Latency: 7.22821002970148E-6 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 0
		Space used (live): 0
		Space used (total): 0
		Space used by snapshots (total): 0
		Off heap memory used (total): 0
		SSTable Compression Ratio: 0.0
		Number of partitions (estimate): 0
		Memtable cell count: 0
		Memtable data size: 0
		Memtable off heap memory used: 0
		Memtable switch count: 399
		Local read count: 10630737
		Local read latency: 0.480 ms
		Local write count: 591986672
		Local write latency: 0.007 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 0
		Bloom filter off heap memory used: 0
		Index summary off heap memory used: 0
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 0
		Compacted partition maximum bytes: 0
		Compacted partition mean bytes: 0
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------
scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5:~$ df -h /var/lib/scylla/data/alternator_usertable_no_lwt
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        3.5T  137G  3.4T   4% /var/lib/scylla
scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5:~$ nodetool status
Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.2.213  988.45 MB  256          ?       bdb02b1e-819a-452e-92a4-1b093438db8b  1a
UN  10.4.0.39   1010.19 MB  256          ?       2654a908-baaf-4fb9-af5f-e12653552b01  1a
UN  10.4.0.55   55.67 GB   256          ?       91ed0563-c5a6-4705-a0de-633862bfba92  1a
UN  10.4.0.27   964.69 MB  256          ?       96c48dab-a1d7-46b4-9a96-beb77c14241f  1a

The after node-1 repair it had:

$ nodetool status

Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.2.213  18.1 GB    256          ?       bdb02b1e-819a-452e-92a4-1b093438db8b  1a
UN  10.4.0.39   22.06 GB   256          ?       2654a908-baaf-4fb9-af5f-e12653552b01  1a
UN  10.4.0.55   55.67 GB   256          ?       91ed0563-c5a6-4705-a0de-633862bfba92  1a
UN  10.4.0.27   19.44 GB   256          ?       96c48dab-a1d7-46b4-9a96-beb77c14241f  1a

yarongilor · 2022-11-07T14:36:24Z

node-16 state:

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-16:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 72
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 2999145
	Read Latency: 5.025175508353213E-4 ms
	Write Count: 535973115
	Write Latency: 6.43605043510438E-6 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 14
		SSTables in each level: [14/4]
		Space used (live): 22628718592
		Space used (total): 22628718592
		Space used by snapshots (total): 0
		Off heap memory used (total): 239912872
		SSTable Compression Ratio: 0.530966
		Number of partitions (estimate): 360759381
		Memtable cell count: 0
		Memtable data size: 0
		Memtable off heap memory used: 0
		Memtable switch count: 361
		Local read count: 2999145
		Local read latency: 0.503 ms
		Local write count: 535973115
		Local write latency: 0.006 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 225724144
		Bloom filter off heap memory used: 226623544
		Index summary off heap memory used: 13289328
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 51
		Compacted partition maximum bytes: 60
		Compacted partition mean bytes: 60
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------
scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-16:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/*big-Data.db | wc -l
27

yarongilor · 2022-11-07T14:37:18Z

node-15 state:

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-15:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 72
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 9782108
	Read Latency: 3.1535881632057224E-4 ms
	Write Count: 601187371
	Write Latency: 6.921040262504118E-6 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 14
		SSTables in each level: [14/4]
		Space used (live): 19862441984
		Space used (total): 19862441984
		Space used by snapshots (total): 0
		Off heap memory used (total): 211460276
		SSTable Compression Ratio: 0.529678
		Number of partitions (estimate): 317113703
		Memtable cell count: 0
		Memtable data size: 0
		Memtable off heap memory used: 0
		Memtable switch count: 405
		Local read count: 9782108
		Local read latency: 0.315 ms
		Local write count: 601187371
		Local write latency: 0.007 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 198446144
		Bloom filter off heap memory used: 199360568
		Index summary off heap memory used: 12099708
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 51
		Compacted partition maximum bytes: 60
		Compacted partition mean bytes: 60
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-15:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/*big-Data.db | wc -l
32

yarongilor · 2022-11-07T16:25:52Z

A CQL query does return 0 for a count(*) query:

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1:~$ cqlsh 10.4.0.55  -e "SELECT count(*) from alternator_usertable_no_lwt.usertable_no_lwt using timeout 10m" --request-timeout 300

 count
-------
     0

(1 rows)

raphaelsc · 2022-11-07T16:28:15Z

A CQL query does return 0 for a count(*) query:

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1:~$ cqlsh 10.4.0.55  -e "SELECT count(*) from alternator_usertable_no_lwt.usertable_no_lwt using timeout 10m" --request-timeout 300

 count
-------
     0

(1 rows)

Please try same query using bypass cache.

raphaelsc · 2022-11-07T16:29:07Z

Actually, scratch that. Please run nodetool flush and nodetool compact, to see if disk usage drops significantly.

yarongilor · 2022-11-08T07:33:42Z

@raphaelsc , after running nodetool flush and compact there's not much change (not getting any significantly closer to zero sstables or partitions):

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 72
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 13269849
	Read Latency: 7.524749528046627E-4 ms
	Write Count: 600764593
	Write Latency: 7.30815039893671E-6 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 14
		SSTables in each level: [14/4]
		Space used (live): 545829888
		Space used (total): 545829888
		Space used by snapshots (total): 0
		Off heap memory used (total): 258365815
		SSTable Compression Ratio: 0.615267
		Number of partitions (estimate): 2715681

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1:~$ nodetool status
Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.2.213  18.1 GB    256          ?       bdb02b1e-819a-452e-92a4-1b093438db8b  1a
UN  10.4.0.39   22.06 GB   256          ?       2654a908-baaf-4fb9-af5f-e12653552b01  1a
UN  10.4.0.55   1.45 GB    256          ?       91ed0563-c5a6-4705-a0de-633862bfba92  1a
UN  10.4.0.27   19.44 GB   256          ?       96c48dab-a1d7-46b4-9a96-beb77c14241f  1a

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/*big-Data.db | wc -l
16

yarongilor · 2022-11-08T07:38:22Z

In a shorter and smaller test without nemesis the results did get to zero sstables and partitions.

It had 4 write stress like:
bin/ycsb load dynamodb -P workloads/workloadc -threads 13 -p recordcount=8589934401
-p fieldcount=4 -p fieldlength=32
-p insertstart=0 -p insertcount=21474836 -p table=usertable_no_lwt -p dynamodb.ttlKey=ttl -p dynamodb.ttlDuration=5400

Test id: c6c93337-a887-4398-8e0f-3e7b38f13f65

BEFORE repair and major compaction:

( the relevant node is db-node-c6c93337-4 with ip 10.4.2.100)

--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.2.100  12.58 GB   256          ?       daf88116-57da-4a77-8bd0-8777a47daf65  1a
UN  10.4.0.228  12.53 GB   256          ?       3e5c3173-a3c8-40ee-855f-80c36953cc42  1a
UN  10.4.2.193  12.07 GB   256          ?       80c9287f-8a09-4730-9589-c84bc4054e9d  1a
UN  10.4.3.61   12.07 GB   256          ?       d55be98e-5450-4de1-bd9d-611000d4af58  1a

scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/*big-Data.db | wc -l
42

scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 61
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 11767643
	Read Latency: 2.937487141647652E-5 ms
	Write Count: 115039966
	Write Latency: 1.0820039706896296E-5 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 42
		SSTables in each level: [42/4]
		Space used (live): 13508375552
		Space used (total): 13508375552
		Space used by snapshots (total): 0
		Off heap memory used (total): 9636790055
		SSTable Compression Ratio: 0.657857
		Number of partitions (estimate): 86106006
		Memtable cell count: 12342478
		Memtable data size: 5180155674
		Memtable off heap memory used: 9510846464
		Memtable switch count: 84
		Local read count: 11767643
		Local read latency: 0.029 ms
		Local write count: 115039966
		Local write latency: 0.011 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 107915936
		Bloom filter off heap memory used: 111935656
		Index summary off heap memory used: 14007935
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 43
		Compacted partition maximum bytes: 258
		Compacted partition mean bytes: 174
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------

scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/*big-Data.db -hl
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-100-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-101-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-102-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-103-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:46 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-104-big-Data.db
-rw-r--r-- 1 scylla scylla 605M Nov  7 12:46 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-105-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:46 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-106-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:46 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-107-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:45 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-108-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:45 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-109-big-Data.db
-rw-r--r-- 1 scylla scylla 605M Nov  7 12:46 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-110-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:46 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-111-big-Data.db
-rw-r--r-- 1 scylla scylla  87M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-112-big-Data.db
-rw-r--r-- 1 scylla scylla  87M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-113-big-Data.db
-rw-r--r-- 1 scylla scylla  87M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-114-big-Data.db
-rw-r--r-- 1 scylla scylla  87M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-115-big-Data.db
-rw-r--r-- 1 scylla scylla  87M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-116-big-Data.db
-rw-r--r-- 1 scylla scylla  87M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-117-big-Data.db
-rw-r--r-- 1 scylla scylla  88M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-118-big-Data.db
-rw-r--r-- 1 scylla scylla  87M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-119-big-Data.db
-rw-r--r-- 1 scylla scylla  88M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-120-big-Data.db
-rw-r--r-- 1 scylla scylla  88M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-121-big-Data.db
-rw-r--r-- 1 scylla scylla  88M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-122-big-Data.db
-rw-r--r-- 1 scylla scylla  88M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-123-big-Data.db
-rw-r--r-- 1 scylla scylla  87M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-124-big-Data.db
-rw-r--r-- 1 scylla scylla  87M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-125-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-126-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-127-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-128-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-129-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-130-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-131-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-132-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-133-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-134-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-135-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-136-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-137-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-138-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-139-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-98-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:46 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-99-big-Data.db

Run a repair on node-4:

scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ nodetool repair 
[2022-11-07 15:41:29,838] Starting repair command #1, repairing 1 ranges for keyspace system_auth (parallelism=SEQUENTIAL, full=true)
[2022-11-07 15:41:30,867] Repair session 1 
[2022-11-07 15:41:30,867] Repair session 1 finished
[2022-11-07 15:41:30,875] Starting repair command #2, repairing 1 ranges for keyspace system_distributed_everywhere (parallelism=SEQUENTIAL, full=true)
[2022-11-07 15:41:31,978] Repair session 2 
[2022-11-07 15:41:31,979] Repair session 2 finished
[2022-11-07 15:41:33,090] Repair session 3 
[2022-11-07 15:41:33,091] Repair session 3 finished
[2022-11-07 15:41:33,107] Starting repair command #4, repairing 1 ranges for keyspace system_traces (parallelism=SEQUENTIAL, full=true)
[2022-11-07 15:41:34,201] Repair session 4 
[2022-11-07 15:41:34,201] Repair session 4 finished
[2022-11-07 15:41:34,209] Starting repair command #5, repairing 1 ranges for keyspace alternator_usertable_no_lwt (parallelism=SEQUENTIAL, full=true)
[2022-11-07 15:42:50,325] Repair session 5 
[2022-11-07 15:42:50,326] Repair session 5 finished

Check SSTABLE files after repair:

scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/*big-Data.db -hl
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-100-big-Data.db
-rw-r--r-- 1 scylla scylla 110K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1000-big-Data.db
-rw-r--r-- 1 scylla scylla  12K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1001-big-Data.db
-rw-r--r-- 1 scylla scylla  55K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1002-big-Data.db
-rw-r--r-- 1 scylla scylla  41K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1003-big-Data.db
-rw-r--r-- 1 scylla scylla  15K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1004-big-Data.db
-rw-r--r-- 1 scylla scylla  12K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1005-big-Data.db
-rw-r--r-- 1 scylla scylla 5.4K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1006-big-Data.db
-rw-r--r-- 1 scylla scylla  41K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1007-big-Data.db
-rw-r--r-- 1 scylla scylla  73K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1008-big-Data.db
-rw-r--r-- 1 scylla scylla  36K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1009-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-101-big-Data.db
-rw-r--r-- 1 scylla scylla 4.7K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1010-big-Data.db
-rw-r--r-- 1 scylla scylla  21K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1011-big-Data.db
-rw-r--r-- 1 scylla scylla  19K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1012-big-Data.db
-rw-r--r-- 1 scylla scylla 5.7K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1013-big-Data.db
-rw-r--r-- 1 scylla scylla  27K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1014-big-Data.db
-rw-r--r-- 1 scylla scylla  36K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1015-big-Data.db
-rw-r--r-- 1 scylla scylla  15K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1016-big-Data.db
-rw-r--r-- 1 scylla scylla  11K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1017-big-Data.db
-rw-r--r-- 1 scylla scylla 9.5K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1018-big-Data.db
-rw-r--r-- 1 scylla scylla 6.9K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1019-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-102-big-Data.db
-rw-r--r-- 1 scylla scylla  43K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1020-big-Data.db
-rw-r--r-- 1 scylla scylla  12K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1021-big-Data.db
-rw-r--r-- 1 scylla scylla  27K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1022-big-Data.db
-rw-r--r-- 1 scylla scylla 120K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1023-big-Data.db
-rw-r--r-- 1 scylla scylla  34K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1024-big-Data.db
-rw-r--r-- 1 scylla scylla 7.1K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1025-big-Data.db
-rw-r--r-- 1 scylla scylla  49K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1026-big-Data.db
-rw-r--r-- 1 scylla scylla  65K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1027-big-Data.db
-rw-r--r-- 1 scylla scylla  20K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1028-big-Data.db


AFTER REPAIR:
=======

scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 61
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 11767643
	Read Latency: 2.937487141647652E-5 ms
	Write Count: 115039966
	Write Latency: 1.0820039706896296E-5 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 5697
		SSTables in each level: [5697/4]
		Space used (live): 13783121920
		Space used (total): 13783121920
		Space used by snapshots (total): 0
		Off heap memory used (total): 9511952576
		SSTable Compression Ratio: 0.674864
		Number of partitions (estimate): 88100878
		Memtable cell count: 12342478
		Memtable data size: 5180155633
		Memtable off heap memory used: 9372565504
		Memtable switch count: 84
		Local read count: 11767643
		Local read latency: 0.029 ms
		Local write count: 115039966
		Local write latency: 0.011 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 108006416
		Bloom filter off heap memory used: 112003516
		Index summary off heap memory used: 27383556
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 43
		Compacted partition maximum bytes: 258
		Compacted partition mean bytes: 172
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------

scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/*big-Data.db | wc -l
5697

--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.2.100  12.69 GB   256          ?       daf88116-57da-4a77-8bd0-8777a47daf65  1a
UN  10.4.0.228  12.61 GB   256          ?       3e5c3173-a3c8-40ee-855f-80c36953cc42  1a
UN  10.4.2.193  12.14 GB   256          ?       80c9287f-8a09-4730-9589-c84bc4054e9d  1a
UN  10.4.3.61   12.14 GB   256          ?       d55be98e-5450-4de1-bd9d-611000d4af58  1a


scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ df -h /var/lib/scylla
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        3.5T  118G  3.4T   4% /var/lib/scylla

AFTER MAJOR COMPACTION:

--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.2.100  1.96 GB    256          ?       daf88116-57da-4a77-8bd0-8777a47daf65  1a
UN  10.4.0.228  12.61 GB   256          ?       3e5c3173-a3c8-40ee-855f-80c36953cc42  1a
UN  10.4.2.193  12.14 GB   256          ?       80c9287f-8a09-4730-9589-c84bc4054e9d  1a
UN  10.4.3.61   12.14 GB   256          ?       d55be98e-5450-4de1-bd9d-611000d4af58  1a


scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ df -h /var/lib/scylla
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        3.5T  107G  3.4T   4% /var/lib/scylla

scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/*big-Data.db | wc -l
14

scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 61
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 11767643
	Read Latency: 2.937487141647652E-5 ms
	Write Count: 115039966
	Write Latency: 1.0820039706896296E-5 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 14
		SSTables in each level: [14/4]
		Space used (live): 2106404864
		Space used (total): 2106404864
		Space used by snapshots (total): 0
		Off heap memory used (total): 128150482
		SSTable Compression Ratio: 0.495797
		Number of partitions (estimate): 34019881
		Memtable cell count: 0
		Memtable data size: 0
		Memtable off heap memory used: 0
		Memtable switch count: 98
		Local read count: 11767643
		Local read latency: 0.029 ms
		Local write count: 115039966
		Local write latency: 0.011 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 125310464
		Bloom filter off heap memory used: 126615608
		Index summary off heap memory used: 1534874
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 43
		Compacted partition maximum bytes: 60
		Compacted partition mean bytes: 60
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------

scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/*big-Data.db -sh
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5557-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5584-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5691-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5694-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5737-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5741-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5860-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5913-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5931-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5964-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5966-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-6010-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-6068-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-6093-big-Data.db

scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ time(cqlsh 10.4.2.100 -e "SELECT count(*) from alternator_usertable_no_lwt.usertable_no_lwt using timeout 10m" --request-timeout 300)

 count
-------
     0

(1 rows)

real	0m29.534s

AFTER GRACE PERIOD AND ANOTHER MAJOR COMPACTION:

12K	/var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd
0	/var/lib/scylla/data/alternator_usertable_no_lwt
0	/var/lib/scylla/data
11G	/var/lib/scylla/commitlog
0	/var/lib/scylla/hints/0
0	/var/lib/scylla/hints/1
0	/var/lib/scylla/hints/10
0	/var/lib/scylla/hints/11
0	/var/lib/scylla/hints/12
0	/var/lib/scylla/hints/13
0	/var/lib/scylla/hints/2
0	/var/lib/scylla/hints/3
0	/var/lib/scylla/hints/4
0	/var/lib/scylla/hints/5
0	/var/lib/scylla/hints/6
0	/var/lib/scylla/hints/7
0	/var/lib/scylla/hints/8
0	/var/lib/scylla/hints/9
0	/var/lib/scylla/hints
0	/var/lib/scylla/view_hints/0
0	/var/lib/scylla/view_hints/1
0	/var/lib/scylla/view_hints/10
0	/var/lib/scylla/view_hints/11
0	/var/lib/scylla/view_hints/12
0	/var/lib/scylla/view_hints/13
0	/var/lib/scylla/view_hints/2
0	/var/lib/scylla/view_hints/3
0	/var/lib/scylla/view_hints/4
0	/var/lib/scylla/view_hints/5
0	/var/lib/scylla/view_hints/6
0	/var/lib/scylla/view_hints/7
0	/var/lib/scylla/view_hints/8
0	/var/lib/scylla/view_hints/9
0	/var/lib/scylla/view_hints
0	/var/lib/scylla/saved_caches
12K	/var/lib/scylla/logs
0	/var/lib/scylla

scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 61
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 11767643
	Read Latency: 2.937487141647652E-5 ms
	Write Count: 115039966
	Write Latency: 1.0820039706896296E-5 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 0
		Space used (live): 0
		Space used (total): 0
		Space used by snapshots (total): 0
		Off heap memory used (total): 0
		SSTable Compression Ratio: 0.0
		Number of partitions (estimate): 0
		Memtable cell count: 0
		Memtable data size: 0
		Memtable off heap memory used: 0
		Memtable switch count: 98
		Local read count: 11767643
		Local read latency: 0.029 ms
		Local write count: 115039966
		Local write latency: 0.011 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 0
		Bloom filter off heap memory used: 0
		Index summary off heap memory used: 0
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 0
		Compacted partition maximum bytes: 0
		Compacted partition mean bytes: 0
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------

--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.2.100  2.33 MB    256          ?       daf88116-57da-4a77-8bd0-8777a47daf65  1a
UN  10.4.0.228  12.61 GB   256          ?       3e5c3173-a3c8-40ee-855f-80c36953cc42  1a
UN  10.4.2.193  12.14 GB   256          ?       80c9287f-8a09-4730-9589-c84bc4054e9d  1a
UN  10.4.3.61   12.14 GB   256          ?       d55be98e-5450-4de1-bd9d-611000d4af58  1a

scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/
total 12
drwxr-xr-x 5 scylla scylla 8192 Nov  8 07:22 ./
drwxr-xr-x 3 scylla scylla   63 Nov  7 12:10 ../
drwxr-xr-x 2 scylla scylla    6 Nov  8 07:22 pending_delete/
drwxr-xr-x 2 scylla scylla    6 Nov  7 12:10 staging/
drwxr-xr-x 2 scylla scylla    6 Nov  7 12:10 upload/

yarongilor · 2022-11-09T08:01:50Z

The issue does reproduce in a smaller test of 5 hours with nemesis.
node-4 10.4.3.183 had the following manual scenario after SCT test ended.

before repair:

|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.3.183  5.8 GB     256          ?       30ec5f71-8e91-4e8f-aaa1-663f71d58562  1a
UN  10.4.3.96   5.24 GB    256          ?       23923798-2d26-40a7-86c7-12cb25cbae1f  1a
UN  10.4.2.63   5.79 GB    256          ?       6b3adb1d-f858-4d5c-b042-ff1851b2431d  1a
UN  10.4.3.186  7.2 GB     256          ?       ebd72f36-81af-436d-9f0e-3b8cc16676ec  1a

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
This EC2 instance is optimized for Scylla.

scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ time(cqlsh 10.4.3.183 -e "SELECT count(*) from alternator_usertable_no_lwt.usertable_no_lwt using timeout 15m" --request-timeout 900)

 count
-------
     0

(1 rows)

real	9m1.691s
user	0m1.321s
sys	0m0.141s
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt

Total number of tables: 61
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 0
	Read Latency: NaN ms
	Write Count: 122992117
	Write Latency: 1.095821450085293E-5 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 9
		SSTables in each level: [9/4]
		Space used (live): 6231398400
		Space used (total): 6231398400
		Space used by snapshots (total): 0
		Off heap memory used (total): 519790294
		SSTable Compression Ratio: 0.522272
		Number of partitions (estimate): 83855435
		Memtable cell count: 1060925
		Memtable data size: 445878382
		Memtable off heap memory used: 448004096
		Memtable switch count: 100
		Local read count: 0
		Local read latency: NaN ms
		Local write count: 122992117
		Local write latency: 0.011 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 66851008
		Bloom filter off heap memory used: 67371044
		Index summary off heap memory used: 4415154
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 43
		Compacted partition maximum bytes: 124
		Compacted partition mean bytes: 86
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------

after repair:

scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ nodetool repair 
[2022-11-09 06:35:07,666] Repair session 249 
[2022-11-09 06:35:07,671] Repair session 249 finished
[2022-11-09 06:35:07,813] Starting repair command #250, repairing 1 ranges for keyspace alternator_usertable (parallelism=SEQUENTIAL, full=true)
[2022-11-09 06:35:10,929] Repair session 250 
[2022-11-09 06:35:10,931] Repair session 250 finished
[2022-11-09 06:35:10,964] Starting repair command #251, repairing 1 ranges for keyspace system_traces (parallelism=SEQUENTIAL, full=true)
[2022-11-09 06:35:17,089] Repair session 251 
[2022-11-09 06:35:17,089] Repair session 251 finished
[2022-11-09 06:35:17,173] Starting repair command #252, repairing 1 ranges for keyspace alternator_usertable_no_lwt (parallelism=SEQUENTIAL, full=true)
[2022-11-09 07:02:17,204] Repair session 252 
[2022-11-09 07:02:17,204] Repair session 252 finished
[2022-11-09 07:02:17,222] Starting repair command #253, repairing 1 ranges for keyspace system_auth (parallelism=SEQUENTIAL, full=true)
[2022-11-09 07:02:31,577] Repair session 253 
[2022-11-09 07:02:31,577] Repair session 253 finished
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ nodetool status
Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.3.183  7.38 GB    256          ?       30ec5f71-8e91-4e8f-aaa1-663f71d58562  1a
UN  10.4.3.96   6.57 GB    256          ?       23923798-2d26-40a7-86c7-12cb25cbae1f  1a
UN  10.4.2.63   6.94 GB    256          ?       6b3adb1d-f858-4d5c-b042-ff1851b2431d  1a
UN  10.4.3.186  8.13 GB    256          ?       ebd72f36-81af-436d-9f0e-3b8cc16676ec  1a

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 61
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 0
	Read Latency: NaN ms
	Write Count: 122992117
	Write Latency: 1.095821450085293E-5 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 1458
		SSTables in each level: [1458/4]
		Space used (live): 7924571136
		Space used (total): 7924571136
		Space used by snapshots (total): 0
		Off heap memory used (total): 523593073
		SSTable Compression Ratio: 0.514505
		Number of partitions (estimate): 108599552
		Memtable cell count: 1060925
		Memtable data size: 445878391
		Memtable off heap memory used: 447741952
		Memtable switch count: 100
		Local read count: 0
		Local read latency: NaN ms
		Local write count: 122992117
		Local write latency: 0.011 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 66874192
		Bloom filter off heap memory used: 67388432
		Index summary off heap memory used: 8462689
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 43
		Compacted partition maximum bytes: 124
		Compacted partition mean bytes: 85
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------

after major compaction:

scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ nodetool compact
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 61
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 0
	Read Latency: NaN ms
	Write Count: 122992117
	Write Latency: 1.095821450085293E-5 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 3
		SSTables in each level: [3]
		Space used (live): 1530227712
		Space used (total): 1530227712
		Space used by snapshots (total): 0
		Off heap memory used (total): 48718237
		SSTable Compression Ratio: 0.524236
		Number of partitions (estimate): 24744514
		Memtable cell count: 0
		Memtable data size: 0
		Memtable off heap memory used: 0
		Memtable switch count: 102
		Local read count: 0
		Local read latency: NaN ms
		Local write count: 122992117
		Local write latency: 0.011 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 47514504
		Bloom filter off heap memory used: 47710220
		Index summary off heap memory used: 1008017
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 43
		Compacted partition maximum bytes: 124
		Compacted partition mean bytes: 60
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ nodetool status
Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.3.183  1.43 GB    256          ?       30ec5f71-8e91-4e8f-aaa1-663f71d58562  1a
UN  10.4.3.96   6.22 GB    256          ?       23923798-2d26-40a7-86c7-12cb25cbae1f  1a
UN  10.4.2.63   6.68 GB    256          ?       6b3adb1d-f858-4d5c-b042-ff1851b2431d  1a
UN  10.4.3.186  7.96 GB    256          ?       ebd72f36-81af-436d-9f0e-3b8cc16676ec  1a

scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ grep -c '"partition" : {' me-3074.json 
13333844
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ ll -h me-3074.json 
-rw-rw-r-- 1 scyllaadm scyllaadm 4.9G Nov  9 07:40 me-3074.json
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ grep -c 'marked_deleted' me-3074.json 
13333844
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ date
Wed Nov  9 07:51:54 UTC 2022
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ grep '"partition" : {' me-3074.json -B 10 -A 10 | head -n 50
[
  {
    "partition" : {
      "key" : [ "user8701641602035344077" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 37,
        "clustering" : [ "YCSB_0" ],
        "deletion_info" : { "marked_deleted" : "2022-11-08T14:09:36.360650Z", "local_delete_time" : "2022-11-08T14:09:36Z" },
        "cells" : [ ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "user7496580425906310589" ],
      "position" : 57
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 94,
        "clustering" : [ "YCSB_0" ],
        "deletion_info" : { "marked_deleted" : "2022-11-08T14:09:36.361486Z", "local_delete_time" : "2022-11-08T14:09:36Z" },
        "cells" : [ ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "user2691499473368432695" ],
      "position" : 114
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 151,
        "clustering" : [ "YCSB_0" ],
        "deletion_info" : { "marked_deleted" : "2022-11-08T14:09:36.362358Z", "local_delete_time" : "2022-11-08T14:09:36Z" },
        "cells" : [ ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "user8494020772433182794" ],
      "position" : 171
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$

Installation details

Kernel Version: 5.15.0-1021-aws
Scylla version (or git commit hash): 2022.2.0~rc3-20221009.994a5f0fbb4c with build-id 756ea8d62c25ed4acdf087054e11b3d07596a117
Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-x86_64-package-2022.2.0-rc3.0.20221009.994a5f0fbb4c.tar.gz
Cluster size: 4 nodes (i3.large)

Scylla Nodes used in this run:

36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4 (3.251.89.87 | 10.4.3.183) (shards: 2)
36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-3 (3.252.163.213 | 10.4.3.96) (shards: 2)
36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-2 (52.211.222.58 | 10.4.2.63) (shards: 2)
36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-1 (34.255.11.123 | 10.4.3.186) (shards: 2)

OS / Image: ami-0b9c9dd9d3af4cec6 (aws: eu-west-1)

Test: longevity-alternator-dbg
Test id: d46a98ee-6981-4dd4-9970-2591068e3b32
Test name: scylla-staging/yarongilor/longevity-alternator-dbg
Test config file(s):

longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-sisyphus.yaml

Issue description

>>>>>>>
Your description here...
<<<<<<<

Restore Monitor Stack command: $ hydra investigate show-monitor d46a98ee-6981-4dd4-9970-2591068e3b32
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs d46a98ee-6981-4dd4-9970-2591068e3b32

Logs:

db-cluster-d46a98ee.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d46a98ee-6981-4dd4-9970-2591068e3b32/20221108_143319/db-cluster-d46a98ee.tar.gz
monitor-set-d46a98ee.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d46a98ee-6981-4dd4-9970-2591068e3b32/20221108_143319/monitor-set-d46a98ee.tar.gz
loader-set-d46a98ee.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d46a98ee-6981-4dd4-9970-2591068e3b32/20221108_143319/loader-set-d46a98ee.tar.gz
sct-runner-d46a98ee.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d46a98ee-6981-4dd4-9970-2591068e3b32/20221108_143319/sct-runner-d46a98ee.tar.gz

Jenkins job URL

bhalevy · 2022-11-09T11:20:18Z

Cc @nyh since this issue is about Alternator TTL, not CQL TTL.

nyh · 2022-11-17T09:02:10Z

This issue has a ton of text but I don't understand at all what is the problem being reported here...

First, do you do a "SELECT *" on the table, do you get zero rows, or not?
If you don't it's an Alternator TTL bug. If you do get zero rows, Alternator TTL is working fine, and this issue, if there's even one, isn't related to Alternator TTL. I'm unassigning this issue from myself until it's clear it's an issue or has anything to do with Alternator TTL.

Second, if you see zero rows but largish sstables, and if there are unexpired tombstones, it's not suprising that we have largish sstables to contain them... It's not a bug, it's working as intended.

Third, if you see zero rows and the gc-grace-period has passed since the time the Alternator TTL deleted those rows, and you did a major compaction, you'd expect to see zero-size sstables . In one of the results above I see that you saw exactly that - zero data size, exactly like we expect, so no bug here:

0	/var/lib/scylla/data

What was surprising (for me) in that output, though, was:

11G	/var/lib/scylla/commitlog

I'm not a commitlog expert (@elcallio maybe you can comment), why would large commit logs remain long after writing stopped (and in our case, old data was deleted)? Is this normal - e.g., old files are kept to be "recycled" - or may indicate such a bug? Again, if it's a bug, it's not an Alternator bug.

Fourth, in the original issue message (which I'm not sure is the same as the following runs you did...) you mentioned having gc-grace-period of 2 hours, but doing a repair after a full day. This is theoretically wrong. If for some reason one of the nodes missed some deletion operations (I don't know why it would, though...), the repair would ressurect this data. This can explain non-zero sstables, but, this explanation is only relevant if "SELECT *" returns some data. If it doesn't, then this explanation is irrelevant.

@yarongilor please clarify what you think the bug here is.

yarongilor · 2022-11-17T10:59:44Z

@nyh , let me summarise all above results and point to an issue (which indeed is not necessarily an Alternator one):

The select * always resulted with zero, no issue here.
The issue is tombstones are not deleted and their containing sstables left when they're expected to be gone.
This is since (1) all data is expired (2) passed gc-period duration (3) ran a major compaction.
The scenario reveals this issue is where running the SCT longevity with "disruptive nemesis".
The previous logging details comment shows that the issue doesn't happen where no nemesis ran. you probably just missed the first line of the above comment: : In a shorter and smaller test without nemesis the results did get to zero sstables and partitions.
So since the issue is about tombstones not deleted on compaction, hence no space reclamation ==> then it might be good to ask advise of @raphaelsc and @asias for that.

roydahan · 2022-11-17T19:32:29Z

@yarongilor does the short reproducer with nemesis reproduce the issue?
If so, what is the list of the nemesis that run in the test?

yarongilor · 2022-11-20T12:11:28Z

@yarongilor does the short reproducer with nemesis reproduce the issue? If so, what is the list of the nemesis that run in the test?

@roydahan

2 nemesis are actually executed (other nemesis either skipped or failed on sct side before running anything):

disrupt_mgmt_repair_cli | 36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-2 | Succeeded | 2022-11-08 11:35:32 | 2022-11-08 14:20:31
-- | -- | -- | -- | --

disrupt_restart_with_resharding | 36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-2 | Succeeded | 2022-11-08 10:23:17 | 2022-11-08 10:44:26
-- | -- | -- | -- | --

Argus job

roydahan · 2022-11-20T15:24:17Z

Ok, let's check each one of them separately to see which one is the root cause and why.

nyh · 2022-11-22T14:49:23Z

If I understood the points which @yarongilor demonstrated above, and by a personal chat with him, we have the following situation, which may indicate a problem not directly related to Alternator TTL but still is a serious-sounding bug:

An sstable dump shows an sstable with deleted items (tombstones).
Yet, more than gc_grace_seconds have passed since those deletions happened, and
A major compaction was done after gc_grace_seconds has passed.

This combination of three facts should have been impossible: A major compaction after gc_grace_seconds should have dropped all tombstones, and we shouldn't be able to see tombstones any in any sstable! If we see any, it suggests we have some sort of compaction or sstable-handling bug.

There's another clue which @yarongilor mentioned: This problem was only reproduced with the "resharding" nemesis. The "resharding nemesis" changes the number of CPUs on the node, and then changes it again back to the original number. This extra leads me to make the following wild guess (for which I don't have any evidence) that maybe the bug is somehow related to resharding operations. Resharding is supposed to replace old per-shard sstables by new per-shard-in-a-different-list-of-shards sstables. What if something in the back-and-forth resharding operation causes some "orphan" sstables to remain, that don't belong to any of the current shards? If that's possible, then these sstables will not belong to any extant shard, they will not get compacted in the major compaction, and their tombstones will never be deleted.

@raphaelsc does this ring any bells? Is it possible that we leave "orphan sstables" after resharding that increases or decreases the number of cores? In general, if @yarongilor sees a problematic sstable that didn't get compacted properly, is there a way to check which shard "owns" that sstable? @yarongilor can you try looking for one of these problematic sstable's file name in the Scylla log, to see if there are any messages about compacting this sstable?

yarongilor · 2022-11-24T14:50:11Z

Ok, let's check each one of them separately to see which one is the root cause and why.

@roydahan , it is no reproduced running any of these nemesis by itself.
(tested in https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/longevity-alternator-dbg/8/)

roydahan · 2022-11-24T15:11:29Z

Ok, let's check each one of them separately to see which one is the root cause and why.

@roydahan , it is no reproduced running any of these nemesis by itself. (tested in https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/longevity-alternator-dbg/8/)

So re-run with the combination of the 2 and check if it's reproducing consistently.

yarongilor · 2022-11-28T07:53:21Z

Ok, let's check each one of them separately to see which one is the root cause and why.

@roydahan , it is no reproduced running any of these nemesis by itself. (tested in https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/longevity-alternator-dbg/8/)

So re-run with the combination of the 2 and check if it's reproducing consistently.

rerunning in: https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/longevity-alternator-dbg/10/

==>

Issue is reproduced similarly, using original Sisyphus with 2 nemesis:

scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:~$ nodetool repair 
[2022-11-29 14:38:00,774] Starting repair command #8, repairing 1 ranges for keyspace system_distributed_everywhere (parallelism=SEQUENTIAL, full=true)
[2022-11-29 14:38:10,890] Repair session 8 
[2022-11-29 14:38:10,898] Repair session 8 finished
[2022-11-29 14:38:10,967] Starting repair command #9, repairing 1 ranges for keyspace system_auth (parallelism=SEQUENTIAL, full=true)
[2022-11-29 14:38:35,144] Repair session 9 
[2022-11-29 14:38:35,144] Repair session 9 finished
[2022-11-29 14:38:35,163] Starting repair command #10, repairing 1 ranges for keyspace alternator_usertable (parallelism=SEQUENTIAL, full=true)
[2022-11-29 14:38:47,319] Repair session 10 
[2022-11-29 14:38:47,366] Repair session 10 finished
[2022-11-29 14:38:47,702] Starting repair command #11, repairing 1 ranges for keyspace alternator_usertable_no_lwt (parallelism=SEQUENTIAL, full=true)
[2022-11-29 15:09:32,218] Repair session 11 
[2022-11-29 15:09:32,234] Repair session 11 finished
[2022-11-29 15:09:32,662] Starting repair command #12, repairing 1 ranges for keyspace system_traces (parallelism=SEQUENTIAL, full=true)
[2022-11-29 15:10:00,912] Repair session 12 
[2022-11-29 15:10:00,915] Repair session 12 finished
scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:~$ nodetool compact
scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:~$ nodetool cfstats alternator_usertable_^C
scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 61
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 0
	Read Latency: NaN ms
	Write Count: 131837580
	Write Latency: 1.235535421690841E-5 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 2
		SSTables in each level: [2]
		Space used (live): 1500385280
		Space used (total): 1500385280
		Space used by snapshots (total): 0
		Off heap memory used (total): 34556180
		SSTable Compression Ratio: 0.517335
		Number of partitions (estimate): 24436178
		Memtable cell count: 469
		Memtable data size: 197368
		Memtable off heap memory used: 393216
		Memtable switch count: 102
		Local read count: 0
		Local read latency: NaN ms
		Local write count: 131837580
		Local write latency: 0.012 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 33067232
		Bloom filter off heap memory used: 33161224
		Index summary off heap memory used: 1001740
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 43
		Compacted partition maximum bytes: 124
		Compacted partition mean bytes: 60
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------
scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:~$ nodetool status
Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.2.23   7.96 GB    256          ?       02ae8812-5a72-47ec-912e-c21a6bbf93e2  1a
UN  10.4.0.188  1.4 GB     256          ?       f03dfb33-c3d4-4793-ac75-b51d46a6af56  1a
UN  10.4.0.126  7.14 GB    256          ?       1489259b-903a-41e6-91b4-0d20b0c05475  1a
UN  10.4.3.201  6.65 GB    256          ?       551ec478-88eb-415e-ab17-7ba1016c51af  1a

scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:/var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-1927e3606fbd11edb01db21e74cd9776$ /usr/bin/sstabledump ./me-3550-big-Data.db > /tmp/me-3550-big-Data.json
WARN  15:56:47,952 Small commitlog volume detected at /commitlog; setting commitlog_total_space_in_mb to 7396.  You can override this in cassandra.yaml
WARN  15:56:47,996 Only 12.954GiB free across all data volumes. Consider adding more capacity to your cluster or removing obsolete snapshots

scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:/var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-1927e3606fbd11edb01db21e74cd9776$ 
scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:/var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-1927e3606fbd11edb01db21e74cd9776$ ll -h /tmp/me-3550-big-Data.json 
-rw-rw-r-- 1 scyllaadm scyllaadm 4.6G Nov 29 15:58 /tmp/me-3550-big-Data.json
scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:/var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-1927e3606fbd11edb01db21e74cd9776$ cd /tmp/
scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:/tmp$ grep -c 'marked_deleted' me-3550-big-Data.json 
12474194
scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:/tmp$ grep '"partition" : {' me-3550-big-Data.json -B 10 -A 10 | head -n 50
[
  {
    "partition" : {
      "key" : [ "user5809209300113301361" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 37,
        "clustering" : [ "YCSB_0" ],
        "deletion_info" : { "marked_deleted" : "2022-11-29T09:42:34.348288Z", "local_delete_time" : "2022-11-29T09:42:34Z" },
        "cells" : [ ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "user5892498033956723264" ],
      "position" : 57
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 94,
        "clustering" : [ "YCSB_0" ],
        "deletion_info" : { "marked_deleted" : "2022-11-29T09:42:34.355847Z", "local_delete_time" : "2022-11-29T09:42:34Z" },
        "cells" : [ ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "user8056586337637327428" ],
      "position" : 114
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 151,
        "clustering" : [ "YCSB_0" ],
        "deletion_info" : { "marked_deleted" : "2022-11-29T09:42:34.365157Z", "local_delete_time" : "2022-11-29T09:42:34Z" },
        "cells" : [ ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "user6645358743114004973" ],
      "position" : 171
scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:/tmp$

Installation details

Kernel Version: 5.15.0-1023-aws
Scylla version (or git commit hash): 2022.2.0~rc5-20221121.feb292600fc4 with build-id 172f9538efb0893c97c86cdf05622925159f4fa2
Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-x86_64-package-2022.2.0-rc5.0.20221121.feb292600fc4.tar.gz
Cluster size: 4 nodes (i3.large)

Scylla Nodes used in this run:

36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-4 (3.251.97.178 | 10.4.0.126) (shards: 2)
36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-3 (34.244.207.12 | 10.4.3.201) (shards: 2)
36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-2 (34.249.200.17 | 10.4.2.23) (shards: 2)
36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1 (52.213.108.126 | 10.4.0.188) (shards: 2)

OS / Image: ami-0b77b476432e37d90 (aws: eu-west-1)

Test: longevity-alternator-dbg
Test id: 31e2cfca-a72d-4c02-81d6-c1aab02b0a38
Test name: scylla-staging/yarongilor/longevity-alternator-dbg
Test config file(s):

longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-sisyphus.yaml

Issue description

>>>>>>>
Your description here...
<<<<<<<

Restore Monitor Stack command: $ hydra investigate show-monitor 31e2cfca-a72d-4c02-81d6-c1aab02b0a38
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 31e2cfca-a72d-4c02-81d6-c1aab02b0a38

Logs:

db-cluster-31e2cfca.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/31e2cfca-a72d-4c02-81d6-c1aab02b0a38/20221129_124324/db-cluster-31e2cfca.tar.gz
monitor-set-31e2cfca.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/31e2cfca-a72d-4c02-81d6-c1aab02b0a38/20221129_124324/monitor-set-31e2cfca.tar.gz
loader-set-31e2cfca.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/31e2cfca-a72d-4c02-81d6-c1aab02b0a38/20221129_124324/loader-set-31e2cfca.tar.gz
sct-runner-31e2cfca.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/31e2cfca-a72d-4c02-81d6-c1aab02b0a38/20221129_124324/sct-runner-31e2cfca.tar.gz

Jenkins job URL

yarongilor · 2022-12-07T12:05:14Z

An automatic reproducer job is now available in a jenkins job:
https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/alternator-ttl-count-sstables/

An example output:

< t:2022-12-07 08:53:31,094 f:loader_utils.py l:94   c:AlternatorTtlLongevityTest p:DEBUG > stress cmd: bin/ycsb load dynamodb -P workloads/workloadc -threads 10 -p recordcount=8589934401 -p fieldcount=2 -p fieldlength=8 -p insertstart=0 -p insertcount=12006000  -p table=usertable_no_lwt -p dynamodb.ttlKey=ttl -p dynamodb.ttlDuration=2160
< t:2022-12-07 08:53:43,897 f:loader_utils.py l:94   c:AlternatorTtlLongevityTest p:DEBUG > stress cmd: bin/ycsb load dynamodb -P workloads/workloadc -threads 10 -p recordcount=8589934401 -p fieldcount=2 -p fieldlength=8 -p insertstart=12006000 -p insertcount=12006000 -p table=usertable_no_lwt -p dynamodb.ttlKey=ttl -p dynamodb.ttlDuration=2160
< t:2022-12-07 09:54:16,128 f:longevity_alternator_ttl_test.py l:21   c:AlternatorTtlLongevityTest p:INFO  > Run a repair on nodes..
< t:2022-12-07 10:21:08,255 f:longevity_alternator_ttl_test.py l:25   c:AlternatorTtlLongevityTest p:INFO  > Run a major compaction on node..
< t:2022-12-07 10:51:31,267 f:longevity_alternator_ttl_test.py l:35   c:AlternatorTtlLongevityTest p:INFO  > Results after a repair and a major compactions: 1492 sstables, 29072676 partitions

yarongilor · 2023-03-20T15:09:07Z

what about waiting for offstrategy completion before running major?

Following API was introduced so you don't have to wait minutes till offstrategy is triggered:

         "path":"/storage_service/keyspace_offstrategy_compaction/{keyspace}",
         "operations":[
            {
               "method":"POST",
               "summary":"Perform offstrategy compaction, if needed, in a single keyspace",
               "type":"boolean",
               "nickname":"perform_keyspace_offstrategy_compaction",
               "produces":[
                  "application/json"
               ],
               "parameters":[
                  {
                     "name":"keyspace",
                     "description":"The keyspace to operate on",
                     "required":true,
                     "allowMultiple":false,
                     "type":"string",
                     "paramType":"path"
                  },
                  {
                     "name":"cf",
                     "description":"Comma-separated table names",
                     "required":false,
                     "allowMultiple":false,
                     "type":"string",
                     "paramType":"query"
                  }
               ]
            }
         ]
      },

@raphaelsc , already implemented - #11915 (comment)

fgelcer · 2023-03-21T08:54:04Z

what about waiting for offstrategy completion before running major?
Following API was introduced so you don't have to wait minutes till offstrategy is triggered:

         "path":"/storage_service/keyspace_offstrategy_compaction/{keyspace}",
         "operations":[
            {
               "method":"POST",
               "summary":"Perform offstrategy compaction, if needed, in a single keyspace",
               "type":"boolean",
               "nickname":"perform_keyspace_offstrategy_compaction",
               "produces":[
                  "application/json"
               ],
               "parameters":[
                  {
                     "name":"keyspace",
                     "description":"The keyspace to operate on",
                     "required":true,
                     "allowMultiple":false,
                     "type":"string",
                     "paramType":"path"
                  },
                  {
                     "name":"cf",
                     "description":"Comma-separated table names",
                     "required":false,
                     "allowMultiple":false,
                     "type":"string",
                     "paramType":"query"
                  }
               ]
            }
         ]
      },

@raphaelsc , already implemented - #11915 (comment)

@raphaelsc , is there anything else required from @yarongilor ?
seems like the issue is expected, but @yarongilor suggested a warning message when running nodetool compact while offstrategy is running in the background... or anything else that will bring transparency to the user...

@DoronArazii ^^

mykaul · 2023-10-22T12:44:47Z

ping @raphaelsc

Major compaction semantics is that all data of a table will be compacted together, so user can expect e.g. a recently introduced tombstone to be compacted with the data it shadows. Today, it can happen that all data in maintenance set won't be included for major, until they're promoted into main set by off-strategy. So user might be left wondering why major is not having the expected effect. To fix this, let's perform off-strategy first, so data in maintenance set will be made available by major. A similar approach is done for data in memtable, so flush is performed before major starts. The only exception will be data in staging, which cannot be compacted until view building is done with it, to avoid inconsistency in view replicas. The serialization in comapaction manager of reshape jobs guarantee correctness if there's an ongoing off-strategy on behalf of the table. Fixes scylladb#11915. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

raphaelsc · 2023-10-22T13:53:45Z

PR sent: #15792

Major compaction semantics is that all data of a table will be compacted together, so user can expect e.g. a recently introduced tombstone to be compacted with the data it shadows. Today, it can happen that all data in maintenance set won't be included for major, until they're promoted into main set by off-strategy. So user might be left wondering why major is not having the expected effect. To fix this, let's perform off-strategy first, so data in maintenance set will be made available by major. A similar approach is done for data in memtable, so flush is performed before major starts. The only exception will be data in staging, which cannot be compacted until view building is done with it, to avoid inconsistency in view replicas. The serialization in comapaction manager of reshape jobs guarantee correctness if there's an ongoing off-strategy on behalf of the table. Fixes scylladb#11915. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb#15792

Major compaction semantics is that all data of a table will be compacted together, so user can expect e.g. a recently introduced tombstone to be compacted with the data it shadows. Today, it can happen that all data in maintenance set won't be included for major, until they're promoted into main set by off-strategy. So user might be left wondering why major is not having the expected effect. To fix this, let's perform off-strategy first, so data in maintenance set will be made available by major. A similar approach is done for data in memtable, so flush is performed before major starts. The only exception will be data in staging, which cannot be compacted until view building is done with it, to avoid inconsistency in view replicas. The serialization in comapaction manager of reshape jobs guarantee correctness if there's an ongoing off-strategy on behalf of the table. Fixes #11915. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #15792 (cherry picked from commit ea6c281)

denesb · 2023-10-30T10:39:15Z

Backport queued to 5.4.

5.2. backport has conflicts, @raphaelsc please open a backport PR.

avikivity · 2023-10-31T14:04:51Z

@denesb note this is a very recent commit and we should be wary of backporting things before they had a chance to get tested

5.4 is okay as it's undergoing testing anyway.

denesb · 2023-10-31T14:37:13Z

Righ. I was going over issues which need to be backported to 5.4. I will keep in mind to delay the other backports.

mykaul · 2023-10-31T14:55:35Z

Righ. I was going over issues which need to be backported to 5.4. I will keep in mind to delay the other backports.

Those mental notes... We must automate them... Even if it's via ugly labels. 'Candidate-For-Backport...' -> 'Ready-For-Backport' after 2-4 weeks, for example.

denesb · 2023-12-18T12:21:37Z

Re-visiting this, the code has soaked for more than a month now. @raphaelsc please prepare a backport PR agasint 5.2.

mykaul · 2024-03-13T12:14:44Z

ping @raphaelsc , @denesb for backport.

Major compaction semantics is that all data of a table will be compacted together, so user can expect e.g. a recently introduced tombstone to be compacted with the data it shadows. Today, it can happen that all data in maintenance set won't be included for major, until they're promoted into main set by off-strategy. So user might be left wondering why major is not having the expected effect. To fix this, let's perform off-strategy first, so data in maintenance set will be made available by major. A similar approach is done for data in memtable, so flush is performed before major starts. The only exception will be data in staging, which cannot be compacted until view building is done with it, to avoid inconsistency in view replicas. The serialization in comapaction manager of reshape jobs guarantee correctness if there's an ongoing off-strategy on behalf of the table. Fixes scylladb#11915. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb#15792 (cherry picked from commit ea6c281) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

raphaelsc · 2024-03-19T18:46:34Z

PR sent: #17901

Major compaction semantics is that all data of a table will be compacted together, so user can expect e.g. a recently introduced tombstone to be compacted with the data it shadows. Today, it can happen that all data in maintenance set won't be included for major, until they're promoted into main set by off-strategy. So user might be left wondering why major is not having the expected effect. To fix this, let's perform off-strategy first, so data in maintenance set will be made available by major. A similar approach is done for data in memtable, so flush is performed before major starts. The only exception will be data in staging, which cannot be compacted until view building is done with it, to avoid inconsistency in view replicas. The serialization in comapaction manager of reshape jobs guarantee correctness if there's an ongoing off-strategy on behalf of the table. Fixes #11915. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #15792 (cherry picked from commit ea6c281) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #17901

denesb · 2024-03-20T06:49:25Z

Backported to 5.2.

fgelcer added the area/alternator Alternator related Issues label Nov 7, 2022

yarongilor assigned nyh Nov 7, 2022

nyh removed their assignment Nov 17, 2022

yarongilor changed the title ~~[Alternator] Large number of sstables with large sizes left after TTL expiration and major compaction~~ [Alternator] Large number of sstables with large sizes left after TTL expiration and major compaction (tombstones are not deleted) Nov 17, 2022

fgelcer assigned yarongilor Nov 20, 2022

fgelcer added the status/missing information Some details are missing to handle the case label Nov 20, 2022

yarongilor changed the title ~~[Alternator] Large number of sstables with large sizes left after TTL expiration and major compaction (tombstones are not deleted)~~ [Alternator] Some sstables with large sizes left after TTL expiration and major compaction (tombstones are not deleted) Dec 1, 2022

yarongilor assigned raphaelsc Dec 1, 2022

DoronArazii removed the status/pending qa reproduction Pending for QA team to reproduce the issue label Mar 23, 2023

DoronArazii modified the milestones: 5.3, 5.4 Aug 1, 2023

raphaelsc mentioned this issue Oct 22, 2023

replica: Fix major compaction semantics by performing off-strategy first #15792

Closed

mykaul added backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Oct 23, 2023

scylladb-promoter closed this as completed in ea6c281 Oct 23, 2023

scylladb-promoter added the Backport candidate label Oct 23, 2023

denesb removed the backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed label Oct 30, 2023

raphaelsc mentioned this issue Mar 19, 2024

[Backport 5.2] replica: Fix major compaction semantics by performing off-strategy first #17901

Closed

denesb removed Backport candidate backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed labels Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alternator] Some sstables with large sizes left after TTL expiration, gc-grace-period and major compaction (tombstones are not deleted) #11915

[Alternator] Some sstables with large sizes left after TTL expiration, gc-grace-period and major compaction (tombstones are not deleted) #11915

yarongilor commented Nov 7, 2022 •

edited

yarongilor commented Nov 7, 2022

yarongilor commented Nov 7, 2022 •

edited

yarongilor commented Nov 7, 2022

yarongilor commented Nov 7, 2022

yarongilor commented Nov 7, 2022

raphaelsc commented Nov 7, 2022

raphaelsc commented Nov 7, 2022

yarongilor commented Nov 8, 2022 •

edited

yarongilor commented Nov 8, 2022 •

edited

yarongilor commented Nov 9, 2022 •

edited

bhalevy commented Nov 9, 2022

nyh commented Nov 17, 2022

yarongilor commented Nov 17, 2022 •

edited

roydahan commented Nov 17, 2022

yarongilor commented Nov 20, 2022 •

edited

roydahan commented Nov 20, 2022

nyh commented Nov 22, 2022

yarongilor commented Nov 24, 2022

roydahan commented Nov 24, 2022

yarongilor commented Nov 28, 2022 •

edited

yarongilor commented Dec 7, 2022

yarongilor commented Mar 20, 2023

fgelcer commented Mar 21, 2023

mykaul commented Oct 22, 2023

raphaelsc commented Oct 22, 2023

denesb commented Oct 30, 2023

avikivity commented Oct 31, 2023

denesb commented Oct 31, 2023

mykaul commented Oct 31, 2023

denesb commented Dec 18, 2023

mykaul commented Mar 13, 2024

raphaelsc commented Mar 19, 2024

denesb commented Mar 20, 2024

[Alternator] Some sstables with large sizes left after TTL expiration, gc-grace-period and major compaction (tombstones are not deleted) #11915

[Alternator] Some sstables with large sizes left after TTL expiration, gc-grace-period and major compaction (tombstones are not deleted) #11915

Comments

yarongilor commented Nov 7, 2022 • edited

Installation details

Issue description

Logs:

yarongilor commented Nov 7, 2022

yarongilor commented Nov 7, 2022 • edited

yarongilor commented Nov 7, 2022

yarongilor commented Nov 7, 2022

yarongilor commented Nov 7, 2022

raphaelsc commented Nov 7, 2022

raphaelsc commented Nov 7, 2022

yarongilor commented Nov 8, 2022 • edited

yarongilor commented Nov 8, 2022 • edited

BEFORE repair and major compaction:

Run a repair on node-4:

Check SSTABLE files after repair:

AFTER MAJOR COMPACTION:

AFTER GRACE PERIOD AND ANOTHER MAJOR COMPACTION:

yarongilor commented Nov 9, 2022 • edited

before repair:

after repair:

after major compaction:

Installation details

Issue description

Logs:

bhalevy commented Nov 9, 2022

nyh commented Nov 17, 2022

yarongilor commented Nov 17, 2022 • edited

roydahan commented Nov 17, 2022

yarongilor commented Nov 20, 2022 • edited

roydahan commented Nov 20, 2022

nyh commented Nov 22, 2022

yarongilor commented Nov 24, 2022

roydahan commented Nov 24, 2022

yarongilor commented Nov 28, 2022 • edited

Installation details

Issue description

Logs:

yarongilor commented Dec 7, 2022

yarongilor commented Mar 20, 2023

fgelcer commented Mar 21, 2023

mykaul commented Oct 22, 2023

raphaelsc commented Oct 22, 2023

denesb commented Oct 30, 2023

avikivity commented Oct 31, 2023

denesb commented Oct 31, 2023

mykaul commented Oct 31, 2023

denesb commented Dec 18, 2023

mykaul commented Mar 13, 2024

raphaelsc commented Mar 19, 2024

denesb commented Mar 20, 2024

yarongilor commented Nov 7, 2022 •

edited

yarongilor commented Nov 7, 2022 •

edited

yarongilor commented Nov 8, 2022 •

edited

yarongilor commented Nov 8, 2022 •

edited

yarongilor commented Nov 9, 2022 •

edited

yarongilor commented Nov 17, 2022 •

edited

yarongilor commented Nov 20, 2022 •

edited

yarongilor commented Nov 28, 2022 •

edited