Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alternator] Some sstables with large sizes left after TTL expiration, gc-grace-period and major compaction (tombstones are not deleted) #11915

Closed
yarongilor opened this issue Nov 7, 2022 · 54 comments
Assignees
Labels
Milestone

Comments

@yarongilor
Copy link

yarongilor commented Nov 7, 2022

Installation details

Kernel Version: 5.15.0-1021-aws
Scylla version (or git commit hash): 2022.2.0~rc3-20221009.994a5f0fbb4c with build-id 756ea8d62c25ed4acdf087054e11b3d07596a117
Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-x86_64-package-2022.2.0-rc3.0.20221009.994a5f0fbb4c.tar.gz
Cluster size: 4 nodes (i3.4xlarge)

Scylla Nodes used in this run:

  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-9 (34.248.19.247 | 10.4.0.151) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-8 (52.209.108.120 | 10.4.0.160) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-7 (34.245.201.183 | 10.4.3.188) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-6 (34.250.32.253 | 10.4.2.151) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5 (34.245.124.139 | 10.4.2.213) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-4 (54.195.144.153 | 10.4.0.41) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-3 (34.254.89.227 | 10.4.0.77) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-2 (54.155.84.133 | 10.4.3.211) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-16 (3.250.192.234 | 10.4.0.39) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-15 (3.251.81.51 | 10.4.0.27) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-14 (3.250.105.87 | 10.4.2.126) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-13 (54.194.207.154 | 10.4.0.92) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-12 (34.244.29.173 | 10.4.1.91) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-11 (54.194.213.46 | 10.4.1.151) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-10 (52.212.227.101 | 10.4.0.71) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1 (63.35.163.138 | 10.4.0.55) (shards: 14)

OS / Image: ami-0b9c9dd9d3af4cec6 (aws: eu-west-1)

Test: longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-nemesis
Test id: 7da36ba4-479e-42fd-bc55-641409ff1c77
Test name: scylla-staging/yarongilor/longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-nemesis
Test config file(s):

Issue description

>>>>>>>
scenario:

  1. Test ran 2 days with nemesis and Alternator TTL writes.
  2. TTL=12 hours, grace-sconds=2 hours, TTL-scan=1 hours.
  3. Test load ended after 2 days.
  4. Then after another one day executed a repair and major compaction on all nodes.
  5. Then, looking at nodes - the number of sstables is decreased, but still high. node-1 has 57 sstables where some of them are 1GB size.
  6. node-1 state is:
scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/*big-Data.db | wc -l
57

some large sstables are:

1001M -rw-r--r-- 1 scylla scylla 1001M Nov  6 16:14 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-127475-big-Data.db
1001M -rw-r--r-- 1 scylla scylla 1001M Nov  6 16:14 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-127467-big-Data.db
1001M -rw-r--r-- 1 scylla scylla 1001M Nov  6 16:14 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-127466-big-Data.db
1001M -rw-r--r-- 1 scylla scylla 1001M Nov  6 16:14 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-127464-big-Data.db
1001M -rw-r--r-- 1 scylla scylla 1001M Nov  6 16:14 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-127462-big-Data.db

nodetool status:

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1:~$ nodetool status
Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.2.213  18.1 GB    256          ?       bdb02b1e-819a-452e-92a4-1b093438db8b  1a
UN  10.4.0.39   22.06 GB   256          ?       2654a908-baaf-4fb9-af5f-e12653552b01  1a
UN  10.4.0.55   55.67 GB   256          ?       91ed0563-c5a6-4705-a0de-633862bfba92  1a
UN  10.4.0.27   19.44 GB   256          ?       96c48dab-a1d7-46b4-9a96-beb77c14241f  1a

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1:~$ 

nodetool cfstats on node-1 shows Number of partitions (estimate): 936282794:

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 72
----------------
Keyspace : alternator_usertable_no_lwt
        Read Count: 13269849
        Read Latency: 7.524749528046627E-4 ms
        Write Count: 600764593
        Write Latency: 7.30815039893671E-6 ms
        Pending Flushes: 0
                Table: usertable_no_lwt
                SSTable count: 55
                SSTables in each level: [55/4]
                Space used (live): 58741483520
                Space used (total): 58741483520
                Space used by snapshots (total): 0
                Off heap memory used (total): 552484908
                SSTable Compression Ratio: 0.554702
                Number of partitions (estimate): 936282794
                Memtable cell count: 0
                Memtable data size: 0
                Memtable off heap memory used: 0
                Memtable switch count: 406
                Local read count: 13269849
                Local read latency: 0.752 ms
                Local write count: 600764593
                Local write latency: 0.007 ms
                Pending flushes: 0
                Percent repaired: 0.0
                Bloom filter false positives: 0
                Bloom filter false ratio: 0.00000
                Bloom filter space used: 513346200
                Bloom filter off heap memory used: 517504780
                Index summary off heap memory used: 34980128
                Compression metadata off heap memory used: 0
                Compacted partition minimum bytes: 43
                Compacted partition maximum bytes: 124
                Compacted partition mean bytes: 60
                Average live cells per slice (last five minutes): 0.0
                Maximum live cells per slice (last five minutes): 0
                Average tombstones per slice (last five minutes): 0.0
                Maximum tombstones per slice (last five minutes): 0
                Dropped Mutations: 0

----------------

Any CQLSH query to any range failed with a timeout:

cqlsh> SELECT p from alternator_usertable_no_lwt.usertable_no_lwt WHERE p < 'user609602598667831698' and p > 'user609602598667830698' and c = 'YCSB_0' LIMIT 1 ALLOW FILTERING using timeout 10m;
OperationTimedOut: errors={'10.4.0.55': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.4.0.55

<<<<<<<

  • Restore Monitor Stack command: $ hydra investigate show-monitor 7da36ba4-479e-42fd-bc55-641409ff1c77
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 7da36ba4-479e-42fd-bc55-641409ff1c77

Logs:

The cluster's nodes and monitor are alive in:

| alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-15     | eu-west-1a | running | 7da36ba4-479e-42fd-bc55-641409ff1c77 | yarongilor | Sat Nov  5 09:00:41 2022 |
| alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-16     | eu-west-1a | running | 7da36ba4-479e-42fd-bc55-641409ff1c77 | yarongilor | Sat Nov  5 15:12:01 2022 |
| alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1      | eu-west-1a | running | 7da36ba4-479e-42fd-bc55-641409ff1c77 | yarongilor | Thu Nov  3 13:34:28 2022 |
| alternator-ttl-4-loaders-no-lwt-sis-monitor-node-7da36ba4-1 | eu-west-1a | running | 7da36ba4-479e-42fd-bc55-641409ff1c77 | yarongilor | Thu Nov  3 13:34:31 2022 |
| alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5      | eu-west-1a | running | 7da36ba4-479e-42fd-bc55-641409ff1c77 | yarongilor | Thu Nov  3 17:24:34 2022 |

original logs of sct test:

18:59:45  +-----------------------------------------------------------------------------------------------------------------------------------------------+
18:59:45  |                                        Collected logs by test-id: 7da36ba4-479e-42fd-bc55-641409ff1c77                                        |
18:59:45  +-------------+---------------------------------------------------------------------------------------------------------------------------------+
18:59:45  | Cluster set | Link                                                                                                                            |
18:59:45  +-------------+---------------------------------------------------------------------------------------------------------------------------------+
18:59:45  | db-cluster  | https://cloudius-jenkins-test.s3.amazonaws.com/7da36ba4-479e-42fd-bc55-641409ff1c77/20221105_163501/db-cluster-7da36ba4.tar.gz  |
18:59:45  | monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/7da36ba4-479e-42fd-bc55-641409ff1c77/20221105_163501/monitor-set-7da36ba4.tar.gz |
18:59:45  | loader-set  | https://cloudius-jenkins-test.s3.amazonaws.com/7da36ba4-479e-42fd-bc55-641409ff1c77/20221105_163501/loader-set-7da36ba4.tar.gz  |
18:59:45  | sct-runner  | https://cloudius-jenkins-test.s3.amazonaws.com/7da36ba4-479e-42fd-bc55-641409ff1c77/20221105_163501/sct-runner-7da36ba4.tar.gz  |
18:59:45  +-------------+---------------------------------------------------------------------------------------------------------------------------------+

Manually collected logs after test run ended and manual operations (repair + major compaction on all nodes):

Jenkins job URL

@fgelcer fgelcer added the area/alternator Alternator related Issues label Nov 7, 2022
@yarongilor
Copy link
Author

live nodes on AWS:
image

@yarongilor
Copy link
Author

yarongilor commented Nov 7, 2022

node-5 state:

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/*big-Data.db -hs
 117M -rw-r--r-- 1 scylla scylla  117M Nov  6 00:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-128542-big-Data.db
 1.1G -rw-r--r-- 1 scylla scylla  1.1G Nov  7 06:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-145526-big-Data.db
 1.1G -rw-r--r-- 1 scylla scylla  1.1G Nov  7 06:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-145541-big-Data.db
 722M -rw-r--r-- 1 scylla scylla  722M Nov  7 06:44 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-145864-big-Data.db
 943M -rw-r--r-- 1 scylla scylla  943M Nov  7 06:43 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-145882-big-Data.db
 709M -rw-r--r-- 1 scylla scylla  709M Nov  7 06:44 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-146053-big-Data.db
1006M -rw-r--r-- 1 scylla scylla 1006M Nov  7 06:45 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-146244-big-Data.db
 1.1G -rw-r--r-- 1 scylla scylla  1.1G Nov  7 06:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-146461-big-Data.db
1015M -rw-r--r-- 1 scylla scylla 1015M Nov  7 06:45 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-146490-big-Data.db
1009M -rw-r--r-- 1 scylla scylla 1009M Nov  7 06:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-146558-big-Data.db
 1.1G -rw-r--r-- 1 scylla scylla  1.1G Nov  7 06:46 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-147387-big-Data.db
 1.2G -rw-r--r-- 1 scylla scylla  1.2G Nov  7 06:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-147673-big-Data.db
 485M -rw-r--r-- 1 scylla scylla  485M Nov  7 07:34 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-150517-big-Data.db
 568M -rw-r--r-- 1 scylla scylla  568M Nov  7 07:34 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-151185-big-Data.db
 654M -rw-r--r-- 1 scylla scylla  654M Nov  7 07:35 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-151462-big-Data.db
 665M -rw-r--r-- 1 scylla scylla  665M Nov  7 07:35 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-151463-big-Data.db
 657M -rw-r--r-- 1 scylla scylla  657M Nov  7 07:35 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-151744-big-Data.db
 569M -rw-r--r-- 1 scylla scylla  569M Nov  7 07:34 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-151902-big-Data.db
 539M -rw-r--r-- 1 scylla scylla  539M Nov  7 07:34 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-151918-big-Data.db
 573M -rw-r--r-- 1 scylla scylla  573M Nov  7 07:34 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-152110-big-Data.db
 645M -rw-r--r-- 1 scylla scylla  645M Nov  7 07:34 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-152171-big-Data.db
 643M -rw-r--r-- 1 scylla scylla  643M Nov  7 07:35 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-152523-big-Data.db
 631M -rw-r--r-- 1 scylla scylla  631M Nov  7 07:35 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-152538-big-Data.db
 621M -rw-r--r-- 1 scylla scylla  621M Nov  7 07:35 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-152676-big-Data.db
 662M -rw-r--r-- 1 scylla scylla  662M Nov  7 07:35 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-153393-big-Data.db
 827M -rw-r--r-- 1 scylla scylla  827M Nov  7 07:36 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/me-153637-big-Data.db
scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/*big-Data.db | wc -l
26
scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 72
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 10630737
	Read Latency: 4.8010396645124413E-4 ms
	Write Count: 591986672
	Write Latency: 7.22821002970148E-6 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 14
		SSTables in each level: [14/4]
		Space used (live): 18398912512
		Space used (total): 18398912512
		Space used by snapshots (total): 0
		Off heap memory used (total): 206641440
		SSTable Compression Ratio: 0.528944
		Number of partitions (estimate): 293853780
		Memtable cell count: 0
		Memtable data size: 0
		Memtable off heap memory used: 0
		Memtable switch count: 399
		Local read count: 10630737
		Local read latency: 0.480 ms
		Local write count: 591986672
		Local write latency: 0.007 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 194167184
		Bloom filter off heap memory used: 195035192
		Index summary off heap memory used: 11606248
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 51
		Compacted partition maximum bytes: 60
		Compacted partition mean bytes: 60
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------

An additional related issue is that after repairing all nodes except node-1, the nodetool status had somewhat smaller data than after reparing node-1:

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/*big-Data.db | wc -l
96
scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5:~$ nodetool compact
scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/*big-Data.db | wc -l
12
scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 72
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 10630737
	Read Latency: 4.8010396645124413E-4 ms
	Write Count: 591986672
	Write Latency: 7.22821002970148E-6 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 0
		Space used (live): 0
		Space used (total): 0
		Space used by snapshots (total): 0
		Off heap memory used (total): 0
		SSTable Compression Ratio: 0.0
		Number of partitions (estimate): 0
		Memtable cell count: 0
		Memtable data size: 0
		Memtable off heap memory used: 0
		Memtable switch count: 399
		Local read count: 10630737
		Local read latency: 0.480 ms
		Local write count: 591986672
		Local write latency: 0.007 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 0
		Bloom filter off heap memory used: 0
		Index summary off heap memory used: 0
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 0
		Compacted partition maximum bytes: 0
		Compacted partition mean bytes: 0
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------
scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5:~$ df -h /var/lib/scylla/data/alternator_usertable_no_lwt
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        3.5T  137G  3.4T   4% /var/lib/scylla
scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5:~$ nodetool status
Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.2.213  988.45 MB  256          ?       bdb02b1e-819a-452e-92a4-1b093438db8b  1a
UN  10.4.0.39   1010.19 MB  256          ?       2654a908-baaf-4fb9-af5f-e12653552b01  1a
UN  10.4.0.55   55.67 GB   256          ?       91ed0563-c5a6-4705-a0de-633862bfba92  1a
UN  10.4.0.27   964.69 MB  256          ?       96c48dab-a1d7-46b4-9a96-beb77c14241f  1a

The after node-1 repair it had:

$ nodetool status

Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.2.213  18.1 GB    256          ?       bdb02b1e-819a-452e-92a4-1b093438db8b  1a
UN  10.4.0.39   22.06 GB   256          ?       2654a908-baaf-4fb9-af5f-e12653552b01  1a
UN  10.4.0.55   55.67 GB   256          ?       91ed0563-c5a6-4705-a0de-633862bfba92  1a
UN  10.4.0.27   19.44 GB   256          ?       96c48dab-a1d7-46b4-9a96-beb77c14241f  1a

@yarongilor
Copy link
Author

node-16 state:

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-16:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 72
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 2999145
	Read Latency: 5.025175508353213E-4 ms
	Write Count: 535973115
	Write Latency: 6.43605043510438E-6 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 14
		SSTables in each level: [14/4]
		Space used (live): 22628718592
		Space used (total): 22628718592
		Space used by snapshots (total): 0
		Off heap memory used (total): 239912872
		SSTable Compression Ratio: 0.530966
		Number of partitions (estimate): 360759381
		Memtable cell count: 0
		Memtable data size: 0
		Memtable off heap memory used: 0
		Memtable switch count: 361
		Local read count: 2999145
		Local read latency: 0.503 ms
		Local write count: 535973115
		Local write latency: 0.006 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 225724144
		Bloom filter off heap memory used: 226623544
		Index summary off heap memory used: 13289328
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 51
		Compacted partition maximum bytes: 60
		Compacted partition mean bytes: 60
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------
scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-16:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/*big-Data.db | wc -l
27

@yarongilor
Copy link
Author

node-15 state:

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-15:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 72
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 9782108
	Read Latency: 3.1535881632057224E-4 ms
	Write Count: 601187371
	Write Latency: 6.921040262504118E-6 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 14
		SSTables in each level: [14/4]
		Space used (live): 19862441984
		Space used (total): 19862441984
		Space used by snapshots (total): 0
		Off heap memory used (total): 211460276
		SSTable Compression Ratio: 0.529678
		Number of partitions (estimate): 317113703
		Memtable cell count: 0
		Memtable data size: 0
		Memtable off heap memory used: 0
		Memtable switch count: 405
		Local read count: 9782108
		Local read latency: 0.315 ms
		Local write count: 601187371
		Local write latency: 0.007 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 198446144
		Bloom filter off heap memory used: 199360568
		Index summary off heap memory used: 12099708
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 51
		Compacted partition maximum bytes: 60
		Compacted partition mean bytes: 60
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-15:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/*big-Data.db | wc -l
32

@yarongilor
Copy link
Author

A CQL query does return 0 for a count(*) query:

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1:~$ cqlsh 10.4.0.55  -e "SELECT count(*) from alternator_usertable_no_lwt.usertable_no_lwt using timeout 10m" --request-timeout 300

 count
-------
     0

(1 rows)

@raphaelsc
Copy link
Member

A CQL query does return 0 for a count(*) query:

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1:~$ cqlsh 10.4.0.55  -e "SELECT count(*) from alternator_usertable_no_lwt.usertable_no_lwt using timeout 10m" --request-timeout 300

 count
-------
     0

(1 rows)

Please try same query using bypass cache.

@raphaelsc
Copy link
Member

Actually, scratch that. Please run nodetool flush and nodetool compact, to see if disk usage drops significantly.

@yarongilor
Copy link
Author

yarongilor commented Nov 8, 2022

@raphaelsc , after running nodetool flush and compact there's not much change (not getting any significantly closer to zero sstables or partitions):

scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 72
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 13269849
	Read Latency: 7.524749528046627E-4 ms
	Write Count: 600764593
	Write Latency: 7.30815039893671E-6 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 14
		SSTables in each level: [14/4]
		Space used (live): 545829888
		Space used (total): 545829888
		Space used by snapshots (total): 0
		Off heap memory used (total): 258365815
		SSTable Compression Ratio: 0.615267
		Number of partitions (estimate): 2715681
scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1:~$ nodetool status
Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.2.213  18.1 GB    256          ?       bdb02b1e-819a-452e-92a4-1b093438db8b  1a
UN  10.4.0.39   22.06 GB   256          ?       2654a908-baaf-4fb9-af5f-e12653552b01  1a
UN  10.4.0.55   1.45 GB    256          ?       91ed0563-c5a6-4705-a0de-633862bfba92  1a
UN  10.4.0.27   19.44 GB   256          ?       96c48dab-a1d7-46b4-9a96-beb77c14241f  1a
scyllaadm@alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-8d33ff105b8011eda2a272ed9ab6957f/*big-Data.db | wc -l
16

@yarongilor
Copy link
Author

yarongilor commented Nov 8, 2022

In a shorter and smaller test without nemesis the results did get to zero sstables and partitions.

It had 4 write stress like:
bin/ycsb load dynamodb -P workloads/workloadc -threads 13 -p recordcount=8589934401
-p fieldcount=4 -p fieldlength=32
-p insertstart=0 -p insertcount=21474836 -p table=usertable_no_lwt -p dynamodb.ttlKey=ttl -p dynamodb.ttlDuration=5400

Test id: c6c93337-a887-4398-8e0f-3e7b38f13f65

BEFORE repair and major compaction:

( the relevant node is db-node-c6c93337-4 with ip 10.4.2.100)

--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.2.100  12.58 GB   256          ?       daf88116-57da-4a77-8bd0-8777a47daf65  1a
UN  10.4.0.228  12.53 GB   256          ?       3e5c3173-a3c8-40ee-855f-80c36953cc42  1a
UN  10.4.2.193  12.07 GB   256          ?       80c9287f-8a09-4730-9589-c84bc4054e9d  1a
UN  10.4.3.61   12.07 GB   256          ?       d55be98e-5450-4de1-bd9d-611000d4af58  1a
scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/*big-Data.db | wc -l
42
scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 61
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 11767643
	Read Latency: 2.937487141647652E-5 ms
	Write Count: 115039966
	Write Latency: 1.0820039706896296E-5 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 42
		SSTables in each level: [42/4]
		Space used (live): 13508375552
		Space used (total): 13508375552
		Space used by snapshots (total): 0
		Off heap memory used (total): 9636790055
		SSTable Compression Ratio: 0.657857
		Number of partitions (estimate): 86106006
		Memtable cell count: 12342478
		Memtable data size: 5180155674
		Memtable off heap memory used: 9510846464
		Memtable switch count: 84
		Local read count: 11767643
		Local read latency: 0.029 ms
		Local write count: 115039966
		Local write latency: 0.011 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 107915936
		Bloom filter off heap memory used: 111935656
		Index summary off heap memory used: 14007935
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 43
		Compacted partition maximum bytes: 258
		Compacted partition mean bytes: 174
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------
scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/*big-Data.db -hl
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-100-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-101-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-102-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-103-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:46 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-104-big-Data.db
-rw-r--r-- 1 scylla scylla 605M Nov  7 12:46 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-105-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:46 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-106-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:46 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-107-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:45 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-108-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:45 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-109-big-Data.db
-rw-r--r-- 1 scylla scylla 605M Nov  7 12:46 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-110-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:46 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-111-big-Data.db
-rw-r--r-- 1 scylla scylla  87M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-112-big-Data.db
-rw-r--r-- 1 scylla scylla  87M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-113-big-Data.db
-rw-r--r-- 1 scylla scylla  87M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-114-big-Data.db
-rw-r--r-- 1 scylla scylla  87M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-115-big-Data.db
-rw-r--r-- 1 scylla scylla  87M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-116-big-Data.db
-rw-r--r-- 1 scylla scylla  87M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-117-big-Data.db
-rw-r--r-- 1 scylla scylla  88M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-118-big-Data.db
-rw-r--r-- 1 scylla scylla  87M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-119-big-Data.db
-rw-r--r-- 1 scylla scylla  88M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-120-big-Data.db
-rw-r--r-- 1 scylla scylla  88M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-121-big-Data.db
-rw-r--r-- 1 scylla scylla  88M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-122-big-Data.db
-rw-r--r-- 1 scylla scylla  88M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-123-big-Data.db
-rw-r--r-- 1 scylla scylla  87M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-124-big-Data.db
-rw-r--r-- 1 scylla scylla  87M Nov  7 14:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-125-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-126-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-127-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-128-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-129-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-130-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-131-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-132-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-133-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-134-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-135-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-136-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-137-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-138-big-Data.db
-rw-r--r-- 1 scylla scylla  41M Nov  7 14:56 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-139-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-98-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:46 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-99-big-Data.db

Run a repair on node-4:

scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ nodetool repair 
[2022-11-07 15:41:29,838] Starting repair command #1, repairing 1 ranges for keyspace system_auth (parallelism=SEQUENTIAL, full=true)
[2022-11-07 15:41:30,867] Repair session 1 
[2022-11-07 15:41:30,867] Repair session 1 finished
[2022-11-07 15:41:30,875] Starting repair command #2, repairing 1 ranges for keyspace system_distributed_everywhere (parallelism=SEQUENTIAL, full=true)
[2022-11-07 15:41:31,978] Repair session 2 
[2022-11-07 15:41:31,979] Repair session 2 finished
[2022-11-07 15:41:33,090] Repair session 3 
[2022-11-07 15:41:33,091] Repair session 3 finished
[2022-11-07 15:41:33,107] Starting repair command #4, repairing 1 ranges for keyspace system_traces (parallelism=SEQUENTIAL, full=true)
[2022-11-07 15:41:34,201] Repair session 4 
[2022-11-07 15:41:34,201] Repair session 4 finished
[2022-11-07 15:41:34,209] Starting repair command #5, repairing 1 ranges for keyspace alternator_usertable_no_lwt (parallelism=SEQUENTIAL, full=true)
[2022-11-07 15:42:50,325] Repair session 5 
[2022-11-07 15:42:50,326] Repair session 5 finished

Check SSTABLE files after repair:

scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/*big-Data.db -hl
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-100-big-Data.db
-rw-r--r-- 1 scylla scylla 110K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1000-big-Data.db
-rw-r--r-- 1 scylla scylla  12K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1001-big-Data.db
-rw-r--r-- 1 scylla scylla  55K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1002-big-Data.db
-rw-r--r-- 1 scylla scylla  41K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1003-big-Data.db
-rw-r--r-- 1 scylla scylla  15K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1004-big-Data.db
-rw-r--r-- 1 scylla scylla  12K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1005-big-Data.db
-rw-r--r-- 1 scylla scylla 5.4K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1006-big-Data.db
-rw-r--r-- 1 scylla scylla  41K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1007-big-Data.db
-rw-r--r-- 1 scylla scylla  73K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1008-big-Data.db
-rw-r--r-- 1 scylla scylla  36K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1009-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-101-big-Data.db
-rw-r--r-- 1 scylla scylla 4.7K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1010-big-Data.db
-rw-r--r-- 1 scylla scylla  21K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1011-big-Data.db
-rw-r--r-- 1 scylla scylla  19K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1012-big-Data.db
-rw-r--r-- 1 scylla scylla 5.7K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1013-big-Data.db
-rw-r--r-- 1 scylla scylla  27K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1014-big-Data.db
-rw-r--r-- 1 scylla scylla  36K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1015-big-Data.db
-rw-r--r-- 1 scylla scylla  15K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1016-big-Data.db
-rw-r--r-- 1 scylla scylla  11K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1017-big-Data.db
-rw-r--r-- 1 scylla scylla 9.5K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1018-big-Data.db
-rw-r--r-- 1 scylla scylla 6.9K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1019-big-Data.db
-rw-r--r-- 1 scylla scylla 606M Nov  7 12:47 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-102-big-Data.db
-rw-r--r-- 1 scylla scylla  43K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1020-big-Data.db
-rw-r--r-- 1 scylla scylla  12K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1021-big-Data.db
-rw-r--r-- 1 scylla scylla  27K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1022-big-Data.db
-rw-r--r-- 1 scylla scylla 120K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1023-big-Data.db
-rw-r--r-- 1 scylla scylla  34K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1024-big-Data.db
-rw-r--r-- 1 scylla scylla 7.1K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1025-big-Data.db
-rw-r--r-- 1 scylla scylla  49K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1026-big-Data.db
-rw-r--r-- 1 scylla scylla  65K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1027-big-Data.db
-rw-r--r-- 1 scylla scylla  20K Nov  7 15:41 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-1028-big-Data.db

AFTER REPAIR:
=======
scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 61
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 11767643
	Read Latency: 2.937487141647652E-5 ms
	Write Count: 115039966
	Write Latency: 1.0820039706896296E-5 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 5697
		SSTables in each level: [5697/4]
		Space used (live): 13783121920
		Space used (total): 13783121920
		Space used by snapshots (total): 0
		Off heap memory used (total): 9511952576
		SSTable Compression Ratio: 0.674864
		Number of partitions (estimate): 88100878
		Memtable cell count: 12342478
		Memtable data size: 5180155633
		Memtable off heap memory used: 9372565504
		Memtable switch count: 84
		Local read count: 11767643
		Local read latency: 0.029 ms
		Local write count: 115039966
		Local write latency: 0.011 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 108006416
		Bloom filter off heap memory used: 112003516
		Index summary off heap memory used: 27383556
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 43
		Compacted partition maximum bytes: 258
		Compacted partition mean bytes: 172
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------
scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/*big-Data.db | wc -l
5697
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.2.100  12.69 GB   256          ?       daf88116-57da-4a77-8bd0-8777a47daf65  1a
UN  10.4.0.228  12.61 GB   256          ?       3e5c3173-a3c8-40ee-855f-80c36953cc42  1a
UN  10.4.2.193  12.14 GB   256          ?       80c9287f-8a09-4730-9589-c84bc4054e9d  1a
UN  10.4.3.61   12.14 GB   256          ?       d55be98e-5450-4de1-bd9d-611000d4af58  1a

scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ df -h /var/lib/scylla
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        3.5T  118G  3.4T   4% /var/lib/scylla

AFTER MAJOR COMPACTION:

--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.2.100  1.96 GB    256          ?       daf88116-57da-4a77-8bd0-8777a47daf65  1a
UN  10.4.0.228  12.61 GB   256          ?       3e5c3173-a3c8-40ee-855f-80c36953cc42  1a
UN  10.4.2.193  12.14 GB   256          ?       80c9287f-8a09-4730-9589-c84bc4054e9d  1a
UN  10.4.3.61   12.14 GB   256          ?       d55be98e-5450-4de1-bd9d-611000d4af58  1a

scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ df -h /var/lib/scylla
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        3.5T  107G  3.4T   4% /var/lib/scylla
scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/*big-Data.db | wc -l
14
scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 61
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 11767643
	Read Latency: 2.937487141647652E-5 ms
	Write Count: 115039966
	Write Latency: 1.0820039706896296E-5 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 14
		SSTables in each level: [14/4]
		Space used (live): 2106404864
		Space used (total): 2106404864
		Space used by snapshots (total): 0
		Off heap memory used (total): 128150482
		SSTable Compression Ratio: 0.495797
		Number of partitions (estimate): 34019881
		Memtable cell count: 0
		Memtable data size: 0
		Memtable off heap memory used: 0
		Memtable switch count: 98
		Local read count: 11767643
		Local read latency: 0.029 ms
		Local write count: 115039966
		Local write latency: 0.011 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 125310464
		Bloom filter off heap memory used: 126615608
		Index summary off heap memory used: 1534874
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 43
		Compacted partition maximum bytes: 60
		Compacted partition mean bytes: 60
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------
scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/*big-Data.db -sh
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5557-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5584-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5691-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5694-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5737-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5741-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5860-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5913-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5931-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5964-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-5966-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-6010-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-6068-big-Data.db
66M -rw-r--r-- 1 scylla scylla 66M Nov  7 15:50 /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/me-6093-big-Data.db
scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ time(cqlsh 10.4.2.100 -e "SELECT count(*) from alternator_usertable_no_lwt.usertable_no_lwt using timeout 10m" --request-timeout 300)

 count
-------
     0

(1 rows)

real	0m29.534s

AFTER GRACE PERIOD AND ANOTHER MAJOR COMPACTION:

12K	/var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd
0	/var/lib/scylla/data/alternator_usertable_no_lwt
0	/var/lib/scylla/data
11G	/var/lib/scylla/commitlog
0	/var/lib/scylla/hints/0
0	/var/lib/scylla/hints/1
0	/var/lib/scylla/hints/10
0	/var/lib/scylla/hints/11
0	/var/lib/scylla/hints/12
0	/var/lib/scylla/hints/13
0	/var/lib/scylla/hints/2
0	/var/lib/scylla/hints/3
0	/var/lib/scylla/hints/4
0	/var/lib/scylla/hints/5
0	/var/lib/scylla/hints/6
0	/var/lib/scylla/hints/7
0	/var/lib/scylla/hints/8
0	/var/lib/scylla/hints/9
0	/var/lib/scylla/hints
0	/var/lib/scylla/view_hints/0
0	/var/lib/scylla/view_hints/1
0	/var/lib/scylla/view_hints/10
0	/var/lib/scylla/view_hints/11
0	/var/lib/scylla/view_hints/12
0	/var/lib/scylla/view_hints/13
0	/var/lib/scylla/view_hints/2
0	/var/lib/scylla/view_hints/3
0	/var/lib/scylla/view_hints/4
0	/var/lib/scylla/view_hints/5
0	/var/lib/scylla/view_hints/6
0	/var/lib/scylla/view_hints/7
0	/var/lib/scylla/view_hints/8
0	/var/lib/scylla/view_hints/9
0	/var/lib/scylla/view_hints
0	/var/lib/scylla/saved_caches
12K	/var/lib/scylla/logs
0	/var/lib/scylla
scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 61
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 11767643
	Read Latency: 2.937487141647652E-5 ms
	Write Count: 115039966
	Write Latency: 1.0820039706896296E-5 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 0
		Space used (live): 0
		Space used (total): 0
		Space used by snapshots (total): 0
		Off heap memory used (total): 0
		SSTable Compression Ratio: 0.0
		Number of partitions (estimate): 0
		Memtable cell count: 0
		Memtable data size: 0
		Memtable off heap memory used: 0
		Memtable switch count: 98
		Local read count: 11767643
		Local read latency: 0.029 ms
		Local write count: 115039966
		Local write latency: 0.011 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 0
		Bloom filter off heap memory used: 0
		Index summary off heap memory used: 0
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 0
		Compacted partition maximum bytes: 0
		Compacted partition mean bytes: 0
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.2.100  2.33 MB    256          ?       daf88116-57da-4a77-8bd0-8777a47daf65  1a
UN  10.4.0.228  12.61 GB   256          ?       3e5c3173-a3c8-40ee-855f-80c36953cc42  1a
UN  10.4.2.193  12.14 GB   256          ?       80c9287f-8a09-4730-9589-c84bc4054e9d  1a
UN  10.4.3.61   12.14 GB   256          ?       d55be98e-5450-4de1-bd9d-611000d4af58  1a
scyllaadm@3h-ttl-128k-data-alternat-db-node-c6c93337-4:~$ ll /var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-37d04ed05e9511ed9c9611aa1d6cebdd/
total 12
drwxr-xr-x 5 scylla scylla 8192 Nov  8 07:22 ./
drwxr-xr-x 3 scylla scylla   63 Nov  7 12:10 ../
drwxr-xr-x 2 scylla scylla    6 Nov  8 07:22 pending_delete/
drwxr-xr-x 2 scylla scylla    6 Nov  7 12:10 staging/
drwxr-xr-x 2 scylla scylla    6 Nov  7 12:10 upload/

@yarongilor
Copy link
Author

yarongilor commented Nov 9, 2022

The issue does reproduce in a smaller test of 5 hours with nemesis.
node-4 10.4.3.183 had the following manual scenario after SCT test ended.

before repair:

|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.3.183  5.8 GB     256          ?       30ec5f71-8e91-4e8f-aaa1-663f71d58562  1a
UN  10.4.3.96   5.24 GB    256          ?       23923798-2d26-40a7-86c7-12cb25cbae1f  1a
UN  10.4.2.63   5.79 GB    256          ?       6b3adb1d-f858-4d5c-b042-ff1851b2431d  1a
UN  10.4.3.186  7.2 GB     256          ?       ebd72f36-81af-436d-9f0e-3b8cc16676ec  1a

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
This EC2 instance is optimized for Scylla.

scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ time(cqlsh 10.4.3.183 -e "SELECT count(*) from alternator_usertable_no_lwt.usertable_no_lwt using timeout 15m" --request-timeout 900)

 count
-------
     0

(1 rows)

real	9m1.691s
user	0m1.321s
sys	0m0.141s
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt

Total number of tables: 61
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 0
	Read Latency: NaN ms
	Write Count: 122992117
	Write Latency: 1.095821450085293E-5 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 9
		SSTables in each level: [9/4]
		Space used (live): 6231398400
		Space used (total): 6231398400
		Space used by snapshots (total): 0
		Off heap memory used (total): 519790294
		SSTable Compression Ratio: 0.522272
		Number of partitions (estimate): 83855435
		Memtable cell count: 1060925
		Memtable data size: 445878382
		Memtable off heap memory used: 448004096
		Memtable switch count: 100
		Local read count: 0
		Local read latency: NaN ms
		Local write count: 122992117
		Local write latency: 0.011 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 66851008
		Bloom filter off heap memory used: 67371044
		Index summary off heap memory used: 4415154
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 43
		Compacted partition maximum bytes: 124
		Compacted partition mean bytes: 86
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------

after repair:

scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ nodetool repair 
[2022-11-09 06:35:07,666] Repair session 249 
[2022-11-09 06:35:07,671] Repair session 249 finished
[2022-11-09 06:35:07,813] Starting repair command #250, repairing 1 ranges for keyspace alternator_usertable (parallelism=SEQUENTIAL, full=true)
[2022-11-09 06:35:10,929] Repair session 250 
[2022-11-09 06:35:10,931] Repair session 250 finished
[2022-11-09 06:35:10,964] Starting repair command #251, repairing 1 ranges for keyspace system_traces (parallelism=SEQUENTIAL, full=true)
[2022-11-09 06:35:17,089] Repair session 251 
[2022-11-09 06:35:17,089] Repair session 251 finished
[2022-11-09 06:35:17,173] Starting repair command #252, repairing 1 ranges for keyspace alternator_usertable_no_lwt (parallelism=SEQUENTIAL, full=true)
[2022-11-09 07:02:17,204] Repair session 252 
[2022-11-09 07:02:17,204] Repair session 252 finished
[2022-11-09 07:02:17,222] Starting repair command #253, repairing 1 ranges for keyspace system_auth (parallelism=SEQUENTIAL, full=true)
[2022-11-09 07:02:31,577] Repair session 253 
[2022-11-09 07:02:31,577] Repair session 253 finished
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ nodetool status
Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.3.183  7.38 GB    256          ?       30ec5f71-8e91-4e8f-aaa1-663f71d58562  1a
UN  10.4.3.96   6.57 GB    256          ?       23923798-2d26-40a7-86c7-12cb25cbae1f  1a
UN  10.4.2.63   6.94 GB    256          ?       6b3adb1d-f858-4d5c-b042-ff1851b2431d  1a
UN  10.4.3.186  8.13 GB    256          ?       ebd72f36-81af-436d-9f0e-3b8cc16676ec  1a

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 61
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 0
	Read Latency: NaN ms
	Write Count: 122992117
	Write Latency: 1.095821450085293E-5 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 1458
		SSTables in each level: [1458/4]
		Space used (live): 7924571136
		Space used (total): 7924571136
		Space used by snapshots (total): 0
		Off heap memory used (total): 523593073
		SSTable Compression Ratio: 0.514505
		Number of partitions (estimate): 108599552
		Memtable cell count: 1060925
		Memtable data size: 445878391
		Memtable off heap memory used: 447741952
		Memtable switch count: 100
		Local read count: 0
		Local read latency: NaN ms
		Local write count: 122992117
		Local write latency: 0.011 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 66874192
		Bloom filter off heap memory used: 67388432
		Index summary off heap memory used: 8462689
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 43
		Compacted partition maximum bytes: 124
		Compacted partition mean bytes: 85
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------

after major compaction:

scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ nodetool compact
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 61
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 0
	Read Latency: NaN ms
	Write Count: 122992117
	Write Latency: 1.095821450085293E-5 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 3
		SSTables in each level: [3]
		Space used (live): 1530227712
		Space used (total): 1530227712
		Space used by snapshots (total): 0
		Off heap memory used (total): 48718237
		SSTable Compression Ratio: 0.524236
		Number of partitions (estimate): 24744514
		Memtable cell count: 0
		Memtable data size: 0
		Memtable off heap memory used: 0
		Memtable switch count: 102
		Local read count: 0
		Local read latency: NaN ms
		Local write count: 122992117
		Local write latency: 0.011 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 47514504
		Bloom filter off heap memory used: 47710220
		Index summary off heap memory used: 1008017
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 43
		Compacted partition maximum bytes: 124
		Compacted partition mean bytes: 60
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ nodetool status
Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.3.183  1.43 GB    256          ?       30ec5f71-8e91-4e8f-aaa1-663f71d58562  1a
UN  10.4.3.96   6.22 GB    256          ?       23923798-2d26-40a7-86c7-12cb25cbae1f  1a
UN  10.4.2.63   6.68 GB    256          ?       6b3adb1d-f858-4d5c-b042-ff1851b2431d  1a
UN  10.4.3.186  7.96 GB    256          ?       ebd72f36-81af-436d-9f0e-3b8cc16676ec  1a
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ grep -c '"partition" : {' me-3074.json 
13333844
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ ll -h me-3074.json 
-rw-rw-r-- 1 scyllaadm scyllaadm 4.9G Nov  9 07:40 me-3074.json
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ grep -c 'marked_deleted' me-3074.json 
13333844
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ date
Wed Nov  9 07:51:54 UTC 2022
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ grep '"partition" : {' me-3074.json -B 10 -A 10 | head -n 50
[
  {
    "partition" : {
      "key" : [ "user8701641602035344077" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 37,
        "clustering" : [ "YCSB_0" ],
        "deletion_info" : { "marked_deleted" : "2022-11-08T14:09:36.360650Z", "local_delete_time" : "2022-11-08T14:09:36Z" },
        "cells" : [ ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "user7496580425906310589" ],
      "position" : 57
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 94,
        "clustering" : [ "YCSB_0" ],
        "deletion_info" : { "marked_deleted" : "2022-11-08T14:09:36.361486Z", "local_delete_time" : "2022-11-08T14:09:36Z" },
        "cells" : [ ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "user2691499473368432695" ],
      "position" : 114
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 151,
        "clustering" : [ "YCSB_0" ],
        "deletion_info" : { "marked_deleted" : "2022-11-08T14:09:36.362358Z", "local_delete_time" : "2022-11-08T14:09:36Z" },
        "cells" : [ ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "user8494020772433182794" ],
      "position" : 171
scyllaadm@36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4:~$ 

Installation details

Kernel Version: 5.15.0-1021-aws
Scylla version (or git commit hash): 2022.2.0~rc3-20221009.994a5f0fbb4c with build-id 756ea8d62c25ed4acdf087054e11b3d07596a117
Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-x86_64-package-2022.2.0-rc3.0.20221009.994a5f0fbb4c.tar.gz
Cluster size: 4 nodes (i3.large)

Scylla Nodes used in this run:

  • 36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-4 (3.251.89.87 | 10.4.3.183) (shards: 2)
  • 36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-3 (3.252.163.213 | 10.4.3.96) (shards: 2)
  • 36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-2 (52.211.222.58 | 10.4.2.63) (shards: 2)
  • 36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-1 (34.255.11.123 | 10.4.3.186) (shards: 2)

OS / Image: ami-0b9c9dd9d3af4cec6 (aws: eu-west-1)

Test: longevity-alternator-dbg
Test id: d46a98ee-6981-4dd4-9970-2591068e3b32
Test name: scylla-staging/yarongilor/longevity-alternator-dbg
Test config file(s):

Issue description

>>>>>>>
Your description here...
<<<<<<<

  • Restore Monitor Stack command: $ hydra investigate show-monitor d46a98ee-6981-4dd4-9970-2591068e3b32
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs d46a98ee-6981-4dd4-9970-2591068e3b32

Logs:

Jenkins job URL

@bhalevy
Copy link
Member

bhalevy commented Nov 9, 2022

Cc @nyh since this issue is about Alternator TTL, not CQL TTL.

@nyh nyh removed their assignment Nov 17, 2022
@nyh
Copy link
Contributor

nyh commented Nov 17, 2022

This issue has a ton of text but I don't understand at all what is the problem being reported here...

First, do you do a "SELECT *" on the table, do you get zero rows, or not?
If you don't it's an Alternator TTL bug. If you do get zero rows, Alternator TTL is working fine, and this issue, if there's even one, isn't related to Alternator TTL. I'm unassigning this issue from myself until it's clear it's an issue or has anything to do with Alternator TTL.

Second, if you see zero rows but largish sstables, and if there are unexpired tombstones, it's not suprising that we have largish sstables to contain them... It's not a bug, it's working as intended.

Third, if you see zero rows and the gc-grace-period has passed since the time the Alternator TTL deleted those rows, and you did a major compaction, you'd expect to see zero-size sstables . In one of the results above I see that you saw exactly that - zero data size, exactly like we expect, so no bug here:

0	/var/lib/scylla/data

What was surprising (for me) in that output, though, was:

11G	/var/lib/scylla/commitlog

I'm not a commitlog expert (@elcallio maybe you can comment), why would large commit logs remain long after writing stopped (and in our case, old data was deleted)? Is this normal - e.g., old files are kept to be "recycled" - or may indicate such a bug? Again, if it's a bug, it's not an Alternator bug.

Fourth, in the original issue message (which I'm not sure is the same as the following runs you did...) you mentioned having gc-grace-period of 2 hours, but doing a repair after a full day. This is theoretically wrong. If for some reason one of the nodes missed some deletion operations (I don't know why it would, though...), the repair would ressurect this data. This can explain non-zero sstables, but, this explanation is only relevant if "SELECT *" returns some data. If it doesn't, then this explanation is irrelevant.

@yarongilor please clarify what you think the bug here is.

@yarongilor yarongilor changed the title [Alternator] Large number of sstables with large sizes left after TTL expiration and major compaction [Alternator] Large number of sstables with large sizes left after TTL expiration and major compaction (tombstones are not deleted) Nov 17, 2022
@yarongilor
Copy link
Author

yarongilor commented Nov 17, 2022

@nyh , let me summarise all above results and point to an issue (which indeed is not necessarily an Alternator one):

  • The select * always resulted with zero, no issue here.
  • The issue is tombstones are not deleted and their containing sstables left when they're expected to be gone.
  • This is since (1) all data is expired (2) passed gc-period duration (3) ran a major compaction.
  • The scenario reveals this issue is where running the SCT longevity with "disruptive nemesis".
  • The previous logging details comment shows that the issue doesn't happen where no nemesis ran. you probably just missed the first line of the above comment: : In a shorter and smaller test without nemesis the results did get to zero sstables and partitions.
  • So since the issue is about tombstones not deleted on compaction, hence no space reclamation ==> then it might be good to ask advise of @raphaelsc and @asias for that.

@roydahan
Copy link

@yarongilor does the short reproducer with nemesis reproduce the issue?
If so, what is the list of the nemesis that run in the test?

@yarongilor
Copy link
Author

yarongilor commented Nov 20, 2022

@yarongilor does the short reproducer with nemesis reproduce the issue? If so, what is the list of the nemesis that run in the test?

@roydahan

2 nemesis are actually executed (other nemesis either skipped or failed on sct side before running anything):

disrupt_mgmt_repair_cli | 36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-2 | Succeeded | 2022-11-08 11:35:32 | 2022-11-08 14:20:31
-- | -- | -- | -- | --

disrupt_restart_with_resharding | 36m-ttl-5GB-dataset-alternat-db-node-d46a98ee-2 | Succeeded | 2022-11-08 10:23:17 | 2022-11-08 10:44:26
-- | -- | -- | -- | --

Argus job

@roydahan
Copy link

Ok, let's check each one of them separately to see which one is the root cause and why.

@fgelcer fgelcer added the status/missing information Some details are missing to handle the case label Nov 20, 2022
@nyh
Copy link
Contributor

nyh commented Nov 22, 2022

If I understood the points which @yarongilor demonstrated above, and by a personal chat with him, we have the following situation, which may indicate a problem not directly related to Alternator TTL but still is a serious-sounding bug:

  1. An sstable dump shows an sstable with deleted items (tombstones).
  2. Yet, more than gc_grace_seconds have passed since those deletions happened, and
  3. A major compaction was done after gc_grace_seconds has passed.

This combination of three facts should have been impossible: A major compaction after gc_grace_seconds should have dropped all tombstones, and we shouldn't be able to see tombstones any in any sstable! If we see any, it suggests we have some sort of compaction or sstable-handling bug.

There's another clue which @yarongilor mentioned: This problem was only reproduced with the "resharding" nemesis. The "resharding nemesis" changes the number of CPUs on the node, and then changes it again back to the original number. This extra leads me to make the following wild guess (for which I don't have any evidence) that maybe the bug is somehow related to resharding operations. Resharding is supposed to replace old per-shard sstables by new per-shard-in-a-different-list-of-shards sstables. What if something in the back-and-forth resharding operation causes some "orphan" sstables to remain, that don't belong to any of the current shards? If that's possible, then these sstables will not belong to any extant shard, they will not get compacted in the major compaction, and their tombstones will never be deleted.

@raphaelsc does this ring any bells? Is it possible that we leave "orphan sstables" after resharding that increases or decreases the number of cores? In general, if @yarongilor sees a problematic sstable that didn't get compacted properly, is there a way to check which shard "owns" that sstable? @yarongilor can you try looking for one of these problematic sstable's file name in the Scylla log, to see if there are any messages about compacting this sstable?

@yarongilor
Copy link
Author

Ok, let's check each one of them separately to see which one is the root cause and why.

@roydahan , it is no reproduced running any of these nemesis by itself.
(tested in https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/longevity-alternator-dbg/8/)

@roydahan
Copy link

Ok, let's check each one of them separately to see which one is the root cause and why.

@roydahan , it is no reproduced running any of these nemesis by itself. (tested in https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/longevity-alternator-dbg/8/)

So re-run with the combination of the 2 and check if it's reproducing consistently.

@yarongilor
Copy link
Author

yarongilor commented Nov 28, 2022

Ok, let's check each one of them separately to see which one is the root cause and why.

@roydahan , it is no reproduced running any of these nemesis by itself. (tested in https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/longevity-alternator-dbg/8/)

So re-run with the combination of the 2 and check if it's reproducing consistently.

rerunning in: https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/longevity-alternator-dbg/10/

==>

Issue is reproduced similarly, using original Sisyphus with 2 nemesis:

scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:~$ nodetool repair 
[2022-11-29 14:38:00,774] Starting repair command #8, repairing 1 ranges for keyspace system_distributed_everywhere (parallelism=SEQUENTIAL, full=true)
[2022-11-29 14:38:10,890] Repair session 8 
[2022-11-29 14:38:10,898] Repair session 8 finished
[2022-11-29 14:38:10,967] Starting repair command #9, repairing 1 ranges for keyspace system_auth (parallelism=SEQUENTIAL, full=true)
[2022-11-29 14:38:35,144] Repair session 9 
[2022-11-29 14:38:35,144] Repair session 9 finished
[2022-11-29 14:38:35,163] Starting repair command #10, repairing 1 ranges for keyspace alternator_usertable (parallelism=SEQUENTIAL, full=true)
[2022-11-29 14:38:47,319] Repair session 10 
[2022-11-29 14:38:47,366] Repair session 10 finished
[2022-11-29 14:38:47,702] Starting repair command #11, repairing 1 ranges for keyspace alternator_usertable_no_lwt (parallelism=SEQUENTIAL, full=true)
[2022-11-29 15:09:32,218] Repair session 11 
[2022-11-29 15:09:32,234] Repair session 11 finished
[2022-11-29 15:09:32,662] Starting repair command #12, repairing 1 ranges for keyspace system_traces (parallelism=SEQUENTIAL, full=true)
[2022-11-29 15:10:00,912] Repair session 12 
[2022-11-29 15:10:00,915] Repair session 12 finished
scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:~$ nodetool compact
scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:~$ nodetool cfstats alternator_usertable_^C
scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:~$ nodetool cfstats alternator_usertable_no_lwt.usertable_no_lwt
Total number of tables: 61
----------------
Keyspace : alternator_usertable_no_lwt
	Read Count: 0
	Read Latency: NaN ms
	Write Count: 131837580
	Write Latency: 1.235535421690841E-5 ms
	Pending Flushes: 0
		Table: usertable_no_lwt
		SSTable count: 2
		SSTables in each level: [2]
		Space used (live): 1500385280
		Space used (total): 1500385280
		Space used by snapshots (total): 0
		Off heap memory used (total): 34556180
		SSTable Compression Ratio: 0.517335
		Number of partitions (estimate): 24436178
		Memtable cell count: 469
		Memtable data size: 197368
		Memtable off heap memory used: 393216
		Memtable switch count: 102
		Local read count: 0
		Local read latency: NaN ms
		Local write count: 131837580
		Local write latency: 0.012 ms
		Pending flushes: 0
		Percent repaired: 0.0
		Bloom filter false positives: 0
		Bloom filter false ratio: 0.00000
		Bloom filter space used: 33067232
		Bloom filter off heap memory used: 33161224
		Index summary off heap memory used: 1001740
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 43
		Compacted partition maximum bytes: 124
		Compacted partition mean bytes: 60
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

----------------
scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:~$ nodetool status
Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.2.23   7.96 GB    256          ?       02ae8812-5a72-47ec-912e-c21a6bbf93e2  1a
UN  10.4.0.188  1.4 GB     256          ?       f03dfb33-c3d4-4793-ac75-b51d46a6af56  1a
UN  10.4.0.126  7.14 GB    256          ?       1489259b-903a-41e6-91b4-0d20b0c05475  1a
UN  10.4.3.201  6.65 GB    256          ?       551ec478-88eb-415e-ab17-7ba1016c51af  1a
scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:/var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-1927e3606fbd11edb01db21e74cd9776$ /usr/bin/sstabledump ./me-3550-big-Data.db > /tmp/me-3550-big-Data.json
WARN  15:56:47,952 Small commitlog volume detected at /commitlog; setting commitlog_total_space_in_mb to 7396.  You can override this in cassandra.yaml
WARN  15:56:47,996 Only 12.954GiB free across all data volumes. Consider adding more capacity to your cluster or removing obsolete snapshots

scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:/var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-1927e3606fbd11edb01db21e74cd9776$ 
scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:/var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-1927e3606fbd11edb01db21e74cd9776$ ll -h /tmp/me-3550-big-Data.json 
-rw-rw-r-- 1 scyllaadm scyllaadm 4.6G Nov 29 15:58 /tmp/me-3550-big-Data.json
scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:/var/lib/scylla/data/alternator_usertable_no_lwt/usertable_no_lwt-1927e3606fbd11edb01db21e74cd9776$ cd /tmp/
scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:/tmp$ grep -c 'marked_deleted' me-3550-big-Data.json 
12474194
scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:/tmp$ grep '"partition" : {' me-3550-big-Data.json -B 10 -A 10 | head -n 50
[
  {
    "partition" : {
      "key" : [ "user5809209300113301361" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 37,
        "clustering" : [ "YCSB_0" ],
        "deletion_info" : { "marked_deleted" : "2022-11-29T09:42:34.348288Z", "local_delete_time" : "2022-11-29T09:42:34Z" },
        "cells" : [ ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "user5892498033956723264" ],
      "position" : 57
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 94,
        "clustering" : [ "YCSB_0" ],
        "deletion_info" : { "marked_deleted" : "2022-11-29T09:42:34.355847Z", "local_delete_time" : "2022-11-29T09:42:34Z" },
        "cells" : [ ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "user8056586337637327428" ],
      "position" : 114
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 151,
        "clustering" : [ "YCSB_0" ],
        "deletion_info" : { "marked_deleted" : "2022-11-29T09:42:34.365157Z", "local_delete_time" : "2022-11-29T09:42:34Z" },
        "cells" : [ ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "user6645358743114004973" ],
      "position" : 171
scyllaadm@36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1:/tmp$ 

Installation details

Kernel Version: 5.15.0-1023-aws
Scylla version (or git commit hash): 2022.2.0~rc5-20221121.feb292600fc4 with build-id 172f9538efb0893c97c86cdf05622925159f4fa2
Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-x86_64-package-2022.2.0-rc5.0.20221121.feb292600fc4.tar.gz
Cluster size: 4 nodes (i3.large)

Scylla Nodes used in this run:

  • 36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-4 (3.251.97.178 | 10.4.0.126) (shards: 2)
  • 36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-3 (34.244.207.12 | 10.4.3.201) (shards: 2)
  • 36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-2 (34.249.200.17 | 10.4.2.23) (shards: 2)
  • 36m-ttl-resharding-mgmtrepair-alter-db-node-31e2cfca-1 (52.213.108.126 | 10.4.0.188) (shards: 2)

OS / Image: ami-0b77b476432e37d90 (aws: eu-west-1)

Test: longevity-alternator-dbg
Test id: 31e2cfca-a72d-4c02-81d6-c1aab02b0a38
Test name: scylla-staging/yarongilor/longevity-alternator-dbg
Test config file(s):

Issue description

>>>>>>>
Your description here...
<<<<<<<

  • Restore Monitor Stack command: $ hydra investigate show-monitor 31e2cfca-a72d-4c02-81d6-c1aab02b0a38
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 31e2cfca-a72d-4c02-81d6-c1aab02b0a38

Logs:

Jenkins job URL

@yarongilor yarongilor changed the title [Alternator] Large number of sstables with large sizes left after TTL expiration and major compaction (tombstones are not deleted) [Alternator] Some sstables with large sizes left after TTL expiration and major compaction (tombstones are not deleted) Dec 1, 2022
@yarongilor yarongilor changed the title [Alternator] Some sstables with large sizes left after TTL expiration and major compaction (tombstones are not deleted) [Alternator] Some sstables with large sizes left after TTL expiration, gc-grace-period and major compaction (tombstones are not deleted) Dec 1, 2022
@yarongilor
Copy link
Author

An automatic reproducer job is now available in a jenkins job:
https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/alternator-ttl-count-sstables/

An example output:

< t:2022-12-07 08:53:31,094 f:loader_utils.py l:94   c:AlternatorTtlLongevityTest p:DEBUG > stress cmd: bin/ycsb load dynamodb -P workloads/workloadc -threads 10 -p recordcount=8589934401 -p fieldcount=2 -p fieldlength=8 -p insertstart=0 -p insertcount=12006000  -p table=usertable_no_lwt -p dynamodb.ttlKey=ttl -p dynamodb.ttlDuration=2160
< t:2022-12-07 08:53:43,897 f:loader_utils.py l:94   c:AlternatorTtlLongevityTest p:DEBUG > stress cmd: bin/ycsb load dynamodb -P workloads/workloadc -threads 10 -p recordcount=8589934401 -p fieldcount=2 -p fieldlength=8 -p insertstart=12006000 -p insertcount=12006000 -p table=usertable_no_lwt -p dynamodb.ttlKey=ttl -p dynamodb.ttlDuration=2160
< t:2022-12-07 09:54:16,128 f:longevity_alternator_ttl_test.py l:21   c:AlternatorTtlLongevityTest p:INFO  > Run a repair on nodes..
< t:2022-12-07 10:21:08,255 f:longevity_alternator_ttl_test.py l:25   c:AlternatorTtlLongevityTest p:INFO  > Run a major compaction on node..
< t:2022-12-07 10:51:31,267 f:longevity_alternator_ttl_test.py l:35   c:AlternatorTtlLongevityTest p:INFO  > Results after a repair and a major compactions: 1492 sstables, 29072676 partitions

@yarongilor
Copy link
Author

what about waiting for offstrategy completion before running major?

Following API was introduced so you don't have to wait minutes till offstrategy is triggered:

         "path":"/storage_service/keyspace_offstrategy_compaction/{keyspace}",
         "operations":[
            {
               "method":"POST",
               "summary":"Perform offstrategy compaction, if needed, in a single keyspace",
               "type":"boolean",
               "nickname":"perform_keyspace_offstrategy_compaction",
               "produces":[
                  "application/json"
               ],
               "parameters":[
                  {
                     "name":"keyspace",
                     "description":"The keyspace to operate on",
                     "required":true,
                     "allowMultiple":false,
                     "type":"string",
                     "paramType":"path"
                  },
                  {
                     "name":"cf",
                     "description":"Comma-separated table names",
                     "required":false,
                     "allowMultiple":false,
                     "type":"string",
                     "paramType":"query"
                  }
               ]
            }
         ]
      },

@raphaelsc , already implemented - #11915 (comment)

@fgelcer
Copy link

fgelcer commented Mar 21, 2023

what about waiting for offstrategy completion before running major?
Following API was introduced so you don't have to wait minutes till offstrategy is triggered:

         "path":"/storage_service/keyspace_offstrategy_compaction/{keyspace}",
         "operations":[
            {
               "method":"POST",
               "summary":"Perform offstrategy compaction, if needed, in a single keyspace",
               "type":"boolean",
               "nickname":"perform_keyspace_offstrategy_compaction",
               "produces":[
                  "application/json"
               ],
               "parameters":[
                  {
                     "name":"keyspace",
                     "description":"The keyspace to operate on",
                     "required":true,
                     "allowMultiple":false,
                     "type":"string",
                     "paramType":"path"
                  },
                  {
                     "name":"cf",
                     "description":"Comma-separated table names",
                     "required":false,
                     "allowMultiple":false,
                     "type":"string",
                     "paramType":"query"
                  }
               ]
            }
         ]
      },

@raphaelsc , already implemented - #11915 (comment)

@raphaelsc , is there anything else required from @yarongilor ?
seems like the issue is expected, but @yarongilor suggested a warning message when running nodetool compact while offstrategy is running in the background... or anything else that will bring transparency to the user...

@DoronArazii ^^

@DoronArazii DoronArazii removed the status/pending qa reproduction Pending for QA team to reproduce the issue label Mar 23, 2023
@DoronArazii DoronArazii modified the milestones: 5.3, 5.4 Aug 1, 2023
@mykaul
Copy link
Contributor

mykaul commented Oct 22, 2023

ping @raphaelsc

raphaelsc added a commit to raphaelsc/scylla that referenced this issue Oct 22, 2023
Major compaction semantics is that all data of a table will be compacted
together, so user can expect e.g. a recently introduced tombstone to be
compacted with the data it shadows.
Today, it can happen that all data in maintenance set won't be included
for major, until they're promoted into main set by off-strategy.
So user might be left wondering why major is not having the expected
effect.
To fix this, let's perform off-strategy first, so data in maintenance
set will be made available by major. A similar approach is done for
data in memtable, so flush is performed before major starts.
The only exception will be data in staging, which cannot be compacted
until view building is done with it, to avoid inconsistency in view
replicas.
The serialization in comapaction manager of reshape jobs guarantee
correctness if there's an ongoing off-strategy on behalf of the
table.

Fixes scylladb#11915.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
@raphaelsc
Copy link
Member

PR sent: #15792

@mykaul mykaul added backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Oct 23, 2023
enaydanov pushed a commit to enaydanov/scylladb that referenced this issue Oct 23, 2023
Major compaction semantics is that all data of a table will be compacted
together, so user can expect e.g. a recently introduced tombstone to be
compacted with the data it shadows.
Today, it can happen that all data in maintenance set won't be included
for major, until they're promoted into main set by off-strategy.
So user might be left wondering why major is not having the expected
effect.
To fix this, let's perform off-strategy first, so data in maintenance
set will be made available by major. A similar approach is done for
data in memtable, so flush is performed before major starts.
The only exception will be data in staging, which cannot be compacted
until view building is done with it, to avoid inconsistency in view
replicas.
The serialization in comapaction manager of reshape jobs guarantee
correctness if there's an ongoing off-strategy on behalf of the
table.

Fixes scylladb#11915.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb#15792
denesb pushed a commit that referenced this issue Oct 30, 2023
Major compaction semantics is that all data of a table will be compacted
together, so user can expect e.g. a recently introduced tombstone to be
compacted with the data it shadows.
Today, it can happen that all data in maintenance set won't be included
for major, until they're promoted into main set by off-strategy.
So user might be left wondering why major is not having the expected
effect.
To fix this, let's perform off-strategy first, so data in maintenance
set will be made available by major. A similar approach is done for
data in memtable, so flush is performed before major starts.
The only exception will be data in staging, which cannot be compacted
until view building is done with it, to avoid inconsistency in view
replicas.
The serialization in comapaction manager of reshape jobs guarantee
correctness if there's an ongoing off-strategy on behalf of the
table.

Fixes #11915.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #15792

(cherry picked from commit ea6c281)
@denesb
Copy link
Contributor

denesb commented Oct 30, 2023

Backport queued to 5.4.

5.2. backport has conflicts, @raphaelsc please open a backport PR.

@denesb denesb removed the backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed label Oct 30, 2023
@avikivity
Copy link
Member

@denesb note this is a very recent commit and we should be wary of backporting things before they had a chance to get tested

5.4 is okay as it's undergoing testing anyway.

@denesb
Copy link
Contributor

denesb commented Oct 31, 2023

Righ. I was going over issues which need to be backported to 5.4. I will keep in mind to delay the other backports.

@mykaul
Copy link
Contributor

mykaul commented Oct 31, 2023

Righ. I was going over issues which need to be backported to 5.4. I will keep in mind to delay the other backports.

Those mental notes... We must automate them... Even if it's via ugly labels. 'Candidate-For-Backport...' -> 'Ready-For-Backport' after 2-4 weeks, for example.

@denesb
Copy link
Contributor

denesb commented Dec 18, 2023

Re-visiting this, the code has soaked for more than a month now. @raphaelsc please prepare a backport PR agasint 5.2.

@mykaul
Copy link
Contributor

mykaul commented Mar 13, 2024

ping @raphaelsc , @denesb for backport.

raphaelsc added a commit to raphaelsc/scylla that referenced this issue Mar 19, 2024
Major compaction semantics is that all data of a table will be compacted
together, so user can expect e.g. a recently introduced tombstone to be
compacted with the data it shadows.
Today, it can happen that all data in maintenance set won't be included
for major, until they're promoted into main set by off-strategy.
So user might be left wondering why major is not having the expected
effect.
To fix this, let's perform off-strategy first, so data in maintenance
set will be made available by major. A similar approach is done for
data in memtable, so flush is performed before major starts.
The only exception will be data in staging, which cannot be compacted
until view building is done with it, to avoid inconsistency in view
replicas.
The serialization in comapaction manager of reshape jobs guarantee
correctness if there's an ongoing off-strategy on behalf of the
table.

Fixes scylladb#11915.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb#15792

(cherry picked from commit ea6c281)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
@raphaelsc
Copy link
Member

PR sent: #17901

denesb pushed a commit that referenced this issue Mar 20, 2024
Major compaction semantics is that all data of a table will be compacted
together, so user can expect e.g. a recently introduced tombstone to be
compacted with the data it shadows.
Today, it can happen that all data in maintenance set won't be included
for major, until they're promoted into main set by off-strategy.
So user might be left wondering why major is not having the expected
effect.
To fix this, let's perform off-strategy first, so data in maintenance
set will be made available by major. A similar approach is done for
data in memtable, so flush is performed before major starts.
The only exception will be data in staging, which cannot be compacted
until view building is done with it, to avoid inconsistency in view
replicas.
The serialization in comapaction manager of reshape jobs guarantee
correctness if there's an ongoing off-strategy on behalf of the
table.

Fixes #11915.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #15792

(cherry picked from commit ea6c281)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #17901
@denesb
Copy link
Contributor

denesb commented Mar 20, 2024

Backported to 5.2.

@denesb denesb removed Backport candidate backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed labels Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.