New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Alternator] Some sstables with large sizes left after TTL expiration, gc-grace-period and major compaction (tombstones are not deleted) #11915
Comments
node-5 state:
An additional related issue is that after repairing all nodes except node-1, the nodetool status had somewhat smaller data than after reparing node-1:
The after node-1 repair it had:
|
node-16 state:
|
node-15 state:
|
A CQL query does return 0 for a
|
Please try same query using bypass cache. |
Actually, scratch that. Please run nodetool flush and nodetool compact, to see if disk usage drops significantly. |
@raphaelsc , after running nodetool flush and compact there's not much change (not getting any significantly closer to zero sstables or partitions):
|
In a shorter and smaller test without nemesis the results did get to zero sstables and partitions. It had 4 write stress like: Test id: BEFORE repair and major compaction:( the relevant node is db-node-c6c93337-4 with ip 10.4.2.100)
Run a repair on node-4:
Check SSTABLE files after repair:
AFTER MAJOR COMPACTION:
AFTER GRACE PERIOD AND ANOTHER MAJOR COMPACTION:
|
The issue does reproduce in a smaller test of 5 hours with nemesis. before repair:
after repair:
after major compaction:
Installation detailsKernel Version: 5.15.0-1021-aws Scylla Nodes used in this run:
OS / Image: Test: Issue description>>>>>>>
Logs:
|
Cc @nyh since this issue is about Alternator TTL, not CQL TTL. |
This issue has a ton of text but I don't understand at all what is the problem being reported here... First, do you do a "SELECT *" on the table, do you get zero rows, or not? Second, if you see zero rows but largish sstables, and if there are unexpired tombstones, it's not suprising that we have largish sstables to contain them... It's not a bug, it's working as intended. Third, if you see zero rows and the gc-grace-period has passed since the time the Alternator TTL deleted those rows, and you did a major compaction, you'd expect to see zero-size sstables . In one of the results above I see that you saw exactly that - zero data size, exactly like we expect, so no bug here:
What was surprising (for me) in that output, though, was:
I'm not a commitlog expert (@elcallio maybe you can comment), why would large commit logs remain long after writing stopped (and in our case, old data was deleted)? Is this normal - e.g., old files are kept to be "recycled" - or may indicate such a bug? Again, if it's a bug, it's not an Alternator bug. Fourth, in the original issue message (which I'm not sure is the same as the following runs you did...) you mentioned having gc-grace-period of 2 hours, but doing a repair after a full day. This is theoretically wrong. If for some reason one of the nodes missed some deletion operations (I don't know why it would, though...), the repair would ressurect this data. This can explain non-zero sstables, but, this explanation is only relevant if "SELECT *" returns some data. If it doesn't, then this explanation is irrelevant. @yarongilor please clarify what you think the bug here is. |
@nyh , let me summarise all above results and point to an issue (which indeed is not necessarily an Alternator one):
|
@yarongilor does the short reproducer with nemesis reproduce the issue? |
2 nemesis are actually executed (other nemesis either skipped or failed on sct side before running anything):
|
Ok, let's check each one of them separately to see which one is the root cause and why. |
If I understood the points which @yarongilor demonstrated above, and by a personal chat with him, we have the following situation, which may indicate a problem not directly related to Alternator TTL but still is a serious-sounding bug:
This combination of three facts should have been impossible: A major compaction after gc_grace_seconds should have dropped all tombstones, and we shouldn't be able to see tombstones any in any sstable! If we see any, it suggests we have some sort of compaction or sstable-handling bug. There's another clue which @yarongilor mentioned: This problem was only reproduced with the "resharding" nemesis. The "resharding nemesis" changes the number of CPUs on the node, and then changes it again back to the original number. This extra leads me to make the following wild guess (for which I don't have any evidence) that maybe the bug is somehow related to resharding operations. Resharding is supposed to replace old per-shard sstables by new per-shard-in-a-different-list-of-shards sstables. What if something in the back-and-forth resharding operation causes some "orphan" sstables to remain, that don't belong to any of the current shards? If that's possible, then these sstables will not belong to any extant shard, they will not get compacted in the major compaction, and their tombstones will never be deleted. @raphaelsc does this ring any bells? Is it possible that we leave "orphan sstables" after resharding that increases or decreases the number of cores? In general, if @yarongilor sees a problematic sstable that didn't get compacted properly, is there a way to check which shard "owns" that sstable? @yarongilor can you try looking for one of these problematic sstable's file name in the Scylla log, to see if there are any messages about compacting this sstable? |
@roydahan , it is no reproduced running any of these nemesis by itself. |
So re-run with the combination of the 2 and check if it's reproducing consistently. |
rerunning in: https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/longevity-alternator-dbg/10/ ==> Issue is reproduced similarly, using original Sisyphus with 2 nemesis:
Installation detailsKernel Version: 5.15.0-1023-aws Scylla Nodes used in this run:
OS / Image: Test: Issue description>>>>>>>
Logs:
|
An automatic reproducer job is now available in a jenkins job: An example output:
|
@raphaelsc , already implemented - #11915 (comment) |
@raphaelsc , is there anything else required from @yarongilor ? @DoronArazii ^^ |
ping @raphaelsc |
Major compaction semantics is that all data of a table will be compacted together, so user can expect e.g. a recently introduced tombstone to be compacted with the data it shadows. Today, it can happen that all data in maintenance set won't be included for major, until they're promoted into main set by off-strategy. So user might be left wondering why major is not having the expected effect. To fix this, let's perform off-strategy first, so data in maintenance set will be made available by major. A similar approach is done for data in memtable, so flush is performed before major starts. The only exception will be data in staging, which cannot be compacted until view building is done with it, to avoid inconsistency in view replicas. The serialization in comapaction manager of reshape jobs guarantee correctness if there's an ongoing off-strategy on behalf of the table. Fixes scylladb#11915. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
PR sent: #15792 |
Major compaction semantics is that all data of a table will be compacted together, so user can expect e.g. a recently introduced tombstone to be compacted with the data it shadows. Today, it can happen that all data in maintenance set won't be included for major, until they're promoted into main set by off-strategy. So user might be left wondering why major is not having the expected effect. To fix this, let's perform off-strategy first, so data in maintenance set will be made available by major. A similar approach is done for data in memtable, so flush is performed before major starts. The only exception will be data in staging, which cannot be compacted until view building is done with it, to avoid inconsistency in view replicas. The serialization in comapaction manager of reshape jobs guarantee correctness if there's an ongoing off-strategy on behalf of the table. Fixes scylladb#11915. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb#15792
Major compaction semantics is that all data of a table will be compacted together, so user can expect e.g. a recently introduced tombstone to be compacted with the data it shadows. Today, it can happen that all data in maintenance set won't be included for major, until they're promoted into main set by off-strategy. So user might be left wondering why major is not having the expected effect. To fix this, let's perform off-strategy first, so data in maintenance set will be made available by major. A similar approach is done for data in memtable, so flush is performed before major starts. The only exception will be data in staging, which cannot be compacted until view building is done with it, to avoid inconsistency in view replicas. The serialization in comapaction manager of reshape jobs guarantee correctness if there's an ongoing off-strategy on behalf of the table. Fixes #11915. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #15792 (cherry picked from commit ea6c281)
Backport queued to 5.4. 5.2. backport has conflicts, @raphaelsc please open a backport PR. |
@denesb note this is a very recent commit and we should be wary of backporting things before they had a chance to get tested 5.4 is okay as it's undergoing testing anyway. |
Righ. I was going over issues which need to be backported to 5.4. I will keep in mind to delay the other backports. |
Those mental notes... We must automate them... Even if it's via ugly labels. 'Candidate-For-Backport...' -> 'Ready-For-Backport' after 2-4 weeks, for example. |
Re-visiting this, the code has soaked for more than a month now. @raphaelsc please prepare a backport PR agasint 5.2. |
ping @raphaelsc , @denesb for backport. |
Major compaction semantics is that all data of a table will be compacted together, so user can expect e.g. a recently introduced tombstone to be compacted with the data it shadows. Today, it can happen that all data in maintenance set won't be included for major, until they're promoted into main set by off-strategy. So user might be left wondering why major is not having the expected effect. To fix this, let's perform off-strategy first, so data in maintenance set will be made available by major. A similar approach is done for data in memtable, so flush is performed before major starts. The only exception will be data in staging, which cannot be compacted until view building is done with it, to avoid inconsistency in view replicas. The serialization in comapaction manager of reshape jobs guarantee correctness if there's an ongoing off-strategy on behalf of the table. Fixes scylladb#11915. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb#15792 (cherry picked from commit ea6c281) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
PR sent: #17901 |
Major compaction semantics is that all data of a table will be compacted together, so user can expect e.g. a recently introduced tombstone to be compacted with the data it shadows. Today, it can happen that all data in maintenance set won't be included for major, until they're promoted into main set by off-strategy. So user might be left wondering why major is not having the expected effect. To fix this, let's perform off-strategy first, so data in maintenance set will be made available by major. A similar approach is done for data in memtable, so flush is performed before major starts. The only exception will be data in staging, which cannot be compacted until view building is done with it, to avoid inconsistency in view replicas. The serialization in comapaction manager of reshape jobs guarantee correctness if there's an ongoing off-strategy on behalf of the table. Fixes #11915. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #15792 (cherry picked from commit ea6c281) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #17901
Backported to 5.2. |
Installation details
Kernel Version: 5.15.0-1021-aws
Scylla version (or git commit hash):
2022.2.0~rc3-20221009.994a5f0fbb4c
with build-id756ea8d62c25ed4acdf087054e11b3d07596a117
Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-x86_64-package-2022.2.0-rc3.0.20221009.994a5f0fbb4c.tar.gz
Cluster size: 4 nodes (i3.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-0b9c9dd9d3af4cec6
(aws: eu-west-1)Test:
longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-nemesis
Test id:
7da36ba4-479e-42fd-bc55-641409ff1c77
Test name:
scylla-staging/yarongilor/longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-nemesis
Test config file(s):
Issue description
>>>>>>>
scenario:
some large sstables are:
nodetool status:
nodetool cfstats on node-1 shows
Number of partitions (estimate): 936282794
:Any CQLSH query to any range failed with a timeout:
<<<<<<<
$ hydra investigate show-monitor 7da36ba4-479e-42fd-bc55-641409ff1c77
$ hydra investigate show-logs 7da36ba4-479e-42fd-bc55-641409ff1c77
Logs:
The cluster's nodes and monitor are alive in:
original logs of sct test:
Manually collected logs after test run ended and manual operations (repair + major compaction on all nodes):
Jenkins job URL
The text was updated successfully, but these errors were encountered: