New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TWCS sometimes stops dropping expired SSTables #8038
Comments
@titer please share output of sstablemetadata tool on a sstable that you think should be gone already. |
when a fully expired sstable remains alive, it's likely that an overlapping (token and time wise) sstable is blocking its deletion. this could be due to lack of #4617, where memtable flush is able to split content according to the window setting. |
Here is the metadata for the SSTable mentioned in the report sstablemetadata ./mc-5450515-big-Data.dbSSTable: /var/lib/scylla/data/streamerlogs/access-b25989d0507c11e98abf000000000000/mc-5450515-big It seems from scylladb/scylla-tools-java#154 that no tool is available so far to easily find overlapping sstables, could we still figure it out from analyzing the metadata output? Shouldn't the data still get dropped eventually when the overlapping sstable also expires, which in the worst case should be a day later with my settings? |
sstablemetadata output looks fine |
i wrote the following tool for this purpose: https://gist.githubusercontent.com/raphaelsc/c7dfeab7b1083c129c222762edd4e672/raw/4da9e7cb1b737dd648342f9cbc85efe419bf7338/expired_sstables_status.py use it like this: |
Nice, thank you python expired_sstables_status.py --table /var/lib/scylla/data/streamerlogs/access-b25989d0507c11e98abf000000000000 --gc-grace-seconds 3600 --default-ttl 86400
|
apparently nothing blocks your expired ssts (which amount to ~20G) from being deleted. there are sstables that overlap in token range, but they don't overlap in timestamp, so they're not considered blockers and shouldn't prevent deletion. i'll investigate this soon. by the time being, try restarting the node if possible as a compaction process will kick in on boot to make sure the sstables respect the TWCS invariant, allowing them to be efficiently deleted if expired. |
Alright, thanks for your help! |
how is it going with regards to space? i'd ask you to set compaction logging level to debug via nodetool. please run my script again and share output |
Afraid I was a bit optimistic and I had to restart that node yesterday - but I do see the issue again on several other nodes already, and the logging level was already set to trace. Here is the script output for one of those: disk usage has been increasing since Feb 6, with 282 GB of reclaimable space now -> 2021021201-scylla-sstable-status.txt And the full logs for that node, covering Feb 5 to 12 -> 2021021200-scylla-twcs-stuck.txt.gz |
On Fri, Feb 12, 2021 at 9:00 AM Eric Petit ***@***.***> wrote:
how is it going with regards to space?
Afraid I was a bit optimistic and I had to restart that node yesterday -
but I do see the issue again on several other nodes already, and the
logging level was already set to trace.
Here is the script output for one of those: disk usage has been increasing
since Feb 6, with 282 GB of reclaimable space now ->
2021021201-scylla-sstable-status.txt
<http://eric.lapsus.org/z/2021021201-scylla-sstable-status.txt>
please share with me the output of lsof on scylla pid. the script tells us
that there's nothing blocking those expired sstables from being purged,
they're one compaction away from it. the strategy should pick them, but for
some reason, it's not doing so. log can perhaps shed some light on why
that's happening.
also, please share output of sstablemetadata on one of those expired ssts
with no blockers like mc-5677051-big-Data.db
And the full logs for that node, covering Feb 5 to 12 ->
2021021200-scylla-twcs-stuck.txt.gz
<http://eric.lapsus.org/z/2021021200-scylla-twcs-stuck.txt.gz>
thanks.
… —
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#8038 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKYA44QW6X2WH3IFCUAKEDS6UJ7XANCNFSM4XCQIV3Q>
.
|
i do see some deleted sstables which weren't released yet, and those are taken into account when checking if an expired sstable can be purged. i'll investigate now why this is possibly happening. |
please send lsof output again in a few hours from that same node, it's possible that we're dealing with a leak here, where fd of deleted sstables remains opened even though unused. |
@titer I'd advise you to set gc_grace_seconds to 0, if you are deleting your data only through ttl, no explicit deletions. |
I think the problem is that the procedure to get expired sstables for compaction is unconditionally discarding sstables which ancestor is not deleted yet, because there's still something referencing it, it can be a range scan or some streaming-based operation. I'd appreciate if you could share lsof output from the same node as before. |
by the time being, you can workaround this by restarting your node, if it's causing significant issue. |
patch sent to mailing list to fix 4.2 branch |
…t_not_deleted during self-race on_compaction_completion() updates _sstables_compacted_but_not_deleted through a temporary to avoid an exception causing a partial update: 1. copy _sstables_compacted_but_not_deleted to a temporary 2. update temporary 3. do dangerous stuff 4. move temporary to _sstables_compacted_but_not_deleted This is racy when we have parallel compactions, since step 3 yields. We can have two invocations running in parallel, taking snapshots of the same _sstables_compacted_but_not_deleted in step 1, each modifying it in different ways, and only one of them winning the race and assigning in step 4. With the right timing we can end with extra sstables in _sstables_compacted_but_not_deleted. Before a536988, this was a benign race (only resulting in deleted file space not being reclaimed until the service is shut down), but afterwards, extra sstable references result in the service refusing to shut down. This was observed in database_test in debug mode, where the race more or less reliably happens for system.truncated. Fix by using a different method to protect _sstables_compacted_but_not_deleted. We unconditionally update it, and also unconditionally fix it up (on success or failure) using seastar::defer(). The fixup includes a call to rebuild_statistics() which must happen every time we touch the sstable list. Ref #7331. Fixes #8038. BACKPORT NOTES: - Turns out this race prevented deletion of expired sstables because the leaked deleted sstables would be accounted when checking if an expired sstable can be purged. - Switch to unordered_set<>::count() as it's not supported by older compilers. (cherry picked from commit a43d507) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210212203832.45846-1-raphaelsc@scylladb.com>
Patch in 4.2-next |
Thanks @raphaelsc ! I will upgrade to 4.3 or 4.2.3 when it's out
Will do that too |
@raphaelsc @avikivity |
patch is already in 4.2 branch, so i am closing this... |
Only patches on master can close issues. |
…t_not_deleted during self-race on_compaction_completion() updates _sstables_compacted_but_not_deleted through a temporary to avoid an exception causing a partial update: 1. copy _sstables_compacted_but_not_deleted to a temporary 2. update temporary 3. do dangerous stuff 4. move temporary to _sstables_compacted_but_not_deleted This is racy when we have parallel compactions, since step 3 yields. We can have two invocations running in parallel, taking snapshots of the same _sstables_compacted_but_not_deleted in step 1, each modifying it in different ways, and only one of them winning the race and assigning in step 4. With the right timing we can end with extra sstables in _sstables_compacted_but_not_deleted. Before a536988, this was a benign race (only resulting in deleted file space not being reclaimed until the service is shut down), but afterwards, extra sstable references result in the service refusing to shut down. This was observed in database_test in debug mode, where the race more or less reliably happens for system.truncated. Fix by using a different method to protect _sstables_compacted_but_not_deleted. We unconditionally update it, and also unconditionally fix it up (on success or failure) using seastar::defer(). The fixup includes a call to rebuild_statistics() which must happen every time we touch the sstable list. Ref scylladb#7331. Fixes scylladb#8038. BACKPORT NOTES: - Turns out this race prevented deletion of expired sstables because the leaked deleted sstables would be accounted when checking if an expired sstable can be purged. - Switch to unordered_set<>::count() as it's not supported by older compilers. (cherry picked from commit a43d507) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210212203832.45846-1-raphaelsc@scylladb.com>
Installation details
Scylla version: 4.2.1-0.20201108.4fb8ebccff-1
Cluster size: 17 nodes
OS: Debian 9
Hi,
I have a column family (
streamerlogs.access
in the logs linked below) set up with:and all data is inserted into it with TTL 86400 (1 day), so it can really go away after 25 hours.
It behaves as expected most of the time, but occasionally a node enters a condition where disk usage starts growing indefinitely, as old SSTables stop being cleaned up. I caught one such case with compaction tracing logs enabled: the condition was in fact restricted to one specific shard of that node, while others continued to drop old SSTables.
The full log (filtered on that shard) can be downloaded here: 2021020400-scylla-twcs-stuck.txt
What I see at first is Scylla dropping some expired data every hour, as expected:
But then on the next hour:
Only one of the two (5450551) is marked as "candidate" for deletion and the other is not, which is preventing any further deletion from happening for that shard's SSTables. My understanding from a quick look at the implementation (here) is that SSTable 5450515 is held because it must have some "undeleted ancestor", but that doesn't seem right - if I look directly into /var/lib/scylla/data, there is no SSTable older than that one. Might I be missing something?
On previous occurrences, bouncing the Scylla node worked around the issue: upon starting again, it immediately notices the expired SSTables and drops them.
The text was updated successfully, but these errors were encountered: