Optimization for avoiding bloom filter during compaction is gone #14091

raphaelsc · 2023-05-30T16:41:34Z

Commit 8c4b5e4 introduced an optimization which only calculates max purgeable timestamp when a tombstone satisfy the grace period.

Commit 'repair: Get rid of the gc_grace_seconds' removed it.

Compaction is much slower as a result when facing tombstone heavy workloads.

The text was updated successfully, but these errors were encountered:

raphaelsc · 2023-05-30T16:43:01Z

Fixed by 38b226f.

mykaul · 2023-07-12T08:55:46Z

@raphaelsc - any idea why this issue is still open?

mykaul · 2023-07-12T08:56:29Z

#13908 is the one fixing it, I reckon?

raphaelsc · 2023-07-12T12:04:06Z

@raphaelsc - any idea why this issue is still open?

That's my fault. Closing it with 38b226f...

mykaul · 2023-07-26T07:09:58Z

@raphaelsc , @avikivity - do we plan to backport this?

raphaelsc · 2023-08-10T14:19:57Z

@raphaelsc , @avikivity - do we plan to backport this?

It's fixing a regression, so my understanding is yes, we need to backport this. Users that are using, default TTL / heavy deletion, can be really affected by this.

mykaul · 2023-10-17T09:28:42Z

@raphaelsc , @avikivity - please backport then.

avikivity · 2023-10-17T15:44:56Z

@raphaelsc please prepare backports.

raphaelsc · 2023-10-17T18:12:52Z

@raphaelsc please prepare backports.

5.2: #15744
5.1: #15745

Commit 8c4b5e4 introduced an optimization which only calculates max purgeable timestamp when a tombstone satisfy the grace period. Commit 'repair: Get rid of the gc_grace_seconds' inverted the order, probably under the assumption that getting grace period can be more expensive than calculating max purgeable, as repair-mode GC will look up into history data in order to calculate gc_before. This caused a significant regression on tombstone heavy compactions, where most of tombstones are still newer than grace period. A compaction which used to take 5s, now takes 35s. 7x slower. The reason is simple, now calculation of max purgeable happens for every single tombstone (once for each key), even the ones that cannot be GC'ed yet. And each calculation has to iterate through (i.e. check the bloom filter of) every single sstable that doesn't participate in compaction. Flame graph makes it very clear that bloom filter is a heavy path without the optimization: 45.64% 45.64% sstable_compact sstable_compaction_test_g [.] utils::filter::bloom_filter::is_present With its resurrection, the problem is gone. This scenario can easily happen, e.g. after a deletion burst, and tombstones becoming only GC'able after they reach upper tiers in the LSM tree. Before this patch, a compaction can be estimated to have this # of filter checks: (# of keys containing *any* tombstone) * (# of uncompacting sstable runs[1]) [1] It's # of *runs*, as each key tend to overlap with only one fragment of each run. After this patch, the estimation becomes: (# of keys containing a GC'able tombstone) * (# of uncompacting runs). With repair mode for tombstone GC, the assumption, that retrieval of gc_before is more expensive than calculating max purgeable, is kept. We can revisit it later. But the default mode, which is the "timeout" (i.e. gc_grace_seconds) one, we still benefit from the optimization of deferring the calculation until needed. Cherry picked from commit 38b226f Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Fixes scylladb#14091. Closes scylladb#13908 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Commit 8c4b5e4 introduced an optimization which only calculates max purgeable timestamp when a tombstone satisfy the grace period. Commit 'repair: Get rid of the gc_grace_seconds' inverted the order, probably under the assumption that getting grace period can be more expensive than calculating max purgeable, as repair-mode GC will look up into history data in order to calculate gc_before. This caused a significant regression on tombstone heavy compactions, where most of tombstones are still newer than grace period. A compaction which used to take 5s, now takes 35s. 7x slower. The reason is simple, now calculation of max purgeable happens for every single tombstone (once for each key), even the ones that cannot be GC'ed yet. And each calculation has to iterate through (i.e. check the bloom filter of) every single sstable that doesn't participate in compaction. Flame graph makes it very clear that bloom filter is a heavy path without the optimization: 45.64% 45.64% sstable_compact sstable_compaction_test_g [.] utils::filter::bloom_filter::is_present With its resurrection, the problem is gone. This scenario can easily happen, e.g. after a deletion burst, and tombstones becoming only GC'able after they reach upper tiers in the LSM tree. Before this patch, a compaction can be estimated to have this # of filter checks: (# of keys containing *any* tombstone) * (# of uncompacting sstable runs[1]) [1] It's # of *runs*, as each key tend to overlap with only one fragment of each run. After this patch, the estimation becomes: (# of keys containing a GC'able tombstone) * (# of uncompacting runs). With repair mode for tombstone GC, the assumption, that retrieval of gc_before is more expensive than calculating max purgeable, is kept. We can revisit it later. But the default mode, which is the "timeout" (i.e. gc_grace_seconds) one, we still benefit from the optimization of deferring the calculation until needed. Cherry picked from commit 38b226f Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Fixes #14091. Closes #13908 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #15744

Commit 8c4b5e4 introduced an optimization which only calculates max purgeable timestamp when a tombstone satisfy the grace period. Commit 'repair: Get rid of the gc_grace_seconds' inverted the order, probably under the assumption that getting grace period can be more expensive than calculating max purgeable, as repair-mode GC will look up into history data in order to calculate gc_before. This caused a significant regression on tombstone heavy compactions, where most of tombstones are still newer than grace period. A compaction which used to take 5s, now takes 35s. 7x slower. The reason is simple, now calculation of max purgeable happens for every single tombstone (once for each key), even the ones that cannot be GC'ed yet. And each calculation has to iterate through (i.e. check the bloom filter of) every single sstable that doesn't participate in compaction. Flame graph makes it very clear that bloom filter is a heavy path without the optimization: 45.64% 45.64% sstable_compact sstable_compaction_test_g [.] utils::filter::bloom_filter::is_present With its resurrection, the problem is gone. This scenario can easily happen, e.g. after a deletion burst, and tombstones becoming only GC'able after they reach upper tiers in the LSM tree. Before this patch, a compaction can be estimated to have this # of filter checks: (# of keys containing *any* tombstone) * (# of uncompacting sstable runs[1]) [1] It's # of *runs*, as each key tend to overlap with only one fragment of each run. After this patch, the estimation becomes: (# of keys containing a GC'able tombstone) * (# of uncompacting runs). With repair mode for tombstone GC, the assumption, that retrieval of gc_before is more expensive than calculating max purgeable, is kept. We can revisit it later. But the default mode, which is the "timeout" (i.e. gc_grace_seconds) one, we still benefit from the optimization of deferring the calculation until needed. Cherry picked from commit 38b226f Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Fixes #14091. Closes #13908 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #15745

denesb · 2023-10-20T06:36:36Z

Backports queued, removing backport candidate labels.

raphaelsc self-assigned this May 30, 2023

raphaelsc added the Backport candidate label May 30, 2023

raphaelsc mentioned this issue May 30, 2023

Resurrect optimization to avoid bloom filter checks during compaction #13908

Closed

mykaul added this to the 5.3 milestone Jun 1, 2023

mykaul added the area/compaction label Jun 1, 2023

raphaelsc closed this as completed Jul 12, 2023

DoronArazii modified the milestones: 5.3, 5.4 Jul 26, 2023

mykaul added the P1 Urgent label Jul 27, 2023

mykaul added backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed Requires-Backport-to-5.1 labels Aug 10, 2023

raphaelsc mentioned this issue Oct 19, 2023

[branch-5.2] Resurrect optimization to avoid bloom filter checks during compaction #15744

Closed

denesb removed Backport candidate backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed Requires-Backport-to-5.1 labels Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization for avoiding bloom filter during compaction is gone #14091

Optimization for avoiding bloom filter during compaction is gone #14091

raphaelsc commented May 30, 2023 •

edited

raphaelsc commented May 30, 2023

mykaul commented Jul 12, 2023

mykaul commented Jul 12, 2023

raphaelsc commented Jul 12, 2023

mykaul commented Jul 26, 2023

raphaelsc commented Aug 10, 2023 •

edited

mykaul commented Oct 17, 2023

avikivity commented Oct 17, 2023

raphaelsc commented Oct 17, 2023

denesb commented Oct 20, 2023

Optimization for avoiding bloom filter during compaction is gone #14091

Optimization for avoiding bloom filter during compaction is gone #14091

Comments

raphaelsc commented May 30, 2023 • edited

raphaelsc commented May 30, 2023

mykaul commented Jul 12, 2023

mykaul commented Jul 12, 2023

raphaelsc commented Jul 12, 2023

mykaul commented Jul 26, 2023

raphaelsc commented Aug 10, 2023 • edited

mykaul commented Oct 17, 2023

avikivity commented Oct 17, 2023

raphaelsc commented Oct 17, 2023

denesb commented Oct 20, 2023

raphaelsc commented May 30, 2023 •

edited

raphaelsc commented Aug 10, 2023 •

edited