Compact memtables and cache during reads #2252

tgrabiec · 2017-04-04T17:26:22Z

Currently, compaction is done on the fragment stream, memtables and cache are not modified by reads. This can pose a performance problem, because the same data may be compacted over and over again.

Consider a partition with a tombstone covering many rows. Readers will still scan through all those rows, each time. We could teach readers how to compact and let them erase such rows so that next reads don't need to walk over them.

Refs #652.

vladzcloudius · 2020-12-04T20:49:21Z

@denesb @slivne @eliransin @tgrabiec
Please, consider re-prioritizing this.
We see this hurting us in cases when there is a hot data subset that gets deletes and inserts and sits constantly in the cache.

As a result on a setup with a lot of RAM users start seeing timeouts during reads from cache.

We worked this around by forcing such reads to BYPASS CACHE because nodetool flush wasn't helping due to #6033.

I wonder if fixing this today is a trivial thing after all the rework that has been done to a read path?
@denesb ?

avikivity · 2020-12-13T09:57:40Z

@slivne agree with @vladzcloudius, we see it again and again.

denesb · 2020-12-14T12:52:07Z

I wonder if fixing this today is a trivial thing after all the rework that has been done to a read path?
@denesb ?

The rework recently done to the read path has nothing to do with this. This will be a non-trivial task.

vladzcloudius · 2020-12-28T16:06:36Z

@amihayday @eliransin FYI

avikivity · 2020-12-28T19:26:03Z

btw, I don't think we should compact during reads. Instead, we should compact during writes / memtable-to-cache merges. I think we should do this in the foreground. Yes, deleting a large partition can take some time if the entire partition is in memtable, but it also took a long time to insert that partition into memtable, so it's amortized.

vladzcloudius · 2020-12-29T01:21:08Z

btw, I don't think we should compact during reads. Instead, we should compact during writes / memtable-to-cache merges. I think we should do this in the foreground. Yes, deleting a large partition can take some time if the entire partition is in memtable, but it also took a long time to insert that partition into memtable, so it's amortized.

Makes sense - a cache/memtables are being polluted during writes.
However, what do you mean "in the foreground"? Is it adding a latency of this operation to the latency of the corresponding CQL write? It's definitely going to be much easier to implement - yes. However we would increase writes' latencies for certain workloads.

What we can do as a compromise (in order NOT to add to the write's latency) is to mark corresponding cache entries as "garbage" in the foreground but do the actual "garbage" cache entries eviction in the background.

If the "compactor" falls behind and we are not able to insert a new entry in the cache - we'll wait just like we do today.

Or (probably better) start with the straight forward implementation like you suggest and then run some benchmarks to see how "bad" are those added latencies. And if it's measurable - consider what to do next.

tgrabiec · 2021-01-25T20:46:58Z

Compaction on write may have to scan the whole cache in the worst case and that can take seconds.

It also doesn't solve the problem of compacting expired elements, which are compactable only some time after the write. If we compact on read, then those will go away (if they're hot) by the means of compact-on-read, or by the means of cache eviction if they're cold.

vladzcloudius · 2021-01-25T21:26:51Z

Compaction on write may have to scan the whole cache in the worst case and that can take seconds.

It also doesn't solve the problem of compacting expired elements, which are compactable only some time after the write. If we compact on read, then those will go away (if they're hot) by the means of compact-on-read, or by the means of cache eviction if they're cold.

I know that it's not the first time I'm bringing this, but we may want to consider if we really need that memtable-merge-into-cache-on-flush heuristics.

If we didn't have it that would "resolve" the compaction-on-writes issue since the "issue in the memtable" would only be relevant till the next flush and everything in the cache would be a result of previous read (which would be compacted).

Those memtable-merge-into-cache events are a big stalls generators and this heuristics efficiency is not very obvious (at least to me). - The last time Tomek has explained it to me he mentioned that it's been implemented with a time series in mind, however with time series we'd rather have all writes having 'BYPASS CACHE' (when/if implemented) because otherwise they would pollute the cache and make the hot set got evicted since those writes are very similar to a 'full scan' but in the write path.

tgrabiec · 2021-01-25T21:47:32Z

On Mon, Jan 25, 2021 at 10:27 PM vladzcloudius ***@***.***> wrote: Compaction on write may have to scan the whole cache in the worst case and that can take seconds. It also doesn't solve the problem of compacting expired elements, which are compactable only some time after the write. If we compact on read, then those will go away (if they're hot) by the means of compact-on-read, or by the means of cache eviction if they're cold. I know that it's not the first time I'm bringing this, but we may want to consider if we really need that memtable-merge-into-cache-on-flush heuristics. If we didn't have it that would "resolve" the compaction-on-writes issue since it would only relevant till the next flush and *everything* in the cache would be a result of previous read (which would be compacted).

It would solve only part of the problem. Yes, if less rows make it into the cache, then there's less chance that we will have to compact them. But if your memtable contains a deletion of a range, we still have the problem that to delete the range we have to compact, or invalidate, a large part of the cache in the compact-on-write variant. With the compact-on-read variant, we only insert the tombstone and a later read will compact it. Which may never happen before data is evicted. We save work by deferring compaction to a later time (like with sstable compaction). Another part of the problem which would still not be solved is the fact that data may expire after it was populated into the cache. Those memtable-merge-into-cache events are a big stalls generators and this

…

heuristics efficiency is not very obvious (at least to me). - The last time Tomek has explained it to me he mentioned that it's been implemented with a time series in mind, however with time series we'd rather have all writes having 'BYPASS CACHE' (when/if implemented) because otherwise they would pollute the cache and make the hot set got evicted since those writes are very similar to a 'full scan' but in the write path. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2252 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACFIL3W4BRDJDW5X4BDC7DS3XO2VANCNFSM4DGMWU6A> .

vladzcloudius · 2021-01-25T22:09:10Z

It would solve only part of the problem. Yes, if less rows make it into the cache, then there's less chance that we will have to compact them. But if your memtable contains a deletion of a range, we still have the problem that to delete the range we have to compact, or invalidate, a large part of the cache in the compact-on-write variant. With the compact-on-read variant, we only insert the tombstone and a later read will compact it.

I'm all for compact-on-read approach.
My point was that if we get rid of 'memtable-merge-on-flush' we could make some of challenges go away.

You are right that this would have to replace the memtable-merge-on-flush by invalidate-dirty-rows-on-memtable-flush, yes, but the actual cache eviction then can take place either in the background and/or during the next read event in the context of 'compact-on-read'.

Which may never happen before data is evicted. We save work by deferring compaction to a later time (like with sstable compaction). Another part of the problem which would still not be solved is the fact that data may expire after it was populated into the cache.

when read from cache compact and expire rows remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033

when read from cache compact and expire row tombstones remove expired empty rows from cache do not expire range tombstones in this patch Refs scylladb#2252, scylladb#6033

when read from cache compact and expire row tombstones remove expired empty rows from cache do not expire range tombstones in this patch Refs #2252, #6033 Closes #12917

when read from cache compact and expire range tombstones remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033

during read from cache compact and expire range tombstones remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033

during read from cache compact and expire range tombstones remove expired empty rows from cache Refs #2252 Fixes #6033 Closes #14463

mykaul · 2024-02-11T13:46:21Z

I'm not sure why this issue is still open.

denesb · 2024-02-12T07:38:31Z

We still have certain aspects of this unaddressed. Range-tombstones are still not compacted/dropped during reads. We only compact/drop dead rows.

michoecho · 2024-02-12T08:09:12Z

Range-tombstones are still not compacted/dropped during reads.

They should be. #14463

denesb · 2024-02-12T09:01:23Z

Range-tombstones are still not compacted/dropped during reads.

They should be. #14463

Right, so what is missing then?

denesb · 2024-02-12T09:02:20Z

Maybe we should open a new specific issue for the things left. It is impossible to follow an issue with 60+ comments, especially if the discussion happened months (or years) before.

michoecho · 2024-02-12T09:04:07Z

Right, so what is missing then?

#16093, I guess

tgrabiec added the symptom/performance Issues causing performance problems label Apr 4, 2017

slivne added this to the x-ray milestone Apr 25, 2017

tgrabiec mentioned this issue Nov 29, 2017

[shard 1] storage_proxy - Exception when communicating with 192.168.0.73: std::bad_alloc #3022

Open

tgrabiec added the area/range deletes label Nov 29, 2017

tgrabiec mentioned this issue Dec 20, 2017

mutation_partition::apply(tombstone) is lazy #652

Closed

slivne modified the milestones: x-ray, 2.3 Feb 1, 2018

slivne modified the milestones: 2.3, 2.x Jul 12, 2018

tgrabiec mentioned this issue Aug 8, 2018

Idea for reducing amount of range tombstone processing during query or reconciliation #3673

Open

tgrabiec mentioned this issue Nov 13, 2018

excessive partition cache hit is reported every 60s when logged batch is used #3907

Open

slivne added the Eng-2 label Oct 27, 2019

tgrabiec mentioned this issue Mar 18, 2020

Expired tombstones are kept in cache indefinitely #6033

Closed

4 tasks

denesb mentioned this issue Apr 14, 2020

Node unreachable, unresponsive to clients (nodetool status disagrees) #6029

Closed

slivne assigned bhalevy May 10, 2020

slivne modified the milestones: 4.x, 4.3 May 10, 2020

bhalevy assigned xemul Jun 1, 2020

slivne modified the milestones: 4.3, 4.4 Nov 26, 2020

alezzqz added a commit to soft-stech/scylladb that referenced this issue Mar 21, 2023

compact and remove expired rows from cache on read

36bbff4

when read from cache compact and expire rows remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033

alezzqz added a commit to soft-stech/scylladb that referenced this issue Mar 21, 2023

compact and remove expired rows from cache on read

90ddf7a

when read from cache compact and expire rows remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033

alezzqz added a commit to soft-stech/scylladb that referenced this issue Apr 4, 2023

compact and remove expired rows from cache on read

0751c6f

when read from cache compact and expire rows remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033

alezzqz added a commit to soft-stech/scylladb that referenced this issue Apr 5, 2023

compact and remove expired rows from cache on read

2fecbe2

when read from cache compact and expire rows remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033

alezzqz added a commit to soft-stech/scylladb that referenced this issue Apr 19, 2023

compact and remove expired rows from cache on read

4320094

when read from cache compact and expire rows remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033

alezzqz added a commit to soft-stech/scylladb that referenced this issue May 22, 2023

compact and remove expired rows from cache on read

2e7c500

when read from cache compact and expire rows remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033

mykaul modified the milestones: 5.3, 5.4 Jun 1, 2023

tgrabiec pushed a commit that referenced this issue Jun 26, 2023

compact and remove expired rows from cache on read

ca4e7f9

when read from cache compact and expire row tombstones remove expired empty rows from cache do not expire range tombstones in this patch Refs #2252, #6033 Closes #12917

alezzqz added a commit to soft-stech/scylladb that referenced this issue Jun 28, 2023

compact and remove expired range tombstones from cache on read

beabc09

when read from cache compact and expire range tombstones remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033

alezzqz added a commit to soft-stech/scylladb that referenced this issue Jun 30, 2023

compact and remove expired range tombstones from cache on read

ea9e30d

during read from cache compact and expire range tombstones remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033

alezzqz mentioned this issue Jun 30, 2023

compact and remove expired range tombstones from cache on read #14463

Closed

alezzqz added a commit to soft-stech/scylladb that referenced this issue Jul 20, 2023

compact and remove expired range tombstones from cache on read

a1444b9

during read from cache compact and expire range tombstones remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033

alezzqz added a commit to soft-stech/scylladb that referenced this issue Jul 31, 2023

compact and remove expired range tombstones from cache on read

e6c9ef7

during read from cache compact and expire range tombstones remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033

DoronArazii modified the milestones: 5.4, Backlog Aug 6, 2023

alezzqz added a commit to soft-stech/scylladb that referenced this issue Aug 28, 2023

compact and remove expired range tombstones from cache on read

5796d2a

during read from cache compact and expire range tombstones remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033

tgrabiec pushed a commit that referenced this issue Aug 31, 2023

compact and remove expired range tombstones from cache on read

03583bb

during read from cache compact and expire range tombstones remove expired empty rows from cache Refs #2252 Fixes #6033 Closes #14463

denesb pushed a commit that referenced this issue Sep 1, 2023

compact and remove expired range tombstones from cache on read

87fa7d0

during read from cache compact and expire range tombstones remove expired empty rows from cache Refs #2252 Fixes #6033 Closes #14463

dani-tweig added P2 High Priority and removed Eng-2 labels Feb 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compact memtables and cache during reads #2252

Compact memtables and cache during reads #2252

tgrabiec commented Apr 4, 2017 •

edited

vladzcloudius commented Dec 4, 2020 •

edited

avikivity commented Dec 13, 2020

denesb commented Dec 14, 2020

vladzcloudius commented Dec 28, 2020

avikivity commented Dec 28, 2020

vladzcloudius commented Dec 29, 2020 •

edited

tgrabiec commented Jan 25, 2021

vladzcloudius commented Jan 25, 2021 •

edited

tgrabiec commented Jan 25, 2021 via email

vladzcloudius commented Jan 25, 2021

mykaul commented Feb 11, 2024

denesb commented Feb 12, 2024

michoecho commented Feb 12, 2024

denesb commented Feb 12, 2024

denesb commented Feb 12, 2024

michoecho commented Feb 12, 2024

Compact memtables and cache during reads #2252

Compact memtables and cache during reads #2252

Comments

tgrabiec commented Apr 4, 2017 • edited

vladzcloudius commented Dec 4, 2020 • edited

avikivity commented Dec 13, 2020

denesb commented Dec 14, 2020

vladzcloudius commented Dec 28, 2020

avikivity commented Dec 28, 2020

vladzcloudius commented Dec 29, 2020 • edited

tgrabiec commented Jan 25, 2021

vladzcloudius commented Jan 25, 2021 • edited

tgrabiec commented Jan 25, 2021 via email

vladzcloudius commented Jan 25, 2021

mykaul commented Feb 11, 2024

denesb commented Feb 12, 2024

michoecho commented Feb 12, 2024

denesb commented Feb 12, 2024

denesb commented Feb 12, 2024

michoecho commented Feb 12, 2024

tgrabiec commented Apr 4, 2017 •

edited

vladzcloudius commented Dec 4, 2020 •

edited

vladzcloudius commented Dec 29, 2020 •

edited

vladzcloudius commented Jan 25, 2021 •

edited