Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compact memtables and cache during reads #2252

Open
tgrabiec opened this issue Apr 4, 2017 · 20 comments
Open

Compact memtables and cache during reads #2252

tgrabiec opened this issue Apr 4, 2017 · 20 comments
Assignees
Labels
area/range deletes Field-Tier1 Urgent issues as per FieldEngineering request P2 High Priority symptom/performance Issues causing performance problems
Milestone

Comments

@tgrabiec
Copy link
Contributor

tgrabiec commented Apr 4, 2017

Currently, compaction is done on the fragment stream, memtables and cache are not modified by reads. This can pose a performance problem, because the same data may be compacted over and over again.

Consider a partition with a tombstone covering many rows. Readers will still scan through all those rows, each time. We could teach readers how to compact and let them erase such rows so that next reads don't need to walk over them.

Refs #652.

@vladzcloudius
Copy link
Contributor

vladzcloudius commented Dec 4, 2020

@denesb @slivne @eliransin @tgrabiec
Please, consider re-prioritizing this.
We see this hurting us in cases when there is a hot data subset that gets deletes and inserts and sits constantly in the cache.

As a result on a setup with a lot of RAM users start seeing timeouts during reads from cache.

We worked this around by forcing such reads to BYPASS CACHE because nodetool flush wasn't helping due to #6033.

I wonder if fixing this today is a trivial thing after all the rework that has been done to a read path?
@denesb ?

@avikivity
Copy link
Member

@slivne agree with @vladzcloudius, we see it again and again.

@denesb
Copy link
Contributor

denesb commented Dec 14, 2020

I wonder if fixing this today is a trivial thing after all the rework that has been done to a read path?
@denesb ?

The rework recently done to the read path has nothing to do with this. This will be a non-trivial task.

@vladzcloudius
Copy link
Contributor

@amihayday @eliransin FYI

@avikivity
Copy link
Member

btw, I don't think we should compact during reads. Instead, we should compact during writes / memtable-to-cache merges. I think we should do this in the foreground. Yes, deleting a large partition can take some time if the entire partition is in memtable, but it also took a long time to insert that partition into memtable, so it's amortized.

@vladzcloudius
Copy link
Contributor

vladzcloudius commented Dec 29, 2020

btw, I don't think we should compact during reads. Instead, we should compact during writes / memtable-to-cache merges. I think we should do this in the foreground. Yes, deleting a large partition can take some time if the entire partition is in memtable, but it also took a long time to insert that partition into memtable, so it's amortized.

Makes sense - a cache/memtables are being polluted during writes.
However, what do you mean "in the foreground"? Is it adding a latency of this operation to the latency of the corresponding CQL write? It's definitely going to be much easier to implement - yes. However we would increase writes' latencies for certain workloads.

What we can do as a compromise (in order NOT to add to the write's latency) is to mark corresponding cache entries as "garbage" in the foreground but do the actual "garbage" cache entries eviction in the background.

If the "compactor" falls behind and we are not able to insert a new entry in the cache - we'll wait just like we do today.

Or (probably better) start with the straight forward implementation like you suggest and then run some benchmarks to see how "bad" are those added latencies. And if it's measurable - consider what to do next.

@tgrabiec
Copy link
Contributor Author

Compaction on write may have to scan the whole cache in the worst case and that can take seconds.

It also doesn't solve the problem of compacting expired elements, which are compactable only some time after the write. If we compact on read, then those will go away (if they're hot) by the means of compact-on-read, or by the means of cache eviction if they're cold.

@vladzcloudius
Copy link
Contributor

vladzcloudius commented Jan 25, 2021

Compaction on write may have to scan the whole cache in the worst case and that can take seconds.

It also doesn't solve the problem of compacting expired elements, which are compactable only some time after the write. If we compact on read, then those will go away (if they're hot) by the means of compact-on-read, or by the means of cache eviction if they're cold.

I know that it's not the first time I'm bringing this, but we may want to consider if we really need that memtable-merge-into-cache-on-flush heuristics.

If we didn't have it that would "resolve" the compaction-on-writes issue since the "issue in the memtable" would only be relevant till the next flush and everything in the cache would be a result of previous read (which would be compacted).

Those memtable-merge-into-cache events are a big stalls generators and this heuristics efficiency is not very obvious (at least to me). - The last time Tomek has explained it to me he mentioned that it's been implemented with a time series in mind, however with time series we'd rather have all writes having 'BYPASS CACHE' (when/if implemented) because otherwise they would pollute the cache and make the hot set got evicted since those writes are very similar to a 'full scan' but in the write path.

@tgrabiec
Copy link
Contributor Author

tgrabiec commented Jan 25, 2021 via email

@vladzcloudius
Copy link
Contributor

It would solve only part of the problem. Yes, if less rows make it into the cache, then there's less chance that we will have to compact them. But if your memtable contains a deletion of a range, we still have the problem that to delete the range we have to compact, or invalidate, a large part of the cache in the compact-on-write variant. With the compact-on-read variant, we only insert the tombstone and a later read will compact it.

I'm all for compact-on-read approach.
My point was that if we get rid of 'memtable-merge-on-flush' we could make some of challenges go away.

You are right that this would have to replace the memtable-merge-on-flush by invalidate-dirty-rows-on-memtable-flush, yes, but the actual cache eviction then can take place either in the background and/or during the next read event in the context of 'compact-on-read'.

Which may never happen before data is evicted. We save work by deferring compaction to a later time (like with sstable compaction). Another part of the problem which would still not be solved is the fact that data may expire after it was populated into the cache.

alezzqz added a commit to soft-stech/scylladb that referenced this issue Mar 21, 2023
when read from cache compact and expire rows
remove expired empty rows from cache

Refs scylladb#2252
Fixes scylladb#6033
alezzqz added a commit to soft-stech/scylladb that referenced this issue Mar 21, 2023
when read from cache compact and expire rows
remove expired empty rows from cache

Refs scylladb#2252
Fixes scylladb#6033
alezzqz added a commit to soft-stech/scylladb that referenced this issue Apr 4, 2023
when read from cache compact and expire rows
remove expired empty rows from cache

Refs scylladb#2252
Fixes scylladb#6033
alezzqz added a commit to soft-stech/scylladb that referenced this issue Apr 5, 2023
when read from cache compact and expire rows
remove expired empty rows from cache

Refs scylladb#2252
Fixes scylladb#6033
alezzqz added a commit to soft-stech/scylladb that referenced this issue Apr 19, 2023
when read from cache compact and expire rows
remove expired empty rows from cache

Refs scylladb#2252
Fixes scylladb#6033
alezzqz added a commit to soft-stech/scylladb that referenced this issue May 22, 2023
when read from cache compact and expire rows
remove expired empty rows from cache

Refs scylladb#2252
Fixes scylladb#6033
@mykaul mykaul modified the milestones: 5.3, 5.4 Jun 1, 2023
alezzqz added a commit to soft-stech/scylladb that referenced this issue Jun 15, 2023
when read from cache compact and expire row tombstones
remove expired empty rows from cache
do not expire range tombstones in this patch

Refs scylladb#2252, scylladb#6033
alezzqz added a commit to soft-stech/scylladb that referenced this issue Jun 20, 2023
when read from cache compact and expire row tombstones
remove expired empty rows from cache
do not expire range tombstones in this patch

Refs scylladb#2252, scylladb#6033
alezzqz added a commit to soft-stech/scylladb that referenced this issue Jun 21, 2023
when read from cache compact and expire row tombstones
remove expired empty rows from cache
do not expire range tombstones in this patch

Refs scylladb#2252, scylladb#6033
tgrabiec pushed a commit that referenced this issue Jun 26, 2023
when read from cache compact and expire row tombstones
remove expired empty rows from cache
do not expire range tombstones in this patch

Refs #2252, #6033

Closes #12917
alezzqz added a commit to soft-stech/scylladb that referenced this issue Jun 28, 2023
when read from cache compact and expire range tombstones
remove expired empty rows from cache

Refs scylladb#2252
Fixes scylladb#6033
alezzqz added a commit to soft-stech/scylladb that referenced this issue Jun 30, 2023
during read from cache compact and expire range tombstones
remove expired empty rows from cache

Refs scylladb#2252
Fixes scylladb#6033
alezzqz added a commit to soft-stech/scylladb that referenced this issue Jul 20, 2023
during read from cache compact and expire range tombstones
remove expired empty rows from cache

Refs scylladb#2252
Fixes scylladb#6033
alezzqz added a commit to soft-stech/scylladb that referenced this issue Jul 31, 2023
during read from cache compact and expire range tombstones
remove expired empty rows from cache

Refs scylladb#2252
Fixes scylladb#6033
@DoronArazii DoronArazii modified the milestones: 5.4, Backlog Aug 6, 2023
alezzqz added a commit to soft-stech/scylladb that referenced this issue Aug 28, 2023
during read from cache compact and expire range tombstones
remove expired empty rows from cache

Refs scylladb#2252
Fixes scylladb#6033
tgrabiec pushed a commit that referenced this issue Aug 31, 2023
during read from cache compact and expire range tombstones
remove expired empty rows from cache

Refs #2252
Fixes #6033

Closes #14463
denesb pushed a commit that referenced this issue Sep 1, 2023
during read from cache compact and expire range tombstones
remove expired empty rows from cache

Refs #2252
Fixes #6033

Closes #14463
@dani-tweig dani-tweig added P2 High Priority and removed Eng-2 labels Feb 4, 2024
@mykaul
Copy link
Contributor

mykaul commented Feb 11, 2024

I'm not sure why this issue is still open.

@denesb
Copy link
Contributor

denesb commented Feb 12, 2024

We still have certain aspects of this unaddressed. Range-tombstones are still not compacted/dropped during reads. We only compact/drop dead rows.

@michoecho
Copy link
Contributor

Range-tombstones are still not compacted/dropped during reads.

They should be. #14463

@denesb
Copy link
Contributor

denesb commented Feb 12, 2024

Range-tombstones are still not compacted/dropped during reads.

They should be. #14463

Right, so what is missing then?

@denesb
Copy link
Contributor

denesb commented Feb 12, 2024

Maybe we should open a new specific issue for the things left. It is impossible to follow an issue with 60+ comments, especially if the discussion happened months (or years) before.

@michoecho
Copy link
Contributor

Right, so what is missing then?

#16093, I guess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/range deletes Field-Tier1 Urgent issues as per FieldEngineering request P2 High Priority symptom/performance Issues causing performance problems
Projects
None yet
Development

No branches or pull requests