-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compact memtables and cache during reads #2252
Comments
@denesb @slivne @eliransin @tgrabiec As a result on a setup with a lot of RAM users start seeing timeouts during reads from cache. We worked this around by forcing such reads to I wonder if fixing this today is a trivial thing after all the rework that has been done to a read path? |
@slivne agree with @vladzcloudius, we see it again and again. |
The rework recently done to the read path has nothing to do with this. This will be a non-trivial task. |
@amihayday @eliransin FYI |
btw, I don't think we should compact during reads. Instead, we should compact during writes / memtable-to-cache merges. I think we should do this in the foreground. Yes, deleting a large partition can take some time if the entire partition is in memtable, but it also took a long time to insert that partition into memtable, so it's amortized. |
Makes sense - a cache/memtables are being polluted during writes. What we can do as a compromise (in order NOT to add to the write's latency) is to mark corresponding cache entries as "garbage" in the foreground but do the actual "garbage" cache entries eviction in the background. If the "compactor" falls behind and we are not able to insert a new entry in the cache - we'll wait just like we do today. Or (probably better) start with the straight forward implementation like you suggest and then run some benchmarks to see how "bad" are those added latencies. And if it's measurable - consider what to do next. |
Compaction on write may have to scan the whole cache in the worst case and that can take seconds. It also doesn't solve the problem of compacting expired elements, which are compactable only some time after the write. If we compact on read, then those will go away (if they're hot) by the means of compact-on-read, or by the means of cache eviction if they're cold. |
I know that it's not the first time I'm bringing this, but we may want to consider if we really need that memtable-merge-into-cache-on-flush heuristics. If we didn't have it that would "resolve" the compaction-on-writes issue since the "issue in the memtable" would only be relevant till the next flush and everything in the cache would be a result of previous read (which would be compacted). Those memtable-merge-into-cache events are a big stalls generators and this heuristics efficiency is not very obvious (at least to me). - The last time Tomek has explained it to me he mentioned that it's been implemented with a time series in mind, however with time series we'd rather have all writes having 'BYPASS CACHE' (when/if implemented) because otherwise they would pollute the cache and make the hot set got evicted since those writes are very similar to a 'full scan' but in the write path. |
On Mon, Jan 25, 2021 at 10:27 PM vladzcloudius ***@***.***> wrote:
Compaction on write may have to scan the whole cache in the worst case and
that can take seconds.
It also doesn't solve the problem of compacting expired elements, which
are compactable only some time after the write. If we compact on read, then
those will go away (if they're hot) by the means of compact-on-read, or by
the means of cache eviction if they're cold.
I know that it's not the first time I'm bringing this, but we may want to
consider if we really need that memtable-merge-into-cache-on-flush
heuristics.
If we didn't have it that would "resolve" the compaction-on-writes issue
since it would only relevant till the next flush and *everything* in the
cache would be a result of previous read (which would be compacted).
It would solve only part of the problem. Yes, if less rows make it into the
cache, then there's less chance that we will have to compact them. But if
your memtable contains a deletion of a range, we still have the problem
that to delete the range we have to compact, or invalidate, a large part of
the cache in the compact-on-write variant. With the compact-on-read
variant, we only insert the tombstone and a later read will compact it.
Which may never happen before data is evicted. We save work by deferring
compaction to a later time (like with sstable compaction).
Another part of the problem which would still not be solved is the fact
that data may expire after it was populated into the cache.
Those memtable-merge-into-cache events are a big stalls generators and this
… heuristics efficiency is not very obvious (at least to me). - The last time
Tomek has explained it to me he mentioned that it's been implemented with a
time series in mind, however with time series we'd rather have all writes
having 'BYPASS CACHE' (when/if implemented) because otherwise they would
pollute the cache and make the hot set got evicted since those writes are
very similar to a 'full scan' but in the write path.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2252 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACFIL3W4BRDJDW5X4BDC7DS3XO2VANCNFSM4DGMWU6A>
.
|
I'm all for compact-on-read approach. You are right that this would have to replace the memtable-merge-on-flush by invalidate-dirty-rows-on-memtable-flush, yes, but the actual cache eviction then can take place either in the background and/or during the next read event in the context of 'compact-on-read'.
|
when read from cache compact and expire rows remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033
when read from cache compact and expire rows remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033
when read from cache compact and expire rows remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033
when read from cache compact and expire rows remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033
when read from cache compact and expire rows remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033
when read from cache compact and expire rows remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033
when read from cache compact and expire row tombstones remove expired empty rows from cache do not expire range tombstones in this patch Refs scylladb#2252, scylladb#6033
when read from cache compact and expire row tombstones remove expired empty rows from cache do not expire range tombstones in this patch Refs scylladb#2252, scylladb#6033
when read from cache compact and expire row tombstones remove expired empty rows from cache do not expire range tombstones in this patch Refs scylladb#2252, scylladb#6033
when read from cache compact and expire range tombstones remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033
during read from cache compact and expire range tombstones remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033
during read from cache compact and expire range tombstones remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033
during read from cache compact and expire range tombstones remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033
during read from cache compact and expire range tombstones remove expired empty rows from cache Refs scylladb#2252 Fixes scylladb#6033
I'm not sure why this issue is still open. |
We still have certain aspects of this unaddressed. Range-tombstones are still not compacted/dropped during reads. We only compact/drop dead rows. |
They should be. #14463 |
Right, so what is missing then? |
Maybe we should open a new specific issue for the things left. It is impossible to follow an issue with 60+ comments, especially if the discussion happened months (or years) before. |
#16093, I guess |
Currently, compaction is done on the fragment stream, memtables and cache are not modified by reads. This can pose a performance problem, because the same data may be compacted over and over again.
Consider a partition with a tombstone covering many rows. Readers will still scan through all those rows, each time. We could teach readers how to compact and let them erase such rows so that next reads don't need to walk over them.
Refs #652.
The text was updated successfully, but these errors were encountered: