loading_cache: improve pollution resilience #8674

vladzcloudius · 2021-05-19T13:12:46Z

Installation details
HEAD: 8480839

Description
If cache is being polluted by used-once entries it may possibly evict entries that are meant to be cached due to cache size restriction.

This is because we use LRU algorithm based on the creation and last usage timestamp when we decide which items to evict.

For instance, if the used-once entries are being pushed in a burst.

Proposal
Have additional LRU list (by creation time) that would hold used-once entries.
When we need to evict by size - evict items from that new list first and only when it's empty fall back to the current algorithm.

The text was updated successfully, but these errors were encountered:

vladzcloudius · 2021-05-19T13:13:16Z

@avikivity Please, comment.

avikivity · 2021-05-19T15:05:37Z

Looks good. Note you can use a single set of links for both lists, since an item can be on one list at a time.

sitano · 2021-05-21T15:31:46Z

what about distinguishing to what generation an item belongs?
@vladzcloudius @avikivity would you prefer to count the number of touches with size_t or just introduce a bool young which will be true for the first touch?

vladzcloudius · 2021-05-21T16:02:07Z

what about distinguishing to what generation an item belongs?
@vladzcloudius @avikivity would you prefer to count the number of touches with size_t or just introduce a bool young which will be true for the first touch?

@sitano
You don't need any count while the semantics is bool.

The heuristics is that if a statement has been used more than once this means that it's not a "pollution".

The implementation is as follows:

On creation the item gets into the new LRU list you need to introduce (and not into the current LRU list).
Next time it's "touched" it's removed from the new list and appended to the current LRU list.

The rest as I've described before.

vladzcloudius · 2021-05-21T16:03:30Z

There are more complex cache eviction schemes, like the one used by ARC cache.
However we want to keep things simple for now.

…(LFRU) eviction policy This patch implements a simple variation of LFRU eviction policy: * We define 2 dynamic cache partitions which total size should not exceed the maximum cache size. * New cache entry is always added to the "new generation" partition. * After a cache entry is read more than PartitionHitThreshold times it moves to the second cache partition. * Both partitions' entries obey expiration and reload rules as before this patch. * When cache entries need to be evicted due to a size restriction "new generation" partition least recently used entries are evicted first. Fixes scylladb#8674 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>

…(LFRU) eviction policy This patch implements a simple variation of LFRU eviction policy: * We define 2 dynamic cache sections which total size should not exceed the maximum cache size. * New cache entry is always added to the "unprivileged" section. * After a cache entry is read more than SectionHitThreshold times it moves to the second cache section. * Both sections' entries obey expiration and reload rules in the same way as before this patch. * When cache entries need to be evicted due to a size restriction "unprivileged" section's least recently used entries are evicted first. Note: With a 2 sections cache it's not enough for a new entry to have the latest timestamp in order not be evicted right after insertion: e.g. if all all other entries are from the privileged section. And obviously we want to allow new cache entries to be added to a cache. Therefore we can no longer first add a new entry and then shrink the cache. Switching the order of these two operations resolves the culprit. Fixes scylladb#8674 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>

…tarov This series introduces a new version of a loading_cache class. The old implementation was susceptible to a "pollution" phenomena when frequently used entry can get evicted by an intensive burst of "used once" entries pushed into the cache. The new version is going to have a privileged and unprivileged cache sections and there's a new loading_cache template parameter - SectionHitThreshold. The new cache algorithm goes as follows: * We define 2 dynamic cache sections which total size should not exceed the maximum cache size. * New cache entry is always added to the "unprivileged" section. * After a cache entry is read more than SectionHitThreshold times it moves to the second cache section. * Both sections' entries obey expiration and reload rules in the same way as before this patch. * When cache entries need to be evicted due to a size restriction "unprivileged" section's least recently used entries are evicted first. More details may be found in #8674. In addition, during a testing another issue was found in the authorized_prepared_statements_cache: #9590. There is a patch that fixes it as well. Closes #9708 * github.com:scylladb/scylla: loading_cache: account unprivileged section evictions loading_cache: implement a variation of least frequent recently used (LFRU) eviction policy authorized_prepared_statements_cache: always "touch" a corresponding cache entry when accessed loading_cache::timestamped::lru_entry: refactoring loading_cache.hh: rearrange the code (no functional change) loading_cache: use std::pmr::polymorphic_allocator

vladzcloudius · 2022-05-26T20:58:00Z

The feature is broken by 3f2224a

As a result of a regression if the amount of "old" queries is greater than half of the cache a "pollution" workload is going to evict "old" entries instead of entries in the unprivileged section of the cache.

avikivity · 2022-05-29T14:41:22Z

It's not broken. It changes the size of the reservation of the new part from 1 item to half the capacity. As a result the old section drops from n-1 to half. It behaves the same way, with smaller reservation - it still resists pollution, just in a different way.

The previous reservation size of 1 item for the new part wasn't enough to execute a batch statement, since the members of the batch would compete with each other.

vladzcloudius · 2022-05-31T13:53:01Z

It's not broken. It changes the size of the reservation of the new part from 1 item to half the capacity. As a result the old section drops from n-1 to half. It behaves the same way, with smaller reservation - it still resists pollution, just in a different way.

The previous reservation size of 1 item for the new part wasn't enough to execute a batch statement, since the members of the batch would compete with each other.

@avikivity You are confused.
What you are describing is what the fix above SHOULD have done but it's not what it actually did - hence the reopening.

3f2224a, doesn't specify the sizes of the cache sections - it specifies the new condition of the eviction and this is not the same.

As a result when the amount of "old" (good) prepared statements is greater than half of the cache the polluting workload will cause a constant eviction from the privileged section (as I described here #8674 (comment)). More than that - this will happen before evicting a possibly evictable entries from the non-privileged section.

A proposal for a solution that ACTUALLY does what you described above (and I described last week) can be found here: #10674 (comment)

avikivity · 2022-05-31T16:25:00Z

@avelanarius please check

vladzcloudius added the enhancement label May 19, 2021

vladzcloudius assigned sitano May 19, 2021

vladzcloudius mentioned this issue Jun 23, 2021

loading_cache.hh: items last read time is not updated in sync calls like loading_cache::find(...) #8920

Closed

vladzcloudius assigned vladzcloudius and unassigned sitano Jun 23, 2021

nyh mentioned this issue Sep 29, 2021

Loading cache improve eviction use policy #9398

Closed

vladzcloudius mentioned this issue Nov 24, 2021

Loading cache improve eviction use policy v2 #9591

Closed

vladzcloudius mentioned this issue Nov 29, 2021

Loading cache improve eviction use policy #9708

Merged

scylladb-promoter closed this as completed in 1a9c6d9 Dec 1, 2021

vladzcloudius reopened this May 26, 2022

avikivity closed this as completed May 29, 2022

vladzcloudius reopened this May 31, 2022

DoronArazii added this to the 5.0 milestone Jul 7, 2022

DoronArazii modified the milestones: 5.0, 5.x Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loading_cache: improve pollution resilience #8674

loading_cache: improve pollution resilience #8674

vladzcloudius commented May 19, 2021

vladzcloudius commented May 19, 2021

avikivity commented May 19, 2021

sitano commented May 21, 2021 •

edited

Loading

vladzcloudius commented May 21, 2021 •

edited

Loading

vladzcloudius commented May 21, 2021

vladzcloudius commented May 26, 2022 •

edited

Loading

avikivity commented May 29, 2022

vladzcloudius commented May 31, 2022

avikivity commented May 31, 2022

loading_cache: improve pollution resilience #8674

loading_cache: improve pollution resilience #8674

Comments

vladzcloudius commented May 19, 2021

vladzcloudius commented May 19, 2021

avikivity commented May 19, 2021

sitano commented May 21, 2021 • edited Loading

vladzcloudius commented May 21, 2021 • edited Loading

vladzcloudius commented May 21, 2021

vladzcloudius commented May 26, 2022 • edited Loading

avikivity commented May 29, 2022

vladzcloudius commented May 31, 2022

avikivity commented May 31, 2022

sitano commented May 21, 2021 •

edited

Loading

vladzcloudius commented May 21, 2021 •

edited

Loading

vladzcloudius commented May 26, 2022 •

edited

Loading