Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loading_cache: improve pollution resilience #8674

Open
vladzcloudius opened this issue May 19, 2021 · 9 comments
Open

loading_cache: improve pollution resilience #8674

vladzcloudius opened this issue May 19, 2021 · 9 comments
Assignees
Milestone

Comments

@vladzcloudius
Copy link
Contributor

Installation details
HEAD: 8480839

Description
If cache is being polluted by used-once entries it may possibly evict entries that are meant to be cached due to cache size restriction.

This is because we use LRU algorithm based on the creation and last usage timestamp when we decide which items to evict.

For instance, if the used-once entries are being pushed in a burst.

Proposal
Have additional LRU list (by creation time) that would hold used-once entries.
When we need to evict by size - evict items from that new list first and only when it's empty fall back to the current algorithm.

@vladzcloudius
Copy link
Contributor Author

@avikivity Please, comment.

@avikivity
Copy link
Member

Looks good. Note you can use a single set of links for both lists, since an item can be on one list at a time.

@sitano
Copy link
Contributor

sitano commented May 21, 2021

what about distinguishing to what generation an item belongs?
@vladzcloudius @avikivity would you prefer to count the number of touches with size_t or just introduce a bool young which will be true for the first touch?

@vladzcloudius
Copy link
Contributor Author

vladzcloudius commented May 21, 2021

what about distinguishing to what generation an item belongs?
@vladzcloudius @avikivity would you prefer to count the number of touches with size_t or just introduce a bool young which will be true for the first touch?

@sitano
You don't need any count while the semantics is bool.

The heuristics is that if a statement has been used more than once this means that it's not a "pollution".

The implementation is as follows:

  1. On creation the item gets into the new LRU list you need to introduce (and not into the current LRU list).
  2. Next time it's "touched" it's removed from the new list and appended to the current LRU list.

The rest as I've described before.

@vladzcloudius
Copy link
Contributor Author

There are more complex cache eviction schemes, like the one used by ARC cache.
However we want to keep things simple for now.

vladzcloudius added a commit to vladzcloudius/scylla that referenced this issue Nov 6, 2021
…(LFRU) eviction policy

This patch implements a simple variation of LFRU eviction policy:
  * We define 2 dynamic cache partitions which total size should not exceed the maximum cache size.
  * New cache entry is always added to the "new generation" partition.
  * After a cache entry is read more than PartitionHitThreshold times it moves to the second cache partition.
  * Both partitions' entries obey expiration and reload rules as before this patch.
  * When cache entries need to be evicted due to a size restriction "new generation" partition
    least recently used entries are evicted first.

Fixes scylladb#8674

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
vladzcloudius added a commit to vladzcloudius/scylla that referenced this issue Nov 29, 2021
…(LFRU) eviction policy

This patch implements a simple variation of LFRU eviction policy:
  * We define 2 dynamic cache sections which total size should not exceed the maximum cache size.
  * New cache entry is always added to the "unprivileged" section.
  * After a cache entry is read more than SectionHitThreshold times it moves to the second cache section.
  * Both sections' entries obey expiration and reload rules in the same way as before this patch.
  * When cache entries need to be evicted due to a size restriction "unprivileged" section's
    least recently used entries are evicted first.

Note:
With a 2 sections cache it's not enough for a new entry to have the latest timestamp
in order not be evicted right after insertion: e.g. if all all other entries
are from the privileged section.

And obviously we want to allow new cache entries to be added to a cache.

Therefore we can no longer first add a new entry and then shrink the cache.
Switching the order of these two operations resolves the culprit.

Fixes scylladb#8674

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
vladzcloudius added a commit to vladzcloudius/scylla that referenced this issue Nov 29, 2021
…(LFRU) eviction policy

This patch implements a simple variation of LFRU eviction policy:
  * We define 2 dynamic cache sections which total size should not exceed the maximum cache size.
  * New cache entry is always added to the "unprivileged" section.
  * After a cache entry is read more than SectionHitThreshold times it moves to the second cache section.
  * Both sections' entries obey expiration and reload rules in the same way as before this patch.
  * When cache entries need to be evicted due to a size restriction "unprivileged" section's
    least recently used entries are evicted first.

Note:
With a 2 sections cache it's not enough for a new entry to have the latest timestamp
in order not be evicted right after insertion: e.g. if all all other entries
are from the privileged section.

And obviously we want to allow new cache entries to be added to a cache.

Therefore we can no longer first add a new entry and then shrink the cache.
Switching the order of these two operations resolves the culprit.

Fixes scylladb#8674

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
nyh added a commit that referenced this issue Dec 1, 2021
…tarov

This series introduces a new version of a loading_cache class.
The old implementation was susceptible to a "pollution" phenomena when frequently used entry can get evicted by an intensive burst of "used once" entries pushed into the cache.

The new version is going to have a privileged and unprivileged cache sections and there's a new loading_cache template parameter - SectionHitThreshold. The new cache algorithm goes as follows:
  * We define 2 dynamic cache sections which total size should not exceed the maximum cache size.
  * New cache entry is always added to the "unprivileged" section.
  * After a cache entry is read more than SectionHitThreshold times it moves to the second cache section.
  * Both sections' entries obey expiration and reload rules in the same way as before this patch.
  * When cache entries need to be evicted due to a size restriction "unprivileged" section's
    least recently used entries are evicted first.

More details may be found in #8674.

In addition, during a testing another issue was found in the authorized_prepared_statements_cache: #9590.
There is a patch that fixes it as well.

Closes #9708

* github.com:scylladb/scylla:
  loading_cache: account unprivileged section evictions
  loading_cache: implement a variation of least frequent recently used (LFRU) eviction policy
  authorized_prepared_statements_cache: always "touch" a corresponding cache entry when accessed
  loading_cache::timestamped::lru_entry: refactoring
  loading_cache.hh: rearrange the code (no functional change)
  loading_cache: use std::pmr::polymorphic_allocator
@vladzcloudius vladzcloudius reopened this May 26, 2022
@vladzcloudius
Copy link
Contributor Author

vladzcloudius commented May 26, 2022

The feature is broken by 3f2224a

As a result of a regression if the amount of "old" queries is greater than half of the cache a "pollution" workload is going to evict "old" entries instead of entries in the unprivileged section of the cache.

@avikivity
Copy link
Member

It's not broken. It changes the size of the reservation of the new part from 1 item to half the capacity. As a result the old section drops from n-1 to half. It behaves the same way, with smaller reservation - it still resists pollution, just in a different way.

The previous reservation size of 1 item for the new part wasn't enough to execute a batch statement, since the members of the batch would compete with each other.

@vladzcloudius
Copy link
Contributor Author

It's not broken. It changes the size of the reservation of the new part from 1 item to half the capacity. As a result the old section drops from n-1 to half. It behaves the same way, with smaller reservation - it still resists pollution, just in a different way.

The previous reservation size of 1 item for the new part wasn't enough to execute a batch statement, since the members of the batch would compete with each other.

@avikivity You are confused.
What you are describing is what the fix above SHOULD have done but it's not what it actually did - hence the reopening.

3f2224a, doesn't specify the sizes of the cache sections - it specifies the new condition of the eviction and this is not the same.

As a result when the amount of "old" (good) prepared statements is greater than half of the cache the polluting workload will cause a constant eviction from the privileged section (as I described here #8674 (comment)). More than that - this will happen before evicting a possibly evictable entries from the non-privileged section.

A proposal for a solution that ACTUALLY does what you described above (and I described last week) can be found here: #10674 (comment)

@vladzcloudius vladzcloudius reopened this May 31, 2022
@avikivity
Copy link
Member

@avelanarius please check

@DoronArazii DoronArazii added this to the 5.0 milestone Jul 7, 2022
@DoronArazii DoronArazii modified the milestones: 5.0, 5.x Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants