Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loading_cache.hh: items last read time is not updated in sync calls like loading_cache::find(...) #8920

Closed
vladzcloudius opened this issue Jun 23, 2021 · 13 comments
Assignees
Labels
Field-Tier2 Important issues as per FieldEngineering request type/bug
Milestone

Comments

@vladzcloudius
Copy link
Contributor

Installation details
HEAD: 373fa3f

Description
Synchronous getters like loading_cache::find(...) don't update the last_read timestamp and don't update the LRU list.

As a result entries are going to be evicted only according to their creation/last reload times if these APIs are used.

This impacts prepared_statements_cache for instance.

@vladzcloudius vladzcloudius self-assigned this Jun 23, 2021
@vladzcloudius vladzcloudius added the Field-Tier2 Important issues as per FieldEngineering request label Jun 23, 2021
@vladzcloudius
Copy link
Contributor Author

Found while working on a fix for #8674

tgrabiec pushed a commit that referenced this issue Aug 16, 2021
…nd return value_ptr from find(...) instead

loading_shared_values/loading_cache'es iterators interface is dangerous/fragile because
iterator doesn't "lock" the entry it points to and if there is a
preemption point between aquiring non-end() iterator and its
dereferencing the corresponding cache entry may had already got evicted (for
whatever reason, e.g. cache size constraints or expiration) and then
dereferencing may end up in a use-after-free and we don't have any
protection against it in the value_extractor_fn today.

And this is in addition to #8920.

So, instead of trying to fix the iterator interface this patch kills two
birds in a single shot: we are ditching the iterators interface
completely and return value_ptr from find(...) instead - the same one we
are returning from loading_cache::get_ptr(...) asyncronous APIs.

A similar rework is done to a loading_shared_values loading_cache is
based on: we drop iterators interface and return
loading_shared_values::entry_ptr from find(...) instead.

loading_cache::value_ptr already takes care of "lock"ing the returned value so that it
would relain readable even if it's evicted from the cache by the time
one tries to read it. And of course it also takes care of updating the
last read time stamp and moving the corresponding item to the top of the
MRU list.

Fixes #8920

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20210805193911.1935729-1-vladz@scylladb.com>
@bhalevy
Copy link
Member

bhalevy commented Aug 17, 2021

test_loading_cache_loading_expiry_reset_on_sync_op added by 655aaf5 is flaky, at least in debug mode.
See https://jenkins.scylladb.com/view/master/job/scylla-master/job/next/3872/artifact/output_test_debug.txt

The following test(s) have failed: boost/loading_cache_test
Output of build/debug/test/boost/loading_cache_test --report_level=no --logger=HRF,test_suite:XML,test_suite,/jenkins/workspace/scylla-master/next/scylla/testlog/debug/xml/loading_cache_test.241.xunit.xml --catch_system_errors=no --color_output=false -- -c2 -m2G --overprovisioned --unsafe-bypass-fsync 1 --kernel-page-cache 1 --blocked-reactor-notify-ms 2000000 --collectd 0 --max-networking-io-control-blocks=100:
=== TEST.PY STARTING TEST #241 ===
export UBSAN_OPTIONS='halt_on_error=1:abort_on_error=1:suppressions=/jenkins/workspace/scylla-master/next/scylla/ubsan-suppressions.supp'
export ASAN_OPTIONS='disable_coredump=0:abort_on_error=1:detect_stack_use_after_return=1'
build/debug/test/boost/loading_cache_test --report_level=no --logger=HRF,test_suite:XML,test_suite,/jenkins/workspace/scylla-master/next/scylla/testlog/debug/xml/loading_cache_test.241.xunit.xml --catch_system_errors=no --color_output=false -- -c2 -m2G --overprovisioned --unsafe-bypass-fsync 1 --kernel-page-cache 1 --blocked-reactor-notify-ms 2000000 --collectd 0 --max-networking-io-control-blocks=100
=== TEST.PY TEST OUTPUT ===
Running 13 test cases...
Entering test module "test/boost/loading_cache_test.cc"
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_shared_values_parallel_loading_same_key"
==2244==WARNING: ASan doesn't fully support makecontext/swapcontext functions and may produce false positives in some cases!
WARNING: debug mode. Not for benchmarking or production
random-seed=142376714
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_shared_values_parallel_loading_same_key"; testing time: 363776us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_shared_values_parallel_loading_different_keys"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_shared_values_parallel_loading_different_keys"; testing time: 1003613us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_shared_values_rehash"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_shared_values_rehash"; testing time: 1424030us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_shared_values_parallel_loading_explicit_eviction"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_shared_values_parallel_loading_explicit_eviction"; testing time: 924905us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_loading_same_key"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_loading_same_key"; testing time: 94885us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_removing_key"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_removing_key"; testing time: 4902us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_loading_different_keys"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_loading_different_keys"; testing time: 927052us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_loading_expiry_eviction"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_loading_expiry_eviction"; testing time: 96067us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_loading_expiry_reset_on_sync_op"
test/boost/loading_cache_test.cc(296): fatal error: in "test_loading_cache_loading_expiry_reset_on_sync_op": critical check vp != nullptr has failed
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_loading_expiry_reset_on_sync_op"; testing time: 30398us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_move_item_to_mru_list_front_on_sync_op"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_move_item_to_mru_list_front_on_sync_op"; testing time: 16683us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_loading_reloading"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_loading_reloading"; testing time: 69879us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_max_size_eviction"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_max_size_eviction"; testing time: 1534913us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_reload_during_eviction"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_reload_during_eviction"; testing time: 1002084us
Leaving test module "test/boost/loading_cache_test.cc"; testing time: 7494090us
=== TEST.PY SUMMARY START ===
Test exited with code 201

=== TEST.PY SUMMARY END ===

It looks like it's sensitive to wall clock timing.
I could reproduce this by running a large number (> 50) of instances of the test in parallel.

@vladzcloudius
Copy link
Contributor Author

vladzcloudius commented Aug 17, 2021 via email

@vladzcloudius
Copy link
Contributor Author

vladzcloudius commented Aug 17, 2021 via email

@bhalevy
Copy link
Member

bhalevy commented Aug 18, 2021

A "debug" testing is mostly about memory leaks, right?

But also about shuffling continuations so we don't rely on a particular order of async tasks that are fundamentally concurrent.

So, I'll add a check that would use a steady_clock that would validate that there wasn't a delay and if there was - it would repopulate the cache. It would possibly not check the actual feature in the "debug" even once (if we get delayed all 10 times) but it would definitely create a lot of allocations/deallocations. I think this should be good enough since the actual feature would be properly tested in the "release" mode.

Avi, Benny, WDYT?

My preference would be to base the test on a manual clock rather than workaround the lowres_clock specifics,
unless they could be generalized and be derived from the loading_cache_clock_type (but the clock's granularity is not exposed un any standard or even seastar-specific way).

@vladzcloudius
Copy link
Contributor Author

vladzcloudius commented Aug 18, 2021 via email

@vladzcloudius
Copy link
Contributor Author

vladzcloudius commented Aug 18, 2021 via email

tgrabiec pushed a commit that referenced this issue Aug 20, 2021
…nd return value_ptr from find(...) instead

loading_shared_values/loading_cache'es iterators interface is dangerous/fragile because
iterator doesn't "lock" the entry it points to and if there is a
preemption point between aquiring non-end() iterator and its
dereferencing the corresponding cache entry may had already got evicted (for
whatever reason, e.g. cache size constraints or expiration) and then
dereferencing may end up in a use-after-free and we don't have any
protection against it in the value_extractor_fn today.

And this is in addition to #8920.

So, instead of trying to fix the iterator interface this patch kills two
birds in a single shot: we are ditching the iterators interface
completely and return value_ptr from find(...) instead - the same one we
are returning from loading_cache::get_ptr(...) asyncronous APIs.

A similar rework is done to a loading_shared_values loading_cache is
based on: we drop iterators interface and return
loading_shared_values::entry_ptr from find(...) instead.

loading_cache::value_ptr already takes care of "lock"ing the returned value so that it
would relain readable even if it's evicted from the cache by the time
one tries to read it. And of course it also takes care of updating the
last read time stamp and moving the corresponding item to the top of the
MRU list.

Fixes #8920

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20210817222404.3097708-1-vladz@scylladb.com>
tgrabiec pushed a commit that referenced this issue Aug 20, 2021
…nd return value_ptr from find(...) instead

loading_shared_values/loading_cache'es iterators interface is dangerous/fragile because
iterator doesn't "lock" the entry it points to and if there is a
preemption point between aquiring non-end() iterator and its
dereferencing the corresponding cache entry may had already got evicted (for
whatever reason, e.g. cache size constraints or expiration) and then
dereferencing may end up in a use-after-free and we don't have any
protection against it in the value_extractor_fn today.

And this is in addition to #8920.

So, instead of trying to fix the iterator interface this patch kills two
birds in a single shot: we are ditching the iterators interface
completely and return value_ptr from find(...) instead - the same one we
are returning from loading_cache::get_ptr(...) asyncronous APIs.

A similar rework is done to a loading_shared_values loading_cache is
based on: we drop iterators interface and return
loading_shared_values::entry_ptr from find(...) instead.

loading_cache::value_ptr already takes care of "lock"ing the returned value so that it
would relain readable even if it's evicted from the cache by the time
one tries to read it. And of course it also takes care of updating the
last read time stamp and moving the corresponding item to the top of the
MRU list.

Fixes #8920

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20210817222404.3097708-1-vladz@scylladb.com>
tgrabiec pushed a commit that referenced this issue Aug 20, 2021
…nd return value_ptr from find(...) instead

loading_shared_values/loading_cache'es iterators interface is dangerous/fragile because
iterator doesn't "lock" the entry it points to and if there is a
preemption point between aquiring non-end() iterator and its
dereferencing the corresponding cache entry may had already got evicted (for
whatever reason, e.g. cache size constraints or expiration) and then
dereferencing may end up in a use-after-free and we don't have any
protection against it in the value_extractor_fn today.

And this is in addition to #8920.

So, instead of trying to fix the iterator interface this patch kills two
birds in a single shot: we are ditching the iterators interface
completely and return value_ptr from find(...) instead - the same one we
are returning from loading_cache::get_ptr(...) asyncronous APIs.

A similar rework is done to a loading_shared_values loading_cache is
based on: we drop iterators interface and return
loading_shared_values::entry_ptr from find(...) instead.

loading_cache::value_ptr already takes care of "lock"ing the returned value so that it
would relain readable even if it's evicted from the cache by the time
one tries to read it. And of course it also takes care of updating the
last read time stamp and moving the corresponding item to the top of the
MRU list.

Fixes #8920

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20210817222404.3097708-1-vladz@scylladb.com>
@avikivity
Copy link
Member

@vladzcloudius I'm evaluating for a backport. Where is the impact? Do we always re-prepare a statement after entry_expiry (60 minutes)?

I guess re-preparing every 60 minutes isn't too bad, so safer not to backport.

@vladzcloudius
Copy link
Contributor Author

@vladzcloudius I'm evaluating for a backport. Where is the impact? Do we always re-prepare a statement after entry_expiry (60 minutes)?

I guess re-preparing every 60 minutes isn't too bad, so safer not to backport.

I agree.
The impact is as you have described.

@avikivity
Copy link
Member

Not backporting per above.

@vladzcloudius
Copy link
Contributor Author

vladzcloudius commented Nov 16, 2021

Not backporting per above.

@avikivity
The problem is actually deeper than just the one this patch fixes.
This patch is one of the series that fixes a series of issues in our internal caches and those issues can be described as "problems in enforcing the LRU eviction".

Those issues get exposed when we need to evict entries when there is no place for a new entry.

In that case the entry that is supposed to be evicted is the less recently used one. Today it's going to be the less recently created one.

This patch fixes the first (fundamental) issue.
But even with this patch our prepared cache would still not work as advertised above because of #9590.

And I have a fix for all these in my aging PR: #9591
And this PR relies on this patch.

And once all this is merged I think that all this patchwork deserves a backport if we think that "using the amount of prepared statements that don't fit in the cache" or "resilience against internal caches pollution" is important enough to deserve a backport (I'm not 100% sure it is - this is a corner case we see rather rarely in the field).

@avikivity
Copy link
Member

If this is just preparing, we can consider a backport when the second part is committed and tested.

@vladzcloudius
Copy link
Contributor Author

If this is just preparing, we can consider a backport when the second part is committed and tested.

Totally agree.

avikivity pushed a commit that referenced this issue May 4, 2022
…nd return value_ptr from find(...) instead

loading_shared_values/loading_cache'es iterators interface is dangerous/fragile because
iterator doesn't "lock" the entry it points to and if there is a
preemption point between aquiring non-end() iterator and its
dereferencing the corresponding cache entry may had already got evicted (for
whatever reason, e.g. cache size constraints or expiration) and then
dereferencing may end up in a use-after-free and we don't have any
protection against it in the value_extractor_fn today.

And this is in addition to #8920.

So, instead of trying to fix the iterator interface this patch kills two
birds in a single shot: we are ditching the iterators interface
completely and return value_ptr from find(...) instead - the same one we
are returning from loading_cache::get_ptr(...) asyncronous APIs.

A similar rework is done to a loading_shared_values loading_cache is
based on: we drop iterators interface and return
loading_shared_values::entry_ptr from find(...) instead.

loading_cache::value_ptr already takes care of "lock"ing the returned value so that it
would relain readable even if it's evicted from the cache by the time
one tries to read it. And of course it also takes care of updating the
last read time stamp and moving the corresponding item to the top of the
MRU list.

Fixes #8920

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20210817222404.3097708-1-vladz@scylladb.com>
(cherry picked from commit 7bd1bcd)

[avi: prerequisite to backporting #10117]
@DoronArazii DoronArazii added this to the 4.6 milestone May 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Field-Tier2 Important issues as per FieldEngineering request type/bug
Projects
None yet
Development

No branches or pull requests

5 participants