loading_cache.hh: items last read time is not updated in sync calls like loading_cache::find(...) #8920

vladzcloudius · 2021-06-23T16:46:49Z

Installation details
HEAD: 373fa3f

Description
Synchronous getters like loading_cache::find(...) don't update the last_read timestamp and don't update the LRU list.

As a result entries are going to be evicted only according to their creation/last reload times if these APIs are used.

This impacts prepared_statements_cache for instance.

The text was updated successfully, but these errors were encountered:

vladzcloudius · 2021-06-23T16:49:55Z

Found while working on a fix for #8674

…nd return value_ptr from find(...) instead loading_shared_values/loading_cache'es iterators interface is dangerous/fragile because iterator doesn't "lock" the entry it points to and if there is a preemption point between aquiring non-end() iterator and its dereferencing the corresponding cache entry may had already got evicted (for whatever reason, e.g. cache size constraints or expiration) and then dereferencing may end up in a use-after-free and we don't have any protection against it in the value_extractor_fn today. And this is in addition to #8920. So, instead of trying to fix the iterator interface this patch kills two birds in a single shot: we are ditching the iterators interface completely and return value_ptr from find(...) instead - the same one we are returning from loading_cache::get_ptr(...) asyncronous APIs. A similar rework is done to a loading_shared_values loading_cache is based on: we drop iterators interface and return loading_shared_values::entry_ptr from find(...) instead. loading_cache::value_ptr already takes care of "lock"ing the returned value so that it would relain readable even if it's evicted from the cache by the time one tries to read it. And of course it also takes care of updating the last read time stamp and moving the corresponding item to the top of the MRU list. Fixes #8920 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <20210805193911.1935729-1-vladz@scylladb.com>

bhalevy · 2021-08-17T06:26:22Z

test_loading_cache_loading_expiry_reset_on_sync_op added by 655aaf5 is flaky, at least in debug mode.
See https://jenkins.scylladb.com/view/master/job/scylla-master/job/next/3872/artifact/output_test_debug.txt

The following test(s) have failed: boost/loading_cache_test
Output of build/debug/test/boost/loading_cache_test --report_level=no --logger=HRF,test_suite:XML,test_suite,/jenkins/workspace/scylla-master/next/scylla/testlog/debug/xml/loading_cache_test.241.xunit.xml --catch_system_errors=no --color_output=false -- -c2 -m2G --overprovisioned --unsafe-bypass-fsync 1 --kernel-page-cache 1 --blocked-reactor-notify-ms 2000000 --collectd 0 --max-networking-io-control-blocks=100:
=== TEST.PY STARTING TEST #241 ===
export UBSAN_OPTIONS='halt_on_error=1:abort_on_error=1:suppressions=/jenkins/workspace/scylla-master/next/scylla/ubsan-suppressions.supp'
export ASAN_OPTIONS='disable_coredump=0:abort_on_error=1:detect_stack_use_after_return=1'
build/debug/test/boost/loading_cache_test --report_level=no --logger=HRF,test_suite:XML,test_suite,/jenkins/workspace/scylla-master/next/scylla/testlog/debug/xml/loading_cache_test.241.xunit.xml --catch_system_errors=no --color_output=false -- -c2 -m2G --overprovisioned --unsafe-bypass-fsync 1 --kernel-page-cache 1 --blocked-reactor-notify-ms 2000000 --collectd 0 --max-networking-io-control-blocks=100
=== TEST.PY TEST OUTPUT ===
Running 13 test cases...
Entering test module "test/boost/loading_cache_test.cc"
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_shared_values_parallel_loading_same_key"
==2244==WARNING: ASan doesn't fully support makecontext/swapcontext functions and may produce false positives in some cases!
WARNING: debug mode. Not for benchmarking or production
random-seed=142376714
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_shared_values_parallel_loading_same_key"; testing time: 363776us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_shared_values_parallel_loading_different_keys"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_shared_values_parallel_loading_different_keys"; testing time: 1003613us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_shared_values_rehash"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_shared_values_rehash"; testing time: 1424030us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_shared_values_parallel_loading_explicit_eviction"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_shared_values_parallel_loading_explicit_eviction"; testing time: 924905us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_loading_same_key"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_loading_same_key"; testing time: 94885us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_removing_key"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_removing_key"; testing time: 4902us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_loading_different_keys"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_loading_different_keys"; testing time: 927052us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_loading_expiry_eviction"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_loading_expiry_eviction"; testing time: 96067us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_loading_expiry_reset_on_sync_op"
test/boost/loading_cache_test.cc(296): fatal error: in "test_loading_cache_loading_expiry_reset_on_sync_op": critical check vp != nullptr has failed
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_loading_expiry_reset_on_sync_op"; testing time: 30398us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_move_item_to_mru_list_front_on_sync_op"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_move_item_to_mru_list_front_on_sync_op"; testing time: 16683us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_loading_reloading"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_loading_reloading"; testing time: 69879us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_max_size_eviction"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_max_size_eviction"; testing time: 1534913us
test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_reload_during_eviction"
test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_reload_during_eviction"; testing time: 1002084us
Leaving test module "test/boost/loading_cache_test.cc"; testing time: 7494090us
=== TEST.PY SUMMARY START ===
Test exited with code 201

=== TEST.PY SUMMARY END ===

It looks like it's sensitive to wall clock timing.
I could reproduce this by running a large number (> 50) of instances of the test in parallel.

vladzcloudius · 2021-08-17T16:21:52Z

On 8/17/21 2:26 AM, Benny Halevy wrote: |test_loading_cache_loading_expiry_reset_on_sync_op| added by 655aaf5 <655aaf5> is flaky, at least in debug mode. See https://jenkins.scylladb.com/view/master/job/scylla-master/job/next/3872/artifact/output_test_debug.txt <https://jenkins.scylladb.com/view/master/job/scylla-master/job/next/3872/artifact/output_test_debug.txt> |The following test(s) have failed: boost/loading_cache_test Output of build/debug/test/boost/loading_cache_test --report_level=no --logger=HRF,test_suite:XML,test_suite,/jenkins/workspace/scylla-master/next/scylla/testlog/debug/xml/loading_cache_test.241.xunit.xml --catch_system_errors=no --color_output=false -- -c2 -m2G --overprovisioned --unsafe-bypass-fsync 1 --kernel-page-cache 1 --blocked-reactor-notify-ms 2000000 --collectd 0 --max-networking-io-control-blocks=100: === TEST.PY STARTING TEST #241 === export UBSAN_OPTIONS='halt_on_error=1:abort_on_error=1:suppressions=/jenkins/workspace/scylla-master/next/scylla/ubsan-suppressions.supp' export ASAN_OPTIONS='disable_coredump=0:abort_on_error=1:detect_stack_use_after_return=1' build/debug/test/boost/loading_cache_test --report_level=no --logger=HRF,test_suite:XML,test_suite,/jenkins/workspace/scylla-master/next/scylla/testlog/debug/xml/loading_cache_test.241.xunit.xml --catch_system_errors=no --color_output=false -- -c2 -m2G --overprovisioned --unsafe-bypass-fsync 1 --kernel-page-cache 1 --blocked-reactor-notify-ms 2000000 --collectd 0 --max-networking-io-control-blocks=100 === TEST.PY TEST OUTPUT === Running 13 test cases... Entering test module "test/boost/loading_cache_test.cc" test/boost/loading_cache_test.cc(0): Entering test case "test_loading_shared_values_parallel_loading_same_key" ==2244==WARNING: ASan doesn't fully support makecontext/swapcontext functions and may produce false positives in some cases! WARNING: debug mode. Not for benchmarking or production random-seed=142376714 test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_shared_values_parallel_loading_same_key"; testing time: 363776us test/boost/loading_cache_test.cc(0): Entering test case "test_loading_shared_values_parallel_loading_different_keys" test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_shared_values_parallel_loading_different_keys"; testing time: 1003613us test/boost/loading_cache_test.cc(0): Entering test case "test_loading_shared_values_rehash" test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_shared_values_rehash"; testing time: 1424030us test/boost/loading_cache_test.cc(0): Entering test case "test_loading_shared_values_parallel_loading_explicit_eviction" test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_shared_values_parallel_loading_explicit_eviction"; testing time: 924905us test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_loading_same_key" test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_loading_same_key"; testing time: 94885us test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_removing_key" test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_removing_key"; testing time: 4902us test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_loading_different_keys" test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_loading_different_keys"; testing time: 927052us test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_loading_expiry_eviction" test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_loading_expiry_eviction"; testing time: 96067us test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_loading_expiry_reset_on_sync_op" test/boost/loading_cache_test.cc(296): fatal error: in "test_loading_cache_loading_expiry_reset_on_sync_op": critical check vp != nullptr has failed test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_loading_expiry_reset_on_sync_op"; testing time: 30398us test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_move_item_to_mru_list_front_on_sync_op" test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_move_item_to_mru_list_front_on_sync_op"; testing time: 16683us test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_loading_reloading" test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_loading_reloading"; testing time: 69879us test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_max_size_eviction" test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_max_size_eviction"; testing time: 1534913us test/boost/loading_cache_test.cc(0): Entering test case "test_loading_cache_reload_during_eviction" test/boost/loading_cache_test.cc(0): Leaving test case "test_loading_cache_reload_during_eviction"; testing time: 1002084us Leaving test module "test/boost/loading_cache_test.cc"; testing time: 7494090us === TEST.PY SUMMARY START === Test exited with code 201 === TEST.PY SUMMARY END === | It looks like it's sensitive to wall clock timing. I could reproduce this by running a large number (> 50) of instances of the test in parallel.

The test is definitely sensitive to a how exact our sleep(..).get() and how long it takes for the corresponding continuation to get scheduled after that. This test is all about time - we check that last-read time stamp is reset and then sleep for 1/2 of the expiration time and validate that the value is still there after that. And we do that 10 times. There is an assumption that the validation part is scheduled to run not later than 10ms after the sleep(10ms) returns. This supposedly reasonable assumption breaks with a 'debug' binary it seems.

…

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#8920 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABOSHLJPPJDRWKP7BERMB2TT5H6JTANCNFSM47GF2NOQ>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.

vladzcloudius · 2021-08-17T18:04:09Z

On 8/17/21 12:21 PM, Vladislav Zolotarov wrote: On 8/17/21 2:26 AM, Benny Halevy wrote: > > |test_loading_cache_loading_expiry_reset_on_sync_op| added by 655aaf5 > <655aaf5> > is flaky, at least in debug mode. > See > https://jenkins.scylladb.com/view/master/job/scylla-master/job/next/3872/artifact/output_test_debug.txt > <https://jenkins.scylladb.com/view/master/job/scylla-master/job/next/3872/artifact/output_test_debug.txt> > > |The following test(s) have failed: boost/loading_cache_test Output > of build/debug/test/boost/loading_cache_test --report_level=no > --logger=HRF,test_suite:XML,test_suite,/jenkins/workspace/scylla-master/next/scylla/testlog/debug/xml/loading_cache_test.241.xunit.xml > --catch_system_errors=no --color_output=false -- -c2 -m2G > --overprovisioned --unsafe-bypass-fsync 1 --kernel-page-cache 1 > --blocked-reactor-notify-ms 2000000 --collectd 0 > --max-networking-io-control-blocks=100: === TEST.PY STARTING TEST > #241 === export > UBSAN_OPTIONS='halt_on_error=1:abort_on_error=1:suppressions=/jenkins/workspace/scylla-master/next/scylla/ubsan-suppressions.supp' > export > ASAN_OPTIONS='disable_coredump=0:abort_on_error=1:detect_stack_use_after_return=1' > build/debug/test/boost/loading_cache_test --report_level=no > --logger=HRF,test_suite:XML,test_suite,/jenkins/workspace/scylla-master/next/scylla/testlog/debug/xml/loading_cache_test.241.xunit.xml > --catch_system_errors=no --color_output=false -- -c2 -m2G > --overprovisioned --unsafe-bypass-fsync 1 --kernel-page-cache 1 > --blocked-reactor-notify-ms 2000000 --collectd 0 > --max-networking-io-control-blocks=100 === TEST.PY TEST OUTPUT === > Running 13 test cases... Entering test module > "test/boost/loading_cache_test.cc" > test/boost/loading_cache_test.cc(0): Entering test case > "test_loading_shared_values_parallel_loading_same_key" > ==2244==WARNING: ASan doesn't fully support makecontext/swapcontext > functions and may produce false positives in some cases! WARNING: > debug mode. Not for benchmarking or production random-seed=142376714 > test/boost/loading_cache_test.cc(0): Leaving test case > "test_loading_shared_values_parallel_loading_same_key"; testing time: > 363776us test/boost/loading_cache_test.cc(0): Entering test case > "test_loading_shared_values_parallel_loading_different_keys" > test/boost/loading_cache_test.cc(0): Leaving test case > "test_loading_shared_values_parallel_loading_different_keys"; testing > time: 1003613us test/boost/loading_cache_test.cc(0): Entering test > case "test_loading_shared_values_rehash" > test/boost/loading_cache_test.cc(0): Leaving test case > "test_loading_shared_values_rehash"; testing time: 1424030us > test/boost/loading_cache_test.cc(0): Entering test case > "test_loading_shared_values_parallel_loading_explicit_eviction" > test/boost/loading_cache_test.cc(0): Leaving test case > "test_loading_shared_values_parallel_loading_explicit_eviction"; > testing time: 924905us test/boost/loading_cache_test.cc(0): Entering > test case "test_loading_cache_loading_same_key" > test/boost/loading_cache_test.cc(0): Leaving test case > "test_loading_cache_loading_same_key"; testing time: 94885us > test/boost/loading_cache_test.cc(0): Entering test case > "test_loading_cache_removing_key" > test/boost/loading_cache_test.cc(0): Leaving test case > "test_loading_cache_removing_key"; testing time: 4902us > test/boost/loading_cache_test.cc(0): Entering test case > "test_loading_cache_loading_different_keys" > test/boost/loading_cache_test.cc(0): Leaving test case > "test_loading_cache_loading_different_keys"; testing time: 927052us > test/boost/loading_cache_test.cc(0): Entering test case > "test_loading_cache_loading_expiry_eviction" > test/boost/loading_cache_test.cc(0): Leaving test case > "test_loading_cache_loading_expiry_eviction"; testing time: 96067us > test/boost/loading_cache_test.cc(0): Entering test case > "test_loading_cache_loading_expiry_reset_on_sync_op" > test/boost/loading_cache_test.cc(296): fatal error: in > "test_loading_cache_loading_expiry_reset_on_sync_op": critical check > vp != nullptr has failed test/boost/loading_cache_test.cc(0): Leaving > test case "test_loading_cache_loading_expiry_reset_on_sync_op"; > testing time: 30398us test/boost/loading_cache_test.cc(0): Entering > test case "test_loading_cache_move_item_to_mru_list_front_on_sync_op" > test/boost/loading_cache_test.cc(0): Leaving test case > "test_loading_cache_move_item_to_mru_list_front_on_sync_op"; testing > time: 16683us test/boost/loading_cache_test.cc(0): Entering test case > "test_loading_cache_loading_reloading" > test/boost/loading_cache_test.cc(0): Leaving test case > "test_loading_cache_loading_reloading"; testing time: 69879us > test/boost/loading_cache_test.cc(0): Entering test case > "test_loading_cache_max_size_eviction" > test/boost/loading_cache_test.cc(0): Leaving test case > "test_loading_cache_max_size_eviction"; testing time: 1534913us > test/boost/loading_cache_test.cc(0): Entering test case > "test_loading_cache_reload_during_eviction" > test/boost/loading_cache_test.cc(0): Leaving test case > "test_loading_cache_reload_during_eviction"; testing time: 1002084us > Leaving test module "test/boost/loading_cache_test.cc"; testing time: > 7494090us === TEST.PY SUMMARY START === Test exited with code 201 === > TEST.PY SUMMARY END === | > > It looks like it's sensitive to wall clock timing. > I could reproduce this by running a large number (> 50) of instances > of the test in parallel. > The test is definitely sensitive to a how exact our sleep(..).get() and how long it takes for the corresponding continuation to get scheduled after that. This test is all about time - we check that last-read time stamp is reset and then sleep for 1/2 of the expiration time and validate that the value is still there after that. And we do that 10 times. There is an assumption that the validation part is scheduled to run not later than 10ms after the sleep(10ms) returns. This supposedly reasonable assumption breaks with a 'debug' binary it seems.

A "debug" testing is mostly about memory leaks, right? So, I'll add a check that would use a steady_clock that would validate that there wasn't a delay and if there was - it would repopulate the cache. It would possibly not check the actual feature in the "debug" even once (if we get delayed all 10 times) but it would definitely create a lot of allocations/deallocations. I think this should be good enough since the actual feature would be properly tested in the "release" mode. Avi, Benny, WDYT?

…

> — > You are receiving this because you were assigned. > Reply to this email directly, view it on GitHub > <#8920 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABOSHLJPPJDRWKP7BERMB2TT5H6JTANCNFSM47GF2NOQ>. > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>. >

bhalevy · 2021-08-18T07:03:43Z

A "debug" testing is mostly about memory leaks, right?

But also about shuffling continuations so we don't rely on a particular order of async tasks that are fundamentally concurrent.

So, I'll add a check that would use a steady_clock that would validate that there wasn't a delay and if there was - it would repopulate the cache. It would possibly not check the actual feature in the "debug" even once (if we get delayed all 10 times) but it would definitely create a lot of allocations/deallocations. I think this should be good enough since the actual feature would be properly tested in the "release" mode.

Avi, Benny, WDYT?

My preference would be to base the test on a manual clock rather than workaround the lowres_clock specifics,
unless they could be generalized and be derived from the loading_cache_clock_type (but the clock's granularity is not exposed un any standard or even seastar-specific way).

vladzcloudius · 2021-08-18T13:09:24Z

On 8/18/21 3:03 AM, Benny Halevy wrote: A "debug" testing is mostly about memory leaks, right? But also about shuffling continuations so we don't rely on a particular order of async tasks that are fundamentally concurrent. So, I'll add a check that would use a steady_clock that would validate that there wasn't a delay and if there was - it would repopulate the cache. It would possibly not check the actual feature in the "debug" even once (if we get delayed all 10 times) but it would definitely create a lot of allocations/deallocations. I think this should be good enough since the actual feature would be properly tested in the "release" mode. Avi, Benny, WDYT? My preference would be to base the test on a manual clock

What's "manual clock"?

rather than workaround the lowres_clock specifics, unless they could be generalized and be derived from the |loading_cache_clock_type| (but the clock's granularity is not exposed un any standard or even seastar-specific way).

The test is not "based" on the lowres_clock - it tests loading_cache functionality that happens to use lowres_clock internally. There are many other tests in this file that take the lowres_clock granularity into an account. I don't understand what your complain to be honest. Every clock has its granularity which should be taken into an account. lowres_clock is not different in that.

…

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#8920 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABOSHLOGPHH4IZTTJCVFDBLT5NLNTANCNFSM47GF2NOQ>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.

vladzcloudius · 2021-08-18T13:15:51Z

On 8/18/21 9:09 AM, Vladislav Zolotarov wrote: On 8/18/21 3:03 AM, Benny Halevy wrote: > > A "debug" testing is mostly about memory leaks, right? > > But also about shuffling continuations so we don't rely on a > particular order of async tasks that are fundamentally concurrent. > > So, I'll add a check that would use a steady_clock that would > validate that there wasn't a delay and if there was - it would > repopulate the cache. It would possibly not check the actual > feature in the "debug" even once (if we get delayed all 10 times) > but it would definitely create a lot of > allocations/deallocations. I think this should be good enough > since the actual feature would be properly tested in the > "release" mode. > > Avi, Benny, WDYT? > > My preference would be to base the test on a manual clock > What's "manual clock"? > rather than workaround the lowres_clock specifics, > unless they could be generalized and be derived from the > |loading_cache_clock_type| (but the clock's granularity is not > exposed un any standard or even seastar-specific way). > The test is not "based" on the lowres_clock - it tests loading_cache functionality that happens to use lowres_clock internally. There are many other tests in this file that take the lowres_clock granularity into an account. I don't understand what your complain to be honest. Every clock has its granularity which should be taken into an account. lowres_clock is not different in that.

In any case, if we decide to parametrize a clock source that should be a separate patchwork since it's irrelevant to this particular patch but makes sense in general.

…

> — > You are receiving this because you were assigned. > Reply to this email directly, view it on GitHub > <#8920 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABOSHLOGPHH4IZTTJCVFDBLT5NLNTANCNFSM47GF2NOQ>. > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>. >

…nd return value_ptr from find(...) instead loading_shared_values/loading_cache'es iterators interface is dangerous/fragile because iterator doesn't "lock" the entry it points to and if there is a preemption point between aquiring non-end() iterator and its dereferencing the corresponding cache entry may had already got evicted (for whatever reason, e.g. cache size constraints or expiration) and then dereferencing may end up in a use-after-free and we don't have any protection against it in the value_extractor_fn today. And this is in addition to #8920. So, instead of trying to fix the iterator interface this patch kills two birds in a single shot: we are ditching the iterators interface completely and return value_ptr from find(...) instead - the same one we are returning from loading_cache::get_ptr(...) asyncronous APIs. A similar rework is done to a loading_shared_values loading_cache is based on: we drop iterators interface and return loading_shared_values::entry_ptr from find(...) instead. loading_cache::value_ptr already takes care of "lock"ing the returned value so that it would relain readable even if it's evicted from the cache by the time one tries to read it. And of course it also takes care of updating the last read time stamp and moving the corresponding item to the top of the MRU list. Fixes #8920 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <20210817222404.3097708-1-vladz@scylladb.com>

avikivity · 2021-08-23T10:37:09Z

@vladzcloudius I'm evaluating for a backport. Where is the impact? Do we always re-prepare a statement after entry_expiry (60 minutes)?

I guess re-preparing every 60 minutes isn't too bad, so safer not to backport.

vladzcloudius · 2021-09-07T21:47:17Z

@vladzcloudius I'm evaluating for a backport. Where is the impact? Do we always re-prepare a statement after entry_expiry (60 minutes)?

I guess re-preparing every 60 minutes isn't too bad, so safer not to backport.

I agree.
The impact is as you have described.

avikivity · 2021-11-15T11:25:13Z

Not backporting per above.

vladzcloudius · 2021-11-16T17:30:12Z

Not backporting per above.

@avikivity
The problem is actually deeper than just the one this patch fixes.
This patch is one of the series that fixes a series of issues in our internal caches and those issues can be described as "problems in enforcing the LRU eviction".

Those issues get exposed when we need to evict entries when there is no place for a new entry.

In that case the entry that is supposed to be evicted is the less recently used one. Today it's going to be the less recently created one.

This patch fixes the first (fundamental) issue.
But even with this patch our prepared cache would still not work as advertised above because of #9590.

And I have a fix for all these in my aging PR: #9591
And this PR relies on this patch.

And once all this is merged I think that all this patchwork deserves a backport if we think that "using the amount of prepared statements that don't fit in the cache" or "resilience against internal caches pollution" is important enough to deserve a backport (I'm not 100% sure it is - this is a corner case we see rather rarely in the field).

avikivity · 2021-11-17T08:25:35Z

If this is just preparing, we can consider a backport when the second part is committed and tested.

vladzcloudius · 2021-11-17T14:20:58Z

If this is just preparing, we can consider a backport when the second part is committed and tested.

Totally agree.

…nd return value_ptr from find(...) instead loading_shared_values/loading_cache'es iterators interface is dangerous/fragile because iterator doesn't "lock" the entry it points to and if there is a preemption point between aquiring non-end() iterator and its dereferencing the corresponding cache entry may had already got evicted (for whatever reason, e.g. cache size constraints or expiration) and then dereferencing may end up in a use-after-free and we don't have any protection against it in the value_extractor_fn today. And this is in addition to #8920. So, instead of trying to fix the iterator interface this patch kills two birds in a single shot: we are ditching the iterators interface completely and return value_ptr from find(...) instead - the same one we are returning from loading_cache::get_ptr(...) asyncronous APIs. A similar rework is done to a loading_shared_values loading_cache is based on: we drop iterators interface and return loading_shared_values::entry_ptr from find(...) instead. loading_cache::value_ptr already takes care of "lock"ing the returned value so that it would relain readable even if it's evicted from the cache by the time one tries to read it. And of course it also takes care of updating the last read time stamp and moving the corresponding item to the top of the MRU list. Fixes #8920 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <20210817222404.3097708-1-vladz@scylladb.com> (cherry picked from commit 7bd1bcd) [avi: prerequisite to backporting #10117]

vladzcloudius added the type/bug label Jun 23, 2021

vladzcloudius self-assigned this Jun 23, 2021

vladzcloudius added the Field-Tier2 Important issues as per FieldEngineering request label Jun 23, 2021

scylladb-promoter closed this as completed in 7bd1bcd Aug 22, 2021

scylladb-promoter added the Backport candidate label Aug 22, 2021

avikivity removed the Backport candidate label Nov 15, 2021

DoronArazii added this to the 4.6 milestone May 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loading_cache.hh: items last read time is not updated in sync calls like loading_cache::find(...) #8920

loading_cache.hh: items last read time is not updated in sync calls like loading_cache::find(...) #8920

vladzcloudius commented Jun 23, 2021

vladzcloudius commented Jun 23, 2021

bhalevy commented Aug 17, 2021

vladzcloudius commented Aug 17, 2021 via email

vladzcloudius commented Aug 17, 2021 via email

bhalevy commented Aug 18, 2021

vladzcloudius commented Aug 18, 2021 via email

vladzcloudius commented Aug 18, 2021 via email

avikivity commented Aug 23, 2021

vladzcloudius commented Sep 7, 2021

avikivity commented Nov 15, 2021

vladzcloudius commented Nov 16, 2021 •

edited

avikivity commented Nov 17, 2021

vladzcloudius commented Nov 17, 2021

loading_cache.hh: items last read time is not updated in sync calls like loading_cache::find(...) #8920

loading_cache.hh: items last read time is not updated in sync calls like loading_cache::find(...) #8920

Comments

vladzcloudius commented Jun 23, 2021

vladzcloudius commented Jun 23, 2021

bhalevy commented Aug 17, 2021

vladzcloudius commented Aug 17, 2021 via email

vladzcloudius commented Aug 17, 2021 via email

bhalevy commented Aug 18, 2021

vladzcloudius commented Aug 18, 2021 via email

vladzcloudius commented Aug 18, 2021 via email

avikivity commented Aug 23, 2021

vladzcloudius commented Sep 7, 2021

avikivity commented Nov 15, 2021

vladzcloudius commented Nov 16, 2021 • edited

avikivity commented Nov 17, 2021

vladzcloudius commented Nov 17, 2021

vladzcloudius commented Nov 16, 2021 •

edited