Merge 'misc_services: fix data race from bad usage of get_next_versio…

…n' from Piotr Dulikowski The function `gms::version_generator::get_next_version()` can only be called from shard 0 as it uses a global, unsynchronized counter to issue versions. Notably, the function is used as a default argument for the constructor of `gms::versioned_value` which is used from shorthand constructors such as `versioned_value::cache_hitrates`, `versioned_value::schema` etc. The `cache_hitrate_calculator` service runs a periodic job which updates the `CACHE_HITRATES` application state in the local gossiper state. Each time the job is scheduled, it runs on the next shard (it goes through shards in a round-robin fashion). The job uses the `versioned_value::cache_hitrates` shorthand to create a `versioned_value`, therefore risking a data race if it is not currently executing on shard 0. The PR fixes the race by moving the call to `versioned_value::cache_hitrates` to shard 0. Additionally, in order to help detect similar issues in the future, a check is introduced to `get_next_version` which aborts the process if the function was called on other shard than 0. There is a possibility that it is a fix for #17493. Because `get_next_version` uses a simple incrementation to advance the global counter, a data race can occur if two shards call it concurrently and it may result in shard 0 returning the same or smaller value when called two times in a row. The following sequence of events is suspected to occur on node A: 1. Shard 1 calls `get_next_version()`, loads version `v - 1` from the global counter and stores in a register; the thread then is preempted, 2. Shard 0 executes `add_local_application_state()` which internally calls `get_next_version()`, loads `v - 1` then stores `v` and uses version `v` to update the application state, 3. Shard 0 executes `add_local_application_state()` again, increments version to `v + 1` and uses it to update the application state, 4. Gossip message handler runs, exchanging application states with node B. It sends its application state to B. Note that the max version of any of the local application states is `v + 1`, 5. Shard 1 resumes and stores version `v` in the global counter, 6. Shard 0 executes `add_local_application_state()` and updates the application state - again - with version `v + 1`. 7. After that, node B will never learn about the application state introduced in point 6. as gossip exchange only sends endpoint states with version larger than the previous observed max version, which was `v + 1` in point 4. Note that the above scenario was _not_ reproduced. However, I managed to observe a race condition by: 1. modifying Scylla to run update of `CACHE_HITRATES` much more frequently than usual, 2. putting an assertion in `add_local_application_state` which fails if the version returned by `get_next_version` was not larger than the previous returned value, 3. running a test which performs schema changes in a loop. The assertion from the second point was triggered. While it's hard to tell how likely it is to occur without making updates of cache hitrates more frequent - not to mention the full theorized scenario - for now this is the best lead that we have, and the data race being fixed here is a real bug anyway. Refs: #17493 Closes #17499 * github.com:scylladb/scylladb: version_generator: check that get_next_version is called on shard 0 misc_services: fix data race from bad usage of get_next_version
scylladb · Feb 25, 2024 · fd32e2e · fd32e2e
2 parents 59df479 + 54546e1
commit fd32e2e
Show file tree

Hide file tree

Showing 2 changed files with 17 additions and 2 deletions.
diff --git a/gms/version_generator.cc b/gms/version_generator.cc
@@ -8,6 +8,12 @@
  * SPDX-License-Identifier: (AGPL-3.0-or-later and Apache-2.0)
  */
 
+#include <seastar/util/modules.hh>
+#include <seastar/core/shard_id.hh>
+#include <seastar/core/on_internal_error.hh>
+#include <seastar/core/print.hh>
+#include "log.hh"
+#include "seastarx.hh"
 #include "version_generator.hh"
 
 namespace gms {
@@ -16,8 +22,15 @@ namespace version_generator {
 // For us, we run the gossiper on a single CPU, and don't need to use atomics.
 static version_type version;
 
+static logging::logger logger("version_generator");
+
 version_type get_next_version() noexcept
 {
+    if (this_shard_id() != 0) [[unlikely]] {
+        on_fatal_internal_error(logger, format(
+                "{} can only be called on shard 0, but it was called on shard {}",
+                __FUNCTION__, this_shard_id()));
+    }
     return ++version;
 }
 

diff --git a/service/misc_services.cc b/service/misc_services.cc
@@ -204,8 +204,10 @@ future<lowres_clock::duration> cache_hitrate_calculator::recalculate_hitrates()
             llogger.debug("Send CACHE_HITRATES update max_diff={}, published_nr={}", _diff, _published_nr);
             ++_published_nr;
             _published_time = now;
-            return _gossiper.add_local_application_state(gms::application_state::CACHE_HITRATES,
-                    gms::versioned_value::cache_hitrates(_gstate)).then([recalculate_duration] {
+            return container().invoke_on(0, [&gstate = _gstate] (cache_hitrate_calculator& self) {
+                return self._gossiper.add_local_application_state(gms::application_state::CACHE_HITRATES,
+                        gms::versioned_value::cache_hitrates(gstate));
+            }).then([recalculate_duration] {
                 return recalculate_duration;
             });
         } else {