System.peers: enforce host_id #16376

bhalevy · 2023-12-12T07:20:25Z

The HOST_ID is already written to system.peers since inception pretty much (See #16376 (comment) for details).

However, it is written to the table using an individual CQL query and so it is not set atomically with other columns.
If scylla crashes or even hits an exception before updating the host_id, then system.peers might be left in an inconsistent state, and in particular without no HOST_ID value.

This series makes sure that HOST_ID is written to system.peers and use it to "seal" the record by upserting it in a single CQL BATCH query when adding the state for new nodes.

On the read side, skip rows that have no HOST_ID state in system.peers, assuming they are incomplete, i.e. scylla got an exception or crashed while writing them, so they can't be trusted.

With that change we can assume that endpoint state loaded from system.peers will always have a valid host_id.

Refs #15903

service/storage_service.cc

scylladb-promoter · 2023-12-12T09:37:13Z

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

Duration: 2 hr 16 min
Builder: spider5.cloudius-systems.com

service/storage_service.cc

bhalevy · 2023-12-15T18:09:34Z

In v2 (de4a404):

batch update system.peers
- well, that required some pretty-significant refactoring
- but the end result is much cleaner IMO, so it's a net win both in correctness, performance, and code quality.

scylladb-promoter · 2023-12-15T20:20:53Z

🔴 CI State: FAILURE

✅ - Build
✅ - dtest
❌ - Unit Tests

Failed Tests (6/23439):

Build Details:

Duration: 2 hr 13 min
Builder: spider1.cloudius-systems.com

bhalevy · 2023-12-17T06:32:31Z

In v3 (1636d77):

Fix scylla-gdb to try accessing the application state map using std::unordered_map first, and fallback to std::map

scylladb-promoter · 2023-12-17T08:56:47Z

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

Duration: 2 hr 13 min
Builder: spider3.cloudius-systems.com

avikivity · 2023-12-17T17:35:47Z

Make sure that HOST_ID is written to system.peers and use it to "seal" the record by updating it last (along with TOKENS) when adding the state for new nodes.

On the read side, skip rows that have no HOST_ID state in system.peers, assuming they are incomplete, i.e. scylla got an exception or crashed while writing them, so they can't be trusted.

How does this work during upgrade?

With that change we can assume that endpoint state loaded from system.peers will always have a valid host_id.

Refs #15903

avikivity · 2023-12-17T17:43:00Z

cql3/query_processor.cc

 future<::shared_ptr<untyped_result_set>>
 query_processor::execute_with_params(
        statements::prepared_statement::checked_weak_ptr p,
        db::consistency_level cl,
        service::query_state& query_state,
-        const std::initializer_list<data_value>& values) {
+        const Range& values) {


I think this could be done more cleanly by having a data_value_list type that's a wrapper over std::vector<raw_value>. Give that two non-explicit constructors - from std::initializer_list<data_value> and your range. Then execute_with_params() passes data_value_list to make_internal_options. This limits the templating to a small area.

Sounds promising. I'll try that

avikivity · 2023-12-17T17:44:46Z

types/types.hh

+
+class data_values_map {
+    std::unordered_map<std::string, data_value> _map;
+


Why std::string and not sstring?

Why a wrapper over a map? Can use the map type directly.

Why sstring? I thought we're trying to stick with the standard when there's no special reason to use sstring (like the uninitialized construct case)

Why a wrapper over a map? Can use the map type directly.

To make the update case simpler.
With bare map m["foo"] = "bar" doesn't work because data_value can't be constructed this way.

unordered_map::insert_or_assign()

If you're solving a common problem, look for a common solution.

Why sstring? I thought we're trying to stick with the standard when there's no special reason to use sstring (like the uninitialized construct case)

We're using sstring over std::string to avoid conversions when we do need sstring.

I thought the main argument for sstring over std::string is performance, since the former doesn't have to be thread safe. Isn't this the case?

Recent (C++11) std::string is also not using copy-on-write. But it's still slower than sstring (sstring doesn't have a separate capacity word).

So we use sstring where performance is important, but also generally to avoid conversions.

avikivity · 2023-12-17T17:45:12Z

types/types.hh

+            return true;
+        }
+        it->second = std::move(value);
+        return false;


Isn't this operator[]?

I wish, but the compiler is playing hard to get (or actually hard to set in this case :))

I guess it's because the value is not default constructible.

There's insert_or_assign() though.

avikivity · 2023-12-17T17:47:15Z

db/system_keyspace.cc

+
+    auto req = fmt::format("INSERT INTO system.{} ({}) VALUES ({})", PEERS, fmt::join(column_names, ","), values_ph_str);
+    slogger.debug("INSERT INTO system.{} ({}) VALUES ({})", PEERS, fmt::join(column_names, ","), fmt::join(data_values, ","));
+    co_await execute_cql_query(req, data_values).discard_result();


It's generally bad practice to generate CQL on the fly. CQL represents updates to structured data.

If you need to merge two updates, use BATCH.

That would involve batchlog, wouldn't it?
The whole point of this series is to update system.peers with a single commitlog write.
Would we lose that if we use a batch insert?

No, single partition batch doesn't touch batchlog. It's just a way to generate a mutation.

avikivity · 2023-12-17T17:48:09Z

service/storage_service.cc

+            insert_string("supported_features", value);
+            break;
+        default:
+            break;


Should use BATCH, or the ugly UNSET feature thing.

avikivity · 2023-12-17T17:49:15Z

service/storage_service.cc

+                info.update("rack", rs.rack);
+                info.update("host_id", id.uuid());
+                info.update("release_version", rs.release_version);
+                co_await _sys_ks.local().update_peer_info(ip, std::move(info));


Could be a single fixed CQL statement (apart from merging the table name).

true
We can pass CQL to uodate_peer_info rather than a column/value map.

Passing CQL strings is also not the cleanest.

avikivity · 2023-12-17T17:50:01Z

db/system_keyspace.cc

 future<std::unordered_map<gms::inet_address, std::unordered_set<dht::token>>> system_keyspace::load_tokens() {
-    sstring req = format("SELECT peer, tokens FROM system.{}", PEERS);
+    sstring req = format("SELECT peer, host_id, tokens FROM system.{}", PEERS);


How does this handle mixed clusters?

All supported versions already store host_id in system.peers. It shouldn't be a problem.

When did we start writing it?

4311662 called update_peer_info from storage_service::handle_state_normal back in 0.10.
Note the use of gossiper::uses_host_id which was true from inception (See 6e83e54 where I got rid of it)

The pretty-much initial implementation of prepare_to_join at efe067f already populated the HOST_ID in the endpoint state, and it is sent in gossip also basically from inception (eed2795).

e86d39f moved the call to update_peer_info a bit later in handle_state_normal, when the node becomes a normal token owner (in branch 4.6)
but it doesn't matter since we've wrote the host_id to system.peers even before that.

ok, so no compatibility issues.

bhalevy · 2023-12-17T18:05:40Z

Make sure that HOST_ID is written to system.peers and use it to "seal" the record by updating it last (along with TOKENS) when adding the state for new nodes.
On the read side, skip rows that have no HOST_ID state in system.peers, assuming they are incomplete, i.e. scylla got an exception or crashed while writing them, so they can't be trusted.

How does this work during upgrade?

We're already populating the host_id in system.peers in the normal path. This series just enforces it.

avikivity · 2023-12-17T18:19:34Z

Make sure that HOST_ID is written to system.peers and use it to "seal" the record by updating it last (along with TOKENS) when adding the state for new nodes.
On the read side, skip rows that have no HOST_ID state in system.peers, assuming they are incomplete, i.e. scylla got an exception or crashed while writing them, so they can't be trusted.

How does this work during upgrade?

We're already populating the host_id in system.peers in the normal path. This series just enforces it.

Since which version?

bhalevy · 2023-12-17T19:25:55Z

Make sure that HOST_ID is written to system.peers and use it to "seal" the record by updating it last (along with TOKENS) when adding the state for new nodes.
On the read side, skip rows that have no HOST_ID state in system.peers, assuming they are incomplete, i.e. scylla got an exception or crashed while writing them, so they can't be trusted.

How does this work during upgrade?

We're already populating the host_id in system.peers in the normal path. This series just enforces it.

Since which version?

See #16376 (comment)
Should be 0.X

avikivity · 2023-12-17T19:39:11Z

Make sure that HOST_ID is written to system.peers and use it to "seal" the record by updating it last (along with TOKENS) when adding the state for new nodes.
On the read side, skip rows that have no HOST_ID state in system.peers, assuming they are incomplete, i.e. scylla got an exception or crashed while writing them, so they can't be trusted.

How does this work during upgrade?

We're already populating the host_id in system.peers in the normal path. This series just enforces it.

Since which version?

See #16376 (comment) Should be 0.X

Please note it in the cover letter.

When adding a peer via update_peer_info, insert all columns in a single query using system_keyspace::peer_info. This ensures that `host_id` is inserted along with all other app states, so we can rely on it when loading the peer info after restart. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

It is unused now after the previous patch to update_peer_info in one call. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Skip rows that have no host_id to make sure the node state we load always has a valid host_id. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

None of the subscribers is doing anything before_change. This is done before changing `on_change` in the following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Rather than calling on_change for each particular application_state, pass an endpoint_state::map_type with all changed states, to be processed as a batch. In particular, thise allows storage_service::on_change to update_peer_info once for all changed states. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

It is no longer used after previous patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

They are no longer used. Instead, all callers now pass peer_info. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

State changes are processed as a batch and there is no reason to maintain them as an ordered map. Instead, use a std::unordered_map that is more efficient. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

bhalevy · 2023-12-31T16:59:13Z

In v6 (cdd5605):

using data_value_or_unset = std::variant<data_value, unset_value>;
using data_value_list = std::initializer_list<data_value_or_unset>;

avikivity · 2023-12-31T17:35:50Z

cql3/query_processor.cc

+        std::visit(overloaded_functor {
+            [&] (const data_value& v) {
+                if (v.type() == bytes_type) {
+                    bound_values.values.emplace_back(cql3::raw_value::make_value(value_cast<bytes>(v)));


Note: this special case is probably a pessimization, since it forces linearization. It's not critical since I don't expect huge values here, but it would be nice to remove it, separately.

avikivity · 2023-12-31T17:38:09Z

types/types.hh

+    }
+};
+
+using data_value_list = std::initializer_list<data_value_or_unset>;


I'm impressed the compiler is able to implicitly convert through so many layers.

avikivity · 2023-12-31T17:39:17Z

types/types.hh

+            },
+            [&ctx] (const unset_value& u) {
+                return fmt::format_to(ctx.out(), "{}", u);
+            }


These two could have been merged ([&] (const auto& v) { ... }). But no matter.

scylladb-promoter · 2023-12-31T18:38:58Z

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

Duration: 1 hr 40 min
Builder: spider5.cloudius-systems.com

gusev-p · 2023-12-31T18:37:24Z

gms/endpoint_state.hh

@@ -19,7 +19,7 @@

 namespace gms {

-using application_state_map = std::map<application_state, versioned_value>;
+using application_state_map = std::unordered_map<application_state, versioned_value>;


I don't know if idl supports it.

It's a matter of just copy-pasting (or templatizing) this std::map serializer. There is already absl::btree_set support there.

Note: there is no compatibility issue here since std::map and std::unordered_map are serialised/deserialized in the same way. The same could be true for absl::flat_hash_map.

gusev-p · 2023-12-31T21:01:32Z

cql3/query_processor.cc

@@ -763,7 +763,7 @@ std::pair<std::reference_wrapper<struct query_processor::remote>, gate::holder>

 query_options query_processor::make_internal_options(


nit: commit message:

data_value_list is a wrapper around std::initializer_list<data_value>.

doesn't correspond to the code (data_value_list is defined to be equal to std::initializer_list<data_value>; in this commit)

gusev-p · 2023-12-31T21:05:03Z

cql3/query_processor.cc

@@ -771,27 +771,37 @@ query_options query_processor::make_internal_options(
                format("Invalid number of values. Expecting {:d} but got {:d}", p->bound_names.size(), values.size()));
    }


nit: commit message:

Add overloads for execute_internal and friends
accepting a vector of optional<data_value>.

doesn't correspond to the code

gusev-p · 2023-12-31T21:26:05Z

service/storage_service.cc

@@ -485,8 +486,9 @@ future<> storage_service::topology_state_load() {
                if (rs.ring.has_value()) {
                    if (!is_me(ip)) {
                        // Save ip -> id mapping in peers table because we need it on restart, but do not save tokens until owned
-                        co_await _sys_ks.local().update_tokens(ip, {});


we lost co_await _sys_ks.local().update_tokens(ip, {}); here. I'm not sure if it's important.

if no tokens are stored yet' then storing an empty set is equivalent (yet I think it's better to store no tokens than to store an empty set).
If there were tokens than storing an empty set here is definitely wrong. The mode should be decommissioned or removed to lose its tokens.

bhalevy · 2023-12-31T22:07:47Z

It's a matter of just copy-pasting (or templatizing) this std::map serializer. There is already absl::btree_set support there.

Note: there is no compatibility issue here since std::map and std::unordered_map are serialised/deserialized in the same way. The same could be true for absl::flat_hash_map

It's really out of scope for this PR.
I regret even including the change to unordered_set.
Let's drop it and then we can make the change to abseil in one go.
You're invited to craft a patch for it.

When loading endpoint_state from system.peers, pass the loaded nodes dc/rack info from storage_service::join_token_ring to gossiper::add_saved_endpoint. Load the endpoint DC/RACK information to the endpoint_state, if available so they can propagate to bootstrapping nodes via gossip, even if those nodes are DOWN after a full cluster-restart. Note that this change makes the host_id presence mandatory following scylladb#16376. The reason to do so is that the other states: tokens, dc, and rack are useless with the host_id. This change is backward compatible since the HOST_ID application state was written to system.peers since inception in scylla and it would be missing only due to potential exception in older versions that failed to write it. In this case, manual intervention is needed and the correct HOST_ID needs to be manually updated in system.peers. Refs scylladb#15787 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

bhalevy requested a review from tgrabiec as a code owner December 12, 2023 07:20

bhalevy requested review from kbr-scylla and gusev-p December 12, 2023 07:20

bhalevy mentioned this pull request Dec 12, 2023

Set dc and rack in gossiper when loaded from system.peers and load the ignored nodes state for replace #15788

Closed

gusev-p reviewed Dec 12, 2023

View reviewed changes

service/storage_service.cc Outdated Show resolved Hide resolved

gusev-p approved these changes Dec 12, 2023

View reviewed changes

kbr-scylla reviewed Dec 14, 2023

View reviewed changes

service/storage_service.cc Show resolved Hide resolved

service/storage_service.cc Outdated Show resolved Hide resolved

bhalevy mentioned this pull request Dec 15, 2023

system_keyspace update_peer_info and update_tokens called for the local endpoint #16425

Closed

bhalevy force-pushed the system-peers-enforce-host_id branch from 7965a1a to de4a404 Compare December 15, 2023 18:06

bhalevy requested review from gusev-p and kbr-scylla December 15, 2023 18:09

bhalevy force-pushed the system-peers-enforce-host_id branch from de4a404 to 1636d77 Compare December 17, 2023 06:31

avikivity reviewed Dec 17, 2023

View reviewed changes

bhalevy force-pushed the system-peers-enforce-host_id branch from 1636d77 to 0e20709 Compare December 17, 2023 22:59

bhalevy added 9 commits December 31, 2023 18:37

system_keyspace: drop update_tokens(endpoint, tokens) overload

74159bb

It is unused now after the previous patch to update_peer_info in one call. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

system_keyspace: load_tokens/peers/host_ids: enforce presence of host_id

7670f60

Skip rows that have no host_id to make sure the node state we load always has a valid host_id. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

everywhere: drop before_change subscription

1d07a59

None of the subscribers is doing anything before_change. This is done before changing `on_change` in the following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

storage_service: on_change: fixup indentation

13d395f

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

storage_service: drop do_update_system_peers_table

0e5a666

It is no longer used after previous patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

system_keyspace: update_peer_info: drop single-column overloads

c520fc2

They are no longer used. Instead, all callers now pass peer_info. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

gms: endpoint_state: change application_state_map to std::unordered_map

cdd5605

State changes are processed as a batch and there is no reason to maintain them as an ordered map. Instead, use a std::unordered_map that is more efficient. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

bhalevy force-pushed the system-peers-enforce-host_id branch from a6cff77 to cdd5605 Compare December 31, 2023 16:57

avikivity reviewed Dec 31, 2023

View reviewed changes

avikivity approved these changes Dec 31, 2023

View reviewed changes

gusev-p reviewed Dec 31, 2023

View reviewed changes

scylladb-promoter merged commit 8ba0dec into scylladb:master Dec 31, 2023
4 checks passed

bhalevy mentioned this pull request Feb 6, 2024

Purge old ip on change #17162

Merged


		class data_values_map {
		std::unordered_map<std::string, data_value> _map;

		@@ -763,7 +763,7 @@ std::pair<std::reference_wrapper<struct query_processor::remote>, gate::holder>

		query_options query_processor::make_internal_options(

		@@ -771,27 +771,37 @@ query_options query_processor::make_internal_options(
		format("Invalid number of values. Expecting {:d} but got {:d}", p->bound_names.size(), values.size()));
		}

System.peers: enforce host_id #16376

System.peers: enforce host_id #16376

Conversation

bhalevy commented Dec 12, 2023 • edited Loading

scylladb-promoter commented Dec 12, 2023

🟢 CI State: SUCCESS

Build Details:

bhalevy commented Dec 15, 2023

scylladb-promoter commented Dec 15, 2023

🔴 CI State: FAILURE

Failed Tests (6/23439):

Build Details:

bhalevy commented Dec 17, 2023

scylladb-promoter commented Dec 17, 2023

🟢 CI State: SUCCESS

Build Details:

avikivity commented Dec 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhalevy commented Dec 17, 2023

avikivity commented Dec 17, 2023

bhalevy commented Dec 17, 2023

avikivity commented Dec 17, 2023

bhalevy commented Dec 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scylladb-promoter commented Dec 31, 2023

🟢 CI State: SUCCESS

Build Details:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhalevy commented Dec 31, 2023 • edited Loading

bhalevy commented Dec 12, 2023 •

edited

Loading

bhalevy commented Dec 31, 2023 •

edited

Loading