Skip to content

raft topology: ban left nodes from the cluster #13850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 22, 2023

Conversation

kbr-scylla
Copy link
Contributor

@kbr-scylla kbr-scylla commented May 10, 2023

Use the new Seastar functionality for storing references to connections to implement banning hosts that have left the cluster (either decommissioned or using removenode) in raft-topology mode. Any attempts at communication from those nodes will be rejected.

This works not only for nodes that restart, but also for nodes that were running behind a network partition and we removed them. Even when the partition resolves, the existing nodes will effectively put a firewall from that node.

Some changes to the decommission algorithm had to be introduced for it to work with node banning. As a side effect a pre-existing problem with decommission was fixed. Read the "introduce left_token_ring state" and "prepare decommission path for node banning" commits for details.

@kbr-scylla
Copy link
Contributor Author

kbr-scylla commented May 10, 2023

In the meantime I'll work on some test infrastructure to automatically test this.
(I already checked manually that it works)

@kbr-scylla kbr-scylla requested a review from avikivity May 10, 2023 14:23
ms._host_connections.erase(start, end);

co_await parallel_for_each(conns, [] (shared_ptr<rpc::server_connection>& conn) {
return conn->stop();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also drop outgoing connections?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but if some service is sending out RPCs to that node, get_rpc_client would just reestablish the connection.

I guess we could also ban outgoing connections like we do with incoming (and checking whether a host is banned in get_rpc_client or something).

Alternatively - and this is the approach I took basically - assume that once a node has left, we will not attempt to communicate with it (fingers crossed). This depends on nice behavior of our services. So we don't have to do the firewall in this direction, because we're not sending anything to that node anymore anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but if some service is sending out RPCs to that node, get_rpc_client would just reestablish the connection.

I guess we could also ban outgoing connections like we do with incoming (and checking whether a host is banned in get_rpc_client or something).

Of course.

Alternatively - and this is the approach I took basically - assume that once a node has left, we will not attempt to communicate with it (fingers crossed). This depends on nice behavior of our services. So we don't have to do the firewall in this direction, because we're not sending anything to that node anymore anyway.

I want to reuse this for isolate_node as well.

@scylladb-promoter
Copy link
Contributor

@xemul
Copy link
Contributor

xemul commented May 10, 2023

General question -- is seastar part really needed? There's messaging_service::find_and_remove_client (and stuff on top) that can be used to drop the connection without the help of seastar. Would it work here?

@kbr-scylla
Copy link
Contributor Author

kbr-scylla commented May 11, 2023

General question -- is seastar part really needed? There's messaging_service::find_and_remove_client (and stuff on top) that can be used to drop the connection without the help of seastar. Would it work here?

IIUC it only works for clients, i.e. outgoing connections/messages. The seastar part is needed for blocking incoming messages.

also, dropping existing connections is not enough, we also need a way to prevent further communication.

@kostja
Copy link
Contributor

kostja commented May 12, 2023

I am not an expert in seastar rpc, but provided we go with the approach suggested in the RFC I would consider the following:

  • do we really need to be able to stop the connection from within reply() if rpc::drop_connection is thrown? Do we need drop-on-exception, isn't explicit drop enough?
  • stop() calls abort(), which calls shutdown_input() which only calls shutdown (SHUT_RD). Why only read part? Why not close the file descriptor altogether?
  • the feature needs unit tests on seastar side.

@kbr-scylla
Copy link
Contributor Author

do we really need to be able to stop the connection from within reply() if rpc::drop_connection is thrown? Do we need drop-on-exception, isn't explicit drop enough?

Without the exception it would be awkward if a two-way RPC handler wants to drop a connection. The handler would stop the connection but then it would still have to construct some kind of response or throw another type of exception.

For this use case though we only close connections from one-way handlers so maybe the exception is not necessary.

stop() calls abort(), which calls shutdown_input() which only calls shutdown (SHUT_RD). Why only read part? Why not close the file descriptor altogether?

That's a question for @gleb-cloudius

@gleb-cloudius
Copy link
Contributor

stop() calls abort(), which calls shutdown_input() which only calls shutdown (SHUT_RD). Why only read part? Why not close the file descriptor altogether?

That's a question for @gleb-cloudius

Shutdown read wake ups a reader fiber which start shutdown process, notifies sender fiber which shutdowns write side at the end and closes the descriptor.

@kbr-scylla
Copy link
Contributor Author

v2:

seastar:

  • in rpc::client_info&, provide rpc::server& + connection_id instead of server::connection&
  • replace server::get_connection with server::drop_connection(connection_id)
  • remove rpc::drop_connection exception; the handler has access to the connection from other commits in this series, it can drop it directly
  • remove "rpc: move server::connection class outside server" commit - not needed with the new approach
  • add a simple test for dropping connection from handler

scylla:

  • use client_info::server and client_info::conn_id fields (instead of client_info::conn which was removed from the seastar patch)
  • use server::drop_connection(conn_id) instead of connection::drop() (we don't have access to the connection anymore)
  • instead of throwing rpc::drop_connection{} exception in handler, use server::drop_connection - the exception API was redundant
  • rename host_banned to is_host_banned
  • add test infrastructure + simple test for banning nodes

I didn't implement dropping/banning of outgoing connections, because it's blocked by #6403. When we send a message, in general, we don't have access to the host_id of the destination inside messaging_service.
AFAIK @bhalevy is working on extending msg_addr with host_ids. Then it will be possible.

Still, I think that banning incoming connections provides 99%-100% of the value that this PR is supposed to provide: it effectively isolates removed nodes, as the added test illustrates.

I propose that in the scope of this PR, we solve banning incoming connections, and leave banning outgoing connections to a follow-up, after we have #6403.

Unfortunately, there's a bigger problem.
Decommissioning doesn't work!

Well, it does to some extent, the node is removed from token ring / topology... but it cannot remove itself from group 0, because by that time, it is banned from the cluster :D cc @kostja @gleb-cloudius
This is illustrated by test_topology_ops which is now failing.

We should discuss how to solve this. This IMO supports the theory that the topology coordinator should be responsible for group 0 reconfigurations.

@gleb-cloudius
Copy link
Contributor

but it cannot remove itself from group 0, because by that time, it is banned from the cluster

But the coordinator already does that in case a node itself fails. So may be we only need to silence the error?

@scylladb-promoter
Copy link
Contributor

@kbr-scylla
Copy link
Contributor Author

But the coordinator already does that in case a node itself fails. So may be we only need to silence the error?

The "error" is that the decommissioning node hangs on leave_group0().

@gleb-cloudius
Copy link
Contributor

But the coordinator already does that in case a node itself fails. So may be we only need to silence the error?

The "error" is that the decommissioning node hangs on leave_group0().

Then we need to do it with a timeout.

@kbr-scylla
Copy link
Contributor Author

Then we need to do it with a timeout.

What's the point of having a call which will timeout 90% of the time? We expect it to fail, so why do it in the first place?

@gleb-cloudius
Copy link
Contributor

Then we need to do it with a timeout.

What's the point of having a call which will timeout 90% of the time? We expect it to fail, so why do it in the first place?

True. We can not do it at all.

@kbr-scylla
Copy link
Contributor Author

I don't like the resulting UX.

nodetool decommission should give certain guarantees to the user when it finishes successfully. It should not regress. Currently we are guaranteed that once it finishes, the node is no longer part of topology or group 0 configuration. We should not break this promise.

So IMO the decommissioning node should get a confirmation from the topology coordinator that the node got removed entirely.

But when we ban left nodes, we make it impossible for the node to get this confirmation. Once it's removed, we're no longer communicating with it, but we still want to tell it that it's removed.

@gleb-cloudius
Copy link
Contributor

I don't like the resulting UX.

nodetool decommission should give certain guarantees to the user when it finishes successfully. It should not regress. Currently we are guaranteed that once it finishes, the node is no longer part of topology or group 0 configuration. We should not break this promise.

When nodetool decommission completes there is no guaranty the node in not part of group0 and that is why you added a clean up procedure to nodetool removenode. The only different is that not the procedure is automatic.

The only other way to "fix" ux is to issue nodetool decommission to another node and add a parameter to provide node id to decommission. IOW make it like nodetool removenode for live nodes.

@kbr-scylla
Copy link
Contributor Author

When nodetool decommission completes there is no guaranty the node in not part of group0 and that is why you added a clean up procedure to nodetool removenode.

The guarantee is there if it completes successfully.
The cleanup procedure is only needed if decommission fails.

The only other way to "fix" ux is to issue nodetool decommission to another node and add a parameter to provide node id to decommission. IOW make it like nodetool removenode for live nodes.

Or ban the node only after it gets the confirmation.

@kbr-scylla
Copy link
Contributor Author

@avikivity could you merge please?

@kbr-scylla
Copy link
Contributor Author

@avikivity ping

@avikivity
Copy link
Member

Please rebase, non-trivial conflicts.

@kbr-scylla
Copy link
Contributor Author

Hm, probably with fencing PR which was queued 20 minutes ago.

@kbr-scylla
Copy link
Contributor Author

I'll rebase once #14285 is merged, which will simplify the "messaging_service: store the node's host ID" commit (we'll be able to make the host_id field non-optional).

@avikivity
Copy link
Member

You're risking a maintainer missing the pings again! But I appreciate doing the right thing.

When a node first establishes a connection to another node, it always
sending a `CLIENT_ID` one-way RPC first. The message contains some
metadata such as `broadcast_address`.

Include the `host_id` of the sender in that RPC. On the receiving side,
store a mapping from that `host_id` to the connection that was just
opened.

This mapping will be used later when we ban nodes that we remove from
the cluster.
Calling `ban_host` causes the following:
- all connections from that host are dropped,
- any further attempts to connect will be rejected (the connection will
  be immediately dropped) when receiving the `CLIENT_ID` verb.
Saves some redundant typing when passing `raft_topology_cmd` parameters,
so we can change this:
```
raft_topology_cmd{raft_topology_cmd::command::fence_old_reads}
```
into this:
```
raft_topology_cmd::command::fence_old_reads
```
We want for the decommissioning node to wait before shutting down until
every node learns that it left the token ring. Otherwise some nodes may
still try coordinating writes to that nodes after it already shut down,
leading to unnecessary failures on the data path(e.g. for CL=ALL writes).

Before this change, a node would shut down immediately after observing
that it was in `left` state; some other nodes may still see it in
`decommissioning` state and the topology transition state as
`write_both_read_new`, so they'd try to write to that node.

After this change, the node first enters the `left_token_ring` state
before entering `left`, while the topology transition state is removed
(so we've finished the token ring change - the node no longer has tokens
in the ring, but it's still part of the topology). There we perform a
read barrier, allowing all nodes to observe that the decommissioning
node has indeed left the token ring. Only after that barrier succeeds we
allow the node to shut down.
Currently the decommissioned node waits until it observes that it was
moved to the `left` state, then proceeds to leave group 0 and shut down.

Unfortunately, this strategy won't work once we introduce banning nodes
that are in `left` state - there is no guarantee that the
decommissioning node will observe that it entered `left` state. The
replication of Raft commands races with the ban propagating through the
cluster.

We also can't make the node leave as soon as it observes the
`left_token_ring` state, which would defeat the purpose of
`left_token_ring` - allowing all nodes to observe that the node has left
the token ring before it shuts down.

We could introduce yet another state between `left_token_ring` and
`left`, which the node waits for before shutting down; the coordinator
would request a barrier from the node before moving to `left` state.

The alternative - which we chose here - is to have the coordinator
explicitly tell the node to shutdown while we're in `left_token_ring`
through a direct RPC. We introduce
`raft_topology_cmd::command::shutdown` and send it to the node while in
`left_token_ring` state, after we requested a cluster barrier.

We don't require the RPC to succeed; we need to allow it to fail to
preserve availability. This is because an earlier incarnation of the
coordinator may have requested the node to shut down already, so the
new coordinator will fail the RPC as the node is already dead. This also
improves availability in general - if the node dies while we're in
`left_token_ring`, we can proceed.

We don't lose safety from that, since we'll ban the node (later commit).
We only lose a bit of user experience if there's a failure at this
decommission step - the decommissioning node may hang, never receiving
the RPC (it will be necessary to shut it down manually).

Another complication arising from banning the node is that it won't be
able to leave group 0 on its own; by the time it tries that, it may have
already been banned by the cluster (the coordinator moves the node to
`left` state after telling it to shut down). So we get rid of the
`leave_group0` step from `raft_decommission()` (which simplifies the
function too), putting a `remove_from_raft_config` inside the
coordinator code instead - after we told the node to shut down.
(Removing the node from configuration is also another reason why we need
to allow the above RPC to fail; the node won't be able to handle the
request once it's outside the configuration, because it handles all
coordinator requests by starting a read barrier.)

Finally, a complication arises when the coordinator is the
decommissioning node. The node would shut down in the middle of handling
the `left_token_ring` state, leading to harmless but awkward errors even
though there were no node/network failures (the original coordinator
would fail the `left_token_ring` state logic; a new coordinator would take
over and do it again, this time succeeding). We fix that by checking if
we're the decommissioning node at the beginning of `left_token_ring`
state handler, and if so, stepping down from leadership by becoming a
nonvoter first.
The "tell the node to shut down" RPC would fail every time in the
removenode path (since the node is dead), which is kind of awkward.

Besides, for removenode we don't really need the `left_token_ring`
state, we don't need to coordinate with the node - writes destined for
it are failing anyway (since it's dead) and we can ban the node
immediately.

Remove the node from group 0 while in `write_both_read_new` transition
state (even when we implement abort, in this state it's too late to
abort, we're committed to removing the node - so it's fine to remove it
from group 0 at this point).
Pause one of the nodes and once it's marked as DOWN, remove it from the
cluster.

Check that it is not able to perform queries once it unpauses.
`server_sees_others` and similar functions periodically call
`get_alive_endpoints`. The period was `.1` seconds, increase it to `.5`
to reduce the log spam (I checked empirically that `.5` is usually how
long it takes in dev mode on my laptop.)
@kbr-scylla
Copy link
Contributor Author

v6:

  • messaging_service now takes host_id in its constructor instead of requiring a separate set_host_id call; the host_id inside is non-optional, I moved it into messaging_service::config struct.
  • the above required a minor movement inside main.cc, to start messaging_service after system_keyspace. The first commit does that
  • rebase
  • use global_token_metadata_barrier (introduced in fencing PR that was merged in the meantime) instead of regular barrier in left_token_ring
  • split the addition of get_cql() helper to separate commit in manager_client.py

Range-diff:

 1:  b2d3da5622 <  -:  ---------- messaging_service: store the node's host ID
 -:  ---------- >  1:  7f3ad6bd25 main: move messaging_service init after system_keyspace init
 -:  ---------- >  2:  a78cc17bd4 messaging_service: don't use parameter defaults in constructor
 -:  ---------- >  3:  87f65d01b8 messaging_service: store the node's host ID
 2:  5b50d145d3 !  4:  95c726a8df messaging_service: exchange host IDs and map them to connections
    @@ message/messaging_service.cc: shared_ptr<messaging_service::rpc_protocol_client_
              find_and_remove_client(_clients[idx], id, [] (const auto&) { return true; });
          }
      
    -+    auto my_host_id = get_my_host_id();
    ++    auto my_host_id = _cfg.id;
          auto broadcast_address = utils::fb_utilities::get_broadcast_address();
          bool listen_to_bc = _cfg.listen_on_broadcast_address && _cfg.ip != broadcast_address;
          auto laddr = socket_address(listen_to_bc ? broadcast_address : _cfg.ip, 0);
    @@ message/messaging_service.cc: shared_ptr<messaging_service::rpc_protocol_client_
     
      ## message/messaging_service.hh ##
     @@ message/messaging_service.hh: class messaging_service : public seastar::async_sharded_service<messaging_servic
    +     std::vector<scheduling_info_for_connection_index> _scheduling_info_for_connection_index;
          std::vector<tenant_connection_index> _connection_index_for_tenant;
    -     std::optional<locator::host_id> _my_host_id;
      
     +    struct connection_ref;
     +    std::unordered_multimap<locator::host_id, connection_ref> _host_connections;
     +
    +     future<> shutdown_tls_server();
    +     future<> shutdown_nontls_server();
          future<> stop_tls_server();
    -     future<> stop_nontls_server();
    -     future<> stop_client();
 3:  fccd910be4 !  5:  8cf47d76a4 messaging_service: implement host banning
    @@ message/messaging_service.hh: class messaging_service : public seastar::async_sh
          std::unordered_multimap<locator::host_id, connection_ref> _host_connections;
     +    std::unordered_set<locator::host_id> _banned_hosts;
      
    -     future<> stop_tls_server();
    -     future<> stop_nontls_server();
    +     future<> shutdown_tls_server();
    +     future<> shutdown_nontls_server();
     @@ message/messaging_service.hh: class messaging_service : public seastar::async_sharded_service<messaging_servic
          future<table_schema_version> send_schema_check(msg_addr, abort_source&);
      
 4:  7fd7e9ff10 <  -:  ---------- raft topology: `raft_topology_cmd` implicit constructor
 -:  ---------- >  6:  c94c07804d raft topology: `raft_topology_cmd` implicit constructor
 5:  fb9a6c94fb !  7:  b8ddfd9ef9 raft topology: introduce `left_token_ring` state
    @@ service/storage_service.cc: future<> storage_service::topology_state_load(cdc::g
                      on_fatal_internal_error(slogger, ::format("Unexpected state {} for node {}", rs.state, id));
                  }
     @@ service/storage_service.cc: class topology_coordinator {
    -                     builder.del_transition_state()
    +                            .set_version(_topo_sm._topology.version + 1)
                                 .with_node(node.id)
                                 .del("tokens")
     -                           .set("node_state", node_state::left);
    @@ service/storage_service.cc: class topology_coordinator {
                      break;
     +            case node_state::left_token_ring: {
     +                // Wait until other nodes observe the new token ring and stop sending writes to this node.
    -+                // FIXME: change `barrier` to a new command which will wait until existing writes are drained.
    -+                bool exec_command_res;
    -+                std::tie(node, exec_command_res) = co_await exec_global_command(
    -+                        std::move(node), raft_topology_cmd::command::barrier, false);
    -+                if (!exec_command_res) {
    -+                    break;
    ++                {
    ++                    auto id = node.id;
    ++                    auto f = co_await coroutine::as_future(global_token_metadata_barrier(std::move(node)));
    ++                    if (f.failed()) {
    ++                        slogger.error("raft topology: node_state::left_token_ring (node: {}), "
    ++                                      "global_token_metadata_barrier failed, error {}",
    ++                                      id, f.get_exception());
    ++                        break;
    ++                    }
    ++                    node = std::move(f).get();
     +                }
     +
     +                topology_mutation_builder builder(node.guard.write_timestamp());
    @@ service/storage_service.cc: class topology_coordinator {
                  case node_state::decommissioning:
                  case node_state::removing:
     @@ service/storage_service.cc: future<raft_topology_cmd_result> storage_service::raft_topology_cmd_handler(shar
    -                         result.status = raft_topology_cmd_result::command_status::success;
    -                     }
    -                     break;
    -+                    case node_state::left_token_ring:
    -                     case node_state::left:
    -                     case node_state::none:
    -                     case node_state::removing:
    +                     result.status = raft_topology_cmd_result::command_status::success;
    +                 }
    +                 break;
    ++                case node_state::left_token_ring:
    +                 case node_state::left:
    +                 case node_state::none:
    +                 case node_state::removing:
     
      ## service/topology_state_machine.cc ##
     @@ service/topology_state_machine.cc: static std::unordered_map<node_state, sstring> node_state_to_name_map = {
 6:  201f66a09c !  8:  977680773b raft topology: prepare decommission path for node banning
    @@ service/storage_service.cc: class topology_coordinator {
     +                }
     +
                      // Wait until other nodes observe the new token ring and stop sending writes to this node.
    -                 // FIXME: change `barrier` to a new command which will wait until existing writes are drained.
    -                 bool exec_command_res;
    +                 {
    +                     auto id = node.id;
     @@ service/storage_service.cc: class topology_coordinator {
    -                     break;
    +                     node = std::move(f).get();
                      }
      
     +                // Tell the node to shut down.
    @@ service/storage_service.cc: future<> storage_service::decommission() {
                          // There's nothing smarter we could do. We should not continue operating in this broken
                          // state (we're not a member of the token ring any more).
     @@ service/storage_service.cc: future<raft_topology_cmd_result> storage_service::raft_topology_cmd_handler(shar
    -                     //co_await sleep_abortable(_db.local().get_config().read_request_timeout_in_ms() * std::chrono::milliseconds(1), _abort_source);
    -                     result.status = raft_topology_cmd_result::command_status::success;
    +                 result.status = raft_topology_cmd_result::command_status::success;
                      break;
    -+                case raft_topology_cmd::command::shutdown:
    -+                    if (_shutdown_request_promise) {
    -+                        std::exchange(_shutdown_request_promise, std::nullopt)->set_value();
    -+                    } else {
    -+                        slogger.warn("raft topology: got shutdown request while not decommissioning");
    -+                    }
    -+                break;
                  }
    -         } catch (...) {
    -             slogger.error("raft topology: raft_topology_cmd failed with: {}", std::current_exception());
    ++            case raft_topology_cmd::command::shutdown:
    ++                if (_shutdown_request_promise) {
    ++                    std::exchange(_shutdown_request_promise, std::nullopt)->set_value();
    ++                } else {
    ++                    slogger.warn("raft topology: got shutdown request while not decommissioning");
    ++                }
    ++                break;
    +         }
    +     } catch (...) {
    +         slogger.error("raft topology: raft_topology_cmd failed with: {}", std::current_exception());
     
      ## service/storage_service.hh ##
     @@ service/storage_service.hh: class storage_service : public service::migration_listener, public gms::i_endpoi
    +     std::optional<shared_future<>> _decomission_result;
          std::optional<shared_future<>> _rebuild_result;
          std::unordered_map<raft::server_id, std::optional<shared_future<>>> _remove_result;
    - 
     +    // During decommission, the node waits for the coordinator to tell it to shut down.
     +    std::optional<promise<>> _shutdown_request_promise;
    -+
    -     future<raft_topology_cmd_result> raft_topology_cmd_handler(sharded<db::system_distributed_keyspace>& sys_dist_ks, raft::term_t term, const raft_topology_cmd& cmd);
    - 
    -     future<> raft_bootstrap(raft::server&);
    +     struct {
    +         raft::term_t term{0};
    +         uint64_t last_index{0};
     
      ## service/topology_state_machine.cc ##
     @@ service/topology_state_machine.cc: std::ostream& operator<<(std::ostream& os, const raft_topology_cmd::command& cmd
    -         case raft_topology_cmd::command::fence_old_reads:
    -             os << "fence_old_reads";
    +         case raft_topology_cmd::command::fence:
    +             os << "fence";
                  break;
     +        case raft_topology_cmd::command::shutdown:
     +            os << "shutdown";
    @@ service/topology_state_machine.cc: std::ostream& operator<<(std::ostream& os, co
     
      ## service/topology_state_machine.hh ##
     @@ service/topology_state_machine.hh: struct raft_topology_cmd {
    -           barrier,         // request to wait for the latest topology
    -           stream_ranges,   // reqeust to stream data, return when streaming is
    -                            // done
    --          fence_old_reads  // wait for all reads started before to complete
    -+          fence_old_reads, // wait for all reads started before to complete
    -+          shutdown         // a decommissioning node should shut down
    +           barrier_and_drain,    // same + drain requests which use previous versions
    +           stream_ranges,        // reqeust to stream data, return when streaming is
    +                                 // done
    +-          fence                 // erect the fence against requests with stale versions
    ++          fence,                // erect the fence against requests with stale versions
    ++          shutdown,             // a decommissioning node should shut down
            };
            command cmd;
      
 7:  f7e56883bd !  9:  737c1b4ae6 raft topology: skip `left_token_ring` state during `removenode`
    @@ Commit message
     
      ## service/storage_service.cc ##
     @@ service/storage_service.cc: class topology_coordinator {
    -         co_return std::pair{retake_node(std::move(guard), node.id), res};
    +         co_return retake_node(std::move(guard), node.id);
          };
      
     +    future<> remove_from_group0(const raft::server_id& id) {
    @@ service/storage_service.cc: class topology_coordinator {
     +                    auto next_state = node.rs->state == node_state::decommissioning
     +                                        ? node_state::left_token_ring : node_state::left;
                          builder.del_transition_state()
    +                            .set_version(_topo_sm._topology.version + 1)
                                 .with_node(node.id)
                                 .del("tokens")
     -                           .set("node_state", node_state::left_token_ring);
 8:  2f53f68135 = 10:  63229e48e8 raft topology: ban left nodes
 9:  9084a493cd = 11:  e02249f0cd test: pylib: ScyllaCluster: server pause/unpause API
 -:  ---------- > 12:  ae92932240 test: pylib: manager_client: `get_cql()` helper
10:  fc5a4259d8 ! 13:  279a109ce0 test: add node banning test
    @@ Commit message
     
         Check that it is not able to perform queries once it unpauses.
     
    - ## test/pylib/manager_client.py ##
    -@@ test/pylib/manager_client.py: class ManagerClient():
    -             logger.debug("refresh driver node list")
    -             self.ccluster.control_connection.refresh_node_list_and_token_map()
    - 
    -+    def get_cql(self) -> CassandraSession:
    -+        assert self.cql
    -+        return self.cql
    -+
    -     async def before_test(self, test_case_name: str) -> None:
    -         """Before a test starts check if cluster needs cycling and update driver connection"""
    -         logger.debug("before_test for %s", test_case_name)
    -
      ## test/topology_experimental_raft/test_node_isolation.py (new) ##
     @@
     +#
11:  70bd636862 = 14:  b38dcba6ed test: pylib: increase checking period for `get_alive_endpoints`

@kbr-scylla kbr-scylla requested a review from avikivity June 20, 2023 11:33
@scylladb-promoter
Copy link
Contributor

@kbr-scylla
Copy link
Contributor Author

CI state ABORTED - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1946/

[2023-06-20T12:11:35.012Z] ERROR: DTest failed: https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/2354/
[2023-06-20T12:11:38.994Z] Finished: ABORTED

https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/2354/

hudson.AbortException: Spot termination

rekicking

@scylladb-promoter
Copy link
Contributor

@kbr-scylla
Copy link
Contributor Author

@avikivity ready for merging again :)

@kbr-scylla
Copy link
Contributor Author

@avikivity ping

@scylladb-promoter scylladb-promoter merged commit 8576502 into scylladb:master Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants