New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
There is no way to wait until the UP/NORMAL states of other nodes are processed #12015
Comments
cc @nyh you might find this interesting. |
I think this is also important for a rolling upgrade in K8S environment - you don't want to continue to the next K8S node drain, before a Scylla node is truly up, communicating and doing work. |
Indeed, or more generally, if you want to "automate the administrator" - this cannot be done correctly unless we have a reliable way to query a node's status regarding how it views other nodes. |
@kbr- thanks, this is interesting because it may explain a few, though not all, of the failures we saw when trying to eliminate the silly sleeps in dtest (https://github.com/scylladb/scylla-dtest/pull/2455). While I agree with you that we could add all sorts of operations (REST API, nodetool, etc.) that checks if node X is aware of Y, and so on, I think there a better solution: I think we need to uphold the requirement that: Node X must not accept CQL requests until it is sure it can process them Here it means that either node X should not listen to CQL requests until it knows it can perform them (including topology-change requests), or, that if X receives a request that it temporarily can't perform (because the gossip hasn't finished) it can wait until it can - or send the request to a different node if that's possible. This is, for example, what we do for nodes joining a cluster - they receive CQL requests, but if they receive a read request, this request is sent to replicas which hold the data and the node doesn't attempt to (incorrectly) respond from its own data. I think adding additional synchronization requests will be a mistake. If we add those, we'll need to start using them in test.py, in dtest, in kubernetes, and who knows where else. We'll need to document this weird new request and explain it. Everything will be so much simpler if we just had the invariant: If a node accepts a CQL request, then it will work. That's it. UPDATE: I got confused a bit above, we're talking about topology change requests, not CQL requests (we would have CQL requests for schema changes, not topology changes). So maybe we need to delay not CQL requests but some other type of requests (nodetool?). But I think my request that a node refuses to accept a request that it still can't perform is the best solution. |
In K8s we are struggling without this invariant. Our tests are flaky because requests are timedout or we get errors even when our readiness check passes. Currenly we qualify node as being ready to serve requests based on whether node itself reported UN via We already proposed such endpoint in #8275 but it doesn't have much traction. We haven't checked suggested endpoint though, we are still using the above ones. We don't have anything for toplogy changes, we do it once all nodes reports themselfes (we don't cross-check) as UN. Endpoint for toplogy check readiness would be beneficial to us. |
and some non-CQL requests too (topology change requests are done through rest api endpoints), but yes, this should probably be the way. BTW. Raft-based topology changes can solve this:
This will result in what @nyh is proposing. |
Well, obviously you cannot have such guarantee, but you can try to guarantee something like: |
Agree with adding a REST endpoint to improve the test. Re having the CQL server (and alternator) wait for stability before listening, that seems like a good idea but need to think about it carefully. Esp. if we move administrative functions to CQL we can end up with losing the ability to fix a problem. Perhaps we need several CQL listeners (on unix-domain sockets): a local administrative listener, and a local pre-boot listener that is started very early (like a serial console for servers). |
This was also proposed in the context of the maintenance mode (#5489): have a separate CQL port for local administrator. |
I don't get why you need the loop to check the STATUSes. In raft-based topology, STATUS is held in system.token_metadata and managed by group0, so handle_state_normal() is not called by the gossiper anymore, but by the group0 state manchine. When a topology change starts, it has the only right to change the STATUS. It is a logical error for STATUS to not be NORMAL for all nodes. The previous topology change should either complete, or abort. |
Fixed in #12540 (already merged) |
Using the mechanism from #12540 does not guarantee that Recently A full solution could work like this:
|
@DoronArazii - we probably wish to have #12540 backported, even if it's not a complete solution, to 5.2 and 5.1. |
@scylladb/scylla-maint please consider backport |
Why is it open @kbr-scylla? what needs to be done? |
ping @kbr-scylla |
The problem described in the original post does not occur in raft topology mode, you can safely start a new topology operation as soon as previous one has finished. Even better, you can queue up operations, the coordinator will handle them in serialized fashion. Almost all our test.py tests are running in raft-topology mode now. The ones that don't, use their old workarounds. In any case, we no longer struggle with the problems described in the original issue. Nevertheless, we should probably create a tool for administrators to communicate with the new topology coordinator to check committed node statuses and so on. It would be useful for observability, debugging and whatnot. But creating such a tool deserves a separate issue. So, I'm closing this one. |
Why is this still marked as backport candidate? Is any work remaining? |
Not that I know of. Removing the labels. |
In test.py topology tests we often perform topology operations as soon as all nodes have finished bootstrapping or restarting. Such topology operations race with gossiper which may be in the middle of notifying different Scylla modules that some other node has entered NORMAL state or that it has restarted. This may disrupt the topology operations.
Examples of failures:
storage_service::handle_state_normal
is called. This function, among other things, may cause certain connections to be dropped and recreated a new. This races with and interrupts a decommission operation.stream_manager::on_restart
, which closed all existing stream sessions to this peer, interrupting the removenode operation.We could take the approach that dtests are taking: "sleep and pray" (pray that everything is ready once we finish sleeping).
But:
Instead, there should be a way of waiting until node X notices that Y is UP and NORMAL, and X finishes all processing related to this fact (i.e. all gossiper notifications,
storage_service::handle_state_normal
) etc.It seems that this may require introducing some additional state for each peer (to answer the question: "have I finished processing this peer?") and, for example, a REST API endpoint to query this state.
Note: our documentation says that one should check that all other nodes are UP and NORMAL before performing a topology operations, using
nodetool status
. This works in practice, because in practice the administrator is slower than Scylla's internal processing of other nodes' states. But in theory this is incorrect. I checked hownodetool status
deduces that a node is NORMAL: it checks whether it is joining or leaving; if neither is true, it assumes the node is normal. This doesn't guarantee thatstorage_service::handle_state_normal
and similar functions have finished. I'm not even sure whether it guarantees that the gossiper endpoint_state for this node actually holds NORMAL status. Is it always true that ~joining && ~leaving implies normal? Even if it was true, this is not good enough for tests, which need a stronger guarantee (that all processing has finished).The text was updated successfully, but these errors were encountered: