Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There is no way to wait until the UP/NORMAL states of other nodes are processed #12015

Closed
kbr- opened this issue Nov 17, 2022 · 21 comments · Fixed by #12540
Closed

There is no way to wait until the UP/NORMAL states of other nodes are processed #12015

kbr- opened this issue Nov 17, 2022 · 21 comments · Fixed by #12540
Assignees
Labels
area/test Issues related to the testing system code and environment tests/test.py
Milestone

Comments

@kbr-
Copy link
Contributor

kbr- commented Nov 17, 2022

In test.py topology tests we often perform topology operations as soon as all nodes have finished bootstrapping or restarting. Such topology operations race with gossiper which may be in the middle of notifying different Scylla modules that some other node has entered NORMAL state or that it has restarted. This may disrupt the topology operations.

Examples of failures:

We could take the approach that dtests are taking: "sleep and pray" (pray that everything is ready once we finish sleeping).
But:

  • sleeping may wait longer than necessary, while we want our tests to be as fast as possible - the test.py tests are part of regular development
  • sleeping does not guarantee that everything has been processed.

Instead, there should be a way of waiting until node X notices that Y is UP and NORMAL, and X finishes all processing related to this fact (i.e. all gossiper notifications, storage_service::handle_state_normal) etc.

It seems that this may require introducing some additional state for each peer (to answer the question: "have I finished processing this peer?") and, for example, a REST API endpoint to query this state.

Note: our documentation says that one should check that all other nodes are UP and NORMAL before performing a topology operations, using nodetool status. This works in practice, because in practice the administrator is slower than Scylla's internal processing of other nodes' states. But in theory this is incorrect. I checked how nodetool status deduces that a node is NORMAL: it checks whether it is joining or leaving; if neither is true, it assumes the node is normal. This doesn't guarantee that storage_service::handle_state_normal and similar functions have finished. I'm not even sure whether it guarantees that the gossiper endpoint_state for this node actually holds NORMAL status. Is it always true that ~joining && ~leaving implies normal? Even if it was true, this is not good enough for tests, which need a stronger guarantee (that all processing has finished).

@kbr- kbr- added area/test Issues related to the testing system code and environment tests/test.py labels Nov 17, 2022
@kbr-
Copy link
Contributor Author

kbr- commented Nov 17, 2022

cc @nyh you might find this interesting.

@mykaul
Copy link
Contributor

mykaul commented Nov 17, 2022

I think this is also important for a rolling upgrade in K8S environment - you don't want to continue to the next K8S node drain, before a Scylla node is truly up, communicating and doing work.

@kbr-
Copy link
Contributor Author

kbr- commented Nov 17, 2022

Indeed, or more generally, if you want to "automate the administrator" - this cannot be done correctly unless we have a reliable way to query a node's status regarding how it views other nodes.

@nyh
Copy link
Contributor

nyh commented Nov 17, 2022

@kbr- thanks, this is interesting because it may explain a few, though not all, of the failures we saw when trying to eliminate the silly sleeps in dtest (https://github.com/scylladb/scylla-dtest/pull/2455).

While I agree with you that we could add all sorts of operations (REST API, nodetool, etc.) that checks if node X is aware of Y, and so on, I think there a better solution: I think we need to uphold the requirement that:

Node X must not accept CQL requests until it is sure it can process them

Here it means that either node X should not listen to CQL requests until it knows it can perform them (including topology-change requests), or, that if X receives a request that it temporarily can't perform (because the gossip hasn't finished) it can wait until it can - or send the request to a different node if that's possible. This is, for example, what we do for nodes joining a cluster - they receive CQL requests, but if they receive a read request, this request is sent to replicas which hold the data and the node doesn't attempt to (incorrectly) respond from its own data.

I think adding additional synchronization requests will be a mistake. If we add those, we'll need to start using them in test.py, in dtest, in kubernetes, and who knows where else. We'll need to document this weird new request and explain it. Everything will be so much simpler if we just had the invariant: If a node accepts a CQL request, then it will work. That's it.

UPDATE: I got confused a bit above, we're talking about topology change requests, not CQL requests (we would have CQL requests for schema changes, not topology changes). So maybe we need to delay not CQL requests but some other type of requests (nodetool?). But I think my request that a node refuses to accept a request that it still can't perform is the best solution.

@zimnx
Copy link

zimnx commented Nov 17, 2022

Everything will be so much simpler if we just had the invariant: If a node accepts a CQL request, then it will work. That's it.

In K8s we are struggling without this invariant. Our tests are flaky because requests are timedout or we get errors even when our readiness check passes.

Currenly we qualify node as being ready to serve requests based on whether node itself reported UN via GET /gossiper/endpoint/live/ and GET /storage_service/host_id and whether CQL port has been opened (GET /storage_service/native_transport) .
If one of these conditions fails, we remove node from passthrough load balancer, and requests to it have nowhere to go.

We already proposed such endpoint in #8275 but it doesn't have much traction. We haven't checked suggested endpoint though, we are still using the above ones.

We don't have anything for toplogy changes, we do it once all nodes reports themselfes (we don't cross-check) as UN. Endpoint for toplogy check readiness would be beneficial to us.

@kbr-
Copy link
Contributor Author

kbr- commented Nov 17, 2022

Node X must not accept CQL requests until it is sure it can process them

and some non-CQL requests too (topology change requests are done through rest api endpoints), but yes, this should probably be the way.

BTW. Raft-based topology changes can solve this:

  • we move the STATUS of peers into group 0
  • move STATUS processing into state_machine::apply (apply doesn't return until processing finishes)
  • before performing a topology operation, we call raft read_barrier, which ensures that we have the latest group 0 state
  • after calling read_barrier, verify that the STATUS of all peers is NORMAL, otherwise call read_barrier again, and so on

This will result in what @nyh is proposing.

@kbr-
Copy link
Contributor Author

kbr- commented Nov 17, 2022

If a node accepts a CQL request, then it will work

Well, obviously you cannot have such guarantee, but you can try to guarantee something like:
If a node accepts a CQL request, and there is no subsequent failure, the request will work
(although depends how you define "failure"; e.g. is a temporary increase in network latency a failure or not?)

@avikivity
Copy link
Member

Agree with adding a REST endpoint to improve the test. Re having the CQL server (and alternator) wait for stability before listening, that seems like a good idea but need to think about it carefully. Esp. if we move administrative functions to CQL we can end up with losing the ability to fix a problem.

Perhaps we need several CQL listeners (on unix-domain sockets): a local administrative listener, and a local pre-boot listener that is started very early (like a serial console for servers).

@denesb
Copy link
Contributor

denesb commented Nov 18, 2022

Agree with adding a REST endpoint to improve the test. Re having the CQL server (and alternator) wait for stability before listening, that seems like a good idea but need to think about it carefully. Esp. if we move administrative functions to CQL we can end up with losing the ability to fix a problem.

Perhaps we need several CQL listeners (on unix-domain sockets): a local administrative listener, and a local pre-boot listener that is started very early (like a serial console for servers).

This was also proposed in the context of the maintenance mode (#5489): have a separate CQL port for local administrator.

@tgrabiec
Copy link
Contributor

tgrabiec commented Jan 13, 2023

BTW. Raft-based topology changes can solve this:

  • we move the STATUS of peers into group 0
  • move STATUS processing into state_machine::apply (apply doesn't return until processing finishes)
  • before performing a topology operation, we call raft read_barrier, which ensures that we have the latest group 0 state
  • after calling read_barrier, verify that the STATUS of all peers is NORMAL, otherwise call read_barrier again, and so on

I don't get why you need the loop to check the STATUSes. In raft-based topology, STATUS is held in system.token_metadata and managed by group0, so handle_state_normal() is not called by the gossiper anymore, but by the group0 state manchine. When a topology change starts, it has the only right to change the STATUS. It is a logical error for STATUS to not be NORMAL for all nodes. The previous topology change should either complete, or abort.

@alecco
Copy link
Contributor

alecco commented May 4, 2023

Fixed in #12540 (already merged)

@alecco alecco closed this as completed May 4, 2023
@kbr-scylla
Copy link
Contributor

Using the mechanism from #12540 does not guarantee that handle_state_normal has finished though.

Recently storage_service::wait_for_normal_state_handled_on_boot was introduced. It is used internally by Scylla on boot. However, since it works only on boot, it doesn't help for the case of another node restarting.

A full solution could work like this:

  • define some per-node value that is incremented on each restart and exchanged with other nodes. I think the gossiper generation_number is such a value, so I'll assume we use it below
  • in handle_state_normal, remember the greatest generation_number for each node that was handled by this function. We could extend the existing std::unordered_set<gms::inet_address> _normal_state_handled_on_boot by turning it into unordered_map<gms::inet_address, generation_type> for example.
  • expose a REST API that allows querying the data structure from previous point
  • in test, after restarting a node, fetch its generation_number (there's already a REST API for that IIRC); then on another node, check if it handled normal state for the restarted node with this or greater number. If not, sleep and retry and so on.

@kbr-scylla kbr-scylla reopened this May 4, 2023
@kostja kostja modified the milestones: 5.3, 5.4 May 12, 2023
@kostja kostja assigned kbr-scylla and unassigned alecco Jul 17, 2023
@mykaul mykaul added Backport candidate backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed labels Aug 28, 2023
@mykaul
Copy link
Contributor

mykaul commented Aug 28, 2023

@DoronArazii - we probably wish to have #12540 backported, even if it's not a complete solution, to 5.2 and 5.1.

@DoronArazii
Copy link

@scylladb/scylla-maint please consider backport

@kbr-scylla
Copy link
Contributor

#12540 and #13240 need to go together

@kbr-scylla
Copy link
Contributor

Queued #12540 and #13240 to next-5.1 and next-5.2

@DoronArazii
Copy link

Why is it open @kbr-scylla? what needs to be done?

@mykaul
Copy link
Contributor

mykaul commented Oct 17, 2023

Why is it open @kbr-scylla? what needs to be done?

ping @kbr-scylla

@kbr-scylla
Copy link
Contributor

The problem described in the original post does not occur in raft topology mode, you can safely start a new topology operation as soon as previous one has finished. Even better, you can queue up operations, the coordinator will handle them in serialized fashion.

Almost all our test.py tests are running in raft-topology mode now. The ones that don't, use their old workarounds. In any case, we no longer struggle with the problems described in the original issue.

Nevertheless, we should probably create a tool for administrators to communicate with the new topology coordinator to check committed node statuses and so on. It would be useful for observability, debugging and whatnot.

But creating such a tool deserves a separate issue.

So, I'm closing this one.

@avikivity
Copy link
Member

Why is this still marked as backport candidate? Is any work remaining?

@kbr-scylla kbr-scylla removed Backport candidate backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed Requires-Backport-to-5.1 labels Nov 2, 2023
@kbr-scylla
Copy link
Contributor

Not that I know of. Removing the labels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test Issues related to the testing system code and environment tests/test.py
Projects
None yet
Development

Successfully merging a pull request may close this issue.