-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent isolated/decommissioned nodes handle kafka API requests #7829
Prevent isolated/decommissioned nodes handle kafka API requests #7829
Conversation
f6cc7c7
to
243796a
Compare
243796a
to
5c0ad4e
Compare
db3485a
to
3a6d3a1
Compare
0ee43de
to
6ae1bc4
Compare
b498256
to
6232d0d
Compare
6232d0d
to
60e933f
Compare
resolved |
/ci-repeat 5 |
@mmaslankaprv @VadimPlh i have a feeling |
@andrewhsu thank you, sorry for this |
/ci-repeat 10 skip-units dt-repeat=32 tests/rptest/tests/isolated_decommissioned_node_test.py |
/ci-repeat 10 |
i've created another draft PR #8369 that runs the |
@VadimPlh @mmaslankaprv fyi the buildkite job went green (from draft PR #8369 (comment)): that used this PR's codebase (plus a dummy commit) |
If node know about leader for controller it means node is not isolated.
New setting node_isolation_heartbeat_timeout. How long after the last heartbeat request a node will wait before considering itself to be isolated
Metadata_cache now contains bool flag which signal is node isolated or not. This bool will be updated by new sharded service and cached inside metadata cache
Returns list of all nodes in cluster
New service comunicate swith health_monitor, node_status_table to understand is node isolated For now we have 3 different signal about node communication: * Health_monitor * node status table * raft0 has a leader If all of them are not updated for long time node can be isolated
If node isolated or decomissioned it can not handle kafka requests from client, so in this case we need to signal client comunicate with another broker. For this we need to exclude isolated node from brokers list and return -1 for controller_id, after it client will send metadata request to another broker and will comunicate with it Also we can not put isolated node like leader for partition. To prevent client stuck we gonna add fake leader to force client connect to another broker
eb61517
to
abbaf02
Compare
Ideas
The main idea of this pr:
We need to signal to clients that node can not get any requests anymore. The simplest way to do it is by using Kafka RPC. Clients send metadata requests to get info about the cluster, so we do not want to return wrong information from isolated nodes. So we can do it by using metadata response.
For now we have several types of pings for nodes.
So for beginning we can just check that we do not communicate with another nodes from Health_monitor, also we do not get append_entries requests from controller leader + node does not have outside raft heartbits. So after it we can decide that we are isolated and signal client to reconnect to another node.
RFC
Ticket
Backports Required
UX Changes
Release Notes
Features
node_isolation_raft_timeout
andnode_isolation_heartbeat_timeout
. Both are 3 seconds by default.