New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
messaging_service: gossiper verbs should be encrypted regardless of topology knowledge #11992
Milestone
Comments
cc @xemul |
kbr-
pushed a commit
to kbr-/scylla
that referenced
this issue
Nov 16, 2022
…for GOSSIP verbs The gossip verbs are used to learn about topology of other nodes. If inter-dc/rack encryption is enabled, the knowledge of topology is necessary to decide whether it's safe to send unencrypted messages to nodes (i.e., whether the destination lies in the same dc/rack). The logic in `messaging_service::get_rpc_client`, which decided whether a connection must be encrypted, was this (given that encryption is enabled): if the topology of the peer is known, and the peer is in the same dc/rack, don't encrypt. Otherwise encrypt. However, it may happen that node A knows node B's topology, but B doesn't know A's topology. A deduces that B is in the same DC and rack and tries sending B an unencrypted message. As the code currently stands, this would cause B to call `on_internal_error`. This is what I encountered when attempting to fix scylladb#11780. To guarantee that it's always possible to deliver gossiper verbs (even if one or both sides don't know each other's topology), and to simplify reasoning about the system in general, choose connection settings that are independent of the topology - for the connection used by gossiper verbs (other connections are still topology-dependent and use complex logic to handle the situation of unknown-and-later-known topology). This connection only contains 'rare' and 'cheap' verbs, so it's not a performance problem to always encrypt it (given that encryption is configured). And this is what already was happening in the past; it was at some point removed during topology knowledge management refactors. We just bring this logic back. Fixes scylladb#11992. Inspired by xemul/scylla#45d48f3d02fd48c6d186cc955cf83d747ac080b9.
@kbr-scylla looks like a real problem, but we've never seen it in the field. How come? Is it a regression? If so, from which version? |
I only remember that this area of code was being refactored&changed quite often on |
Not backporting per above. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When a node creates a connection to another node in
messaging_service::get_rpc_client
it decides whether the connection should be encrypted by calculating amust_encrypt
boolean (scylladb/message/messaging_service.cc
Line 718 in d85f731
Similarly, when a node receives a connection attempt on unencrypted port and inter/DC-rack encryption is enabled, the node will verify whether the connection attempt comes from an address in the same DC-rack (
messaging_service::do_start_listen
;scylladb/message/messaging_service.cc
Line 309 in d85f731
master
branch, if the node does not have the topology of the sender's address, so it's not able to verify whether it's the same DC/rack, it will callon_internal_error
(scylladb/locator/token_metadata.cc
Line 1341 in d85f731
As explained in #11942 (comment): there is no guarantee that if a node X knows the topology of node Y, Y also knows the topology of X. So it may happen that X attempts unencrypted communication with Y, but Y will reject it. I obtained a situation where a node called
on_internal_error
(described in the previous paragraph) because it received unencrypted communication attempt from a node whose topology was yet unknown when trying to fix #11780, with #11942.We need to guarantee that Y can eventually learn the topology of X. But we cannot give such guarantee if Y will reject any communication attempts from X! Perhaps Y could learn the topology of X from other nodes. But can we guarantee that Y can receive communication from other node? What if it told other nodes its topology, and then there was a failure which caused it not to learn the topology of others (dropped messages, a crash + restart before the knowledge was persisted etc.)?
The system is hard to reason about in the current state, and it's hard to prove if any such guarantees are given.
At some point there was the following piece of code in the definition of
must_encrypt
inget_rpc_client
:i.e., when we create an RPC client due to a gossiper verb, we encrypt the communication (even though the nodes may live in the same DC/rack).
Later it was removed (7bdad47).
However, the check may be necessary for liveness: it's easy to see that it guarantees that every node can eventually learn every other node's topology through gossiping; nodes accept encrypted communication, they only reject unencrypted communication if they cannot verify that it's inter-DC/rack.
It's also not a performance problem to encrypt the RPC client used for gossiper verbs, because it doesn't contain any hot/data path verbs, only rare metadata verbs:
Finally, the check combined with #11942 will let me fix #11780.
The text was updated successfully, but these errors were encountered: