Re-resolve hostnames as fallback when control connection is broken #770

wprzytula · 2023-07-26T13:16:33Z

Motivation

See the referenced issue (#756).

What's done

Main matter

fallback logic is introduced to read_metadata().
This is the crucial step in boosting the driver's robustness in case of sudden IP changes in the cluster. In case that the control connection fails to fetch metadata, after all known peers fail to be queried, the fallback phase is entered: addresses of all initially known nodes are resolved again and fetching metadata from them is attempted.
Until now, if all nodes in the cluster changed their IPs at once (which is particularly easy in a single-node cluster), the driver was unable to contact the cluster ever more. Now, the driver reconnects the control connection with the next metadata fetch attempt (after at most 60 seconds, as configured in ClusterWorker), fetches new topology, discovers all nodes' addresses and finally connects to them.
control connection reconnect attempts are more frequent.
If the control connection is faulty and hence metadata fetch fails, it is advisable that further attempts to reconnect and fetch take place more frequently. The motivation is: if the control connection fails, it is possible that the node has changed its IP and hence we need to fetch new metadata ASAP to discover its new address. Therefore, the ClusterWorker's sleep time is changed from 60 seconds to 1 second once a metadata fetch fails, and is only reverted back to 60 seconds after a fetch succeeds.
control connection reconnect attempt begins immediately after it breaks.
Until now, if the control connection node changed it IP, we would discover that node only after the next metadata fetch is issued, which would possibly happen only after 60 seconds (if previous fetch succeeded). To make the rediscovery faster, immediate signalling that the control connection got broken is introduced, so that ClusterWorker begins instantly its every-1-second-attempt phase.

Bonus

ContactPoint is renamed to ResolvedContactPoint, to stress its difference from KnownNode (which is an unresolved contact point, essentially).
various node-related structs (ResolvedContactPoint, KnownNode, CloudEndpoint) and hostname resolution-related routines are moved from cluster and session modules to the node module. This contributes to making the session module cleaner, as well as keeps those entities in the most conceptually related module.

Evaluation

Manual tests were conducted. They were identical to ones performed before merging the analogous enhancement to gocql. Both single- and multi-node clusters were tested.
The tests involved random connection breaks between arbitrary nodes and the driver, as well as changes of IPs of all the nodes in the cluster at once.
The implemented solution showed to be very robust in the tests: immediately after losing control connection, the reconnect-attempt-phase begins. As soon as any node (either one known initially or one known from recently fetched metadata) becomes reachable, a control connection is opened and metadata is fetched successfully, so the whole cluster is discoverable. The driver performed stable, with no fluctuations in behaviour over time.

Fixes: #756
Refs: #386

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
~~[ ] I added relevant tests for new features and bug fixes.~~
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
~~[ ] I have provided docstrings for the public items that I want to introduce.~~
~~[ ] I have adjusted the documentation in ./docs/source/.~~
I added appropriate Fixes: annotations to PR description.

scylla/src/transport/node.rs

wprzytula · 2023-07-31T13:26:02Z

Rebased on main.

wprzytula · 2023-08-02T08:03:01Z

Rebased on main.

scylla/src/transport/topology.rs

scylla/src/transport/cluster.rs

The new name highlights the fact that the `ContactPoint` is necessarily a resolved address. This is contrary to `KnownNode`, which can contain an unresolved hostname. Additionally, missing docstrings are added for `ContactPoint` and `CloudEndpoint`, and the docstring for `KnownNode` is extended.

This not only makes the `Session::connect()` cleaner, but also enables using the same routine in the "fallback" in next commits.

The `node` module is the most suitable for the following structs: - `KnownNode` - `ResolvedContactPoint` - `CloudEndpoint`

As hostname resolution is conceptually strictly connected to nodes, the `node` module is better suiting those routines than the overloaded `session` module.

The initially known nodes (those before hostname resolution) will be used there to implement the "fallback" in case all known peers change their IPs, but at least some of them stay discoverable under the same hostname.

The code of `read_metadata()` is refactored to prepare for addition of fallback logic. A new log is added at debug level for a successful metadata fetch. Sadly, the new design requires cloning peers due to the borrow checker. Luckily, this only happens if the first fetch attempt fails and is not particularly expensive.

This is the final step in boosting the driver's robustness in case of sudden IP changes in the cluster. In case that the control connection fails to fetch metadata, after all known peers fail to be queried, the fallback phase is entered: addresses of all initially known nodes are resolved again and fetching metadata from them is attempted. Until now, if all nodes in the cluster changed their IPs at once (which is particularly easy in a single-node cluster), the driver was unable to contact the cluster ever more. Now, the driver reconnects the control connection with the next metadata fetch attempt (after at most 60 seconds, as configured in ClusterWorker), fetches new topology, discovers all nodes' addresses and finally connects to them. The next commits are yet to shorten this downtime from 60 seconds to possibly 1 second, by reacting immediately to the control connection being broken.

If the control connection is faulty and hence metadata fetch fails, it is advisable that further attempts to reconnect and fetch take place more frequently. The motivation is: if the control connection fails, it is possible that the node has changed its IP and hence we need to fetch new metadata ASAP to discover its new address. Therefore, the ClusterWorker's sleep time is changed from 60 seconds to 1 second once a metadata fetch fails, and is only reverted back to 60 seconds after a fetch succeeds. We are still not good enough: if all nodes change their IPs at once, we will discover them only after the next metadata fetch is issued, which may happen only after 60 seconds (if previous fetch succeeded). Hence, the next commit introduces immediate signalling that the control connection is broken, so that ClusterWorker begins instantly its every-1-second-attempt phase.

wprzytula · 2023-08-28T07:40:05Z

@piodul Rebased on main.

Until now, if all nodes changed their IPs at once, we would discover them only after the next metadata fetch is issued, which might happen only after 60 seconds (if previous fetch succeeded). This commit introduces immediate signalling that the control connection got broken, so that ClusterWorker begins instantly its every-1-second-attempt phase. In manual tests, this showed to be very robust: immediately after losing control connection, the reconnect-attempt-phase begins. As soon as any node (one known initially or from recently fetched metadata) becomes reachable, a control connection is opened and metadata is fetched successfully, so the whole cluster is discoverable.

This is a refactor that makes reduces number of parameters needed for MetadataReader::new().

wprzytula requested review from piodul and cvybhu July 26, 2023 13:17

wprzytula force-pushed the reresolve-dns branch 2 times, most recently from fc4af5c to a624165 Compare July 26, 2023 19:00

avelanarius reviewed Jul 27, 2023

View reviewed changes

scylla/src/transport/node.rs Show resolved Hide resolved

wprzytula force-pushed the reresolve-dns branch from a624165 to dfbd9f6 Compare July 27, 2023 11:42

wprzytula requested a review from avelanarius July 28, 2023 16:35

wprzytula added this to the 0.10.0 milestone Jul 28, 2023

wprzytula force-pushed the reresolve-dns branch from dfbd9f6 to 914dbd9 Compare July 31, 2023 13:25

wprzytula force-pushed the reresolve-dns branch from 914dbd9 to d1e2b92 Compare August 2, 2023 08:02

avelanarius mentioned this pull request Aug 10, 2023

Re-resolve hostnames as fallback when all hosts are unreachable scylladb/python-driver#239

Closed

piodul reviewed Aug 21, 2023

View reviewed changes

scylla/src/transport/topology.rs Outdated Show resolved Hide resolved

scylla/src/transport/topology.rs Outdated Show resolved Hide resolved

scylla/src/transport/cluster.rs Show resolved Hide resolved

wprzytula force-pushed the reresolve-dns branch from d1e2b92 to 9659f3d Compare August 21, 2023 12:28

wprzytula requested a review from piodul August 21, 2023 12:29

wprzytula force-pushed the reresolve-dns branch from 9659f3d to ac36ac5 Compare August 22, 2023 10:00

wprzytula added 8 commits August 28, 2023 09:35

session: extract hostname resolution to a function

52e5294

This not only makes the `Session::connect()` cleaner, but also enables using the same routine in the "fallback" in next commits.

codewide: move node-related structs to node.rs

8487bbf

The `node` module is the most suitable for the following structs: - `KnownNode` - `ResolvedContactPoint` - `CloudEndpoint`

session: move hostname resolution to node.rs

94bcf33

As hostname resolution is conceptually strictly connected to nodes, the `node` module is better suiting those routines than the overloaded `session` module.

codethrough: pass known nodes to MetadataReader

2397b49

The initially known nodes (those before hostname resolution) will be used there to implement the "fallback" in case all known peers change their IPs, but at least some of them stay discoverable under the same hostname.

wprzytula force-pushed the reresolve-dns branch from ac36ac5 to 94c50cc Compare August 28, 2023 07:38

wprzytula added 2 commits August 28, 2023 09:45

session: move hostname resolution to MetadataReader::new()

d1c4c48

This is a refactor that makes reduces number of parameters needed for MetadataReader::new().

wprzytula force-pushed the reresolve-dns branch from 94c50cc to d1c4c48 Compare August 28, 2023 07:45

piodul approved these changes Aug 28, 2023

View reviewed changes

piodul merged commit afad74a into scylladb:main Aug 28, 2023
8 checks passed

wprzytula deleted the reresolve-dns branch August 28, 2023 08:36

wprzytula mentioned this pull request Apr 9, 2024

Support maintaining the original hostnames provided using .known_node(s) in the event all nodes become unreachable #386

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-resolve hostnames as fallback when control connection is broken #770

Re-resolve hostnames as fallback when control connection is broken #770

wprzytula commented Jul 26, 2023 •

edited

wprzytula commented Jul 31, 2023

wprzytula commented Aug 2, 2023

wprzytula commented Aug 28, 2023

Re-resolve hostnames as fallback when control connection is broken #770

Re-resolve hostnames as fallback when control connection is broken #770

Conversation

wprzytula commented Jul 26, 2023 • edited

Motivation

What's done

Main matter

Bonus

Evaluation

Pre-review checklist

wprzytula commented Jul 31, 2023

wprzytula commented Aug 2, 2023

wprzytula commented Aug 28, 2023

wprzytula commented Jul 26, 2023 •

edited