Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-resolve hostnames as fallback when control connection is broken #770

Merged
merged 10 commits into from Aug 28, 2023

Conversation

wprzytula
Copy link
Collaborator

@wprzytula wprzytula commented Jul 26, 2023

Motivation

See the referenced issue (#756).

What's done

Main matter

  • fallback logic is introduced to read_metadata().
    This is the crucial step in boosting the driver's robustness in case of sudden IP changes in the cluster. In case that the control connection fails to fetch metadata, after all known peers fail to be queried, the fallback phase is entered: addresses of all initially known nodes are resolved again and fetching metadata from them is attempted.
    Until now, if all nodes in the cluster changed their IPs at once (which is particularly easy in a single-node cluster), the driver was unable to contact the cluster ever more. Now, the driver reconnects the control connection with the next metadata fetch attempt (after at most 60 seconds, as configured in ClusterWorker), fetches new topology, discovers all nodes' addresses and finally connects to them.
  • control connection reconnect attempts are more frequent.
    If the control connection is faulty and hence metadata fetch fails, it is advisable that further attempts to reconnect and fetch take place more frequently. The motivation is: if the control connection fails, it is possible that the node has changed its IP and hence we need to fetch new metadata ASAP to discover its new address. Therefore, the ClusterWorker's sleep time is changed from 60 seconds to 1 second once a metadata fetch fails, and is only reverted back to 60 seconds after a fetch succeeds.
  • control connection reconnect attempt begins immediately after it breaks.
    Until now, if the control connection node changed it IP, we would discover that node only after the next metadata fetch is issued, which would possibly happen only after 60 seconds (if previous fetch succeeded). To make the rediscovery faster, immediate signalling that the control connection got broken is introduced, so that ClusterWorker begins instantly its every-1-second-attempt phase.

Bonus

  • ContactPoint is renamed to ResolvedContactPoint, to stress its difference from KnownNode (which is an unresolved contact point, essentially).
  • various node-related structs (ResolvedContactPoint, KnownNode, CloudEndpoint) and hostname resolution-related routines are moved from cluster and session modules to the node module. This contributes to making the session module cleaner, as well as keeps those entities in the most conceptually related module.

Evaluation

Manual tests were conducted. They were identical to ones performed before merging the analogous enhancement to gocql. Both single- and multi-node clusters were tested.
The tests involved random connection breaks between arbitrary nodes and the driver, as well as changes of IPs of all the nodes in the cluster at once.
The implemented solution showed to be very robust in the tests: immediately after losing control connection, the reconnect-attempt-phase begins. As soon as any node (either one known initially or one known from recently fetched metadata) becomes reachable, a control connection is opened and metadata is fetched successfully, so the whole cluster is discoverable. The driver performed stable, with no fluctuations in behaviour over time.

Fixes: #756
Refs: #386

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • [ ] I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • [ ] I have provided docstrings for the public items that I want to introduce.
  • [ ] I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

@wprzytula wprzytula requested review from piodul and cvybhu July 26, 2023 13:17
@wprzytula wprzytula force-pushed the reresolve-dns branch 2 times, most recently from fc4af5c to a624165 Compare July 26, 2023 19:00
@wprzytula
Copy link
Collaborator Author

Rebased on main.

@wprzytula
Copy link
Collaborator Author

Rebased on main.

scylla/src/transport/topology.rs Outdated Show resolved Hide resolved
scylla/src/transport/topology.rs Outdated Show resolved Hide resolved
scylla/src/transport/cluster.rs Show resolved Hide resolved
The new name highlights the fact that the `ContactPoint` is necessarily a
resolved address. This is contrary to `KnownNode`, which can contain
an unresolved hostname.

Additionally, missing docstrings are added for `ContactPoint` and
`CloudEndpoint`, and the docstring for `KnownNode` is extended.
This not only makes the `Session::connect()` cleaner, but also enables
using the same routine in the "fallback" in next commits.
The `node` module is the most suitable for the following structs:
- `KnownNode`
- `ResolvedContactPoint`
- `CloudEndpoint`
As hostname resolution is conceptually strictly connected to nodes,
the `node` module is better suiting those routines than the overloaded
`session` module.
The initially known nodes (those before hostname resolution) will be
used there to implement the "fallback" in case all known peers change
their IPs, but at least some of them stay discoverable under the same
hostname.
The code of `read_metadata()` is refactored to prepare for addition of
fallback logic.
A new log is added at debug level for a successful metadata fetch.
Sadly, the new design requires cloning peers due to the borrow checker.
Luckily, this only happens if the first fetch attempt fails and is not
particularly expensive.
This is the final step in boosting the driver's robustness in case of
sudden IP changes in the cluster. In case that the control connection
fails to fetch metadata, after all known peers fail to be queried, the
fallback phase is entered: addresses of all initially known nodes are
resolved again and fetching metadata from them is attempted.

Until now, if all nodes in the cluster changed their IPs at once (which
is particularly easy in a single-node cluster), the driver was unable to
contact the cluster ever more. Now, the driver reconnects the control
connection with the next metadata fetch attempt (after at most 60
seconds, as configured in ClusterWorker), fetches new topology,
discovers all nodes' addresses and finally connects to them.

The next commits are yet to shorten this downtime from 60 seconds to
possibly 1 second, by reacting immediately to the control connection
being broken.
If the control connection is faulty and hence metadata fetch fails,
it is advisable that further attempts to reconnect and fetch take place
more frequently. The motivation is: if the control connection fails,
it is possible that the node has changed its IP and hence we need to
fetch new metadata ASAP to discover its new address. Therefore,
the ClusterWorker's sleep time is changed from 60 seconds to 1 second
once a metadata fetch fails, and is only reverted back to 60 seconds
after a fetch succeeds.

We are still not good enough: if all nodes change their IPs at once,
we will discover them only after the next metadata fetch is issued,
which may happen only after 60 seconds (if previous fetch succeeded).
Hence, the next commit introduces immediate signalling that the control
connection is broken, so that ClusterWorker begins instantly its
every-1-second-attempt phase.
@wprzytula
Copy link
Collaborator Author

@piodul Rebased on main.

Until now, if all nodes changed their IPs at once, we would discover
them only after the next metadata fetch is issued, which might happen
only after 60 seconds (if previous fetch succeeded).
This commit introduces immediate signalling that the control connection
got broken, so that ClusterWorker begins instantly its
every-1-second-attempt phase.

In manual tests, this showed to be very robust: immediately after losing
control connection, the reconnect-attempt-phase begins. As soon as any
node (one known initially or from recently fetched metadata) becomes
reachable, a control connection is opened and metadata is fetched
successfully, so the whole cluster is discoverable.
This is a refactor that makes reduces number of parameters needed for
MetadataReader::new().
@piodul piodul merged commit afad74a into scylladb:main Aug 28, 2023
8 checks passed
@wprzytula wprzytula deleted the reresolve-dns branch August 28, 2023 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Re-resolve hostnames as fallback when all hosts are unreachable
3 participants