Make max client connections configurable and refactor `rpc::connection_cache` #12906

ballard26 · 2023-08-21T08:23:31Z

The first change this PR makes is to allow for the number of client connections to each of the brokers user configurable.

The second change this PR makes is to use a stateful connection allocation method that ensures the following;

Each shard should have an equal number of client connections
Each client connection should have an equal number of shards using it.
On each node there should be max_connections to every other node in the cluster.
If a shard wants a connection for a (node, shard) and the current shard has a connection for that (node, shard) it is selected. (from @travisdowns)
For any broker A. Connections to broker A when aggregated by shard ID across every broker in the cluster should have an equal count per shard ID. (from
Shard aware connections v2 #8)

Fixes #12912

Backports Required

Release Notes

Features

Adds rpc_client_connections_per_shard cluster property that allows for the number of clients a broker opens to a given peer to be user configurable.

StephanDollberg

Could you maybe comment why you now switched away from your original hash ring approach to a simple shuffle based one?

StephanDollberg · 2023-08-29T13:54:59Z

src/v/rpc/backoff_policy.h

    };

    explicit backoff_policy(std::unique_ptr<impl> i)
      : _impl(std::move(i)) {}

+    backoff_policy(backoff_policy&&) = default;
+
+    backoff_policy(const backoff_policy& o) {


Rule of 3/5 something something? I have forgotten when things get implicitly constructed these days but might cleaner to just default/delete.

Gotcha, will add a copy/move assignment operator.

StephanDollberg · 2023-08-29T13:58:34Z

src/v/rpc/connection_cache.h

+    ss::future<>
+    remove_connection_location(ss::shard_id dest_shard, model::node_id node) {
+        return container().invoke_on(dest_shard, [node](auto& cache) {
+            auto conn_loc = cache._connection_map.find(node);


think the find is redundant. You can just call erase which will be noop if not found.

Nice find, switching to just erase

StephanDollberg · 2023-08-29T14:05:31Z

src/v/rpc/connection_cache.cc

+    if (!_connection_map.contains(n)) {
+        return {};
+    }
+    return {_connection_map.at(n)};


Use find for this pattern? Avoids an extra hash table look up.

Right you are, switching over to find

ballard26 · 2023-08-29T16:31:04Z

Could you maybe comment why you now switched away from your original hash ring approach to a simple shuffle based one?

Couple of reasons;

Even with the hash ring approach I couldn't ensure that the number of connections per shard were equal. With a simple shuffle its easily proven that connections per shard will be equal by construction.
Using a stateful connection allocation strategy allows for a lot of potential future enhancements. For example in clusters with a lot of nodes its possible for a given shard to only communicate with 2-3 of them. In those cases we may want the connection cache to allocate connections to those 2-3 nodes on that shard.

StephanDollberg · 2023-08-30T14:05:54Z

src/v/rpc/connection_cache.h

+
+    absl::flat_hash_map<model::node_id, absl::flat_hash_set<ss::shard_id>>
+      _node_to_shards;
+    std::vector<std::pair<ss::shard_id, size_t>> _connections_per_shard;


Using a struct would be nice here as well for readability as it avoids the std::get stuff.

…r_client

dotnwat

cool stuff

StephanDollberg · 2023-09-15T14:52:58Z

src/v/rpc/connection_set.cc

+        _connections.erase(connection);
+    }
+
+    auto cert_creds = co_await maybe_build_reloadable_certificate_credentials(


If we yield here then _connections now doesn't contain an entry for this connection which other fibers might rely on. Is that a problem?

E.g.: connection_set::get straight out does:

transport_ptr get(model::node_id n) const { return _connections.find(n)->second; }

StephanDollberg · 2023-09-15T15:05:01Z

src/v/rpc/connection_cache.cc

+    }
+    auto holder = _gate.hold();
+
+    auto& alloc_strat = _coordinator_state->alloc_strat;


Potenial access to _coordinator_state->alloc_strat without owning the mutex?

StephanDollberg · 2023-09-15T15:05:24Z

src/v/rpc/connection_cache.cc

+    }
+    auto holder = _gate.hold();
+
+    auto& alloc_strat = _coordinator_state->alloc_strat;


Potenial access to _coordinator_state->alloc_strat without owning the mutex?

StephanDollberg · 2023-09-15T17:31:58Z

Tried fixing the things I pointed out above as per https://github.com/redpanda-data/redpanda/compare/stephan/connection-cache-fixes?expand=1

but still seems to crash so probably not the root cause.

Maybe can still serve as inspiration.

dotnwat · 2023-09-15T18:20:41Z

I don't yet have any specific feedback, but generally the read_iobuf_exactly thing that is reading from the input stream looks pretty solid to me. however, the input stream is (1) passed as a reference and (2) the resulting iobuf may contain temporary_buffers (via iobuf::fragment) that are shared with temporary_buffers owned by the input_stream. none of that is inherently a problem, but it does increase a bit of the scope of potential issues i might look for.

…impl" This reverts commit 1ad588d, reversing changes made to 8088aeb.

github-actions bot added area/rpk area/redpanda labels Aug 21, 2023

ballard26 force-pushed the new-cc-impl branch from dd34cca to 41cefb1 Compare August 21, 2023 08:36

github-actions bot added area/build area/wasm WASM Data Transforms area/k8s labels Aug 21, 2023

ballard26 force-pushed the new-cc-impl branch from 41cefb1 to 4f45192 Compare August 21, 2023 08:51

github-actions bot removed area/rpk area/build area/wasm WASM Data Transforms area/k8s labels Aug 21, 2023

piyushredpanda modified the milestone: v23.2.8 Aug 24, 2023

StephanDollberg reviewed Aug 29, 2023

View reviewed changes

ballard26 force-pushed the new-cc-impl branch from 984ae77 to af0da93 Compare August 30, 2023 03:02

ballard26 requested a review from StephanDollberg August 30, 2023 03:03

StephanDollberg previously approved these changes Aug 30, 2023

View reviewed changes

ballard26 changed the title ~~Draft: Make max client connections configurable and refactor rpc::connection_cache~~ Make max client connections configurable and refactor rpc::connection_cache Aug 30, 2023

ballard26 marked this pull request as ready for review August 30, 2023 16:01

ballard26 requested review from dotnwat, mmaslankaprv and travisdowns August 30, 2023 16:01

ballard26 force-pushed the new-cc-impl branch from af0da93 to a01d200 Compare September 12, 2023 00:01

ballard26 added 6 commits September 11, 2023 20:01

rpc: add connection_set to manage local connections

c34d223

cluster: change node_status_backend to use connection_set

0bb5800

rpc: change connection_cache to use connection_set

3529e6f

rpc: refactor connection allocation to connection_cache

31f8b34

rpc: make backoff_policy copy-constructable

b13e867

rpc: add backoff_policy parameter to update_broker_client

b464e71

ballard26 added 4 commits September 11, 2023 20:01

raft: change raft_group_fixture to use connection_cache::update_broke…

d12e089

…r_client

treewide: move rpc methods to rpc_utils

eb80684

rpc: use stateful connection allocation in connection_cache

611ce0e

config: add rpc_client_connections_per_peer property

67b4a9a

ballard26 dismissed StephanDollberg’s stale review via 67b4a9a September 12, 2023 00:02

ballard26 force-pushed the new-cc-impl branch from a01d200 to 67b4a9a Compare September 12, 2023 00:02

ballard26 requested a review from StephanDollberg September 12, 2023 00:03

StephanDollberg approved these changes Sep 12, 2023

View reviewed changes

ballard26 merged commit 1ad588d into redpanda-data:dev Sep 12, 2023
25 checks passed

dotnwat reviewed Sep 13, 2023

View reviewed changes

dotnwat mentioned this pull request Sep 14, 2023

segmentation fault in iobuf related to rpc connection caching #13447

Closed

StephanDollberg reviewed Sep 15, 2023

View reviewed changes

StephanDollberg mentioned this pull request Sep 16, 2023

Revert changes to rpc::connection_cache #13483

Closed

7 tasks

rockwotj mentioned this pull request Sep 17, 2023

Introduce Internal Data Transform RPCs #13277

Merged

7 tasks

rockwotj added a commit to rockwotj/redpanda that referenced this pull request Sep 19, 2023

Revert "Merge pull request redpanda-data#12906 from ballard26/new-cc-…

e9f7767

…impl" This reverts commit 1ad588d, reversing changes made to 8088aeb.

github-actions bot mentioned this pull request Dec 22, 2023

update redpanda appVersion from v23.2.21 to v23.3.1 redpanda-data/helm-charts#950

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make max client connections configurable and refactor `rpc::connection_cache` #12906

Make max client connections configurable and refactor `rpc::connection_cache` #12906

ballard26 commented Aug 21, 2023 •

edited

Loading

StephanDollberg left a comment

StephanDollberg Aug 29, 2023

ballard26 Aug 29, 2023

StephanDollberg Aug 29, 2023

ballard26 Aug 29, 2023

StephanDollberg Aug 29, 2023

ballard26 Aug 29, 2023

ballard26 commented Aug 29, 2023

StephanDollberg Aug 30, 2023

dotnwat left a comment

StephanDollberg Sep 15, 2023 •

edited

Loading

StephanDollberg Sep 15, 2023 •

edited

Loading

StephanDollberg Sep 15, 2023 •

edited

Loading

StephanDollberg commented Sep 15, 2023

dotnwat commented Sep 15, 2023

Make max client connections configurable and refactor rpc::connection_cache #12906

Make max client connections configurable and refactor rpc::connection_cache #12906

Conversation

ballard26 commented Aug 21, 2023 • edited Loading

Backports Required

Release Notes

Features

StephanDollberg left a comment

Choose a reason for hiding this comment

StephanDollberg Aug 29, 2023

Choose a reason for hiding this comment

ballard26 Aug 29, 2023

Choose a reason for hiding this comment

StephanDollberg Aug 29, 2023

Choose a reason for hiding this comment

ballard26 Aug 29, 2023

Choose a reason for hiding this comment

StephanDollberg Aug 29, 2023

Choose a reason for hiding this comment

ballard26 Aug 29, 2023

Choose a reason for hiding this comment

ballard26 commented Aug 29, 2023

StephanDollberg Aug 30, 2023

Choose a reason for hiding this comment

dotnwat left a comment

Choose a reason for hiding this comment

StephanDollberg Sep 15, 2023 • edited Loading

Choose a reason for hiding this comment

StephanDollberg Sep 15, 2023 • edited Loading

Choose a reason for hiding this comment

StephanDollberg Sep 15, 2023 • edited Loading

Choose a reason for hiding this comment

StephanDollberg commented Sep 15, 2023

dotnwat commented Sep 15, 2023

Make max client connections configurable and refactor `rpc::connection_cache` #12906

Make max client connections configurable and refactor `rpc::connection_cache` #12906

ballard26 commented Aug 21, 2023 •

edited

Loading

StephanDollberg Sep 15, 2023 •

edited

Loading

StephanDollberg Sep 15, 2023 •

edited

Loading

StephanDollberg Sep 15, 2023 •

edited

Loading