transport: load balancing module refactor #612

havaker · 2022-12-04T00:12:31Z

Description

This is the third attempt to refactor this module;)

Replica set calculation was moved to a separate module (`transport::locator`).

Replica set calculation is done by a module originally written by @cvybhu - transport::locator::replication_info. struct ReplicationData lives in that module, and provides a set of functions that allow calculating of SimpleStrategy & NetworkTopologyStrategy replica lists.

To make load balancing fast, precomputing replica sets is required. This is done by another of @cvybhu's modules - transport::locator::precomputed_replicas. precomputed_replicas::PrecomputedReplicas is a struct that precomputes replica lists of a given strategies, and provides O(1) access to desired replica slices.

locator::ReplicaLocator combines the functionality of the previously mentioned modules, and provides a unified API for getting precomputed or calculated on the fly replica sets. Representing replica sets by a custom ReplicaSet type allowed creating an API that supported optional limiting replicas to a single data center (and allocation-free creation, getting & iteration through precomputed replica sets).

Things left for a follow-up:

A way to configure which strategies need replica set precomputation

`LoadBalancingPolicy` interface was changed.

plan was split to pick and fallback methods. This allows to better optimize the most common case, where only one node from the load balancing plan is needed. Changes required in query execution code were minimized by providing a lazy chaining iterator ltransport::load_balancing::plan::Plan. This iterator's first element is a node returned by the LoadBalancingPolicy::pick function, the next items come from LoadBalancingPolicy::fallback iterator. Falback method is called lazily - only when second+ element of the Plan iterator is needed.

The following methods were added:

fn on_query_success(&self, query: &QueryInfo, latency: Duration, node: NodeRef<_'>);
fn on_query_error(&self, query: &QueryInfo, latency: Duration, node: NodeRef<_'>, error: &QueryError);

The motivation for adding them was the ability to contain the logic of latency-aware policy inside some load balancing policy.

Default policy was created.

Default policy supports token and data center awareness. It also has a logic to do a data center failover.

Things left for a follow-up:

LWT routing support

Latency aware policy was merged with the default one

Mechanisms used only by latency aware policy were removed (e.g. TimestampedAverage living in Node).

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I added appropriate Fixes: annotations to PR description.

Fixes: #408

scylla/src/transport/load_balancing/default.rs

wprzytula · 2022-12-07T15:57:12Z

I'm not familiar with the token ring, replica computation etc., but apart from that, this approach looks good.

scylla/src/transport/locator/token_ring.rs

scylla/src/transport/locator/replicas.rs

scylla/src/transport/locator/replication_info.rs

scylla/src/transport/locator/mod.rs

scylla/src/transport/load_balancing/default.rs

piodul · 2022-12-08T12:02:41Z

scylla/src/transport/load_balancing/default.rs

-    pub fn new() -> Self {
-        Self {}
+impl LoadBalancingPolicy for DefaultPolicy {
+    fn pick<'a>(&'a self, query: &'a QueryInfo, cluster: &'a ClusterData) -> NodeRef<'a> {


I've got a feeling that pick and fallback methods follow a similar pattern (control flow-wise) and we could try to reduce the complexity by deduplicating them. Basically, you could think of pick as a function that returns the first element of fallback without having to allocate memory for the FallbackPlan. Of course, if we implemented pick such that it calls fallback then the performance benefit would be gone - however, what if we removed the need to allocate the plan with a callback?

You could implement a function do_with_plan (didn't think too much about the name):

// FnOnce won't be sufficient to express callback's type, because you don't have // access to the iterator's type - you will most likely have to use a trait trait PlanExtractor<'a> { type Output; fn extract<I>(plan: I) -> Self::Output where I: Iterator<Output = NodeRef<'a> + Send + Sync + 'a>; } fn do_with_plan<'a, E>( &'a self, query: &'a QueryInfo, cluster: &'a ClusterData, extractor: E, ) -> E::Output where E: PlanExtractor<'a>, { /* snip */ let plan = maybe_replicas .chain(robined_local_nodes) .chain(maybe_remote_nodes) .unique(); extractor.extract(plan) }

I didn't really scrutinize the pick and fallback methods so I'm not sure what the performance impact of this approach will be. If computing the plan requires significantly more computations than the current implementation of pick (apart from avoiding the allocation), then I guess we can stay with the current approach.

It would be great if we could compare the performance of both approaches. Perhaps a benchmark would be in order? After all, one of the reasons for the refactor was to improve the performance.

cvybhu

Had an initial look, generally looks nice although I didn't dig into the juicy details yet.

scylla/src/transport/load_balancing/default.rs

examples/compare-tokens.rs

scylla/tests/retries.rs

scylla/src/transport/mod.rs

scylla/src/transport/locator/replication_info.rs

cvybhu

Looks nice.
We got rid of the problematic policy chaining and the code is much clearer.

I like how simple planning has become, in my implementation I had a bunch of convoluted iterators that were a pain to write and read. This implementation is much more elegant.

The direction seems good, but there are still a few things that need to be solved before proceeding, like LWT routing, latency awareness and proper UP/DOWN event handling.

scylla/src/transport/load_balancing/default.rs

scylla/src/transport/locator/replication_info.rs

scylla/src/transport/locator/precomputed_replicas.rs

cvybhu · 2022-12-22T18:43:33Z

scylla/src/transport/load_balancing/default.rs

+
+        // Get a list of all local alive nodes, and apply a round robin to it
+        let local_nodes = self.preffered_node_set(cluster);
+        let robined_local_nodes = Self::round_robin_nodes(local_nodes, Self::is_alive);


Round robin has the problem of overloads in case of node failure.
If the usual round robin order is A->B->C and A fails, B will take over all of A's requests. It would be better to try nodes in random order, then in case of A's failure the load that used to be handled by A will be shared equally among B and C.

As we have no quick idea how to mitigate this problem yet, let's put off this problem until a follow-up.

scylla/src/transport/load_balancing/default.rs

wprzytula · 2023-01-05T13:53:53Z

I'm sad to see that we didn't really take advantage of grouping all default features in one Default Policy. Namely, we still pass the load balancing policy in an Arc, which imposes some overhead. Instead, I would propose:

struct LoadBalancing(Or)

enum Or {
    Default(DefaultLoadBalancingPolicy),
    Custom(Arc<dyn LoadBalancingPolicy>)
}

impl LoadBalancingPolicy for LoadBalancing {
   ... 
}

havaker · 2023-01-19T16:12:44Z

Rebased on top of main (using @wprzytula's pull request havaker#8)

havaker · 2023-01-19T16:22:13Z

Removed scylla/src/transport/load_balancing/latency_aware.rs in transport: load_balancing: remove policies other than the default one.

havaker · 2023-01-22T17:04:28Z

Rebased on top of main, merged havaker#6 (with additional NodeRef import fixes)

havaker · 2023-02-10T12:49:53Z

I'm sad to see that we didn't really take advantage of grouping all default features in one Default Policy. Namely, we still pass the load balancing policy in an Arc, which imposes some overhead. Instead, I would propose:
struct LoadBalancing(Or)

enum Or {
    Default(DefaultLoadBalancingPolicy),
    Custom(Arc<dyn LoadBalancingPolicy>)
}

impl LoadBalancingPolicy for LoadBalancing {
   ... 
}

I don't think that the overhead you mentioned is noticeable.

havaker · 2023-02-13T15:24:50Z

v1:

Rebased on top of main
Prepared a commit series that is suitable for review

cvybhu · 2023-02-13T16:22:08Z

clippy check is failing. It looks like 9bc33b0 removed a clone() and we started doing slice.into_iter().

piodul · 2023-02-14T15:00:32Z

Please update the PR description. It mentions that some things still need to be done - the PR was marked as ready so I guess those things are ready?

havaker · 2023-02-14T17:32:51Z

v2:

applied clippy suggestions
added doc comments to the new LoadBalancingPolicy interface

`LoadBalancingPolicy::plan` was split to `pick` and `fallback` methods. This allows to better optimize the most common case, where only one node from the load balancing plan is needed. Changes required in query execution code were minimized by providing a lazy chaining iterator `transport::load_balancing::plan::Plan`. This iterator's first element is a node returned by the `LoadBalancingPolicy::pick` function, the next items come from `LoadBalancingPolicy::fallback` iterator. Falback method is called lazily - only when second+ element of the `Plan` iterator is needed.

Implemented token and datacenter awareness. Added a builder for default policy (adding new parameters to default policy won't break the API). Default policy prefers to return nodes in the following order: - Alive local replicas (if token is available & token awareness is enabled) - Alive remote replicas (if datacenter failover is permitted & possible due to consistency constraints) - Alive local nodes - Alive remote nodes (if datacenter failover is permitted & possible due to consistency constraints) - Enabled down nodes If no preferred datacenter is specified, all nodes are treated as local ones. `DefaultPolicy::pick` method does not allocate if the replica lists for given strategy were precomputed. Co-authored-by: Wojciech Przytuła <wojciech.przytula@scylladb.com>

…nabled()

Two methods were added to the LoadBalancingPolicy trait: fn on_query_success(&self, query: &QueryInfo, latency: Duration, node: NodeRef<'_>); fn on_query_failure(&self, query: &QueryInfo, latency: Duration, node: NodeRef<'_>, error: &QueryError); Their addition allows implementing latency aware policy.

This commit prepares for cleaner introduction of latency awareness into DefaultPolicy.

The experimental latency awareness module is added to DefaultPolicy. Its behaviour is based on the previous LatencyAwarePolicy implementation. Notes on performance related to operations involving pick predicate: Pick predicate is boxed inside DefaultPolicy. Fallback performed efficiently - predicate is only borrowed, for eager computation of fallback iterator.

havaker · 2023-03-17T12:41:14Z

v12:

reworded docs
rebased on top of main (to make clippy happy)

cvybhu · 2023-03-17T14:16:24Z

The new version of load balancing does the precomputation of replica sets for each vnode, which could turn out to be quite a bit of work, so I measured how much resources the precomputation consumes when compared to the previous version which didn't do the precomputation (using lbbench):

Cluster: 3 datacenters, 8 nodes in each one

A few SimpleStrategy keyspaces, RF <= 8:

Before: 1.8MB of memory used, 420µs to compute ClusterData
After: 2.5MB of memory used, 5ms to compute ClusterData

A few big SimpleStrategt keyspaces, RF ~= 20

Before: 1.9MB of memory used, 558.229µs to compute ClusterData
After: 3.7MB of memory used, 14ms to compute ClusterData

A few NetworkTopologyStrategy keyspaces RF <= 8

Before: 1.9MB of memory used, 454.11µs to compute ClusterData
After: 6.3MB of memory used, 25ms to compute ClusterData

All of the above

Before: 2MB of memory used, 354.14µs to compute ClusterData
After: 7.8MB of memory used, 37ms to compute ClusterData

We use a bit more memory and compute power, but it's within reasonable limits.

One positive thing stemming from the new precomputation is that the number of allocations per request got reduced:
Before:

Inserts:
----------
allocs/req:                15.00
reallocs/req:               8.00
frees/req:                 15.00
bytes allocated/req:     2458.05
bytes reallocated/req:    269.06
bytes freed/req:         2456.80
(allocated - freed)/req:      1.25
----------
Sending 100000 selects, hold tight ..........
----------
Selects:
----------
allocs/req:                48.00
reallocs/req:               8.00
frees/req:                 48.00
bytes allocated/req:     5266.07
bytes reallocated/req:    209.00
bytes freed/req:         5266.00
(allocated - freed)/req:      0.07

After:

Inserts:
----------
allocs/req:                 6.01
reallocs/req:               6.00
frees/req:                  6.00
bytes allocated/req:      381.80
bytes reallocated/req:    173.05
bytes freed/req:          380.62
(allocated - freed)/req:      1.18
----------
Sending 100000 selects, hold tight ..........
----------
Selects:
----------
allocs/req:                39.00
reallocs/req:               6.00
frees/req:                 39.00
bytes allocated/req:     3190.15
bytes reallocated/req:    113.01
bytes freed/req:         3190.04
(allocated - freed)/req:      0.11
----------

Here are the exact results:
lb-results.zip

cvybhu

LGTM 🚀

mykaul · 2023-03-17T15:42:10Z

Nice work!

Resolves scylladb#468 This is a follow-up on scylladb#508 and scylladb#658: - To minimize CPU usage related to network operations when inserting a very large number of lines, it is relevant to batch. - To batch in the most efficient manner, these batches have to be shard-aware. Since scylladb#508, `batch` will pick the shard of the first statement to send the query to. However it is left to the user to constitute the batches in such a way that the target shard is the same for all the elements of the batch. - This was made *possible* by scylladb#658, but it was still very boilerplate-ish. I was waiting for scylladb#612 to be merged (amazing work btw! 😃) to implement a more direct and factored API (as that would use it). - This new ~`Session::first_shard_for_statement(self, &PreparedStatement, &SerializedValues) -> Option<(Node, Option<Shard>)>` makes shard-aware batching easy on the users, by providing access to the first node and shard of the query plan.

havaker requested review from cvybhu and piodul and removed request for cvybhu December 5, 2022 15:06

wprzytula reviewed Dec 7, 2022

View reviewed changes

scylla/src/transport/load_balancing/default.rs Outdated Show resolved Hide resolved

wprzytula reviewed Dec 7, 2022

View reviewed changes

scylla/src/transport/load_balancing/default.rs Outdated Show resolved Hide resolved

wprzytula reviewed Dec 7, 2022

View reviewed changes

scylla/src/transport/load_balancing/default.rs Outdated Show resolved Hide resolved

piodul reviewed Dec 8, 2022

View reviewed changes

cvybhu reviewed Dec 12, 2022

View reviewed changes

cvybhu reviewed Dec 22, 2022

View reviewed changes

avelanarius reviewed Jan 5, 2023

View reviewed changes

scylla/src/transport/load_balancing/default.rs Outdated Show resolved Hide resolved

piodul mentioned this pull request Jan 9, 2023

Precompute replica sets #190

Closed

havaker force-pushed the lb2 branch from 9008d58 to 4b88fd2 Compare January 10, 2023 21:15

havaker force-pushed the lb2 branch from 4b88fd2 to 70f9dcc Compare January 19, 2023 16:02

havaker force-pushed the lb2 branch from 70f9dcc to bc63656 Compare January 19, 2023 16:20

havaker force-pushed the lb2 branch from 66fa64d to acc647a Compare January 22, 2023 17:00

havaker force-pushed the lb2 branch 2 times, most recently from e7f3d91 to fc471e8 Compare February 13, 2023 15:23

havaker marked this pull request as ready for review February 13, 2023 15:24

havaker force-pushed the lb2 branch from fc471e8 to 551485e Compare February 14, 2023 17:31

havaker and others added 13 commits March 17, 2023 13:25

transport: load_balancing: remove ChildLoadBalancingPolicy trait

ed0b8b6

transport: load_balancing: rename Statement to RoutingInfo

b0e2d75

default_policy: Warn about forbidden failover with SimpleStrategy

c9a022d

transport: load_balancing: Node::is_alive() made stub - limit to is_e…

9480d1c

…nabled()

transport: load_balancing: test framework for DefaultPolicy

073d57d

transport: load_balancing: tests for DefaultPolicy

f6a1712

default_policy: Introduced pick_predicate

7508eaf

This commit prepares for cleaner introduction of latency awareness into DefaultPolicy.

transport: load_balancing: tests for latency awareness in DefaultPolicy

da2c23a

docs: add load balancing module documentation

dac9456

havaker force-pushed the lb2 branch from 0871a84 to dac9456 Compare March 17, 2023 12:39

piodul approved these changes Mar 17, 2023

View reviewed changes

havaker requested a review from cvybhu March 17, 2023 14:21

cvybhu approved these changes Mar 17, 2023

View reviewed changes

piodul merged commit c5dd118 into scylladb:main Mar 17, 2023

havaker mentioned this pull request Mar 17, 2023

Clarify in the docs that load balancing policies do not influence to which nodes connections are being opened #636

Closed

Ten0 mentioned this pull request Jun 3, 2023

Shard aware batching - add Session::shard_for_statement & Batch::enforce_target_node #738

Open

8 tasks

wprzytula mentioned this pull request Jul 28, 2023

Show a warning when an incorrect datacenter name is provided #406

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transport: load balancing module refactor #612

transport: load balancing module refactor #612

havaker commented Dec 4, 2022 •

edited

Loading

wprzytula commented Dec 7, 2022

piodul Dec 8, 2022

cvybhu left a comment

cvybhu left a comment

cvybhu Dec 22, 2022

wprzytula Jan 26, 2023

wprzytula commented Jan 5, 2023

havaker commented Jan 19, 2023

havaker commented Jan 19, 2023

havaker commented Jan 22, 2023

havaker commented Feb 10, 2023

havaker commented Feb 13, 2023

cvybhu commented Feb 13, 2023

piodul commented Feb 14, 2023

havaker commented Feb 14, 2023

havaker commented Mar 17, 2023

cvybhu commented Mar 17, 2023 •

edited

Loading

cvybhu left a comment

mykaul commented Mar 17, 2023

transport: load balancing module refactor #612

transport: load balancing module refactor #612

Conversation

havaker commented Dec 4, 2022 • edited Loading

Description

Replica set calculation was moved to a separate module (transport::locator).

LoadBalancingPolicy interface was changed.

Default policy was created.

Latency aware policy was merged with the default one

Pre-review checklist

wprzytula commented Dec 7, 2022

piodul Dec 8, 2022

Choose a reason for hiding this comment

cvybhu left a comment

Choose a reason for hiding this comment

cvybhu left a comment

Choose a reason for hiding this comment

cvybhu Dec 22, 2022

Choose a reason for hiding this comment

wprzytula Jan 26, 2023

Choose a reason for hiding this comment

wprzytula commented Jan 5, 2023

havaker commented Jan 19, 2023

havaker commented Jan 19, 2023

havaker commented Jan 22, 2023

havaker commented Feb 10, 2023

havaker commented Feb 13, 2023

cvybhu commented Feb 13, 2023

piodul commented Feb 14, 2023

havaker commented Feb 14, 2023

havaker commented Mar 17, 2023

cvybhu commented Mar 17, 2023 • edited Loading

A few SimpleStrategy keyspaces, RF <= 8:

A few big SimpleStrategt keyspaces, RF ~= 20

A few NetworkTopologyStrategy keyspaces RF <= 8

All of the above

cvybhu left a comment

Choose a reason for hiding this comment

mykaul commented Mar 17, 2023

havaker commented Dec 4, 2022 •

edited

Loading

Replica set calculation was moved to a separate module (`transport::locator`).

`LoadBalancingPolicy` interface was changed.

cvybhu commented Mar 17, 2023 •

edited

Loading