Add LOAD_BALANCING_POLICY_SLOW_AVOIDANCE funtionality #168

sylwiaszunejko · 2024-04-17T10:19:31Z

The java driver has the feature to automatically avoid slow replicas by doing simple heuristics (https://github.com/scylladb/java-driver/blob/scylla-4.x/core/src/main/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java#L104). This is one of the key feature to have a stable latency.

This PR adds additional field in tokenAwareHostPolicy to control if the feature is enabled and what is the maximum in flight threshold.

If feature is enabled driver sorts the replicas to first try those with less than specified maximum in flight requests.

Fixes: #154

sylwiaszunejko · 2024-04-17T10:19:58Z

I am not sure how to test this, I am open for any suggestions.

roydahan · 2024-04-17T13:22:50Z

Can you please provide a reference of this functionality in the Java driver?

sylwiaszunejko · 2024-04-17T18:57:02Z

Can you please provide a reference of this functionality in the Java driver?

I added link to the code to the description.

avelanarius · 2024-04-18T10:44:25Z

policies.go

@@ -424,6 +434,8 @@ type clusterMeta struct {
 	tokenRing *tokenRing
 }

+const MAX_IN_FLIGHT_THRESHOLD int = 10


Please make this configurable.

avikivity · 2024-04-24T10:07:08Z

I'm suspicious of this policy, it can induce instability in the cluster.

A node is slower than the others
The policy moves work to other nodes
Another node becomes slow due to work moved
Slowness moves around the cluster as the policy keeps moving work around

Any action taken has to be strongly dampened to avoid overreacting. This is especially important when there are many client processes making independent, but similar decisions.

sylwiaszunejko · 2024-04-24T10:07:30Z

v2:

I made MAX_IN_FLIGHT_THRESHOLD configurable.
I realized that checking the number of request in flight was not correct. I was checking the number of connections instead of the sum of in-use streams on every connection

The process of testing this change is still in progress. I am trying to observe the behavior in use by setting up a 3-node cluster with one "slow" node (with some sleeps), aiming to show that replicas on this host are unhealthy. However, so far, I have not been able to observe higher number of in-use streams.

mykaul · 2024-04-24T10:10:41Z

If feature is enabled driver sorts the replicas to first try those with less than specified maximum in flight requests.

There was a nice addition of 'power of two choice' in one of our drivers (Java?) - I think this might be a good candidate here as well.

sylwiaszunejko · 2024-04-24T10:15:45Z

I'm suspicious of this policy, it can induce instability in the cluster.
1. A node is slower than the others

2. The policy moves work to other nodes

3. Another node becomes slow due to work moved

4. Slowness moves around the cluster as the policy keeps moving work around
Any action taken has to be strongly dampened to avoid overreacting. This is especially important when there are many client processes making independent, but similar decisions.

@avikivity I guess it is working fine for java-driver, but at the same time this behavior is not tested there so I am curious how often it is actually used and how often the node is slow. Especially because the MAX_IN_FLIGHT_THRESHOLD is not configurable in java-driver.

avikivity · 2024-04-24T10:41:42Z

I'm suspicious of this policy, it can induce instability in the cluster.
1. A node is slower than the others

2. The policy moves work to other nodes

3. Another node becomes slow due to work moved

4. Slowness moves around the cluster as the policy keeps moving work around
Any action taken has to be strongly dampened to avoid overreacting. This is especially important when there are many client processes making independent, but similar decisions.
@avikivity I guess it is working fine for java-driver, but at the same time this behavior is not tested there so I am curious how often it is actually used and how often the node is slow. Especially because the MAX_IN_FLIGHT_THRESHOLD is not configurable in java-driver.

I don't have concrete evidence that it fails, just suspicions.

The policy originated[1] with Cassandra where there is an actual source of node slowness - full garbage collection. We don't have this source, but we have others.

I heard that the "dynamic snitch" (which performs similar functionality for the coordinator->replica hop) performs poorly. See https://thelastpickle.com/blog/2019/01/30/new-cluster-recommendations.html (see 5).

MAX_IN_FLIGHT_THRESHOLD changes its meaning if we use shard-aware or shard-unaware modes, and is very workload dependent. In a cache hit intensive workload you'd see small in flight request count, in a cache miss intensive workload you'd see high in flight request count.

A node could have high latency because one of the replicas it is accessing is slow (mostly for small clusters where a single replica is 1/3 of the replicas contacted by the coordinator; for large clusters the slow replica would be averaged out).

We could add it for parity with the Java driver, but we have to be careful about recommending it.

[1] I'm just guessing

moguchev · 2024-04-27T15:33:03Z

scylla.go

+func (p *scyllaConnPicker) InFlight() int {
+	result := 0
+	for i := range p.conns {
+		idx := int(atomic.AddUint64(&p.pos, 1))


idx := int(atomic.AddUint64(&p.pos, 1)) should be calculated once before the loop

I changed that to just iterating through p.conns for simplicity

The java driver has the feature to automatically avoid slow replicas by doing simple heuristics. This is one of the key feature to have a stable latency. This commit adds additional field in tokenAwareHostPolicy to control if the feature is enabled and what is the maximum in flight threshold. If feature is enabled driver sorts the replicas to first try those with less than specified maximum in flight connections. Fixes: scylladb#154

sylwiaszunejko · 2024-04-30T10:18:10Z

I managed to observe the behavior in use by setting up a 3-node cluster with one "slow" node (with some sleeps). Extra logs showed larger amount of in-use streams and the slow node was considered unhealthy.

Now I will repeat the process to see the behavior on metrics.

sylwiaszunejko · 2024-05-08T18:46:17Z

I used 3-node cluster with one slow node (with some sleeps).

It is not obvious how to observe avoiding slow replicas on metrics. If replica shuffling is disabled (default behavior) or the query happens to be LWT, there is little to no difference between gocql with or without slow replica avoidance. The difference can only be seen if the slow node happens to not be the first on the replica list.

On this graph we see number of request for different situations: avoidSlowReplicas=false, MAX_IN_FLIGHT_THRESHOLD=0, MAX_IN_FLIGHT_THRESHOLD=10, MAX_IN_FLIGHT_THRESHOLD=5, MAX_IN_FLIGHT_THRESHOLD=7 (the blue line is the slow node, the yellow one is first on the replica list most of the time, it is as fast as the green one). The test contains 20 concurrent scenarios with 10 INSERT queries with RF=3 and 500 SELECT queries with CL=1.

If driver does shuffle the replica list, enabling slow replica avoidance gives positive outcome. We can see the test execution is faster and more queries are directed to the fast nodes. (In this test there were 5 concurrent scenarios with 10 INSERT queries with RF=3 and 50 SELECT queries with CL=1)

With higher number of concurrent scenarios version without slow replica avoidance timeouts and with it works just fine:

sylwiaszunejko · 2024-05-10T07:11:35Z

During testing slow replica avoidance functionality I realized that simple queries e.g. SELECT pk, ck, v FROM keyspace1.table1 WHERE pk = x; were incorrectly marked as LWT queries and because of that there was little to no difference between gocql with or without slow replica avoidance. I submitted an issue (#174) and a PR to fix that (#173).

sylwiaszunejko · 2024-05-15T14:38:38Z

@avikivity I added some metrics to show the impact on the performance, what do you think about merging this?

avikivity · 2024-05-15T14:47:30Z

@avikivity I added some metrics to show the impact on the performance, what do you think about merging this?

#168 (comment) is not an objection to merging, since it's adding functionality already in the Java driver (opt-in I hope). I'll go over your measurements regardless.

avikivity · 2024-05-15T14:48:50Z

btw - @michoecho this can feed into our discussion re slow shards.

avikivity · 2024-05-15T14:54:29Z

Your measurement results are unsurprising - for sure if one node really is slow, then avoiding it gives better results. My worry is that we'll misdetect a node as slow, thereby diverting requests to another node, which then becomes slow, starting a feedback loop where we cause nodes to be slow.

If the detection threshold is sufficiently high, perhaps this doesn't happen.

mrsinham · 2024-05-16T07:27:44Z

Hello there, just giving a comment to give a little context. It happened that we had a very slow node, especially during upgrade process or hardware issues. This slow down with the earlier version of the driver slowed everything down, causing timeouts, write/read issue and angry customers. This PR should definitively improve those situations.

sylwiaszunejko requested a review from avelanarius April 17, 2024 10:24

sylwiaszunejko force-pushed the avoid_slow_replicas branch from f7f6a59 to 1290781 Compare April 17, 2024 11:06

sylwiaszunejko self-assigned this Apr 18, 2024

avelanarius reviewed Apr 18, 2024

View reviewed changes

sylwiaszunejko force-pushed the avoid_slow_replicas branch from 1290781 to 3451c6e Compare April 24, 2024 09:58

sylwiaszunejko requested a review from avelanarius April 24, 2024 10:51

moguchev reviewed Apr 27, 2024

View reviewed changes

sylwiaszunejko force-pushed the avoid_slow_replicas branch from 3451c6e to 26c61ef Compare April 29, 2024 06:19

sylwiaszunejko marked this pull request as ready for review May 9, 2024 11:40

avelanarius approved these changes May 21, 2024

View reviewed changes

avelanarius merged commit 7f7905d into scylladb:master May 21, 2024
1 check passed

sylwiaszunejko mentioned this pull request Jun 17, 2024

load balancing: slow coordinator avoidance #81

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LOAD_BALANCING_POLICY_SLOW_AVOIDANCE funtionality #168

Add LOAD_BALANCING_POLICY_SLOW_AVOIDANCE funtionality #168

sylwiaszunejko commented Apr 17, 2024 •

edited

Loading

sylwiaszunejko commented Apr 17, 2024

roydahan commented Apr 17, 2024

sylwiaszunejko commented Apr 17, 2024

avelanarius Apr 18, 2024

avikivity commented Apr 24, 2024

sylwiaszunejko commented Apr 24, 2024

mykaul commented Apr 24, 2024

sylwiaszunejko commented Apr 24, 2024

avikivity commented Apr 24, 2024

moguchev Apr 27, 2024

sylwiaszunejko Apr 29, 2024

sylwiaszunejko commented Apr 30, 2024

sylwiaszunejko commented May 8, 2024

sylwiaszunejko commented May 10, 2024

sylwiaszunejko commented May 15, 2024

avikivity commented May 15, 2024

avikivity commented May 15, 2024

avikivity commented May 15, 2024

mrsinham commented May 16, 2024

Add LOAD_BALANCING_POLICY_SLOW_AVOIDANCE funtionality #168

Add LOAD_BALANCING_POLICY_SLOW_AVOIDANCE funtionality #168

Conversation

sylwiaszunejko commented Apr 17, 2024 • edited Loading

sylwiaszunejko commented Apr 17, 2024

roydahan commented Apr 17, 2024

sylwiaszunejko commented Apr 17, 2024

avelanarius Apr 18, 2024

Choose a reason for hiding this comment

avikivity commented Apr 24, 2024

sylwiaszunejko commented Apr 24, 2024

mykaul commented Apr 24, 2024

sylwiaszunejko commented Apr 24, 2024

avikivity commented Apr 24, 2024

moguchev Apr 27, 2024

Choose a reason for hiding this comment

sylwiaszunejko Apr 29, 2024

Choose a reason for hiding this comment

sylwiaszunejko commented Apr 30, 2024

sylwiaszunejko commented May 8, 2024

sylwiaszunejko commented May 10, 2024

sylwiaszunejko commented May 15, 2024

avikivity commented May 15, 2024

avikivity commented May 15, 2024

avikivity commented May 15, 2024

mrsinham commented May 16, 2024

sylwiaszunejko commented Apr 17, 2024 •

edited

Loading