Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LOAD_BALANCING_POLICY_SLOW_AVOIDANCE funtionality #168

Merged
merged 1 commit into from
May 21, 2024

Conversation

sylwiaszunejko
Copy link
Collaborator

@sylwiaszunejko sylwiaszunejko commented Apr 17, 2024

The java driver has the feature to automatically avoid slow replicas by doing simple heuristics (https://github.com/scylladb/java-driver/blob/scylla-4.x/core/src/main/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java#L104). This is one of the key feature to have a stable latency.

This PR adds additional field in tokenAwareHostPolicy to control if the feature is enabled and what is the maximum in flight threshold.

If feature is enabled driver sorts the replicas to first try those with less than specified maximum in flight requests.

Fixes: #154

@sylwiaszunejko
Copy link
Collaborator Author

I am not sure how to test this, I am open for any suggestions.

@roydahan
Copy link
Collaborator

Can you please provide a reference of this functionality in the Java driver?

@sylwiaszunejko
Copy link
Collaborator Author

Can you please provide a reference of this functionality in the Java driver?

I added link to the code to the description.

@sylwiaszunejko sylwiaszunejko self-assigned this Apr 18, 2024
policies.go Outdated
@@ -424,6 +434,8 @@ type clusterMeta struct {
tokenRing *tokenRing
}

const MAX_IN_FLIGHT_THRESHOLD int = 10

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this configurable.

@avikivity
Copy link
Member

I'm suspicious of this policy, it can induce instability in the cluster.

  1. A node is slower than the others
  2. The policy moves work to other nodes
  3. Another node becomes slow due to work moved
  4. Slowness moves around the cluster as the policy keeps moving work around

Any action taken has to be strongly dampened to avoid overreacting. This is especially important when there are many client processes making independent, but similar decisions.

@sylwiaszunejko
Copy link
Collaborator Author

v2:

  • I made MAX_IN_FLIGHT_THRESHOLD configurable.
  • I realized that checking the number of request in flight was not correct. I was checking the number of connections instead of the sum of in-use streams on every connection

The process of testing this change is still in progress. I am trying to observe the behavior in use by setting up a 3-node cluster with one "slow" node (with some sleeps), aiming to show that replicas on this host are unhealthy. However, so far, I have not been able to observe higher number of in-use streams.

@mykaul
Copy link

mykaul commented Apr 24, 2024

If feature is enabled driver sorts the replicas to first try those with less than specified maximum in flight requests.

There was a nice addition of 'power of two choice' in one of our drivers (Java?) - I think this might be a good candidate here as well.

@sylwiaszunejko
Copy link
Collaborator Author

I'm suspicious of this policy, it can induce instability in the cluster.

1. A node is slower than the others

2. The policy moves work to other nodes

3. Another node becomes slow due to work moved

4. Slowness moves around the cluster as the policy keeps moving work around

Any action taken has to be strongly dampened to avoid overreacting. This is especially important when there are many client processes making independent, but similar decisions.

@avikivity I guess it is working fine for java-driver, but at the same time this behavior is not tested there so I am curious how often it is actually used and how often the node is slow. Especially because the MAX_IN_FLIGHT_THRESHOLD is not configurable in java-driver.

@avikivity
Copy link
Member

I'm suspicious of this policy, it can induce instability in the cluster.

1. A node is slower than the others

2. The policy moves work to other nodes

3. Another node becomes slow due to work moved

4. Slowness moves around the cluster as the policy keeps moving work around

Any action taken has to be strongly dampened to avoid overreacting. This is especially important when there are many client processes making independent, but similar decisions.

@avikivity I guess it is working fine for java-driver, but at the same time this behavior is not tested there so I am curious how often it is actually used and how often the node is slow. Especially because the MAX_IN_FLIGHT_THRESHOLD is not configurable in java-driver.

I don't have concrete evidence that it fails, just suspicions.

The policy originated[1] with Cassandra where there is an actual source of node slowness - full garbage collection. We don't have this source, but we have others.

I heard that the "dynamic snitch" (which performs similar functionality for the coordinator->replica hop) performs poorly. See https://thelastpickle.com/blog/2019/01/30/new-cluster-recommendations.html (see 5).

MAX_IN_FLIGHT_THRESHOLD changes its meaning if we use shard-aware or shard-unaware modes, and is very workload dependent. In a cache hit intensive workload you'd see small in flight request count, in a cache miss intensive workload you'd see high in flight request count.

A node could have high latency because one of the replicas it is accessing is slow (mostly for small clusters where a single replica is 1/3 of the replicas contacted by the coordinator; for large clusters the slow replica would be averaged out).

We could add it for parity with the Java driver, but we have to be careful about recommending it.

[1] I'm just guessing

scylla.go Outdated
func (p *scyllaConnPicker) InFlight() int {
result := 0
for i := range p.conns {
idx := int(atomic.AddUint64(&p.pos, 1))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idx := int(atomic.AddUint64(&p.pos, 1)) should be calculated once before the loop

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed that to just iterating through p.conns for simplicity

The java driver has the feature to automatically avoid slow replicas
by doing simple heuristics. This is one of the key feature to have a
stable latency.

This commit adds additional field in tokenAwareHostPolicy to control
if the feature is enabled and what is the maximum in flight threshold.

If feature is enabled driver sorts the replicas to first try those
with less than specified maximum in flight connections.

Fixes: scylladb#154
@sylwiaszunejko
Copy link
Collaborator Author

I managed to observe the behavior in use by setting up a 3-node cluster with one "slow" node (with some sleeps). Extra logs showed larger amount of in-use streams and the slow node was considered unhealthy.

Now I will repeat the process to see the behavior on metrics.

@sylwiaszunejko
Copy link
Collaborator Author

I used 3-node cluster with one slow node (with some sleeps).

It is not obvious how to observe avoiding slow replicas on metrics. If replica shuffling is disabled (default behavior) or the query happens to be LWT, there is little to no difference between gocql with or without slow replica avoidance. The difference can only be seen if the slow node happens to not be the first on the replica list.

On this graph we see number of request for different situations: avoidSlowReplicas=false, MAX_IN_FLIGHT_THRESHOLD=0, MAX_IN_FLIGHT_THRESHOLD=10, MAX_IN_FLIGHT_THRESHOLD=5, MAX_IN_FLIGHT_THRESHOLD=7 (the blue line is the slow node, the yellow one is first on the replica list most of the time, it is as fast as the green one). The test contains 20 concurrent scenarios with 10 INSERT queries with RF=3 and 500 SELECT queries with CL=1.

Screenshot from 2024-05-08 12-49-16

If driver does shuffle the replica list, enabling slow replica avoidance gives positive outcome. We can see the test execution is faster and more queries are directed to the fast nodes. (In this test there were 5 concurrent scenarios with 10 INSERT queries with RF=3 and 50 SELECT queries with CL=1)

Screenshot from 2024-05-08 19-15-12

With higher number of concurrent scenarios version without slow replica avoidance timeouts and with it works just fine:

Screenshot from 2024-05-08 20-44-42

@sylwiaszunejko sylwiaszunejko marked this pull request as ready for review May 9, 2024 11:40
@sylwiaszunejko
Copy link
Collaborator Author

During testing slow replica avoidance functionality I realized that simple queries e.g. SELECT pk, ck, v FROM keyspace1.table1 WHERE pk = x; were incorrectly marked as LWT queries and because of that there was little to no difference between gocql with or without slow replica avoidance. I submitted an issue (#174) and a PR to fix that (#173).

@sylwiaszunejko
Copy link
Collaborator Author

@avikivity I added some metrics to show the impact on the performance, what do you think about merging this?

@avikivity
Copy link
Member

@avikivity I added some metrics to show the impact on the performance, what do you think about merging this?

#168 (comment) is not an objection to merging, since it's adding functionality already in the Java driver (opt-in I hope). I'll go over your measurements regardless.

@avikivity
Copy link
Member

btw - @michoecho this can feed into our discussion re slow shards.

@avikivity
Copy link
Member

Your measurement results are unsurprising - for sure if one node really is slow, then avoiding it gives better results. My worry is that we'll misdetect a node as slow, thereby diverting requests to another node, which then becomes slow, starting a feedback loop where we cause nodes to be slow.

If the detection threshold is sufficiently high, perhaps this doesn't happen.

@mrsinham
Copy link

Hello there, just giving a comment to give a little context. It happened that we had a very slow node, especially during upgrade process or hardware issues. This slow down with the earlier version of the driver slowed everything down, causing timeouts, write/read issue and angry customers. This PR should definitively improve those situations.

@avelanarius avelanarius merged commit 7f7905d into scylladb:master May 21, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add LOAD_BALANCING_POLICY_SLOW_AVOIDANCE in the GO scylladb driver
7 participants