[Tablets] Requests served imbalance after adding nodes to cluster #19107

soyacz · 2024-06-05T10:11:18Z

Packages

Scylla version: 6.1.0~dev-20240528.519317dc5833 with build-id 75e8987548653166f5131039236650c1ead746f4

Kernel Version: 5.15.0-1062-aws

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

Test scenario covering case of scaling out cluster from 3 nodes to 6, with new nodes added in parallel.
During adding node we can see one node takes over most of the cluster load while the rest served requests drop significantly.
Also, after growing, requests served are still not balanced (before grow we can see all nodes serving equally).
Test uses c-s with java driver 3.11.5.2 which is tablet aware.

Impact

Degraded performance

How frequently does it reproduce?

Reproduces in all tablets elasticity tests (write, read, mixed)

Installation details

Cluster size: 3 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

perf-latency-grow-shrink-ubuntu-db-node-f417745e-6 (3.91.25.250 | 10.12.1.135) (shards: 7)
perf-latency-grow-shrink-ubuntu-db-node-f417745e-5 (44.204.102.74 | 10.12.0.9) (shards: 7)
perf-latency-grow-shrink-ubuntu-db-node-f417745e-4 (3.232.133.68 | 10.12.2.98) (shards: 7)
perf-latency-grow-shrink-ubuntu-db-node-f417745e-3 (107.23.70.58 | 10.12.0.47) (shards: 7)
perf-latency-grow-shrink-ubuntu-db-node-f417745e-2 (34.232.109.63 | 10.12.0.113) (shards: 7)
perf-latency-grow-shrink-ubuntu-db-node-f417745e-1 (3.239.252.252 | 10.12.2.208) (shards: 7)

OS / Image: ami-0a070c0d6ef92b552 (aws: undefined_region)

Test: scylla-master-perf-regression-latency-650gb-grow-shrink
Test id: f417745e-0067-4479-95ee-24c9182267ce
Test name: scylla-staging/lukasz/scylla-master-perf-regression-latency-650gb-grow-shrink
Test config file(s):

perf-regression-latency-650gb-grow-shrink.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor f417745e-0067-4479-95ee-24c9182267ce
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs f417745e-0067-4479-95ee-24c9182267ce

Logs:

Date	Log type	Link
20190101_010101	prometheus	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/prometheus_snapshot_20240530_213037.tar.gz
20190101_010101	prometheus	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/prometheus_snapshot_20240531_001921.tar.gz
20190101_010101	prometheus	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/prometheus_snapshot_20240531_002918.tar.gz
20190101_010101	prometheus	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/prometheus_snapshot_20240531_003853.tar.gz
20240530_212207	grafana	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240530_212207/grafana-screenshot-overview-20240530_212207-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240530_212207	grafana	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240530_212207/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240530_212338-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240530_222727	grafana	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240530_222727/grafana-screenshot-overview-20240530_222727-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240530_222727	grafana	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240530_222727/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240530_222812-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240530_230010	grafana	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240530_230010/grafana-screenshot-overview-20240530_230032-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240530_230010	grafana	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240530_230010/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240530_230110-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240531_001123	grafana	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_001123/grafana-screenshot-overview-20240531_001144-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240531_001123	grafana	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_001123/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240531_001222-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240531_002134	grafana	[https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_002134/grafana-screenshot-overview-20240531_002134-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png](https://cloudius-jenkins-test.s3.amazonaws.com/f417745e
20240531_002134	grafana	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_002134/grafana-screenshot-overview-20240531_002134-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240531_002134	grafana	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_002134/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240531_002219-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240531_003108	grafana	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_003108/grafana-screenshot-overview-20240531_003108-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240531_003108	grafana	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_003108/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240531_003153-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240531_003920	db-cluster	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_003920/db-cluster-f417745e.tar.gz
20240531_003920	loader-set	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_003920/loader-set-f417745e.tar.gz
20240531_003920	monitor-set	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_003920/monitor-set-f417745e.tar.gz
20240531_003920	sct	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_003920/sct-f417745e.log.tar.gz
20240531_003920	event	https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_003920/sct-runner-events-f417745e.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

michoecho · 2024-06-05T18:01:52Z

Seems like a driver-side load balancing issue. What's the driver's load balancing policy?

bhalevy · 2024-06-05T20:09:42Z

Which metric(s) are imbalanced?
coordinator or replica side?
reads, writes or both?
what is the consistency level?
I agree with @michoecho that this could be a load-balancing issue on the driver side.

michoecho · 2024-06-05T21:44:51Z

Which metric(s) are imbalanced? coordinator or replica side? reads, writes or both? what is the consistency level? I agree with @michoecho that this could be a load-balancing issue on the driver side.

Coordinator work is unbalanced. Replica work is balanced. There are writes only. CL doesn't matter, since it's writes only.

The test starts with 3 nodes (7 shards on each node, 18 tablets replicated on each shard). Initially, work is perfectly balanced across coordinators.

Then, 3 nodes are bootstrapped in parallel. As soon as the bootstrap starts, the balance shatters — one of the 3 original nodes starts handling 80% of requests, while the other two handle 10% each. (I'm only showing the per-instance graph here, because shards within each instance are mostly symmetric, so the per-shard view isn't very interesting in this case).

The coordinators never return to balance after that, even after replica work is eventually balanced perfectly.

Also: shard awareness seems to break down thoroughly after the bootstrap (i.e. most requests are sent to the wrong shard), and it never recovers. However, node awareness works (i.e. all requests are sent to the right node). Edit: this part is wrong, see my later comment.

Hypothesis: ~~tablet awareness~~ load balancing in the java driver gets broken by tablet migrations.

soyacz · 2024-06-06T06:28:49Z

Seems like a driver-side load balancing issue. What's the driver's load balancing policy?

I don't know the answer for this, cassandra-stress default is used.

bhalevy · 2024-06-06T07:49:12Z

Summoning @piodul

piodul · 2024-06-06T08:10:41Z

Summoning @piodul

I have no idea about the java driver's implementation of support for tablets. I might be wrong, but AFAIK @Bouncheck implemented it and @Lorak-mmk was reviewing it, so they might have some ideas.

Lorak-mmk · 2024-06-06T08:19:29Z

Summoning @piodul

I have no idea about the java driver's implementation of support for tablets. I might be wrong, but AFAIK @Bouncheck implemented it and @Lorak-mmk was reviewing it, so they might have some ideas.

I'm reviewing implementation in Java Driver 4.x (btw why isn't c-s using 4.x?), I didn't really look at 3.x implementation

piodul · 2024-06-06T08:26:47Z

cc: @avelanarius are there other people from your team who could take a look at it?

Lorak-mmk · 2024-06-06T08:33:07Z

@soyacz would it be hard to run similar scenario using cql-stress's c-s mode? Current master of cql-stress uses Rust Driver 0.13 which has tablet awareness.
This way we'd know if it's a bug in implementation of Java Driver 3.x, or something more universal (server bug / inherent problem with how tablet awareness works).

soyacz · 2024-06-06T09:32:44Z

@soyacz would it be hard to run similar scenario using cql-stress's c-s mode? Current master of cql-stress uses Rust Driver 0.13 which has tablet awareness. This way we'd know if it's a bug in implementation of Java Driver 3.x, or something more universal (server bug / inherent problem with how tablet awareness works).

I can try, if throttling is supported should work without much effort (only prepare stage needs to be adjusted possibly to not overload cluster). But first would be good if we updated cql-stress to newest release with newest drivers, see scylladb/scylla-cluster-tests#7582

Lorak-mmk · 2024-06-06T11:28:06Z

@soyacz would it be hard to run similar scenario using cql-stress's c-s mode? Current master of cql-stress uses Rust Driver 0.13 which has tablet awareness. This way we'd know if it's a bug in implementation of Java Driver 3.x, or something more universal (server bug / inherent problem with how tablet awareness works).

I can try, if throttling is supported should work without much effort (only prepare stage needs to be adjusted possibly to not overload cluster). But first would be good if we updated cql-stress to newest release with newest drivers, see scylladb/scylla-cluster-tests#7582

I see in that issue that @fruch managed to build the new version. Is there anything blocking the update?

soyacz · 2024-06-06T11:33:56Z

@soyacz would it be hard to run similar scenario using cql-stress's c-s mode? Current master of cql-stress uses Rust Driver 0.13 which has tablet awareness. This way we'd know if it's a bug in implementation of Java Driver 3.x, or something more universal (server bug / inherent problem with how tablet awareness works).

I can try, if throttling is supported should work without much effort (only prepare stage needs to be adjusted possibly to not overload cluster). But first would be good if we updated cql-stress to newest release with newest drivers, see scylladb/scylla-cluster-tests#7582

I see in that issue that @fruch managed to build the new version. Is there anything blocking the update?

trying it.

soyacz · 2024-06-06T15:35:42Z

test unfortunately failed before growing cluster due parsing cql-stress result. Issue created: scylladb/cql-stress#95

soyacz · 2024-06-07T13:31:08Z

I managed to execute test with cql-stress as a stress tool (based on rust driver with tablet support) and it looks differently, but still bad (still unbalanced). This is write only test.:

Unfortunately there are no email report with details. For more metrics: hydra investigate show-monitor 7634564e-35e4-4cff-bbbb-f8aab3dc4d03
logs:

michoecho · 2024-06-07T15:10:53Z

I managed to execute test with cql-stress as a stress tool (based on rust driver with tablet support) and it looks differently, but still bad (still unbalanced). This is write only test.:

In this case the problem is different.

As you can see, this time coordinator work is directly proportional to replica work — which means that this time load balancing works.

This time, the imbalance doesn't come from bad load balancing on the client side, but from bad balancing of tablets on the server. cql-stress apparently creates two tables instead of one — standard1 and counter1, cassandra-stress creates only one of those, depending on the test — and their mix on each shard is arbitrary. So there are some shards with a majority of standard1 tablets and some with a majority of counter1 tablets, but only standard1 is used, so replica work is unbalanced.

Note that there is still a high rate of cross-shard ops. I've said earlier that "shard awareness seems to break down", but I've just realized that this isn't true — it's a server-side issue. Shards only communicate with their siblings on other nodes. With vnodes, a replica set for a given token range is always replicated on a set of sibling shards, so shards can send replica requests to their siblings directly, and there are no cross-shard ops. With tablets, there is no such property — on different nodes, the same tablet will be replicated on shards with different numbers, so cross-shard ops are unavoidable.

michoecho · 2024-06-07T15:14:00Z

So, to sum up: with cql-stress, the results look (to me) as expected. So it would appear that the problem is with the java driver. (And the problem is probably just with load balancing. I was wrong earlier about requests being sent to non-replica shards).

Lorak-mmk · 2024-06-07T16:12:29Z

So, to sum up: with cql-stress, the results look (to me) as expected. So it would appear that the problem is with the java driver. (And the problem is probably just with load balancing. I was wrong earlier about requests being sent to non-replica shards).

In that case @Bouncheck will be the best person to investigate this

roydahan · 2024-06-19T13:25:46Z

So, to sum up: with cql-stress, the results look (to me) as expected.

@michoecho regarding the cql-stress and counter table, something doesn't adds up.
AFAIU from @muzarski, the counters table was created but there are no writes/reads to it.
Furthermore, counters table isn't supported for tablets so according to @avikivity it should have been rejected and not being created at all.

michoecho · 2024-06-19T14:24:42Z

@michoecho regarding the cql-stress and counter table, something doesn't adds up.
AFAIU from @muzarski, the counters table was created but there are no writes/reads to it.

Correct. What about this doesn't add up? Tablet load balancer doesn't care about traffic, only the number of tablets.

Furthermore, counters table isn't supported for tablets so according to @avikivity it should have been rejected and not being created at all.

Then apparently the codebase didn't get the memo, because they aren't rejected.

bhalevy · 2024-06-20T03:39:44Z

@michoecho regarding the cql-stress and counter table, something doesn't adds up.
AFAIU from @muzarski, the counters table was created but there are no writes/reads to it.

Correct. What about this doesn't add up? Tablet load balancer doesn't care about traffic, only the number of tablets.

Furthermore, counters table isn't supported for tablets so according to @avikivity it should have been rejected and not being created at all.

Then apparently the codebase didn't get the memo, because they aren't rejected.

issue number?

michoecho · 2024-06-24T03:33:12Z

Furthermore, counters table isn't supported for tablets so according to @avikivity it should have been rejected and not being created at all.

Then apparently the codebase didn't get the memo, because they aren't rejected.

issue number?

Are you asking whether there is an existing ticket for this, or are you asking me to create one?

Opened #19449.

bhalevy · 2024-06-24T05:34:20Z

Furthermore, counters table isn't supported for tablets so according to @avikivity it should have been rejected and not being created at all.

Then apparently the codebase didn't get the memo, because they aren't rejected.

issue number?

Are you asking whether there is an existing ticket for this, or are you asking me to create one?

Either...

Opened #19449.

Thanks!

fee-mendes · 2024-06-24T13:53:20Z

I've been playing with cql-stress (I manually patched it myself to stop creating counter tables) and I get a similar issue as described here. While scaling or downscaling a cluster, I observe the driver/stressor emit the following log lines:

2024-06-20T22:33:46.136372Z  WARN scylla::transport::topology: Failed to establish control connection and fetch metadata on all known peers. Falling back to initial contact points.
2024-06-20T22:33:46.136430Z  WARN scylla::transport::topology: Failed to fetch metadata using current control connection control_connection_address="172.31.16.15:9042" error=Protocol Error: system.peers or system.local has invalid column type
2024-06-20T22:33:46.146289Z ERROR scylla::transport::topology: Could not fetch metadata error=Protocol Error: system.peers or system.local has invalid column type

What happens after is that throughput basically tanks down. A naive user will then try to restart the client to see if it "helps" with anything, this actually worsens the situation because now the driver will only connect to the initial contact point. For example:

If anyone would like more metrics I can easily reproduce it and make it available somewhere.

Lorak-mmk · 2024-06-24T14:22:47Z

I've been playing with cql-stress (I manually patched it myself to stop creating counter tables) and I get a similar issue as described here. While scaling or downscaling a cluster, I observe the driver/stressor emit the following log lines:
2024-06-20T22:33:46.136372Z  WARN scylla::transport::topology: Failed to establish control connection and fetch metadata on all known peers. Falling back to initial contact points.
2024-06-20T22:33:46.136430Z  WARN scylla::transport::topology: Failed to fetch metadata using current control connection control_connection_address="172.31.16.15:9042" error=Protocol Error: system.peers or system.local has invalid column type
2024-06-20T22:33:46.146289Z ERROR scylla::transport::topology: Could not fetch metadata error=Protocol Error: system.peers or system.local has invalid column type
What happens after is that throughput basically tanks down. A naive user will then try to restart the client to see if it "helps" with anything, this actually worsens the situation because now the driver will only connect to the initial contact point. For example:

If anyone would like more metrics I can easily reproduce it and make it available somewhere.

Could you describe how can I reproduce it myself? I'd like to debug this from the driver side.

fee-mendes · 2024-06-24T14:28:06Z

Could you describe how can I reproduce it myself? I'd like to debug this from the driver side.

I run the following load (starts at a small 3-node i4i.xlarge cluster) :

Ingest whatever:

cargo run --release --bin cql-stress-cassandra-stress -- write n=100M cl=local_quorum keysize=100 -col n=5 size='FIXED(200)' -mode cql3 -rate throttle=120000/s threads=8 -pop seq=1..100M -node 172.31.16.15

Run a mixed workload from (1):

cargo run --release --bin cql-stress-cassandra-stress -- mixed duration=6h cl=local_quorum keysize=100 'ratio(read=8,write=2)' -col n=5 size='FIXED(200)' -mode cql3 -rate throttle=120000/s threads=32 -pop seq=1..1M -node 172.31.16.15

Add 3 nodes to the cluster:

---
- name: Double cluster-size
  hosts: double_cluster
  become: True

  tasks: 
    - name: Start ScyllaDB Service
      ansible.builtin.systemd_service:
        name: scylla-server.service
        state: started

    - name: Waiting for CQL port readiness
      wait_for:
        port: 9042
        host: 127.0.0.1
        connect_timeout: 3
        delay: 3
        sleep: 10
        timeout: 1200
        state: present

Considering you added enough data in (1), you'll see the problem shortly after you run 3 above, and throughput and clients won't recover until the full tablet migration is complete. As soon as you see the Warning/Error in logs, restart the client - You will notice the driver will only route traffic to the contact point you specified on the command line.

gleb-cloudius · 2024-06-26T13:09:18Z

This apparently breaks the contract with drivers which only expect system.peers to contain entries for normal nodes.

Why don't we fix it on the driver side, to try again and get system.peers after some time? (though I am not sure how will Cassandra drivers behave?)

Anything else? (remember we need to commit to that so it can be safely implemented on all drivers and not expected to change any time soon).

Actually, there's still a race which may happen during the initial connection phase. If a client connects to the server exactly when system.peers contains invalid entries,

And with vnodes it may be in this state for a long time.

mykaul · 2024-06-26T13:10:14Z

Why don't we fix it on the driver side, to try again and get system.peers after some time?

@mykaul I think this is already happening. Drivers are refreshing topology metadata periodically.

They may not soon - scylladb/scylla-rust-driver#1008

And judging from the driver log output that @fee-mendes posted earlier -- #19107 (comment) -- the system.peers problem is indeed transient. Drivers recover eventually.

Which points to the hypothesis that something else is responsible for the imbalance. (If the imbalance stays?)

May well be - and perhaps we should split this issue - handle here the ongoing (?) imbalance and elsewhere the issue with system.peers.

kbr-scylla · 2024-06-26T13:34:03Z

If a client connects to the server exactly when system.peers contains invalid entries, it will backoff and just route queries to the initial list of contact points

Ugh. So the driver, if it sees one "invalid" system.peers entry, it abandons the entire system.peers fetch?

That sounds pretty drastic. It should handle the rows that considers correct/full, but ignore just the invalid rows.

Then our change of adding partial rows corresponding to bootstrapping nodes would be transparent to the drivers. They would simply ignore these partial rows, and the end result would be equivalent to pre-6.0 state, where those rows don't exist in the first place.

We should check what other drivers do, e.g. Python driver.

kbr-scylla · 2024-06-26T13:38:01Z

The has invalid column type error in Rust driver really sounds like we unintentionally fail the entire system.peers fetch because of one unexpected null.

OTOH in Python driver we can find code like (thanks @patjed41 for digging this out):

for row in peers_result:
    if not self._is_valid_peer(row):
        continue

which means we only skip over the partial rows -- so the Python driver will give equivalent result as if the partial rows weren't there.

Could be the reason why we didn't notice that there's a problem in our tests -- in test.py, and dtest, we only use the Python driver...

mykaul · 2024-06-26T13:39:33Z

GoCQL code - https://github.com/scylladb/gocql/blob/2c5fba30d56bbc3b30c4049ef11db8d45d4fde3d/host_source.go#L645 (called from https://github.com/scylladb/gocql/blob/74675d1c5ba516724eb09732cac9a7abc2fb9936/host_source.go#L841 ) :

for _, row := range rows {
		// extract all available info about the peer
		host, err := r.session.hostInfoFromMap(row, &HostInfo{port: r.session.cfg.Port})
		if err != nil {
			return nil, err
		} else if !isValidPeer(host) {
			// If it's not a valid peer
			r.session.logger.Printf("Found invalid peer '%s' "+
				"Likely due to a gossip or snitch issue, this host will be ignored", host)
			continue
		}

		peers = append(peers, host)
	}

Lorak-mmk · 2024-06-26T13:45:22Z

If a client connects to the server exactly when system.peers contains invalid entries, it will backoff and just route queries to the initial list of contact points

Ugh. So the driver, if it sees one "invalid" system.peers entry, it abandons the entire system.peers fetch?

That sounds pretty drastic. It should handle the rows that considers correct/full, but ignore just the invalid rows.

OTOH failing fast and visibly may have it's benefits - it's possible we wouldn't have noticed the issue if Rust Driver skipped only one node.
We could add skipping invalid rows to our drivers to handle the case where driver just connected and made initial fetch during an invalid state.
The question is what does Cassandra do (does it have null rpc address, do datastax drivers handle null rpc address), because we should be compatible with their drivers.

What is imo important is that the cql event is only sent after the new node is ready to fetch (and so contains rpc address and other fields) - otherwise the event is useless.

Then our change of adding partial rows corresponding to bootstrapping nodes would be transparent to the drivers. They would simply ignore these partial rows, and the end result would be equivalent to pre-6.0 state, where those rows don't exist in the first place.

We should check what other drivers do, e.g. Python driver.

mykaul · 2024-06-26T13:58:13Z

What is imo important is that the cql event is only sent after the new node is ready to fetch (and so contains rpc address and other fields) - otherwise the event is useless.

What if it is sent after 1 node was added, and another is pending? With 'parallel' bootstrap (is that the right terminology?) that might happen, no?

gleb-cloudius · 2024-06-26T14:02:47Z

If a client connects to the server exactly when system.peers contains invalid entries, it will backoff and just route queries to the initial list of contact points

Ugh. So the driver, if it sees one "invalid" system.peers entry, it abandons the entire system.peers fetch?
That sounds pretty drastic. It should handle the rows that considers correct/full, but ignore just the invalid rows.

OTOH failing fast and visibly may have it's benefits - it's possible we wouldn't have noticed the issue if Rust Driver skipped only one node.

There wouldn't be the issue if all drivers would have skiped incomplete rows.

What is imo important is that the cql event is only sent after the new node is ready to fetch (and so contains rpc address and other fields) - otherwise the event is useless.

If a cql event is sent before the row for a node the even was sent for (does the even has a node info at all) is complete it is a Scylla bug.

Lorak-mmk · 2024-06-26T14:06:24Z

What is imo important is that the cql event is only sent after the new node is ready to fetch (and so contains rpc address and other fields) - otherwise the event is useless.

What if it is sent after 1 node was added, and another is pending? With 'parallel' bootstrap (is that the right terminology?) that might happen, no?

This would be fine - second event would be sent after second node finished adding and the driver could then fetch it.

If a client connects to the server exactly when system.peers contains invalid entries, it will backoff and just route queries to the initial list of contact points

Ugh. So the driver, if it sees one "invalid" system.peers entry, it abandons the entire system.peers fetch?
That sounds pretty drastic. It should handle the rows that considers correct/full, but ignore just the invalid rows.

OTOH failing fast and visibly may have it's benefits - it's possible we wouldn't have noticed the issue if Rust Driver skipped only one node.

There wouldn't be the issue if all drivers would have skiped incomplete rows.

Maybe it's just a problem in Rust Driver, or maybe there are some other driver that also have it, I'm not sure.
Anyway, if Scylla doesn't show invalid rows in system.peers then the problem wouldn't exist at all, no matter how the driver implemented it.
It is easier to change one place (Scylla) than N drivers, some of which are not even ours.

What is imo important is that the cql event is only sent after the new node is ready to fetch (and so contains rpc address and other fields) - otherwise the event is useless.

If a cql event is sent before the row for a node the even was sent for (does the even has a node info at all) is complete it is a Scylla bug.

gleb-cloudius · 2024-06-26T14:10:30Z

What is imo important is that the cql event is only sent after the new node is ready to fetch (and so contains rpc address and other fields) - otherwise the event is useless.

What if it is sent after 1 node was added, and another is pending? With 'parallel' bootstrap (is that the right terminology?) that might happen, no?

This scenario is possible even without 'parallel' bootstrap. But it looks like the notification contains the information about the node it notifies about.

gleb-cloudius · 2024-06-26T14:28:08Z

Maybe it's just a problem in Rust Driver, or maybe there are some other driver that also have it, I'm not sure. Anyway, if Scylla doesn't show invalid rows in system.peers then the problem wouldn't exist at all, no matter how the driver implemented it. It is easier to change one place (Scylla) than N drivers, some of which are not even ours.

No. In fact it is not easier to change Scylla after the release unless we find some other place to store this info in backwards compatible way (may we can use scylla_local for it) . And how do you decide which row is invalid and which is not. Some rows may be missing some info that a driver needs to create new host connection. Then it should skip doing so, not completely abandon everything and give up. We want to have tokenless nodes which means token filed in local and peers table will be empty. Will such entries completely kill the rust driver as well? We will not have workaround for it in Scylla.

kbr-scylla · 2024-06-26T14:40:49Z

I opened #19507 -- let's continue discussing the system.peers issue there if necessary.

The current issue might turn out to be unrelated after all.

I suspect that it is tablets specific. After all, we did implement ton of tablets specific load balancing code, didn't we? With the lazy fetching of tablet replica mapping etc. (cc @sylwiaszunejko) --- I think it should be a major suspect in this continued imbalance issue.

kbr-scylla · 2024-06-26T15:06:40Z

So, I re-read the thread again from the beginning, this time carefully...

@michoecho already confirmed before that this problem is tablets specific.
#19107 (comment)
#19107 (comment)

IIUC there are two imbalance-related bugs:

one in java driver's implementation of load balancing for tablets (cc @Lorak-mmk)
and one in Scylla -- bad balancing of tablet replicas across nodes (cc @tgrabiec)

this entire system.peers story is just a huge unrelated red herring and we (I) continue spamming this issue about it, I don't know why.

Let's just wait for @soyacz results for non-tablets run whether the imbalance happens there too and we should have the full picture.

avikivity · 2024-06-26T15:15:57Z

Drivers are suppose to occasionally "forget" the tablet mapping in order to get a fresh one.

Lorak-mmk · 2024-06-26T15:18:44Z

Drivers are suppose to occasionally "forget" the tablet mapping in order to get a fresh one.

They don't forget the whole mapping. When they send a statement to wrong node, they will get a payload with correct tablet for this statement. Then the driver will remove from it's local mapping tablets that overlap with newly received one and insert the newly received one.

tgrabiec · 2024-06-27T09:40:07Z

On Wed, Jun 26, 2024 at 5:07 PM Kamil Braun ***@***.***> wrote: So, I re-read the thread again from the beginning, this time carefully... @michoecho <https://github.com/michoecho> already confirmed before that this problem is tablets specific. #19107 (comment) <#19107 (comment)> #19107 (comment) <#19107 (comment)> IIUC there are two imbalance-related bugs: - one in java driver's implementation of load balancing for tablets (cc @Lorak-mmk <https://github.com/Lorak-mmk>) - and one in Scylla -- bad balancing of tablet replicas across nodes (cc @tgrabiec <https://github.com/tgrabiec>) #19107 (comment)

<#19107 (comment)> ^ This sounds like #16824 Message ID: ***@***.***>

…

mykaul · 2024-10-06T13:10:03Z

@dimakr - can you please ensure there's nothing to do here in any of the drivers?

dimakr · 2024-10-06T17:09:36Z

I assume the question is for @dkropachev

mykaul · 2024-11-04T13:40:54Z

I assume the question is for @dkropachev

@dkropachev ?

Bouncheck · 2024-11-04T13:49:56Z

I'll prioritize investigating the java driver (3.x) side now

dkropachev · 2024-11-04T14:32:35Z

@dimakr - can you please ensure there's nothing to do here in any of the drivers?

There is definitely a problem on java-drver 3.x side with imbalanced load after nodes are added.
rust driver problem was fixed here - scylladb/scylla-rust-driver#1023

soyacz added symptom/performance Issues causing performance problems triage/master Looking for assignee labels Jun 5, 2024

mykaul added the area/tablets label Jun 5, 2024

mykaul assigned bhalevy Jun 5, 2024

muzarski mentioned this issue Jun 10, 2024

c-s: Tool creates counter1 table when regular workloads are being run scylladb/cql-stress#85

Closed

kbr-scylla mentioned this issue Jun 26, 2024

raft topology: system.peers contains partial rows (only key and host ID) for non-normal nodes, which confuses (some) drivers #19507

Closed

soyacz mentioned this issue Jun 27, 2024

Failed to add node in parallel (no tablets) #19523

Closed

2 tasks

mykaul added the P1 Urgent label Aug 21, 2024

dani-tweig added area/elastic cloud and removed area/elastic cloud labels Oct 30, 2024

bhalevy added the area/elastic cloud label Nov 3, 2024

dani-tweig added this to the 6.3 milestone Nov 3, 2024

bhalevy assigned dkropachev and unassigned bhalevy Nov 4, 2024

[Tablets] Requests served imbalance after adding nodes to cluster #19107

[Tablets] Requests served imbalance after adding nodes to cluster #19107

Comments

soyacz commented Jun 5, 2024

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

michoecho commented Jun 5, 2024

bhalevy commented Jun 5, 2024

michoecho commented Jun 5, 2024 • edited Loading

soyacz commented Jun 6, 2024

bhalevy commented Jun 6, 2024

piodul commented Jun 6, 2024

Lorak-mmk commented Jun 6, 2024

piodul commented Jun 6, 2024

Lorak-mmk commented Jun 6, 2024 • edited Loading

soyacz commented Jun 6, 2024

Lorak-mmk commented Jun 6, 2024

soyacz commented Jun 6, 2024

soyacz commented Jun 6, 2024

soyacz commented Jun 7, 2024

michoecho commented Jun 7, 2024

michoecho commented Jun 7, 2024 • edited Loading

Lorak-mmk commented Jun 7, 2024

roydahan commented Jun 19, 2024

michoecho commented Jun 19, 2024

bhalevy commented Jun 20, 2024

michoecho commented Jun 24, 2024

bhalevy commented Jun 24, 2024

fee-mendes commented Jun 24, 2024

Lorak-mmk commented Jun 24, 2024

fee-mendes commented Jun 24, 2024

gleb-cloudius commented Jun 26, 2024

mykaul commented Jun 26, 2024

kbr-scylla commented Jun 26, 2024

kbr-scylla commented Jun 26, 2024

mykaul commented Jun 26, 2024 • edited Loading

Lorak-mmk commented Jun 26, 2024 • edited Loading

mykaul commented Jun 26, 2024

gleb-cloudius commented Jun 26, 2024

Lorak-mmk commented Jun 26, 2024 • edited Loading

gleb-cloudius commented Jun 26, 2024

gleb-cloudius commented Jun 26, 2024

kbr-scylla commented Jun 26, 2024

kbr-scylla commented Jun 26, 2024

avikivity commented Jun 26, 2024

Lorak-mmk commented Jun 26, 2024

tgrabiec commented Jun 27, 2024 via email

mykaul commented Oct 6, 2024

dimakr commented Oct 6, 2024

mykaul commented Nov 4, 2024

Bouncheck commented Nov 4, 2024

dkropachev commented Nov 4, 2024 • edited Loading

michoecho commented Jun 5, 2024 •

edited

Loading

Lorak-mmk commented Jun 6, 2024 •

edited

Loading

michoecho commented Jun 7, 2024 •

edited

Loading

mykaul commented Jun 26, 2024 •

edited

Loading

Lorak-mmk commented Jun 26, 2024 •

edited

Loading

Lorak-mmk commented Jun 26, 2024 •

edited

Loading

dkropachev commented Nov 4, 2024 •

edited

Loading