New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Truncate cqlsh command fails to achieve consistency expecting more node that there are present in the cluster #11928
Comments
@asias, given https://github.com/scylladb/scylla-enterprise/blob/994a5f0fbb4c4ceb09a5a79b0bfbe67257842fdd/service/storage_proxy.cc#L5073
Compounded with #11925 (comment) Maybe we're hitting #11935, What do you think? Cc @xemul |
I have the following a patch when debugging 10296 the truncate issue. I suspected that _topology is out of sync. It seemed to help.
|
The get_live_token_owners returns the nodes that are part of the ring and live. The get_unreachable_token_owners returns the nodes that are part of the ring and is not alive. The token_metadata::get_all_endpoints returns nodes that are part of the ring. The patch changes both functions to use the more authoritative source to get the nodes that are part of the ring and call is_alive to check if the node is up or down. So that the correctness does not depend on any derived information. This patch fixes a truncate issue in storage_proxy::truncate_blocking where it calls get_live_token_owners and get_unreachable_token_owners to decide the nodes to talk with for truncate operation. The truncate failed because incorrect nodes were returned. Fixes scylladb#10296 Fixes scylladb#11928
The get_live_token_owners returns the nodes that are part of the ring and live. The get_unreachable_token_owners returns the nodes that are part of the ring and is not alive. The token_metadata::get_all_endpoints returns nodes that are part of the ring. The patch changes both functions to use the more authoritative source to get the nodes that are part of the ring and call is_alive to check if the node is up or down. So that the correctness does not depend on any derived information. This patch fixes a truncate issue in storage_proxy::truncate_blocking where it calls get_live_token_owners and get_unreachable_token_owners to decide the nodes to talk with for truncate operation. The truncate failed because incorrect nodes were returned. Fixes scylladb#10296 Fixes scylladb#11928
The get_live_token_owners returns the nodes that are part of the ring and live. The get_unreachable_token_owners returns the nodes that are part of the ring and is not alive. The token_metadata::get_all_endpoints returns nodes that are part of the ring. The patch changes both functions to use the more authoritative source to get the nodes that are part of the ring and call is_alive to check if the node is up or down. So that the correctness does not depend on any derived information. This patch fixes a truncate issue in storage_proxy::truncate_blocking where it calls get_live_token_owners and get_unreachable_token_owners to decide the nodes to talk with for truncate operation. The truncate failed because incorrect nodes were returned. Fixes #10296 Fixes #11928 Closes #11952
The get_live_token_owners returns the nodes that are part of the ring and live. The get_unreachable_token_owners returns the nodes that are part of the ring and is not alive. The token_metadata::get_all_endpoints returns nodes that are part of the ring. The patch changes both functions to use the more authoritative source to get the nodes that are part of the ring and call is_alive to check if the node is up or down. So that the correctness does not depend on any derived information. This patch fixes a truncate issue in storage_proxy::truncate_blocking where it calls get_live_token_owners and get_unreachable_token_owners to decide the nodes to talk with for truncate operation. The truncate failed because incorrect nodes were returned. Fixes #10296 Fixes #11928 Closes #11952 (cherry picked from commit 16bd9ec)
@asias just seen something very similar in 2022.1.3
can you take a look ? Installation detailsKernel Version: 5.15.0-1020-gke Scylla Nodes used in this run: OS / Image: Test: Logs and commands
Logs:
|
Yes, I think we need the backport. |
@scylladb/scylla-maint so this is a confirmation that at least 2022.1 need this backport |
The get_live_token_owners returns the nodes that are part of the ring and live. The get_unreachable_token_owners returns the nodes that are part of the ring and is not alive. The token_metadata::get_all_endpoints returns nodes that are part of the ring. The patch changes both functions to use the more authoritative source to get the nodes that are part of the ring and call is_alive to check if the node is up or down. So that the correctness does not depend on any derived information. This patch fixes a truncate issue in storage_proxy::truncate_blocking where it calls get_live_token_owners and get_unreachable_token_owners to decide the nodes to talk with for truncate operation. The truncate failed because incorrect nodes were returned. Fixes #10296 Fixes #11928 Closes #11952 (cherry picked from commit 16bd9ec)
Backported to 5.0. |
Installation details
Kernel Version: 5.15.0-1021-aws
Scylla version (or git commit hash):
2022.2.0~rc3-20221009.994a5f0fbb4c
with build-idad2ce7ef5214afc9f921ccfa2ce00706376f77dc
Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-aarch64-package-2022.2.0-rc3.0.20221009.994a5f0fbb4c.tar.gz
Cluster size: 6 nodes (im4gn.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-0dcd7f54f9f910597
(aws: eu-west-1)Test:
longevity-50gb-3days-arm-test
Test id:
7f685b49-4c05-4d4d-987e-8287694d3a4e
Test name:
enterprise-2022.2/longevity/longevity-50gb-3days-arm-test
Test config file(s):
Issue description
>>>>>>>
During the
Truncate
nemesis the cqlsh truncate command failed due to expecting a larger number of nodes forALL
than there was in the cluster at the time. Nodes have been replaced before this error and the cluster was expanded and shrinked with one of the nemesis, as this was > 1 day into the test run.less than a second later we run
nodetool status
to get the snapshot of the cluster's status, which lists only 6 nodes in the cluster:<<<<<<<
$ hydra investigate show-monitor 7f685b49-4c05-4d4d-987e-8287694d3a4e
$ hydra investigate show-logs 7f685b49-4c05-4d4d-987e-8287694d3a4e
Logs:
Jenkins job URL
The text was updated successfully, but these errors were encountered: