New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'Cannot perform truncate, some hosts are down' with hosts that have left the cluster #10296
Comments
Trying to look for gossip information Sequence is:
|
this is a regression I expect that a simple reproducer can be built.
The second truncate (after decommission+wait is expected to fail) |
Since its gossip assigning to Asias |
It looks like the truncate failed not just because of the decommissioned node with ip 3.165. It needs 7nodes to be alive, but only 4 is alive.
|
I could not reproduce with the following dtest: scylla: c450508
|
If nothing was fixed, I don't see a reason why wouldn't it reproduce. |
@roydahan can you refer me to somewhere that explains how to do that? More importantly, how does one can extract the sequence from a test? I would like to try and have a minimal reproducer for this eventually (if we can't find the problem immediately). |
You copy this job "Jenkins job URL" as a reproducer job in Jenkins and run it. |
How do I do that? Is there a place to provide a seed in the job? |
@asias after @roydahan / @k0machi answers the question above #10296 (comment) |
This bug was reported May 30. Did QA run this same test since then? Did it reproduce? I implemented a mini dtest reproducer following the procedure and I did not reproduce. |
Just to summarize thing that were discussed verbally during the daily call: The course of action on this issue was agreed to be (in parallel):
|
@asias here is the gossip information from node 4 which had the problem - got a message that says that not all nodes are available: Can we infer from the logs below what should have been the state of gossip at the point of failure?
|
|
I scratched the last comment since it was a bogus guess, looking at the Prometheus metrics for node 4 it shows that no more |
Installation detailsKernel Version: 5.15.0-1015-aws Scylla Nodes used in this run:
OS / Image: Test: Issue descriptionAt
There are only 4 nodes in this scenario, and that stays true throughout the whole run. While there are nemeses that add and remove nodes, there was no "leftover" node from any of them at any point, both according to our logs and according to the "nodetool status" commands. It was not the only time we attampted to truncate a table in this run, and at
Logs:
|
@asias it seams that we keep accounting some nodes after they were decommissioned (sometimes) - maybe something that is related to qurantine or node restarts while there are nodes in quarantine? |
I will take another look at the latest reproducer: |
Even with https://argus.scylladb.com/workspace?state=WyI4ZGFmZGYyYS05NTY2LTRiYzktODU5OS02MWU0MGY1NDcwNjciLCJjYTFhYzg5NS03OGMwLTRhYmQtODBkNS0wYmE2ZGM3ODQ0YmYiXQ, it is pretty hard to build what ops was performed and the status of the cluster. @ShlomiBalalis Can you confrim what nodes were in the cluster when the truncate ops was run? This is my guess:
For example: This does not mention which node was added after the decommission or node was added at all. This does not mention which node was added after operations and how was the node was removed (decommission or removenode). |
It basically does this and I think this is another quirk of dpkg:
Skips dependency checks and unpacks the packages - I think dpkg considers service files as config files and doesn't override with new ones, but when it goes to clean the old versions - it removes the service file. |
The run finished and did not fail at either of truncates there, I'm going to re-run this job again this time using sct master branch without any alterations to nemeses and see if it reproduces. |
@asias how would we know if your patch fix the issue or we just couldn't reproduce it? |
The following is without the fix. Here is the branch https://github.com/asias/scylla/tree/truncate_issue_5_1.qa.test2 Or the patch on top of 9deeeb4:
|
I have everything set up already, so I'll use the branch you've linked: Version string: I've already launched a test with the fix yesterday morning, had to restart it after the weekend unfortunately, something went wrong with the network. I'll queue up the job without the fix too. |
@k0machi any updates? This is a blocker for 5.1 |
Currently running a longevity without the fix with debug information, we should see tomorrow if it reproduces the issue. |
The issue wasn't reproduced, I've started another one while I'm investigating the differences between the runs that did have this issue and ones that didn't |
@k0machi any news so far? |
Two runs and it didn't reproduce, we're going to try to run a test where there's only truncate nemesis happening over and over again now. |
@avikivity ping |
Without the patch it reproduced for @KnifeyMoloko in #11928 (RC3). |
The get_live_token_owners returns the nodes that are part of the ring and live. The get_unreachable_token_owners returns the nodes that are part of the ring and is not alive. The token_metadata::get_all_endpoints returns nodes that are part of the ring. The patch changes both functions to use the more authoritative source to get the nodes that are part of the ring and call is_alive to check if the node is up or down. So that the correctness does not depend on any derived information. This patch fixes a truncate issue in storage_proxy::truncate_blocking where it calls get_live_token_owners and get_unreachable_token_owners to decide the nodes to talk with for truncate operation. The truncate failed because incorrect nodes were returned. Fixes scylladb#10296 Fixes scylladb#11928
I sent a PR for the patch which @k0machi tested. |
The get_live_token_owners returns the nodes that are part of the ring and live. The get_unreachable_token_owners returns the nodes that are part of the ring and is not alive. The token_metadata::get_all_endpoints returns nodes that are part of the ring. The patch changes both functions to use the more authoritative source to get the nodes that are part of the ring and call is_alive to check if the node is up or down. So that the correctness does not depend on any derived information. This patch fixes a truncate issue in storage_proxy::truncate_blocking where it calls get_live_token_owners and get_unreachable_token_owners to decide the nodes to talk with for truncate operation. The truncate failed because incorrect nodes were returned. Fixes scylladb#10296 Fixes scylladb#11928
The get_live_token_owners returns the nodes that are part of the ring and live. The get_unreachable_token_owners returns the nodes that are part of the ring and is not alive. The token_metadata::get_all_endpoints returns nodes that are part of the ring. The patch changes both functions to use the more authoritative source to get the nodes that are part of the ring and call is_alive to check if the node is up or down. So that the correctness does not depend on any derived information. This patch fixes a truncate issue in storage_proxy::truncate_blocking where it calls get_live_token_owners and get_unreachable_token_owners to decide the nodes to talk with for truncate operation. The truncate failed because incorrect nodes were returned. Fixes #10296 Fixes #11928 Closes #11952
@avikivity I think we should backport @asias's fix ^^ , please have a look |
@asias how far back should this be backported? Is only 5.1 affected? (if not, then it wasn't a regression) |
We only saw the issue with 5.1. Let's backport to 5.1 only. |
The get_live_token_owners returns the nodes that are part of the ring and live. The get_unreachable_token_owners returns the nodes that are part of the ring and is not alive. The token_metadata::get_all_endpoints returns nodes that are part of the ring. The patch changes both functions to use the more authoritative source to get the nodes that are part of the ring and call is_alive to check if the node is up or down. So that the correctness does not depend on any derived information. This patch fixes a truncate issue in storage_proxy::truncate_blocking where it calls get_live_token_owners and get_unreachable_token_owners to decide the nodes to talk with for truncate operation. The truncate failed because incorrect nodes were returned. Fixes #10296 Fixes #11928 Closes #11952 (cherry picked from commit 16bd9ec)
Backported to 5.1. |
The get_live_token_owners returns the nodes that are part of the ring and live. The get_unreachable_token_owners returns the nodes that are part of the ring and is not alive. The token_metadata::get_all_endpoints returns nodes that are part of the ring. The patch changes both functions to use the more authoritative source to get the nodes that are part of the ring and call is_alive to check if the node is up or down. So that the correctness does not depend on any derived information. This patch fixes a truncate issue in storage_proxy::truncate_blocking where it calls get_live_token_owners and get_unreachable_token_owners to decide the nodes to talk with for truncate operation. The truncate failed because incorrect nodes were returned. Fixes #10296 Fixes #11928 Closes #11952 (cherry picked from commit 16bd9ec)
Installation details
Kernel Version: 5.13.0-1019-aws
Scylla version (or git commit hash):
5.1.dev-20220317.c45050895403
with build-id076f513b6143670def988c7626389b270411f8f7
Cluster size: 4 nodes (i3en.2xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-0d27db767997d0ec4
(aws: eu-west-1)Test:
longevity-twcs-48h-test
Test id:
2acb1d5c-3e77-4abd-93e7-ae1d3fa89903
Test name:
longevity/longevity-twcs-48h-test
Test config file(s):
Issue description
During
disrupt_truncate_large_partition
nemesis, a large partition is created and subsequently truncated on the same node. However, this said node seems to think that not all nodes are currently available, hence it is unable to perform truncation, when those missing nodes were decommisioned and left long time ago.Gossip information from that node
Node log
SCT
$ hydra investigate show-monitor 2acb1d5c-3e77-4abd-93e7-ae1d3fa89903
$ hydra investigate show-logs 2acb1d5c-3e77-4abd-93e7-ae1d3fa89903
Logs:
Jenkins job URL
The text was updated successfully, but these errors were encountered: