'Cannot perform truncate, some hosts are down' with hosts that have left the cluster #10296

k0machi · 2022-03-30T13:16:08Z

Installation details

Kernel Version: 5.13.0-1019-aws
Scylla version (or git commit hash): 5.1.dev-20220317.c45050895403 with build-id 076f513b6143670def988c7626389b270411f8f7
Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

longevity-twcs-48h-master-db-node-2acb1d5c-1 (34.249.11.226 | 10.0.1.42) (shards: 8)
longevity-twcs-48h-master-db-node-2acb1d5c-2 (54.154.214.184 | 10.0.1.237) (shards: 8)
longevity-twcs-48h-master-db-node-2acb1d5c-3 (34.248.103.231 | 10.0.2.143) (shards: 8)
longevity-twcs-48h-master-db-node-2acb1d5c-4 (3.250.227.110 | 10.0.0.15) (shards: 8)
longevity-twcs-48h-master-db-node-2acb1d5c-5 (54.170.137.53 | 10.0.0.106) (shards: 8)
longevity-twcs-48h-master-db-node-2acb1d5c-6 (54.216.174.11 | 10.0.2.196) (shards: 8)
longevity-twcs-48h-master-db-node-2acb1d5c-7 (63.32.45.143 | 10.0.0.235) (shards: 8)
longevity-twcs-48h-master-db-node-2acb1d5c-8 (54.195.204.64 | 10.0.3.165) (shards: 8)

OS / Image: ami-0d27db767997d0ec4 (aws: eu-west-1)

Test: longevity-twcs-48h-test
Test id: 2acb1d5c-3e77-4abd-93e7-ae1d3fa89903
Test name: longevity/longevity-twcs-48h-test
Test config file(s):

longevity-twcs-48h.yaml

Issue description

During disrupt_truncate_large_partition nemesis, a large partition is created and subsequently truncated on the same node. However, this said node seems to think that not all nodes are currently available, hence it is unable to perform truncation, when those missing nodes were decommisioned and left long time ago.

Gossip information from that node

< t:2022-03-26 19:45:31,709 f:cluster.py      l:2623 c:sdcm.cluster_aws     p:DEBUG > Node longevity-twcs-48h-master-db-node-2acb1d5c-4 [3.250.227.110 | 10.0.0.15] (seed: False): Command '/usr/bin/nodetool  gossipinfo ' duration -> 1.4254664019972552 s
< t:2022-03-26 19:45:31,710 f:cluster.py      l:2750 c:sdcm.cluster         p:DEBUG > get_gossip_info: Command exited with status 0.
< t:2022-03-26 19:45:31,710 f:cluster.py      l:2750 c:sdcm.cluster         p:DEBUG > === stdout ===
<...>
< t:2022-03-26 19:45:31,710 f:cluster.py      l:2767 c:sdcm.cluster         p:DEBUG > Get gossip info. Node with IP 10.0.0.106 is not found in the cluster. Node status is: LEFT
< t:2022-03-26 19:45:31,710 f:cluster.py      l:2767 c:sdcm.cluster         p:DEBUG > Get gossip info. Node with IP 10.0.1.237 is not found in the cluster. Node status is: LEFT
< t:2022-03-26 19:45:31,711 f:cluster.py      l:2767 c:sdcm.cluster         p:DEBUG > Get gossip info. Node with IP 10.0.2.196 is not found in the cluster. Node status is: LEFT
< t:2022-03-26 19:45:31,711 f:cluster.py      l:2767 c:sdcm.cluster         p:DEBUG > Get gossip info. Node with IP 10.0.3.165 is not found in the cluster. Node status is: LEFT
< t:2022-03-26 19:45:33,706 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Status=Up/Down
< t:2022-03-26 19:45:33,706 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > |/ State=Normal/Leaving/Joining/Moving
< t:2022-03-26 19:45:33,707 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > --  Address     Load       Tokens       Owns    Host ID                               Rack
< t:2022-03-26 19:45:33,712 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > UN  10.0.1.42   32.83 GB   256          ?       e298382a-4bbb-495a-817c-4ec2b47f1210  1a
< t:2022-03-26 19:45:33,717 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > UN  10.0.0.235  25.23 GB   256          ?       737c6ca2-88db-45a2-aeb1-c1225826c948  1a
< t:2022-03-26 19:45:33,720 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > UN  10.0.0.15   28.22 GB   256          ?       26449be3-7af3-4616-916a-adc541b1cbd1  1a
< t:2022-03-26 19:45:33,722 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > UN  10.0.2.143  28.11 GB   256          ?       8dca9f03-a2a7-4c16-9bfa-9b6a93bfeb4e  1a

Node log

Mar 26 19:47:46 longevity-twcs-48h-master-db-node-2acb1d5c-4 scylla[68161]:  [shard 1] create_keyspace - Using SimpleStrategy in a multi-datacenter environment is not recommended.
Mar 26 19:49:05 longevity-twcs-48h-master-db-node-2acb1d5c-4 scylla[68161]:  [shard 4] storage_proxy - Cannot perform truncate, some hosts are down

SCT

< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Command: 'cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "TRUNCATE ks_truncate_large_partition.test_table" 10.0.0.15 9042'
< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Exit code: 2
< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Stdout:
< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Stderr:
< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > <stdin>:1:TruncateError: Error during truncate: exceptions::unavailable_exception (Cannot achieve consistency level for cl ALL. Requires 7, alive 4)

Restore Monitor Stack command: $ hydra investigate show-monitor 2acb1d5c-3e77-4abd-93e7-ae1d3fa89903
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 2acb1d5c-3e77-4abd-93e7-ae1d3fa89903

Logs:

db-cluster-2acb1d5c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/2acb1d5c-3e77-4abd-93e7-ae1d3fa89903/20220326_205719/db-cluster-2acb1d5c.tar.gz
monitor-set-2acb1d5c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/2acb1d5c-3e77-4abd-93e7-ae1d3fa89903/20220326_205719/monitor-set-2acb1d5c.tar.gz
loader-set-2acb1d5c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/2acb1d5c-3e77-4abd-93e7-ae1d3fa89903/20220326_205719/loader-set-2acb1d5c.tar.gz
sct-runner-2acb1d5c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/2acb1d5c-3e77-4abd-93e7-ae1d3fa89903/20220326_205719/sct-runner-2acb1d5c.tar.gz

Jenkins job URL

The text was updated successfully, but these errors were encountered:

slivne · 2022-04-03T20:39:51Z

Trying to look for gossip information

Sequence is:

Node 3.165 is up
Ran repair with all nodes
Node 3.165 decommissioned
Truncate 20 minutes later fails

[shlomi@shlomi-scylladb-com i_10296]$ grep 2022-03-26T1[4-9]: truncate db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log | grep -v kernel | egrep  "gossip -| truncate|starting|DOWN"  | more
grep: truncate: No such file or directory
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T17:38:43+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.3.165 is now UP, status = UNKNOWN
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T18:44:35+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] repair - repair[2a40a2cc-84a2-4915-a011-e436ec0c5aaf]: starting user-requested rep
air for keyspace keyspace1, repair id 1, options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> true}}
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T18:44:37+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] repair - repair[f9a6383e-bab0-44d0-94b1-9e28c7ecdf75]: starting user-requested rep
air for keyspace system_distributed_everywhere, repair id 2, options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> true}}
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T18:44:37+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] repair - repair[782a98df-bf27-4c36-bed4-31853c8f5c36]: starting user-requested rep
air for keyspace ks_truncate_large_partition, repair id 3, options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> true}}
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T18:44:39+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] repair - repair[2b8d9ee2-c8b6-4080-ab1d-933e98e38562]: starting user-requested rep
air for keyspace scylla_bench, repair id 4, options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> true}}
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T19:05:18+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] repair - repair[e96e4873-5ff5-4bf7-94b6-6b91749941c2]: starting user-requested rep
air for keyspace system_auth, repair id 5, options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> true}}
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T19:05:21+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] repair - repair[b12e6a66-70df-4482-b645-363dcafae8b6]: starting user-requested rep
air for keyspace system_traces, repair id 6, options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> true}}
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T19:26:40+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Node 10.0.3.165 will be removed from gossip at [2022-03-29 19:26:38]: (ex
pire = 1648581998973579413, now = 1648322800624051733, diff = 259198 seconds)
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T19:26:42+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.3.165 is now DOWN, status = LEFT
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T19:27:40+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - 60000 ms elapsed, 10.0.3.165 gossip quarantine over
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T19:49:05+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 4] storage_proxy - Cannot perform truncate, some hosts are down

slivne · 2022-04-03T20:42:49Z

this is a regression

I expect that a simple reproducer can be built.

Create a 3 node cluster
Create a table
Add node
Truncate
Decommission node
Wait for 3 minutes (so the decommissioned node is removed from gossip in other nodes state)
Truncate

The second truncate (after decommission+wait is expected to fail)

slivne · 2022-04-03T20:43:55Z

Since its gossip assigning to Asias

asias · 2022-04-06T03:01:04Z

It looks like the truncate failed not just because of the decommissioned node with ip 3.165. It needs 7nodes to be alive, but only 4 is alive.

< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > <stdin>:1:TruncateError: Error during truncate: exceptions::unavailable_exception (Cannot achieve consistency level for cl ALL. Requires 7, alive 4)

asias · 2022-04-06T06:55:29Z

I could not reproduce with the following dtest:

scylla: c450508

    def test_truncate_after_decommission(self):
        """
        Test truncate after decommission
        """
        cluster = self.cluster

        # Disable hinted handoff and set batch commit log so this doesn't
        # interfer with the test (this must be after the populate)
        cluster.set_configuration_options(
            values=self.default_config_options(), batch_commitlog=True)
        cluster.populate(3).start(wait_for_binary_proto=True, wait_other_notice=True)  
        node1, node2, node3 = cluster.nodelist()    

        session = self.patient_cql_connection(node1)
        create_ks(session, 'ks', 2)
        create_cf(session, 'cf', read_repair=0.0, columns={'c1': 'text', 'c2': 'text'})
        create_cf(session, 'cf2', read_repair=0.0, columns={'c1': 'text', 'c2': 'text'})

        insert_c1c2(session, keys=range(1000), consistency=ConsistencyLevel.ONE)

        logger.info("Truncate ks.cf2 started")
        truncate = SimpleStatement("TRUNCATE ks.cf2", consistency_level=ConsistencyLevel.ALL)
        session.execute(truncate)
        logger.info("Truncate ks.cf2 done")

        logger.info("Decommission node started")
        node2.decommission()
        node2.stop()
        logger.info("Decommission node done")

        logger.info("Sleep 300s started")
        time.sleep(300)
        logger.info("Sleep 300s done")

        logger.info("Truncate ks.cf started")
        truncate = SimpleStatement("TRUNCATE ks.cf", consistency_level=ConsistencyLevel.ALL)
        session.execute(truncate)
        logger.info("Truncate ks.cf done")

        logger.info("Sleep 300s started")
        time.sleep(300)
        logger.info("Sleep 300s done")

slivne · 2022-07-17T07:31:13Z

@roydahan / @k0machi - does this reproduce ? - Asias did try to reproduce the issue and provided a test - but it doesn;t reproduce the problem - can you please comment

roydahan · 2022-07-17T18:25:08Z

If nothing was fixed, I don't see a reason why wouldn't it reproduce.
Whoever want to reproduce, need to run the same longevity with the same sequence of nemesis.

eliransin · 2022-07-25T06:21:51Z

If nothing was fixed, I don't see a reason why wouldn't it reproduce. Whoever want to reproduce, need to run the same longevity with the same sequence of nemesis.

@roydahan can you refer me to somewhere that explains how to do that? More importantly, how does one can extract the sequence from a test? I would like to try and have a minimal reproducer for this eventually (if we can't find the problem immediately).

roydahan · 2022-07-25T10:40:41Z

You copy this job "Jenkins job URL" as a reproducer job in Jenkins and run it.
To find the minimal reproducer you need to suggest what could be the cause.
To find the exact sequence of nemesis, you may grep the log.

eliransin · 2022-08-01T07:06:46Z

Whoever want to reproduce, need to run the same longevity with the same sequence of nemesis.

How do I do that? Is there a place to provide a seed in the job?

eliransin · 2022-08-01T07:10:02Z

@asias after @roydahan / @k0machi answers the question above #10296 (comment)
Can you try and reproduce? We will then be able to add debug info and run again if ncesairy.

asias · 2022-08-01T07:39:32Z

This bug was reported May 30. Did QA run this same test since then? Did it reproduce?

I implemented a mini dtest reproducer following the procedure and I did not reproduce.

eliransin · 2022-08-02T09:17:11Z

Just to summarize thing that were discussed verbally during the daily call:
@roydahan provided an answer - as long as you run the same SCT (same commit) you'll get the same sequence. He shared his intuition that he thinks it will not reproduce (it might be a race with some unknown suspected to be low probability).
Given that we will probably won't block the branching, however, @slivne is not whiling to give up so easily 🙂 so we will keep treating this issue as a branching blocker at least until we made reasonable effort to
either solve it or prove it is rare enough.

The course of action on this issue was agreed to be (in parallel):

Rerun the test a few times and see if it reproduces.
Try to better inspect the logs to see if we can detect the logic failure that lead to this situation.

eliransin · 2022-08-03T15:29:28Z

@asias here is the gossip information from node 4 which had the problem - got a message that says that not all nodes are available:
Map from IP to node id:
{'10.0.1.42': '1', '10.0.1.237': '2', '10.0.2.143': '3', '10.0.0.15': '4', '10.0.0.106': '5', '10.0.2.196': '6', '10.0.0.235': '7', '10.0.3.165': '8'}

Can we infer from the logs below what should have been the state of gossip at the point of failure?

2022-03-25T22:18:02+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO | Scylla version 5.1.dev-0.20220317.c45050895403 with build-id 076f513b6143670def988c7626389b270411f8f7 starting ...
2022-03-25T22:18:03+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] init - Scylla version 5.1.dev-0.20220317.c45050895403 with build-id 076f513b6143670def988c7626389b270411f8f7 starting ...
2022-03-25T22:18:17+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Gossip shadow round started with nodes={10.0.2.143, 10.0.2.196, 10.0.1.42}
2022-03-25T22:18:17+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Gossip shadow round finisehd with nodes_talked={10.0.1.42, 10.0.2.143, 10.0.2.196}
2022-03-25T22:18:17+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Feature check passed. Local node 10.0.0.15 features = {ALTERNATOR_STREAMS, ALTERNATOR_TTL, CDC, CDC_GENERATIONS_V2, COMPUTED_COLUMNS, CORRECT_COUNTER_ORDER, CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX, CORRECT_NON_COMPOUND
_RANGE_TOMBSTONES, CORRECT_STATIC_COMPACT_IN_MC, COUNTERS, DIGEST_FOR_NULL_VALUES, DIGEST_INSENSITIVE_TO_EXPIRY, DIGEST_MULTIPARTITION_READ, HINTED_HANDOFF_SEPARATE_CONNECTION, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, LWT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, ME_SSTABLE_FORMAT, NONFROZEN_UDT
S, PARALLELIZED_AGGREGATION, PER_TABLE_CACHING, PER_TABLE_PARTITIONERS, RANGE_SCAN_DATA_VARIANT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT, STREAM_WITH_RPC_STREAM, TOMBSTONE_GC_OPTIONS, TRUNCATION_TABLE, UDA, UNBOUNDED_RANGE_TOMBSTONES, VIEW_VIRTUAL_COLUMNS, WRIT
E_FAILURE_REPLY, XXHASH}, Remote common_features = {ALTERNATOR_STREAMS, ALTERNATOR_TTL, CDC, CDC_GENERATIONS_V2, COMPUTED_COLUMNS, CORRECT_COUNTER_ORDER, CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, CORRECT_STATIC_COMPACT_IN_MC, COUNTERS, DIGEST_FOR_NULL_VALUES, DIGEST_INSENSITIVE_TO_E
XPIRY, DIGEST_MULTIPARTITION_READ, HINTED_HANDOFF_SEPARATE_CONNECTION, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, LWT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, ME_SSTABLE_FORMAT, NONFROZEN_UDTS, PARALLELIZED_AGGREGATION, PER_TABLE_CACHING, PER_TABLE_PARTITIONERS, RANGE_SCAN_DATA_VARIANT, RANGE_TOM
BSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT, STREAM_WITH_RPC_STREAM, TOMBSTONE_GC_OPTIONS, TRUNCATION_TABLE, UDA, UNBOUNDED_RANGE_TOMBSTONES, VIEW_VIRTUAL_COLUMNS, WRITE_FAILURE_REPLY, XXHASH}
2022-03-25T22:18:17+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - failure_detector_loop: Started main loop
2022-03-25T22:18:18+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.1.237 is now DOWN, status = LEFT
2022-03-25T22:18:18+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.0.106 is now DOWN, status = LEFT
2022-03-25T22:18:18+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.1.42 is now UP, status = NORMAL
2022-03-25T22:18:18+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Node 10.0.1.237 will be removed from gossip at [2022-03-28 15:16:17]: (expire = 1648480577354800891, now = 1648246698452132298, diff = 233878 seconds)
2022-03-25T22:18:18+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Node 10.0.0.106 will be removed from gossip at [2022-03-28 08:22:35]: (expire = 1648455755274219062, now = 1648246698452132298, diff = 209056 seconds)
2022-03-25T22:18:18+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.2.143 is now UP, status = NORMAL
2022-03-25T22:18:18+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.2.196 is now UP, status = NORMAL
2022-03-25T22:18:27+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Gossip not settled after 1 polls.
2022-03-25T22:18:30+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Gossip settled after 1 extra polls; proceeding
2022-03-25T22:18:45+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - No gossip backlog; proceeding
2022-03-25T22:18:47+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] init - Scylla version 5.1.dev-0.20220317.c45050895403 initialization completed.
2022-03-25T22:19:18+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - 60000 ms elapsed, 10.0.0.106 gossip quarantine over
2022-03-25T22:19:18+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - 60000 ms elapsed, 10.0.1.237 gossip quarantine over
2022-03-25T22:40:16+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 ! WARNING |  [shard 1] gossip - failure_detector_loop: Send echo to node 10.0.2.143, status = failed: seastar::rpc::timeout_error (rpc call timed out)
2022-03-25T22:40:16+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 1] gossip - failure_detector_loop: Mark node 10.0.2.143 as DOWN
2022-03-25T22:40:23+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.2.143 is now UP, status = NORMAL
2022-03-25T22:40:23+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.2.143 is now DOWN, status = NORMAL
2022-03-26T07:15:47+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Node 10.0.2.196 will be removed from gossip at [2022-03-29 07:15:47]: (expire = 1648538147679323676, now = 1648278947744548704, diff = 259199 seconds)
2022-03-26T07:15:48+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.2.196 is now DOWN, status = LEFT
2022-03-26T07:16:47+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - 60000 ms elapsed, 10.0.2.196 gossip quarantine over
2022-03-26T07:20:15+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.0.235 is now UP, status = UNKNOWN
2022-03-26T17:38:43+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.3.165 is now UP, status = UNKNOWN
2022-03-26T19:26:40+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Node 10.0.3.165 will be removed from gossip at [2022-03-29 19:26:38]: (expire = 1648581998973579413, now = 1648322800624051733, diff = 259198 seconds)
2022-03-26T19:26:42+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.3.165 is now DOWN, status = LEFT
2022-03-26T19:27:40+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - 60000 ms elapsed, 10.0.3.165 gossip quarantine over
2022-03-26T19:49:05+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 4] storage_proxy - Cannot perform truncate, some hosts are down
2022-03-26T20:24:36+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Got shutdown message from 10.0.2.143, received_generation=1648242988, local_generation=1648242988
2022-03-26T20:24:36+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.2.143 is now DOWN, status = shutdown
2022-03-26T20:37:47+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.2.143 is now UP, status = NORMAL

eliransin · 2022-08-03T15:49:14Z

One more guess I have is that "unreachable endpoints" and "live_enpoints" collections can intersect, since the exception
infers that number of endpoints should be: unreachable + live , if for example all peer nodes are marked both as live and unreachable it will generate the exact same exception.

eliransin · 2022-08-07T08:58:31Z

I scratched the last comment since it was a bogus guess, looking at the Prometheus metrics for node 4 it shows that no more
than 3 or 4 nodes are considered for each category at the same time (live, unreachable).
However, unless I am missing something, at that point I would expect to see the two metrics sum to no more than 4 (the
number of nodes that are actually members of the cluster). So nodes have failed to be removed from the cluster?
@asias ? (I will keep digging but let me know if I am digging in the wrong direction).

ShlomiBalalis · 2022-08-07T15:09:34Z

Installation details

Kernel Version: 5.15.0-1015-aws
Scylla version (or git commit hash): 5.1.0~dev-20220802.663f2e2a8f81 with build-id d6f523279872624d89a9a451f02d8276f0073cb5
Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

longevity-twcs-48h-master-db-node-7d3e2623-9 (13.51.235.198 | 10.0.3.206) (shards: 7)
longevity-twcs-48h-master-db-node-7d3e2623-8 (16.171.54.129 | 10.0.1.93) (shards: 7)
longevity-twcs-48h-master-db-node-7d3e2623-7 (16.170.238.147 | 10.0.1.71) (shards: 7)
longevity-twcs-48h-master-db-node-7d3e2623-6 (13.50.5.151 | 10.0.2.77) (shards: 7)
longevity-twcs-48h-master-db-node-7d3e2623-5 (13.48.105.6 | 10.0.3.242) (shards: 7)
longevity-twcs-48h-master-db-node-7d3e2623-4 (13.53.111.243 | 10.0.3.225) (shards: 7)
longevity-twcs-48h-master-db-node-7d3e2623-3 (16.170.162.55 | 10.0.2.212) (shards: 7)
longevity-twcs-48h-master-db-node-7d3e2623-2 (16.16.64.67 | 10.0.2.79) (shards: 7)
longevity-twcs-48h-master-db-node-7d3e2623-12 (16.170.231.60 | 10.0.1.99) (shards: 7)
longevity-twcs-48h-master-db-node-7d3e2623-11 (13.51.238.217 | 10.0.1.85) (shards: 7)
longevity-twcs-48h-master-db-node-7d3e2623-10 (13.51.55.79 | 10.0.2.182) (shards: 7)
longevity-twcs-48h-master-db-node-7d3e2623-1 (13.49.46.223 | 10.0.2.218) (shards: 7)

OS / Image: ami-0da67a130b28d1222 (aws: eu-north-1)

Test: longevity-twcs-48h-test
Test id: 7d3e2623-427c-45a0-9d15-1b5033661a18
Test name: scylla-master/longevity/longevity-twcs-48h-test
Test config file(s):

longevity-twcs-48h.yaml

Issue description

At 2022-08-06 04:37:57 the same issue occurred when trying to execute a truncate:

Command: 'cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "TRUNCATE ks_truncate.standard1" 10.0.2.77 9042'

Exit code: 2

Stdout:



Stderr:

<stdin>:1:TruncateError: Error during truncate: exceptions::unavailable_exception (Cannot achieve consistency level for cl ALL. Requires 5, alive 4)


Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3632, in wrapper
    result = method(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1685, in disrupt_truncate
    self.target_node.run_cqlsh(cmd='TRUNCATE {}.{}'.format(keyspace_truncate, table), timeout=120)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2793, in run_cqlsh
    cqlsh_out = self.remoter.run(cmd, timeout=timeout + 30,  # we give 30 seconds to cqlsh timeout mechanism to work
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run
    result = _run()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 67, in inner
    return func(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run
    return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute
    result = connection.run(**command_kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
    return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 654, in _complete_run
    raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!

Command: 'cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "TRUNCATE ks_truncate.standard1" 10.0.2.77 9042'

Exit code: 2

Stdout:



Stderr:

<stdin>:1:TruncateError: Error during truncate: exceptions::unavailable_exception (Cannot achieve consistency level for cl ALL. Requires 5, alive 4)

There are only 4 nodes in this scenario, and that stays true throughout the whole run. While there are nemeses that add and remove nodes, there was no "leftover" node from any of them at any point, both according to our logs and according to the "nodetool status" commands.

It was not the only time we attampted to truncate a table in this run, and at 2022-08-05 11:29:38.758 we truncated a different table, with no issue:
< t:2022-08-05 11:33:09,051 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "cqlsh --no-color --request-timeout=120 --connect-timeout=60 -e "TRUNCATE ks_truncate_large_partition.test_table" 10.0.3.225 9042" finished with status 0

Restore Monitor Stack command: $ hydra investigate show-monitor 7d3e2623-427c-45a0-9d15-1b5033661a18
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 7d3e2623-427c-45a0-9d15-1b5033661a18

Logs:

db-cluster-7d3e2623.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7d3e2623-427c-45a0-9d15-1b5033661a18/20220807_034518/db-cluster-7d3e2623.tar.gz
monitor-set-7d3e2623.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7d3e2623-427c-45a0-9d15-1b5033661a18/20220807_034518/monitor-set-7d3e2623.tar.gz
loader-set-7d3e2623.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7d3e2623-427c-45a0-9d15-1b5033661a18/20220807_034518/loader-set-7d3e2623.tar.gz
sct-runner-7d3e2623.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7d3e2623-427c-45a0-9d15-1b5033661a18/20220807_034518/sct-runner-7d3e2623.tar.gz

Jenkins job URL

eliransin · 2022-08-08T12:39:37Z

@asias it seams that we keep accounting some nodes after they were decommissioned (sometimes) - maybe something that is related to qurantine or node restarts while there are nodes in quarantine?

asias · 2022-08-17T07:06:57Z

I will take another look at the latest reproducer:

#10296 (comment)

asias · 2022-08-17T08:13:28Z

Installation details

Kernel Version: 5.15.0-1015-aws Scylla version (or git commit hash): 5.1.0~dev-20220802.663f2e2a8f81 with build-id d6f523279872624d89a9a451f02d8276f0073cb5 Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

* longevity-twcs-48h-master-db-node-7d3e2623-9 (13.51.235.198 | 10.0.3.206) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-8 (16.171.54.129 | 10.0.1.93) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-7 (16.170.238.147 | 10.0.1.71) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-6 (13.50.5.151 | 10.0.2.77) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-5 (13.48.105.6 | 10.0.3.242) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-4 (13.53.111.243 | 10.0.3.225) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-3 (16.170.162.55 | 10.0.2.212) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-2 (16.16.64.67 | 10.0.2.79) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-12 (16.170.231.60 | 10.0.1.99) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-11 (13.51.238.217 | 10.0.1.85) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-10 (13.51.55.79 | 10.0.2.182) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-1 (13.49.46.223 | 10.0.2.218) (shards: 7)

OS / Image: ami-0da67a130b28d1222 (aws: eu-north-1)

Test: longevity-twcs-48h-test Test id: 7d3e2623-427c-45a0-9d15-1b5033661a18 Test name: scylla-master/longevity/longevity-twcs-48h-test Test config file(s):

* [longevity-twcs-48h.yaml](https://github.com/scylladb/scylla-cluster-tests/blob/5eb7e7241a351a35cd9c1756cd43cc329cb5a873/test-cases/longevity/longevity-twcs-48h.yaml)

Issue description

At 2022-08-06 04:37:57 the same issue occurred when trying to execute a truncate:

Command: 'cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "TRUNCATE ks_truncate.standard1" 10.0.2.77 9042'

Exit code: 2

Stdout:



Stderr:

<stdin>:1:TruncateError: Error during truncate: exceptions::unavailable_exception (Cannot achieve consistency level for cl ALL. Requires 5, alive 4)


Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3632, in wrapper
    result = method(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1685, in disrupt_truncate
    self.target_node.run_cqlsh(cmd='TRUNCATE {}.{}'.format(keyspace_truncate, table), timeout=120)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2793, in run_cqlsh
    cqlsh_out = self.remoter.run(cmd, timeout=timeout + 30,  # we give 30 seconds to cqlsh timeout mechanism to work
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run
    result = _run()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 67, in inner
    return func(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run
    return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute
    result = connection.run(**command_kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
    return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 654, in _complete_run
    raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!

Command: 'cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "TRUNCATE ks_truncate.standard1" 10.0.2.77 9042'

Exit code: 2

Stdout:



Stderr:

<stdin>:1:TruncateError: Error during truncate: exceptions::unavailable_exception (Cannot achieve consistency level for cl ALL. Requires 5, alive 4)

There are only 4 nodes in this scenario, and that stays true throughout the whole run. While there are nemeses that add and remove nodes, there was no "leftover" node from any of them at any point, both according to our logs and according to the "nodetool status" commands.

It was not the only time we attampted to truncate a table in this run, and at 2022-08-05 11:29:38.758 we truncated a different table, with no issue: < t:2022-08-05 11:33:09,051 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "cqlsh --no-color --request-timeout=120 --connect-timeout=60 -e "TRUNCATE ks_truncate_large_partition.test_table" 10.0.3.225 9042" finished with status 0

* Restore Monitor Stack command: `$ hydra investigate show-monitor 7d3e2623-427c-45a0-9d15-1b5033661a18`

* Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=7d3e2623-427c-45a0-9d15-1b5033661a18)

* Show all stored logs command: `$ hydra investigate show-logs 7d3e2623-427c-45a0-9d15-1b5033661a18`

Logs:

* **db-cluster-7d3e2623.tar.gz** - https://cloudius-jenkins-test.s3.amazonaws.com/7d3e2623-427c-45a0-9d15-1b5033661a18/20220807_034518/db-cluster-7d3e2623.tar.gz

* **monitor-set-7d3e2623.tar.gz** - https://cloudius-jenkins-test.s3.amazonaws.com/7d3e2623-427c-45a0-9d15-1b5033661a18/20220807_034518/monitor-set-7d3e2623.tar.gz

* **loader-set-7d3e2623.tar.gz** - https://cloudius-jenkins-test.s3.amazonaws.com/7d3e2623-427c-45a0-9d15-1b5033661a18/20220807_034518/loader-set-7d3e2623.tar.gz

* **sct-runner-7d3e2623.tar.gz** - https://cloudius-jenkins-test.s3.amazonaws.com/7d3e2623-427c-45a0-9d15-1b5033661a18/20220807_034518/sct-runner-7d3e2623.tar.gz

Jenkins job URL

Even with https://argus.scylladb.com/workspace?state=WyI4ZGFmZGYyYS05NTY2LTRiYzktODU5OS02MWU0MGY1NDcwNjciLCJjYTFhYzg5NS03OGMwLTRhYmQtODBkNS0wYmE2ZGM3ODQ0YmYiXQ, it is pretty hard to build what ops was performed and the status of the cluster.

@ShlomiBalalis Can you confrim what nodes were in the cluster when the truncate ops was run?

This is my guess:

nodes   1 2 3 4  in the cluster initially 
decom  3 and add 5    nodes after ops: 1 2 4 5
remove  2 and add 6   nodes after ops: 1 4 5 6
remove 4 and add  7   nodes after ops: 1 5 6 7
run truncate and failed

For example:

This does not mention which node was added after the decommission or node was added at all.

This does not mention which node was added after operations and how was the node was removed (decommission or removenode).

k0machi · 2022-10-27T18:06:59Z

Something weird happens during the package install, I'm going to build an AMI instead with this tarball

I am not surprised that update_db_packages function of sct is not working properly.

It basically does this and I think this is another quirk of dpkg:

yes Y | dpkg --force-depends -i /tmp/scylla/scylla*

Skips dependency checks and unpacks the packages - I think dpkg considers service files as config files and doesn't override with new ones, but when it goes to clean the old versions - it removes the service file.

k0machi · 2022-10-30T00:59:03Z

Okay, I've recompiled scylla and built an ami-00a0e22699c50d9df on us-east-1 with patch from Asias applied on top of 9deeeb4, started the longevity again, will see what happens in around 12 hours from now I think.

Link to the run (Argus) Scylla version is 5.1.0~rc3-0.20221026.9deeeb4db1cd

The run finished and did not fail at either of truncates there, I'm going to re-run this job again this time using sct master branch without any alterations to nemeses and see if it reproduces.

roydahan · 2022-10-30T17:48:14Z

@asias how would we know if your patch fix the issue or we just couldn't reproduce it?
Is it possible to have only the part of the patch that adds the debug info?
will it help you then to understand the root cause?

asias · 2022-10-31T00:50:33Z

@asias how would we know if your patch fix the issue or we just couldn't reproduce it? Is it possible to have only the part of the patch that adds the debug info? will it help you then to understand the root cause?

The following is without the fix.

Here is the branch

https://github.com/asias/scylla/tree/truncate_issue_5_1.qa.test2

Or the patch on top of 9deeeb4:

diff --git a/gms/gossiper.cc b/gms/gossiper.cc
index 1535b44cec..deae8aecb2 100644
--- a/gms/gossiper.cc
+++ b/gms/gossiper.cc
@@ -2205,6 +2205,12 @@ void gossiper::dump_endpoint_state_map() {
         logger.info("endpoint={}, endpoint_state={}", x.first, x.second);
     }
     logger.info("=== endpoint_state_map dump ends ===");
+
+    logger.info("=== dump token_metadata starts ===");
+    for (auto& x : get_token_metadata_ptr()->get_token_to_endpoint()) {
+        logger.info("token owners: token={} -> endpoint={}", x.first, x.second);
+    }
+    logger.info("=== dump token_metadata ends ===");
 }
 
 bool gossiper::is_alive(inet_address ep) const {
diff --git a/service/storage_proxy.cc b/service/storage_proxy.cc
index 108adf18e2..40dce38b63 100644
--- a/service/storage_proxy.cc
+++ b/service/storage_proxy.cc
@@ -5056,33 +5056,33 @@ db::hints::manager& storage_proxy::hints_manager_for(db::write_type type) {
 }
 
 future<> storage_proxy::truncate_blocking(sstring keyspace, sstring cfname) {
-    slogger.debug("Starting a blocking truncate operation on keyspace {}, CF {}", keyspace, cfname);
-
-    if (!_gossiper.get_unreachable_token_owners().empty()) {
-        slogger.info("Cannot perform truncate, some hosts are down");
+    slogger.info("Starting a blocking truncate operation on keyspace {}, CF {}", keyspace, cfname);
+
+    auto live_nodes = _gossiper.get_live_token_owners();
+    auto down_nodes = _gossiper.get_unreachable_token_owners();
+    if (!down_nodes.empty()) {
+        slogger.warn("Cannot perform truncate, some hosts are down, ks={}, table={}, live_token_owners={}, unreachable_token_owners={}",
+                keyspace, cfname, live_nodes, down_nodes);
+        _gossiper.dump_endpoint_state_map();
         // Since the truncate operation is so aggressive and is typically only
         // invoked by an admin, for simplicity we require that all nodes are up
         // to perform the operation.
-        auto live_members = _gossiper.get_live_members().size();
-
         return make_exception_future<>(exceptions::unavailable_exception(db::consistency_level::ALL,
-                live_members + _gossiper.get_unreachable_members().size(),
-                live_members));
+                live_nodes.size() + down_nodes.size(), live_nodes.size()));
     }
 
-    auto all_endpoints = _gossiper.get_live_token_owners();
     auto& ms = _messaging;
     auto timeout = clock_type::now() + std::chrono::milliseconds(_db.local().get_config().truncate_request_timeout_in_ms());
 
-    slogger.trace("Enqueuing truncate messages to hosts {}", all_endpoints);
+    slogger.info("Enqueuing truncate messages to hosts {}", live_nodes);
 
-    return parallel_for_each(all_endpoints, [keyspace, cfname, &ms, timeout](auto ep) {
+    return parallel_for_each(live_nodes, [keyspace, cfname, &ms, timeout](auto ep) {
         return ser::storage_proxy_rpc_verbs::send_truncate(&ms, netw::messaging_service::msg_addr{ep, 0}, timeout, keyspace, cfname);
     }).handle_exception([cfname](auto ep) {
        try {
            std::rethrow_exception(ep);
        } catch (rpc::timeout_error& e) {
-           slogger.trace("Truncation of {} timed out: {}", cfname, e.what());
+           slogger.info("Truncation of {} timed out: {}", cfname, e.what());
            throw;
        } catch (...) {
            throw;

roydahan · 2022-10-31T12:29:34Z

@asias, @k0machi is out, can you please create an AMI with this patch?
(There should be an easy jenkins job to allow this)

asias · 2022-11-01T01:48:43Z

@asias, @k0machi is out, can you please create an AMI with this patch? (There should be an easy jenkins job to allow this)

OK. Can you show me the jenkins link to build the AMI? Who is going to run the test when we have the AMI?

k0machi · 2022-11-01T04:08:36Z

@asias, @k0machi is out, can you please create an AMI with this patch? (There should be an easy jenkins job to allow this)

OK. Can you show me the jenkins link to build the AMI? Who is going to run the test when we have the AMI?

I have everything set up already, so I'll use the branch you've linked: truncate_issue_5_1.qa.test2, here is the new ami: us-east-1: ami-0444217138b144e3d.

Version string: 5.1.0~rc3-0.20221101.5bd9b0846be3.

I've already launched a test with the fix yesterday morning, had to restart it after the weekend unfortunately, something went wrong with the network. I'll queue up the job without the fix too.

DoronArazii · 2022-11-03T12:20:55Z

@k0machi any updates?

This is a blocker for 5.1

k0machi · 2022-11-03T14:39:00Z

@k0machi any updates?

This is a blocker for 5.1

Currently running a longevity without the fix with debug information, we should see tomorrow if it reproduces the issue.

k0machi · 2022-11-07T08:47:05Z

The issue wasn't reproduced, I've started another one while I'm investigating the differences between the runs that did have this issue and ones that didn't

asias · 2022-11-09T01:01:52Z

@k0machi any news so far?

k0machi · 2022-11-09T02:24:28Z

@k0machi any news so far?

Two runs and it didn't reproduce, we're going to try to run a test where there's only truncate nemesis happening over and over again now.

DoronArazii · 2022-11-10T12:35:12Z

@avikivity ping

roydahan · 2022-11-10T13:33:09Z

Without the patch it reproduced for @KnifeyMoloko in #11928 (RC3).

The get_live_token_owners returns the nodes that are part of the ring and live. The get_unreachable_token_owners returns the nodes that are part of the ring and is not alive. The token_metadata::get_all_endpoints returns nodes that are part of the ring. The patch changes both functions to use the more authoritative source to get the nodes that are part of the ring and call is_alive to check if the node is up or down. So that the correctness does not depend on any derived information. This patch fixes a truncate issue in storage_proxy::truncate_blocking where it calls get_live_token_owners and get_unreachable_token_owners to decide the nodes to talk with for truncate operation. The truncate failed because incorrect nodes were returned. Fixes scylladb#10296 Fixes scylladb#11928

asias · 2022-11-11T01:09:40Z

I sent a PR for the patch which @k0machi tested.

The get_live_token_owners returns the nodes that are part of the ring and live. The get_unreachable_token_owners returns the nodes that are part of the ring and is not alive. The token_metadata::get_all_endpoints returns nodes that are part of the ring. The patch changes both functions to use the more authoritative source to get the nodes that are part of the ring and call is_alive to check if the node is up or down. So that the correctness does not depend on any derived information. This patch fixes a truncate issue in storage_proxy::truncate_blocking where it calls get_live_token_owners and get_unreachable_token_owners to decide the nodes to talk with for truncate operation. The truncate failed because incorrect nodes were returned. Fixes scylladb#10296 Fixes scylladb#11928

The get_live_token_owners returns the nodes that are part of the ring and live. The get_unreachable_token_owners returns the nodes that are part of the ring and is not alive. The token_metadata::get_all_endpoints returns nodes that are part of the ring. The patch changes both functions to use the more authoritative source to get the nodes that are part of the ring and call is_alive to check if the node is up or down. So that the correctness does not depend on any derived information. This patch fixes a truncate issue in storage_proxy::truncate_blocking where it calls get_live_token_owners and get_unreachable_token_owners to decide the nodes to talk with for truncate operation. The truncate failed because incorrect nodes were returned. Fixes #10296 Fixes #11928 Closes #11952

DoronArazii · 2022-11-16T13:40:31Z

@avikivity I think we should backport @asias's fix ^^ , please have a look

avikivity · 2022-11-17T12:04:41Z

@asias how far back should this be backported? Is only 5.1 affected? (if not, then it wasn't a regression)

asias · 2022-11-17T12:15:18Z

@asias how far back should this be backported? Is only 5.1 affected? (if not, then it wasn't a regression)

We only saw the issue with 5.1. Let's backport to 5.1 only.

The get_live_token_owners returns the nodes that are part of the ring and live. The get_unreachable_token_owners returns the nodes that are part of the ring and is not alive. The token_metadata::get_all_endpoints returns nodes that are part of the ring. The patch changes both functions to use the more authoritative source to get the nodes that are part of the ring and call is_alive to check if the node is up or down. So that the correctness does not depend on any derived information. This patch fixes a truncate issue in storage_proxy::truncate_blocking where it calls get_live_token_owners and get_unreachable_token_owners to decide the nodes to talk with for truncate operation. The truncate failed because incorrect nodes were returned. Fixes #10296 Fixes #11928 Closes #11952 (cherry picked from commit 16bd9ec)

avikivity · 2022-11-17T12:31:06Z

Backported to 5.1.

The get_live_token_owners returns the nodes that are part of the ring and live. The get_unreachable_token_owners returns the nodes that are part of the ring and is not alive. The token_metadata::get_all_endpoints returns nodes that are part of the ring. The patch changes both functions to use the more authoritative source to get the nodes that are part of the ring and call is_alive to check if the node is up or down. So that the correctness does not depend on any derived information. This patch fixes a truncate issue in storage_proxy::truncate_blocking where it calls get_live_token_owners and get_unreachable_token_owners to decide the nodes to talk with for truncate operation. The truncate failed because incorrect nodes were returned. Fixes #10296 Fixes #11928 Closes #11952 (cherry picked from commit 16bd9ec)

bentsi added the triage/master Looking for assignee label Mar 31, 2022

slivne added status/regression area/gossip type/bug master/high and removed triage/master Looking for assignee labels Apr 3, 2022

slivne added this to the 5.1 milestone Apr 3, 2022

slivne assigned asias and eliransin Apr 3, 2022

slivne mentioned this issue Apr 3, 2022

Cassandra timeout during SIMPLE write query at consistency ALL in the AddRemoveDc nemesis #10274

Closed

bhalevy mentioned this issue Nov 9, 2022

Truncate cqlsh command fails to achieve consistency expecting more node that there are present in the cluster #11928

Closed

asias mentioned this issue Nov 11, 2022

gossip: Improve get_live_token_owners and get_unreachable_token_owners #11952

Closed

scylladb-promoter closed this as completed in 16bd9ec Nov 15, 2022

scylladb-promoter added the Backport candidate label Nov 15, 2022

avikivity removed the Backport candidate label Nov 17, 2022

'Cannot perform truncate, some hosts are down' with hosts that have left the cluster #10296

'Cannot perform truncate, some hosts are down' with hosts that have left the cluster #10296

Comments

k0machi commented Mar 30, 2022

Installation details

Issue description

Logs:

slivne commented Apr 3, 2022

slivne commented Apr 3, 2022

slivne commented Apr 3, 2022

asias commented Apr 6, 2022 • edited

asias commented Apr 6, 2022

slivne commented Jul 17, 2022

roydahan commented Jul 17, 2022

eliransin commented Jul 25, 2022

roydahan commented Jul 25, 2022

eliransin commented Aug 1, 2022

eliransin commented Aug 1, 2022

asias commented Aug 1, 2022

eliransin commented Aug 2, 2022

eliransin commented Aug 3, 2022

eliransin commented Aug 3, 2022 • edited

eliransin commented Aug 7, 2022

ShlomiBalalis commented Aug 7, 2022

Installation details

Issue description

Logs:

eliransin commented Aug 8, 2022

asias commented Aug 17, 2022

asias commented Aug 17, 2022

Installation details

Issue description

Logs:

k0machi commented Oct 27, 2022 • edited

k0machi commented Oct 30, 2022

roydahan commented Oct 30, 2022

asias commented Oct 31, 2022

roydahan commented Oct 31, 2022

asias commented Nov 1, 2022

k0machi commented Nov 1, 2022

DoronArazii commented Nov 3, 2022

k0machi commented Nov 3, 2022

k0machi commented Nov 7, 2022

asias commented Nov 9, 2022

k0machi commented Nov 9, 2022

DoronArazii commented Nov 10, 2022

roydahan commented Nov 10, 2022

asias commented Nov 11, 2022

DoronArazii commented Nov 16, 2022

avikivity commented Nov 17, 2022

asias commented Nov 17, 2022

avikivity commented Nov 17, 2022

asias commented Apr 6, 2022 •

edited

eliransin commented Aug 3, 2022 •

edited

k0machi commented Oct 27, 2022 •

edited