Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Cannot perform truncate, some hosts are down' with hosts that have left the cluster #10296

Closed
k0machi opened this issue Mar 30, 2022 · 67 comments
Assignees
Milestone

Comments

@k0machi
Copy link
Contributor

k0machi commented Mar 30, 2022

Installation details

Kernel Version: 5.13.0-1019-aws
Scylla version (or git commit hash): 5.1.dev-20220317.c45050895403 with build-id 076f513b6143670def988c7626389b270411f8f7
Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

  • longevity-twcs-48h-master-db-node-2acb1d5c-1 (34.249.11.226 | 10.0.1.42) (shards: 8)
  • longevity-twcs-48h-master-db-node-2acb1d5c-2 (54.154.214.184 | 10.0.1.237) (shards: 8)
  • longevity-twcs-48h-master-db-node-2acb1d5c-3 (34.248.103.231 | 10.0.2.143) (shards: 8)
  • longevity-twcs-48h-master-db-node-2acb1d5c-4 (3.250.227.110 | 10.0.0.15) (shards: 8)
  • longevity-twcs-48h-master-db-node-2acb1d5c-5 (54.170.137.53 | 10.0.0.106) (shards: 8)
  • longevity-twcs-48h-master-db-node-2acb1d5c-6 (54.216.174.11 | 10.0.2.196) (shards: 8)
  • longevity-twcs-48h-master-db-node-2acb1d5c-7 (63.32.45.143 | 10.0.0.235) (shards: 8)
  • longevity-twcs-48h-master-db-node-2acb1d5c-8 (54.195.204.64 | 10.0.3.165) (shards: 8)

OS / Image: ami-0d27db767997d0ec4 (aws: eu-west-1)

Test: longevity-twcs-48h-test
Test id: 2acb1d5c-3e77-4abd-93e7-ae1d3fa89903
Test name: longevity/longevity-twcs-48h-test
Test config file(s):

Issue description

During disrupt_truncate_large_partition nemesis, a large partition is created and subsequently truncated on the same node. However, this said node seems to think that not all nodes are currently available, hence it is unable to perform truncation, when those missing nodes were decommisioned and left long time ago.

Gossip information from that node

< t:2022-03-26 19:45:31,709 f:cluster.py      l:2623 c:sdcm.cluster_aws     p:DEBUG > Node longevity-twcs-48h-master-db-node-2acb1d5c-4 [3.250.227.110 | 10.0.0.15] (seed: False): Command '/usr/bin/nodetool  gossipinfo ' duration -> 1.4254664019972552 s
< t:2022-03-26 19:45:31,710 f:cluster.py      l:2750 c:sdcm.cluster         p:DEBUG > get_gossip_info: Command exited with status 0.
< t:2022-03-26 19:45:31,710 f:cluster.py      l:2750 c:sdcm.cluster         p:DEBUG > === stdout ===
<...>
< t:2022-03-26 19:45:31,710 f:cluster.py      l:2767 c:sdcm.cluster         p:DEBUG > Get gossip info. Node with IP 10.0.0.106 is not found in the cluster. Node status is: LEFT
< t:2022-03-26 19:45:31,710 f:cluster.py      l:2767 c:sdcm.cluster         p:DEBUG > Get gossip info. Node with IP 10.0.1.237 is not found in the cluster. Node status is: LEFT
< t:2022-03-26 19:45:31,711 f:cluster.py      l:2767 c:sdcm.cluster         p:DEBUG > Get gossip info. Node with IP 10.0.2.196 is not found in the cluster. Node status is: LEFT
< t:2022-03-26 19:45:31,711 f:cluster.py      l:2767 c:sdcm.cluster         p:DEBUG > Get gossip info. Node with IP 10.0.3.165 is not found in the cluster. Node status is: LEFT
< t:2022-03-26 19:45:33,706 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Status=Up/Down
< t:2022-03-26 19:45:33,706 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > |/ State=Normal/Leaving/Joining/Moving
< t:2022-03-26 19:45:33,707 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > --  Address     Load       Tokens       Owns    Host ID                               Rack
< t:2022-03-26 19:45:33,712 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > UN  10.0.1.42   32.83 GB   256          ?       e298382a-4bbb-495a-817c-4ec2b47f1210  1a
< t:2022-03-26 19:45:33,717 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > UN  10.0.0.235  25.23 GB   256          ?       737c6ca2-88db-45a2-aeb1-c1225826c948  1a
< t:2022-03-26 19:45:33,720 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > UN  10.0.0.15   28.22 GB   256          ?       26449be3-7af3-4616-916a-adc541b1cbd1  1a
< t:2022-03-26 19:45:33,722 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > UN  10.0.2.143  28.11 GB   256          ?       8dca9f03-a2a7-4c16-9bfa-9b6a93bfeb4e  1a

Node log

Mar 26 19:47:46 longevity-twcs-48h-master-db-node-2acb1d5c-4 scylla[68161]:  [shard 1] create_keyspace - Using SimpleStrategy in a multi-datacenter environment is not recommended.
Mar 26 19:49:05 longevity-twcs-48h-master-db-node-2acb1d5c-4 scylla[68161]:  [shard 4] storage_proxy - Cannot perform truncate, some hosts are down

SCT

< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Command: 'cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "TRUNCATE ks_truncate_large_partition.test_table" 10.0.0.15 9042'
< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Exit code: 2
< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Stdout:
< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Stderr:
< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > <stdin>:1:TruncateError: Error during truncate: exceptions::unavailable_exception (Cannot achieve consistency level for cl ALL. Requires 7, alive 4)
  • Restore Monitor Stack command: $ hydra investigate show-monitor 2acb1d5c-3e77-4abd-93e7-ae1d3fa89903
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 2acb1d5c-3e77-4abd-93e7-ae1d3fa89903

Logs:

Jenkins job URL

@bentsi bentsi added the triage/master Looking for assignee label Mar 31, 2022
@slivne
Copy link
Contributor

slivne commented Apr 3, 2022

Trying to look for gossip information

Sequence is:

  1. Node 3.165 is up
  2. Ran repair with all nodes
  3. Node 3.165 decommissioned
  4. Truncate 20 minutes later fails
[shlomi@shlomi-scylladb-com i_10296]$ grep 2022-03-26T1[4-9]: truncate db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log | grep -v kernel | egrep  "gossip -| truncate|starting|DOWN"  | more
grep: truncate: No such file or directory
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T17:38:43+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.3.165 is now UP, status = UNKNOWN
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T18:44:35+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] repair - repair[2a40a2cc-84a2-4915-a011-e436ec0c5aaf]: starting user-requested rep
air for keyspace keyspace1, repair id 1, options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> true}}
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T18:44:37+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] repair - repair[f9a6383e-bab0-44d0-94b1-9e28c7ecdf75]: starting user-requested rep
air for keyspace system_distributed_everywhere, repair id 2, options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> true}}
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T18:44:37+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] repair - repair[782a98df-bf27-4c36-bed4-31853c8f5c36]: starting user-requested rep
air for keyspace ks_truncate_large_partition, repair id 3, options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> true}}
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T18:44:39+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] repair - repair[2b8d9ee2-c8b6-4080-ab1d-933e98e38562]: starting user-requested rep
air for keyspace scylla_bench, repair id 4, options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> true}}
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T19:05:18+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] repair - repair[e96e4873-5ff5-4bf7-94b6-6b91749941c2]: starting user-requested rep
air for keyspace system_auth, repair id 5, options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> true}}
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T19:05:21+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] repair - repair[b12e6a66-70df-4482-b645-363dcafae8b6]: starting user-requested rep
air for keyspace system_traces, repair id 6, options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> true}}
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T19:26:40+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Node 10.0.3.165 will be removed from gossip at [2022-03-29 19:26:38]: (ex
pire = 1648581998973579413, now = 1648322800624051733, diff = 259198 seconds)
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T19:26:42+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.3.165 is now DOWN, status = LEFT
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T19:27:40+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - 60000 ms elapsed, 10.0.3.165 gossip quarantine over
db-cluster-2acb1d5c/longevity-twcs-48h-master-db-node-2acb1d5c-4/messages.log:2022-03-26T19:49:05+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 4] storage_proxy - Cannot perform truncate, some hosts are down

@slivne
Copy link
Contributor

slivne commented Apr 3, 2022

this is a regression

I expect that a simple reproducer can be built.

  1. Create a 3 node cluster
  2. Create a table
  3. Add node
  4. Truncate
  5. Decommission node
  6. Wait for 3 minutes (so the decommissioned node is removed from gossip in other nodes state)
  7. Truncate

The second truncate (after decommission+wait is expected to fail)

@slivne slivne added this to the 5.1 milestone Apr 3, 2022
@slivne
Copy link
Contributor

slivne commented Apr 3, 2022

Since its gossip assigning to Asias

@asias
Copy link
Contributor

asias commented Apr 6, 2022

It looks like the truncate failed not just because of the decommissioned node with ip 3.165. It needs 7nodes to be alive, but only 4 is alive.

< t:2022-03-26 19:49:06,101 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > <stdin>:1:TruncateError: Error during truncate: exceptions::unavailable_exception (Cannot achieve consistency level for cl ALL. Requires 7, alive 4)

@asias
Copy link
Contributor

asias commented Apr 6, 2022

I could not reproduce with the following dtest:

scylla: c450508

    def test_truncate_after_decommission(self):
        """
        Test truncate after decommission
        """
        cluster = self.cluster

        # Disable hinted handoff and set batch commit log so this doesn't
        # interfer with the test (this must be after the populate)
        cluster.set_configuration_options(
            values=self.default_config_options(), batch_commitlog=True)
        cluster.populate(3).start(wait_for_binary_proto=True, wait_other_notice=True)  
        node1, node2, node3 = cluster.nodelist()    

        session = self.patient_cql_connection(node1)
        create_ks(session, 'ks', 2)
        create_cf(session, 'cf', read_repair=0.0, columns={'c1': 'text', 'c2': 'text'})
        create_cf(session, 'cf2', read_repair=0.0, columns={'c1': 'text', 'c2': 'text'})

        insert_c1c2(session, keys=range(1000), consistency=ConsistencyLevel.ONE)

        logger.info("Truncate ks.cf2 started")
        truncate = SimpleStatement("TRUNCATE ks.cf2", consistency_level=ConsistencyLevel.ALL)
        session.execute(truncate)
        logger.info("Truncate ks.cf2 done")

        logger.info("Decommission node started")
        node2.decommission()
        node2.stop()
        logger.info("Decommission node done")

        logger.info("Sleep 300s started")
        time.sleep(300)
        logger.info("Sleep 300s done")

        logger.info("Truncate ks.cf started")
        truncate = SimpleStatement("TRUNCATE ks.cf", consistency_level=ConsistencyLevel.ALL)
        session.execute(truncate)
        logger.info("Truncate ks.cf done")

        logger.info("Sleep 300s started")
        time.sleep(300)
        logger.info("Sleep 300s done")

@slivne
Copy link
Contributor

slivne commented Jul 17, 2022

@roydahan / @k0machi - does this reproduce ? - Asias did try to reproduce the issue and provided a test - but it doesn;t reproduce the problem - can you please comment

@roydahan
Copy link

If nothing was fixed, I don't see a reason why wouldn't it reproduce.
Whoever want to reproduce, need to run the same longevity with the same sequence of nemesis.

@eliransin
Copy link
Contributor

If nothing was fixed, I don't see a reason why wouldn't it reproduce. Whoever want to reproduce, need to run the same longevity with the same sequence of nemesis.

@roydahan can you refer me to somewhere that explains how to do that? More importantly, how does one can extract the sequence from a test? I would like to try and have a minimal reproducer for this eventually (if we can't find the problem immediately).

@roydahan
Copy link

You copy this job "Jenkins job URL" as a reproducer job in Jenkins and run it.
To find the minimal reproducer you need to suggest what could be the cause.
To find the exact sequence of nemesis, you may grep the log.

@eliransin
Copy link
Contributor

Whoever want to reproduce, need to run the same longevity with the same sequence of nemesis.

How do I do that? Is there a place to provide a seed in the job?

@eliransin
Copy link
Contributor

@asias after @roydahan / @k0machi answers the question above #10296 (comment)
Can you try and reproduce? We will then be able to add debug info and run again if ncesairy.

@asias
Copy link
Contributor

asias commented Aug 1, 2022

This bug was reported May 30. Did QA run this same test since then? Did it reproduce?

I implemented a mini dtest reproducer following the procedure and I did not reproduce.

@eliransin
Copy link
Contributor

Just to summarize thing that were discussed verbally during the daily call:
@roydahan provided an answer - as long as you run the same SCT (same commit) you'll get the same sequence. He shared his intuition that he thinks it will not reproduce (it might be a race with some unknown suspected to be low probability).
Given that we will probably won't block the branching, however, @slivne is not whiling to give up so easily 🙂 so we will keep treating this issue as a branching blocker at least until we made reasonable effort to
either solve it or prove it is rare enough.

The course of action on this issue was agreed to be (in parallel):

  1. Rerun the test a few times and see if it reproduces.
  2. Try to better inspect the logs to see if we can detect the logic failure that lead to this situation.

@eliransin
Copy link
Contributor

@asias here is the gossip information from node 4 which had the problem - got a message that says that not all nodes are available:
Map from IP to node id:
{'10.0.1.42': '1', '10.0.1.237': '2', '10.0.2.143': '3', '10.0.0.15': '4', '10.0.0.106': '5', '10.0.2.196': '6', '10.0.0.235': '7', '10.0.3.165': '8'}

Can we infer from the logs below what should have been the state of gossip at the point of failure?

2022-03-25T22:18:02+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO | Scylla version 5.1.dev-0.20220317.c45050895403 with build-id 076f513b6143670def988c7626389b270411f8f7 starting ...
2022-03-25T22:18:03+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] init - Scylla version 5.1.dev-0.20220317.c45050895403 with build-id 076f513b6143670def988c7626389b270411f8f7 starting ...
2022-03-25T22:18:17+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Gossip shadow round started with nodes={10.0.2.143, 10.0.2.196, 10.0.1.42}
2022-03-25T22:18:17+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Gossip shadow round finisehd with nodes_talked={10.0.1.42, 10.0.2.143, 10.0.2.196}
2022-03-25T22:18:17+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Feature check passed. Local node 10.0.0.15 features = {ALTERNATOR_STREAMS, ALTERNATOR_TTL, CDC, CDC_GENERATIONS_V2, COMPUTED_COLUMNS, CORRECT_COUNTER_ORDER, CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX, CORRECT_NON_COMPOUND
_RANGE_TOMBSTONES, CORRECT_STATIC_COMPACT_IN_MC, COUNTERS, DIGEST_FOR_NULL_VALUES, DIGEST_INSENSITIVE_TO_EXPIRY, DIGEST_MULTIPARTITION_READ, HINTED_HANDOFF_SEPARATE_CONNECTION, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, LWT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, ME_SSTABLE_FORMAT, NONFROZEN_UDT
S, PARALLELIZED_AGGREGATION, PER_TABLE_CACHING, PER_TABLE_PARTITIONERS, RANGE_SCAN_DATA_VARIANT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT, STREAM_WITH_RPC_STREAM, TOMBSTONE_GC_OPTIONS, TRUNCATION_TABLE, UDA, UNBOUNDED_RANGE_TOMBSTONES, VIEW_VIRTUAL_COLUMNS, WRIT
E_FAILURE_REPLY, XXHASH}, Remote common_features = {ALTERNATOR_STREAMS, ALTERNATOR_TTL, CDC, CDC_GENERATIONS_V2, COMPUTED_COLUMNS, CORRECT_COUNTER_ORDER, CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, CORRECT_STATIC_COMPACT_IN_MC, COUNTERS, DIGEST_FOR_NULL_VALUES, DIGEST_INSENSITIVE_TO_E
XPIRY, DIGEST_MULTIPARTITION_READ, HINTED_HANDOFF_SEPARATE_CONNECTION, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, LWT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, ME_SSTABLE_FORMAT, NONFROZEN_UDTS, PARALLELIZED_AGGREGATION, PER_TABLE_CACHING, PER_TABLE_PARTITIONERS, RANGE_SCAN_DATA_VARIANT, RANGE_TOM
BSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT, STREAM_WITH_RPC_STREAM, TOMBSTONE_GC_OPTIONS, TRUNCATION_TABLE, UDA, UNBOUNDED_RANGE_TOMBSTONES, VIEW_VIRTUAL_COLUMNS, WRITE_FAILURE_REPLY, XXHASH}
2022-03-25T22:18:17+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - failure_detector_loop: Started main loop
2022-03-25T22:18:18+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.1.237 is now DOWN, status = LEFT
2022-03-25T22:18:18+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.0.106 is now DOWN, status = LEFT
2022-03-25T22:18:18+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.1.42 is now UP, status = NORMAL
2022-03-25T22:18:18+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Node 10.0.1.237 will be removed from gossip at [2022-03-28 15:16:17]: (expire = 1648480577354800891, now = 1648246698452132298, diff = 233878 seconds)
2022-03-25T22:18:18+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Node 10.0.0.106 will be removed from gossip at [2022-03-28 08:22:35]: (expire = 1648455755274219062, now = 1648246698452132298, diff = 209056 seconds)
2022-03-25T22:18:18+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.2.143 is now UP, status = NORMAL
2022-03-25T22:18:18+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.2.196 is now UP, status = NORMAL
2022-03-25T22:18:27+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Gossip not settled after 1 polls.
2022-03-25T22:18:30+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Gossip settled after 1 extra polls; proceeding
2022-03-25T22:18:45+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - No gossip backlog; proceeding
2022-03-25T22:18:47+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] init - Scylla version 5.1.dev-0.20220317.c45050895403 initialization completed.
2022-03-25T22:19:18+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - 60000 ms elapsed, 10.0.0.106 gossip quarantine over
2022-03-25T22:19:18+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - 60000 ms elapsed, 10.0.1.237 gossip quarantine over
2022-03-25T22:40:16+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 ! WARNING |  [shard 1] gossip - failure_detector_loop: Send echo to node 10.0.2.143, status = failed: seastar::rpc::timeout_error (rpc call timed out)
2022-03-25T22:40:16+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 1] gossip - failure_detector_loop: Mark node 10.0.2.143 as DOWN
2022-03-25T22:40:23+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.2.143 is now UP, status = NORMAL
2022-03-25T22:40:23+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.2.143 is now DOWN, status = NORMAL
2022-03-26T07:15:47+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Node 10.0.2.196 will be removed from gossip at [2022-03-29 07:15:47]: (expire = 1648538147679323676, now = 1648278947744548704, diff = 259199 seconds)
2022-03-26T07:15:48+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.2.196 is now DOWN, status = LEFT
2022-03-26T07:16:47+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - 60000 ms elapsed, 10.0.2.196 gossip quarantine over
2022-03-26T07:20:15+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.0.235 is now UP, status = UNKNOWN
2022-03-26T17:38:43+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.3.165 is now UP, status = UNKNOWN
2022-03-26T19:26:40+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Node 10.0.3.165 will be removed from gossip at [2022-03-29 19:26:38]: (expire = 1648581998973579413, now = 1648322800624051733, diff = 259198 seconds)
2022-03-26T19:26:42+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.3.165 is now DOWN, status = LEFT
2022-03-26T19:27:40+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - 60000 ms elapsed, 10.0.3.165 gossip quarantine over
2022-03-26T19:49:05+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 4] storage_proxy - Cannot perform truncate, some hosts are down
2022-03-26T20:24:36+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - Got shutdown message from 10.0.2.143, received_generation=1648242988, local_generation=1648242988
2022-03-26T20:24:36+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.2.143 is now DOWN, status = shutdown
2022-03-26T20:37:47+00:00 longevity-twcs-48h-master-db-node-2acb1d5c-4 !    INFO |  [shard 0] gossip - InetAddress 10.0.2.143 is now UP, status = NORMAL

@eliransin
Copy link
Contributor

eliransin commented Aug 3, 2022

One more guess I have is that "unreachable endpoints" and "live_enpoints" collections can intersect, since the exception
infers that number of endpoints should be: unreachable + live , if for example all peer nodes are marked both as live and unreachable it will generate the exact same exception.

@eliransin
Copy link
Contributor

I scratched the last comment since it was a bogus guess, looking at the Prometheus metrics for node 4 it shows that no more
than 3 or 4 nodes are considered for each category at the same time (live, unreachable).
However, unless I am missing something, at that point I would expect to see the two metrics sum to no more than 4 (the
number of nodes that are actually members of the cluster). So nodes have failed to be removed from the cluster?
@asias ? (I will keep digging but let me know if I am digging in the wrong direction).

image

@ShlomiBalalis
Copy link

Installation details

Kernel Version: 5.15.0-1015-aws
Scylla version (or git commit hash): 5.1.0~dev-20220802.663f2e2a8f81 with build-id d6f523279872624d89a9a451f02d8276f0073cb5
Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

  • longevity-twcs-48h-master-db-node-7d3e2623-9 (13.51.235.198 | 10.0.3.206) (shards: 7)
  • longevity-twcs-48h-master-db-node-7d3e2623-8 (16.171.54.129 | 10.0.1.93) (shards: 7)
  • longevity-twcs-48h-master-db-node-7d3e2623-7 (16.170.238.147 | 10.0.1.71) (shards: 7)
  • longevity-twcs-48h-master-db-node-7d3e2623-6 (13.50.5.151 | 10.0.2.77) (shards: 7)
  • longevity-twcs-48h-master-db-node-7d3e2623-5 (13.48.105.6 | 10.0.3.242) (shards: 7)
  • longevity-twcs-48h-master-db-node-7d3e2623-4 (13.53.111.243 | 10.0.3.225) (shards: 7)
  • longevity-twcs-48h-master-db-node-7d3e2623-3 (16.170.162.55 | 10.0.2.212) (shards: 7)
  • longevity-twcs-48h-master-db-node-7d3e2623-2 (16.16.64.67 | 10.0.2.79) (shards: 7)
  • longevity-twcs-48h-master-db-node-7d3e2623-12 (16.170.231.60 | 10.0.1.99) (shards: 7)
  • longevity-twcs-48h-master-db-node-7d3e2623-11 (13.51.238.217 | 10.0.1.85) (shards: 7)
  • longevity-twcs-48h-master-db-node-7d3e2623-10 (13.51.55.79 | 10.0.2.182) (shards: 7)
  • longevity-twcs-48h-master-db-node-7d3e2623-1 (13.49.46.223 | 10.0.2.218) (shards: 7)

OS / Image: ami-0da67a130b28d1222 (aws: eu-north-1)

Test: longevity-twcs-48h-test
Test id: 7d3e2623-427c-45a0-9d15-1b5033661a18
Test name: scylla-master/longevity/longevity-twcs-48h-test
Test config file(s):

Issue description

At 2022-08-06 04:37:57 the same issue occurred when trying to execute a truncate:

Command: 'cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "TRUNCATE ks_truncate.standard1" 10.0.2.77 9042'

Exit code: 2

Stdout:



Stderr:

<stdin>:1:TruncateError: Error during truncate: exceptions::unavailable_exception (Cannot achieve consistency level for cl ALL. Requires 5, alive 4)


Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3632, in wrapper
    result = method(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1685, in disrupt_truncate
    self.target_node.run_cqlsh(cmd='TRUNCATE {}.{}'.format(keyspace_truncate, table), timeout=120)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2793, in run_cqlsh
    cqlsh_out = self.remoter.run(cmd, timeout=timeout + 30,  # we give 30 seconds to cqlsh timeout mechanism to work
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run
    result = _run()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 67, in inner
    return func(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run
    return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute
    result = connection.run(**command_kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
    return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 654, in _complete_run
    raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!

Command: 'cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "TRUNCATE ks_truncate.standard1" 10.0.2.77 9042'

Exit code: 2

Stdout:



Stderr:

<stdin>:1:TruncateError: Error during truncate: exceptions::unavailable_exception (Cannot achieve consistency level for cl ALL. Requires 5, alive 4)

There are only 4 nodes in this scenario, and that stays true throughout the whole run. While there are nemeses that add and remove nodes, there was no "leftover" node from any of them at any point, both according to our logs and according to the "nodetool status" commands.

It was not the only time we attampted to truncate a table in this run, and at 2022-08-05 11:29:38.758 we truncated a different table, with no issue:
< t:2022-08-05 11:33:09,051 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "cqlsh --no-color --request-timeout=120 --connect-timeout=60 -e "TRUNCATE ks_truncate_large_partition.test_table" 10.0.3.225 9042" finished with status 0

  • Restore Monitor Stack command: $ hydra investigate show-monitor 7d3e2623-427c-45a0-9d15-1b5033661a18
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 7d3e2623-427c-45a0-9d15-1b5033661a18

Logs:

Jenkins job URL

@eliransin
Copy link
Contributor

@asias it seams that we keep accounting some nodes after they were decommissioned (sometimes) - maybe something that is related to qurantine or node restarts while there are nodes in quarantine?

@asias
Copy link
Contributor

asias commented Aug 17, 2022

I will take another look at the latest reproducer:

#10296 (comment)

@asias
Copy link
Contributor

asias commented Aug 17, 2022

Installation details

Kernel Version: 5.15.0-1015-aws Scylla version (or git commit hash): 5.1.0~dev-20220802.663f2e2a8f81 with build-id d6f523279872624d89a9a451f02d8276f0073cb5 Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

* longevity-twcs-48h-master-db-node-7d3e2623-9 (13.51.235.198 | 10.0.3.206) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-8 (16.171.54.129 | 10.0.1.93) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-7 (16.170.238.147 | 10.0.1.71) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-6 (13.50.5.151 | 10.0.2.77) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-5 (13.48.105.6 | 10.0.3.242) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-4 (13.53.111.243 | 10.0.3.225) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-3 (16.170.162.55 | 10.0.2.212) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-2 (16.16.64.67 | 10.0.2.79) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-12 (16.170.231.60 | 10.0.1.99) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-11 (13.51.238.217 | 10.0.1.85) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-10 (13.51.55.79 | 10.0.2.182) (shards: 7)

* longevity-twcs-48h-master-db-node-7d3e2623-1 (13.49.46.223 | 10.0.2.218) (shards: 7)

OS / Image: ami-0da67a130b28d1222 (aws: eu-north-1)

Test: longevity-twcs-48h-test Test id: 7d3e2623-427c-45a0-9d15-1b5033661a18 Test name: scylla-master/longevity/longevity-twcs-48h-test Test config file(s):

* [longevity-twcs-48h.yaml](https://github.com/scylladb/scylla-cluster-tests/blob/5eb7e7241a351a35cd9c1756cd43cc329cb5a873/test-cases/longevity/longevity-twcs-48h.yaml)

Issue description

At 2022-08-06 04:37:57 the same issue occurred when trying to execute a truncate:

Command: 'cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "TRUNCATE ks_truncate.standard1" 10.0.2.77 9042'

Exit code: 2

Stdout:



Stderr:

<stdin>:1:TruncateError: Error during truncate: exceptions::unavailable_exception (Cannot achieve consistency level for cl ALL. Requires 5, alive 4)


Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3632, in wrapper
    result = method(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1685, in disrupt_truncate
    self.target_node.run_cqlsh(cmd='TRUNCATE {}.{}'.format(keyspace_truncate, table), timeout=120)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2793, in run_cqlsh
    cqlsh_out = self.remoter.run(cmd, timeout=timeout + 30,  # we give 30 seconds to cqlsh timeout mechanism to work
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run
    result = _run()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 67, in inner
    return func(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run
    return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute
    result = connection.run(**command_kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
    return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 654, in _complete_run
    raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!

Command: 'cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "TRUNCATE ks_truncate.standard1" 10.0.2.77 9042'

Exit code: 2

Stdout:



Stderr:

<stdin>:1:TruncateError: Error during truncate: exceptions::unavailable_exception (Cannot achieve consistency level for cl ALL. Requires 5, alive 4)

There are only 4 nodes in this scenario, and that stays true throughout the whole run. While there are nemeses that add and remove nodes, there was no "leftover" node from any of them at any point, both according to our logs and according to the "nodetool status" commands.

It was not the only time we attampted to truncate a table in this run, and at 2022-08-05 11:29:38.758 we truncated a different table, with no issue: < t:2022-08-05 11:33:09,051 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "cqlsh --no-color --request-timeout=120 --connect-timeout=60 -e "TRUNCATE ks_truncate_large_partition.test_table" 10.0.3.225 9042" finished with status 0

* Restore Monitor Stack command: `$ hydra investigate show-monitor 7d3e2623-427c-45a0-9d15-1b5033661a18`

* Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=7d3e2623-427c-45a0-9d15-1b5033661a18)

* Show all stored logs command: `$ hydra investigate show-logs 7d3e2623-427c-45a0-9d15-1b5033661a18`

Logs:

* **db-cluster-7d3e2623.tar.gz** - https://cloudius-jenkins-test.s3.amazonaws.com/7d3e2623-427c-45a0-9d15-1b5033661a18/20220807_034518/db-cluster-7d3e2623.tar.gz

* **monitor-set-7d3e2623.tar.gz** - https://cloudius-jenkins-test.s3.amazonaws.com/7d3e2623-427c-45a0-9d15-1b5033661a18/20220807_034518/monitor-set-7d3e2623.tar.gz

* **loader-set-7d3e2623.tar.gz** - https://cloudius-jenkins-test.s3.amazonaws.com/7d3e2623-427c-45a0-9d15-1b5033661a18/20220807_034518/loader-set-7d3e2623.tar.gz

* **sct-runner-7d3e2623.tar.gz** - https://cloudius-jenkins-test.s3.amazonaws.com/7d3e2623-427c-45a0-9d15-1b5033661a18/20220807_034518/sct-runner-7d3e2623.tar.gz

Jenkins job URL

Even with https://argus.scylladb.com/workspace?state=WyI4ZGFmZGYyYS05NTY2LTRiYzktODU5OS02MWU0MGY1NDcwNjciLCJjYTFhYzg5NS03OGMwLTRhYmQtODBkNS0wYmE2ZGM3ODQ0YmYiXQ, it is pretty hard to build what ops was performed and the status of the cluster.

@ShlomiBalalis Can you confrim what nodes were in the cluster when the truncate ops was run?

This is my guess:

nodes   1 2 3 4  in the cluster initially 
decom  3 and add 5    nodes after ops: 1 2 4 5
remove  2 and add 6   nodes after ops: 1 4 5 6
remove 4 and add  7   nodes after ops: 1 5 6 7
run truncate and failed

For example:

image

This does not mention which node was added after the decommission or node was added at all.

image

This does not mention which node was added after operations and how was the node was removed (decommission or removenode).

@k0machi
Copy link
Contributor Author

k0machi commented Oct 27, 2022

Something weird happens during the package install, I'm going to build an AMI instead with this tarball

I am not surprised that update_db_packages function of sct is not working properly.

It basically does this and I think this is another quirk of dpkg:

yes Y | dpkg --force-depends -i /tmp/scylla/scylla*

Skips dependency checks and unpacks the packages - I think dpkg considers service files as config files and doesn't override with new ones, but when it goes to clean the old versions - it removes the service file.

@k0machi
Copy link
Contributor Author

k0machi commented Oct 30, 2022

Okay, I've recompiled scylla and built an ami-00a0e22699c50d9df on us-east-1 with patch from Asias applied on top of 9deeeb4, started the longevity again, will see what happens in around 12 hours from now I think.

Link to the run (Argus) Scylla version is 5.1.0~rc3-0.20221026.9deeeb4db1cd

The run finished and did not fail at either of truncates there, I'm going to re-run this job again this time using sct master branch without any alterations to nemeses and see if it reproduces.

@roydahan
Copy link

@asias how would we know if your patch fix the issue or we just couldn't reproduce it?
Is it possible to have only the part of the patch that adds the debug info?
will it help you then to understand the root cause?

@asias
Copy link
Contributor

asias commented Oct 31, 2022

@asias how would we know if your patch fix the issue or we just couldn't reproduce it? Is it possible to have only the part of the patch that adds the debug info? will it help you then to understand the root cause?

The following is without the fix.

Here is the branch

https://github.com/asias/scylla/tree/truncate_issue_5_1.qa.test2

Or the patch on top of 9deeeb4:

diff --git a/gms/gossiper.cc b/gms/gossiper.cc
index 1535b44cec..deae8aecb2 100644
--- a/gms/gossiper.cc
+++ b/gms/gossiper.cc
@@ -2205,6 +2205,12 @@ void gossiper::dump_endpoint_state_map() {
         logger.info("endpoint={}, endpoint_state={}", x.first, x.second);
     }
     logger.info("=== endpoint_state_map dump ends ===");
+
+    logger.info("=== dump token_metadata starts ===");
+    for (auto& x : get_token_metadata_ptr()->get_token_to_endpoint()) {
+        logger.info("token owners: token={} -> endpoint={}", x.first, x.second);
+    }
+    logger.info("=== dump token_metadata ends ===");
 }
 
 bool gossiper::is_alive(inet_address ep) const {
diff --git a/service/storage_proxy.cc b/service/storage_proxy.cc
index 108adf18e2..40dce38b63 100644
--- a/service/storage_proxy.cc
+++ b/service/storage_proxy.cc
@@ -5056,33 +5056,33 @@ db::hints::manager& storage_proxy::hints_manager_for(db::write_type type) {
 }
 
 future<> storage_proxy::truncate_blocking(sstring keyspace, sstring cfname) {
-    slogger.debug("Starting a blocking truncate operation on keyspace {}, CF {}", keyspace, cfname);
-
-    if (!_gossiper.get_unreachable_token_owners().empty()) {
-        slogger.info("Cannot perform truncate, some hosts are down");
+    slogger.info("Starting a blocking truncate operation on keyspace {}, CF {}", keyspace, cfname);
+
+    auto live_nodes = _gossiper.get_live_token_owners();
+    auto down_nodes = _gossiper.get_unreachable_token_owners();
+    if (!down_nodes.empty()) {
+        slogger.warn("Cannot perform truncate, some hosts are down, ks={}, table={}, live_token_owners={}, unreachable_token_owners={}",
+                keyspace, cfname, live_nodes, down_nodes);
+        _gossiper.dump_endpoint_state_map();
         // Since the truncate operation is so aggressive and is typically only
         // invoked by an admin, for simplicity we require that all nodes are up
         // to perform the operation.
-        auto live_members = _gossiper.get_live_members().size();
-
         return make_exception_future<>(exceptions::unavailable_exception(db::consistency_level::ALL,
-                live_members + _gossiper.get_unreachable_members().size(),
-                live_members));
+                live_nodes.size() + down_nodes.size(), live_nodes.size()));
     }
 
-    auto all_endpoints = _gossiper.get_live_token_owners();
     auto& ms = _messaging;
     auto timeout = clock_type::now() + std::chrono::milliseconds(_db.local().get_config().truncate_request_timeout_in_ms());
 
-    slogger.trace("Enqueuing truncate messages to hosts {}", all_endpoints);
+    slogger.info("Enqueuing truncate messages to hosts {}", live_nodes);
 
-    return parallel_for_each(all_endpoints, [keyspace, cfname, &ms, timeout](auto ep) {
+    return parallel_for_each(live_nodes, [keyspace, cfname, &ms, timeout](auto ep) {
         return ser::storage_proxy_rpc_verbs::send_truncate(&ms, netw::messaging_service::msg_addr{ep, 0}, timeout, keyspace, cfname);
     }).handle_exception([cfname](auto ep) {
        try {
            std::rethrow_exception(ep);
        } catch (rpc::timeout_error& e) {
-           slogger.trace("Truncation of {} timed out: {}", cfname, e.what());
+           slogger.info("Truncation of {} timed out: {}", cfname, e.what());
            throw;
        } catch (...) {
            throw;

@roydahan
Copy link

@asias, @k0machi is out, can you please create an AMI with this patch?
(There should be an easy jenkins job to allow this)

@asias
Copy link
Contributor

asias commented Nov 1, 2022

@asias, @k0machi is out, can you please create an AMI with this patch? (There should be an easy jenkins job to allow this)

OK. Can you show me the jenkins link to build the AMI? Who is going to run the test when we have the AMI?

@k0machi
Copy link
Contributor Author

k0machi commented Nov 1, 2022

@asias, @k0machi is out, can you please create an AMI with this patch? (There should be an easy jenkins job to allow this)

OK. Can you show me the jenkins link to build the AMI? Who is going to run the test when we have the AMI?

I have everything set up already, so I'll use the branch you've linked: truncate_issue_5_1.qa.test2, here is the new ami: us-east-1: ami-0444217138b144e3d.

Version string: 5.1.0~rc3-0.20221101.5bd9b0846be3.

I've already launched a test with the fix yesterday morning, had to restart it after the weekend unfortunately, something went wrong with the network. I'll queue up the job without the fix too.

@DoronArazii
Copy link

@k0machi any updates?

This is a blocker for 5.1

@k0machi
Copy link
Contributor Author

k0machi commented Nov 3, 2022

@k0machi any updates?

This is a blocker for 5.1

Currently running a longevity without the fix with debug information, we should see tomorrow if it reproduces the issue.

@k0machi
Copy link
Contributor Author

k0machi commented Nov 7, 2022

The issue wasn't reproduced, I've started another one while I'm investigating the differences between the runs that did have this issue and ones that didn't

@asias
Copy link
Contributor

asias commented Nov 9, 2022

@k0machi any news so far?

@k0machi
Copy link
Contributor Author

k0machi commented Nov 9, 2022

@k0machi any news so far?

Two runs and it didn't reproduce, we're going to try to run a test where there's only truncate nemesis happening over and over again now.

@DoronArazii
Copy link

@avikivity ping

@roydahan
Copy link

Without the patch it reproduced for @KnifeyMoloko in #11928 (RC3).

asias added a commit to asias/scylla that referenced this issue Nov 11, 2022
The get_live_token_owners returns the nodes that are part of the ring
and live.

The get_unreachable_token_owners returns the nodes that are part of the ring
and is not alive.

The token_metadata::get_all_endpoints returns nodes that are part of the
ring.

The patch changes both functions to use the more authoritative source to
get the nodes that are part of the ring and call is_alive to check if
the node is up or down. So that the correctness does not depend on
any derived information.

This patch fixes a truncate issue in storage_proxy::truncate_blocking
where it calls get_live_token_owners and get_unreachable_token_owners to
decide the nodes to talk with for truncate operation. The truncate
failed because incorrect nodes were returned.

Fixes scylladb#10296
Fixes scylladb#11928
@asias
Copy link
Contributor

asias commented Nov 11, 2022

I sent a PR for the patch which @k0machi tested.

asias added a commit to asias/scylla that referenced this issue Nov 15, 2022
The get_live_token_owners returns the nodes that are part of the ring
and live.

The get_unreachable_token_owners returns the nodes that are part of the ring
and is not alive.

The token_metadata::get_all_endpoints returns nodes that are part of the
ring.

The patch changes both functions to use the more authoritative source to
get the nodes that are part of the ring and call is_alive to check if
the node is up or down. So that the correctness does not depend on
any derived information.

This patch fixes a truncate issue in storage_proxy::truncate_blocking
where it calls get_live_token_owners and get_unreachable_token_owners to
decide the nodes to talk with for truncate operation. The truncate
failed because incorrect nodes were returned.

Fixes scylladb#10296
Fixes scylladb#11928
xemul pushed a commit that referenced this issue Nov 15, 2022
The get_live_token_owners returns the nodes that are part of the ring
and live.

The get_unreachable_token_owners returns the nodes that are part of the ring
and is not alive.

The token_metadata::get_all_endpoints returns nodes that are part of the
ring.

The patch changes both functions to use the more authoritative source to
get the nodes that are part of the ring and call is_alive to check if
the node is up or down. So that the correctness does not depend on
any derived information.

This patch fixes a truncate issue in storage_proxy::truncate_blocking
where it calls get_live_token_owners and get_unreachable_token_owners to
decide the nodes to talk with for truncate operation. The truncate
failed because incorrect nodes were returned.

Fixes #10296
Fixes #11928

Closes #11952
@DoronArazii
Copy link

@avikivity I think we should backport @asias's fix ^^ , please have a look

@avikivity
Copy link
Member

@asias how far back should this be backported? Is only 5.1 affected? (if not, then it wasn't a regression)

@asias
Copy link
Contributor

asias commented Nov 17, 2022

@asias how far back should this be backported? Is only 5.1 affected? (if not, then it wasn't a regression)

We only saw the issue with 5.1. Let's backport to 5.1 only.

avikivity pushed a commit that referenced this issue Nov 17, 2022
The get_live_token_owners returns the nodes that are part of the ring
and live.

The get_unreachable_token_owners returns the nodes that are part of the ring
and is not alive.

The token_metadata::get_all_endpoints returns nodes that are part of the
ring.

The patch changes both functions to use the more authoritative source to
get the nodes that are part of the ring and call is_alive to check if
the node is up or down. So that the correctness does not depend on
any derived information.

This patch fixes a truncate issue in storage_proxy::truncate_blocking
where it calls get_live_token_owners and get_unreachable_token_owners to
decide the nodes to talk with for truncate operation. The truncate
failed because incorrect nodes were returned.

Fixes #10296
Fixes #11928

Closes #11952

(cherry picked from commit 16bd9ec)
@avikivity
Copy link
Member

Backported to 5.1.

denesb pushed a commit that referenced this issue Jan 9, 2023
The get_live_token_owners returns the nodes that are part of the ring
and live.

The get_unreachable_token_owners returns the nodes that are part of the ring
and is not alive.

The token_metadata::get_all_endpoints returns nodes that are part of the
ring.

The patch changes both functions to use the more authoritative source to
get the nodes that are part of the ring and call is_alive to check if
the node is up or down. So that the correctness does not depend on
any derived information.

This patch fixes a truncate issue in storage_proxy::truncate_blocking
where it calls get_live_token_owners and get_unreachable_token_owners to
decide the nodes to talk with for truncate operation. The truncate
failed because incorrect nodes were returned.

Fixes #10296
Fixes #11928

Closes #11952

(cherry picked from commit 16bd9ec)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.