Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After error: sdcm.remote.base.RetryableNetworkException next command on with same remoter and ssh session could be not executed on node #7427

Open
1 of 2 tasks
aleksbykov opened this issue May 13, 2024 · 3 comments
Assignees
Labels

Comments

@aleksbykov
Copy link
Contributor

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Node3 was decommissioned with Decommission_seed_node nemesis and new node was added node11. But during cleanup
on node11 issue: scylladb/scylladb#14004 happened with network exception. In several nemesis, node11 was choosen for new nemesis disrupt_restart_with_resharding. First steps of nemesis are stop scylla update scylla.yaml and start scylla. And according to sct.log, command was executed succesfully on node11.

Sct reported that command was executed on node11:

< t:2024-05-12 05:23:06,708 f:cluster.py      l:635  c:sdcm.cluster_aws     p:DEBUG > Node longevity-parallel-topology-schema--db-node-2a3bb187-11 [34.244.68.23 | 10.4.10.153]: CPUSET on node longevity-parallel-topology-schema--db-node-2a3bb187-11: CPUSET="--cpuset 1-7 "
< t:2024-05-12 05:23:06,974 f:nemesis.py      l:1001 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Restart node with resharding. New murmur3_partitioner_ignore_msb_bits value: 15
< t:2024-05-12 05:23:06,974 f:decorators.py   l:118  c:sdcm.utils.decorators p:DEBUG > BEGIN: stop_scylla <AWSNode>
< t:2024-05-12 05:23:06,974 f:remote_base.py  l:521  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.10.153>: Running command "sudo systemctl stop scylla-server.service"...

< t:2024-05-12 05:23:55,490 f:base.py         l:143  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.10.153>: Command "sudo systemctl stop scylla-server.service" finished with status 0


< t:2024-05-12 05:24:00,261 f:remote_base.py  l:521  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.10.153>: Running command "sudo systemctl start scylla-server.service"...

< t:2024-05-12 05:24:44,815 f:base.py         l:143  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.10.153>: Command "sudo systemctl start scylla-server.service" finished with status 0

But this command is not logged and scylla was not restarted :

2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 3:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 0:strm] migration_manager - Gossiping my schema version 82c3b60c-101e-11ef-df50-1b54cd8053fe
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 0:strm] schema_tables - Schema version changed to 82c3b60c-101e-11ef-df50-1b54cd8053fe
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 0:strm] schema_tables - Altering scylla_bench.test id=0ef48140-0fef-11ef-a1d9-169281dba5bc version=83986280-101e-11ef-bbd9-8728581320a2
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 0:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 5:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 1:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 6:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 3:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 2:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 4:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 0:strm] migration_manager - Gossiping my schema version 83987c20-101e-11ef-930b-cceb725bebf8
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 0:strm] schema_tables - Schema version changed to 83987c20-101e-11ef-930b-cceb725bebf8
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 0:comp] compaction - [Compact keyspace1.standard1 7d2d5180-101e-11ef-b3fd-aed309b4fe72] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-fe44ce900fee11efa9cc2e9ed5cf889a/me-3gg2_0ejw_47j8w2qjd05txmp7xu-big-Data.db:level=0]. 1GB to 1GB (~86% of original) in 10905ms = 135MB/s. ~1052672 total partitions merged to 905165.
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 0:comp] compaction - [Compact keyspace1.standard1 83ae8290-101e-11ef-b3fd-aed309b4fe72] Compacting [/var/lib/scylla/data/keyspace1/standard1-fe44ce900fee11efa9cc2e9ed5cf889a/me-3gg2_0ejw_47j8w2qjd05txmp7xu-big-Data.db:level=0:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-fe44ce900fee11efa9cc2e9ed5cf889a/me-3gg2_0eeq_4jbmo2qjd05txmp7xu-big-Data.db:level=0:origin=cleanup]
2024-05-12T05:22:23.581+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 6:strm] compaction_manager - Starting off-strategy compaction for keyspace1.standard1 compaction_group=0, 469 candidates were found
2024-05-12T05:23:23.582+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 2:strm] compaction_manager - Starting off-strategy compaction for keyspace1.standard1 compaction_group=0, 486 candidates were found
2024-05-12T05:24:23.585+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 5:strm] compaction - [Reshape keyspace1.standard1 8a7ce940-101e-11ef-a596-aed409b4fe72] Reshaped 484 sstables to [/var/lib/scylla/data/keyspace1/standard1-fe44ce900fee11efa9cc2e9ed5cf889a/me-3gg2_0ekj_07xhs2inife2kgzt6a-big-Data.db:level=0]. 5MB to 2MB (~48% of original) in 153ms = 34MB/s. ~61952 total partitions merged to 1720.
2024-05-12T05:24:23.585+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 5:strm] compaction_manager - Done with off-strategy compaction for keyspace1.standard1 compaction_group=0
2024-05-12T05:25:23.587+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 0:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:25:23.587+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 5:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:25:23.587+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 6:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:25:23.587+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 2:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements
2024-05-12T05:25:23.587+00:00 longevity-parallel-topology-schema--db-node-2a3bb187-11     !INFO | scylla[6772]:  [shard 3:strm] query_processor - Column definitions for scylla_bench.test changed, invalidating related prepared statements

The possible reason is node11 was lost connection due to issue: scylladb/scylladb#14004 during previous disruptive nemesis and after that ssh connection was not reconnected properly.

Steps to Reproduce

  1. Decommission seed with possible networkexception like in issue Losing connection during rebuild on a new provisioned node on a different region (sshd complains on Bad packet length) scylladb#14004
  2. Run disrupt_restart_with_resharding on new node after step1

Expected behavior: [What you expected to happen]
disrupt_restart_with_resharding should pass

Actual behavior: [What actually happened]
Failed with error: Exception: Resharding has not been started (murmur3_partitioner_ignore_msb_bits=15) Check the log for the details
because command for scylla restart was not exectued on node.

Impact

Nemesis disrupt_restart_with_resharding failed because scylla was not restart

How frequently does it reproduce?

rarely

Installation details

SCT Version: master:latest
Scylla version (or git commit hash): Scylla version 5.5.0~dev-0.20240510.28791aa2c1d3

Logs

@k0machi
Copy link
Contributor

k0machi commented May 13, 2024

The issue is due to the nemesis starting scylla and then immediately checking logs - there are currently two problems with that:

  • start_scylla doesn't expose verify_up_timeout parameter of start_scylla_server, which leads to lower timeout than expected
  • the log follower tries to check for log lines without any retries and timeouts, so if resharding starts slightly later (in this case it started 5 minutes after the nemesis failed, fyi the nemesis ran for a total of 2 minutes - which seems kinda low)

@roydahan
Copy link
Contributor

@fruch
Copy link
Contributor

fruch commented Jun 13, 2024

I saw the same issue (I think) here: https://argus.scylladb.com/test/8764d79e-d785-4b99-9c70-99e2f24d0a18/runs?additionalRuns[]=b3091312-ac66-4ca5-86ff-35e1fa8c8dec

@fruch who should own this?

I would expect whom found it during master duty to attend to it, and not waier it off.

@temichus temichus added the tier1 label Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants