Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_ignore_dead_nodes_for_replace_option: Startup failed: std::runtime_error: Failed to parse node list #14594

Closed
bhalevy opened this issue Jul 10, 2023 · 3 comments
Assignees
Labels
symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework tests/dtest
Milestone

Comments

@bhalevy
Copy link
Member

bhalevy commented Jul 10, 2023

Seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/300/artifact/logs-full.release.018/1688957655880_repair_based_node_operations_test.py%3A%3ATestRepairBasedNodeOperations%3A%3Atest_ignore_dead_nodes_for_replace_option/node8.log

Scylla version 5.4.0~dev-0.20230710.7a334c53af10 with build-id 6a14250adb69fd913763bb9e74b2ff5341bde958 starting ...
command used: "/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-_u677qp9/test/node8/bin/scylla --options-file /jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-_u677qp9/test/node8/conf/scylla.yaml --log-to-stdout 1 --abort-on-seastar-bad-alloc --abort-on-lsa-bad-alloc 1 --abort-on-internal-error 1 --enable-repair-based-node-ops true --ignore-dead-nodes-for-replace 127.0.51.6,127.0.51.7 --api-address 127.0.51.8 --collectd-hostname 6faa22d09fda.node8 --smp 2 --memory 1024M --developer-mode true --default-log-level info --collectd 0 --overprovisioned --prometheus-address 127.0.51.8 --replace-node-first-boot dcac43c2-c0a2-44cb-84c1-eac5f8a00847 --unsafe-bypass-fsync 1 --kernel-page-cache 1 --commitlog-use-o-dsync 0 --max-networking-io-control-blocks 1000"
...
INFO  2023-07-10 02:54:02,356 [shard 0] storage_service - entering STARTING mode
INFO  2023-07-10 02:54:02,356 [shard 0] storage_service - Loading persisted ring state
INFO  2023-07-10 02:54:02,358 [shard 0] storage_service - initial_contact_nodes={127.0.51.7, 127.0.51.6, 127.0.51.5, 127.0.51.4, 127.0.51.3, 127.0.51.2, 127.0.51.1}, loaded_endpoints={}, loaded_peer_features=0
INFO  2023-07-10 02:54:02,358 [shard 0] storage_service - Gathering node replacement information for dcac43c2-c0a2-44cb-84c1-eac5f8a00847/0000:0000:0000:0000:0000:0000:0000:0000
INFO  2023-07-10 02:54:02,358 [shard 0] storage_service - Checking remote features with gossip
INFO  2023-07-10 02:54:02,358 [shard 0] gossip - Gossip shadow round started with nodes={127.0.51.7, 127.0.51.6, 127.0.51.5, 127.0.51.4, 127.0.51.3, 127.0.51.2, 127.0.51.1}
WARN  2023-07-10 02:54:02,359 [shard 0] gossip - Node 127.0.51.7 is down for get_endpoint_states verb
WARN  2023-07-10 02:54:02,359 [shard 0] gossip - Node 127.0.51.6 is down for get_endpoint_states verb
WARN  2023-07-10 02:54:02,359 [shard 0] gossip - Node 127.0.51.5 is down for get_endpoint_states verb
INFO  2023-07-10 02:54:02,360 [shard 0] gossip - Gossip shadow round finished with nodes_talked={127.0.51.3, 127.0.51.1, 127.0.51.2, 127.0.51.4}
INFO  2023-07-10 02:54:02,360 [shard 0] gossip - Feature check passed. Local node 127.0.51.8 features = {AGGREGATE_STORAGE_OPTIONS, ALTERNATOR_TTL, CDC, CDC_GENERATIONS_V2, COLLECTION_INDEXING, COMPUTED_COLUMNS, CORRECT_COUNTER_ORDER, CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, CORRECT_STATIC_COMPACT_IN_MC, COUNTERS, DIGEST_FOR_NULL_VALUES, DIGEST_INSENSITIVE_TO_EXPIRY, DIGEST_MULTIPARTITION_READ, EMPTY_REPLICA_PAGES, HINTED_HANDOFF_SEPARATE_CONNECTION, INDEXES, LARGE_COLLECTION_DETECTION, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, LWT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, ME_SSTABLE_FORMAT, NONFROZEN_UDTS, PARALLELIZED_AGGREGATION, PER_TABLE_CACHING, PER_TABLE_PARTITIONERS, RANGE_SCAN_DATA_VARIANT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_COMMITLOG, SCHEMA_TABLES_V3, SECONDARY_INDEXES_ON_STATIC_COLUMNS, SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT, STREAM_WITH_RPC_STREAM, TOMBSTONE_GC_OPTIONS, TRUNCATION_TABLE, TYPED_ERRORS_IN_READ_RPC, UDA, UDA_NATIVE_PARALLELIZED_AGGREGATION, UNBOUNDED_RANGE_TOMBSTONES, UUID_SSTABLE_IDENTIFIERS, VIEW_VIRTUAL_COLUMNS, WRITE_FAILURE_REPLY, XXHASH}, Remote common_features = {AGGREGATE_STORAGE_OPTIONS, ALTERNATOR_TTL, CDC, CDC_GENERATIONS_V2, COLLECTION_INDEXING, COMPUTED_COLUMNS, CORRECT_COUNTER_ORDER, CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, CORRECT_STATIC_COMPACT_IN_MC, COUNTERS, DIGEST_FOR_NULL_VALUES, DIGEST_INSENSITIVE_TO_EXPIRY, DIGEST_MULTIPARTITION_READ, EMPTY_REPLICA_PAGES, HINTED_HANDOFF_SEPARATE_CONNECTION, INDEXES, LARGE_COLLECTION_DETECTION, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, LWT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, ME_SSTABLE_FORMAT, NONFROZEN_UDTS, PARALLELIZED_AGGREGATION, PER_TABLE_CACHING, PER_TABLE_PARTITIONERS, RANGE_SCAN_DATA_VARIANT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_COMMITLOG, SCHEMA_TABLES_V3, SECONDARY_INDEXES_ON_STATIC_COLUMNS, SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT, STREAM_WITH_RPC_STREAM, TOMBSTONE_GC_OPTIONS, TRUNCATION_TABLE, TYPED_ERRORS_IN_READ_RPC, UDA, UDA_NATIVE_PARALLELIZED_AGGREGATION, UNBOUNDED_RANGE_TOMBSTONES, UUID_SSTABLE_IDENTIFIERS, VIEW_VIRTUAL_COLUMNS, WRITE_FAILURE_REPLY, XXHASH}
INFO  2023-07-10 02:54:02,360 [shard 0] storage_service - Host 5fd93a5a-62ed-4ccd-803f-d7a2de8886cf/127.0.51.8 is replacing dcac43c2-c0a2-44cb-84c1-eac5f8a00847/127.0.51.5
INFO  2023-07-10 02:54:02,360 [shard 0] storage_service - Replacing a node with a different IP address, my address=127.0.51.8, node being replaced=127.0.51.5
INFO  2023-07-10 02:54:02,361 [shard 0] storage_service - Save advertised features list in the 'system.local' table
INFO  2023-07-10 02:54:02,364 [shard 0] schema_tables - Schema version changed to 59adb24e-f3cd-3e02-97f0-5b395827453f
INFO  2023-07-10 02:54:02,365 [shard 0] storage_service - Starting up server gossip
INFO  2023-07-10 02:54:02,374 [shard 0] compaction - [Compact system.local 06a61260-1ecd-11ee-a292-f1901241ba36] Compacting [/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-_u677qp9/test/node8/data/system/local-7ad54392bcdd35a684174e047860b377/mc-2-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-_u677qp9/test/node8/data/system/local-7ad54392bcdd35a684174e047860b377/mc-4-big-Data.db:level=0:origin=memtable]
INFO  2023-07-10 02:54:02,375 [shard 0] gossip - failure_detector_loop: Started main loop
INFO  2023-07-10 02:54:02,375 [shard 0] gossip - Waiting for 2 live nodes to show up in gossip, currently 1 present...
INFO  2023-07-10 02:54:02,388 [shard 0] compaction - [Compact system.local 06a61260-1ecd-11ee-a292-f1901241ba36] Compacted 2 sstables to [/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-_u677qp9/test/node8/data/system/local-7ad54392bcdd35a684174e047860b377/mc-6-big-Data.db:level=0]. 13kB to 7kB (~55% of original) in 10ms = 753kB/s. ~256 total partitions merged to 1.
INFO  2023-07-10 02:54:03,602 [shard 0] storage_service - Set host_id=ba46fa94-5d07-44a4-a3af-6ebfc0930ece to be owned by node=127.0.51.1
WARN  2023-07-10 02:54:03,604 [shard 0] gossip - Fail to send EchoMessage to 127.0.51.7: seastar::rpc::closed_error (connection is closed)
WARN  2023-07-10 02:54:03,604 [shard 0] gossip - Fail to send EchoMessage to 127.0.51.5: seastar::rpc::closed_error (connection is closed)
WARN  2023-07-10 02:54:03,604 [shard 0] gossip - Fail to send EchoMessage to 127.0.51.6: seastar::rpc::closed_error (connection is closed)
INFO  2023-07-10 02:54:03,604 [shard 0] gossip - InetAddress 127.0.51.1 is now UP, status = NORMAL
INFO  2023-07-10 02:54:03,604 [shard 0] gossip - InetAddress 127.0.51.3 is now UP, status = NORMAL
INFO  2023-07-10 02:54:03,604 [shard 0] gossip - InetAddress 127.0.51.4 is now UP, status = NORMAL
INFO  2023-07-10 02:54:03,604 [shard 0] gossip - InetAddress 127.0.51.2 is now UP, status = NORMAL
INFO  2023-07-10 02:54:03,605 [shard 0] storage_service - Set host_id=11a2210c-e65a-4e4b-8e5c-03ef4dc8caeb to be owned by node=127.0.51.3
INFO  2023-07-10 02:54:03,607 [shard 0] storage_service - Set host_id=d4487fd5-f9a0-4c80-a008-fa47eb72dee8 to be owned by node=127.0.51.4
INFO  2023-07-10 02:54:03,609 [shard 0] storage_service - Node 127.0.51.5 state jump to normal
INFO  2023-07-10 02:54:03,609 [shard 0] storage_service - Set host_id=dcac43c2-c0a2-44cb-84c1-eac5f8a00847 to be owned by node=127.0.51.5
INFO  2023-07-10 02:54:03,611 [shard 0] storage_service - Set host_id=ce727778-a853-4e51-9555-3b91980a4f2a to be owned by node=127.0.51.6
INFO  2023-07-10 02:54:03,613 [shard 0] gossip - Live nodes seen in gossip: {127.0.51.1, 127.0.51.2, 127.0.51.3, 127.0.51.4, 127.0.51.8}
INFO  2023-07-10 02:54:03,613 [shard 0] init - Shutting down group 0 service
...
ERROR 2023-07-10 02:54:12,299 [shard 0] init - Startup failed: std::runtime_error (Failed to parse node list: {127.0.51.6, 127.0.51.7}: invalid node=127.0.51.7: std::runtime_error (Host inet address 127.0.51.7 not found in the cluster))

node7 is indeed shut down by the test before node8 is started.

Scylla includes the fix for a previous, similar issue #14487: 96278a0

@bhalevy bhalevy added tests/dtest triage/master Looking for assignee symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework labels Jul 10, 2023
@bhalevy
Copy link
Member Author

bhalevy commented Jul 10, 2023

Maybe #14487 should simply be reopened with the above failure as the fix is apparently not enough.

@bhalevy
Copy link
Member Author

bhalevy commented Jul 10, 2023

@kbr-scylla https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/300/artifact/logs-full.release.010/1688957447447_update_cluster_layout_tests.py%3A%3ATestUpdateClusterLayout%3A%3Atest_decommission_after_changing_node_ip/node4.log also failed:

Scylla version 5.4.0~dev-0.20230710.7a334c53af10 with build-id 6a14250adb69fd913763bb9e74b2ff5341bde958 starting ...
...
ERROR 2023-07-10 02:50:44,192 [shard 0] init - Startup failed: std::runtime_error (Failed to mark node as alive in 30000 ms, nodes={127.0.88.2, 127.0.88.1, 127.0.88.3}, live_nodes={127.0.88.2, 127.0.88.1})

This looks like #14468 which is also supposed to be fixed by 96278a0, that is contained by 7a334c5.

@bhalevy
Copy link
Member Author

bhalevy commented Jul 10, 2023

Oops, my bad. 96278a0 is merged to master after 7a334c5

@bhalevy bhalevy closed this as not planned Won't fix, can't repro, duplicate, stale Jul 10, 2023
@DoronArazii DoronArazii removed the triage/master Looking for assignee label Jul 10, 2023
@DoronArazii DoronArazii added this to the 5.4 milestone Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework tests/dtest
Projects
None yet
Development

No branches or pull requests

3 participants