Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't connect new node after dead node was removed #17404

Closed
oturek opened this issue Feb 19, 2024 · 16 comments
Closed

can't connect new node after dead node was removed #17404

oturek opened this issue Feb 19, 2024 · 16 comments

Comments

@oturek
Copy link

oturek commented Feb 19, 2024

This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.

  • [] I have read the disclaimer above, and I am reporting a suspected malfunction in Scylla.

Installation details
Scylla version (or git commit hash): 5.2.9-0.20230920.5709d0043978-1
Cluster size: 4 nodes
cluster-log.txt
health-check.txt
node-log.txt

OS (RHEL/CentOS/Ubuntu/AWS AMI): Ubuntu 22.04

@mykaul
Copy link
Contributor

mykaul commented Feb 20, 2024

The cluster log (where is it coming from?) is a bit shy in details:

Feb 19 15:35:51 scylla-bm-node2 scylla[5147]:  [shard  0] raft_group_registry - Raft server id c3af2f4e-b70a-4d14-8467-3f52d518963f cannot be translated to an IP address.
Feb 19 15:36:16 scylla-bm-node2 scylla[5147]:  [shard  2] gossip - failure_detector_loop: Send echo to node 10.10.26.15, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Feb 19 15:36:16 scylla-bm-node2 scylla[5147]:  [shard  2] rpc - client 10.10.26.15:7000: client connection dropped: connection is closed

Can you share more details? How did you perform the replace, what was the IP of the old node, the new node, etc.?

@mykaul
Copy link
Contributor

mykaul commented Feb 20, 2024

This is interesting, from the node log:

Feb 19 14:35:53 scylla-bm-node3 scylla[20205]: scylla: schema.cc:368: schema::schema(schema::private_tag, const schema::raw_schema &, std::optional<raw_view_info>): Assertion `!def.id || def.id == id - column_offset(def.kind)' failed.
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]: Aborting on shard 0.
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]: Backtrace:
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   0x4ff41f8
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   0x5026632
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   /opt/scylladb/libreloc/libc.so.6+0x3cb1f
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   /opt/scylladb/libreloc/libc.so.6+0x8ce5b
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   /opt/scylladb/libreloc/libc.so.6+0x3ca75
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   /opt/scylladb/libreloc/libc.so.6+0x267fb
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   /opt/scylladb/libreloc/libc.so.6+0x2671a
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   /opt/scylladb/libreloc/libc.so.6+0x35655
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   0x1af7c9b
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   0x1afb064
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   0x3172f43
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   0x31d62bc
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   0x31d3488
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   0x31b7373
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   0x323cb2d
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   0x1201fba
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   0x5004a74
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   0x5005cf7
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   0x5005039
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   0x4fabaf5
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   0x4faac68
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   0x1148d34
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   0x114a840
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   0x114746a
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   /opt/scylladb/libreloc/libc.so.6+0x2750f
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   /opt/scylladb/libreloc/libc.so.6+0x275c8
Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:   0x1145624

Decoded:

Feb 19 14:35:53 scylla-bm-node3 scylla[20205]: Backtrace:Feb 19 14:35:53 scylla-bm-node3 scylla[20205]:
[Backtrace #0]
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:59
 (inlined by) seastar::backtrace_buffer::append_backtrace() at ./build/release/seastar/./seastar/src/core/reactor.cc:783
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:813
seastar::print_with_backtrace(char const*, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:825
 (inlined by) seastar::sigabrt_action() at ./build/release/seastar/./seastar/src/core/reactor.cc:3864
 (inlined by) operator() at ./build/release/seastar/./seastar/src/core/reactor.cc:3840
 (inlined by) __invoke at ./build/release/seastar/./seastar/src/core/reactor.cc:3836
/data/scylla-s3-reloc.cache/by-build-id/686601fd1656c6724f7f042163b9285bf3efd582/extracted/scylla/libreloc/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=81daba31ee66dbd63efdc4252a872949d874d136, for GNU/Linux 3.2.0, not stripped

__GI___sigaction at :?
__pthread_kill_implementation at ??:?
__GI_raise at :?
__GI_abort at :?
__assert_fail_base.cold at ??:?
__GI___assert_fail at :?
schema at ./schema.cc:368
seastar::lw_shared_ptr<schema> seastar::lw_shared_ptr<schema>::make<schema::private_tag, schema::raw_schema&, std::optional<raw_view_info>&>(schema::private_tag&&, schema::raw_schema&, std::optional<raw_view_info>&) at ././seastar/include/seastar/core/shared_ptr.hh:274
 (inlined by) seastar::lw_shared_ptr<schema> seastar::make_lw_shared<schema, schema::private_tag, schema::raw_schema&, std::optional<raw_view_info>&>(schema::private_tag&&, schema::raw_schema&, std::optional<raw_view_info>&) at ././seastar/include/seastar/core/shared_ptr.hh:434
 (inlined by) schema_builder::build() at ./schema.cc:1325
db::schema_tables::create_table_from_mutations(db::schema_ctxt const&, schema_mutations, std::optional<utils::tagged_uuid<table_schema_version_tag> >) at ./db/schema_tables.cc:2986
operator() at ./db/schema_tables.cc:1290
 (inlined by) seastar::noncopyable_function<seastar::lw_shared_ptr<schema const> (schema_mutations, db::schema_tables::schema_diff_side)>::direct_vtable_for<db::schema_tables::merge_tables_and_views(seastar::sharded<service::storage_proxy>&, std::map<utils::tagged_uuid<table_id_tag>, schema_mutations, std::less<utils::tagged_uuid<table_id_tag> >, std::allocator<std::pair<utils::tagged_uuid<table_id_tag> const, schema_mutations> > >&&, std::map<utils::tagged_uuid<table_id_tag>, schema_mutations, std::less<utils::tagged_uuid<table_id_tag> >, std::allocator<std::pair<utils::tagged_uuid<table_id_tag> const, schema_mutations> > >&&, std::map<utils::tagged_uuid<table_id_tag>, schema_mutations, std::less<utils::tagged_uuid<table_id_tag> >, std::allocator<std::pair<utils::tagged_uuid<table_id_tag> const, schema_mutations> > >&&, std::map<utils::tagged_uuid<table_id_tag>, schema_mutations, std::less<utils::tagged_uuid<table_id_tag> >, std::allocator<std::pair<utils::tagged_uuid<table_id_tag> const, schema_mutations> > >&&)::$_13>::call(seastar::noncopyable_function<seastar::lw_shared_ptr<schema const> (schema_mutations, db::schema_tables::schema_diff_side)> const*, schema_mutations, db::schema_tables::schema_diff_side) at ././seastar/include/seastar/util/noncopyable_function.hh:124
seastar::noncopyable_function<seastar::lw_shared_ptr<schema const> (schema_mutations, db::schema_tables::schema_diff_side)>::operator()(schema_mutations, db::schema_tables::schema_diff_side) const at ././seastar/include/seastar/util/noncopyable_function.hh:210
 (inlined by) db::schema_tables::diff_table_or_view(seastar::sharded<service::storage_proxy>&, std::map<utils::tagged_uuid<table_id_tag>, schema_mutations, std::less<utils::tagged_uuid<table_id_tag> >, std::allocator<std::pair<utils::tagged_uuid<table_id_tag> const, schema_mutations> > >&&, std::map<utils::tagged_uuid<table_id_tag>, schema_mutations, std::less<utils::tagged_uuid<table_id_tag> >, std::allocator<std::pair<utils::tagged_uuid<table_id_tag> const, schema_mutations> > >&&, seastar::noncopyable_function<seastar::lw_shared_ptr<schema const> (schema_mutations, db::schema_tables::schema_diff_side)>) at ./db/schema_tables.cc:1271
db::schema_tables::merge_tables_and_views(seastar::sharded<service::storage_proxy>&, std::map<utils::tagged_uuid<table_id_tag>, schema_mutations, std::less<utils::tagged_uuid<table_id_tag> >, std::allocator<std::pair<utils::tagged_uuid<table_id_tag> const, schema_mutations> > >&&, std::map<utils::tagged_uuid<table_id_tag>, schema_mutations, std::less<utils::tagged_uuid<table_id_tag> >, std::allocator<std::pair<utils::tagged_uuid<table_id_tag> const, schema_mutations> > >&&, std::map<utils::tagged_uuid<table_id_tag>, schema_mutations, std::less<utils::tagged_uuid<table_id_tag> >, std::allocator<std::pair<utils::tagged_uuid<table_id_tag> const, schema_mutations> > >&&, std::map<utils::tagged_uuid<table_id_tag>, schema_mutations, std::less<utils::tagged_uuid<table_id_tag> >, std::allocator<std::pair<utils::tagged_uuid<table_id_tag> const, schema_mutations> > >&&) at ./db/schema_tables.cc:1289
db::schema_tables::do_merge_schema(seastar::sharded<service::storage_proxy>&, std::vector<mutation, std::allocator<mutation> >, bool) at ./db/schema_tables.cc:1146
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume() const at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/coroutine:244
 (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:120
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2509
 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2946
seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3115
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2998
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:266
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:157
scylla_main(int, char**) at ./main.cc:546
std::function<int (int, char**)>::operator()(int, char**) const at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/std_function.h:591
main at ./main.cc:1810
__libc_start_call_main at ??:?
__libc_start_main_alias_2 at :?
_start at ??:?


@mykaul
Copy link
Contributor

mykaul commented Feb 20, 2024

Which is identical to #16683 (comment) , I think.

@mykaul
Copy link
Contributor

mykaul commented Feb 20, 2024

Can you see if it's still happening with a newer 5.2.x Scylla? Latest is https://forum.scylladb.com/t/release-scylladb-5-2-15/1306

@oturek
Copy link
Author

oturek commented Feb 20, 2024

The cluster log (where is it coming from?) is a bit shy in details:

Feb 19 15:35:51 scylla-bm-node2 scylla[5147]:  [shard  0] raft_group_registry - Raft server id c3af2f4e-b70a-4d14-8467-3f52d518963f cannot be translated to an IP address.
Feb 19 15:36:16 scylla-bm-node2 scylla[5147]:  [shard  2] gossip - failure_detector_loop: Send echo to node 10.10.26.15, status = failed: seastar::rpc::timeout_error (rpc call timed out)
Feb 19 15:36:16 scylla-bm-node2 scylla[5147]:  [shard  2] rpc - client 10.10.26.15:7000: client connection dropped: connection is closed

Can you share more details? How did you perform the replace, what was the IP of the old node, the new node, etc.?

Hi,
old IP - 10.10.26.13
How i removed:

  • nodetool drain
  • nodetool removenode node ID

tryded to add new node with ip 10.10.26.13
then after same issue:

  • found ghost member id
  • nodetool removenode ghost member id

tryied to add with ip 10.10.26.15

This log got from other node with: journalctl _COMM=scylla

@oturek
Copy link
Author

oturek commented Feb 21, 2024

see if it's still

Hi, the same problem with 5.2.15

@mykaul
Copy link
Contributor

mykaul commented Feb 21, 2024

@kbr-scylla - can someone from your team look at this?
@oturek - can you please share more complete logs? Snippets sometimes lack critical information.

@oturek
Copy link
Author

oturek commented Feb 21, 2024

@kbr-scylla - can someone from your team look at this? @oturek - can you please share more complete logs? Snippets sometimes lack critical information.

sure, here is syslog and journalctl from node I've tryied to connect
scylla-node3-syslog.txt
journalctl-node3.txt
log from alive node:
alive-node-log.txt

@mykaul what cat I share more?

@mykaul
Copy link
Contributor

mykaul commented Feb 21, 2024

The alive node is only 10-15 lines or so, I'm not sure it's sufficient. Let's see.

@oturek
Copy link
Author

oturek commented Feb 21, 2024

sorry, log from 6am:
alive-node2-log.txt

@kbr-scylla
Copy link
Contributor

kbr-scylla commented Feb 21, 2024

@oturek I see you're trying to add new node in 5.2.15, but have you upgraded the existing cluster to 5.2.15?

Please provide fresh health check report from all nodes in your cluster and also output of nodetool status

@oturek
Copy link
Author

oturek commented Feb 21, 2024

@kbr-scylla no, cluster wasn't updated to 5.2.15 (tested before on test cluster if node with version 5.2.15 is able to connect to cluster with version 5.2.1 and there was no problem)
health reports:
node1.zip
node3.zip
node4.zip
node2-part1.zip
node2-part2.zip

sorry, node 2 scylla-logs.txt.gz I've split on two patrs( scylla-logs.txt.gz.part-aa and scylla-logs.txt.gz.part-ab, because one file was too big to upload here)

@kbr-scylla
Copy link
Contributor

@oturek in that case please upgrade your cluster to 5.2.15 before trying to join the node again, it contains important fixes which should eliminate the crash you're seeing.

@oturek
Copy link
Author

oturek commented Feb 22, 2024

@oturek in that case please upgrade your cluster to 5.2.15 before trying to join the node again, it contains important fixes which should eliminate the crash you're seeing.

thx, @kbr-scylla I'll back with the result

@mykaul
Copy link
Contributor

mykaul commented Feb 28, 2024

@oturek - any news?

@mykaul mykaul removed the triage/oss label Feb 28, 2024
@oturek
Copy link
Author

oturek commented Feb 28, 2024

Hi, sorry for delay, finally connected with version 5.2.9( it was last try) before upgrading to 5.2.15 :)
Thanks for support, I'll close the issue.

@oturek oturek closed this as completed Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants