Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft topology upgrade procedure stucked after upgrade from 5.4.4 #17822

Closed
aleksbykov opened this issue Mar 15, 2024 · 13 comments
Closed

raft topology upgrade procedure stucked after upgrade from 5.4.4 #17822

aleksbykov opened this issue Mar 15, 2024 · 13 comments
Assignees
Milestone

Comments

@aleksbykov
Copy link
Contributor

aleksbykov commented Mar 15, 2024

Scylla version: 5.5.0-dev-0.20240313.15bea069a911 with build-id babc816420ca865129a6fb6d6028c7da05ba478a
Cluster size: 3 nodes
Base version: Scylla version 5.4.4-0.20240228.58a1be93b212 with build-id f385c89e3d29e70e0a8beab1dfefcace5e909775
Upgrade to : 5.5.0-dev-0.20240313.15bea069a911
CI:
Draft PR with reproduce: https://github.com/scylladb/scylla-dtest/pull/4066

Test starts with 5.4. Then it runs rolling upgrade to 5.5. Verify data after upgrade. Then stop each node, update scylla.yaml with
experimental_features: consistent-topology-changes, start node. After all nodes up,
rest api call with post to /storage_service/raft_topology/upgrade for starting raft upgrade procedure for raft topology.
after that monitoring using rest api call with get method /storage_service/raft_topology/upgrade that status is done.

But all nodes are stucked after:

INFO  2024-03-15 18:36:26,403 [shard 0: gms] storage_service - handle_state_normal: endpoint=127.0.62.3 == current_owner=127.0.62.3 token -2941654910418796807
INFO  2024-03-15 18:36:26,403 [shard 0: gms] storage_service - handle_state_normal: endpoint=127.0.62.3 == current_owner=127.0.62.3 token -6320067988398039908
INFO  2024-03-15 18:36:26,403 [shard 0: gms] storage_service - handle_state_normal: endpoint=127.0.62.3 == current_owner=127.0.62.3 token 6245740444337385762
INFO  2024-03-15 18:36:26,403 [shard 0: gms] storage_service - handle_state_normal: endpoint=127.0.62.3 == current_owner=127.0.62.3 token 406493589704482632
INFO  2024-03-15 18:36:26,403 [shard 0: gms] storage_service - handle_state_normal: endpoint=127.0.62.3 == current_owner=127.0.62.3 token -3868278053417635166
INFO  2024-03-15 18:36:26,403 [shard 0: gms] storage_service - handle_state_normal: endpoint=127.0.62.3 == current_owner=127.0.62.3 token -6610197408461548696
INFO  2024-03-15 18:36:26,403 [shard 0: gms] storage_service - handle_state_normal: endpoint=127.0.62.3 == current_owner=127.0.62.3 token -3214514548453325505
INFO  2024-03-15 18:36:26,403 [shard 0: gms] storage_service - handle_state_normal: endpoint=127.0.62.3 == current_owner=127.0.62.3 token 1453937152616926887
INFO  2024-03-15 18:36:26,412 [shard 0: gms] storage_service - handle_state_normal for 127.0.62.3/f1d8e9d5-6590-4e3b-a687-f53059ab8f0c finished
INFO  2024-03-15 18:36:26,419 [shard 0:comp] compaction - [Compact system.scylla_local 42038830-e2c0-11ee-838a-a35f39ac0e6c] Compacting [/home/abykov/.dtest/dtest-jx12h4us/test/node1/data/system/scylla_local-297
2ec7ffb2038ddaac1d876f2e3fcbd/me-3geg_0w8q_2gqg0200hrxjv8owh8-big-Data.db:level=0:origin=memtable,/home/abykov/.dtest/dtest-jx12h4us/test/node1/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3geg_0
w8l_2f8fl200hrxjv8owh8-big-Data.db:level=0:origin=compaction]
INFO  2024-03-15 18:36:26,419 [shard 0: gms] features - Feature SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES is enabled
INFO  2024-03-15 18:36:26,420 [shard 0: gms] gossip - InetAddress f1d8e9d5-6590-4e3b-a687-f53059ab8f0c/127.0.62.3 is now UP, status = NORMAL
INFO  2024-03-15 18:36:26,421 [shard 0:comp] compaction - [Compact system.scylla_local 42038830-e2c0-11ee-838a-a35f39ac0e6c] Compacted 2 sstables to [/home/abykov/.dtest/dtest-jx12h4us/test/node1/data/system/scy
lla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3geg_0w8q_2ht0x200hrxjv8owh8-big-Data.db:level=0]. 12kB to 6441 bytes (~52% of original) in 2ms = 6MB/s. ~256 total partitions merged to 4.
INFO  2024-03-15 18:36:26,907 [shard 0:strm] raft_topology - requesting to start upgrade to topology on raft
INFO  2024-03-15 18:36:26,994 [shard 0:strm] raft_topology - upgrade to topology on raft is scheduled
INFO  2024-03-15 18:37:16,407 [shard 0:strm] cdc - Generation (2024/03/15 11:35:55, b55db3f5-f421-4a18-9613-085e72b0d692): streams description table already updated.

Logs:
dtest.logs.tar.gz

i aborted the scylla process at moment when it hangs

Aborting on shard 0.
Backtrace:
  0x5dba2f8
  0x5df0b41
  /home/abykov/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x3dbaf
  /home/abykov/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x10ac5c
  0x5e1a050
  0x5e15f62
  0x5e1668c
  0x5dcd50c
  0x5dcc578
  0x5d5b387
  0x5d5a53c
  0x139e159
  0x139fbb0
  0x139c7dc
  /home/abykov/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27b89
  /home/abykov/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27c4a
  0x139a064

[Backtrace #0]
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:68
 (inlined by) seastar::backtrace_buffer::append_backtrace() at ./build/release/seastar/./seastar/src/core/reactor.cc:825
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:855
seastar::print_with_backtrace(char const*, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:867
 (inlined by) seastar::sigabrt_action() at ./build/release/seastar/./seastar/src/core/reactor.cc:4071
 (inlined by) operator() at ./build/release/seastar/./seastar/src/core/reactor.cc:4047
 (inlined by) __invoke at ./build/release/seastar/./seastar/src/core/reactor.cc:4043
/data/scylla-s3-reloc.cache/by-build-id/babc816420ca865129a6fb6d6028c7da05ba478a/extracted/scylla/libreloc/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=70e92bb237883be3065a6afc9f0696aef2d068bf, for GNU/Linux 3.2.0, not stripped

__GI___sigaction at :?
syscall at ??:?
seastar::internal::io_pgetevents(unsigned long, long, long, seastar::internal::linux_abi::io_event*, timespec const*, __sigset_t const*, bool) at ./build/release/seastar/./seastar/src/core/linux-aio.cc:163
seastar::reactor_backend_aio::await_events(int, __sigset_t const*) at ./build/release/seastar/./seastar/src/core/reactor_backend.cc:490
seastar::reactor_backend_aio::wait_and_process_events(__sigset_t const*) at ./build/release/seastar/./seastar/src/core/reactor_backend.cc:557
seastar::reactor::sleep() at ./build/release/seastar/./seastar/src/core/reactor.cc:3406
 (inlined by) seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3372
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3210
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:276
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:167
scylla_main(int, char**) at ./main.cc:673
std::function<int (int, char**)>::operator()(int, char**) const at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_function.h:591
main at ./main.cc:2147
__libc_start_call_main at ??:?
__libc_start_main_alias_2 at :?
_start at ??:?

@aleksbykov
Copy link
Contributor Author

@piodul can you take a look?

@kbr-scylla
Copy link
Contributor

Test starts with 5.4. Then it runs rolling upgrade to 5.5. Verify data after upgrade. Then stop each node, update scylla.yaml with
experimental_features: consistent-topology-changes, start node.

@aleksbykov please update the test: don't start 5.5 without consistent-topology-changes first. Enable it at the same time you upgrade to 5.5, that is, enable in config before you restart in 5.5 version. That's because we plan to remove the flag. And please verify whether it still gets stuck the same way.

@kbr-scylla kbr-scylla added this to the 6.0 milestone Mar 15, 2024
@aleksbykov
Copy link
Contributor Author

aleksbykov commented Mar 15, 2024

Test starts with 5.4. Then it runs rolling upgrade to 5.5. Verify data after upgrade. Then stop each node, update scylla.yaml with
experimental_features: consistent-topology-changes, start node.

@aleksbykov please update the test: don't start 5.5 without consistent-topology-changes first. Enable it at the same time you upgrade to 5.5, that is, enable in config before you restart in 5.5 version. That's because we plan to remove the flag. And please verify whether it still gets stuck the same way.

this way works, if add consistent-topology-changes before upgrade, and run upgrade with it, then after upgrade procedure finished and nodes are running

@kbr-scylla
Copy link
Contributor

But you start 5.4 without consistent-topology-changes, right?

@aleksbykov
Copy link
Contributor Author

aleksbykov commented Mar 15, 2024

right. it started with experimental_features: []

@kbr-scylla
Copy link
Contributor

Ok. In that case there is nothing to fix. We will not support this:

  • start 5.4 without consistent-topology-changes flag
  • then start 6.0 without consistent-topology-changes flag
  • then restart 6.0 with consistent-topology-changes flag
  • then trigger upgrade

because we will remove consistent-topology-changes experimental-feature from 6.0.

We will only support this:

  • start 5.4 without consistent-topology-changes flag
  • start 6.0 (with flag implicitly ON because it was removed)
  • then trigger upgrade

Your test should only check the second scenario.

If the second scenario works, we're good

@kbr-scylla
Copy link
Contributor

So please update the test to start 6.0 (currently called 5.5 but we'll have to update) already with consistent-topology-changes ON.

@aleksbykov
Copy link
Contributor Author

aleksbykov commented Mar 15, 2024

yes, we are good with second scenario
consistent-topology-changes will be:

  • set as parameter in scylla.yaml and always on: consistent-topology-changes: true?
  • or as for consistent-cluster-managment(schema changes) it will enable by default and will not be in scylla.yaml?

if first variant, then we can get into situation as in first case, that customer will set it to false, run upgrade, and then enabled topology later

@kbr-scylla
Copy link
Contributor

It will be removed from scylla.yaml in 6.0.
Issue to track: #17802

@aleksbykov
Copy link
Contributor Author

if not call /storage_service/raft_topology/upgrade , then cluster will continue work without raft topology after upgrade even when it will enabled by default (6.0)?

@kbr-scylla
Copy link
Contributor

Yes

@aleksbykov
Copy link
Contributor Author

This issue could be closed, because used upgrade path was not relevant

@kbr-scylla
Copy link
Contributor

We can close it once the dtest change is merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants