Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stateful failover doesn't clean up its data after rs removing #1875

Closed
kluevandrew opened this issue Jul 28, 2022 · 2 comments · Fixed by #1881
Closed

Stateful failover doesn't clean up its data after rs removing #1875

kluevandrew opened this issue Jul 28, 2022 · 2 comments · Fixed by #1881
Assignees
Labels
bug Something isn't working cartridge

Comments

@kluevandrew
Copy link

kluevandrew commented Jul 28, 2022

Description

When stateful failover enabled and all instance of some non all-rw replicaset are expelled. It's not able for user to add new instances in that replicaset

Tarantool version:

bash-4.2$ tarantool --version
Tarantool 2.10.0-rc1-0-g7ed15e6
Target: Linux-x86_64-RelWithDebInfo
Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr -DENABLE_BACKTRACE=ON
Compiler: /opt/rh/devtoolset-8/root/usr/bin/cc /opt/rh/devtoolset-8/root/usr/bin/c++
C_FLAGS:-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic -fexceptions -funwind-tables -fno-common -fopenmp -msse2 -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-gnu-alignof-expression -fno-gnu89-inline -Wno-cast-function-type -Werror
CXX_FLAGS:-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic -fexceptions -funwind-tables -fno-common -fopenmp -msse2 -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Wno-cast-function-type -Werror

Cartridge version:

unix/:/var/run/tarantool/kv.router-0-7557fb969b-0.control> require('cartridge').VERSION
---
- 2.7.3
...

Steps to reproduce:

  1. Create cluster with topology like following:
    • router-0 (roles: vshard-router, failover-coordinator; all-rw = true)
      • router-0-0
    • storage-0 (roles: vshard-storage; all-rw = false; weight=100)
      • storage-0-0
      • storage-0-1
    • storage-1 (roles: vshard-storage; all-rw = false; weight=100)
      • storage-1-0
      • storage-1-1
  2. Bootstrap cluster
  3. Enable stateful failover with etcd2 state provider, and ensure it works
  4. Find out replicaset_uuid of storage-1 and remember it
  5. Set weight=0 for storage-1
  6. Wait until all buckets will be moved to storage-0
  7. Disable storage-1-1
  8. Expel storage-1-1
  9. Disable storage-1-0
  10. Expel storage-1-0
  11. Ensure that replicaset storage-1 is no more present in topology
  12. Create two new unconfigured instances from scratch with same advertise uris as expelled instances
  13. Try to join them with roles: vshard-storage; all-rw = false; weight=100 and replicaset_uuid that was found at step 4

Actual result:

  • 1 of instances falls into "OperationError" state, the second instance stuck at "BootstrappingBox" state.
  • Both instances are read-only and cannot apply configuration
2022-07-27 13:13:19.195 [14] main/147/remote_control/10.244.0.57:44388 confapplier.lua:133 E> Instance entering failed state: ConfiguringRoles -> OperationError
ApplyConfigError: Can't modify data on a read-only instance - box.cfg.read_only is true
stack traceback:
        builtin/box/schema.lua:3037: in function 'create'
        /usr/share/tarantool/kv/app/roles/storage.lua:32: in function </usr/share/tarantool/kv/app/roles/storage.lua:4>
        [C]: in function 'xpcall'
        /usr/share/tarantool/kv/.rocks/share/tarantool/errors.lua:145: in function 'pcall'
        .../tarantool/kv/.rocks/share/tarantool/cartridge/roles.lua:365: in function 'apply_config'
        ...tool/kv/.rocks/share/tarantool/cartridge/confapplier.lua:282: in function <...tool/kv/.rocks/share/tarantool/cartridge/confapplier.lua:244>
        [C]: in function 'xpcall'
        /usr/share/tarantool/kv/.rocks/share/tarantool/errors.lua:145: in function </usr/share/tarantool/kv/.rocks/share/tarantool/errors.lua:139>
        [C]: in function 'pcall'
        ...l/kv/.rocks/share/tarantool/cartridge/remote-control.lua:72: in function 'fn'
        ...l/kv/.rocks/share/tarantool/cartridge/remote-control.lua:139: in function <...l/kv/.rocks/share/tarantool/cartridge/remote-control.lua:132>
2022-07-27 13:13:19.195 [14] main/141/remote_control/10.244.0.57:44388 utils.c:463 E> LuajitError: builtin/socket.lua:88: attempt to use closed socket
...
2022-07-27 13:15:02.419 [14] main/174/main box.cc:217 E> ER_READONLY: Can't modify data on a read-only instance - box.cfg.read_only is true
2022-07-27 13:15:03.421 [14] main/174/main box.cc:217 E> ER_READONLY: Can't modify data on a read-only instance - box.cfg.read_only is true
2022-07-27 13:15:04.423 [14] main/174/main box.cc:217 E> ER_READONLY: Can't modify data on a read-only instance - box.cfg.read_only is true
2022-07-27 13:15:05.426 [14] main/174/main box.cc:217 E> ER_READONLY: Can't modify data on a read-only instance - box.cfg.read_only is true
2022-07-27 13:15:06.431 [14] main/174/main box.cc:217 E> ER_READONLY: Can't modify data on a read-only instance - box.cfg.read_only is true
2022-07-27 13:15:07.435 [14] main/174/main box.cc:217 E> ER_READONLY: Can't modify data on a read-only instance - box.cfg.read_only is true
2022-07-27 13:15:08.437 [14] main/174/main box.cc:217 E> ER_READONLY: Can't modify data on a read-only instance - box.cfg.read_only is true
2022-07-27 13:15:09.440 [14] main/174/main box.cc:217 E> ER_READONLY: Can't modify data on a read-only instance - box.cfg.read_only is true
2022-07-27 13:15:10.443 [14] main/174/main box.cc:217 E> ER_READONLY: Can't modify data on a read-only instance - box.cfg.read_only is true
2022-07-27 13:15:11.446 [14] main/174/main box.cc:217 E> ER_READONLY: Can't modify data on a read-only instance - box.cfg.read_only is true

Expected result:

Both instance joins cluster without errors and starts normally

Important notices:

  • After expelling all instance from replicaset (step 11), an entry about leader is still present in etcd and in
    require('cartridge.vars').new('cartridge.roles.coordinator').client.session on coordinator instance
  • Clearing of etcd /leaders and coordinator memory does nothing
  • Bug is not reproducible on all-rw replicasets
  • Bug is not reproducible with eventual or disabled failover
@kluevandrew kluevandrew added the bug Something isn't working label Jul 28, 2022
@yngvar-antonsson yngvar-antonsson self-assigned this Jul 28, 2022
@kluevandrew
Copy link
Author

UPD: The same behavior on cartridge 2.7.5

@kluevandrew
Copy link
Author

Stable workaround, should be evaluated just after the las instance was expelled for replicaset, can be evaluated on any running instance

function on_last_instance_expelled(expelled_replicaset_uuid)
    local pool = require('cartridge.pool')
    local failover = require('cartridge.failover')
    local coordinator, _ = failover.get_coordinator()
    local connection = pool.connect(coordinator.uri, {wait_connected = false})

    return connection:eval([[
        local vars = require('cartridge.vars').new('cartridge.roles.coordinator')
        local expelled_replicaset_uuid = "]] .. expelled_replicaset_uuid .. [["

        vars.client.session.ctx.decisions[expelled_replicaset_uuid] = nil
        vars.client.session.leaders[expelled_replicaset_uuid] = nil

        return vars.client.session:set_leaders({})
    ]])
end

on_last_instance_expelled("UUID_OF_REPLICASET_WHERE_LAST_INSTANCES_WAS_EXPELLED")

Special thanx to @yngvar-antonsson

@yngvar-antonsson yngvar-antonsson changed the title Stateful failover works incorrect Stateful failover doesn't clean up its data after rs removing Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cartridge
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants