Stateful failover #604

rosik · 2020-02-25T16:09:09Z

This patch introduces new failover mode - stateful.

There are 3 main concepts:

Internal coordinator is a new role, which makes decisions regarding leadership. Earlier it was a part of every instances' failover module, but now it's split away. There may be only one active coordinator in cluster at a time. Its uniqueness is ensured by external storage which manages the lock and saves appointments.

External storage (kingdom.lua) is a stand-alone Tarantool instance which provides locking mechanism and keeps decisions made by the coordinator.

Failover module (the old one) operates on every instance in cluster and gathers leadership information for others modules. It was refactored too, and now it can be described with 4 functions with clearly separated responsibilities:

_get_appointments_* generates leadership map by itself or polls it from external storage depending on the mode setting (disabled/eventual/stateful).
accept_appointments() just refreshes the cache and tracks if anything changed.
failover_loop (a fiber) repeatedly gets new appointments and accepts them using corresponding functions from the above.
cfg is called from confapplier.apply_config() (on restart or on committing new clusterwide configuration). At first it gets appointments synchronously and then starts the failover loop.

I didn't forget about

Tests
Changelog
Documentation

Close #148

cartridge/failover-slave.lua

cartridge/failover.old.lua

kingdom.lua

test/integration/kingdom_test.lua

cartridge/roles/coordinator.lua

Kasen

I suggest allowing the node to check its status and push it into membership. It may be made with some kind of callback. It may be necessary to check node replication status if this node is storage.

cartridge/failover.lua

cartridge/roles/coordinator.lua

kingdom.lua

test/integration/kingdom_test.lua

HustonMmmavr · 2020-03-20T11:23:03Z

cartridge/topology.lua

            )

-            local parts = uri.parse(topology.failover.coordinator_uri)
+            local parts = uri.parse(topology.failover.storage_uri)


Maybe use pool.format_uri, something like that:

local _, err = pool.format_uri(uri) e_config:assert( not err, '%s.failover.storage_uri: %s', field, err.err )

rosik force-pushed the gh-148-stateful-failover branch from b1f5c9c to 4e6c538 Compare February 25, 2020 16:22

rosik force-pushed the gh-148-stateful-failover branch from 4e6c538 to 9235376 Compare March 6, 2020 13:42

artur-barsegyan reviewed Mar 10, 2020

View reviewed changes

cartridge/failover-slave.lua Outdated Show resolved Hide resolved

rosik force-pushed the gh-148-stateful-failover branch 4 times, most recently from 4a2746d to 4082451 Compare March 17, 2020 16:56

rosik marked this pull request as ready for review March 17, 2020 17:09

olegrok reviewed Mar 18, 2020

View reviewed changes

rosik mentioned this pull request Mar 18, 2020

Implements switchover API (est: 2) #664

Closed

rosik added this to the 2.1.0 (stateful failover) milestone Mar 18, 2020

rosik force-pushed the gh-148-stateful-failover branch 4 times, most recently from 8340cfc to e882318 Compare March 18, 2020 18:06

olegrok reviewed Mar 18, 2020

View reviewed changes

test/integration/kingdom_test.lua Outdated Show resolved Hide resolved

olegrok reviewed Mar 18, 2020

View reviewed changes

cartridge/roles/coordinator.lua Outdated Show resolved Hide resolved

rosik force-pushed the gh-148-stateful-failover branch 2 times, most recently from 553fe0c to e441dee Compare March 18, 2020 22:35

olegrok reviewed Mar 19, 2020

View reviewed changes

cartridge/roles/coordinator.lua Show resolved Hide resolved

Kasen reviewed Mar 19, 2020

View reviewed changes

rosik mentioned this pull request Mar 19, 2020

Edit failover params in WebUI #666

Closed

rosik force-pushed the gh-148-stateful-failover branch from 39d9209 to 5feabda Compare March 19, 2020 17:56

HustonMmmavr reviewed Mar 20, 2020

View reviewed changes

Implement stateful failover

3c3f78c

rosik force-pushed the gh-148-stateful-failover branch from 6692dca to 3c3f78c Compare March 23, 2020 11:40

Fix tests

9b5f1d1

mtrempoltsev approved these changes Mar 23, 2020

View reviewed changes

rosik merged commit 887843a into master Mar 23, 2020

rosik deleted the gh-148-stateful-failover branch March 23, 2020 16:55

This was referenced Mar 30, 2020

Forbid reacquiring kingdom lock if it was stolen #709

Merged

Document failover architecture #743

Closed

rosik mentioned this pull request Apr 8, 2020

Fix coordinator architecture #720

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stateful failover #604

Stateful failover #604

rosik commented Feb 25, 2020 •

edited

Loading

Kasen left a comment

HustonMmmavr Mar 20, 2020 •

edited

Loading

Stateful failover #604

Stateful failover #604

Conversation

rosik commented Feb 25, 2020 • edited Loading

Kasen left a comment

Choose a reason for hiding this comment

HustonMmmavr Mar 20, 2020 • edited Loading

Choose a reason for hiding this comment

rosik commented Feb 25, 2020 •

edited

Loading

HustonMmmavr Mar 20, 2020 •

edited

Loading