Skip to content

Commit

Permalink
box: disable split-brain detection until schema is upgraded
Browse files Browse the repository at this point in the history
Our split-brain detection machinery relies among other things on all
nodes tracking the synchro queue confirmed lsn. This tracking was only
added together with the split-brain detection. Only the synchro queue
owner tracked the confirmed lsn before.

This means that after an upgrade all the replicas remember the latest
confirmed lsn as 0, and any PROMOTE/DEMOTE request from the queue owner
is treated as a split brain.

Let's fix this and only enable split-brain detection on the replica set
once the schema version is updated. Thanks to the synchro queue freeze
on restart, this can only happen after a new PROMOTE or DEMOTE entry is
written by one of the nodes, and thus the coorect confirmed lsn
is propagated with this PROMOTE/DEMOTE to all the cluster members.

Closes tarantool#8996

NO_DOC=bugfix
NO_TEST=hard to test, involves multiple versions
  • Loading branch information
sergepetrenko committed Sep 1, 2023
1 parent f58cc96 commit 8f38270
Show file tree
Hide file tree
Showing 5 changed files with 58 additions and 2 deletions.
4 changes: 4 additions & 0 deletions changelogs/unreleased/gh-8996-spurious-spit-brain-detected.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
## bugfix/replication

* Fixed a false-positive split-brain in a replica set on the first
promotion after an upgrade from versions before 2.10.1 (gh-8996).
37 changes: 37 additions & 0 deletions src/box/alter.cc
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
#include "coll_id_cache.h"
#include "coll_id_def.h"
#include "txn.h"
#include "txn_limbo.h"
#include "tuple.h"
#include "tuple_constraint.h"
#include "fiber.h" /* for gc_pool */
Expand Down Expand Up @@ -4078,10 +4079,46 @@ on_commit_replicaset_name(struct trigger *trigger, void * /* event */)
return 0;
}

static int
on_commit_dd_version_start_filtering(va_list /* ap */)
{
txn_limbo_filter_enable(&txn_limbo);
return 0;
}

static int
on_commit_dd_version_stop_filtering(va_list /* ap */)
{
txn_limbo_filter_disable(&txn_limbo);
return 0;
}

/**
* Update the cached schema version and enable version-dependent features, like
* split-brain detection.
*/
static int
on_commit_dd_version(struct trigger *trigger, void * /* event */)
{
uint32_t old_version_id = dd_version_id;
dd_version_id = (uint32_t)(uintptr_t)trigger->data;
if (recovery_state != FINISHED_RECOVERY)
return 0;
if (dd_version_id > version_id(2, 10, 1) &&
old_version_id <= version_id(2, 10, 1)) {
struct fiber *fiber;
fiber = fiber_new_system("synchro_queue_filter_enabler",
on_commit_dd_version_start_filtering);
if (fiber == NULL)
panic("Couldn't create a system fiber");
} else if (dd_version_id <= version_id(2, 10, 1) &&
old_version_id > version_id(2, 10, 1)) {
struct fiber *fiber;
fiber = fiber_new_system("synchro_queue_filter_disabler",
on_commit_dd_version_stop_filtering);
if (fiber == NULL)
panic("Couldn't create a system fiber");
}
return 0;
}

Expand Down
7 changes: 5 additions & 2 deletions src/box/box.cc
Original file line number Diff line number Diff line change
Expand Up @@ -5460,9 +5460,12 @@ box_cfg_xc(void)
/*
* Enable split brain detection once node is fully recovered or
* bootstrapped. No split brain could happen during bootstrap or local
* recovery.
* recovery. Only do so in an upgraded cluster. Unfortunately, schema
* version 2.10.1 was used in 2.10.0 release, while split-brain
* detection appeared in 2.10.1. So use the schema version after 2.10.1.
*/
txn_limbo_filter_enable(&txn_limbo);
if (dd_version_id > version_id(2, 10, 1))
txn_limbo_filter_enable(&txn_limbo);

title("running");
say_info("ready to accept requests");
Expand Down
8 changes: 8 additions & 0 deletions src/box/txn_limbo.c
Original file line number Diff line number Diff line change
Expand Up @@ -1273,6 +1273,14 @@ txn_limbo_filter_enable(struct txn_limbo *limbo)
latch_unlock(&limbo->promote_latch);
}

void
txn_limbo_filter_disable(struct txn_limbo *limbo)
{
latch_lock(&limbo->promote_latch);
limbo->do_validate = false;
latch_unlock(&limbo->promote_latch);
}

void
txn_limbo_init(void)
{
Expand Down
4 changes: 4 additions & 0 deletions src/box/txn_limbo.h
Original file line number Diff line number Diff line change
Expand Up @@ -435,6 +435,10 @@ txn_limbo_on_parameters_change(struct txn_limbo *limbo);
void
txn_limbo_filter_enable(struct txn_limbo *limbo);

/** Stop filtering incoming synchro requests. */
void
txn_limbo_filter_disable(struct txn_limbo *limbo);

/**
* Freeze limbo. Prevent CONFIRMs and ROLLBACKs until limbo is unfrozen.
*/
Expand Down

0 comments on commit 8f38270

Please sign in to comment.