Skip to content

Commit

Permalink
replication: check rs uuid on subscribe process
Browse files Browse the repository at this point in the history
Remote node doing the subscribe might be from a different
replicaset.

Before this patch the subscribe would be retried infinitely
because the node couldn't be found in _cluster, and the master
assumed it must have joined to another node, and its ID should
arrive shortly (ER_TOO_EARLY_SUBSCRIBE).

The ID would never arrive, because the node belongs to another
replicaset.

The patch makes so the master checks if the peer lives in the same
replicaset. Since it is doing a subscribe, it must have joined
already and should have a valid replicaset UUID, regardless of
whether it is anonymous or not.

Correct behaviour is to hard cut this peer off immediately,
without retries.

Closes #6094
Part of #5613
  • Loading branch information
Gerold103 committed Jun 1, 2021
1 parent 2a0a56c commit d5acaab
Show file tree
Hide file tree
Showing 7 changed files with 114 additions and 96 deletions.
6 changes: 6 additions & 0 deletions changelogs/unreleased/gh-6094-rs-uuid-mismatch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
## bugfix/replication

* Fixed an error when a replica, at attempt to subscribe to a foreign cluster
(with different replicaset UUID), didn't notice it is not possible, and
instead was stuck in an infinite retry loop printing an error about "too
early subscribe" (gh-6094).
17 changes: 15 additions & 2 deletions src/box/box.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2704,17 +2704,30 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
tnt_raise(ClientError, ER_LOADING);

struct tt_uuid replica_uuid = uuid_nil;
struct tt_uuid peer_replicaset_uuid = uuid_nil;
struct vclock replica_clock;
uint32_t replica_version_id;
vclock_create(&replica_clock);
bool anon;
uint32_t id_filter;
xrow_decode_subscribe_xc(header, NULL, &replica_uuid, &replica_clock,
&replica_version_id, &anon, &id_filter);
xrow_decode_subscribe_xc(header, &peer_replicaset_uuid, &replica_uuid,
&replica_clock, &replica_version_id, &anon,
&id_filter);

/* Forbid connection to itself */
if (tt_uuid_is_equal(&replica_uuid, &INSTANCE_UUID))
tnt_raise(ClientError, ER_CONNECTION_TO_SELF);
/*
* The peer should have bootstrapped from somebody since it tries to
* subscribe already. If it belongs to a different replicaset, it won't
* be ever found here, and would try to reconnect thinking its replica
* ID wasn't replicated here yet. Prevent it right away.
*/
if (!tt_uuid_is_equal(&peer_replicaset_uuid, &REPLICASET_UUID)) {
tnt_raise(ClientError, ER_REPLICASET_UUID_MISMATCH,
tt_uuid_str(&REPLICASET_UUID),
tt_uuid_str(&peer_replicaset_uuid));
}

/*
* Do not allow non-anonymous followers for anonymous
Expand Down
68 changes: 0 additions & 68 deletions test/replication/gh-3704-misc-replica-checks-cluster-id.result

This file was deleted.

25 changes: 0 additions & 25 deletions test/replication/gh-3704-misc-replica-checks-cluster-id.test.lua

This file was deleted.

67 changes: 67 additions & 0 deletions test/replication/gh-6094-rs-uuid-mismatch.result
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
-- test-run result file version 2
test_run = require('test_run').new()
| ---
| ...

--
-- gh-6094: master instance didn't check if the subscribed instance has the same
-- replicaset UUID as its own. As a result, if the peer is from a different
-- replicaset, the master couldn't find its record in _cluster, and assumed it
-- simply needs to wait a bit more. This led to an infinite re-subscribe.
--
box.schema.user.grant('guest', 'super')
| ---
| ...

test_run:cmd('create server master2 with script="replication/master1.lua"')
| ---
| - true
| ...
test_run:cmd('start server master2')
| ---
| - true
| ...
test_run:switch('master2')
| ---
| - true
| ...
replication = test_run:eval('default', 'return box.cfg.listen')[1]
| ---
| ...
box.cfg{replication = {replication}}
| ---
| ...
assert(test_run:grep_log('master2', 'ER_REPLICASET_UUID_MISMATCH'))
| ---
| - ER_REPLICASET_UUID_MISMATCH
| ...
info = box.info
| ---
| ...
repl_info = info.replication[1]
| ---
| ...
assert(not repl_info.downstream and not repl_info.upstream)
| ---
| - true
| ...
assert(info.status == 'orphan')
| ---
| - true
| ...

test_run:switch('default')
| ---
| - true
| ...
test_run:cmd('stop server master2')
| ---
| - true
| ...
test_run:cmd('delete server master2')
| ---
| - true
| ...
box.schema.user.revoke('guest', 'super')
| ---
| ...
25 changes: 25 additions & 0 deletions test/replication/gh-6094-rs-uuid-mismatch.test.lua
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
test_run = require('test_run').new()

--
-- gh-6094: master instance didn't check if the subscribed instance has the same
-- replicaset UUID as its own. As a result, if the peer is from a different
-- replicaset, the master couldn't find its record in _cluster, and assumed it
-- simply needs to wait a bit more. This led to an infinite re-subscribe.
--
box.schema.user.grant('guest', 'super')

test_run:cmd('create server master2 with script="replication/master1.lua"')
test_run:cmd('start server master2')
test_run:switch('master2')
replication = test_run:eval('default', 'return box.cfg.listen')[1]
box.cfg{replication = {replication}}
assert(test_run:grep_log('master2', 'ER_REPLICASET_UUID_MISMATCH'))
info = box.info
repl_info = info.replication[1]
assert(not repl_info.downstream and not repl_info.upstream)
assert(info.status == 'orphan')

test_run:switch('default')
test_run:cmd('stop server master2')
test_run:cmd('delete server master2')
box.schema.user.revoke('guest', 'super')
2 changes: 1 addition & 1 deletion test/replication/suite.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
"gh-3610-misc-assert-connecting-master-twice.test.lua": {},
"gh-3637-misc-error-on-replica-auth-fail.test.lua": {},
"gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua": {},
"gh-3704-misc-replica-checks-cluster-id.test.lua": {},
"gh-3711-misc-no-restart-on-same-configuration.test.lua": {},
"gh-3760-misc-return-on-quorum-0.test.lua": {},
"gh-4399-misc-no-failure-on-error-reading-wal.test.lua": {},
Expand Down Expand Up @@ -47,6 +46,7 @@
"gh-5566-final-join-synchro.test.lua": {},
"gh-6032-promote-wal-write.test.lua": {},
"gh-6057-qsync-confirm-async-no-wal.test.lua": {},
"gh-6094-rs-uuid-mismatch.test.lua": {},
"*": {
"memtx": {"engine": "memtx"},
"vinyl": {"engine": "vinyl"}
Expand Down

0 comments on commit d5acaab

Please sign in to comment.