Skip to content

Commit

Permalink
replication: check rs uuid on subscribe process
Browse files Browse the repository at this point in the history
Remote node doing the subscribe might be from a different
replicaset.

Before this patch the subscribe would be retried infinitely
because the node couldn't be found in _cluster, and the master
assumed it must have joined to another node, and its ID should
arrive shortly (ER_TOO_EARLY_SUBSCRIBE).

The ID would never arrive, because the node belongs to another
replicaset.

The patch makes so the master checks if the peer lives in the same
replicaset. Since it is doing a subscribe, it must have joined
already and should have a valid replicaset UUID, regardless of
whether it is anonymous or not.

Correct behaviour is to hard cut this peer off immediately,
without retries.

Closes #6094
Part of #5613

(cherry picked from commit ea0b126)
  • Loading branch information
Gerold103 committed Jun 2, 2021
1 parent 1a76691 commit 39a009a
Show file tree
Hide file tree
Showing 7 changed files with 121 additions and 101 deletions.
6 changes: 6 additions & 0 deletions changelogs/unreleased/gh-6094-rs-uuid-mismatch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
## bugfix/replication

* Fixed an error when a replica, at attempt to subscribe to a foreign cluster
(with different replicaset UUID), didn't notice it is not possible, and
instead was stuck in an infinite retry loop printing an error about "too
early subscribe" (gh-6094).
29 changes: 22 additions & 7 deletions src/box/box.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2704,17 +2704,30 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
tnt_raise(ClientError, ER_LOADING);

struct tt_uuid replica_uuid = uuid_nil;
struct tt_uuid peer_replicaset_uuid = uuid_nil;
struct vclock replica_clock;
uint32_t replica_version_id;
vclock_create(&replica_clock);
bool anon;
uint32_t id_filter;
xrow_decode_subscribe_xc(header, NULL, &replica_uuid, &replica_clock,
&replica_version_id, &anon, &id_filter);
xrow_decode_subscribe_xc(header, &peer_replicaset_uuid, &replica_uuid,
&replica_clock, &replica_version_id, &anon,
&id_filter);

/* Forbid connection to itself */
if (tt_uuid_is_equal(&replica_uuid, &INSTANCE_UUID))
tnt_raise(ClientError, ER_CONNECTION_TO_SELF);
/*
* The peer should have bootstrapped from somebody since it tries to
* subscribe already. If it belongs to a different replicaset, it won't
* be ever found here, and would try to reconnect thinking its replica
* ID wasn't replicated here yet. Prevent it right away.
*/
if (!tt_uuid_is_equal(&peer_replicaset_uuid, &REPLICASET_UUID)) {
tnt_raise(ClientError, ER_REPLICASET_UUID_MISMATCH,
tt_uuid_str(&REPLICASET_UUID),
tt_uuid_str(&peer_replicaset_uuid));
}

/*
* Do not allow non-anonymous followers for anonymous
Expand Down Expand Up @@ -2773,11 +2786,13 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
* the replica how many rows we have in stock for it,
* and identify ourselves with our own replica id.
*
* Tarantool > 2.1.1 master doesn't check that replica
* has the same cluster id. Instead it sends its cluster
* id to replica, and replica checks that its cluster id
* matches master's one. Older versions will just ignore
* the additional field.
* Master not only checks the replica has the same replicaset UUID, but
* also sends the UUID to the replica so both Tarantools could perform
* any checks they want depending on their version and supported
* features.
*
* Older versions not supporting replicaset UUID in the response will
* just ignore the additional field (these are < 2.1.1).
*/
struct xrow_header row;
xrow_encode_subscribe_response_xc(&row, &REPLICASET_UUID, &vclock);
Expand Down
68 changes: 0 additions & 68 deletions test/replication/gh-3704-misc-replica-checks-cluster-id.result

This file was deleted.

25 changes: 0 additions & 25 deletions test/replication/gh-3704-misc-replica-checks-cluster-id.test.lua

This file was deleted.

67 changes: 67 additions & 0 deletions test/replication/gh-6094-rs-uuid-mismatch.result
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
-- test-run result file version 2
test_run = require('test_run').new()
| ---
| ...

--
-- gh-6094: master instance didn't check if the subscribed instance has the same
-- replicaset UUID as its own. As a result, if the peer is from a different
-- replicaset, the master couldn't find its record in _cluster, and assumed it
-- simply needs to wait a bit more. This led to an infinite re-subscribe.
--
box.schema.user.grant('guest', 'super')
| ---
| ...

test_run:cmd('create server master2 with script="replication/master1.lua"')
| ---
| - true
| ...
test_run:cmd('start server master2')
| ---
| - true
| ...
test_run:switch('master2')
| ---
| - true
| ...
replication = test_run:eval('default', 'return box.cfg.listen')[1]
| ---
| ...
box.cfg{replication = {replication}}
| ---
| ...
assert(test_run:grep_log('master2', 'ER_REPLICASET_UUID_MISMATCH'))
| ---
| - ER_REPLICASET_UUID_MISMATCH
| ...
info = box.info
| ---
| ...
repl_info = info.replication[1]
| ---
| ...
assert(not repl_info.downstream and not repl_info.upstream)
| ---
| - true
| ...
assert(info.status == 'orphan')
| ---
| - true
| ...

test_run:switch('default')
| ---
| - true
| ...
test_run:cmd('stop server master2')
| ---
| - true
| ...
test_run:cmd('delete server master2')
| ---
| - true
| ...
box.schema.user.revoke('guest', 'super')
| ---
| ...
25 changes: 25 additions & 0 deletions test/replication/gh-6094-rs-uuid-mismatch.test.lua
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
test_run = require('test_run').new()

--
-- gh-6094: master instance didn't check if the subscribed instance has the same
-- replicaset UUID as its own. As a result, if the peer is from a different
-- replicaset, the master couldn't find its record in _cluster, and assumed it
-- simply needs to wait a bit more. This led to an infinite re-subscribe.
--
box.schema.user.grant('guest', 'super')

test_run:cmd('create server master2 with script="replication/master1.lua"')
test_run:cmd('start server master2')
test_run:switch('master2')
replication = test_run:eval('default', 'return box.cfg.listen')[1]
box.cfg{replication = {replication}}
assert(test_run:grep_log('master2', 'ER_REPLICASET_UUID_MISMATCH'))
info = box.info
repl_info = info.replication[1]
assert(not repl_info.downstream and not repl_info.upstream)
assert(info.status == 'orphan')

test_run:switch('default')
test_run:cmd('stop server master2')
test_run:cmd('delete server master2')
box.schema.user.revoke('guest', 'super')
2 changes: 1 addition & 1 deletion test/replication/suite.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
"gh-3610-misc-assert-connecting-master-twice.test.lua": {},
"gh-3637-misc-error-on-replica-auth-fail.test.lua": {},
"gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua": {},
"gh-3704-misc-replica-checks-cluster-id.test.lua": {},
"gh-3711-misc-no-restart-on-same-configuration.test.lua": {},
"gh-3760-misc-return-on-quorum-0.test.lua": {},
"gh-4399-misc-no-failure-on-error-reading-wal.test.lua": {},
Expand Down Expand Up @@ -47,6 +46,7 @@
"gh-5566-final-join-synchro.test.lua": {},
"gh-6032-promote-wal-write.test.lua": {},
"gh-6057-qsync-confirm-async-no-wal.test.lua": {},
"gh-6094-rs-uuid-mismatch.test.lua": {},
"*": {
"memtx": {"engine": "memtx"},
"vinyl": {"engine": "vinyl"}
Expand Down

0 comments on commit 39a009a

Please sign in to comment.