Skip to content

Commit

Permalink
replication: allow to re-register with new UUID
Browse files Browse the repository at this point in the history
Previously it wasn't allowed to change instance UUID in _cluster.
When needed, it had to be done manually by deleting the instance
from _cluster and inserting it back with a new UUID. Or not to be
done at all.

Re-UUID (like re-name) was reported to be used when people didn't
want to register new replica IDs. They wanted to rejoin lost
replicas from scratch but keeping the numeric ID. With UUID they
could deal by either setting it explicitly to the old value on a
new instance, or by doing the manual re-UUID like described above.

This commit is supposed to make things simpler. If a replica has a
name, then its re-join with another UUID is not an error. Its
record in _cluster is automatically updated to store the new UUID.

That is only possible if the old-UUID-instance is not connected
anymore and is not listed in replication cfg.

Closes #5029

@TarantoolBot document
Title: Instance rebootstrap with new UUID but same ID and name
If an instance has a non-empty instance name
(`box.cfg.instance_name`), then at rebootstrap it can keep the
name and its old numeric ID (space `_cluster['id']` field).

This might be needed if one doesn't want to pollute `_cluster`
with new rows, and somewhy doesn't want to or can't just drop the
rows belonging to the dead replicas.

In order for this to work 1) the re-bootstrapping replica must
keep its old non-empty instance name, 2) the other instances
should not have any alive connections to the old dead replica.
Ideally, the old replica should be just deleted from
`box.cfg.replication` everywhere.

When that works, the old row in `_cluster` is automatically
updated with the new instance UUID.
  • Loading branch information
Gerold103 committed Mar 10, 2023
1 parent c0c8ba8 commit 77d80b1
Show file tree
Hide file tree
Showing 8 changed files with 166 additions and 18 deletions.
3 changes: 3 additions & 0 deletions changelogs/unreleased/global-names.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,6 @@
unique in its replicaset. It is displayed in `box.info.name`. Names of the
other replicas in the same replicaset are visible in
`box.info.replication[id].name` (gh-5029).

* Instance during rebootstrap can change its UUID while keeping its numeric ID
if it has the same non-empty instance name (gh-5029).
83 changes: 76 additions & 7 deletions src/box/alter.cc
Original file line number Diff line number Diff line change
Expand Up @@ -4161,6 +4161,16 @@ replica_def_new_from_tuple(struct tuple *tuple, struct region *region)
return def;
}

/** Add an instance on commit/rollback. */
static int
on_replace_cluster_add_replica(struct trigger *trigger, void * /* event */)
{
const struct replica_def *def = (typeof(def))trigger->data;
struct replica *r = replicaset_add(def->id, &def->uuid);
replica_set_name(r, &def->name);
return 0;
}

/** Set instance name on commit/rollback. */
static int
on_replace_cluster_set_name(struct trigger *trigger, void * /* event */)
Expand Down Expand Up @@ -4234,6 +4244,63 @@ on_replace_dd_cluster_set_name(struct replica *replica,
return 0;
}

/** Set instance UUID on _cluster update. */
static int
on_replace_dd_cluster_set_uuid(struct replica *replica,
const struct replica_def *new_def)
{
struct replica *old_replica = replica;
if (replica_has_connections(old_replica)) {
diag_set(ClientError, ER_UNSUPPORTED, "Replica",
"UUID update when the old replica is still here");
return -1;
}
struct replica *new_replica = replica_by_uuid(&new_def->uuid);
if (new_replica != NULL && new_replica->id != REPLICA_ID_NIL) {
diag_set(ClientError, ER_UNSUPPORTED, "Replica",
"UUID update when the new UUID is already registered");
return -1;
}
if (!tt_hostname_are_eq(&new_def->name, &replica->name)) {
diag_set(ClientError, ER_UNSUPPORTED, "Replica",
"UUID and name update together");
return -1;
}
struct trigger *on_rollback_drop_new = txn_alter_trigger_new(
on_replace_cluster_clear_id, NULL);
struct trigger *on_rollback_add_old = txn_alter_trigger_new(
on_replace_cluster_add_replica, NULL);
if (on_rollback_drop_new == NULL || on_rollback_add_old == NULL)
return -1;
size_t size;
struct replica_def *old_def = region_alloc_object(
&in_txn()->region, typeof(*old_def), &size);
if (old_def == NULL) {
diag_set(OutOfMemory, size, "region_alloc_object",
"old_def");
return -1;
}
memset(old_def, 0, sizeof(*old_def));
old_def->id = old_replica->id;
tt_hostname_set(&old_def->name, &old_replica->name);
old_def->uuid = old_replica->uuid;

replica_clear_id(old_replica);
if (replica_by_uuid(&old_def->uuid) != NULL)
panic("Replica with old UUID wasn't deleted");
if (new_replica == NULL)
new_replica = replicaset_add(new_def->id, &new_def->uuid);
else
replica_set_id(new_replica, new_def->id);
replica_set_name(new_replica, &old_def->name);
on_rollback_drop_new->data = new_replica;
on_rollback_add_old->data = old_def;
struct txn_stmt *stmt = txn_current_stmt(in_txn());
txn_stmt_on_rollback(stmt, on_rollback_drop_new);
txn_stmt_on_rollback(stmt, on_rollback_add_old);
return 0;
}

/** _cluster update - both old and new tuples are present. */
static int
on_replace_dd_cluster_update(const struct replica_def *old_def,
Expand All @@ -4243,14 +4310,16 @@ on_replace_dd_cluster_update(const struct replica_def *old_def,
struct replica *replica = replica_by_id(new_def->id);
if (replica == NULL)
panic("Found a _cluster tuple not having a replica");
/*
* Forbid changes of UUID for a registered instance: it requires an
* extra effort to keep _cluster in sync with appliers and relays.
*/
if (!tt_uuid_is_equal(&new_def->uuid, &old_def->uuid)) {
diag_set(ClientError, ER_UNSUPPORTED, "Space _cluster",
"updates of instance uuid");
return -1;
if (tt_uuid_is_equal(&old_def->uuid, &INSTANCE_UUID)) {
diag_set(ClientError, ER_UNSUPPORTED, "Replica",
"own UUID update in _cluster");
return -1;
}
if (on_replace_dd_cluster_set_uuid(replica, new_def) != 0)
return -1;
/* The replica was re-created. */
replica = replica_by_id(new_def->id);
}
return on_replace_dd_cluster_set_name(replica, &new_def->name);
}
Expand Down
12 changes: 10 additions & 2 deletions src/box/box.cc
Original file line number Diff line number Diff line change
Expand Up @@ -3741,6 +3741,14 @@ box_register_instance(const struct tt_uuid *uuid,
diag_raise();
return;
}
struct replica *other = replica_by_name(name);
if (other != NULL && other != replica) {
if (boxk(IPROTO_UPDATE, BOX_CLUSTER_ID, "[%u][[%s%d%s]]",
(unsigned)other->id, "=", BOX_CLUSTER_FIELD_UUID,
tt_uuid_str(uuid)) != 0)
diag_raise();
return;
}
uint32_t replica_id;
if (replica_find_new_id(&replica_id) != 0)
diag_raise();
Expand Down Expand Up @@ -3991,13 +3999,13 @@ box_process_join(struct iostream *io, const struct xrow_header *header)
(replica != NULL &&
!tt_hostname_are_eq(&replica->name, &req.instance_name))) {
struct replica *other = replica_by_name(&req.instance_name);
if (other != NULL && other != replica) {
if (other != NULL && other != replica &&
replica_has_connections(other)) {
tnt_raise(ClientError, ER_INSTANCE_NAME_DUPLICATE,
tt_hostname_for_log(&req.instance_name),
tt_uuid_str(&other->uuid));
}
}

/*
* Register the replica as a WAL consumer so that
* it can resume FINAL JOIN where INITIAL JOIN ends.
Expand Down
13 changes: 10 additions & 3 deletions src/box/replication.cc
Original file line number Diff line number Diff line change
Expand Up @@ -245,9 +245,8 @@ replica_check_id(uint32_t replica_id)
static bool
replica_is_orphan(struct replica *replica)
{
assert(replica->relay != NULL);
return replica->id == REPLICA_ID_NIL && replica->applier == NULL &&
relay_get_state(replica->relay) != RELAY_FOLLOW;
return replica->id == REPLICA_ID_NIL &&
!replica_has_connections(replica);
}

static int
Expand Down Expand Up @@ -454,6 +453,14 @@ replica_set_applier(struct replica *replica, struct applier *applier)
&replica->on_applier_state);
}

bool
replica_has_connections(const struct replica *replica)
{
assert(replica->relay != NULL);
return relay_get_state(replica->relay) == RELAY_FOLLOW ||
replica->applier != NULL;
}

/** A helper to track applier health on its state change. */
static void
replica_update_applier_health(struct replica *replica)
Expand Down
7 changes: 7 additions & 0 deletions src/box/replication.h
Original file line number Diff line number Diff line change
Expand Up @@ -468,6 +468,13 @@ replica_clear_applier(struct replica *replica);
void
replica_set_applier(struct replica * replica, struct applier * applier);

/**
* See if the replica still has active connections or might be trying to make
* new ones.
*/
bool
replica_has_connections(const struct replica *replica);

/**
* Check if there are enough "healthy" connections, and fire the appropriate
* triggers. A replica connection is considered "healthy", when:
Expand Down
56 changes: 55 additions & 1 deletion test/replication-luatest/instance_name_test.lua
Original file line number Diff line number Diff line change
Expand Up @@ -415,8 +415,14 @@ g.test_instance_name_subscribe_request_mismatch = function(lg)
end

g.test_instance_name_new_uuid = function(lg)
lg.replica:stop()
local old_uuid = lg.replica:get_instance_uuid()
local old_id = lg.replica:get_instance_id()
--
-- Fail to change UUID while the old replica is still alive.
--
local box_cfg = table.copy(lg.replica.box_cfg)
local original_replica_cfg = table.copy(box_cfg)
-- Don't need fullmesh.
box_cfg.bootstrap_strategy = 'legacy'
box_cfg.replication = {server.build_listen_uri('master')}
t.assert_equals(box_cfg.instance_name, 'replica-name')
Expand All @@ -433,5 +439,53 @@ g.test_instance_name_new_uuid = function(lg)
'replica%-name, already occupied', 1024, {filename = logfile}))
end)
new_replica:drop()
--
-- The old replica is eliminated. Only a _cluster row remains. Can legally
-- change UUID now.
--
lg.master:exec(function()
box.cfg{replication = {}}
end)
lg.replica:drop()
lg.master:exec(function()
--
-- Fail to change UUID and name together.
--
-- This is not achievable via normal protocols. Have to do it manually
-- in _cluster.
local uuid = require('uuid')
local _cluster = box.space._cluster
local new_uuid = uuid.str()
local old_id
for _, t in _cluster:pairs() do
if t.name == 'replica-name' then
old_id = t.id
break
end
end
t.assert_not_equals(old_id, nil)
t.assert_error_msg_contains('UUID and name update together',
_cluster.update, _cluster, {old_id},
{{'=', 'uuid', new_uuid},
{'=', 'name', 'new-name'}})
--
-- Can't change own UUID.
--
t.assert_error_msg_contains('own UUID update in _cluster',
_cluster.update, _cluster, {box.info.id},
{{'=', 'uuid', new_uuid}})
end)
--
-- Normal re-UUID.
--
lg.replica = server:new({
alias = 'replica',
box_cfg = original_replica_cfg,
})
lg.replica:start()
t.assert_not_equals(old_uuid, lg.replica:get_instance_uuid())
t.assert_equals(old_id, lg.replica:get_instance_id())
lg.master:exec(function(replication)
box.cfg{replication = replication}
end, {lg.master.box_cfg.replication})
end
8 changes: 4 additions & 4 deletions test/replication-py/cluster.result
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ box.space._cluster:replace{1, require('uuid').NULL:str()}
...
box.space._cluster:replace{1, require('uuid').str()}
---
- error: Space _cluster does not support updates of instance uuid
- error: Replica does not support own UUID update in _cluster
...
box.space._cluster:update(1, {{'=', 4, 'test'}})
---
Expand All @@ -54,15 +54,15 @@ box.space._cluster:replace{5, '0d5bd431-7f3e-4695-a5c2-82de0a9cbc95'}
...
box.space._cluster:replace{5, 'a48a19a3-26c0-4f8c-a5b5-77377bab389b'}
---
- error: Space _cluster does not support updates of instance uuid
- [5, 'a48a19a3-26c0-4f8c-a5b5-77377bab389b']
...
box.space._cluster:update(5, {{'=', 3, 'test'}})
---
- [5, '0d5bd431-7f3e-4695-a5c2-82de0a9cbc95', 'test']
- [5, 'a48a19a3-26c0-4f8c-a5b5-77377bab389b', 'test']
...
box.space._cluster:delete(5)
---
- [5, '0d5bd431-7f3e-4695-a5c2-82de0a9cbc95', 'test']
- [5, 'a48a19a3-26c0-4f8c-a5b5-77377bab389b', 'test']
...
box.info.vclock[5] == nil
---
Expand Down
2 changes: 1 addition & 1 deletion test/replication-py/cluster.test.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ def check_join(msg):

# Replace with the same UUID is OK
server.admin("box.space._cluster:replace{{5, '{0}'}}".format(new_uuid))
# Replace with a new UUID is not OK
# Replace with a new UUID is also OK (if the instance is not connected)
new_uuid = "a48a19a3-26c0-4f8c-a5b5-77377bab389b"
server.admin("box.space._cluster:replace{{5, '{0}'}}".format(new_uuid))
# Update of tail is OK
Expand Down

0 comments on commit 77d80b1

Please sign in to comment.