Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backport 2.11] Improve downstream lag behavior #10025

Merged

Conversation

sergepetrenko
Copy link
Collaborator

box.info.replication[replica_id].downstream.lag shows how much time it took the latest transaction, acked by the given replica, to be replicated to the replica, be written to its WAL, and be acked by this instance.

It sounds simple, but sometimes the actual behaviour was misleading or hard to use.

The patchset improves the user experience. The main change is that the lag is actually updated on each acked txn. Before that it was only updated for acked txns created by this specific master.

See #9748.

It was stored in struct replica, now is in struct applier. The
motivation is that applier-specific data must be inside the
applier.

Also it makes the next commits look more logical. They are going
to change this timestamp when applier progresses through its state
machine. It looks strange when the applier is changing the replica
object. Replica is on an upper level in the hierarchy. It owns the
applier and the applier ideally mustn't know about struct replica
(hardly possible to achieve), or at least not change it (this is
feasible).

In scope of tarantool#9748

NO_DOC=internal
NO_TEST=refactoring
NO_CHANGELOG=refactoring

(cherry picked from commit 8e5d9f2)
Before the patch if the applier was reconnected, the master would
see downstream lag equal to the time since it replicated the last
txn to this applier.

This happened because applier between reconnects kept the txn
timestamp used for acks. On the master's side the relay was
recreated, received the ack, thought the applier just applied this
txn, and displayed this as a lag.

The test makes a master restart because this is the easiest way to
reproduce it. Most importantly, the applier shouldn't be
re-created, and relay should restart.

Part of tarantool#9748

NO_DOC=bugfix
NO_CHANGELOG=later

(cherry picked from commit dda4203)
To reduce the insane indentation level. And to isolate the further
changes in next commits more.

Part of tarantool#9748

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

(cherry picked from commit d6f15a1)
From the code it isn't obvious, but relay->status_msg.vclock and
relay->last_recv_ack.vclock are both coming from the applier.
Status_msg is the previous ack, last_recv_ack is the latest ack.

They can never go down. And are not affected anyhow by the master
committing its own transactions. I.e. master can commit something,
relay->r->vclock (recovery cursor) will go up, and recovery vclock
might become incomparable with the last ACK vclock. But the prev
and last ACK vclocks are always comparable and always go up.

This invariant was broken though, because relay on restart didn't
nullify the current applier status (status_msg). It could break
if the replica would loose its xlog files or its ID would be
taken by another instance - then its vclock would go down, making
last_recv_ack.vclock < status_msg.vclock. But that is not right
and is fixed in this patch.

In scope of tarantool#9748

NO_DOC=bugfix
NO_TEST=test 5158 already covers it
NO_CHANGELOG=bugfix

(cherry picked from commit 71dbb47)
Not only for own txns, but also on the txns authored by other
instances.

Note that the lag isn't updated when the replica got new txns from
another master. The lag still only reflects the replication
between this relay and its specific applier.

The motivation is that otherwise the lag sometimes shows
irrelevant things, like that the replica is very outdated, while
it keeps replicating just fine. Only not txns of this specific
master, who might even turned into a replica itself already.

Closes tarantool#9748

NO_DOC=bugfix

(cherry picked from commit 39af9fb)
@sergepetrenko sergepetrenko self-assigned this May 17, 2024
@sergepetrenko sergepetrenko requested a review from a team as a code owner May 17, 2024 12:02
@coveralls
Copy link

Coverage Status

coverage: 85.799%. remained the same
when pulling 40c7a08 on sergepetrenko:downstream-lag-2.11
into 4db1299
on tarantool:release/2.11
.

@sergepetrenko sergepetrenko added the full-ci Enables all tests for a pull request label May 17, 2024
@sergepetrenko sergepetrenko merged commit 79a8e82 into tarantool:release/2.11 May 17, 2024
161 checks passed
@sergepetrenko sergepetrenko deleted the downstream-lag-2.11 branch May 22, 2024 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
full-ci Enables all tests for a pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants