Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backport 2.11] Fix ACK vclock order assert in relay #10127

Merged
merged 5 commits into from
Jun 13, 2024

Conversation

Gerold103
Copy link
Collaborator

Backport of #10070.

The function takes the burden of explaining why this hack about
setting local component in a remote vclock is needed. It also
creates a new vclock, not alters an existing one. This is to
signify that the vclock is no longer what was received from a
remote host.

Otherwise it is too easy to actually mistreat this mutant vlock as
a remote vclock. That btw did happen and is fixed in following
commits.

In scope of #10047

NO_TEST=refactoring
NO_CHANGELOG=refactoring
NO_DOC=refactoring

(cherry picked from commit b846396)
GC consumer creation and destroy seemed to only happen in box.cc
with one exception in relay_subscribe(). Lets move it out for
consistency. Now relay can only notify GC consumers, but can't
manage them.

That also makes it harder to misuse the GC by passing some wrong
vclock to it, similar to what was happening in #10047.

In scope of #10047

NO_TEST=refactoring
NO_CHANGELOG=refactoring
NO_DOC=refactoring

(cherry picked from commit 4dc0c1e)
It wasn't clear which of them are inputs and which are outputs.
The patch explicitly marks the input vclocks as const. It makes
the code a bit easier to read inside of relay.cc knowing that
these vclocks shouldn't change.

Alongside "replica_clock" in subscribe is renamed to
"start_vclock". To make it consistent with relay_final_join(), and
to signify that technically it doesn't have to be a replica
vclock. It isn't really. Box.cc alters the replica's vclock before
giving it to relay, which means it is no longer "replica clock".

In scope of #10047

NO_TEST=refactoring
NO_CHANGELOG=refactoring
NO_DOC=refactoring

(cherry picked from commit 5ebbed7)
Remote replica's vclock is given to master to send data starting
from that position. The master does that, but, in order to find
the relevant position in local WAL to start from, the master must
ignore the local rows. Consider them all already "sent". For that
the master replaces the remote vclock[0] with the local vclock[0].
That makes xlog cursor skip all the local rows.

The problem is that this vclock was taken by relay as is, like if
it was truly reported by the replica. It was even saved as the
"last received ACK". Which clearly isn't the case.

When a real ACK was received, it didn't contain anything in
vclock[0], and yet relay "saw" that the previous ACK has
vclock[0] > 0. That looked like the replica went backwards without
even closing connection, which isn't possible. That made the relay
crash from cringe (on assert).

The fix is not to save the local vclock[0] in the last received
ACK.

For GC and xlog cursor the hack is still needed. An option how to
make it easier was to set vclock[0] to INT64_MAX to just never
even bother with any local rows, but that didn't work. Some
assumptions in other places seem to depend on having a proper
local LSN in these places.

Closes #10047

NO_CHANGELOG=the bug wasn't released
NO_DOC=bugfix

(cherry picked from commit 1f75231)
@coveralls
Copy link

Coverage Status

coverage: 85.805% (+0.003%) from 85.802%
when pulling 8cd4c32 on gerold103/gh-10047-release-2.11
into 715abaa
on release/2.11
.

@Gerold103 Gerold103 added the full-ci Enables all tests for a pull request label Jun 10, 2024
@Gerold103 Gerold103 marked this pull request as ready for review June 10, 2024 21:04
@Gerold103 Gerold103 self-assigned this Jun 10, 2024
Commit 715abaa ("ci: fix RPM package builds on aarch64 runners")
has limited number of parallel jobs to 6 on these runners to fix the
OOM, but it turns out this isn't enough: almalinux_9_aarch64 workflow
fails constantly even with this setting. Let's try to reduce the amount
of jobs to 4.

NO_CHANGELOG=ci
NO_TEST=ci
NO_DOC=ci
@sergepetrenko sergepetrenko requested a review from a team as a code owner June 13, 2024 10:14
@coveralls
Copy link

Coverage Status

coverage: 85.82% (+0.02%) from 85.802%
when pulling 9ed4268 on gerold103/gh-10047-release-2.11
into 715abaa
on release/2.11
.

@sergepetrenko
Copy link
Collaborator

sergepetrenko commented Jun 13, 2024

@ylobankov, PTAL at the last commit (9ed4268)

@sergepetrenko sergepetrenko merged commit 74223a2 into release/2.11 Jun 13, 2024
95 checks passed
@sergepetrenko sergepetrenko deleted the gerold103/gh-10047-release-2.11 branch June 13, 2024 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
full-ci Enables all tests for a pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants