-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[backport 2.11] Fix ACK vclock order assert in relay #10127
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The function takes the burden of explaining why this hack about setting local component in a remote vclock is needed. It also creates a new vclock, not alters an existing one. This is to signify that the vclock is no longer what was received from a remote host. Otherwise it is too easy to actually mistreat this mutant vlock as a remote vclock. That btw did happen and is fixed in following commits. In scope of #10047 NO_TEST=refactoring NO_CHANGELOG=refactoring NO_DOC=refactoring (cherry picked from commit b846396)
GC consumer creation and destroy seemed to only happen in box.cc with one exception in relay_subscribe(). Lets move it out for consistency. Now relay can only notify GC consumers, but can't manage them. That also makes it harder to misuse the GC by passing some wrong vclock to it, similar to what was happening in #10047. In scope of #10047 NO_TEST=refactoring NO_CHANGELOG=refactoring NO_DOC=refactoring (cherry picked from commit 4dc0c1e)
It wasn't clear which of them are inputs and which are outputs. The patch explicitly marks the input vclocks as const. It makes the code a bit easier to read inside of relay.cc knowing that these vclocks shouldn't change. Alongside "replica_clock" in subscribe is renamed to "start_vclock". To make it consistent with relay_final_join(), and to signify that technically it doesn't have to be a replica vclock. It isn't really. Box.cc alters the replica's vclock before giving it to relay, which means it is no longer "replica clock". In scope of #10047 NO_TEST=refactoring NO_CHANGELOG=refactoring NO_DOC=refactoring (cherry picked from commit 5ebbed7)
Remote replica's vclock is given to master to send data starting from that position. The master does that, but, in order to find the relevant position in local WAL to start from, the master must ignore the local rows. Consider them all already "sent". For that the master replaces the remote vclock[0] with the local vclock[0]. That makes xlog cursor skip all the local rows. The problem is that this vclock was taken by relay as is, like if it was truly reported by the replica. It was even saved as the "last received ACK". Which clearly isn't the case. When a real ACK was received, it didn't contain anything in vclock[0], and yet relay "saw" that the previous ACK has vclock[0] > 0. That looked like the replica went backwards without even closing connection, which isn't possible. That made the relay crash from cringe (on assert). The fix is not to save the local vclock[0] in the last received ACK. For GC and xlog cursor the hack is still needed. An option how to make it easier was to set vclock[0] to INT64_MAX to just never even bother with any local rows, but that didn't work. Some assumptions in other places seem to depend on having a proper local LSN in these places. Closes #10047 NO_CHANGELOG=the bug wasn't released NO_DOC=bugfix (cherry picked from commit 1f75231)
Commit 715abaa ("ci: fix RPM package builds on aarch64 runners") has limited number of parallel jobs to 6 on these runners to fix the OOM, but it turns out this isn't enough: almalinux_9_aarch64 workflow fails constantly even with this setting. Let's try to reduce the amount of jobs to 4. NO_CHANGELOG=ci NO_TEST=ci NO_DOC=ci
sergepetrenko
approved these changes
Jun 13, 2024
@ylobankov, PTAL at the last commit (9ed4268) |
ylobankov
approved these changes
Jun 13, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport of #10070.