Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport 5.4] mv: handle different ERMs for base and view table #18943

Closed
wants to merge 1 commit into from

Conversation

mergify[bot]
Copy link

@mergify mergify bot commented May 28, 2024

When calculating the base-view mapping while the topology is changing, we may encounter a situation where the base table noticed the change in its effective replication map while the view table hasn't, or vice-versa. This can happen because the ERM update may be performed during the preemption between taking the base ERM and view ERM, or, due to f2ff701, the update may have just been performed partially when we are taking the ERMs.

Until now, we assumed that the ERMs are synchronized while calling finding the base-view endpoint mapping, so in particular, we were using the topology from the base's ERM to check the datacenters of all endpoints. Now that the ERMs are more likely to not be the same, we may try to get the datacenter of a view endpoint that doesn't exist in the base's topology, causing us to crash.

This is fixed in the following patch by using the view table's topology for endpoints coming from the view ERM. The mapping resulting from the call might now be a temporary mapping between endpoints in different topologies, but it still maps base and view replicas 1-to-1.

Fixes: #17786
Fixes: #18709

(cherry picked from commit ed95782)

Refs #18816

When calculating the base-view mapping while the topology
is changing, we may encounter a situation where the base
table noticed the change in its effective replication map
while the view table hasn't, or vice-versa. This can happen
because the ERM update may be performed during the preemption
between taking the base ERM and view ERM, or, due to f2ff701,
the update may have just been performed partially when we are
taking the ERMs.

Until now, we assumed that the ERMs are synchronized while calling
finding the base-view endpoint mapping, so in particular, we were
using the topology from the base's ERM to check the datacenters of
all endpoints. Now that the ERMs are more likely to not be the same,
we may try to get the datacenter of a view endpoint that doesn't
exist in the base's topology, causing us to crash.

This is fixed in this patch by using the view table's topology for
endpoints coming from the view ERM. The mapping resulting from the
call might now be a temporary mapping between endpoints in different
topologies, but it still maps base and view replicas 1-to-1.

Fixes: #17786
Fixes: #18709
(cherry picked from commit ed95782)

# Conflicts:
#	db/view/view.cc
#	service/storage_service.cc
@mergify mergify bot requested a review from tgrabiec as a code owner May 28, 2024 17:07
@mergify mergify bot added the conflicts label May 28, 2024
@mergify mergify bot requested review from nyh and piodul as code owners May 28, 2024 17:07
Copy link
Author

mergify bot commented May 28, 2024

Cherry-pick of ed95782 has failed:

On branch mergify/copy/branch-5.4/pr-18816
Your branch is up to date with 'origin/branch-5.4'.

You are currently cherry-picking commit ed95782bf2.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	new file:   test/topology_custom/test_mv_topology_change.py

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   db/view/view.cc
	both modified:   service/storage_service.cc

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@mergify mergify bot marked this pull request as draft May 28, 2024 17:08
@wmitros
Copy link
Contributor

wmitros commented Jun 5, 2024

Looks like this backport is not needed after all. The crash was made possible in 9b656ec which uses get_datacenter which doesn't sanitize whether the endpoint we're getting the datacenter of is in its topology, and the crash was made more likely by f2ff701 which introduces a preemption point in the erm update code. As neither is in 5.4, I'm closing this PR

@wmitros wmitros closed this Jun 5, 2024
@mergify mergify bot deleted the mergify/copy/branch-5.4/pr-18816 branch June 5, 2024 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant