Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite the local changesets in-place for client reset recovery #7161

Merged
merged 9 commits into from
Dec 5, 2023

Conversation

tgoyne
Copy link
Member

@tgoyne tgoyne commented Nov 22, 2023

Rather than discarding the old sync history and creating new commits, update the individual changesets which are being recovered in-place while keeping the original version numbers. This significantly simplifies recovery when pending subscriptions are involved, as they no longer need to be updated at all, and it lets us apply the client reset in a single write transaction, which improves notifications and avoids a lot of problems if another thread has the Realm open while the client reset is happening.

It also avoids causing problems for the server, which doesn't like overly large changesets, and sometimes ran into problems due to all of the pending local changesets being merged into one.


"Move RecoverLocalChangesetsHandler fully to the cpp file" is just moving code around with no changes, and the functional changes to client_reset_recovery.cpp are entirely in the final commit.

"Remove SubscriptionStore::get_mutable_by_version()" is adjusting the SubscriptionStore API to eliminate a problem I ran into while writing some tests: the MutableSubscriptionSet it returns doesn't actually support modifying the subscription set. This is actually the correct behavior, as existing subscription sets shouldn't be modified, but it was a misleading interface. I changed it to only expose update_state() rather than appearing to allow more than it did.

The client reset during async open tests started hitting some very unintended codepaths and started crashing. AFAICT the change in timing resulted in it triggering two client resets, as it relied on stopping sync in the middle of the server sending a batch having some very specific behavior that wasn't actually guaranteed. I updated it to use the client reset endpoint that all other tests use and to immediately disconnect rather than relying on the server disconnecting it. This revealed that the FLX test had failed to test the thing it was trying to test and that very specific scenario was broken.

@tgoyne tgoyne self-assigned this Nov 22, 2023
Copy link

coveralls-official bot commented Nov 22, 2023

Pull Request Test Coverage Report for Build thomas.goyne_131

  • 835 of 865 (96.53%) changed or added relevant lines in 17 files are covered.
  • 69 unchanged lines in 12 files lost coverage.
  • Overall coverage increased (+0.06%) to 91.748%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/realm/object-store/sync/sync_session.cpp 32 33 96.97%
test/object-store/sync/client_reset.cpp 53 56 94.64%
test/object-store/sync/flx_sync.cpp 101 104 97.12%
src/realm/sync/subscriptions.cpp 84 95 88.42%
src/realm/sync/noinst/client_reset_recovery.cpp 51 63 80.95%
Files with Coverage Reduction New Missed Lines %
src/realm/array_key.cpp 1 97.53%
src/realm/object-store/sync/sync_session.cpp 1 93.38%
src/realm/sync/noinst/client_impl_base.cpp 1 85.62%
test/object-store/sync/client_reset.cpp 1 98.46%
test/test_dictionary.cpp 1 99.85%
src/realm/util/file.cpp 2 81.73%
src/realm/sync/subscriptions.cpp 3 94.49%
test/object-store/sync/flx_sync.cpp 3 98.48%
src/realm/sync/instruction_applier.cpp 4 70.26%
src/realm/unicode.cpp 4 90.15%
Totals Coverage Status
Change from base Build 1878: 0.06%
Covered Lines: 232156
Relevant Lines: 253037

💛 - Coveralls

@tgoyne tgoyne force-pushed the tg/in-place-client-reset branch 7 times, most recently from 462f034 to 950ec59 Compare November 29, 2023 02:19
@tgoyne tgoyne force-pushed the tg/in-place-client-reset branch 2 times, most recently from 8b15358 to 6c31b99 Compare November 29, 2023 19:57
@tgoyne tgoyne marked this pull request as ready for review November 29, 2023 21:28
Copy link
Contributor

@ironage ironage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements across the board 👍
It is a lot of code changes in some dusty corners of the codebase, but I went through all the commits and it looks good to me. Thanks for taking this on, having the FLX recovery as a single commit is a huge win!

// By marking version 2 as complete version 1 will get superceded and removed.
CHECK_THROW(store->get_mutable_by_version(1), KeyNotFound);
// By marking version 2 as complete version 1 will get superseded and removed.
CHECK_EQUAL(store->get_by_version(1).state(), SubscriptionSet::State::Superseded);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1 is indeed superseded but not removed in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"superseded and removed" is a redundant phrase, as we mark things superseded by removing them. The comment is probably misleading, but the difference in the test is just due to get_by_version() and get_mutable_by_version() reporting a superseded version in different ways.

The exposed interface for this type is just a function, so we can slightly
improve build time/size by not exposing any of the implementation details.
The MutableSubscriptionSet returned by this didn't support any operations other
than `update_state()`, so pushing `update_state()` to `SubscriptionStore` and
removing `get_mutable_by_version()` makes it less error-prone.
Rather than discarding the old sync history and creating new commits, update
the individual changesets which are being recovered in-place while keeping the
original version numbers. This significantly simplifies recovery when pending
subscriptions are involved, as they no longer need to be updated at all, and it
lets us apply the client reset in a single write transaction, which improves
notifications and avoids a lot of problems if another thread has the Realm open
while the client reset is happening.

It also avoids causing problems for the server, which doesn't like overly large
changesets, and sometimes ran into problems due to all of the pending local
changesets being merged into one.
…eset on an async open

If the very first open of a flexible sync Realm triggered a client reset, the
configuration had an initial subscriptions callback, both before and after
reset callbacks, and the initial subscription callback began a read transaction
without ending it (which is normally going to be the case), opening the frozen
Realm for the after reset callback would trigger a BadVersion exception.

The commit to create the initial subscriptions resulted in the `before` Realm
being one version of out date, so once it was closed and the read lock released
that VersionID was no longer valid and could not be used to obtain new
transactions.
These tests were a giant race condition that only worked with fairly specific
timing from the server, and in practice didn't actually test the situation they
were intending to test.
If DOWNLOAD messages were received while there were unuploaded local changes
prior to the reset, the reciprocal history will be more up-to-date than the
original changesets and be more likely to recover correctly. If not, this has
exactly the old behavior.
@tgoyne
Copy link
Member Author

tgoyne commented Dec 5, 2023

I've switched it over to recovering using the reciprocal history, made it clear the reciprocal history afterwards, and added a test showing why each of those is needed.

@@ -973,4 +971,25 @@ int64_t SubscriptionStore::set_active_as_latest(Transaction& wt)
return version;
}

int64_t SubscriptionStore::mark_active_as_complete(Transaction& wt)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could make this a no-op if the state is already Complete (unless it's possible to set the state and not deliver the notifications)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the active set is already complete then m_pending_notifications should be empty barring someone waiting for a Superseded notification, so the only unnecessary thing we're doing in that scenario is acquiring the lock.

@@ -507,39 +507,16 @@ void wait_for_num_objects_in_atlas(std::shared_ptr<SyncUser> user, const AppSess
std::chrono::minutes(15), std::chrono::milliseconds(500));
}

void trigger_client_reset(const AppSession& app_session)
void trigger_client_reset(const AppSession& app_session, const SyncSession& sync_session)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need to set development mode back and wait for initial sync to complete as before? (is the client reset endpoint doing all that?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client reset endpoint just directly triggers a client reset for a specific client file ident without disabling sync, so none of that is needed.

Copy link
Collaborator

@danieltabacaru danieltabacaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. You could mention using the reciprocal history in the PR's description.

@tgoyne tgoyne merged commit a4c558c into master Dec 5, 2023
34 checks passed
@tgoyne tgoyne deleted the tg/in-place-client-reset branch December 5, 2023 19:16
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants