Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crash when opening FLX realm after client reset failure #6671

Merged
merged 6 commits into from
May 30, 2023

Conversation

danieltabacaru
Copy link
Collaborator

@danieltabacaru danieltabacaru commented May 25, 2023

What, How & Why?

If a client reset fails and the app crashes or exits before the realm is marked for deletion, then the next time the realm is opened it crashes because there is no active subscription set needed for the IDENT message.

This was caused by the fact that in this case the active subscription set was marked Error while the pending subscription sets Superseded. This was wrong because in case of a successful retry with recovery, any unsynced local changes led to compensating writes. The fix it to not change the state of the subscriptions sets.

Fixes #6494.

☑️ ToDos

  • 📝 Changelog update
  • 🚦 Tests (or not relevant)
  • C-API, if public C++ API changed.

@danieltabacaru danieltabacaru marked this pull request as ready for review May 25, 2023 08:38
// versioned ones.
auto mut_sub = local_subs_store->get_active().make_mutable_copy();
// Mark the new subscription set Complete because there must always exist an active subscription set.
mut_sub.update_state(sync::SubscriptionSet::State::Complete);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to work fine for this corner case. But I'm a little concerned that we are papering over a real underlying problem here. The subscription is actually not complete, it is in an error state. Is there some invariant of the subscription store that has been violated by setting the active subscription to an error?

If I undo these changes and run the test you added, it looks like the actual problem is that https://github.com/realm/realm-core/blob/master/src/realm/sync/subscriptions.cpp#L727-L730 where an assumption has been made that there is an object with primary key 0. This is probably true in the initial state, but not later on. I wonder if it is possible to hit this in normal flow somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the assumption is that there is always going to be a subscription with pk==0. If that is true, we should not remove it during a superceed state change. What do you think of these changes instead? https://github.com/realm/realm-core/compare/dt/fix_client_reset_crash...js/flx-client-reset?expand=1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the assumption is that there is always going to be a subscription with pk==0 if there is no complete subscription. And that makes sense when opening a fresh realm. But once we have a complete subscription, we supercede and hence remove all the previous ones. I had the same idea to keep around the subscription with the primary key 0, but I think it's wrong. We always supercede subscriptions when one is marked Complete (https://github.com/realm/realm-core/blob/master/src/realm/sync/subscriptions.cpp#L393-L398). If one is marked Error, we should still have the last complete one. And that I think should also be the case here. The new subscription is just a copy of the active/complete one, so it should not be in Error state since there is no issue with it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I undo these changes and run the test you added, it looks like the actual problem is that https://github.com/realm/realm-core/blob/master/src/realm/sync/subscriptions.cpp#L727-L730 where an assumption has been made that there is an object with primary key 0. This is probably true in the initial state, but not later on. I wonder if it is possible to hit this in normal flow somehow?

I cannot think of any other scenario.

Is there some invariant of the subscription store that has been violated by setting the active subscription to an error?

supercede_prior_to is called only when a subscription is marked complete, while supercede_all_except is the issue here (called only in this specific case)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, what you say about removing the pk==0 subscription makes sense, ignore my suggested change for that.
I am just wary of setting the subscription to complete when that is not the actual state of things, I feel that this may create other edge case bugs. I've tried to come up with a scenario where that matters, but am having a hard time.

Now I am questioning if it is correct to be superceeding subscriptions in this way at all. It may be fine for discard local, but I think it may be wrong for recovery mode in a scenario where the server breaks the schema but then changes it back such that FLX queries start to work again. In that situation we wouldn't want to superceed subscriptions made offline because those should be recovered.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered why we don't leave the subscriptions as they are..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superceeding the subscriptions will lead to compensating writes. We could superceed them for DiscardLocal and keep them for the recovery mode. I would keep them in all modes (in the worst case we'll create subscriptions for tables or objects that don't exist, but that's not an issue for the server afaik)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right, and we should just delete https://github.com/realm/realm-core/blob/master/src/realm/object-store/sync/sync_session.cpp#L580-L589 and nuke the method supercede_all_except(). The comment says that the intent is to remove all later versioned (made offline) subscriptions, but that isn't necessary in the case of a client reset at this point in discard mode and it is plain wrong in recovery mode.
I think you can test the recovery scenario with something like this:

  1. Subscribe to a query A and successfully sync with the server.
  2. Go offline and make a subscription B and add an object that falls under that subscription.
  3. Break the server schema so that subscription A becomes an invalid query and terminate/restart the server.
  4. Client reset with Recovery. This should fail but not remove FLX query B
  5. Break the server schema again such that subscription A becomes valid again and terminate/restart the server.
  6. Reconnect the client, it should do a client reset with recovery successfully and subscription B is now active and the object added in step 2 is uploaded to the server.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test suggestion. I'll give it a try.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at https://github.com/realm/realm-core/blob/master/test/object-store/sync/flx_sync.cpp#L754-L800 for inspiration, but the outcome is not always deterministic. So, I added a new test based on the other test I added for this PR to test for this specific case.

config_local.sync_config->error_handler = nullptr;

// Attempt to open the realm again.
// This time the client reset succeesds and the offline subscription and writes are recovered.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo on "succeeds"

// Remove the folder preventing the completion of a client reset.
util::try_remove_dir_recursive(fresh_path);

auto config_copy = config_local;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is working, but I think you could strengthen this test by using make_client_reset_handler on this config and waiting for the reset to succeed instead of waiting for a generic upload and download.

@@ -901,35 +901,6 @@ void SubscriptionStore::supercede_prior_to(TransactionRef tr, int64_t version_id
remove_query.remove();
}

void SubscriptionStore::supercede_all_except(MutableSubscriptionSet& mut_sub) const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Copy link
Contributor

@ironage ironage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this one! 👍

Copy link
Contributor

@michael-wb michael-wb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I like the way it ended up with the subscription store remaining in the same state as before the client reset failure.

michael-wb

This comment was marked as duplicate.

@danieltabacaru danieltabacaru merged commit 6996885 into master May 30, 2023
27 checks passed
@danieltabacaru danieltabacaru deleted the dt/fix_client_reset_crash branch May 30, 2023 05:26
kraenhansen added a commit that referenced this pull request Jun 19, 2023
* Updated release notes

* Update catch2 v3.3.2 (#6605)

* Make core infer platform and cpu_arch, while bundle_id must be provided by SDK's (#6612)

* platform and cpu_arch are inferred by core, bundle_id must be provided by SDK's

* update changelog

* Return proper value for X86_64 arch

Co-authored-by: Christian Melchior <christian.melchior@mongodb.com>

* Get fine-grained platform for Apple devices

* Fix tests

* small fixes

* fix more tests

* Fix mistake in changelog

---------

Co-authored-by: Christian Melchior <christian.melchior@mongodb.com>

* use consistent rounding, following SERVER-75392 (#6477)

* fix entries that went to the wrong change version (#6632)

* Special-case main thread runloop scheduler

* Improve SectionedResults performance

SectionedResults used a std::map in a few places where the keys are a dense
range (i.e. always [0..map.size())) and so they can be std::vector instead. The
maps keyed on Mixed are now std::unordered_map.

Change notifications now report changes as a `std::vector<IndexSet>` rather
than `std::map<size_t, IndexSet>`. This is slower and uses more memory when the
only sections that changed are near the end of a SectionedResults with a large
number of sections, but is much faster if all sections changed or if the
sections which changed are early in the SectionedResults. Change notifications
now reuse buffers, which increases persistent memory usage slightly but
significant reduces allocations.

Change notifications for a single section now only compute the changes for that
section rather than computing the full changes and then filtering out the
changes for other sections.

* use static_assert rather than a old home rolled one

* fix warning of redefine of CHECK macro

* fix unused function warning

* silence warnings in bid128_to_string

* Introduce BPlusTree::for_all

* Prevent program from crashing when removing backlinks

* Fix broken snapshot of collection of objects

* Fix importing Results with deleted collection

The result should be an empty result, not the whole table.

* geospatial validation of polygons (#6607)

* geospatial validation of polygons

* Loop->Ring, added tests

* use std::unique

* changelog

* Benchmark for full-text search

* Allow to filter benchmark and run only list of specified names
* Add simple benchmark for fulltext search with index

* Filter out unresolved links in Dictionary::get_any()

* Add support for early exit in BPlusTree::for_all()

* Geospatial feedback (#6645)

* verify local results match a server query

* disallow geowithin on top level tables

* fix geo queries with ANY/ALL/NONE

* geospatial validation of points

* rename GeoCenterSphere -> GeoCircle

* review feedback

* better testing and fix any/all/none geospatial

* format

* Geospatial basic queries benchmarks (#6621)

* Add basic benchmarks for Geospatial type and queries

* Less copying in GeoWithinCompare

* Bring back caching of s2 region into Geospatial

* remove transaction overhead from measurements

* a couple small optimizations

* formatting

* simplify geospatial query evaluations

* changelog

---------

Co-authored-by: James Stone <james.stone@mongodb.com>

* Updated baas server tag for CI (#6650)

* Prepare release

* Updated release notes

* Access token refresh for websockets was not updating the location metadata (#6631)

* Always refresh metadata on app login
* Updated changelog
* Always update location when requested; fix c_api test
* Update test to properly evaluate websocket redirections; added one more test
* Updated changelog and fixed compile warning
* Added location checks back to test
* added mutex locking around location updated state and reworked requesting location update to use flag
* clang format and fix incorrect timeout value
* Reworked update location logic a bit and removed unused function
* Free mutex before calling completion on early exit in init_app_metadata

* maybe fix a race in a test (#6651)

* Use std::optional to store cached leaves in query nodes (#6653)

Our use of aligned_storage was basically a complicated manual version of this.
I was hoping this'd have binary size benefits, but it ended up making the
library 100 bytes larger instead. Nonetheless, it greatly simplifies things.

* Fix a few UBSan failures hit by tests

* Avoid performing unaligned reads in Array::get_chunk()

* Fix a lock order inversion in tests (#6666)

The cycle was DaemonThread::m_running_on_change_mutex =>
RealmCoordinator::m_realm_mutex  => SyncManager::m_mutex  =>
RealmCoordinator::s_coordinator_mutex  =>
DaemonThread::m_running_on_change_mutex, and it happened due to
DaemonThread::remove() being called inside RealmCoordinator::clear_cache()
while holding s_coordinator_mutex. Fortunately we don't actually need to be doing that.

As the cycle required RealmCoordinator::clear_all_caches(), this was only
applicable to tests.

* Allow geo coordinate numeric argument substitutions (#6663)

* allow geo coordinate numeric argument substitutions

* review feedback

* explicit cast to address warning

* Remove catch() clause to prevent truncating stack trace in AsyncOper::do_recycle_and_execute() (#6667)

* Fix an assertion failure if an async write callback ran during a write transaction (#6661)

Between when the callback after acquiring the write lock is scheduled and when
it's invoked a synchronous write transaction can be begun, and if it's not
ended before the next time the scheduler gets to run, the scheduled callback
will be invoked inside the write. When this happens we want to just do nothing.
Ending the synchronous write transaction will take care of rescheduling the
async write it preempted.

* core release 13.13.0

* Updated release notes

* Allocate arguments for lists (#6674)

* Small documentation and code fixes (#6672)

* Fix crash when opening FLX realm after client reset failure (#6671)

* Fix crash when opening FLX realm after client reset failure

* Update changelog

* Don't superceed pending subscriptions in case of a client reset failure

* Add test

* Changes after code review

* Support sorting based on values from a dictionary (#5311)

Co-authored-by: Sebastian Valle <svalle@gascaribe.com>
Co-authored-by: James Stone <james.stone@mongodb.com>

* Filter out external sources from Eclipse (#6682)

Indexer has a hard time dealing with Catch2

* Use cross-compilers instead of CentOS image (#6559)

* Use cross-compilers instead of CentOS image

* changelog

* fix bad merge

* refactor toolchain files

* clarify useToolchain exception circumstances

* Remap github URL to ssh to fix BAAS dependency using https:// (#6685)

* core v13.14.0

* Updated release notes

* Switch to building with Xcode 14 (#6647)

* better fix explanation in the changelog for list of args in the query parser (#6692)

* Remove constructor for GeoPoint and GeoPolygon (#6679)

Co-authored-by: Mathias Stearn <redbeard0531@gmail.com>

* Fix failing "sync: non-synced metadata table doesn't result in non-additive schema change" tests (#6697)

* Reporting correct error message on HTTP errors for Browser target

* User/Server API key provider becomes a single 'API key' provider (#6696)

* Allow frozen Realms to be opened with additive schema changes (#6693)

* allow frozen Realms to be opened with additive schema changes

* lint

* strengthen tests and comments

* Update src/realm/object-store/shared_realm.cpp

Co-authored-by: Thomas Goyne <thomas.goyne@mongodb.com>

---------

Co-authored-by: Thomas Goyne <thomas.goyne@mongodb.com>

* Reverted minimum swift version to fix failing CI tests (#6706)

* core release v13.15.0

* Updated release notes

* Fix client reset test with invalid query (#6711)

* Fix SessionWrapper use-after-free crash when tearing down sessions (#6676)

* Changed SessionWrapper pointer to bind_ptr; added session ident history
* Fix teardown if client is destroyed before session
* Session no longer holds bind_ptr to SessionWrapper; reverted some changes
* Fixed return and updated some comments
* Don't process errors if session is shutting down
* Added extra checks for session state
* Updates from review
* Updated some finalized checks
* Rolled back some changes
* Added output to ASSERTS and moved session history to unordered_set
* Remove session history entry on normal close
* Updated comment in sync tests

* Add [baas] and [local] tags to object store sync tests to identify the tests that rely on BAAS or not (#6710)

* Use Columns<Link> when property is Dictionary of links (#6705)

If a Dictionary property has links as value type, we can use Columns<Link> to handle
the links instead of the basic Columns<Dictionary>. This has the effect that when we
compare with a single value, we will optimize to use LinksToNode. So we need to make
LinksToNode handle the Dictionary case.

When we compare with a list of links, we must ensure that the list is converted to
a list obj ObjKeys - which is the type that Column<Link> evaluates to.

 Use LinksToNode for lists in QueryParser

* better changelog message for the fix related to queries with list of arguments (#6717)

* Fixes for Emscripten target (Passing header from fetch response. Using Config.path for inMemory Realm) (#6716)

* Fixes for Emscripten target: Passing header for fetch response. Passing the RealmConfig.path to be used for inMemory Realm, this is needed for registering SyncSession

Co-authored-by: Jørgen Edelbo <jorgen.edelbo@mongodb.com>

* release 13.15.1

* Updated spec.yml to remove User & Server prefix from ApiKey credentials

---------

Co-authored-by: James Stone <james.stone@mongodb.com>
Co-authored-by: realm-ci <ci@realm.io>
Co-authored-by: Kirill Burtsev <kirill.burtsev@mongodb.com>
Co-authored-by: Daniel Tabacaru <96778637+danieltabacaru@users.noreply.github.com>
Co-authored-by: Christian Melchior <christian.melchior@mongodb.com>
Co-authored-by: Thomas Goyne <tg@realm.io>
Co-authored-by: Thomas Goyne <thomas.goyne@mongodb.com>
Co-authored-by: Jørgen Edelbo <jorgen.edelbo@mongodb.com>
Co-authored-by: Michael Wilkerson-Barker <michael.wilkersonbarker@mongodb.com>
Co-authored-by: Nicola Cabiddu <nicola.cabiddu@mongodb.com>
Co-authored-by: Sebastian Valle <svalle@gascaribe.com>
Co-authored-by: Yavor Georgiev <fealebenpae@users.noreply.github.com>
Co-authored-by: Ferdinando Papale <4850119+papafe@users.noreply.github.com>
Co-authored-by: Mathias Stearn <redbeard0531@gmail.com>
Co-authored-by: Nabil Hachicha <nabil.hachicha@gmail.com>
Co-authored-by: Finn Schiermer Andersen <finn.schiermer.andersen@gmail.com>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sync client may crash after failing to download a fresh FLX realm during client reset w/recovery
3 participants