Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native libc crash on commit transaction in multithreaded environment #2224

Closed
lagmac opened this issue Feb 4, 2021 · 39 comments
Closed

Native libc crash on commit transaction in multithreaded environment #2224

lagmac opened this issue Feb 4, 2021 · 39 comments
Assignees
Labels
More-information-needed More information is needed to progress. The issue will close automatically in 2 weeks. O-Community Reproduction-Required

Comments

@lagmac
Copy link

lagmac commented Feb 4, 2021

As per discussion #2207 opened (now closed) a few days ago by my colleague, a similar problem has now also occurred on a Samsung A40 device. This issue is related to a crash during a commit transaction as per the crash report below.
My questions are :

  • is there a possibility that the problem is internal to the library?
  • is there the possibility that the problem is related to our incorrect implementation?

Goals

Successfully commit the transaction

Expected Results

The transaction commit must complete successfully

Actual Results

Native libc crash

--------- beginning of crash
01-18 09:01:49.253 917 917 F linker : CANNOT LINK EXECUTABLE "/system/bin/sec_diag_uart_log": library "libdiag_system.so" not found
01-18 15:37:01.267 9030 3102 F libc : Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x72616d69725024 in tid 3102 (Thread Pool Wor), pid 9030 (com.vimar.view)
01-18 15:37:01.634 3566 3566 F DEBUG : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
01-18 15:37:01.634 3566 3566 F DEBUG : Build fingerprint: 'samsung/a70qeea/a70q:10/QP1A.190711.020/A705FNXXU5CTK4:user/release-keys'
01-18 15:37:01.634 3566 3566 F DEBUG : Revision: '14'
01-18 15:37:01.634 3566 3566 F DEBUG : ABI: 'arm64'
01-18 15:37:01.645 3566 3566 F DEBUG : Timestamp: 2021-01-18 15:37:01+0100
01-18 15:37:01.645 3566 3566 F DEBUG : pid: 9030, tid: 3102, name: Thread Pool Wor >>> com.vimar.view <<<
01-18 15:37:01.645 3566 3566 F DEBUG : uid: 10319
01-18 15:37:01.645 3566 3566 F DEBUG : signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x72616d69725024
01-18 15:37:01.645 3566 3566 F DEBUG : x0 00000079d5cd1800 x1 0000000000000029 x2 0000000000000000 x3 00000079f9441ed4
01-18 15:37:01.645 3566 3566 F DEBUG : x4 0000007a090efce0 x5 0000007a090efd00 x6 00000079d616ae80 x7 00000079c6682850
01-18 15:37:01.645 3566 3566 F DEBUG : x8 0000000000000029 x9 7972616d69725014 x10 0000000000000000 x11 0000000000000004
01-18 15:37:01.645 3566 3566 F DEBUG : x12 00000079cb4b0040 x13 0000000000002587 x14 00000079c0000000 x15 0000007b01d43560
01-18 15:37:01.645 3566 3566 F DEBUG : x16 00000079f9b20420 x17 0000007b01257ffc x18 0000000000000008 x19 00000079d6038048
01-18 15:37:01.645 3566 3566 F DEBUG : x20 0000000000000000 x21 00000079d5cd1d88 x22 00000079c57fc020 x23 00000079c57fc020
01-18 15:37:01.645 3566 3566 F DEBUG : x24 0000000000000000 x25 0000007a02a77b48 x26 0000007a02aca4f8 x27 0000000000000000
01-18 15:37:01.645 3566 3566 F DEBUG : x28 00000079c57fc020 x29 00000079c57fa840
01-18 15:37:01.645 3566 3566 F DEBUG : sp 00000079c57fa6f0 lr 00000079f979c1f8 pc 00000079f96d2e2c
01-18 15:37:01.918 3566 3566 F DEBUG :
01-18 15:37:01.918 3566 3566 F DEBUG : backtrace:
01-18 15:37:01.918 3566 3566 F DEBUG : #00 pc 00000000003e5e2c /data/app/com.vimar.view-EpmnC9DopvgAw2qLIES2NA==/lib/arm64/librealm-wrappers.so (BuildId: 464c0f025e0b843293e1a31d06965fd887cc188c)
01-18 15:37:01.918 3566 3566 F DEBUG : #1 pc 00000000004af1f4 /data/app/com.vimar.view-EpmnC9DopvgAw2qLIES2NA==/lib/arm64/librealm-wrappers.so (BuildId: 464c0f025e0b843293e1a31d06965fd887cc188c)
01-18 15:37:01.918 3566 3566 F DEBUG : #2 pc 0000000000457d20 /data/app/com.vimar.view-EpmnC9DopvgAw2qLIES2NA==/lib/arm64/librealm-wrappers.so (BuildId: 464c0f025e0b843293e1a31d06965fd887cc188c)
01-18 15:37:01.918 3566 3566 F DEBUG : #3 pc 000000000045ec20 /data/app/com.vimar.view-EpmnC9DopvgAw2qLIES2NA==/lib/arm64/librealm-wrappers.so (BuildId: 464c0f025e0b843293e1a31d06965fd887cc188c)
01-18 15:37:01.918 3566 3566 F DEBUG : #4 pc 00000000001c7948 /data/app/com.vimar.view-EpmnC9DopvgAw2qLIES2NA==/lib/arm64/librealm-wrappers.so (BuildId: 464c0f025e0b843293e1a31d06965fd887cc188c)
01-18 15:37:01.918 3566 3566 F DEBUG : #5 pc 00000000001a9e94 /data/app/com.vimar.view-EpmnC9DopvgAw2qLIES2NA==/lib/arm64/librealm-wrappers.so (BuildId: 464c0f025e0b843293e1a31d06965fd887cc188c)
01-18 15:37:01.919 3566 3566 F DEBUG : #6 pc 0000000000161820 /data/app/com.vimar.view-EpmnC9DopvgAw2qLIES2NA==/lib/arm64/librealm-wrappers.so (shared_realm_commit_transaction+44) (BuildId: 464c0f025e0b843293e1a31d06965fd887cc188c)
01-18 15:37:01.919 3566 3566 F DEBUG : #7 pc 000000000004b134 anonymous:79fc9f8000

Steps to Reproduce

non reproducible, happens in random contexts.

Version of Realm and Tooling

Realm 5.1.1 on Android/Ios
XamarinForms

@nirinchev
Copy link
Member

Hm... unfortunately, the stacktrace here doesn't contain a whole lot of meaningful information other than that the crash indeed happens when committing a transaction.

Can you share more information about your use case? Is this a release build - if so, are you using the AOT compiler for Android? How large is the Realm file and how much free space/memory is on the device at the time of the crash? You mention that the crash happens in a multithreaded environment - can you share some code samples that show how you open a Realm file, write to it, and close it - maybe there's some smoking gun there.

@lagmac
Copy link
Author

lagmac commented Feb 5, 2021

Hi @nirinchev. Thanks for the quick response. I am gathering more information and plan to publish it shortly.

@lagmac
Copy link
Author

lagmac commented Feb 8, 2021

Hi @nirinchev. First of all my name is Gino (nicknames are nice but the real name is perhaps better)

Below you can find the required information:

The app was created in release;
The basic flow of the app is as follows:

The app during the lifecycle connects to a gateway and exchanges information with it.
Under certain conditions, the app sends a command to the gateway to execute.
The execution of the command by the gateway causes the gateway to restart.
When the gateway wakes up, the app searches for it and reconnects.
After reconnection, the gateway sends its status information.
The app collects this information, translates it into realm objects and writes it to the DB.

Below the code affected by this flow. (I hope it's enough)

  1. The function called when the data has received from the gateway:
private static async Task SaveSystemFunctions(
            SFDiscoveryResponse response,
            Gateway.GatewayTypes gatewayTypes,
            string sfCategory,
            bool deleteSfs = false)
        {
            await DataManager.Instance.UpdateSystemFunctions(
                    response,
                    gatewayTypes,
                    sfCategory,
                    Gateway.IsVideoGateway(gatewayTypes),
                    deleteSfs).ConfigureAwait(false);
        }
  1. Then we call on DataManager the following function
public async Task UpdateSystemFunctions(
            SFDiscoveryResponse sfdr,
            Gateway.GatewayTypes gatewayType,
            string sfCategory,
            bool isVideoentryphoneSf = false,
            bool deleteSfs = false)
        {
            DiffAndRemove(sfdr, gatewayType, sfCategory, deleteSfs);
            **await PersistSourceItems(sfdr, isVideoentryphoneSf, gatewayType).ConfigureAwait(false);
            await RealmAsyncServiceProvider.Connections
                                           .AsyncContextExec(UpdateAggregatorsCount, true)
                                           .ConfigureAwait(false);**
            UpdateHueEntries();
        }
  1. Then we use this function to prepare the data to add to Realm
private async Task PersistSourceItems(SFDiscoveryResponse sfdr, bool isVideoentryphoneSf, Gateway.GatewayTypes gatewayType)
        {
            var toAdd = new List<ISystemFunctionData>();

            var r = ByMeRealmProvider.GetRealm();
            {
                foreach (var result in sfdr.Result)
                {
                    var sourceSfs = result.Sf;
                    var idAmbient = result.IdAmbient;

                    if (sourceSfs.Count > 0)
                    {
                        foreach (var x in sourceSfs)
                        {
                            switch (x.Sftype)
                            {
                                case SystemFunction.Scene:
                                {
                                    var sceneryData = CreateSceneryDataObject(x, sfdr.Source, idAmbient);
                                    toAdd.Add(sceneryData);
                                    break;
                                }
                                case SystemFunction.SceneActivator:
                                {
                                    var sceneActivatorData = CreateSceneActivatorObject(x, sfdr.Source);
                                    toAdd.Add(sceneActivatorData);
                                    break;
                                }
                                default:
                                {
                                    if (isVideoentryphoneSf)
                                    {
                                        var oldIdAmbientUsers = Queries.GetAllVideoentryphone(r)
                                            .Where(s => s.Sftype != VG2FHelper.DigitalCamera)
                                            .ToList()
                                            .ToDictionary(s =>
                                                s.Idsf, s => s.IdAmbientUser);

                                        oldIdAmbientUsers.TryGetValue(x.Idsf, out var oldIdAmbientUser);
                                        var item = GetVideoEntryphoneSf(sfdr, x, idAmbient, gatewayType, oldIdAmbientUser);
                                        toAdd.Add(item);
                                    }
                                    else if (IsKnxSf(x))
                                    {
                                        var knx = GetKnxSf(x, idAmbient, sfdr.Source);
                                        toAdd.Add(knx);
                                    }
                                    else
                                    {
                                        var sfd = CreateSystemFunctionDataObject(x, idAmbient, sfdr.Source);
                                        toAdd.Add(sfd);

                                        if (x.Sstype == ClimaRealization.ClimaZone ||
                                            x.Sstype == ClimaRealization.ClimaControl)
                                        {
                                            var climaOnOff = GetClimaOnOfModalityData(x.Idsf, sfdr.Source, x.Sstype == ClimaRealization.ClimaControl);
                                            if (climaOnOff != null)
                                                r.Write(() => r.Add(climaOnOff));
                                        }
                                    }
                                    break;
                                }
                            }
                        }
                    }
                    else
                    {
                        var sfd = CreateSystemFunctionDataObject(result, sfdr.Source);
                        toAdd.Add(sfd);
                    }
                }
                r.TryDispose();

                if (toAdd.Count > 0)
                {
                    var toBePersisted = toAdd.OfType<RealmObject>().ToList();
                    **await RealmAsyncServiceProvider.Connections.PersistAsync(toBePersisted).ConfigureAwait(false);**
                }
            }
        }
  1. And at the end we write the data to the DB
public async Task PersistAsync(List<RealmObject> toBePersisted)
        {
            var f = new Func<Realm, bool>(r =>
            {
                r.Write(() =>
                {
                    foreach (var x in toBePersisted)
                        **Realm.Add**(x, true);
                });
                return true;
            });
            await AsyncContextExec(f).ConfigureAwait(false);
        }

private RealmAsyncServiceProvider()
        {
            AcThread = new AsyncContextThread();
            AcThread.Factory.Run(() =>
            {
                **Realm = ByMeRealmProvider.GetRealm();**
                RealmInit.Set();
            });
        }

@nirinchev
Copy link
Member

Hey Gino, thanks for the code samples - for the most part, things seem fine.

In PersistAsync - is it a typo that you have Realm.Add in the foreach loop? Shouldn't this be r.Add instead? Also can you share the implementation of AsyncContextExec.

@lagmac
Copy link
Author

lagmac commented Feb 10, 2021

Hi @nirinchev, thanks for the reply. Below the AsyncContextExec implementation.

public async Task<T> AsyncContextExec<T>(Func<Realm, T> toExec, bool isTransaction = false)
        {
            var f = new Func<Task<T>>(async () =>
            {
                RealmInit.WaitOne();

                Task<T> t = default;
                T ret = default;

                try
                {
                    if (toExec == null)
                    {
                        await Task.FromException(new ArgumentException("Execution object is null")).ConfigureAwait(false);
                    }
                    else
                    {
                        if (isTransaction)
                        {
                            t = AcThread.Factory.Run(() =>
                            {
                                T res = default;
                                Realm.Write(() => res = toExec.Invoke(Realm));
                                return res;
                            });
                        }
                        else
                        {
                            t = AcThread.Factory.Run(() => toExec.Invoke(Realm));
                        }
                        ret = await t.ConfigureAwait(false);
                    }
                }
                catch (Exception ex)
                {
                    if (t?.Exception != null)
                    {
                        LogBroker.Instance.TraceDebug("aggregate exception caught during async context execution!");
                        var e = t.Exception.Flatten().InnerExceptions;
                        foreach (var x in e)
                            Utils.TraceException(x);
                    }
                    else
                    {
                        Utils.TraceException(ex);
                    }
                }
                return ret;
            });

            return await Scheduler.Exec(f).ConfigureAwait(false);
        }

@fede-marsiglia
Copy link

fede-marsiglia commented Feb 10, 2021

Hey Gino, thanks for the code samples - for the most part, things seem fine.

In PersistAsync - is it a typo that you have Realm.Add in the foreach loop? Shouldn't this be r.Add instead? Also can you share the implementation of AsyncContextExec.

actually it should be 'r' as you mention, my bad. It should not be a problem as the 'r' reference is set to 'Realm' inside "AsyncContextExec()".

@nirinchev
Copy link
Member

Hm... the async code is a bit complicated to read - am I correct in assuming that the purpose of the code is to serialize all Realm access on a single thread?

If that's the case, then it's hard for me to see how this code can correlate with the stacktrace posted in #2207. The crash there happens when the app calls shared_realm_open, which is the native counterpart to Realm.GetInstance. Is RealmAsyncServiceProvider a singleton? Because that's the only place I see a call to something similar to Realm.GetInstance (assuming ByMeRealmProvider.GetRealm calls GetInstance internally). If you only open a Realm once for the lifetime of your app, then I don't see how that could cause the mutex crash. If RealmAsyncServiceProvider is not a singleton, then it's hard to understand why you need the class at all. Having multiple threads open the same Realm instance is inefficient and can lead to memory leaks if the Realm instances are not disposed when the thread exits.

@fede-marsiglia
Copy link

fede-marsiglia commented Feb 11, 2021

Your assumption is right, we did serialize Realm read/writes via a singleton containing a private Realm belonging to an AsyncContexThread, believing the problem we were facing (mutex crash) was due to parallel realm writes on different realm instances in concurrent threads; even by doing that though, the error persists. By error I mean the one outlined in the stacktrace below, where you can see a native crash during pthread_mutex_lock() which most likely happens during the execution of the code posted by @lagmac , sharing the same description (Thread Pool Wor) of the one from which the discussion started.


02-09 13:43:13.684   126   126 I DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***

02-09 13:43:13.684   126   126 I DEBUG   : Build fingerprint: 'Android/a93/a93:5.1/LMY47D/build02031045:user/test-keys'

02-09 13:43:13.684   126   126 I DEBUG   : Revision: '0'

02-09 13:43:13.684   126   126 I DEBUG   : ABI: 'arm'

02-09 13:43:13.685   126   126 I DEBUG   : pid: 2141, tid: 4563, name: Thread Pool Wor  >>> com.mtsbyme <<<

02-09 13:43:13.685   126   126 I DEBUG   : signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x8

02-09 13:43:13.737   126   126 I DEBUG   :     r0 00000008  r1 00000081  r2 00000001  r3 00000000

02-09 13:43:13.737   126   126 I DEBUG   :     r4 00000008  r5 00000000  r6 40273dd4  r7 87bfd078

02-09 13:43:13.737   126   126 I DEBUG   :     r8 ffffffff  r9 ffffffff  sl 87bfd290  fp 00000000

02-09 13:43:13.738   126   126 I DEBUG   :     ip 6f39c170  sp 87bfd060  lr 6f2341b9  pc 4022ef3e  cpsr 40070030

02-09 13:43:13.739   126   126 I DEBUG   : 

02-09 13:43:13.739   126   126 I DEBUG   : backtrace:

02-09 13:43:13.739   126   126 I DEBUG   :     #00 pc 00016f3e  /system/lib/libc.so (pthread_mutex_lock+7)

02-09 13:43:13.739   126   126 I DEBUG   :     #01 pc 004321b5  /data/app/com.mtsbyme-2/lib/arm/librealm-wrappers.so (std::__ndk1::recursive_mutex::lock()+4)

02-09 13:43:13.739   126   126 I DEBUG   :     #02 pc 002dc877  /data/app/com.mtsbyme-2/lib/arm/librealm-wrappers.so

02-09 13:43:13.739   126   126 I DEBUG   :     #03 pc 00104c2d  /data/app/com.mtsbyme-2/lib/arm/librealm-wrappers.so

02-09 13:43:13.739   126   126 I DEBUG   :     #04 pc 000f4221  /data/app/com.mtsbyme-2/lib/arm/librealm-wrappers.so

02-09 13:43:13.739   126   126 I DEBUG   :     #05 pc 000cd6b7  /data/app/com.mtsbyme-2/lib/arm/librealm-wrappers.so (shared_realm_open+350)

02-09 13:43:13.739   126   126 I DEBUG   :     #06 pc 0016c790  <unknown>

Am I right assuming that by doing read/write calls to the same realm instance via an AsyncContexThread, the calls are serialized and safe to perform? Thanks in advance!

@nirinchev
Copy link
Member

I'll keep looking at it, but in the meantime, I have an idea about something we could try - if you can reproduce that occasionally in your development environment, I can prepare a debug build of the native binaries for you to run with, which will give us a more meaningful stacktrace.

@fede-marsiglia
Copy link

This would be perfect, we replicate the issue easily.

@nirinchev
Copy link
Member

@fede-marsiglia hm, now that I look at it - you seem to be using version 5.1.1 of the .NET SDK, which is somewhat old. Would you mind trying to upgrade to 10.1.0? If you're only using the local database, it should be a fairly seamless upgrade and should bring a bunch of bug fixes and performance improvements. While I don't believe it will fix the mutex issue, it will be much easier for me to produce a debug build of the 10.x line of releases and when we do identify the issue and fix it, the fix will be released as a 10.x release, so you'll need to upgrade anyway.

@fede-marsiglia
Copy link

fede-marsiglia commented Feb 11, 2021

Just upgraded to 10.1.0, waiting for the debug build. Thanks again!

@nirinchev
Copy link
Member

Here's a link to an archive containing the nupkg-es with the debug binaries. I hope that running the app with these would give us a more meaningful stacktrace for the crash 🤞

@fede-marsiglia
Copy link

Thank you @nirinchev, will let you know as soon as we reproduce the issue.

@fede-marsiglia
Copy link

We just reproduced the issue, stacktrace below:

02-12 12:27:02.575 F/libc    ( 7818): Fatal signal 11 (SIGSEGV), code 1, fault addr 0x8 in tid 7904 (Thread Pool Wor)
02-12 12:27:02.680 I/DEBUG   (  125): *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
02-12 12:27:02.680 I/DEBUG   (  125): Build fingerprint: 'Android/a93/a93:5.1/LMY47D/build05291704:user/test-keys'
02-12 12:27:02.680 I/DEBUG   (  125): Revision: '0'
02-12 12:27:02.680 I/DEBUG   (  125): ABI: 'arm'
02-12 12:27:02.680 I/DEBUG   (  125): pid: 7818, tid: 7904, name: Thread Pool Wor  >>> com.mtsbyme.debug <<<
02-12 12:27:02.680 I/DEBUG   (  125): signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x8
02-12 12:27:04.382 I/DEBUG   (  125):     r0 00000008  r1 00000008  r2 00000008  r3 894cb1b8
02-12 12:27:04.382 I/DEBUG   (  125):     r4 00000008  r5 7809af34  r6 894cb61c  r7 894cb168
02-12 12:27:04.382 I/DEBUG   (  125):     r8 81f51320  r9 000016f3  sl 81f511e8  fp 00000000
02-12 12:27:04.382 I/DEBUG   (  125):     ip 835551cc  sp 894cb150  lr 83352ea1  pc 400cff3e  cpsr 40070030
02-12 12:27:04.382 I/DEBUG   (  125):
02-12 12:27:04.382 I/DEBUG   (  125): backtrace:
02-12 12:27:04.382 I/DEBUG   (  125):     #00 pc 00016f3e  /system/lib/libc.so (pthread_mutex_lock+7)
02-12 12:27:04.382 I/DEBUG   (  125):     #01 pc 0093ae9f  /data/app/com.mtsbyme.debug-1/lib/arm/librealm-wrappers.so (std::__ndk1::recursive_mutex::lock()+6)
02-12 12:27:04.382 I/DEBUG   (  125):     #02 pc 0069562d  /data/app/com.mtsbyme.debug-1/lib/arm/librealm-wrappers.so (std::__ndk1::lock_guard<std::__ndk1::recursive_mutex>::lock_guard(std::__ndk1::recursive_mutex&)+20)
02-12 12:27:04.383 I/DEBUG   (  125):     #03 pc 006979a1  /data/app/com.mtsbyme.debug-1/lib/arm/librealm-wrappers.so (realm::DB::get_version_id_of_latest_snapshot()+32)
02-12 12:27:04.383 I/DEBUG   (  125):     #04 pc 00695f1f  /data/app/com.mtsbyme.debug-1/lib/arm/librealm-wrappers.so (realm::DB::get_version_of_latest_snapshot()+30)
02-12 12:27:04.383 I/DEBUG   (  125):     #05 pc 002052d1  /data/app/com.mtsbyme.debug-1/lib/arm/librealm-wrappers.so (realm::Transaction::get_version_of_latest_snapshot()+16)
02-12 12:27:04.383 I/DEBUG   (  125):     #06 pc 00205189  /data/app/com.mtsbyme.debug-1/lib/arm/librealm-wrappers.so (realm::_impl::RealmCoordinator::advance_to_latest(realm::Realm&)+120)
02-12 12:27:04.383 I/DEBUG   (  125):     #07 pc 001befb1  /data/app/com.mtsbyme.debug-1/lib/arm/librealm-wrappers.so (realm::Realm::do_refresh()+172)
02-12 12:27:04.383 I/DEBUG   (  125):     #08 pc 001c1459  /data/app/com.mtsbyme.debug-1/lib/arm/librealm-wrappers.so (realm::Realm::refresh()+22)
02-12 12:27:04.383 I/DEBUG   (  125):     #09 pc 001026cd  /data/app/com.mtsbyme.debug-1/lib/arm/librealm-wrappers.so (shared_realm_open::$_0::operator()() const+420)
02-12 12:27:04.383 I/DEBUG   (  125):     #10 pc 000fd771  /data/app/com.mtsbyme.debug-1/lib/arm/librealm-wrappers.so (decltype ({parm#2}()) realm::handle_errors<shared_realm_open::$_0>(realm::NativeException::Marshallable&, shared_realm_open::$_0&&)+28)
02-12 12:27:04.383 I/DEBUG   (  125):     #11 pc 000fd72f  /data/app/com.mtsbyme.debug-1/lib/arm/librealm-wrappers.so (shared_realm_open+186)
02-12 12:27:04.383 I/DEBUG   (  125):     #12 pc 00005e80  <unknown>```

@nirinchev
Copy link
Member

Very helpful, thank you! I'll ping some folks from the Core database team to look at it and take over.

@fede-marsiglia
Copy link

@nirinchev could you please share updates as soon as you have them? Unfortunately it is mandatory that we have this issue fixed in order to release our app, and deadline is short. Again, we really appreciate your prompt and careful support.

@nirinchev
Copy link
Member

I transferred this to the Core team and they are looking into it, but we don't have a timeline for a fix yet.

@fede-marsiglia
Copy link

Understandable. Can I ask if the issue is 100% Realm related or there is the possibility of a bad API usage from our part?

@nirinchev
Copy link
Member

Okay, so it looks like the Realm instance you're getting is invalid. This should be impossible, but apparently we have a bug somewhere. So few more clarifying questions:

  1. Can you share a project that reproduces the issue? That would be the fastest path for us to debug and diagnose locally.
  2. If 1 is impossible, can you clarify a little bit the lifecycle of the Realm instances - i.e. are you disposing Realm instances after use, are you always getting a new instance or are you using the same instance for the lifecycle of your app?

@mattiascibien
Copy link

@nirinchev seems a problem like the one I had with caching giving me a Realm that should be closed.

@fede-marsiglia
Copy link

We weren't able to reproduce the problem in a separate test environment, and since this arises in a code section that gets triggered during comunication beetween our app and our client's gateways, we can't share a self containing project that exhibits this same issue. As for the second request, I'll soon write a separate post with all the details you need.

@fede-marsiglia
Copy link

fede-marsiglia commented Feb 16, 2021

Hi @nirinchev, here is a simplified, pseudo code version of what we actually do when the issue arises. I focused only on parts that make explicit realm use, hopefully it will be easier for you to identify suspicious code patterns. I've also unfolded all the called methods so that, even if the result is a wall of text, following the code should be straightforward. Let me know if something is unclear and I'll dive deeper into implementation details. Thanks!

Update(realm)
{
	realm.Write(() => 
	{
		if (c1)
		{
			...

			foreach (var x in list1)
			{
				var type = GetSystemFunctionType(x);
				var items = realm.All<type>();
				foreach (var y in items)
					realm.Remove(y)
			}

			...
		}

		if (c2)
		{
			...	

			if (c3)
			{
				var item = realm.All<t1>.FirstOrDefault(c4);
				if (item != null)
					realm.Remove(item);
			}

			...
		}

		if (c5)
		{
			var item = realm.All<t2>().FirstOrDefault(c6);
			if (item != null && c7)
			{
				...

				var sfs = realm.All<t3>();

				...

				
				foreach (x in list2)
				{
					var item = realm.All<t3>().FirstOrDefault(c8);
					if (item != null)
						realm.Remove(x)
				}

				...	
			}
		}

		foreach (var x in list3)
		{
			...

			var type = GetSystemFunctionType(x);
			var items = realm.All<type>();
			foreach (var y in items)
				realm.Remove(y)

			...
		}

		foreach (var x in list4)
		{
			...

			var obj = CreateRealmObject(x);
			realm.Add(obj, update: true)

			...
		}

		foreach (var x in list5)
		{
			if (x is RealmObject obj)
			{
				if (c9)
				{	
					...

					obj.Property = (int )n; 

					...
				}
				else 
				{
					...

					var value = realm.All<t4>().Count() +
							realm.All<t5>().Count() +
							realm.All<t6>().Count() +
							realm.All<t7>().Count();

					x.Property = (int )value;

					...
				}
			}
		}
		
	}
}

This same method gets called every time we do attach to a gateway, so, in case of multiple attach at the same time, it could theoretically run in parallel with other copies.

@nirinchev
Copy link
Member

Hey Federico, I'm sorry - I must have not been very clear with my request. The code you posted seems just fine and is unlikely to be the source of the issue. The stacktrace points to an issue when opening a realm file (i.e. when calling GetInstance) - instead of obtaining a valid native instance, it looks like we're getting an invalid one. So if you do get to a point where you're using the Realm, it's unlikely that this will cause issues.

My best guess is that you're hitting some corner case with the caching mechanism where the cache thinks it has a valid Realm instance to return, but in reality it has already been disposed/invalidated. So can you try and look at your code for all the places where you're opening and closing a Realm file. Namely, it's interesting to know:

  1. Do you always dispose the Realm file after calling GetInstance?
  2. Do you have some caching on your end?
  3. Which threads do you call GetInstance/Dispose on?
  4. Are there Realm instances that are being garbage collected?
  5. Does this happen consistently in the same place or various different places - if it's the same place, is there something special about it? E.g. if you're using AsyncThread only there and not elsewhere, it could be a smoking gun, etc. Anything different may be a clue as to what is causing this.

Thanks again for your help tracking that down!

@fede-marsiglia
Copy link

fede-marsiglia commented Feb 16, 2021

  1. Quite sure every realm instance gets disposed in the context in which we see the issue, but will double check.
  2. We had a caching mechanism on our end in the form described above (single realm instance belonging to an
    AsynContextThread, managing all read/writes only during gateway comunication). Even before that measure took place, we
    had the same issue, albeit not so frequently. We reverted this mechanism anyway, as it was not resolutive, so now we follow
    the default realm lifecyle (open => transaction =>dispose, all on the same thread).
  3. Worker thread, following an event thats gets triggered when one of the gateways gets discovered.
  4. This is hard to tell, will do some testing.
  5. The issue arises consistently in the same place, never seen that outside of the scenery described above. Interestingly, by
    limiting concurrent realm writes to 5 via a semaphore, it seems to disappear.

@nirinchev
Copy link
Member

Okay, so let me try to summarize some of your answers and see if I understand things correctly:

  1. You are opening and disposing of Realms on arbitrary background threads (e.g. Task.Run(() => ....)). Can you verify that there is no SynchronizationContext installed on the thread where you're opening the Realm file? This can be done by logging SynchronizationContext.Current just before calling Realm.GetInstance.
  2. You don't do any asynchronous work between opening and closing a Realm file. This is fairly important as on TPL background threads, the continuations are dispatched on a random thread meaning that an await between GetInstance and Dispose will prevent the Dispose from properly cleaning up native resources.
  3. If you do dispose of all Realms that are getting opened, then you shouldn't see Realms being garbage collected.
  4. You shouldn't need to limit concurrent realm writes, because it should be impossible to have more than 1 write at a time anyway. realm.Write() acquires a Realm-wide write lock, so calling it from multiple threads should have those threads wait until the transaction is committed/rolled back. It's feasible that you're hitting some corner case there, so if my understanding in points 1-3 is correct, then this is our best lead for uncovering the real issue.

@nirinchev nirinchev self-assigned this Feb 16, 2021
@fede-marsiglia
Copy link

fede-marsiglia commented Feb 16, 2021

  1. the stack trace leading to the problematic method is as follows:
	IpcClient.SaveSystemFunctions() 
	IpcClient.<>c__DisplayClass71_1.<SfDiscovery>b__8()
	[External code: 15 frames]
	IPConnectorBroker.HandleSfDiscovery()
	IPConnectorBroker.HandleResponse()
	IpcClient.<Attach>g__OnConnectionOnOnIncomingMessage|65_9() 
	WebSocketConnection.Connection_OnMessage() 

IpcClient.SaveSystemFunctions() is the method in which we do all the transactions we need to store the informations coming from the gateway, using DataManager.Instance.UpdateSystemFunction() internally, which contains the pseudo code shared in my previous post. I confirm that at that point the synchronization context is null

Capture

  1. No async code is executed, this is confirmed by the fact that DataManager.Instance.UpdateSystemFunction() is not awaitable, and no fire and forget tasks gets called internally.

@nirinchev
Copy link
Member

nirinchev commented Feb 16, 2021

Okay, that's great! One follow up question - what does TryDispose do and why are you using that instead of the using pattern? It can be a red herring, but the concern I have is that if an exception occurs in UpdateSystemFunction, the TryDispose method will not be executed, so the Realm instance will not be properly disposed. If you replaced that with:

using var realm = ByMeRealmProvider.GetRealm();

DataManager.Instance.UpdateSystemFunction(...)

the runtime will guarantee that realm.Dispose will be called regardless of whether UpdateSystemFunction succeeds or fails. It's a bit of a long shot, but if you don't have some very special logic in TryDispose, it might be worth giving it a try.

@fede-marsiglia
Copy link

fede-marsiglia commented Feb 16, 2021

The purpose behind TryDispose() is to have meaningful informations during the disposage of a realm instance in case of an exception; the relevant part is as follows:

try
{
        if (obj is Realm re)
        {
            var disposed = false;
            try
            {
                if (!re.IsClosed)
                {
                    re.Dispose();
                    disposed = true;
                    Interlocked.Decrement(ref Utils.RealmCurrentInstances);
                }
            }
            catch (Exception)
            {
                MainThread.BeginInvokeOnMainThread(() =>
                {
                    try
                    {
                        if (!re.IsClosed)
                        {
                            re.Dispose();
                            disposed = true;
                            Interlocked.Decrement(ref Utils.RealmCurrentInstances);
                        }
                    }
                    catch (Exception ex)
                    {
                        Utils.TraceTryDisposeException(ex, callingMethod, callingFilePath, callingFileLineNumber);
                        disposed = false;
                    }
                });
            }
            return disposed;
        }
}
catch (Exception ex)
{
    Utils.TraceTryDisposeException(ex, callingMethod, callingFilePath, callingFileLineNumber);
}

We will certainly try the using pattern to see if it has a positive impact. Thanks again!

@nirinchev
Copy link
Member

@fede-marsiglia sorry to bother you but checking in - did replacing that with a using statement make any difference or are you still seeing the problem?

@fede-marsiglia
Copy link

Hi @nirinchev, unfortunately, due to other priorities, I wasn't able to do extensive testing with the tweak you suggested. I will surely take some time today to do that, will let you know what I find. Thanks!

@nirinchev
Copy link
Member

No worries, take as much time as you need! I was checking in mostly to make sure we were on the same page and that you weren't waiting on us for something without me realizing it 😄

@fede-marsiglia
Copy link

Hi @nirinchev, sorry for not showing up until now, but here I am. After setting up automated testing to trigger the described behavior on 2 different app versions, both with the tweak suggested and without it, we weren't able to reproduce the issue with none of them. At this point, this could be caused by some random variables dependant upon the specific characteristics of our client's system. Will do more testing and let you know.

@fede-marsiglia
Copy link

Hi @nirinchev, do you have any news regarding the issue? Thanks in advance.

@nirinchev
Copy link
Member

Hi @fede-marsiglia we don't have any updates. In your previous message you said:

At this point, this could be caused by some random variables dependant upon the specific characteristics of our client's system. Will do more testing and let you know.

so I've been waiting for updates from your team.

@fede-marsiglia
Copy link

Ok, I was referring specifically to this line

I transferred this to the Core team and they are looking into it, but we don't have a timeline for a fix yet.

Is there any news?

@nirinchev
Copy link
Member

Oh, for that one, my next message summed up the results from their investigation:

Okay, so it looks like the Realm instance you're getting is invalid. This should be impossible, but apparently we have a bug somewhere.

Essentially, the Realm reference appears to be invalidated even though the Realm itself is still in use. Unfortunately, without a somewhat reliable way to repro, it's next to impossible to say what's causing it to be invalidated. I did a thorough review of the code that handles the Realm instance lifecycle and couldn't find any smoking guns there, so if the bug is in the Realm SDK, then it's certainly not obvious.

@fede-marsiglia
Copy link

Ok, thanks again for your support, we'll try to find a consistent way to reproduce the issue

@nirinchev nirinchev added the More-information-needed More information is needed to progress. The issue will close automatically in 2 weeks. label Jun 8, 2021
@no-response
Copy link

no-response bot commented Jun 22, 2021

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.

@no-response no-response bot closed this as completed Jun 22, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
More-information-needed More information is needed to progress. The issue will close automatically in 2 weeks. O-Community Reproduction-Required
Projects
None yet
Development

No branches or pull requests

4 participants