Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sendorfwd] Skip deadlocking libgbinder cleanup. Fixes JB#60298 #7

Merged
merged 4 commits into from Mar 20, 2023

Conversation

spiiroin
Copy link
Contributor

@spiiroin spiiroin commented Mar 7, 2023

Stopping sensorfwd systemd service takes 15 seconds. This happens because libgbinder cleanup gets deadlocked by some kind of orphaned transaction of unknown origin that never finishes - which then requires systemd to go through TERM-wait-timeout-KILL cycle.

As the part of at-exit cleanup we really care about is getting sensor hw turned off: Refactor hybrisManager instance logic so that cleanup code that stops sensors gets executed along with exit from mainloop instead of after returning from main(). Then skip all further cleanup code that would be executed after return from main() by using _exit().

@monich
Copy link
Member

monich commented Mar 7, 2023

That's typically a symptom of a reference (and hence memory) leak.

@spiiroin
Copy link
Contributor Author

spiiroin commented Mar 9, 2023

That's typically a symptom of a reference (and hence memory) leak.

While there seems to be some reference counting wobble with auto-release objects,
after considerable trial and error debug session, the root cause seems to be that
gbinder_client_cancel() just marks stuff as cancel-this-when-woken-up and as we
are already on exit path there are no more wakeups happening -> forever wait.

Or at least I managed to get the thing to unwind by adding dummy synchronous
transaction after the cancellation point here:
https://github.com/sailfishos/sensorfw/blob/master/core/hybrisadaptor.cpp#L441

@spiiroin
Copy link
Contributor Author

spiiroin commented Mar 9, 2023

@monich Can you take a peek? Is it really that gbinder_client_cancel() and would there be something bit more generic that could be done at libgbinder in case there are similar pending transactions elsewhere too?

@spiiroin spiiroin requested a review from mlehtima March 10, 2023 10:43
@monich
Copy link
Member

monich commented Mar 13, 2023

So far I haven't been able to reproduce the problem but running sensorfwd under valgrind gives me lots of these when I switch between the apps (which I guess causes connections to be established and terminated):

==24714== Thread 4:
==24714== Invalid read of size 8
==24714==    at 0x48F5DD0: swap<QListData::Data*> (move.h:193)
==24714==    by 0x48F5DD0: qSwap<QListData::Data*> (qglobal.h:869)
==24714==    by 0x48F5DD0: swap (qlist.h:151)
==24714==    by 0x48F5DD0: operator= (qlist.h:149)
==24714==    by 0x48F5DD0: clear (qlist.h:864)
==24714==    by 0x48F5DD0: AbstractSensorChannel::downsampleAndPropagate(TimedXyzData const&, QMap<int, QList<TimedXyzData> >&) (abstractsensor.cpp:156)
==24714==    by 0x8C68007: DataEmitter<TimedXyzData>::pushNewData() (dataemitter.h:71)
==24714==    by 0x8BF4197: wakeUpReaders (ringbuffer.h:217)
==24714==    by 0x8BF4197: RingBuffer<TimedXyzData>::write(unsigned int, TimedXyzData const*) (ringbuffer.h:235)
==24714==    by 0x8BA2497: propagate (source.h:103)
==24714==    by 0x8BA2497: BufferReader<TimedXyzData>::pushNewData() (bufferreader.h:73)
==24714==    by 0x8BF4197: wakeUpReaders (ringbuffer.h:217)
==24714==    by 0x8BF4197: RingBuffer<TimedXyzData>::write(unsigned int, TimedXyzData const*) (ringbuffer.h:235)
==24714==    by 0x8C20EC7: propagate (source.h:103)
==24714==    by 0x8C20EC7: CoordinateAlignFilter::filter(unsigned int, TimedXyzData const*) (coordinatealignfilter.cpp:44)
==24714==    by 0x8BA2497: propagate (source.h:103)
==24714==    by 0x8BA2497: BufferReader<TimedXyzData>::pushNewData() (bufferreader.h:73)
==24714==    by 0x8C434C7: wakeUpReaders (ringbuffer.h:217)
==24714==    by 0x8C434C7: HybrisAccelerometerAdaptor::processSample(sensors_event_t const&) (hybrisaccelerometeradaptor.cpp:76)
==24714==    by 0x84C920B: HybrisManager::processSample(sensors_event_t const&) (hybrisadaptor.cpp:802)
==24714==    by 0x84C93C3: HybrisManager::processEvents(sensors_event_t const*, int, int&, bool&) (hybrisadaptor.cpp:1256)
==24714==    by 0x84C96F7: HybrisManager::eventReaderThread(void*) (hybrisadaptor.cpp:1182)
==24714==    by 0x4A01A4B: start_thread (pthread_create.c:479)
==24714==    by 0x549289B: thread_start (clone.S:78)
==24714==  Address 0x8f52400 is 32 bytes inside a block of size 40 free'd
==24714==    at 0x486B2B8: free (vg_replace_malloc.c:872)
==24714==    by 0x4D77213: QMapDataBase::freeNodeAndRebalance(QMapNodeBase*) (qmap.cpp:287)
==24714==    by 0x8C67693: deleteNode (qmap.h:278)
==24714==    by 0x8C67693: remove (qmap.h:935)
==24714==    by 0x8C67693: AccelerometerSensorChannel::removeSession(int) (accelerometersensor.cpp:123)
==24714==    by 0x48F5437: AbstractSensorChannel::stop(int) (abstractsensor.cpp:74)
==24714==    by 0x4912527: AbstractSensorChannelAdaptor::qt_static_metacall(QObject*, QMetaObject::Call, int, void**) (moc_abstractsensor_a.cpp:206)
==24714==    by 0x4912B43: AbstractSensorChannelAdaptor::qt_metacall(QMetaObject::Call, int, void**) (moc_abstractsensor_a.cpp:306)
==24714==    by 0x8C6F1B7: AccelerometerSensorChannelAdaptor::qt_metacall(QMetaObject::Call, int, void**) (moc_accelerometersensor_a.cpp:158)
==24714==    by 0x4BD654F: QDBusConnectionPrivate::deliverCall(QObject*, int, QDBusMessage const&, QVector<int> const&, int) (qdbusintegrator.cpp:991)
==24714==    by 0x4BDA16B: QDBusConnectionPrivate::activateCall(QObject*, int, QDBusMessage const&) [clone .part.240] (qdbusintegrator.cpp:901)
==24714==    by 0x4BDA713: activateCall (qdbusintegrator.cpp:843)
==24714==    by 0x4BDA713: activateObject (qdbusintegrator.cpp:1483)
==24714==    by 0x4BDA713: QDBusConnectionPrivate::activateObject(QDBusConnectionPrivate::ObjectTreeNode&, QDBusMessage const&, int) (qdbusintegrator.cpp:1433)
==24714==    by 0x4BDCA7B: QDBusActivateObjectEvent::placeMetaCall(QObject*) (qdbusintegrator.cpp:1603)
==24714==    by 0x4EF9C6B: QObject::event(QEvent*) (qobject.cpp:1256)
==24714==    by 0x4ECF2B3: doNotify (qcoreapplication.cpp:1090)
==24714==    by 0x4ECF2B3: notify (qcoreapplication.cpp:1076)
==24714==    by 0x4ECF2B3: QCoreApplication::notifyInternal2(QObject*, QEvent*) (qcoreapplication.cpp:1015)
==24714==    by 0x4ED16B7: sendEvent (qcoreapplication.h:225)
==24714==    by 0x4ED16B7: QCoreApplicationPrivate::sendPostedEvents(QObject*, int, QThreadData*) (qcoreapplication.cpp:1650)
==24714==    by 0x4F21647: postEventSourceDispatch(_GSource*, int (*)(void*), void*) (qeventdispatcher_glib.cpp:275)
==24714==    by 0x6107D33: g_main_dispatch (gmain.c:3417)
==24714==    by 0x6107D33: g_main_context_dispatch (gmain.c:4135)
==24714==    by 0x6107F9F: g_main_context_iterate.isra.23 (gmain.c:4211)
==24714==    by 0x6108033: g_main_context_iteration (gmain.c:4276)
==24714==    by 0x4F21AB3: QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) (qeventdispatcher_glib.cpp:425)
==24714==    by 0x4ECD607: QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) (qeventloop.cpp:206)
==24714==  Block was alloc'd at
==24714==    at 0x48687C0: malloc (vg_replace_malloc.c:381)
==24714==    by 0x4D774AB: qMapAllocate (qmap.cpp:310)
==24714==    by 0x4D774AB: QMapDataBase::createNode(int, int, QMapNodeBase*, bool) (qmap.cpp:323)
==24714==    by 0x48F5BDF: createNode (qmap.h:223)
==24714==    by 0x48F5BDF: insert (qmap.h:713)
==24714==    by 0x48F5BDF: operator[] (qmap.h:663)
==24714==    by 0x48F5BDF: AbstractSensorChannel::downsampleAndPropagate(TimedXyzData const&, QMap<int, QList<TimedXyzData> >&) (abstractsensor.cpp:121)
==24714==    by 0x8C68007: DataEmitter<TimedXyzData>::pushNewData() (dataemitter.h:71)
==24714==    by 0x8BF4197: wakeUpReaders (ringbuffer.h:217)
==24714==    by 0x8BF4197: RingBuffer<TimedXyzData>::write(unsigned int, TimedXyzData const*) (ringbuffer.h:235)
==24714==    by 0x8BA2497: propagate (source.h:103)
==24714==    by 0x8BA2497: BufferReader<TimedXyzData>::pushNewData() (bufferreader.h:73)
==24714==    by 0x8BF4197: wakeUpReaders (ringbuffer.h:217)
==24714==    by 0x8BF4197: RingBuffer<TimedXyzData>::write(unsigned int, TimedXyzData const*) (ringbuffer.h:235)
==24714==    by 0x8C20EC7: propagate (source.h:103)
==24714==    by 0x8C20EC7: CoordinateAlignFilter::filter(unsigned int, TimedXyzData const*) (coordinatealignfilter.cpp:44)
==24714==    by 0x8BA2497: propagate (source.h:103)
==24714==    by 0x8BA2497: BufferReader<TimedXyzData>::pushNewData() (bufferreader.h:73)
==24714==    by 0x8C434C7: wakeUpReaders (ringbuffer.h:217)
==24714==    by 0x8C434C7: HybrisAccelerometerAdaptor::processSample(sensors_event_t const&) (hybrisaccelerometeradaptor.cpp:76)
==24714==    by 0x84C920B: HybrisManager::processSample(sensors_event_t const&) (hybrisadaptor.cpp:802)
==24714==    by 0x84C93C3: HybrisManager::processEvents(sensors_event_t const*, int, int&, bool&) (hybrisadaptor.cpp:1256)
==24714==    by 0x84C96F7: HybrisManager::eventReaderThread(void*) (hybrisadaptor.cpp:1182)
==24714==    by 0x4A01A4B: start_thread (pthread_create.c:479)
==24714==    by 0x549289B: thread_start (clone.S:78)

I don't think it's related but who knows. In any case it's worth fixing IMO.

core/hybrisadaptor.cpp Outdated Show resolved Hide resolved
@spiiroin
Copy link
Contributor Author

So far I haven't been able to reproduce the problem

For hangup to occur: it needs to be something that uses andoid sensor api 1.x -> for example xa2.
Crashes on the other hand happened with sensor api 2.x -> for example xperia 10 iii.

but running sensorfwd under valgrind gives me lots of these when I switch between the apps (which I guess causes connections to be established and terminated):

There has been a long standing suspicion about thread safety of stuff starting from

==24714== by 0x84C920B: HybrisManager::processSample(sensors_event_t const&) (hybrisadaptor.cpp:802)

In all likelihood not directly contributing to the immediate problem, but I did (finally) create a separate ticket about that.

Stopping sensorfwd systemd service takes 15 seconds. This happens
because libgbinder cleanup gets deadlocked by some kind of orphaned
transaction of unknown origin that never finishes - which then
requires systemd to go through TERM-wait-timeout-KILL cycle.

As the part of at-exit cleanup we really care about is getting sensor
hw turned off: Refactor hybrisManager instance logic so that cleanup
code that stops sensors gets executed along with exit from mainloop
instead of after returning from main(). Then skip all further cleanup
code that would be executed after return from main() by using _exit().

Signed-off-by: Simo Piiroinen <simo.piiroinen@jolla.com>
By the looks of it, the deadlock at gbinder_ipc_exit() is caused
by delayed processing of gbinder_client_cancel() actions.

Use a dummy synchronous transaction to trigger cancellation and
thus unblock way to exit.

While at it, omit extra references for autorelease objects, add
potentially missing reply unref and streamline object cleanup.

Signed-off-by: Simo Piiroinen <simo.piiroinen@jolla.com>
Sensorfwd crashes on exit in devices using android sensors api 2.x.
The crash occurs within event reader thread.

The amount of events that are read in one go is already limited to
what can be fit into fixed size buffers. However event processing is
done as if all of the available events were read - which then leads
to problems when garbage data from beyond buffer end is processed
too.

Process only the number of events that were read instead of how many
would have been available.

Signed-off-by: Simo Piiroinen <simo.piiroinen@jolla.com>
Crashes occur during sensorfwd exit time cleanup. Backtraces point
towards objects related to cancellation of death notification already
being invalidated.

Documentation suggests that underlying glib signal connections can
end up being torn automatically, but it is a bit unclear where and
how that might happen.

Doing cancellation of death and other async notifications first seems
to help.

Signed-off-by: Simo Piiroinen <simo.piiroinen@jolla.com>
@spiiroin
Copy link
Contributor Author

Debug commits dropped. Review commits squashed.

@spiiroin spiiroin merged commit 278548e into sailfishos:master Mar 20, 2023
@spiiroin spiiroin deleted the jb60298_skip_atexit branch April 3, 2023 06:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants