ble: gatt: db_hash_work runs for too long and makes serial communication fail #43811

closatt · 2022-03-15T15:24:48Z

Environment
Zephyr version: 2.5
Target: NRF52840

Bug description
Just after registering a service, UART communication is blocked for more than 10ms, causing a communication failure. In the following image, the moment when the bus is interrupted is circled in red. As a result, the chip answers before the tx communication finishes.

Some investigation showed that the db_hash_work (gatt.c) execution is taking several ms, and blocks all serial communications during this time as sysworkq is very high priority:

On the SystemView capture below, we see that the sysworkq is interrupting our LPWAN thread (which is making UART communications) for 10ms.
By printing the work pointers, I saw that db_hash_work was executed just at the time of the bug. Setting a k_busy_wait(5000) in db_hash_process confirmed this, as the interruption time was now 15ms.

The execution time of this work depends of the number of services (in this case 13 services).

The only simple temporary workaround we found (which is not thread safe) is to add a k_sleep of several ms after each call to bt_gatt_service_register.

To Reproduce

Register a lot of services with bt_gatt_service_register
A bit less than 10ms after registering the last service, make a long UART communication

This communication failure is not easy to reproduce as it requires a particular timing. But even without reproducing the communication failure, the fact that db_hash_work lasts a long time can be easily seen with Segger SystemView (or even with timestamped logs placed at the beginning and at the end of db_hash_process function). For this, it is enough to register a lot of services (for example 15 with each 5 characteristics) with bt_gatt_service_register and capture the right moment, 10ms after the last service register.
Maybe the execution time is completely different on another target than the NRF52840.

Expected behavior
If this db_hash_process can be interrupted, it should maybe be executed in a low priority thread. If it cannot be interrupted, I think there should be at least a callback indicating the caller of bt_gatt_service_register that db_hash_process has finished and serial communications are safe.
There are certainly better solutions that I didn't think of.

The text was updated successfully, but these errors were encountered:

carlescufi · 2022-03-15T16:11:21Z

@thoh-ot assigned this to you following-up on the work you did on GATT hashing recently, but feel free to reassign. By the way, have you seen this on the Oticon side?

@closatt can you please try to reproduce with the latest Zephyr? And, if possible, provide a basic sample that can reproduce the problem.

carlescufi · 2022-04-06T11:26:33Z

@thoh-ot will you be able to take a look at this? Otherwise we'll try to find someone else to assign this to.

thoh-ot · 2022-04-08T09:17:02Z

@thoh-ot will you be able to take a look at this? Otherwise we'll try to find someone else to assign this to.

So I have been thinking a little bit about this - to me this seems like an architectural issue with using the system work queue for "long running" work while also servicing other constraints in the system.

As the system work queue executes the work in a cooperative thread with fairly high priority I think k_yield() in the db_hash_process() is the only short term solution to this problem.

I won't have the capacity to make a long term viable solution.

cvinayak · 2022-04-28T11:32:38Z

@jori-nordic could take a look into using a user work queue or pre-emptible threads to defer long duration operations/functions. Ex. BT ECC thread (hci_ecc.c).

jori-nordic · 2022-06-02T10:25:21Z

@closatt could you try reproducing the bug with #46131 checked out ?
You can just do git fetch origin pull/46131/head:pr-46131 && git checkout pr-46131 to get the latest version.

closatt · 2022-06-20T09:55:06Z

@jori-nordic thanks for the patch and sorry for the delay.
I have rebased and tested your commit on Zephyr 2.7 since I have no software running on the master.
I had troubles with SystemView always crashing at init so I couldn't use it to test your patch.
So I checked with logs, with lots of services registered, if a long UART communication from a low-priority thread was interrupted by the db_hash.work process. Without your patch, the UART communication was interrupted by the db_hash.work process. But with your patch, the UART communication finished without interruption before the db_hash.work process was launched.
So it does solve the issue I mentionned.

jori-nordic · 2022-06-20T10:00:13Z

@closatt good to hear. Something else needed attention the past week, but I'll try to merge that patch this week.

Send long-running tasks to a dedicated low-priority workqueue. This shouldn't increase memory usage since by doing this, we get rid of the ECC processing thread. This should fix issues like zephyrproject-rtos#43811, since the system workqueue runs at a cooperative priority, and the new dedicated one runs at a pre-emptible priority. Fixes zephyrproject-rtos#43811 Signed-off-by: Jonathan Rico <jonathan.rico@nordicsemi.no>

Send long-running tasks to a dedicated low-priority workqueue. This shouldn't increase memory usage since by doing this, we get rid of the ECC processing thread. This should fix issues like #43811, since the system workqueue runs at a cooperative priority, and the new dedicated one runs at a pre-emptible priority. Fixes #43811 Signed-off-by: Jonathan Rico <jonathan.rico@nordicsemi.no>

closatt added the bug The issue is a bug, or the PR is fixing a bug label Mar 15, 2022

carlescufi added area: Bluetooth area: Bluetooth Host labels Mar 15, 2022

carlescufi assigned thoh-ot Mar 15, 2022

mbolivar-nordic added the priority: low Low impact/importance bug label Mar 15, 2022

carlescufi changed the title ~~ble: gatt: db_hash_work lasts too long and make serial communication fail~~ ble: gatt: db_hash_work runs for too long and makes serial communication fail Apr 6, 2022

carlescufi assigned jori-nordic and unassigned thoh-ot May 19, 2022

jori-nordic mentioned this issue May 31, 2022

Bluetooth: host: add dedicated WQ for long-running tasks #46131

Merged

carlescufi closed this as completed in #46131 Jun 30, 2022

jori-nordic mentioned this issue Feb 20, 2023

Bluetooth: Fix GATT race conditions #54929

Merged

mbung mentioned this issue Feb 22, 2023

Bluetooth: GATT: settings write is blocking user threads on disconnect #55067

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ble: gatt: db_hash_work runs for too long and makes serial communication fail #43811

ble: gatt: db_hash_work runs for too long and makes serial communication fail #43811

closatt commented Mar 15, 2022

carlescufi commented Mar 15, 2022

carlescufi commented Apr 6, 2022

thoh-ot commented Apr 8, 2022

cvinayak commented Apr 28, 2022 •

edited

jori-nordic commented Jun 2, 2022

closatt commented Jun 20, 2022

jori-nordic commented Jun 20, 2022

ble: gatt: db_hash_work runs for too long and makes serial communication fail #43811

ble: gatt: db_hash_work runs for too long and makes serial communication fail #43811

Comments

closatt commented Mar 15, 2022

carlescufi commented Mar 15, 2022

carlescufi commented Apr 6, 2022

thoh-ot commented Apr 8, 2022

cvinayak commented Apr 28, 2022 • edited

jori-nordic commented Jun 2, 2022

closatt commented Jun 20, 2022

jori-nordic commented Jun 20, 2022

cvinayak commented Apr 28, 2022 •

edited