Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ble: gatt: db_hash_work runs for too long and makes serial communication fail #43811

Closed
closatt opened this issue Mar 15, 2022 · 7 comments · Fixed by #46131
Closed

ble: gatt: db_hash_work runs for too long and makes serial communication fail #43811

closatt opened this issue Mar 15, 2022 · 7 comments · Fixed by #46131
Assignees
Labels
area: Bluetooth Host area: Bluetooth bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug

Comments

@closatt
Copy link
Contributor

closatt commented Mar 15, 2022

Environment
Zephyr version: 2.5
Target: NRF52840

Bug description
Just after registering a service, UART communication is blocked for more than 10ms, causing a communication failure. In the following image, the moment when the bus is interrupted is circled in red. As a result, the chip answers before the tx communication finishes.
image

Some investigation showed that the db_hash_work (gatt.c) execution is taking several ms, and blocks all serial communications during this time as sysworkq is very high priority:

  • On the SystemView capture below, we see that the sysworkq is interrupting our LPWAN thread (which is making UART communications) for 10ms.
  • By printing the work pointers, I saw that db_hash_work was executed just at the time of the bug. Setting a k_busy_wait(5000) in db_hash_process confirmed this, as the interruption time was now 15ms.

image

The execution time of this work depends of the number of services (in this case 13 services).

The only simple temporary workaround we found (which is not thread safe) is to add a k_sleep of several ms after each call to bt_gatt_service_register.

To Reproduce

  1. Register a lot of services with bt_gatt_service_register
  2. A bit less than 10ms after registering the last service, make a long UART communication

This communication failure is not easy to reproduce as it requires a particular timing. But even without reproducing the communication failure, the fact that db_hash_work lasts a long time can be easily seen with Segger SystemView (or even with timestamped logs placed at the beginning and at the end of db_hash_process function). For this, it is enough to register a lot of services (for example 15 with each 5 characteristics) with bt_gatt_service_register and capture the right moment, 10ms after the last service register.
Maybe the execution time is completely different on another target than the NRF52840.

Expected behavior
If this db_hash_process can be interrupted, it should maybe be executed in a low priority thread. If it cannot be interrupted, I think there should be at least a callback indicating the caller of bt_gatt_service_register that db_hash_process has finished and serial communications are safe.
There are certainly better solutions that I didn't think of.

@closatt closatt added the bug The issue is a bug, or the PR is fixing a bug label Mar 15, 2022
@mbolivar-nordic mbolivar-nordic added the priority: low Low impact/importance bug label Mar 15, 2022
@carlescufi
Copy link
Member

@thoh-ot assigned this to you following-up on the work you did on GATT hashing recently, but feel free to reassign. By the way, have you seen this on the Oticon side?

@closatt can you please try to reproduce with the latest Zephyr? And, if possible, provide a basic sample that can reproduce the problem.

@carlescufi carlescufi changed the title ble: gatt: db_hash_work lasts too long and make serial communication fail ble: gatt: db_hash_work runs for too long and makes serial communication fail Apr 6, 2022
@carlescufi
Copy link
Member

@thoh-ot will you be able to take a look at this? Otherwise we'll try to find someone else to assign this to.

@thoh-ot
Copy link
Collaborator

thoh-ot commented Apr 8, 2022

@thoh-ot will you be able to take a look at this? Otherwise we'll try to find someone else to assign this to.

So I have been thinking a little bit about this - to me this seems like an architectural issue with using the system work queue for "long running" work while also servicing other constraints in the system.

As the system work queue executes the work in a cooperative thread with fairly high priority I think k_yield() in the db_hash_process() is the only short term solution to this problem.

I won't have the capacity to make a long term viable solution.

@cvinayak
Copy link
Contributor

cvinayak commented Apr 28, 2022

@jori-nordic could take a look into using a user work queue or pre-emptible threads to defer long duration operations/functions. Ex. BT ECC thread (hci_ecc.c).

@jori-nordic
Copy link
Contributor

@closatt could you try reproducing the bug with #46131 checked out ?
You can just do git fetch origin pull/46131/head:pr-46131 && git checkout pr-46131 to get the latest version.

@closatt
Copy link
Contributor Author

closatt commented Jun 20, 2022

@jori-nordic thanks for the patch and sorry for the delay.
I have rebased and tested your commit on Zephyr 2.7 since I have no software running on the master.
I had troubles with SystemView always crashing at init so I couldn't use it to test your patch.
So I checked with logs, with lots of services registered, if a long UART communication from a low-priority thread was interrupted by the db_hash.work process. Without your patch, the UART communication was interrupted by the db_hash.work process. But with your patch, the UART communication finished without interruption before the db_hash.work process was launched.
So it does solve the issue I mentionned.

@jori-nordic
Copy link
Contributor

@closatt good to hear. Something else needed attention the past week, but I'll try to merge that patch this week.

jori-nordic added a commit to jori-nordic/zephyr that referenced this issue Jun 29, 2022
Send long-running tasks to a dedicated low-priority workqueue.

This shouldn't increase memory usage since by doing this, we get rid of the
ECC processing thread.

This should fix issues like zephyrproject-rtos#43811, since the system workqueue runs at a
cooperative priority, and the new dedicated one runs at a pre-emptible
priority.

Fixes zephyrproject-rtos#43811

Signed-off-by: Jonathan Rico <jonathan.rico@nordicsemi.no>
carlescufi pushed a commit that referenced this issue Jun 30, 2022
Send long-running tasks to a dedicated low-priority workqueue.

This shouldn't increase memory usage since by doing this, we get rid of the
ECC processing thread.

This should fix issues like #43811, since the system workqueue runs at a
cooperative priority, and the new dedicated one runs at a pre-emptible
priority.

Fixes #43811

Signed-off-by: Jonathan Rico <jonathan.rico@nordicsemi.no>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Bluetooth Host area: Bluetooth bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants