Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bluetooth controller extended advertisement crashes in lll layer #48812

Closed
optical-o opened this issue Aug 9, 2022 · 9 comments
Closed

Bluetooth controller extended advertisement crashes in lll layer #48812

optical-o opened this issue Aug 9, 2022 · 9 comments
Assignees
Labels
area: Bluetooth Controller area: Bluetooth bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@optical-o
Copy link
Contributor

optical-o commented Aug 9, 2022

Zephyr version:
NCS 2.0.0 - Zephyr tag v3.0.99-ncs1

I'm running and Bluetooth host and controller combination of nRF9160(Host) and nRF52833(Controller) using the BT_LL_SW_SPLIT variant. The application is very complicated so i cannot provide a reproducible sample. I have managed to capture an stack trace using Memfault. The problem occurs rather random once in a few hours.

image

lll.c

Note that the trace is capture with disabled CONFIG_BT_ASSERT=n on the controller.
We had issues with advertisement raising radio tx not ready. In production we disabled the assert to remove unnecessary restarts with cost of some advertisements not getting transmitted.

I can try to catch some debug outputs using the BT_ASSERT, however the assert string is usually lost due to LOG_DEFFER. To catch the issue with memfault is also possibility with the BT_ASSERT but requires more work.

Afterwards i tried to optimize IRQ times by these settings(replaced Zero Latency Interrupts) and haven't tested the assert variant ever since:

CONFIG_BT_CTLR_ULL_HIGH_PRIO=1
CONFIG_BT_CTLR_ULL_LOW_PRIO=2

Basic description of the application.
It broadcasts a large amount of advertisements sets
6 sets of advertisements using BT legacy(1 connectable).
6 sets of advertisements using BT long range (Coded phy).
Interval of all advertisements are 4HZ.

Application performs regular ADV_DATA updates for the advertisements with interval 1Hz-4Hz.

Communication between Controller and Host is implemented by custom layer. I can provide HCI traces using RTT BT debug if necessary. However, we are using same layer to implement other communication between the chips and we had no problems with it. It is basically H5(Confirmed messages, and retransmission) over multiplexed UART with Flow Control. The controller and host does not seem to generate any error logs.

Configuration of the host(BT part):

# Enable Bluetooth stack and libraries
CONFIG_BT=y
CONFIG_BT_PERIPHERAL=y
CONFIG_BT_NO_DRIVER=y
CONFIG_BT_H4=n
CONFIG_BT_SMP=y
CONFIG_BT_KEYS_OVERWRITE_OLDEST=y
CONFIG_BT_HCI_VS=y
CONFIG_BT_HCI_VS_EXT=y
CONFIG_BT_WAIT_NOP=n
CONFIG_BT_DEVICE_NAME="Mini"
CONFIG_BT_DEVICE_APPEARANCE=666
CONFIG_BT_EXT_ADV=y
CONFIG_BT_EXT_ADV_MAX_ADV_SET=15
CONFIG_BT_USER_PHY_UPDATE=y
CONFIG_BT_MAX_CONN=4

# Allow for large Bluetooth data packets.
CONFIG_BT_BUF_ACL_RX_SIZE=255
CONFIG_BT_BUF_CMD_TX_SIZE=255
CONFIG_BT_L2CAP_TX_MTU=247
CONFIG_BT_L2CAP_TX_BUF_COUNT=10
CONFIG_BT_L2CAP_DYNAMIC_CHANNEL=y
CONFIG_BT_ATT_PREPARE_COUNT=2

Configuration of the controller(BT part):

#BT Driver
CONFIG_BT=y
CONFIG_BT_CTLR=y
CONFIG_BT_CTLR_TX_PWR_PLUS_8=y
CONFIG_NET_BUF=y
CONFIG_BT_HCI_RAW=y
CONFIG_BT_CENTRAL=n

CONFIG_BT_EXT_ADV=y
CONFIG_BT_CTLR_ADV_EXT=y
CONFIG_BT_CTLR_ADV_SET=15
CONFIG_BT_CTLR_ADV_DATA_LEN_MAX=100
CONFIG_BT_CTLR_LE_ENC=y
CONFIG_BT_CTLR_ADVANCED_FEATURES=y
CONFIG_BT_CTLR_OPTIMIZE_FOR_SPEED=y
CONFIG_BT_CTLR_SCHED_ADVANCED=y

CONFIG_BT_MAX_CONN=4
CONFIG_BT_CTLR_TX_PWR_DYNAMIC_CONTROL=y
CONFIG_BT_CTLR_RX_BUFFERS=16
CONFIG_BT_CTLR_PHY_CODED=y

CONFIG_BT_CTLR_ULL_HIGH_PRIO=1
CONFIG_BT_CTLR_ULL_LOW_PRIO=2
CONFIG_BT_CTLR_LOW_LAT=n
CONFIG_BT_ASSERT=n

CONFIG_BT_LL_SOFTDEVICE=n
CONFIG_BT_LL_SW_SPLIT=y
CONFIG_MPSL=n

#BT Minimizing
CONFIG_BT_CTLR_PHY_2M=n
CONFIG_BT_CTLR_PRIVACY=n
CONFIG_BT_CTLR_LE_PING=n
CONFIG_BT_CTLR_LE_ENC=n
CONFIG_BT_CTLR_CRYPTO=n

What other outputs might be helpful to identify the origin of the problem ?

@optical-o optical-o added the bug The issue is a bug, or the PR is fixing a bug label Aug 9, 2022
@carlescufi carlescufi added the priority: medium Medium impact/importance bug label Aug 9, 2022
@cvinayak
Copy link
Contributor

cvinayak commented Aug 9, 2022

Some observations:

  • Are using Zephyr Controller in production without BT design listing?
  • You are using proprietary host and controller interface, may be breaking the threading contract needed for context safety of controller data structures
  • You may have altered the design of LLL control flow of the controller by removing LL_ASSERTs
  • There is no details of where the assert is, line number would not match any upstream commit if you have modified the controller source code
  • There is no steps to reproduce or with a simple sample

Request:

  • Put back the LL_ASSERTs, report back on the ASSERTs hit and under what pre-conditions are the asserts hit
  • Please check and lower the UART IRQ Priority lower than the controller's lowest level.

Frankly, if the issue is not something that I cannot reproduce with simple steps at my end, it will be a lot of effort interacting or speculating.

@optical-o
Copy link
Contributor Author

optical-o commented Aug 9, 2022

Some observations:

  • Are using Zephyr Controller in production without BT design listing?

Do you refer to Bluetooth Qualification Process ?

  • You are using proprietary host and controller interface, may be breaking the threading contract needed for context safety of controller data structures

The driver implementation is closely following the H4 sample implementation, which uses UART.

It implements the same rx_thread for (controller -> host) communication. The data retrieved by this thread is copied into our layer for transmission. The only difference is that the data is then copied and passed to the our layer instead of UART irq flow.

When receiving the only difference is that the net_buf_put(tx_queue) is called from system_workqueue instead of uart isr.

For this purpose i can capture the HCI communication, which might verify that the controller is receiving the HCI commands correctly.

  • You may have altered the design of LLL control flow of the controller by removing LL_ASSERTs
  • There is no details of where the assert is, line number would not match any upstream commit if you have modified the controller source code

We have not modified the source code of the controller. It is as the link in the ncs-zephyr repository.

  • There is no steps to reproduce or with a simple sample

Request:

  • Put back the LL_ASSERTs, report back on the ASSERTs hit and under what pre-conditions are the asserts hit

I will test several devices with the BT_ASSERT enabled and will report back with the results.

  • Please check and lower the UART IRQ Priority lower than the controller's lowest level.

I will check the UART IRQ priority, however i assume that it is unchanged from default priority given by nrf_hal. The custom transport layer is using the same UART api as H4 or H5 sample. So i assume the priorities should be the same.

Frankly, if the issue is not something that I cannot reproduce with simple steps at my end, it will be a lot of effort interacting or speculating.

@cvinayak
Copy link
Contributor

cvinayak commented Aug 9, 2022

Do you refer to Bluetooth Qualification Process ?
Yes.

When receiving the only difference is that the net_buf_put(tx_queue) is called from system_workqueue instead of uart isr.

As long as all Controller's downstream command and data path, i.e. LL interface functions in ll.h are called from a single thread, and upstream reception events are from single thread. Both threads being co-operative, the controller;s memory will be safe.

We have not modified the source code of the controller. It is as the link in the ncs-zephyr repository.

Then why do you state this:

In production we disabled the assert to remove unnecessary restarts

The custom transport layer is using the same UART api as H4 or H5 sample. So i assume the priorities should be the same.

You have modified the Controller's IRQ priorities, they can now overlap with UART IRQ priority level.

@optical-o
Copy link
Contributor Author

Do you refer to Bluetooth Qualification Process ?
Yes.

When receiving the only difference is that the net_buf_put(tx_queue) is called from system_workqueue instead of uart isr.

As long as all Controller's downstream command and data path, i.e. LL interface functions in ll.h are called from a single thread, and upstream reception events are from single thread. Both threads being co-operative, the controller;s memory will be safe.

We have not modified the source code of the controller. It is as the link in the ncs-zephyr repository.

Then why do you state this:

In production we disabled the assert to remove unnecessary restarts

Disabled by Kconfig Option CONFIG_BT_ASSERT=n i thought this should not count as modification of the source files because it is available feature of zephyr. Is there some recommended configuration, of the KConfigs ?

The custom transport layer is using the same UART api as H4 or H5 sample. So i assume the priorities should be the same.

You have modified the Controller's IRQ priorities, they can now overlap with UART IRQ priority level.

I can test the same use case using default values for, however in previous version we used it:

CONFIG_BT_CTLR_ULL_HIGH_PRIO=1
CONFIG_BT_CTLR_ULL_LOW_PRIO=2
CONFIG_BT_CTLR_LOW_LAT=n

If it might help to discover the origin of the issue will will retest using default priorities of the controller bt part.

@cvinayak
Copy link
Contributor

cvinayak commented Aug 9, 2022

Disabled by Kconfig Option CONFIG_BT_ASSERT=n i thought this should not count as modification of the source files because it is available feature of zephyr. Is there some recommended configuration, of the KConfigs ?

I will send a PR when I get time to not be able disabling of LL_ASSERT. Disabling LL assertions is not a solution for LL assertion failures.

I am unable to assist further, without details of the assertions and detailed steps to reproduce a crash using a simple sample. I do not rule out bugs in Controller though, you may try using the latest upstream Zephyr Project repository to if things are already fixed.

@optical-o
Copy link
Contributor Author

I have performed following test:
Enabled:

CONFIG_BT_ASSERT

Removed:

CONFIG_BT_CTLR_ULL_HIGH_PRIO=1
CONFIG_BT_CTLR_ULL_LOW_PRIO=2
CONFIG_BT_CTLR_LOW_LAT=n

Running on ten devices overnight. None of them crashed.
@cvinayak You were correct with me changing the priorities lead to this state. I have experienced the instability first in zephyr 2.6 with default controller configuration. The stability back then improved by using ZLI afterwards i used CONFIG_BT_CTLR_PROFILE_ISR to lower the ISR time which lead me to modifing the priorities. Since then the application and the zephyr has changed significantly and on the way i forgot to check the default configuration after migrating to newer versions. The issue is probably fixed.

I have one question regarding the H4 controller setup. Why there is not a controller driver abstraction layer or module ? The controller has to be build from the H4 sample and modified if the user require any additional functionality from the controller chip. The ideal situation would by a clear main function. With enabled H4 controller driver and defined UART interface on which it should operate.

@cvinayak
Copy link
Contributor

Running on ten devices overnight. None of them crashed.

Glad to hear :-)

I have one question regarding the H4 controller setup. Why there is not a controller driver abstraction layer or module ? The controller has to be build from the H4 sample and modified if the user require any additional functionality from the controller chip. The ideal situation would by a clear main function. With enabled H4 controller driver and defined UART interface on which it should operate.

The H4 sample, I assume you are referring to hci_uart sample, is used for Bluetooth Conformance testing, and hence is built as an application usable with other Hosts stacks and testers. If additional functionality is to be added, there is HCI vendor specific commands and events that can be added/implemented by users as proprietary features without affecting the Bluetooth Conformance. Refer to: https://github.com/zephyrproject-rtos/zephyr/blob/95cae9b8705cb5f4043670b1e366bd5cb617f0d0/doc/connectivity/bluetooth/api/hci.txt

maybe I am misunderstanding, @jhedberg may be you have some comments on the H4 sample question

@optical-o
Copy link
Contributor Author

The H4 sample, I assume you are referring to hci_uart sample, is used for Bluetooth Conformance testing, and hence is built as an application usable with other Hosts stacks and testers. If additional functionality is to be added, there is HCI vendor specific commands and events that can be added/implemented by users as proprietary features without affecting the Bluetooth Conformance. Refer to: https://github.com/zephyrproject-rtos/zephyr/blob/95cae9b8705cb5f4043670b1e366bd5cb617f0d0/doc/connectivity/bluetooth/api/hci.txt

maybe I am misunderstanding, @jhedberg may be you have some comments on the H4 sample question

So the only possible option for a user to add additional functionality to the controller, without compromising the Bluetooth Conformance. Is to implement some simple logic, which can be interfaced using the vendor specific HCI commands ?

If i add additional thread to the controller dedicated for some processing from other UART interface. Will it compromise the Bluetooth Conformance ? How far can one go in adding logic to the controller without compromising the Bluetooth Conformance.

Is there some API, which can be used to easily implement vendor specific HCI commands, without directly modifying the hci.c hci_vendor_cmd_handle_common or hci_vendor_cmd_handle where the vendor commands are processed ?

I can see there is a lot of zephyr specific HCI commands available already inside the controller. Does the Host some how use these capabilities in CONFIG_BT_LL_SW_SPLIT configuration. For example 'Zephyr Fatal Error', which could be logged on the Host side, or even reacted to reboot the Host instead of crashing on HCI command timeout (This happens when the controller crashes, the host tries to perform HCI commands, which timeouts and raises ASSERT).

@carlescufi
Copy link
Member

Closing this as the bug itself is resolved, please open a discussion to continue the conversation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Bluetooth Controller area: Bluetooth bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

No branches or pull requests

4 participants