Skip to content

Conversation

@alwa-nordic
Copy link
Contributor

@alwa-nordic alwa-nordic commented Jan 23, 2024

bt_buf_get_cmd_complete is broken due to #64158, and fixing it would require changing its signature and put even more complexity into the HCI drivers, as it would require the drivers to perform an even deeper peek into the event in order to observe the opcode.

Instead of the above, this patch removes the use of bt_buf_get_cmd_complete and adds logic to allow the host to accept command complete events in normal event buffers.

The above means performing a copy into the destination buffer, which is the original command buffer. This is a small inefficiency for now, but we should strive to redesign the host into a streaming architecture as much as possible and handle events immediately instead of retaining buffers.

This fixes #64158: Like all command completed events, the completion event for BT_HCI_OP_HOST_NUM_COMPLETED_PACKETS is now placed in normal event buffers. The the logic where the host discards this event is already present. Since it's discarded, it will not interfere with the logic around bt_dev.cmd_send.

The next commit will change the behavior of `bt_buf_get_evt` to no
longer use `bt_buf_get_cmd_complete`. We have to remove the test that
would prevent this.

Signed-off-by: Aleksander Wasaznik <aleksander.wasaznik@nordicsemi.no>
@alwa-nordic alwa-nordic changed the title Bluetooth: host: Remove use of bt_buf_get_cmd_complete Bluetooth: host: Remove bt_buf_get_cmd_complete Jan 24, 2024
@alwa-nordic alwa-nordic changed the title Bluetooth: host: Remove bt_buf_get_cmd_complete Bluetooth: Host: Remove bt_buf_get_cmd_complete Jan 24, 2024
`bt_buf_get_cmd_complete` is broken due to
zephyrproject-rtos#64158, and fixing it
would require changing its signature and put even more complexity into
the HCI drivers, as it would require the drivers to perform an even
deeper peek into the event in order to observe the opcode.

Instead of the above, this patch removes the use of
`bt_buf_get_cmd_complete` and adds logic to allow the host to accept
command complete events in normal event buffers.

The above means performing a copy into the destination buffer, which is
the original command buffer. This is a small inefficiency for now, but
we should strive to redesign the host into a streaming architecture as
much as possible and handle events immediately instead of retaining
buffers.

This fixes zephyrproject-rtos#64158:
Like all command completed events, the completion event for
`BT_HCI_OP_HOST_NUM_COMPLETED_PACKETS` is now placed in normal event
buffers. The the logic where the host discards this event is already
present. Since it's discarded, it will not interfere with the logic
around `bt_dev.cmd_send`.

Signed-off-by: Aleksander Wasaznik <aleksander.wasaznik@nordicsemi.no>
After the previous commit, this function no longer has any users.

Signed-off-by: Aleksander Wasaznik <aleksander.wasaznik@nordicsemi.no>
This is purely a syntactical refactor.

Signed-off-by: Aleksander Wasaznik <aleksander.wasaznik@nordicsemi.no>
@jhedberg
Copy link
Member

I'm not opposed to this change, but another reason (besides being an optimisation) for the reuse is that it completely eliminates one category of bugs, namely deadlocks caused by bt_hci_cmd_send_sync() blocking in a thread that would also be needed to free up buffers for the command complete event reception.

@jhedberg
Copy link
Member

I'm not opposed to this change, but another reason (besides being an optimisation) for the reuse is that it completely eliminates one category of bugs, namely deadlocks caused by bt_hci_cmd_send_sync() blocking in a thread that would also be needed to free up buffers for the command complete event reception.

and to continue on this thought: unless my memory fails me, the feature was originally inspired by wanting to get rid of fighting deadlocks like this. It was introduced in commit 50678b0 but unfortunately the commit message doesn't seem to go into much detail besides mentioning "reducing pressure on the RX buffer pool". The feature would have its 7th anniversary in a few days :)

Btw, might be worth to do some profiling to see what kind of impact this has on the performance of Bluetooth Mesh, since that actually depends on a continuous stream of HCI commands (as opposed to ACL packets) for the transmission of data. @PavelVPV FYI.

@alwa-nordic
Copy link
Contributor Author

The CI error is unrelated.

@alwa-nordic
Copy link
Contributor Author

I think the deadlocking problems the host has can really only be solved a principled way: We must stop blocking the event stream from the controller. All these priority levels and extra pools and stuff are just band-aids without technical soundness.

I mean all events must be treated as ISR. And I mean the event buffer must be made available for the next event immediately without any dependencies. If an event handler must keep data around, it must have it's own buffers (it can copy data into or give to the driver for DMA), or it can choose to drop the data. The handler cannot delay the event stream.

It's also not true that deadlocks caused by bt_hci_cmd_send_sync are eliminated, as far as I understand. The reason is that the event stream is serial at transport level. The command complete can be stuck behind other events. The host cannot peek into the future of the physical UART RX wire.

@PavelVPV
Copy link
Contributor

Btw, might be worth to do some profiling to see what kind of impact this has on the performance of Bluetooth Mesh, since that actually depends on a continuous stream of HCI commands (as opposed to ACL packets) for the transmission of data. @PavelVPV FYI.

I cherry-picked these 4 commits to the nrf connect sdk fork and ran our performance tests. The relay node performance is not changed. Throughput varies within the margin of error.

@jhedberg
Copy link
Member

It's also not true that deadlocks caused by bt_hci_cmd_send_sync are eliminated, as far as I understand. The reason is that the event stream is serial at transport level. The command complete can be stuck behind other events. The host cannot peek into the future of the physical UART RX wire.

That's a good point. And as a general design approach what you suggest sounds like the only safe bet. What mitigates the current situation a little bit is the concept of "discardable" events (e.g. advertising reports) but that doesn't cover everything that could be in the transport queue.

@carlescufi carlescufi merged commit 9426309 into zephyrproject-rtos:main Jan 31, 2024
jori-nordic added a commit to jori-nordic/zephyr that referenced this pull request Apr 30, 2024
Delete tests that did not end up bringing any value.

What ended up happening is busy-work to "make the test pass" without
understanding what's their original purpose.

Worse, the CI change-based testing is broken and doesn't pick them up,
even by PRs modifying the tests themeselves.
See zephyrproject-rtos#68008

Signed-off-by: Jonathan Rico <jonathan.rico@nordicsemi.no>
carlescufi pushed a commit that referenced this pull request Apr 30, 2024
Delete tests that did not end up bringing any value.

What ended up happening is busy-work to "make the test pass" without
understanding what's their original purpose.

Worse, the CI change-based testing is broken and doesn't pick them up,
even by PRs modifying the tests themeselves.
See #68008

Signed-off-by: Jonathan Rico <jonathan.rico@nordicsemi.no>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: Bluetooth Host Bluetooth Host (excluding BR/EDR) area: Bluetooth

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bluetooth: HCI command corruption on NUM_COMPLETED_PACKETS

6 participants