Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bluetooth: HCI command corruption on NUM_COMPLETED_PACKETS #64158

Closed
JordanYates opened this issue Oct 20, 2023 · 12 comments · Fixed by #68008
Closed

Bluetooth: HCI command corruption on NUM_COMPLETED_PACKETS #64158

JordanYates opened this issue Oct 20, 2023 · 12 comments · Fixed by #68008
Assignees
Labels
area: Bluetooth HCI Bluetooth HCI Driver area: Bluetooth bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@JordanYates
Copy link
Collaborator

JordanYates commented Oct 20, 2023

Describe the bug

HCI command buffers can be corrupted by the host receiving a BT_HCI_OP_HOST_NUM_COMPLETED_PACKETS (opcode 0x0c35) response at an inopportune time.

hci_core.c assigns bt_dev.sent_cmd before actually sending the command to the HCI backend:

bt_dev.sent_cmd = net_buf_ref(buf);
LOG_DBG("Sending command 0x%04x (buf %p) to driver", cmd(buf)->opcode, buf);
err = bt_send(buf);

bt_buf_get_cmd_complete, which is used to allocate a buffer for the command response, naively re-uses the command buffer and overwrites the packet type and length.

struct net_buf *bt_buf_get_cmd_complete(k_timeout_t timeout)
{
struct net_buf *buf;
buf = (struct net_buf *)atomic_ptr_clear((atomic_ptr_t *)&bt_dev.sent_cmd);
if (buf) {
bt_buf_set_type(buf, BT_BUF_EVT);
buf->len = 0U;
net_buf_reserve(buf, BT_BUF_RESERVE);
return buf;
}
return bt_buf_get_rx(BT_BUF_EVT, timeout);
}

BT_HCI_OP_HOST_NUM_COMPLETED_PACKETS is a special command in that it ignores the normal flow control, and normally does not receive a response. From the BT spec:

Normally, no event is generated after the HCI_Host_Number_Of_Completed_Packets command has completed. However, if the HCI_Host_Number_Of_Completed_Packets command contains one or more invalid parameters, the Controller shall return an HCI_Command_Complete event with a failure status indicating the Invalid HCI Command Parameters error code.

If such a response is received, after the host has assigned bt_dev.sent_cmd, but before it is actually sent, the buffer allocation function will corrupt the command parameters, usually leading to the HCI backend rejecting the command as invalid.

Expected behavior
Receiving BT_HCI_OP_HOST_NUM_COMPLETED_PACKETS should not corrupt pending HCI commands.

Impact
Arbitrary HCI command failures

@jhedberg
Copy link
Member

jhedberg commented Oct 24, 2023

(edited to have the right PR reference)

@JordanYates wasn't #64106 the fix for this? (just wondering why they weren't linked and why this issue is still open)

@MaureenHelm MaureenHelm added the priority: medium Medium impact/importance bug label Oct 24, 2023
@JordanYates
Copy link
Collaborator Author

@JordanYates wasn't #64106 the fix for this? (just wondering why they weren't linked and why this issue is still open)

No, that PR stops the response (after allocation) from incorrectly giving the ncmd_sem semaphore, and improves the debug output.

This issue is about the allocation process itself. NUM_COMPLETED_PACKETS response shouldn't be pushed into the bt_dev.sent_cmd buffer, because that command is not associated with the response. It can lead to corrupting the command before it is sent, as shown before. Even if it doesn't corrupt the command, the command is no longer sitting in bt_dev.sent_cmd for the actual response to use.

@jhedberg
Copy link
Member

jhedberg commented Oct 25, 2023

@JordanYates got it, thanks. The challenge is that we don't know, and don't require the HCI driver to know, which command is completing through a Command Status or Command Complete event. Instead we allow the drivers to initiate the allocation of a buffer as soon as they've received a complete HCI Event header (which does not yet contain the information on what command completed). If we did have this information we could simply make an exception of not using bt_dev.sent_cmd in the case of Host Num Completed Packets.

@jhedberg
Copy link
Member

Another thing I wanted to mention, is that is's absolutely essential that bt_dev.sent_cmd is set before the call to bt_send(). This is because in some cases, in particular with the native controller, the command might be fully processed while inside the bt_send() call and therefore require bt_buf_get_cmd_complete() to correctly work already then.

@jhedberg
Copy link
Member

@JordanYates based on my two comments above (and since you've clearly looked into and thought about this a lot), do you have any proposals for how to fix this?

@JordanYates
Copy link
Collaborator Author

Another thing I wanted to mention, is that is's absolutely essential that bt_dev.sent_cmd is set before the call to bt_send().

Yea, never suspected that moving it after would be acceptable (and wouldn't even fix the issue of consuming the commands response buffer)

@JordanYates based on my two comments above (and since you've clearly looked into and thought about this a lot), do you have any proposals for how to fix this?

So my current solution (which I in no way recommend) is to simply drop all NUM_COMPLETED_PACKETS packets at the HCI driver (https://github.com/csiro-wsn/zephyr/commit/03707d562c7a5c6402680aa7bff6182b3b8ac74a), as I know that the stack currently ignores them anyway. Liable to break in the future, and requires custom code in each HCI driver.

The path I went down before going with the hack was providing the opcode to bt_buf_get_evt, which would require changes to all HCI drivers (and deeper in the stack and tests). The problem is caused by a lack of knowledge in the allocation, so we aren't going to solve it without providing more.

I suspect the best path would be to special case the BT_HCI_EVT_CMD_COMPLETE event at the HCI driver layer, and call an updated version of bt_buf_get_cmd_complete (instead of bt_buf_get_evt) which also takes the opcode response (or a pointer to the HCI response which it can pull the opcode out of).

@jhedberg
Copy link
Member

Special-casing this command completion on the HCI driver-side has similar implications as digging out the opcode and passing it to the buffer allocator. I.e. either way we impose extra parsing requirements for every driver.

Another option is CONFIG_BT_HCI_ACL_FLOW_CONTROL=n (and possibly remove the whole feature), since I'm not convinced that this is bringing any concrete performance (or other) improvements. I'd be happy to hear counter evidence on this, however.

@JordanYates
Copy link
Collaborator Author

I.e. either way we impose extra parsing requirements for every driver.

Sure, but there aren't that many drivers, and it would allow for extra validation in the future (checking that the bt_dev.sent_cmd opcode actually matches the response?)

Another option is CONFIG_BT_HCI_ACL_FLOW_CONTROL=n

I have no comment on how useful it is overall, I saw it was enabled on all the samples, so I enabled it too.
It's used by the Nordic softdevice (nrfconnect/sdk-nrfxlib@5834aee), maybe they can comment? @cvinayak

@kruithofa
Copy link
Collaborator

@cvinayak will get a clarification from the SIG on BT spec 5.4, Vol.4 Part E section 7.3.40 re. the Events generated
@JordanYates the temporary work-around is to do as Johan suggests, disable the CONFIG_BT_HCI_ACL_FLOW_CONTROL

@JordanYates
Copy link
Collaborator Author

@JordanYates the temporary work-around is to do as Johan suggests, disable the CONFIG_BT_HCI_ACL_FLOW_CONTROL

I am unwilling to disable the flow control without an explanation of why it exists in the first place, I don't want the cure to be worse than the disease.

@cvinayak
Copy link
Contributor

cvinayak commented Nov 2, 2023

@JordanYates The controller to host flow control exists to aid the possibility to limit the statically allocated host rx buffers such that controller can throttle ACL data reception in sync with available rx buffers in the host.

Typically there is the transport flow control that can implicitly ensure there is no rx buffer overflow; but this has the disadvantage that under high rx throughput other HCI events that are received serialized in the transport will be delayed until all queued ACL data can be dequeued by the host.

If your application, host and controller execute on a single CPU then the above mentioned implicit flow control maintained by the buffer allocation and kernel semaphore/queues should ensure no rx buffer overflow occurs. And if under high rx throughput, delayed events are acceptable (example, Tx waiting for num of completed packets is delayed until all rx buffers enqueued inside the controller are received by the host) , you can safely disable use of controller to host flow control.

@alwa-nordic
Copy link
Collaborator

alwa-nordic commented Nov 6, 2023

FYI, the Nordic controller in NCS 2.5.0 has a workaround for this. The Controller will now never send Command Complete for Host Num Complete. Link to changelog.

@github-actions github-actions bot added the Stale label Jan 13, 2024
@zephyrproject-rtos zephyrproject-rtos deleted a comment from github-actions bot Jan 13, 2024
@cvinayak cvinayak removed the Stale label Jan 13, 2024
alwa-nordic added a commit to alwa-nordic/zephyr that referenced this issue Jan 23, 2024
`bt_buf_get_cmd_complete` is broken, and fixing it requires

Fixes: zephyrproject-rtos#64158

Like every command completed event,
`BT_HCI_OP_HOST_NUM_COMPLETED_PACKETS` are now placed in normal event
buffers. These packets will no longer confuse the Host.
alwa-nordic added a commit to alwa-nordic/zephyr that referenced this issue Jan 23, 2024
`bt_buf_get_cmd_complete` is broken, and fixing would put even more
complexity into the HCI drivers.

To fix the problem while decreasing complexity in the drivers, this
patch removes its use and makes the host accept command complete events
in normal event buffers.

This decision is based on a goal to simplify the drivers. This patch
also aligns well with the goal of getting rid of generic event buffers.

Fixes: zephyrproject-rtos#64158

Like every command completed event,
`BT_HCI_OP_HOST_NUM_COMPLETED_PACKETS` are now placed in normal event
buffers. These packets will no longer confuse the Host.
alwa-nordic added a commit to alwa-nordic/zephyr that referenced this issue Jan 24, 2024
`bt_buf_get_cmd_complete` is broken, and fixing would put even more
complexity into the HCI drivers.

To fix the problem while decreasing complexity in the drivers, this
patch removes its use and makes the host accept command complete events
in normal event buffers.

This decision is based on a goal to simplify the drivers. This patch
also aligns well with the goal of getting rid of generic event buffers.

Fixes: zephyrproject-rtos#64158

Like every command completed event,
`BT_HCI_OP_HOST_NUM_COMPLETED_PACKETS` are now placed in normal event
buffers. These packets will no longer confuse the Host.

Signed-off-by: Aleksander Wasaznik <aleksander.wasaznik@nordicsemi.no>
alwa-nordic added a commit to alwa-nordic/zephyr that referenced this issue Jan 29, 2024
`bt_buf_get_cmd_complete` is broken due to
zephyrproject-rtos#64158, and fixing it
would require changing its signature and put even more complexity into
the HCI drivers, as it would require the drivers to perform an even
deeper peek into the event in order to observe the opcode.

Instead of the above, this patch removes the use of
`bt_buf_get_cmd_complete` and adds logic to allow the host to accept
command complete events in normal event buffers.

The above means performing a copy into the destination buffer, which is
the original command buffer. This is a small inefficiency for now, but
we should strive to redesign the host into a streaming architecture as
much as possible and handle events immediately instead of retaining
buffers.

This fixes zephyrproject-rtos#64158:
Like all command completed events, the completion event for
`BT_HCI_OP_HOST_NUM_COMPLETED_PACKETS` is now placed in normal event
buffers. The the logic where the host discards this event is already
present. Since it's discarded, it will not interfere with the logic
around `bt_dev.cmd_send`.

Signed-off-by: Aleksander Wasaznik <aleksander.wasaznik@nordicsemi.no>
PavelVPV pushed a commit to PavelVPV/sdk-zephyr that referenced this issue Jan 30, 2024
`bt_buf_get_cmd_complete` is broken due to
zephyrproject-rtos/zephyr#64158, and fixing it
would require changing its signature and put even more complexity into
the HCI drivers, as it would require the drivers to perform an even
deeper peek into the event in order to observe the opcode.

Instead of the above, this patch removes the use of
`bt_buf_get_cmd_complete` and adds logic to allow the host to accept
command complete events in normal event buffers.

The above means performing a copy into the destination buffer, which is
the original command buffer. This is a small inefficiency for now, but
we should strive to redesign the host into a streaming architecture as
much as possible and handle events immediately instead of retaining
buffers.

This fixes zephyrproject-rtos/zephyr#64158:
Like all command completed events, the completion event for
`BT_HCI_OP_HOST_NUM_COMPLETED_PACKETS` is now placed in normal event
buffers. The the logic where the host discards this event is already
present. Since it's discarded, it will not interfere with the logic
around `bt_dev.cmd_send`.

Signed-off-by: Aleksander Wasaznik <aleksander.wasaznik@nordicsemi.no>
carlescufi pushed a commit that referenced this issue Jan 31, 2024
`bt_buf_get_cmd_complete` is broken due to
#64158, and fixing it
would require changing its signature and put even more complexity into
the HCI drivers, as it would require the drivers to perform an even
deeper peek into the event in order to observe the opcode.

Instead of the above, this patch removes the use of
`bt_buf_get_cmd_complete` and adds logic to allow the host to accept
command complete events in normal event buffers.

The above means performing a copy into the destination buffer, which is
the original command buffer. This is a small inefficiency for now, but
we should strive to redesign the host into a streaming architecture as
much as possible and handle events immediately instead of retaining
buffers.

This fixes #64158:
Like all command completed events, the completion event for
`BT_HCI_OP_HOST_NUM_COMPLETED_PACKETS` is now placed in normal event
buffers. The the logic where the host discards this event is already
present. Since it's discarded, it will not interfere with the logic
around `bt_dev.cmd_send`.

Signed-off-by: Aleksander Wasaznik <aleksander.wasaznik@nordicsemi.no>
coreboot-org-bot pushed a commit to coreboot/zephyr-cros that referenced this issue Feb 8, 2024
`bt_buf_get_cmd_complete` is broken due to
zephyrproject-rtos/zephyr#64158, and fixing it
would require changing its signature and put even more complexity into
the HCI drivers, as it would require the drivers to perform an even
deeper peek into the event in order to observe the opcode.

Instead of the above, this patch removes the use of
`bt_buf_get_cmd_complete` and adds logic to allow the host to accept
command complete events in normal event buffers.

The above means performing a copy into the destination buffer, which is
the original command buffer. This is a small inefficiency for now, but
we should strive to redesign the host into a streaming architecture as
much as possible and handle events immediately instead of retaining
buffers.

This fixes zephyrproject-rtos/zephyr#64158:
Like all command completed events, the completion event for
`BT_HCI_OP_HOST_NUM_COMPLETED_PACKETS` is now placed in normal event
buffers. The the logic where the host discards this event is already
present. Since it's discarded, it will not interfere with the logic
around `bt_dev.cmd_send`.

(cherry picked from commit 1cb83a8)

Original-Signed-off-by: Aleksander Wasaznik <aleksander.wasaznik@nordicsemi.no>
GitOrigin-RevId: 1cb83a8
Change-Id: I9c1641f1bde863f4a7e19334f2f9b12cf52e9ed2
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/zephyr/+/5280013
Tested-by: ChromeOS Prod (Robot) <chromeos-ci-prod@chromeos-bot.iam.gserviceaccount.com>
Reviewed-by: Al Semjonovs <asemjonovs@google.com>
Commit-Queue: Al Semjonovs <asemjonovs@google.com>
Tested-by: Al Semjonovs <asemjonovs@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Bluetooth HCI Bluetooth HCI Driver area: Bluetooth bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants