Bluetooth: Host: Fixed where bt_send returns an error but is actually… #74287

LingaoM · 2024-06-14T07:13:59Z

Enhancement

clear var sync make more safe.
add rsp to avoiding deep-copy. (related: Bluetooth: Host: Remove bt_buf_get_cmd_complete #68008)

jhedberg · 2024-06-14T07:40:52Z

@LingaoM while the change looks reasonable, you need to explain (in the commit message) in more detail the sequence of events which will trigger it. Is this purely theoretical, or you actually saw it in practice? If the latter, what kind of build configuration & HW did you have?

Btw, there's a merge conflict, so you need to rebase.

LingaoM · 2024-06-14T07:54:16Z

@LingaoM while the change looks reasonable, you need to explain (in the commit message) in more detail the sequence of events which will trigger it. Is this purely theoretical, or you actually saw it in practice?

In fact, it is not a problem with Zephyr itself. The problem occurred when we ran the Zephyr protocol stack on Linux and tested it. However, we believe that the zephyr protocol stack uses a local variable and should be cleared, which is safer :).

The `sync` is a local variable in the stack space. Clearing this pointer explicitly before releasing it is a safer way. Signed-off-by: Lingao Meng <menglingao@xiaomi.com>

LingaoM · 2024-06-14T08:18:14Z

@alwa-nordic CC :).

subsys/bluetooth/host/hci_core.c

jori-nordic · 2024-06-14T08:45:02Z

@alwa-nordic isn't that what you were trying to avoid with 1cb83a8 ?
Specifically allowing the application to ref() event buffers.

Add `rsp` field to avoid deep-copy for every cmd. Signed-off-by: Lingao Meng <menglingao@xiaomi.com>

LingaoM · 2024-06-14T09:20:38Z

@alwa-nordic isn't that what you were trying to avoid with 1cb83a8 ? Specifically allowing the application to ref() event buffers.

#68008 (comment)

I think the deadlocking problems the host has can really only be solved a principled way: We must stop blocking the event stream from the controller. All these priority levels and extra pools and stuff are just band-aids without technical soundness.

I mean all events must be treated as ISR. And I mean the event buffer must be made available for the next event immediately without any dependencies. If an event handler must keep data around, it must have it's own buffers (it can copy data into or give to the driver for DMA), or it can choose to drop the data. The handler cannot delay the event stream.

I don't think my approach delays the event flow. If you look at the implementation of bt_recv, for the events of evt_flags & BT_HCI_EVT_FLAG_RECV, buf will still be occupied by bt_dev.rx_queue until the BT RX is processed, there is no difference between this method and the method I am using for this PR, and more I think this method is safe than BT RX process, because in most cases only the protocol stack will use the bt_hci_cmd_send_sync API, So we can completely ensure that there is no delayed code in all existing places that use this API and rsp is not empty, on the contrary, BT RX processing buffer events will have many callbacks involving the user layer, which will increase uncertainty.

Thalley

So what this PR really does (besides the renames), is that it replaced a copy with another pointer.

I'm not convinced that this is a better approach as it doesn't seem to solve any issues, but does increase our RAM usage for a small performance gain.

The RAM usage is easily measurable, but do we save anything meaningful when not doing the copy?

Thalley · 2024-06-16T06:39:19Z

subsys/bluetooth/host/hci_core.c

Is cmd(buf)->rsp always non-NULL here?

subsys/bluetooth/host/hci_core.c

LingaoM · 2024-06-16T07:11:05Z

I'm not convinced that this is a better approach as it doesn't seem to solve any issues, but does increase our RAM usage for a small performance gain.

The RAM usage is easily measurable, but do we save anything meaningful when not doing the copy?

@Thalley I don't think this change will increase ram too lot, if you see https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/bluetooth/host/hci_core.c#L121 , that the default CONFIG_BT_BUF_CMD_TX_COUNT only 2, so only 8-bytes ram increase on 32bits machine, after this change, can completely avoid double-copy design, this is valuable i think.

BTW: From coding style, use double-copy is not a good-idea, although not cause performance decrease, but it's odd.

Thalley · 2024-06-16T07:17:32Z

I'm not convinced that this is a better approach as it doesn't seem to solve any issues, but does increase our RAM usage for a small performance gain.
The RAM usage is easily measurable, but do we save anything meaningful when not doing the copy?

@Thalley I don't think this change will increase ram too lot, if you see https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/bluetooth/host/hci_core.c#L121 , that the default CONFIG_BT_BUF_CMD_TX_COUNT only 2, so only 8-bytes ram increase on 32bits machine, after this change, can completely avoid double-copy design, this is valuable i think.

BTW: From coding style, use double-copy is not a good-idea, although not cause performance decrease, but it's odd.

I tend to agree, but also consider that the size of events and responses are usually so small that it doesn't really matter either :)

Not opposed to the change, but still unsure whether it's overall better.

Since `send_cmd` will follow request-response. So rename seqerate, to make clear. Signed-off-by: Lingao Meng <menglingao@xiaomi.com>

LingaoM · 2024-06-16T07:27:31Z

Not opposed to the change, but still unsure whether it's overall better.

That this true this PR not improve performance a lot , but for coding style become more concise indeed, at least i think. :)

jori-nordic · 2024-06-17T06:15:33Z

most cases only the protocol stack will use the bt_hci_cmd_send_sync API, So we can completely ensure that there is no delayed code in all existing places that use this API

Most cases did indeed involve the host making the deadlock by using send_sync() in the wrong place at the wrong time. With the current design that is very synchronous, it's hard to avoid deadlocks in general.

Anyway, I was left scratching my head at the previous logic (what does if (evt_buf != buf) even mean?), so your PR would be an improvement in readability for me.
I'll trigger our internal HW testing pipeline to double-check there are no issues.

LingaoM · 2024-06-19T06:31:23Z

@jori-nordic BabbleSim Tests PASSED.

jhedberg · 2024-06-19T09:12:28Z

what does if (evt_buf != buf) even mean?)

I think it's a sanity-check that the HCI driver used the appropriate buffer allocation method for the command complete event which should result in getting hold of the original command buffer, and if it didn't do that (used some generic allocator or even its own pool) then the code was trying to work around it.

alwa-nordic · 2024-06-19T11:13:54Z

@alwa-nordic isn't that what you were trying to avoid with 1cb83a8 ? Specifically allowing the application to ref() event buffers.

#68008 (comment)

I think the deadlocking problems the host has can really only be solved a principled way: We must stop blocking the event stream from the controller. All these priority levels and extra pools and stuff are just band-aids without technical soundness.

I mean all events must be treated as ISR. And I mean the event buffer must be made available for the next event immediately without any dependencies. If an event handler must keep data around, it must have it's own buffers (it can copy data into or give to the driver for DMA), or it can choose to drop the data. The handler cannot delay the event stream.

I don't think my approach delays the event flow. If you look at the implementation of bt_recv, for the events of evt_flags & BT_HCI_EVT_FLAG_RECV, buf will still be occupied by bt_dev.rx_queue until the BT RX is processed, there is no difference between this method and the method I am using for this PR, and more I think this method is safe than BT RX process, because in most cases only the protocol stack will use the bt_hci_cmd_send_sync API, So we can completely ensure that there is no delayed code in all existing places that use this API and rsp is not empty, on the contrary, BT RX processing buffer events will have many callbacks involving the user layer, which will increase uncertainty.

The significant change in this PR is that applications get a reference to the buffer the Command Complete event was received into.

This is orthogonal to the primary reason for 1cb83a8, to remove the assumption that a Command Complete event is a response to the previously sent command.

More importantly, there exists a separate issue that blocking the HCI event stream makes it impossibly complicated to guarantee no deadlocks form. Giving the application a reference to a event buffer is a hazard in this respect. The hazard is equivalent to invoking an application callback from bt_recv_prio.

To remain safe, the application must give the buffer back before it can expect any synchronizing with the Bluetooth Host to complete, since the Host may potentially be blocked by the application. In terms if a callback, we would just say that the callback should be ISR-safe.

Stalling due to a held reference is very non-intuitive for our users. It's even less intuitive than stalling due to control held in a callback, which is already a confusing topic. (Aside: The obvious version would be a event loop, and the application not getting any events when the application is not polling for events because it's handling the previous event.)

Due to the hazard outlined above, I am against allowing the application to get a reference to a stack-internal buffer in the common case. I would ok adding a second 'expert API' for those who got to go fast.

Then there is also the question of benchmarking this. Do you have any numbers from experiments that show a gain in speed or a real reduction in power use? I fear we are doing premature optimization.

LingaoM · 2024-06-19T12:06:18Z

I don't think so, generally this API bt_hci_cmd_send_sync only call by host stack, not by application user. Even this api is public, but only in some situation where the user maybe call this function with vendor command. But most of this API only call by host stack, which code belongs ours maintained, we can ensure that.

LingaoM · 2024-06-20T06:58:18Z

subsys/bluetooth/host/hci_core.c

-		net_buf_reset(buf);
-		bt_buf_set_type(buf, BT_BUF_EVT);
-		net_buf_reserve(buf, BT_BUF_RESERVE);
-		net_buf_add_mem(buf, evt_buf->data, evt_buf->len);


@alwa-nordic Here we actually borrow the buffer of cmd to carry the data of rsp, but there is a premise here that the length of cmd must be greater than the length of rsp. Therefore, the current code implementation is actually the maximum value of the two areas. https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/bluetooth/host/hci_core.c#L162 .After my PR, this constraint can actually be avoided.

LingaoM · 2024-06-20T07:15:57Z

I don't think my approach delays the event flow. If you look at the implementation of bt_recv, for the events of evt_flags & BT_HCI_EVT_FLAG_RECV, buf will still be occupied by bt_dev.rx_queue until the BT RX is processed, there is no difference between this method and the method I am using for this PR, and more I think this method is safe than BT RX process, because in most cases only the protocol stack will use the bt_hci_cmd_send_sync API, So we can completely ensure that there is no delayed code in all existing places that use this API and rsp is not empty, on the contrary, BT RX processing buffer events will have many callbacks involving the user layer, which will increase uncertainty.

More importantly, there exists a separate issue that blocking the HCI event stream makes it impossibly complicated to guarantee no deadlocks form. Giving the application a reference to a event buffer is a hazard in this respect. The hazard is equivalent to invoking an application callback from bt_recv_prio.

To remain safe, the application must give the buffer back before it can expect any synchronizing with the Bluetooth Host to complete, since the Host may potentially be blocked by the application. In terms if a callback, we would just say that the callback should be ISR-safe.

Like all command completed events, the completion event for
BT_HCI_OP_HOST_NUM_COMPLETED_PACKETS is now placed in normal event
buffers.

Summarize:

After 1cb83a8, there are no different between bt_recv_prio vs bt_recv, both of them use same normal net_buf pool.
If you look at the implementation of bt_recv, for the events of evt_flags & BT_HCI_EVT_FLAG_RECV, buf will still be occupied by bt_dev.rx_queue until the BT RX is processed, which use same net_buf pool.
In most cases only the protocol stack will use the bt_hci_cmd_send_sync API, so we can completely ensure that there is no delayed code in all existing places that use this bt_hci_cmd_send_sync API with rsp is not empty.

BTW:

Stalling due to a held reference is very non-intuitive for our users. It's even less intuitive than stalling due to control held in a callback, which is already a confusing topic. (Aside: The obvious version would be a event loop, and the application not getting any events when the application is not polling for events because it's handling the previous event.)

Due to the hazard outlined above, I am against allowing the application to get a reference to a stack-internal buffer in the common case. I would ok adding a second 'expert API' for those who got to go fast.

I checked all the places where rsp is used in the host stack. There are 38 places in total, and there is no block in any place.

alwa-nordic · 2024-06-20T10:00:07Z

After 1cb83a8, there are no different between bt_recv_prio vs bt_recv, both of them use same normal net_buf pool.

If you look at the implementation of bt_recv, for the events of evt_flags & BT_HCI_EVT_FLAG_RECV, buf will still be occupied by bt_dev.rx_queue until the BT RX is processed, which use same net_buf pool.

From my perspective, you have identified a defect here. The Command Complete events can and should go in sync_evt_pool.

LingaoM · 2024-06-21T07:40:35Z

From my perspective, you have identified a defect here. The Command Complete events can and should go in sync_evt_pool.

#74645

zephyrbot added area: Bluetooth area: Bluetooth Host Bluetooth Host (excluding BR/EDR) labels Jun 14, 2024

zephyrbot requested review from Thalley, alwa-nordic, hermabe, jhedberg, jori-nordic, sjanc and theob-pro June 14, 2024 07:14

zephyrbot assigned jori-nordic Jun 14, 2024

LingaoM force-pushed the fix_bt_send_failed branch from 9f2b545 to 4487985 Compare June 14, 2024 07:51

Bluetooth: Host: Fixed where bt_send returns an error but is actual succ

15f3699

The `sync` is a local variable in the stack space. Clearing this pointer explicitly before releasing it is a safer way. Signed-off-by: Lingao Meng <menglingao@xiaomi.com>

LingaoM force-pushed the fix_bt_send_failed branch 3 times, most recently from c92df8b to 7c2c4a0 Compare June 14, 2024 08:12

Thalley reviewed Jun 14, 2024

View reviewed changes

subsys/bluetooth/host/hci_core.c Outdated Show resolved Hide resolved

LingaoM force-pushed the fix_bt_send_failed branch from 7c2c4a0 to 6272e3c Compare June 14, 2024 08:28

Bluetooth: Host: Add rsp field to avoid deep-copy

271385e

Add `rsp` field to avoid deep-copy for every cmd. Signed-off-by: Lingao Meng <menglingao@xiaomi.com>

LingaoM force-pushed the fix_bt_send_failed branch from 6272e3c to c130755 Compare June 14, 2024 09:01

Thalley reviewed Jun 16, 2024

View reviewed changes

LingaoM force-pushed the fix_bt_send_failed branch from c130755 to a392245 Compare June 16, 2024 07:01

LingaoM force-pushed the fix_bt_send_failed branch 2 times, most recently from 7ff6d51 to 299c117 Compare June 16, 2024 07:17

blueooth: host: rename buf to send_cmd and evt_buf to buf

b2a66a8

Since `send_cmd` will follow request-response. So rename seqerate, to make clear. Signed-off-by: Lingao Meng <menglingao@xiaomi.com>

LingaoM force-pushed the fix_bt_send_failed branch from 299c117 to b2a66a8 Compare June 16, 2024 07:21

alwa-nordic added the Enhancement Changes/Updates/Additions to existing features label Jun 19, 2024

LingaoM commented Jun 20, 2024

View reviewed changes

LingaoM mentioned this pull request Jun 22, 2024

bluetooth: host: Split cmd complete & cmd status to separate pool #74645

Closed

alwa-nordic mentioned this pull request Jun 28, 2024

Bluetooth: Host: Don't give cmd buf as evt buf on send fail #74613

Closed

LingaoM closed this Jul 5, 2024

Bluetooth: Host: Fixed where bt_send returns an error but is actually… #74287

Bluetooth: Host: Fixed where bt_send returns an error but is actually… #74287

Uh oh!

Conversation

LingaoM commented Jun 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhedberg commented Jun 14, 2024

Uh oh!

LingaoM commented Jun 14, 2024

Uh oh!

LingaoM commented Jun 14, 2024

Uh oh!

Uh oh!

jori-nordic commented Jun 14, 2024

Uh oh!

LingaoM commented Jun 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Thalley left a comment

Choose a reason for hiding this comment

Uh oh!

Thalley Jun 16, 2024

Choose a reason for hiding this comment

Uh oh!

LingaoM Jun 16, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LingaoM commented Jun 16, 2024

Uh oh!

Thalley commented Jun 16, 2024

Uh oh!

LingaoM commented Jun 16, 2024

Uh oh!

jori-nordic commented Jun 17, 2024

Uh oh!

LingaoM commented Jun 19, 2024

Uh oh!

jhedberg commented Jun 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alwa-nordic commented Jun 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LingaoM commented Jun 19, 2024

Uh oh!

LingaoM Jun 20, 2024

Choose a reason for hiding this comment

Uh oh!

LingaoM commented Jun 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alwa-nordic commented Jun 20, 2024

Uh oh!

LingaoM commented Jun 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

LingaoM commented Jun 14, 2024 •

edited

Loading

LingaoM commented Jun 14, 2024 •

edited

Loading

jhedberg commented Jun 19, 2024 •

edited

Loading

alwa-nordic commented Jun 19, 2024 •

edited

Loading

LingaoM commented Jun 20, 2024 •

edited

Loading