Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bluetooth: rfcomm: fix issue of sending buf invalid #73113

Merged

Conversation

lylezhu2012
Copy link
Contributor

@lylezhu2012 lylezhu2012 commented May 22, 2024

There is a change (commit no.: 93d0eac) that if the ref of sending buf is not 1, the error code -EINVAL will be returned from bt_conn_send_cb.

It causes the RFCOMM functionality cannot work properly.

Remove the ref operation from the buf to be sent to fix the issue.

Due to the sending buf cannot be referred by sending layer.
It is unsafe to use buf identification for a transmission, because the buf may have been newly transmitted when the
sent callback is triggered.

Now instead, when the send completion callback is received, the upper layer is notified that a transfer is completed.
If multiple bufs are sent at the same time, there is no guarantee which buf is completed when the sent callback triggered. Therefore, it is recommended that the caller transfers the next data block after the previous transfer is completed.

HFP AG: Due to the sent callback of RFCOMM is changed, the sending buf need to be primed waiting for the previous one to be completed. Add a worker for this purpose.

HFP Unit: Due to the parameter buf has been removed by rfcomm, update the prototype of channel sent callback hfp_hf_sent.

@jhedberg
Copy link
Member

@lylezhu2012 seems like the issue was a reference leak, i.e. bt_rfcomm_dlc_send() was always supposed to move the reference rather than taking a new one. I think ideally it should be taking a struct net_buf **buf as a parameter and setting the original pointer to NULL to make the reference move clear. Btw, this also relates to #72798.

At a bare minimum I think it'd be important to accurately describe the bug that you're fixing. Now it sounds like you're working around an issue introduced by commit 93d0eac, but that's actually not the case - the reference leak existed even before that commit.

Copy link
Member

@jhedberg jhedberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lylezhu2012 also, the documentation for `bt_rfcomm_dlc_send() is missing the information bout the API moving the reference. I think that's a potential source of future bugs, so please add the missing information to the documentation as well.

@lylezhu2012
Copy link
Contributor Author

@lylezhu2012 seems like the issue was a reference leak, i.e. bt_rfcomm_dlc_send() was always supposed to move the reference rather than taking a new one. I think ideally it should be taking a struct net_buf **buf as a parameter and setting the original pointer to NULL to make the reference move clear. Btw, this also relates to #72798.

At a bare minimum I think it'd be important to accurately describe the bug that you're fixing. Now it sounds like you're working around an issue introduced by commit 93d0eac, but that's actually not the case - the reference leak existed even before that commit.

Since conn has a TX pending queue, I want to leverage this queue, so all sending bufs are pending in the TX pending queue of the conn layer, so that the upper layer does not need to implement such a queue by itself. To achieve this, I ref the buf that needs to be sent in rfcomm. When the sending is completed, notify the upper layer which buf is sent and then unref this buf in RFCOMM layer.

It seems inappropriate to do so at the moment. I cache all the bufs that need to be sent by the upper layer of RFCOMM and send them one by one. At the same time the sending buf need to be primed waiting for the previous one to be completed.

@lylezhu2012
Copy link
Contributor Author

@lylezhu2012 also, the documentation for `bt_rfcomm_dlc_send() is missing the information bout the API moving the reference. I think that's a potential source of future bugs, so please add the missing information to the documentation as well.

OK. But I plan to revert it in this PR. Do not update the reference of buf.

@lylezhu2012 lylezhu2012 force-pushed the fix_issue_of_sending_buf_invalid branch from 0e44f90 to d8bf3ea Compare May 22, 2024 13:31
@lylezhu2012 lylezhu2012 requested a review from jhedberg May 22, 2024 13:32
@jori-nordic jori-nordic assigned jhedberg and unassigned jori-nordic May 23, 2024
@jori-nordic jori-nordic removed their request for review May 23, 2024 13:01
@lylezhu2012 lylezhu2012 force-pushed the fix_issue_of_sending_buf_invalid branch from d8bf3ea to 4eaa6b3 Compare May 28, 2024 11:24
@lylezhu2012 lylezhu2012 requested a review from jhedberg May 28, 2024 11:25
@Thalley Thalley removed their request for review May 29, 2024 15:47
@lylezhu2012
Copy link
Contributor Author

@jhedberg , @sjanc , @alwa-nordic , Could you please take a look at it.

@dleach02
Copy link
Member

dleach02 commented Jun 4, 2024

@jhedberg can you review to determine if your change request is satisfied.

There is a change (commit no.: 93d0eac) that if
the `ref` of sending buf is not 1, the error code
`-EINVAL` will be returned from bt_conn_send_cb.

It causes the RFCOMM functionality cannot work
properly.

Remove the ref operation from the buf to be sent
to fix the issue.

Signed-off-by: Lyle Zhu <lyle.zhu@nxp.com>
Due to the sending buf cannot be referred
by sending layer.
It is unsafe to use `buf` identification
for a transmission, because the buf may
have been newly transmitted when the
sent callback is triggered.

Now instead, when the send completion
callback is received, the upper layer
is notified that a transfer is completed.
If multiple bufs are sent at the same
time, there is no guarantee which buf
is completed when the sent callback
triggered. Therefore, it is recommended
that the caller transfers the next data
block after the previous transfer is
completed.

Signed-off-by: Lyle Zhu <lyle.zhu@nxp.com>
Due to the parameter `buf` has been removed by rfcomm,
update the prototype of channel sent callback hfp_hf_sent.

Signed-off-by: Lyle Zhu <lyle.zhu@nxp.com>
Due to the sent callback of RFCOMM is changed, the
sending buf need to be primed waiting for the
previous one to be completed. Add a worker for
this purpose.

Signed-off-by: Lyle Zhu <lyle.zhu@nxp.com>
When testing zephyrproject-rtos#72090,
there is an issue found.

The change in the previous commit is to put all data sending
operations into the work queue context, and lock the current
AG before sending data.

And in change of zephyrproject-rtos#72090, the HCI TX thread is removed. All
sending sequence are happened in work queue context.

There is a possible problem when AG creates a SCO connection
by calling the function bt_conn_create_sco. Before the
function bt_conn_create_sco is called, AG will be locked to
avoid creating repeated SCO connection.
And the execution of the function bt_conn_create_sco
depends on the work queue. Because the HCI command of
function bt_conn_create_sco is sent in work queue context.

In the normal case, there is not any issue.
But there is a case that when the function
bt_conn_create_sco is being executed, there is a pending AG
TX waiting to be executed.

Once the work queue starts executing the handler, the AG TX
handler is executed first. Since the lock has been acquired
by other threads, the AG TX handler cannot acquire the lock.
As a result, the SCO connection creation fails.

Remove the AG lock from SCO creating. Instead, use a flag
to mark whether a SCO connection is be created.

Signed-off-by: Lyle Zhu <lyle.zhu@nxp.com>
Give `err` a initialization value 0.

Signed-off-by: Lyle Zhu <lyle.zhu@nxp.com>
@lylezhu2012 lylezhu2012 force-pushed the fix_issue_of_sending_buf_invalid branch from 4eaa6b3 to b68f5d8 Compare June 11, 2024 09:54
@lylezhu2012
Copy link
Contributor Author

When testing #72090, there is an issue found.

The change in the previous commit is to put all data sending operations into the work queue context, and lock the current
AG before sending data.

And in change of #72090, the HCI TX thread is removed. All sending sequence are happened in work queue context.

There is a possible problem when AG creates a SCO connection by calling the function bt_conn_create_sco. Before the
function bt_conn_create_sco is called, AG will be locked to avoid creating repeated SCO connection.
And the execution of the function bt_conn_create_sco depends on the work queue. Because the HCI command of
function bt_conn_create_sco is sent in work queue context.

In the normal case, there is not any issue.

But there is a case that when the function bt_conn_create_sco is being executed, there is a pending AG TX waiting to be executed.

Once the work queue starts executing the handler, the AG TX handler is executed first. Since the lock has been acquired
by other threads, the AG TX handler cannot acquire the lock. As a result, the SCO connection creation fails.

Remove the AG lock from SCO creating. Instead, use a flag to mark whether a SCO connection is be created.

subsys/bluetooth/host/classic/hfp_ag.c Outdated Show resolved Hide resolved
@lylezhu2012
Copy link
Contributor Author

@dleach02 , @carlescufi , now there is only one reviewer approves the PR. Is there anyone else who can help review this PR? Currently, handsfree and handsfree_ag based on RFCOMM cannot work properly.

@lylezhu2012 lylezhu2012 reopened this Jun 12, 2024
@jhedberg jhedberg requested a review from dleach02 June 12, 2024 11:06
@nashif nashif merged commit 6458c5a into zephyrproject-rtos:main Jun 13, 2024
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants