Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bluetooth: conn: move auto-init procedures to system workqueue #77703

Merged

Conversation

jori-nordic
Copy link
Contributor

@jori-nordic jori-nordic commented Aug 28, 2024

conn_auto_initiate() starts a bunch of controller procedures (read: HCI
commands) that are fired off right after connection establishment.

Right now, it's called from the RX context, which is the same context where
resources (cmd & acl buffers) are freed. This not ideal.

But the procedures are all async, so it should be fine to schedule this
function on the system workqueue, where we have less risk of deadlocks.

@jori-nordic jori-nordic force-pushed the move-conn_auto_initiate-to-syswq branch from 2c2a47f to 8bbe833 Compare August 28, 2024 15:23
@jori-nordic jori-nordic changed the title [wip] Bluetooth: conn: move auto-init procedures to syswq Bluetooth: conn: move auto-init procedures to system workqueue Aug 28, 2024
@jori-nordic jori-nordic marked this pull request as ready for review August 28, 2024 15:24
subsys/bluetooth/host/hci_core.c Outdated Show resolved Hide resolved
subsys/bluetooth/host/hci_core.c Outdated Show resolved Hide resolved
@jori-nordic jori-nordic force-pushed the move-conn_auto_initiate-to-syswq branch from 8bbe833 to 958b700 Compare August 29, 2024 06:00
jhedberg
jhedberg previously approved these changes Aug 29, 2024
Copy link
Collaborator

@Thalley Thalley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, but nothing that actually needs to be fixed especially since this is from moved code.

subsys/bluetooth/host/conn.c Outdated Show resolved Hide resolved
subsys/bluetooth/host/conn.c Outdated Show resolved Hide resolved
subsys/bluetooth/host/conn.c Outdated Show resolved Hide resolved
@jori-nordic
Copy link
Contributor Author

nothing that actually needs to be fixed

Will make another PR right after this one to address, scout's honor.

}

exit:
bt_conn_unref(conn);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can potentially destroy the currently running k_work object. Is that ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I have no idea if it's going to implode or not.

Copy link
Collaborator

@alwa-nordic alwa-nordic Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks in work.c like the work item is touched after the handler completes. Better safe than sorry. I suggest we make the work item global, we iterate over all connected conns in the handler, and have a flag that marks the work done on each conn.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at what the workqueue implementation does after calling the callback this may actually be a problem (it's still accessing parts of the work struct, like work->flags after it).

zephyr/kernel/work.c

Lines 688 to 703 in 40414f7

handler(work);
/* Mark the work item as no longer running and deal
* with any cancellation and flushing issued while it
* was running. Clear the BUSY flag and optionally
* yield to prevent starving other threads.
*/
key = k_spin_lock(&lock);
flag_clear(&work->flags, K_WORK_RUNNING_BIT);
if (flag_test(&work->flags, K_WORK_FLUSHING_BIT)) {
finalize_flush_locked(work);
}
if (flag_test(&work->flags, K_WORK_CANCELING_BIT)) {
finalize_cancel_locked(work);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is already an issue with deferred_work:

/* Release the reference we took for the very first
* state transition.
*/
bt_conn_unref(conn);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One solution could be to do "slow" freeing of connections, i.e. when the refcount drops to zero there's a separate (independent of any connection object) k_work that's responsible for really freeing the object.

Wouldn't that hinder advertisers to restart connectable advertising even further?
Today we suggest to use a k_work in the disconnected callback, but if the stack then also does a k_work before actually freeing the connection, then the application's k_work needs to be schedules after the stack's k_work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably need to address the rootcause, which is the work item belonging to the struct bt_conn. I really don't like the deferred_work anyways. So refactoring deferred_work using @alwa-nordic 's proposition seems appropriate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably need to address the rootcause, which is the work item belonging to the struct bt_conn. I really don't like the deferred_work anyways. So refactoring deferred_work using @alwa-nordic 's proposition seems appropriate.

Not really opposed to that either. It does sound a bit like we are starting to implement a tiny garbage collector if we have a work item that occasionally goes through all connection objects to finalize the free'ing :D But you are correct that free'ing an object that contains the work items that triggers the free is an issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to name it ZNGC Zis is Not a Garbage Collector 🚮🙊

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZNGC Zis is Not a Garbage Collector

Better make it "ZNGC: ZNGC is Not a Garbage Collector" for additional recursiveness

Copy link
Collaborator

@Thalley Thalley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we need to fix further things in this PR

@jori-nordic jori-nordic force-pushed the move-conn_auto_initiate-to-syswq branch from a95c8bb to 7013f8f Compare August 30, 2024 11:41
Comment on lines 1723 to 1726
if (conn->state != BT_CONN_CONNECTED) {
goto exit;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the if (conn->state != BT_CONN_CONNECTED) { on line 1697, do we need this check for each procedure? Or can e.g. bt_hci_read_remote_version actually disconnect the connection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk. I'm not taking any chances in case of pre-emption by e.g. the controller thread sending a disconnected event, which is high-prio.

Comment on lines +1747 to +1749
err = bt_hci_le_read_max_data_len(&tx_octets, &tx_time);
if (!err) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No log message if bt_hci_le_read_max_data_len fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just copy-pasting stuff around

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to have this PR be a single cherry-pickable commit. And have another one for logs and cosmetic changes, IS_ENABLED() etc..

{
ARG_UNUSED(unused);

bt_conn_foreach(BT_CONN_TYPE_ALL, perform_auto_initiated_procedures, NULL);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bt_conn_foreach(BT_CONN_TYPE_ALL, perform_auto_initiated_procedures, NULL);
bt_conn_foreach(BT_CONN_TYPE_LE, perform_auto_initiated_procedures, NULL);

I guess we don't want to do these for ISO or classic connections


LOG_DBG("[%p] Running auto-initiated procedures", conn);

if (atomic_test_and_set_bit(conn->flags, BT_CONN_AUTO_INIT_PROCEDURES_DONE)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be atomic_test_and_clear_bit()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I think it might be correct. To me it'd just be more intuitive to have a flag set together with the reference, and cleared when you do unref()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set should be correct

@@ -62,6 +62,7 @@ enum {
BT_CONN_BR_NOBOND, /* SSP no bond pairing tracker */
BT_CONN_BR_PAIRING_INITIATOR, /* local host starts authentication */
BT_CONN_CLEANUP, /* Disconnected, pending cleanup */
BT_CONN_AUTO_INIT_PROCEDURES_DONE, /* Auto-initiated procedures have been done */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be more intuitive if you reversed this, i.e. call it AUTO_INIT_PROCEDURES_PENDING. That way the reference is tied to something more tangible (the flag being set).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this to distinguish between a freshly-memset connection and a connection that has had this procedure performed. I can swap if you really want to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think verifying the correctness of the reference counting becomes clearer if it was reversed. Normally reference counts are tied to actual pointer variables, but that's not the case here. If you can see a setting of the flag when you do ref() and a test_and_clear() when you do unref() it's IMO more obvious to what the reference count is tied, i.e. something like:

bt_conn_ref(conn);
set_bit();

and:

if (test_and_clear_bit()) {
	bt_conn_unref();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aiight, i'll change it.

@jori-nordic jori-nordic force-pushed the move-conn_auto_initiate-to-syswq branch from 7013f8f to e0fa3af Compare August 30, 2024 12:35
LOG_DBG("[%p] Successfully ran auto-initiated procedures", conn);

exit:
CHECKIF(!atomic_test_and_clear_bit(conn->flags, BT_CONN_AUTO_INIT_PROCEDURES_PENDING)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the same rule applies for CHECKIF as for ASSERT, i.e. you should never put functionally significant actions inside it. E.g. if CONFIG_NO_RUNTIME_CHECKS is set then the test_and_clear_bit() would never get called, which is probably not what you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. forgot about that option.

@jori-nordic jori-nordic force-pushed the move-conn_auto_initiate-to-syswq branch from e0fa3af to adf1895 Compare August 30, 2024 15:09
* connection flag. The reference will be given back the moment that
* flag is set.
*/
atomic_set_bit(bt_conn_ref(conn)->flags, BT_CONN_AUTO_INIT_PROCEDURES_PENDING);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking the ref here is unnecessary. bt_conn_foreach iterates over live connection objects and lends a reference to the loop body function.

In general, if a function gets a bt_conn in a parameter, the caller shall guarantee it lives until the function returns.

* connection flag. The reference will be given back the moment that
* flag is set.
*/
atomic_set_bit(bt_conn_ref(conn)->flags, BT_CONN_AUTO_INIT_PROCEDURES_PENDING);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag is arguably redundant. I think it's functionally equivalent to conn.state == CONNECTED && !procedures_done.

@jori-nordic jori-nordic force-pushed the move-conn_auto_initiate-to-syswq branch from adf1895 to 85b8ac3 Compare August 30, 2024 15:39
jhedberg
jhedberg previously approved these changes Aug 30, 2024
Comment on lines +1781 to +1785
static void schedule_auto_initiated_procedures(struct bt_conn *conn)
{
LOG_DBG("[%p] Scheduling auto-init procedures", conn);
k_work_submit(&procedures_on_connect);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static void schedule_auto_initiated_procedures(struct bt_conn *conn)
{
LOG_DBG("[%p] Scheduling auto-init procedures", conn);
k_work_submit(&procedures_on_connect);
}
static void schedule_auto_initiated_procedures(void)
{
LOG_DBG("Scheduling auto-init procedures");
k_work_submit(&procedures_on_connect);
}

Since the k_work is no longer in the conn object, suggest to remove it from the function. In the case of LOG_DBG not being enabled, it was a unused argument anyhow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still have it. As it is the start of an async operation and the log in perform_auto_initiated_procedures only notify us of the execution of that async operation.

If you really don't want it, feel free to NAK and I'll remove.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge deal and it may be useful :)

`conn_auto_initiate()` starts a bunch of controller procedures (read: HCI
commands) that are fired off right after connection establishment.

Right now, it's called from the RX context, which is the same context where
resources (cmd & acl buffers) are freed. This not ideal.

But the procedures are all async, so it should be fine to schedule this
function on the system workqueue, where we have less risk of deadlocks.

Signed-off-by: Jonathan Rico <jonathan.rico@nordicsemi.no>
@jori-nordic
Copy link
Contributor Author

rebased to fix conflict in hci_core.c

@nashif nashif merged commit 01354c0 into zephyrproject-rtos:main Sep 9, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants