Bluetooth: conn: move auto-init procedures to system workqueue #77703

jori-nordic · 2024-08-28T13:13:02Z

conn_auto_initiate() starts a bunch of controller procedures (read: HCI
commands) that are fired off right after connection establishment.

Right now, it's called from the RX context, which is the same context where
resources (cmd & acl buffers) are freed. This not ideal.

But the procedures are all async, so it should be fine to schedule this
function on the system workqueue, where we have less risk of deadlocks.

subsys/bluetooth/host/hci_core.c

Thalley

A few comments, but nothing that actually needs to be fixed especially since this is from moved code.

subsys/bluetooth/host/conn.c

jori-nordic · 2024-08-29T10:38:59Z

nothing that actually needs to be fixed

Will make another PR right after this one to address, scout's honor.

alwa-nordic · 2024-08-29T15:01:11Z

subsys/bluetooth/host/conn.c

+	}
+
+exit:
+	bt_conn_unref(conn);


This can potentially destroy the currently running k_work object. Is that ok?

good point. I have no idea if it's going to implode or not.

It looks in work.c like the work item is touched after the handler completes. Better safe than sorry. I suggest we make the work item global, we iterate over all connected conns in the handler, and have a flag that marks the work done on each conn.

Looking at what the workqueue implementation does after calling the callback this may actually be a problem (it's still accessing parts of the work struct, like work->flags after it).

zephyr/kernel/work.c

Lines 688 to 703 in 40414f7

handler(work);

/* Mark the work item as no longer running and deal

* with any cancellation and flushing issued while it

* was running. Clear the BUSY flag and optionally

* yield to prevent starving other threads.

*/

key = k_spin_lock(&lock);

flag_clear(&work->flags, K_WORK_RUNNING_BIT);

if (flag_test(&work->flags, K_WORK_FLUSHING_BIT)) {

finalize_flush_locked(work);

}

if (flag_test(&work->flags, K_WORK_CANCELING_BIT)) {

finalize_cancel_locked(work);

}

I wonder if this is already an issue with deferred_work:

zephyr/subsys/bluetooth/host/conn.c

Lines 2069 to 2072 in 40414f7

/* Release the reference we took for the very first

* state transition.

*/

bt_conn_unref(conn);

One solution could be to do "slow" freeing of connections, i.e. when the refcount drops to zero there's a separate (independent of any connection object) k_work that's responsible for really freeing the object.

Wouldn't that hinder advertisers to restart connectable advertising even further?
Today we suggest to use a k_work in the disconnected callback, but if the stack then also does a k_work before actually freeing the connection, then the application's k_work needs to be schedules after the stack's k_work.

we probably need to address the rootcause, which is the work item belonging to the struct bt_conn. I really don't like the deferred_work anyways. So refactoring deferred_work using @alwa-nordic 's proposition seems appropriate.

we probably need to address the rootcause, which is the work item belonging to the struct bt_conn. I really don't like the deferred_work anyways. So refactoring deferred_work using @alwa-nordic 's proposition seems appropriate.

Not really opposed to that either. It does sound a bit like we are starting to implement a tiny garbage collector if we have a work item that occasionally goes through all connection objects to finalize the free'ing :D But you are correct that free'ing an object that contains the work items that triggers the free is an issue

I suggest to name it ZNGC Zis is Not a Garbage Collector 🚮🙊

ZNGC Zis is Not a Garbage Collector

Better make it "ZNGC: ZNGC is Not a Garbage Collector" for additional recursiveness

Thalley

Sounds like we need to fix further things in this PR

subsys/bluetooth/host/conn.c

Thalley · 2024-08-30T11:43:55Z

subsys/bluetooth/host/conn.c

+		if (conn->state != BT_CONN_CONNECTED) {
+			goto exit;
+		}


With the if (conn->state != BT_CONN_CONNECTED) { on line 1697, do we need this check for each procedure? Or can e.g. bt_hci_read_remote_version actually disconnect the connection?

idk. I'm not taking any chances in case of pre-emption by e.g. the controller thread sending a disconnected event, which is high-prio.

Thalley · 2024-08-30T11:44:32Z

subsys/bluetooth/host/conn.c

+			err = bt_hci_le_read_max_data_len(&tx_octets, &tx_time);
+			if (!err) {


No log message if bt_hci_le_read_max_data_len fails?

just copy-pasting stuff around

I was planning to have this PR be a single cherry-pickable commit. And have another one for logs and cosmetic changes, IS_ENABLED() etc..

Thalley · 2024-08-30T11:45:06Z

subsys/bluetooth/host/conn.c

+{
+	ARG_UNUSED(unused);
+
+	bt_conn_foreach(BT_CONN_TYPE_ALL, perform_auto_initiated_procedures, NULL);


Suggested change

bt_conn_foreach(BT_CONN_TYPE_ALL, perform_auto_initiated_procedures, NULL);

bt_conn_foreach(BT_CONN_TYPE_LE, perform_auto_initiated_procedures, NULL);

I guess we don't want to do these for ISO or classic connections

jhedberg · 2024-08-30T11:45:56Z

subsys/bluetooth/host/conn.c

+
+	LOG_DBG("[%p] Running auto-initiated procedures", conn);
+
+	if (atomic_test_and_set_bit(conn->flags, BT_CONN_AUTO_INIT_PROCEDURES_DONE)) {


Shouldn't this be atomic_test_and_clear_bit()?

Nevermind, I think it might be correct. To me it'd just be more intuitive to have a flag set together with the reference, and cleared when you do unref()

set should be correct

jhedberg · 2024-08-30T11:49:01Z

subsys/bluetooth/host/conn_internal.h

@@ -62,6 +62,7 @@ enum {
 	BT_CONN_BR_NOBOND,                    /* SSP no bond pairing tracker */
 	BT_CONN_BR_PAIRING_INITIATOR,         /* local host starts authentication */
 	BT_CONN_CLEANUP,                      /* Disconnected, pending cleanup */
+	BT_CONN_AUTO_INIT_PROCEDURES_DONE,    /* Auto-initiated procedures have been done */


I think it might be more intuitive if you reversed this, i.e. call it AUTO_INIT_PROCEDURES_PENDING. That way the reference is tied to something more tangible (the flag being set).

I did this to distinguish between a freshly-memset connection and a connection that has had this procedure performed. I can swap if you really want to.

I think verifying the correctness of the reference counting becomes clearer if it was reversed. Normally reference counts are tied to actual pointer variables, but that's not the case here. If you can see a setting of the flag when you do ref() and a test_and_clear() when you do unref() it's IMO more obvious to what the reference count is tied, i.e. something like:

bt_conn_ref(conn); set_bit();

and:

if (test_and_clear_bit()) { bt_conn_unref(); }

aiight, i'll change it.

jhedberg · 2024-08-30T13:23:55Z

subsys/bluetooth/host/conn.c

+	LOG_DBG("[%p] Successfully ran auto-initiated procedures", conn);
+
+exit:
+	CHECKIF(!atomic_test_and_clear_bit(conn->flags, BT_CONN_AUTO_INIT_PROCEDURES_PENDING)) {


I think the same rule applies for CHECKIF as for ASSERT, i.e. you should never put functionally significant actions inside it. E.g. if CONFIG_NO_RUNTIME_CHECKS is set then the test_and_clear_bit() would never get called, which is probably not what you want.

right. forgot about that option.

alwa-nordic · 2024-08-30T15:33:16Z

subsys/bluetooth/host/conn.c

+	 * connection flag. The reference will be given back the moment that
+	 * flag is set.
+	 */
+	atomic_set_bit(bt_conn_ref(conn)->flags, BT_CONN_AUTO_INIT_PROCEDURES_PENDING);


Taking the ref here is unnecessary. bt_conn_foreach iterates over live connection objects and lends a reference to the loop body function.

In general, if a function gets a bt_conn in a parameter, the caller shall guarantee it lives until the function returns.

alwa-nordic · 2024-08-30T15:34:32Z

subsys/bluetooth/host/conn.c

+	 * connection flag. The reference will be given back the moment that
+	 * flag is set.
+	 */
+	atomic_set_bit(bt_conn_ref(conn)->flags, BT_CONN_AUTO_INIT_PROCEDURES_PENDING);


This flag is arguably redundant. I think it's functionally equivalent to conn.state == CONNECTED && !procedures_done.

Thalley · 2024-09-02T08:45:22Z

subsys/bluetooth/host/conn.c

+static void schedule_auto_initiated_procedures(struct bt_conn *conn)
+{
+	LOG_DBG("[%p] Scheduling auto-init procedures", conn);
+	k_work_submit(&procedures_on_connect);
+}


Suggested change

static void schedule_auto_initiated_procedures(struct bt_conn *conn)

{

LOG_DBG("[%p] Scheduling auto-init procedures", conn);

k_work_submit(&procedures_on_connect);

}

static void schedule_auto_initiated_procedures(void)

{

LOG_DBG("Scheduling auto-init procedures");

k_work_submit(&procedures_on_connect);

}

Since the k_work is no longer in the conn object, suggest to remove it from the function. In the case of LOG_DBG not being enabled, it was a unused argument anyhow

I think we should still have it. As it is the start of an async operation and the log in perform_auto_initiated_procedures only notify us of the execution of that async operation.

If you really don't want it, feel free to NAK and I'll remove.

Not a huge deal and it may be useful :)

`conn_auto_initiate()` starts a bunch of controller procedures (read: HCI commands) that are fired off right after connection establishment. Right now, it's called from the RX context, which is the same context where resources (cmd & acl buffers) are freed. This not ideal. But the procedures are all async, so it should be fine to schedule this function on the system workqueue, where we have less risk of deadlocks. Signed-off-by: Jonathan Rico <jonathan.rico@nordicsemi.no>

jori-nordic · 2024-09-09T12:56:03Z

rebased to fix conflict in hci_core.c

jori-nordic force-pushed the move-conn_auto_initiate-to-syswq branch from 2c2a47f to 8bbe833 Compare August 28, 2024 15:23

jori-nordic changed the title ~~[wip] Bluetooth: conn: move auto-init procedures to syswq~~ Bluetooth: conn: move auto-init procedures to system workqueue Aug 28, 2024

jori-nordic marked this pull request as ready for review August 28, 2024 15:24

zephyrbot added area: Bluetooth Host area: Bluetooth labels Aug 28, 2024

zephyrbot requested review from alwa-nordic, hermabe, jhedberg, sjanc, Thalley and theob-pro August 28, 2024 15:25

zephyrbot assigned jori-nordic and jhedberg Aug 28, 2024

jori-nordic mentioned this pull request Aug 28, 2024

bluetooth: Behavior change in host causing MESH to fail on sending messages #77241

Open

jhedberg reviewed Aug 28, 2024

View reviewed changes

subsys/bluetooth/host/hci_core.c Outdated Show resolved Hide resolved

subsys/bluetooth/host/hci_core.c Outdated Show resolved Hide resolved

jori-nordic force-pushed the move-conn_auto_initiate-to-syswq branch from 8bbe833 to 958b700 Compare August 29, 2024 06:00

jhedberg previously approved these changes Aug 29, 2024

View reviewed changes

jori-nordic requested a review from PavelVPV August 29, 2024 07:36

Thalley approved these changes Aug 29, 2024

View reviewed changes

subsys/bluetooth/host/conn.c Outdated Show resolved Hide resolved

subsys/bluetooth/host/conn.c Outdated Show resolved Hide resolved

subsys/bluetooth/host/conn.c Outdated Show resolved Hide resolved

alwa-nordic reviewed Aug 29, 2024

View reviewed changes

Thalley requested changes Aug 30, 2024

View reviewed changes

jori-nordic dismissed jhedberg’s stale review via a96f483 August 30, 2024 10:58

jori-nordic force-pushed the move-conn_auto_initiate-to-syswq branch 2 times, most recently from a96f483 to a95c8bb Compare August 30, 2024 10:59

jori-nordic requested review from jhedberg, Thalley and alwa-nordic August 30, 2024 11:00

jhedberg reviewed Aug 30, 2024

View reviewed changes

subsys/bluetooth/host/conn.c Outdated Show resolved Hide resolved

jori-nordic force-pushed the move-conn_auto_initiate-to-syswq branch from a95c8bb to 7013f8f Compare August 30, 2024 11:41

Thalley reviewed Aug 30, 2024

View reviewed changes

jhedberg reviewed Aug 30, 2024

View reviewed changes

jori-nordic force-pushed the move-conn_auto_initiate-to-syswq branch from 7013f8f to e0fa3af Compare August 30, 2024 12:35

jhedberg reviewed Aug 30, 2024

View reviewed changes

jori-nordic force-pushed the move-conn_auto_initiate-to-syswq branch from e0fa3af to adf1895 Compare August 30, 2024 15:09

alwa-nordic reviewed Aug 30, 2024

View reviewed changes

jori-nordic force-pushed the move-conn_auto_initiate-to-syswq branch from adf1895 to 85b8ac3 Compare August 30, 2024 15:39

jhedberg previously approved these changes Aug 30, 2024

View reviewed changes

Thalley reviewed Sep 2, 2024

View reviewed changes

jori-nordic dismissed jhedberg’s stale review via b975ee8 September 9, 2024 12:55

jori-nordic force-pushed the move-conn_auto_initiate-to-syswq branch from 85b8ac3 to b975ee8 Compare September 9, 2024 12:55

jori-nordic requested a review from Thalley September 9, 2024 12:56

jhedberg approved these changes Sep 9, 2024

View reviewed changes

Thalley approved these changes Sep 9, 2024

View reviewed changes

nashif merged commit 01354c0 into zephyrproject-rtos:main Sep 9, 2024
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bluetooth: conn: move auto-init procedures to system workqueue #77703

Bluetooth: conn: move auto-init procedures to system workqueue #77703

jori-nordic commented Aug 28, 2024 •

edited

Loading

Thalley left a comment

jori-nordic commented Aug 29, 2024

alwa-nordic Aug 29, 2024

jori-nordic Aug 29, 2024

alwa-nordic Aug 29, 2024 •

edited

Loading

jhedberg Aug 29, 2024

jori-nordic Aug 29, 2024

Thalley Aug 30, 2024

jori-nordic Aug 30, 2024

Thalley Aug 30, 2024

jori-nordic Aug 30, 2024

Thalley Aug 30, 2024

Thalley left a comment

Thalley Aug 30, 2024

jori-nordic Aug 30, 2024

Thalley Aug 30, 2024

jori-nordic Aug 30, 2024

jori-nordic Aug 30, 2024

Thalley Aug 30, 2024

jhedberg Aug 30, 2024

jhedberg Aug 30, 2024

jori-nordic Aug 30, 2024

jhedberg Aug 30, 2024

jori-nordic Aug 30, 2024

jhedberg Aug 30, 2024

jori-nordic Aug 30, 2024

jhedberg Aug 30, 2024

jori-nordic Aug 30, 2024

alwa-nordic Aug 30, 2024

alwa-nordic Aug 30, 2024

Thalley Sep 2, 2024

jori-nordic Sep 9, 2024

Thalley Sep 9, 2024

jori-nordic commented Sep 9, 2024

	handler(work);

	/* Mark the work item as no longer running and deal
	* with any cancellation and flushing issued while it
	* was running. Clear the BUSY flag and optionally
	* yield to prevent starving other threads.
	*/
	key = k_spin_lock(&lock);

	flag_clear(&work->flags, K_WORK_RUNNING_BIT);
	if (flag_test(&work->flags, K_WORK_FLUSHING_BIT)) {
	finalize_flush_locked(work);
	}
	if (flag_test(&work->flags, K_WORK_CANCELING_BIT)) {
	finalize_cancel_locked(work);
	}

	/* Release the reference we took for the very first
	* state transition.
	*/
	bt_conn_unref(conn);

		err = bt_hci_le_read_max_data_len(&tx_octets, &tx_time);
		if (!err) {

	bt_conn_foreach(BT_CONN_TYPE_ALL, perform_auto_initiated_procedures, NULL);
	bt_conn_foreach(BT_CONN_TYPE_LE, perform_auto_initiated_procedures, NULL);


		LOG_DBG("[%p] Running auto-initiated procedures", conn);

		if (atomic_test_and_set_bit(conn->flags, BT_CONN_AUTO_INIT_PROCEDURES_DONE)) {

Bluetooth: conn: move auto-init procedures to system workqueue #77703

Bluetooth: conn: move auto-init procedures to system workqueue #77703

Conversation

jori-nordic commented Aug 28, 2024 • edited Loading

Thalley left a comment

Choose a reason for hiding this comment

jori-nordic commented Aug 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alwa-nordic Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Thalley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jori-nordic commented Sep 9, 2024

jori-nordic commented Aug 28, 2024 •

edited

Loading

alwa-nordic Aug 29, 2024 •

edited

Loading