LwM2M lwm2m_rd_client_stop() not working when called during bootstrapping/registration #37170

MaikVermeulen · 2021-07-23T14:45:58Z

Describe the bug
When external logic calls lwm2m_rd_client_stop() while the client is already bootstrapping or registering, it often (not always) does not work:

[00:00:20.579,833] net_lwm2m_rd_client: Bootstrap started with endpoint 'urn:cust:ORBITZ-89882390000226837758' with client lifetime 0
[00:00:20.580,169] Modem: Found free socket: 0
[00:00:20.580,200] Modem: Socket created. <ID, FD> : <0, 1>
[00:00:21.073,059] net_lwm2m_rd_client: Stop LWM2M Client: urn:cust:ORBITZ-89882390000226837758
[00:00:21.170,623] Modem: Processing modem command: Socket open
[00:00:21.170,715] Modem: Sending: AT+QIOPEN=1,0,"UDP","10.113.2.99",5684,0,0
[00:00:21.310,760] Modem: Received response: OK
[00:00:21.310,791] Modem: Wait for socket open URC now
[00:00:21.310,821] Modem: Setting timeout for socket 0
[00:00:21.410,980] Modem: Received URC: +QIOPEN: 0,0
[00:00:21.688,903] <err> net_lwm2m_engine: Failed to send packet, err 9
[00:00:21.688,934] <err> net_lwm2m_rd_client: Error sending LWM2M packet (err:-9).
[00:00:21.688,934] <err> net_lwm2m_rd_client: Bootstrap registration err: -9
[00:00:21.789,154] Modem: Processing modem command: Socket close
[00:00:21.789,154] Modem: Closing socket: 0
[00:00:21.789,245] Modem: Sending: AT+QICLOSE=0,0
[00:00:21.911,590] Modem: Received response: OK
[00:00:21.911,621] Modem: Socket 0 closed
[00:00:21.911,621] Modem: Freeing socket 0
[00:00:22.189,239] net_lwm2m_rd_client: Bootstrap started with endpoint 'urn:cust:ORBITZ-89882390000226837758' with client lifetime 0

To Reproduce
See above log, but in short:

Use a Zephyr version akin to Nordic sdk-zephyr v2.4.99-ncs2 (I checked the more recent git history, but I don't see anything directly addressing this issue. Currently we can't update to the async socket version yet)
Start RD Client
Call lwm2m_rd_client_stop() while the RD client is still bootstrapping/registering (preferably ASAP)
Observe that bootstrapping/registration is restarted, I think due to sm_do_network_error()

Expected behavior
The RD client to stop completely (and remain so until explicitly started again)

Impact
Data communication where we do not expect and want it. More power consumption etc.

The text was updated successfully, but these errors were encountered:

rlubos · 2021-07-28T12:00:29Z

I was not able to reproduce with qemu, but I can imagine that the issue could occur with slower platforms (I'm not sure what modem this is, but looking at the timestamps, it takes its time for processing send()).

What I think could happen in this case, is that when lwm2m_rd_client_stop() was called, the LwM2M thread was still processing sm_do_bootstrap_reg()/sm_do_register(). As in such case lwm2m_rd_client_stop() would simply close the lwm2m context (https://github.com/nrfconnect/sdk-zephyr/blob/v2.4.99-ncs2/subsys/net/lib/lwm2m/lwm2m_rd_client.c#L1052), the ongoing/pending send() will fail, resulting in the ENGINE_IDLE state being overridden by the fault handlnig routine (https://github.com/nrfconnect/sdk-zephyr/blob/v2.4.99-ncs2/subsys/net/lib/lwm2m/lwm2m_rd_client.c#L837). This will lead to the bootstrap/regsitration process restart.

I think the most reasonable way to handle this would be to add a mutex protection to the LwM2M state machine. The LwM2M engine should simply lock the mutex when processing lwm2m_rd_client_service(), and then use the same mutex to protect lwm2m_rd_client_start/stop() API calls. With this solution, lwm2m_rd_client_stop() would simply wait for the state machine to stop processing, removing the risk of the acutal state being overwritten.

Alternatively we could add some extra logic to the fault handling routines, to check if the "IDLE" state wasn't set during the processing, but I think the mutex approach is a much cleaner solution.

MaikVermeulen · 2021-07-29T15:08:15Z

Thanks @rlubos for the thorough evaluation and trying to reproduce it.
Your reasoning about the cause seems right. We have lowered the Modem's priority to below that of LwM2M and that seems like a working workaround.

The mutex solution indeed sounds nice. If you want, I can give it a shot.
I hope to find some time for it before I go on holiday in 2 weeks.

rlubos · 2021-07-29T15:12:39Z

The mutex solution indeed sounds nice. If you want, I can give it a shot.

That would be nice, thanks. It should be a simple change, it should be enough to add a mutex lock/unlock at the beginning/end of the aforementioned functions (unless I miss something).

MaikVermeulen · 2021-07-30T08:25:00Z

The mutex solution indeed sounds nice. If you want, I can give it a shot.

That would be nice, thanks. It should be a simple change, it should be enough to add a mutex lock/unlock at the beginning/end of the aforementioned functions (unless I miss something).

Looked at it, and indeed it would be simple. There is even /* TODO: add locking? */ in some places 🙂.

TLDR:
All together, all mutations are happening (indirectly) in socket_loop() (which calls all functions operating on the state machine, like lwm2m_rd_client_service()), except of course the lwm2m_rd_client_start(), and lwm2m_rd_client_stop(). So summary: You did not miss anything 😃.

I did notice that the state is only set using set_sm_state(), but that client.engine_state is read directly in quite some places, while e.g. do_deregister_reply_cb() is using get_sm_state() instead. I think we have two options of where to place to locks:

Like you said, in lwm2m_rd_client_service(), lwm2m_rd_client_start(), lwm2m_rd_client_stop().
In set_sm_state() and get_sm_state().

Personally, I think option 2 is safest as it doesn't lean on the assumption that most mutations are done by a single thread running socket_loop(). On the other hand, perhaps it's overkill and we are locking/unlocking a lot in places where we currently know it's not a problem.

What do you think?

Background:
To be thorough and learn more about the Zephyr LwM2M stack, I checked all places where client.engine_state and its setter set_sm_state() and getter get_sm_state() are used.
I found that indeed it's happening in lwm2m_rd_client_service() (which is calling the sm_do_* functions), lwm2m_rd_client_start(), and lwm2m_rd_client_stop(). The lwm2m_rd_client_service() is called in socket_loop().

Additionally, the setter and getter are used in the *_reply_cb functions. These are called by coap_response_received(), which is called by lwm2m_udp_receive(), which in turn is called by socket_recv_message(), which is called by socket_loop().

rlubos · 2021-07-30T09:33:45Z

I'm afraid that adding mutex lock in set_sm_state() will not be enough. We need to lock the start()/stop() API for the entire time the current state is processed, otherwise we'll still get the race condition with stop(). If we add mutex to set/get_sm_state() only, it'd remain unlocked for the most of the time, so we'll still hit the issue.

It might be needed however to add the mutex also for the *_reply_cb and *_timeout_cb functions as you pointed out, to prevent rare conditions where these calls are preempted by the public API call.

A mutex is used to syncrhonize the start, stop and service() functions of the RD client. Previously it could happen that while service() was working on e.g. bootstrapping, a stop() call by another thread would close the socket. Then the bootstrapping process would detect it as a network error, and restart the process. Fixes #37170. Signed-off-by: Maik Vermeulen <maik.vermeulen@innotractor.com>

MaikVermeulen added the bug The issue is a bug, or the PR is fixing a bug label Jul 23, 2021

carlescufi assigned rlubos Jul 27, 2021

carlescufi added area: LWM2M priority: low Low impact/importance bug labels Jul 27, 2021

MaikVermeulen mentioned this issue Aug 3, 2021

net: lwm2m: Synchronized functions mutating the RD client's state #37393

Merged

nashif closed this as completed in #37393 Aug 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LwM2M lwm2m_rd_client_stop() not working when called during bootstrapping/registration #37170

LwM2M lwm2m_rd_client_stop() not working when called during bootstrapping/registration #37170

MaikVermeulen commented Jul 23, 2021

rlubos commented Jul 28, 2021

MaikVermeulen commented Jul 29, 2021

rlubos commented Jul 29, 2021

MaikVermeulen commented Jul 30, 2021 •

edited

rlubos commented Jul 30, 2021

LwM2M lwm2m_rd_client_stop() not working when called during bootstrapping/registration #37170

LwM2M lwm2m_rd_client_stop() not working when called during bootstrapping/registration #37170

Comments

MaikVermeulen commented Jul 23, 2021

rlubos commented Jul 28, 2021

MaikVermeulen commented Jul 29, 2021

rlubos commented Jul 29, 2021

MaikVermeulen commented Jul 30, 2021 • edited

rlubos commented Jul 30, 2021

MaikVermeulen commented Jul 30, 2021 •

edited