Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LwM2M lwm2m_rd_client_stop() not working when called during bootstrapping/registration #37170

Closed
MaikVermeulen opened this issue Jul 23, 2021 · 5 comments · Fixed by #37393
Closed
Assignees
Labels
area: LWM2M bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug

Comments

@MaikVermeulen
Copy link
Contributor

Describe the bug
When external logic calls lwm2m_rd_client_stop() while the client is already bootstrapping or registering, it often (not always) does not work:

[00:00:20.579,833] net_lwm2m_rd_client: Bootstrap started with endpoint 'urn:cust:ORBITZ-89882390000226837758' with client lifetime 0
[00:00:20.580,169] Modem: Found free socket: 0
[00:00:20.580,200] Modem: Socket created. <ID, FD> : <0, 1>
[00:00:21.073,059] net_lwm2m_rd_client: Stop LWM2M Client: urn:cust:ORBITZ-89882390000226837758
[00:00:21.170,623] Modem: Processing modem command: Socket open
[00:00:21.170,715] Modem: Sending: AT+QIOPEN=1,0,"UDP","10.113.2.99",5684,0,0
[00:00:21.310,760] Modem: Received response: OK
[00:00:21.310,791] Modem: Wait for socket open URC now
[00:00:21.310,821] Modem: Setting timeout for socket 0
[00:00:21.410,980] Modem: Received URC: +QIOPEN: 0,0
[00:00:21.688,903] <err> net_lwm2m_engine: Failed to send packet, err 9
[00:00:21.688,934] <err> net_lwm2m_rd_client: Error sending LWM2M packet (err:-9).
[00:00:21.688,934] <err> net_lwm2m_rd_client: Bootstrap registration err: -9
[00:00:21.789,154] Modem: Processing modem command: Socket close
[00:00:21.789,154] Modem: Closing socket: 0
[00:00:21.789,245] Modem: Sending: AT+QICLOSE=0,0
[00:00:21.911,590] Modem: Received response: OK
[00:00:21.911,621] Modem: Socket 0 closed
[00:00:21.911,621] Modem: Freeing socket 0
[00:00:22.189,239] net_lwm2m_rd_client: Bootstrap started with endpoint 'urn:cust:ORBITZ-89882390000226837758' with client lifetime 0

To Reproduce
See above log, but in short:

  1. Use a Zephyr version akin to Nordic sdk-zephyr v2.4.99-ncs2 (I checked the more recent git history, but I don't see anything directly addressing this issue. Currently we can't update to the async socket version yet)
  2. Start RD Client
  3. Call lwm2m_rd_client_stop() while the RD client is still bootstrapping/registering (preferably ASAP)
  4. Observe that bootstrapping/registration is restarted, I think due to sm_do_network_error()

Expected behavior
The RD client to stop completely (and remain so until explicitly started again)

Impact
Data communication where we do not expect and want it. More power consumption etc.

@MaikVermeulen MaikVermeulen added the bug The issue is a bug, or the PR is fixing a bug label Jul 23, 2021
@carlescufi carlescufi added area: LWM2M priority: low Low impact/importance bug labels Jul 27, 2021
@rlubos
Copy link
Contributor

rlubos commented Jul 28, 2021

I was not able to reproduce with qemu, but I can imagine that the issue could occur with slower platforms (I'm not sure what modem this is, but looking at the timestamps, it takes its time for processing send()).

What I think could happen in this case, is that when lwm2m_rd_client_stop() was called, the LwM2M thread was still processing sm_do_bootstrap_reg()/sm_do_register(). As in such case lwm2m_rd_client_stop() would simply close the lwm2m context (https://github.com/nrfconnect/sdk-zephyr/blob/v2.4.99-ncs2/subsys/net/lib/lwm2m/lwm2m_rd_client.c#L1052), the ongoing/pending send() will fail, resulting in the ENGINE_IDLE state being overridden by the fault handlnig routine (https://github.com/nrfconnect/sdk-zephyr/blob/v2.4.99-ncs2/subsys/net/lib/lwm2m/lwm2m_rd_client.c#L837). This will lead to the bootstrap/regsitration process restart.

I think the most reasonable way to handle this would be to add a mutex protection to the LwM2M state machine. The LwM2M engine should simply lock the mutex when processing lwm2m_rd_client_service(), and then use the same mutex to protect lwm2m_rd_client_start/stop() API calls. With this solution, lwm2m_rd_client_stop() would simply wait for the state machine to stop processing, removing the risk of the acutal state being overwritten.

Alternatively we could add some extra logic to the fault handling routines, to check if the "IDLE" state wasn't set during the processing, but I think the mutex approach is a much cleaner solution.

@MaikVermeulen
Copy link
Contributor Author

Thanks @rlubos for the thorough evaluation and trying to reproduce it.
Your reasoning about the cause seems right. We have lowered the Modem's priority to below that of LwM2M and that seems like a working workaround.

The mutex solution indeed sounds nice. If you want, I can give it a shot.
I hope to find some time for it before I go on holiday in 2 weeks.

@rlubos
Copy link
Contributor

rlubos commented Jul 29, 2021

The mutex solution indeed sounds nice. If you want, I can give it a shot.

That would be nice, thanks. It should be a simple change, it should be enough to add a mutex lock/unlock at the beginning/end of the aforementioned functions (unless I miss something).

@MaikVermeulen
Copy link
Contributor Author

MaikVermeulen commented Jul 30, 2021

The mutex solution indeed sounds nice. If you want, I can give it a shot.

That would be nice, thanks. It should be a simple change, it should be enough to add a mutex lock/unlock at the beginning/end of the aforementioned functions (unless I miss something).

Looked at it, and indeed it would be simple. There is even /* TODO: add locking? */ in some places 🙂.

TLDR:
All together, all mutations are happening (indirectly) in socket_loop() (which calls all functions operating on the state machine, like lwm2m_rd_client_service()), except of course the lwm2m_rd_client_start(), and lwm2m_rd_client_stop(). So summary: You did not miss anything 😃.

I did notice that the state is only set using set_sm_state(), but that client.engine_state is read directly in quite some places, while e.g. do_deregister_reply_cb() is using get_sm_state() instead. I think we have two options of where to place to locks:

  1. Like you said, in lwm2m_rd_client_service(), lwm2m_rd_client_start(), lwm2m_rd_client_stop().
  2. In set_sm_state() and get_sm_state().

Personally, I think option 2 is safest as it doesn't lean on the assumption that most mutations are done by a single thread running socket_loop(). On the other hand, perhaps it's overkill and we are locking/unlocking a lot in places where we currently know it's not a problem.

What do you think?

Background:
To be thorough and learn more about the Zephyr LwM2M stack, I checked all places where client.engine_state and its setter set_sm_state() and getter get_sm_state() are used.
I found that indeed it's happening in lwm2m_rd_client_service() (which is calling the sm_do_* functions), lwm2m_rd_client_start(), and lwm2m_rd_client_stop(). The lwm2m_rd_client_service() is called in socket_loop().

Additionally, the setter and getter are used in the *_reply_cb functions. These are called by coap_response_received(), which is called by lwm2m_udp_receive(), which in turn is called by socket_recv_message(), which is called by socket_loop().

@rlubos
Copy link
Contributor

rlubos commented Jul 30, 2021

I'm afraid that adding mutex lock in set_sm_state() will not be enough. We need to lock the start()/stop() API for the entire time the current state is processed, otherwise we'll still get the race condition with stop(). If we add mutex to set/get_sm_state() only, it'd remain unlocked for the most of the time, so we'll still hit the issue.

It might be needed however to add the mutex also for the *_reply_cb and *_timeout_cb functions as you pointed out, to prevent rare conditions where these calls are preempted by the public API call.

nashif pushed a commit that referenced this issue Aug 3, 2021
A mutex is used to syncrhonize the start, stop and service() functions
of the RD client. Previously it could happen that while service() was
working on e.g. bootstrapping, a stop() call by another thread would
close the socket. Then the bootstrapping process would detect it as a
network error, and restart the process.

Fixes #37170.

Signed-off-by: Maik Vermeulen <maik.vermeulen@innotractor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: LWM2M bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants