Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LwM2M server fallback #65745

Merged

Conversation

SeppoTakalo
Copy link
Collaborator

@SeppoTakalo SeppoTakalo commented Nov 24, 2023

This PR is a list of changes that implement fallback mechanism and various fixes to our LwM2M state machine.

  • Fallback to secondary bootstrap server.
    • Each time, we cannot connect to bootstrap server, we try another one from the list.
  • Fallback to secondary LwM2M server
    • If we lose connection to server, we switch to another server, if more that one are configured.
    • Order is controlled by priority field, if we use LwM2M 1.1
  • Fallback to bootstrap
    • If we lose connection to all servers or server rejects our registration, we fallback to bootstrap
    • This is controlled by boolean option in server object.
  • Implement disable functionality to server object.
  • Retry logic is now controlled in NETWORK_ERROR state. All errors should lead to that state.

Problems on current LwM2M engine

When I started to implement bootstrap testcases and fallback testcases from LwM2M interoperability set, it came too evident what the limitations of the current LwM2M engine is. This mainly consist of how RD_client is implemented.

  • Current implementation does not implement Disable functionality of server object.
  • Current implementation does not fallback to secondary server or bootstrap
  • Current implementation does not implement "priority" resource of server instance.
  • Current implementation gives LWM2M_RD_CLIENT_EVENT_NETWORK_ERROR if connection to server fails.
  • Engine is not able to keep running when server connection drops. We stop.
  • When engine is stopped, Notify messages don't work.
  • When engine is stopped, "Store Notifies when offline" cannot be implemented.
  • Misconfiguring a client ends up in busy-loop "Cannot find server/security" -> init -> "Cannot find...". This may cause deadlock in tickless mode.

Refactoring the Server Object

I implemented few missing resources into the server object. These are priority and disabled functionality.
Disabled works, so that it has one k_timepoint_t per server instance. When that timepoint is expired, we know that the server is not disabled, so it is active.

Then obviously missing API was a functionality that goes through list of servers and pick the next one based on their priority values and whether they are active (as in, not disabled). This works fine, because Bootstrap servers, by the spec, do not have LwM2M server object instance. So I don't need to keep looking that that flag. Also, I noticed that current RD client is wrongly going first through Security instances and picking first one from there, this order is wrong. First we should pick the server instance, then find its Security object instance.
So there is now new API

/**
 * @brief Select server instance.
 *
 * Find possible server instance considering values on server data.
 * Server candidates cannot be in disabled state and if priority values are set,
 * those are compared and lowest values are considered first.
 *
 * If @ref obj_inst_id is NULL, this can be used to check if there are any server available.
 *
 * @param[out] obj_inst_id where selected server instance ID is written. Can be NULL.
 * @return true if server instance was found, false otherwise.
 */
bool lwm2m_server_select(uint16_t *obj_inst_id);

Also, if the connection to server fails, and we want to fall back to secondary server, I though that we could disable current server for a short period, so the previous API would find a next candidate. So this new API is also there

/**
 * @brief Disable server instance for a period of time.
 *
 * Timeout values can be calculated using kernel macros like K_SECONDS().
 * Values like K_FOREVER or K_NO_WAIT are also accepted.
 *
 * @param timeout Timeout value.
 * @return zero on success, negative error code otherwise.
 */
int lwm2m_server_disable(uint16_t obj_inst_id, k_timeout_t timeout);

Few other functions added, please see the lwm2m_obj_server.h

Refactoring the RD-Client

If server registration fails, allow fallback to secondary server,
or fallback to bootstrap.
Also allow fallback to different bootstrap server.

Add API to tell RD client when server have been disabled by
executable command.

Changes to RD state machine:

  • All retry logic should be handled in NETWORK_ERROR state.
  • New state SERVER_DISABLED.
  • Internally disable servers that reject registration
  • Temporary disable server on network error.
  • Clean up all "disable timers" on start.
  • Select server first, then find security object for it.
  • State functions return void, error handling is done using states.
  • DISCONNECT event will only come when client is requested to stop.
  • NETWORK_ERROR will stop engine. This is generic error for all kinds
    of registration or network failures.
  • BOOTSTRAP_REG_FAILURE also stops engine. This is fatal, and we cannot
    recover.

Refactoring:

  • Server selection logic is inside server object.
  • sm_handle_timeout_state() does not require msg parameter. Unused.
  • When bootstrap fail, we should NOT back off to registration.
    This is a fatal error, and it stops the engine and informs application.

Clarification to the events

As a result of these changes, I clarified the documentation to match what the code does.
Now it should be clear, what application should do on each of the event. Please see the changes in lwm2m.rst and the lwm2m_engine_state_machine.png

lwm2m_engine_state_machine

In short, events that stop the engine:

event why
DISCONNECT Engine have stopped as a result of calling lwm2m_rd_client_stop()
NETWORK_ERROR Engine have stopped as it cannot recover any server connection
BOOTSTRAP_REG_FAILURE Engine have stopped as a result of bootstrap failing

As seen from the diagram above, only those three events end up into the IDLE state. All other events are just informational, and should result the client to recover, if there was connection failures.

One new event added SERVER_DISABLED, and one new state with the same name.

@SeppoTakalo SeppoTakalo force-pushed the lwm2m_server_fallback branch 4 times, most recently from ddcbabf to fc7bf85 Compare November 30, 2023 19:01
Add API to find a security instance ID with given Short Server ID.

Signed-off-by: Seppo Takalo <seppo.takalo@nordicsemi.no>
React to disable executable, as well as add callback that allows
disabling server for a period of time.

Also add API that would find a next server candidate based on the
priority and server being not-disabled.

Move all server related functions into its own header.

Signed-off-by: Seppo Takalo <seppo.takalo@nordicsemi.no>
If server registration fails, allow fallback to secondary server,
or fallback to bootstrap.
Also allow fallback to different bootstrap server.

Add API to tell RD client when server have been disabled by
executable command.

Changes to RD state machine:
* All retry logic should be handled in NETWORK_ERROR state.
* New state SERVER_DISABLED.
* Internally disable servers that reject registration
* Temporary disable server on network error.
* Clean up all "disable timers" on start.
* Select server first, then find security object for it.
* State functions return void, error handling is done using states.
* DISCONNECT event will only come when client is requested to stop.
* NETWORK_ERROR will stop engine. This is generic error for all kinds
  of registration or network failures.
* BOOTSTRAP_REG_FAILURE also stops engine. This is fatal, and we cannot
  recover.

Refactoring:
* Server selection logic is inside server object.
* sm_handle_timeout_state() does not require msg parameter. Unused.
* When bootstrap fail, we should NOT back off to registration.
  This is a fatal error, and it stops the engine and informs application.

Signed-off-by: Seppo Takalo <seppo.takalo@nordicsemi.no>
@SeppoTakalo SeppoTakalo force-pushed the lwm2m_server_fallback branch 2 times, most recently from 014190c to 7c3929b Compare December 4, 2023 13:30
@SeppoTakalo SeppoTakalo changed the title RFC: LwM2M server fallback LwM2M server fallback Dec 4, 2023
@SeppoTakalo SeppoTakalo marked this pull request as ready for review December 4, 2023 13:56
rerickson1
rerickson1 previously approved these changes Dec 4, 2023
@SeppoTakalo
Copy link
Collaborator Author

Changed one arrow on image to get it less than 1000px wide.
Then ran the png through pngcrusher to get size to less than 250 kB which seem to be the limit on documentation.

@SeppoTakalo
Copy link
Collaborator Author

Tests for the bootstrap submitted in #66120
one of the tests require fallback mechanism from this PR. One of those require the "disabled" functionality added here as well.

@kartben
Copy link
Collaborator

kartben commented Dec 4, 2023

Changed one arrow on image to get it less than 1000px wide.
Then ran the png through pngcrusher to get size to less than 250 kB which seem to be the limit on documentation.

Not sure what tool you used but it would be fine (and even encouraged) to directly use the SVG version of your diagram.

@SeppoTakalo
Copy link
Collaborator Author

Not sure what tool you used but it would be fine (and even encouraged) to directly use the SVG version of your diagram.

The png is drawn using https://app.diagrams.net/ (Previously it was draw.io)
The original XML image is inside the metadata, so you can re-open that png in the same tool and it runs on browser.

In fallback refactoring to the LwM2M engine, some changes
to the server object are visible in hard-coded test
values.

Also, add Endpoint wrapper class that ensures the registration
state of the returned endpoint.

Signed-off-by: Seppo Takalo <seppo.takalo@nordicsemi.no>
Properly document the actions that application should
take on certain events.

This clarifies the events that indicate that the LwM2M
engine is stopped.

Add missing events to the state machine diagram and
apply color coding to states.

Signed-off-by: Seppo Takalo <seppo.takalo@nordicsemi.no>
@SeppoTakalo
Copy link
Collaborator Author

Not sure what tool you used but it would be fine (and even encouraged) to directly use the SVG version of your diagram.

Thanks for the suggestion. I was not even aware that we could use SVG directly. I have always drawn those diagrams in draw.io that could produce SVG, but I just exported as PNG.

I now replaced the diagram with SVG version that can be opened in https://app.diagrams.net/

Copy link
Contributor

@rlubos rlubos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work!

@MaureenHelm MaureenHelm merged commit 9102821 into zephyrproject-rtos:main Dec 5, 2023
19 checks passed
@SeppoTakalo SeppoTakalo deleted the lwm2m_server_fallback branch December 8, 2023 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants