Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Declare child devices again if deleted in cloud #1644

Merged
merged 6 commits into from
Dec 15, 2022

Conversation

cmosd
Copy link
Contributor

@cmosd cmosd commented Dec 12, 2022

Proposed changes

When deleting a child device from the cloud representation the supported operations logic would re-create this child device but fail to notice it is not available on the cloud any longer. Thus resulting in a child device name prefixed by "MQTT Device".

This fix, changes the logic of initialising supported operations on startup, to first requesting the cloud representation of child devices, computing a diff between cloud and local, initialising as new child devices anything that is available locally but is not seen on the cloud and then proceeds to add supported operations for the child devices.

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

@cmosd cmosd self-assigned this Dec 12, 2022
Copy link
Contributor

@PradeepKiruvale PradeepKiruvale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM,
When the request is sent to the cloud to get the child devices list from the cloud, how long will it wait to get the child devices list? Is there any timeout? What if it gets stuck and never receives the list from the cloud?

@cmosd cmosd force-pushed the bugfix/1569/register-deleted-child-devices branch 3 times, most recently from 689a51c to 5ff5ccc Compare December 12, 2022 16:48
Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This work is going in the right direction. However, I have questions regarding the expected behavior. Previously the reference were the local device. Now, we compute a diff between the cloud view and the local view. It's not clear to me how this diff needs to be resolved. Notably, the current PR pushes local child-device info to the cloud if not declared in the cloud. But nothing is done in the reverse direction. What if a child device is declared in the cloud but not locally known? We also have to double check that nothing has been broken regarding the dynamic declaration of new child devices (when a child device starts to send telemetry data without any declaration).

crates/core/tedge_mapper/src/c8y/converter.rs Outdated Show resolved Hide resolved
crates/core/tedge_mapper/src/c8y/converter.rs Show resolved Hide resolved
crates/core/tedge_mapper/src/c8y/converter.rs Outdated Show resolved Hide resolved
crates/core/tedge_mapper/src/c8y/tests.rs Outdated Show resolved Hide resolved
crates/core/tedge_mapper/src/c8y/tests.rs Outdated Show resolved Hide resolved
crates/core/tedge_mapper/src/c8y/converter.rs Outdated Show resolved Hide resolved
// if there are any local child devices that are not included in the
// `cloud_child_devices` struct, we create them on the cloud, sending a 101
// message. Then proceed to declare their supported operations.
let difference: Vec<&String> = local_child_devices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the child devices that are declared cloud side but not locally? Do we have to create something on the local device? On the file-system? In the Converter.children: HashMap<String, Operations> cache?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that it was decided not to update the local state. Just wanted to know if it was just deferred for later or we've decided not to sync state from the cloud for some reason? If it's the latter, curious to know why.

@albinsuresh
Copy link
Contributor

The proposed fix is definitely the right thing to do as such a reconciliation with cloud will be needed soon anyway. But, for the specific reported issue of the "device name getting messed up", a simpler fix was also possible: Simply send a 101,{child_id},{child_id},thin-edge.io-child for each know child device during the bootstrap phase of the mappers/plugins. These messages are idempotent and hence even if multiple plugins sends the same 101 message, the first one will succeed and the subsequent ones will just fail which can be ignored.

@PradeepKiruvale
Copy link
Contributor

The proposed fix is definitely the right thing to do as such a reconciliation with cloud will be needed soon anyway. But, for the specific reported issue of the "device name getting messed up", a simpler fix was also possible: Simply send a 101,{child_id},{child_id},thin-edge.io-child for each know child device during the bootstrap phase of the mappers/plugins. These messages are idempotent and hence even if multiple plugins sends the same 101 message, the first one will succeed and the subsequent ones will just fail which can be ignored.

I agree with this, sending 101 to check if the device is already created or not. If the device already exists then go ahead with the process.

@cmosd
Copy link
Contributor Author

cmosd commented Dec 13, 2022

Simply send a 101,{child_id},{child_id},thin-edge.io-child for each know child device during the bootstrap phase of the mappers/plugins

This was what @reubenmiller and I first considered but concluded that doing this does not scale to more than just a few child devices

@cmosd cmosd force-pushed the bugfix/1569/register-deleted-child-devices branch from 26842f0 to eaf3a1a Compare December 13, 2022 16:01
@PradeepKiruvale
Copy link
Contributor

Simply send a 101,{child_id},{child_id},thin-edge.io-child for each know child device during the bootstrap phase of the mappers/plugins

This was what @reubenmiller and I first considered but concluded that doing this does not scale to more than just a few child devices

@reubenmiller Agree that it will be a scalability issue. Is there any way to send multiple child device registration/creation messages as one message?, Kind of batch registering?

@albinsuresh
Copy link
Contributor

Simply send a 101,{child_id},{child_id},thin-edge.io-child for each know child device during the bootstrap phase of the mappers/plugins

This was what @reubenmiller and I first considered but concluded that doing this does not scale to more than just a few child devices

@reubenmiller Agree that it will be a scalability issue. Is there any way to send multiple child device registration/creation messages as one message?, Kind of batch registering?

Yes. You can send multiple SmartREST 101 commands across multiple lines in a single MQTT message as follows:

101,child-1,child-1,tedge-child
101,child-2,child-2,tedge-child

We can just blindly send these messages and ignore the "device already exists "responses as we're only interested in recreating any missing devices. So, I'm also curious to understand this scalability issue that you're highlighting.

Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

@cmosd cmosd force-pushed the bugfix/1569/register-deleted-child-devices branch 2 times, most recently from afeb964 to 7e66f48 Compare December 14, 2022 18:40
@cmosd
Copy link
Contributor Author

cmosd commented Dec 14, 2022

Thanks for approving @didier-wenzek but I made more changes and I would like another review, it should not take too long.

The changes I made are the following:

  • added child device cache (9638859)
  • tested some corner cases for remote vs local + tested that cache is updated (198b8b2)
  • tested that "106" is parsed even if there are no devices in the cloud (minor test - 9026e8f)

I did not touch anything you reviewed before

@cmosd cmosd force-pushed the bugfix/1569/register-deleted-child-devices branch 2 times, most recently from 9026e8f to 0a3660c Compare December 14, 2022 21:16
"106" => register_child_device_supported_operations(config_dir, payload, children),
// Ignore any other child device incoming request as not yet supported
_ => {
info!("Ignored. Message not yet supported: {payload}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
info!("Ignored. Message not yet supported: {payload}");
debug!("Ignored. Message not yet supported: {payload}");

This might be better as a debug message otherwise I see the mapper logging this for every operation request sent to any child device.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug! or info!? I proposed info! because it's not convenient when you have to change the log level to discover that the messages sent to the device are ignored. But I agree that if this is a normal case to have these messages ignored by the mapper and processed by a plugin, then debug! is more appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we extend the mapper in future where it performs all the message mapping for all operations(including config mgmt,log mgmt etc), even for child devices, info or even warn logging would be the right thing to do as we would want to know when the device receives a message that the mapper can not handle. But today, since the mapper only maps software management operations for the parent device only, I feel that this log will be triggered most of the time.

/// │   └── c8y_LogfileRequest
/// └── child-3
/// └── c8y_LogfileRequest
fn make_n_child_devices_with_k_operations(n: u8, ttd: &TempTedgeDir) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test utility. You've gone to quite some lengths to ensure that randomness across test runs.

// if there are any local child devices that are not included in the
// `cloud_child_devices` struct, we create them on the cloud, sending a 101
// message. Then proceed to declare their supported operations.
let difference: Vec<&String> = local_child_devices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that it was decided not to update the local state. Just wanted to know if it was just deferred for later or we've decided not to sync state from the cloud for some reason? If it's the latter, curious to know why.

@@ -1188,6 +1188,8 @@ async fn mapper_publishes_child_device_create_message() {

publish_a_fake_jwt_token(broker).await;

broker.publish("c8y/s/ds", "106,child-one").await.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the mapper init block is not waiting for the child device list response, is this really necessary?

@@ -1212,6 +1214,7 @@ async fn mapper_publishes_supported_operations_for_child_device() {
let (_tmp_dir, sm_mapper) = start_c8y_mapper(broker.port, &cfg_dir).await.unwrap();

publish_a_fake_jwt_token(broker).await;
broker.publish("c8y/s/ds", "106,child-one").await.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as the one above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albinsuresh , yes because now we don't send 114 on startup but only when we receive a 116 from the cloud. The logic is the following:

mapper sends 105
cloud replies with 106
mapper computes diff between cloud and local
mapper sends 101 for new child devices
mapper sends 114 for all child devices

@albinsuresh
Copy link
Contributor

LGTM. Have just added some minor comments/queries.

Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm my approval, having already approved 9638859.

/// updated with operations. This child device will not be cached.
///
/// - Any child device that is present locally but not in the cloud will be created and
/// then supported operations will be published to the cloud and the device will be cached.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then supported operations will be published to the cloud

This part is not tested. Neither the messages returned by process_smartrest.

For that test to be perfect, I would add a check that there is a declaration message for each child-device known locally but not in the cloud.

Copy link
Contributor Author

@cmosd cmosd Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update this test 😄

@cmosd cmosd force-pushed the bugfix/1569/register-deleted-child-devices branch 2 times, most recently from 0cf41d6 to b600799 Compare December 15, 2022 14:31
alexandru solomes added 2 commits December 15, 2022 14:46
When deleting a child device from the cloud representation the supported
operations logic would re-create this child device but fail to notice it
is not available on the cloud any longer. Thus resulting in a child
device name prefixed by "MQTT Device".

This fix, changes the logic of initialising supported operations on
startup, to first requesting the cloud representation of child devices,
computing a diff between cloud and local, initialising as new child
devices anything that is available locally but is not seen on the cloud
and then proceeds to add supported operations for the child devices.

Signed-off-by: alexandru solomes <solo@softwareag.com>
Signed-off-by: alexandru solomes <solo@softwareag.com>
alexandru solomes and others added 4 commits December 15, 2022 14:46
Signed-off-by: alexandru solomes <solo@softwareag.com>
Signed-off-by: alexandru solomes <solo@softwareag.com>
Added a series of tests that checks a few assumptions. Namely:
- the source of true child devices is the parent device (not the cloud)
- local child devices are updated to the cloud
- remote child devices are ignored
- the child device cache is updated according to these rules

Signed-off-by: alexandru solomes <solo@softwareag.com>
Signed-off-by: alexandru solomes <solo@softwareag.com>
@cmosd cmosd force-pushed the bugfix/1569/register-deleted-child-devices branch from b600799 to 0901582 Compare December 15, 2022 14:46
@cmosd cmosd merged commit 06990a1 into main Dec 15, 2022
@cmosd cmosd deleted the bugfix/1569/register-deleted-child-devices branch December 15, 2022 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants