Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt monitoring of daemons to new Service Monitoring Feature in 10.14 #1728

Merged

Conversation

PradeepKiruvale
Copy link
Contributor

Proposed changes

These code changes support monitoring of the thin-edge device service as well as the child devices service on c8y.

The tedge-mapper-c8y receives the health status from the thin-edge c8y services (tedge-agent, c8y-log-plugin, etc) or any child device service and translates into a smart rest (102) message, and sends it to c8y.

  • The thin-edge device services must publish health status on tedge/health/<service-name> topic
  • The child device services must publish the health status on tedge/health/<child-id>/<service-name> topic

The health status will be published to c8y only on state change.i.e on start (up) or stop(down).

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

#1353

Checklist

  • [x ] I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key points of the feature are here, but the code must be re-organized.

crates/core/tedge_api/src/health.rs Outdated Show resolved Hide resolved
crates/core/tedge_mapper/src/c8y/service_monitor.rs Outdated Show resolved Hide resolved
crates/core/tedge_mapper/src/c8y/service_monitor.rs Outdated Show resolved Hide resolved
let topic = message.topic.name.to_owned();
let payload: HashMap<String, Value> = serde_json::from_str(message.payload_str()?)?;

if payload.len() > 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the len is 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The service monitor message will not be created. This can happen in the case of, a bridge health status message. [tedge/health/mosquitto-c8y-bridge] 1/0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. An option could be to translate 1/0 as up/down. However not sure that this helpful as the point is to transfer this state over the bridge. @reubenmiller what do you think about this case?

crates/core/tedge_mapper/src/core/mapper.rs Outdated Show resolved Hide resolved
crates/core/tedge_mapper/src/c8y/mapper.rs Outdated Show resolved Hide resolved
crates/core/tedge_mapper/src/c8y/mapper.rs Show resolved Hide resolved
crates/core/tedge_mapper/src/core/mapper.rs Outdated Show resolved Hide resolved
crates/core/tedge_mapper/src/c8y/converter.rs Outdated Show resolved Hide resolved
crates/core/tedge_mapper/src/c8y/service_monitor.rs Outdated Show resolved Hide resolved
@github-actions
Copy link
Contributor

github-actions bot commented Feb 17, 2023

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass %
137 0 5 137 100

Passed Tests

Name ⏱️ Duration Suite
Define Child device 1 ID 0.004 s C8Y Child Alarms Rpi
Normal case when the child device does not exist on c8y cloud 3.957 s C8Y Child Alarms Rpi
Normal case when the child device already exists 0.641 s C8Y Child Alarms Rpi
Reconciliation when the new alarm message arrives, restart the mapper 0.871 s C8Y Child Alarms Rpi
Reconciliation when the alarm that is cleared 5.299 s C8Y Child Alarms Rpi
Prerequisite Parent 17.948 s Child Conf Mgmt Plugin
Prerequisite Child 0.29 s Child Conf Mgmt Plugin
Child device bootstrapping 15.367 s Child Conf Mgmt Plugin
Snapshot from device 19.359 s Child Conf Mgmt Plugin
Child device config update 18.66 s Child Conf Mgmt Plugin
Configuration types should be detected on file change (without restarting service) 52.539 s Inotify Crate
Child devices support sending simple measurements 49.322 s Child Device Telemetry
Child devices support sending custom measurements 50.142 s Child Device Telemetry
Child devices support sending custom events 43.196 s Child Device Telemetry
Child devices support sending custom events overriding the type 37.49 s Child Device Telemetry
Child devices support sending custom alarms #1699 38.125 s Child Device Telemetry
Child devices support sending inventory data via c8y topic 24.665 s Child Device Telemetry
Main device support sending inventory data via c8y topic 22.416 s Child Device Telemetry
Successful firmware operation 63.928 s Firmware Operation
Install with empty firmware name 56.032 s Firmware Operation
Supports restarting the device 83.688 s Restart Device
Update tedge version from previous using Cumulocity 119.288 s Tedge Self Update
Successful shell command with output 3.632 s Shell Operation
Check Successful shell command with literal double quotes output 3.203 s Shell Operation
Execute multiline shell command 3.212 s Shell Operation
Failed shell command 3.052 s Shell Operation
Software list should be populated during startup 56.365 s Software
Install software via Cumulocity 76.509 s Software
Software list should only show currently installed software and not candidates 45.808 s Software
Stop tedge-agent service 0.395 s Log Path Config
Customize the log path 0.19 s Log Path Config
Initialize tedge-agent 0.189 s Log Path Config
Check created folders 0.106 s Log Path Config
Remove created custom folders 0.192 s Log Path Config
Install latest via script (from current branch) 31.696 s Install Tedge
Install specific version via script (from current branch) 20.198 s Install Tedge
Install latest tedge via script (from main branch) 25.73 s Install Tedge
Support starting and stopping services 41.853 s Service-Control
Supports a reconnect 49.634 s Test-Commands
Supports disconnect then connect 51.577 s Test-Commands
Update unknown setting 33.835 s Test-Commands
Update known setting 26.668 s Test-Commands
Stop c8y-configuration-plugin 0.223 s Health C8Y-Configuration-Plugin
Update the service file 0.262 s Health C8Y-Configuration-Plugin
Reload systemd files 0.733 s Health C8Y-Configuration-Plugin
Start c8y-configuration-plugin 0.325 s Health C8Y-Configuration-Plugin
Start watchdog service 10.379 s Health C8Y-Configuration-Plugin
Check PID of c8y-configuration-plugin 0.123 s Health C8Y-Configuration-Plugin
Kill the PID 0.164 s Health C8Y-Configuration-Plugin
Recheck PID of c8y-configuration-plugin 2.277 s Health C8Y-Configuration-Plugin
Compare PID change 0.001 s Health C8Y-Configuration-Plugin
Stop watchdog service 0.156 s Health C8Y-Configuration-Plugin
Remove entry from service file 0.2 s Health C8Y-Configuration-Plugin
Stop c8y-log-plugin 0.232 s Health C8Y-Log-Plugin
Update the service file 0.37 s Health C8Y-Log-Plugin
Reload systemd files 0.977 s Health C8Y-Log-Plugin
Start c8y-log-plugin 0.311 s Health C8Y-Log-Plugin
Start watchdog service 10.54 s Health C8Y-Log-Plugin
Check PID of c8y-log-plugin 0.195 s Health C8Y-Log-Plugin
Kill the PID 0.201 s Health C8Y-Log-Plugin
Recheck PID of c8y-log-plugin 2.153 s Health C8Y-Log-Plugin
Compare PID change 0.002 s Health C8Y-Log-Plugin
Stop watchdog service 0.08 s Health C8Y-Log-Plugin
Remove entry from service file 0.077 s Health C8Y-Log-Plugin
Stop tedge-mapper 0.346 s Health Tedge Mapper C8Y
Update the service file 0.286 s Health Tedge Mapper C8Y
Reload systemd files 0.992 s Health Tedge Mapper C8Y
Start tedge-mapper 0.394 s Health Tedge Mapper C8Y
Start watchdog service 10.488 s Health Tedge Mapper C8Y
Check PID of tedge-mapper 0.146 s Health Tedge Mapper C8Y
Kill the PID 0.135 s Health Tedge Mapper C8Y
Recheck PID of tedge-mapper 2.205 s Health Tedge Mapper C8Y
Compare PID change 0.003 s Health Tedge Mapper C8Y
Stop watchdog service 0.082 s Health Tedge Mapper C8Y
Remove entry from service file 0.089 s Health Tedge Mapper C8Y
Stop tedge-agent 0.099 s Health Tedge-Agent
Update the service file 0.064 s Health Tedge-Agent
Reload systemd files 0.252 s Health Tedge-Agent
Start tedge-agent 0.115 s Health Tedge-Agent
Start watchdog service 10.167 s Health Tedge-Agent
Check PID of tedge-mapper 0.195 s Health Tedge-Agent
Kill the PID 0.155 s Health Tedge-Agent
Recheck PID of tedge-agent 2.285 s Health Tedge-Agent
Compare PID change 0.001 s Health Tedge-Agent
Stop watchdog service 0.211 s Health Tedge-Agent
Remove entry from service file 0.226 s Health Tedge-Agent
Stop tedge-mapper-az 0.125 s Health Tedge-Mapper-Az
Update the service file 0.074 s Health Tedge-Mapper-Az
Reload systemd files 0.348 s Health Tedge-Mapper-Az
Start tedge-mapper-az 0.133 s Health Tedge-Mapper-Az
Start watchdog service 10.273 s Health Tedge-Mapper-Az
Check PID of tedge-mapper-az 0.091 s Health Tedge-Mapper-Az
Kill the PID 0.11 s Health Tedge-Mapper-Az
Recheck PID of tedge-agent 2.305 s Health Tedge-Mapper-Az
Compare PID change 0.001 s Health Tedge-Mapper-Az
Stop watchdog service 0.369 s Health Tedge-Mapper-Az
Remove entry from service file 0.405 s Health Tedge-Mapper-Az
Stop tedge-mapper-collectd 0.254 s Health Tedge-Mapper-Collectd
Update the service file 0.244 s Health Tedge-Mapper-Collectd
Reload systemd files 0.75 s Health Tedge-Mapper-Collectd
Start tedge-mapper-collectd 0.184 s Health Tedge-Mapper-Collectd
Start watchdog service 10.313 s Health Tedge-Mapper-Collectd
Check PID of tedge-mapper-collectd 0.375 s Health Tedge-Mapper-Collectd
Kill the PID 0.453 s Health Tedge-Mapper-Collectd
Recheck PID of tedge-mapper-collectd 0.357 s Health Tedge-Mapper-Collectd
Compare PID change 0.001 s Health Tedge-Mapper-Collectd
Stop watchdog service 0.222 s Health Tedge-Mapper-Collectd
Remove entry from service file 0.289 s Health Tedge-Mapper-Collectd
c8y-log-plugin health status 6.065 s MQTT health endpoints
c8y-configuration-plugin health status 6.281 s MQTT health endpoints
Wrong package name 0.101 s Improve Tedge Apt Plugin Error Messages
Wrong version 0.1 s Improve Tedge Apt Plugin Error Messages
Wrong type 0.188 s Improve Tedge Apt Plugin Error Messages
tedge_connect_test_positive 0.465 s Tedge Connect Test
tedge_connect_test_negative 1.113 s Tedge Connect Test
tedge_connect_test_sm_services 8.803 s Tedge Connect Test
tedge_disconnect_test_sm_services 1.802 s Tedge Connect Test
Install thin-edge.io 15.579 s Call Tedge
call tedge -V 0.103 s Call Tedge
call tedge -h 0.132 s Call Tedge
call tedge -h -V 0.113 s Call Tedge
call tedge help 0.101 s Call Tedge
tedge config list 0.135 s Call Tedge Config List
tedge config list --all 0.261 s Call Tedge Config List
set/unset device.type 0.799 s Call Tedge Config List
set/unset device.key.path 0.726 s Call Tedge Config List
set/unset device.cert.path 0.924 s Call Tedge Config List
set/unset c8y.root.cert.path 0.899 s Call Tedge Config List
set/unset c8y.smartrest.templates 0.778 s Call Tedge Config List
set/unset az.root.cert.path 0.564 s Call Tedge Config List
set/unset az.mapper.timestamp 0.479 s Call Tedge Config List
set/unset mqtt.bind_address 0.434 s Call Tedge Config List
set/unset mqtt.port 0.483 s Call Tedge Config List
set/unset tmp.path 0.65 s Call Tedge Config List
set/unset logs.path 0.519 s Call Tedge Config List
set/unset run.path 0.369 s Call Tedge Config List
Get Put Delete 4.139 s Http File Transfer Api

crates/core/tedge_mapper/src/c8y/converter.rs Outdated Show resolved Hide resolved
Comment on lines 14 to 15
// If not Bridge health status
if !service_name.contains("mosquitto-c8y-bridge") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is too specific: there are also the az and aws bridge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's for all the bridges. I changed to check for all the bridges.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. An option could be to translate 1/0 as up/down. However not sure that this helpful as the point is to transfer this state over the bridge. @reubenmiller what do you think about this case?

Yes I would love to have a consistent health payload if possible, e.g. using up/down instead of 1/0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. An option could be to translate 1/0 as up/down. However not sure that this helpful as the point is to transfer this state over the bridge. @reubenmiller what do you think about this case?

Yes I would love to have a consistent health payload if possible, e.g. using up/down instead of 1/0.

I think he meant that when we send it to c8y cloud. But do we really need the bridge status to be sent to the c8y cloud?. Because when the bridge goes down, the status can't be sent to the c8y cloud.

let topic = message.topic.name.to_owned();
let payload: HashMap<String, Value> = serde_json::from_str(message.payload_str()?)?;

if payload.len() > 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. An option could be to translate 1/0 as up/down. However not sure that this helpful as the point is to transfer this state over the bridge. @reubenmiller what do you think about this case?

Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failing test has been fixed by #1730.

So I'm okay to merge this PR. Can you remove the useless check on the message size, though? Thanks.

crates/core/tedge_mapper/src/c8y/converter.rs Outdated Show resolved Hide resolved
Comment on lines 59 to 66
let payload_str = message
.payload_str()
.unwrap_or(r#""type":"thin-edge.io","status":"down""#);

let health_status = serde_json::from_str(payload_str).unwrap_or_else(|_| HealthStatus {
service_type: "unknown".to_string(),
status: "down".to_string(),
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid having to provide twice the default values (and involuntary introducing a difference), one can chain function calls.

Suggested change
let payload_str = message
.payload_str()
.unwrap_or(r#""type":"thin-edge.io","status":"down""#);
let health_status = serde_json::from_str(payload_str).unwrap_or_else(|_| HealthStatus {
service_type: "unknown".to_string(),
status: "down".to_string(),
});
let health_status = message
.payload_str()
.and_then(serde_json::from_str)
.unwrap_or_else(|_| HealthStatus {
service_type: "unknown".to_string(),
status: "down".to_string(),
});

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay to not have chaining. However: fix the default values. This must be the same line 61 and 64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@PradeepKiruvale PradeepKiruvale temporarily deployed to Test Pull Request February 20, 2023 13:52 — with GitHub Actions Inactive
@PradeepKiruvale PradeepKiruvale temporarily deployed to Test Pull Request February 20, 2023 15:50 — with GitHub Actions Inactive
@@ -1776,6 +1776,62 @@ async fn mapper_updating_the_inventory_fragments_from_file() {
sm_mapper.abort();
}

#[tokio::test(flavor = "multi_thread", worker_threads = 2)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that test really require 2 threads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

);
}

#[tokio::test(flavor = "multi_thread", worker_threads = 2)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that test really require 2 threads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 6 to 7
Below table gives more information about the `device type`, `thin-edge` topic, `health status` message format
and mapping to the `cumulocity` cloud message.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious for a user to guess what he has to do. I would focus on the tegde/health topics telling the user when his software/service/child-device has to send a health status and how. In a second step I would tell what is done by thin-edge or more precisely the c8y mapper and how this can be observed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the doc.

@PradeepKiruvale PradeepKiruvale temporarily deployed to Test Pull Request February 20, 2023 16:18 — with GitHub Actions Inactive
@PradeepKiruvale PradeepKiruvale temporarily deployed to Test Pull Request February 21, 2023 02:35 — with GitHub Actions Inactive
@PradeepKiruvale PradeepKiruvale temporarily deployed to Test Pull Request February 21, 2023 07:17 — with GitHub Actions Inactive

## Send the health status of a service to `tedge/health` topic.

Below table lists the MQTT topic to which the health status message should be sent, and the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Below table lists the MQTT topic to which the health status message should be sent, and the
The table below lists the MQTT topics to which the health status message should be sent, and the

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

The above message says that the `tedge-mapper-c8y` is `up` and the `type` of the service is `thin-edge.io`.


For example, to monitor the health of a `docker` service that is running on an `external-sensor` child device from cumulocity,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to repeat "For example" as you are still in the example from the previous statement.

Also we don't have to mention that the child device is connected to c8y.

Suggested change
For example, to monitor the health of a `docker` service that is running on an `external-sensor` child device from cumulocity,
To monitor the health of a `docker` service that is running on an `external-sensor` child device,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

> Note: The `status` here can be `up or down` or any other string. For example, `unknown`.

For example, to monitor the health status of a `tedge-mapper-c8y service` that is running on a `thin-edge.io` device
from cumulocity, one has to send the below message.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to mention the Cumulocity bit as it is assumed already.

Suggested change
from cumulocity, one has to send the below message.
one has to send the below message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

The `tedge-mapper-c8y` will translate the `health status` message that is received on `tedge/health/#`
topic to `Cumulocity` specific `service monitor` message and sends it to `Cumulocity` cloud.

The table below gives more information about the `cumulocity topic`, and the `translated service monitor` message for both `thin-edge` as well as for `child` device.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to to backticks for the Cumulocity IoT product name.

Suggested change
The table below gives more information about the `cumulocity topic`, and the `translated service monitor` message for both `thin-edge` as well as for `child` device.
The table below gives more information about the **Cumulocity IoT** topic and the translated service monitor message for both `thin-edge` as well as for `child` device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

# How to monitor health of service from Cumulocity IoT

The health of a `thin-edge.io` service or any other `service` that is running on the `thin-edge.io` device
or on the `child` device can be monitored from the Cumulocity by sending the `health-status` message to `Cumulocity IoT`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General change (in multiple places), we should used the markdown **Cumulocity IoT** when referring to Cumulocity (e.g. in bold, and always using a capital letter for Cumulocity)

Suggested change
or on the `child` device can be monitored from the Cumulocity by sending the `health-status` message to `Cumulocity IoT`.
or on the `child` device can be monitored from the **Cumulocity IoT** by sending the `health-status` message to **Cumulocity IoT**.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

from cumulocity, one has to send the below message.

```
tedge mqtt pub tedge/health/tedge-mapper-c8y `{"status":"up","type":"thin-edge.io"}` -q 2 -r
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make more sense to use the generic service value for the type field (also in the second example).

Suggested change
tedge mqtt pub tedge/health/tedge-mapper-c8y `{"status":"up","type":"thin-edge.io"}` -q 2 -r
tedge mqtt pub tedge/health/tedge-mapper-c8y `{"status":"up","type":"service"}` -q 2 -r

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@PradeepKiruvale PradeepKiruvale temporarily deployed to Test Pull Request February 21, 2023 14:26 — with GitHub Actions Inactive
Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

Signed-off-by: Pradeep Kumar K J <pradeepkumar.kj@softwareag.com>
@PradeepKiruvale PradeepKiruvale temporarily deployed to Test Pull Request February 22, 2023 13:06 — with GitHub Actions Inactive
@PradeepKiruvale PradeepKiruvale merged commit 9adc261 into thin-edge:main Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants