benchmark: add monitor command #593

VMois · 2021-11-19T13:15:02Z

This PR introduces monitor command to the benchmark. It will collect all defined metrics (check code for details), sleep for a defined time in seconds, and repeat the same process until is terminated manually. Results are saved to a JSON file to be used in the analyze command later.

closes #574

How to test:

(optional) Launch some workflows reana-benchmark launch -w monitor-test -n 1-20 (not helloworld as they are too fast, atlas-demo is okay);
Right after workflows are started, start a monitor command reana-benchmark monitor --sleep 5 -w monitor-test;
When workflows are finished (or around that time), terminate the command using Ctrl+C;
Check JSON output in monitor-test_monitored_results.json, you will see some metrics (currently 2) and timestamps when they were collected.

Note: -w in the monitor command is not tied to actual workflows started before, and added only as a convention and a way to map monitored results in analyze command later.

VMois · 2021-11-19T14:11:14Z

reana/reana_benchmark/monitor.py

+    def _count(pods: List[Dict]) -> Dict[str, int]:
+        statistics = defaultdict(lambda: 0)
+        for pod in pods:
+            phase = pod.get("status", {}).get("phase")


question: The phase value is not representing exactly what kubectl get pods (STATUS column) is displaying. It has Pending, Running and Succeed statuses. In addition, it has Failed status which, for some reason, appears for batch pods at the end when workflows are successfully finished.

I guess this happens because batch pods enter NotReady state because the workflow engine successfully finished and exits. You cannot easily get NotReady state from kubectl JSON output, you need to deduct it from container statuses.

What do you think about the situation? Any ideas on how to get a more accurate status?

The phase values are described here. I think it's a pretty accurate metric for us to use. But this case with successfully finished Failed status could be confusing.. I guess one way how we could solve it is to include the containerStatuses to the final output (downsize is that it will make it much more verbose..). Here is an example:

kubectl get pod -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\t"}{.status.containerStatuses[*].state}{"\n"}{end}' ... reana-server-69c494847b-fxldq Running {"running":{"startedAt":"2021-11-23T08:09:29Z"}} {"running":{"startedAt":"2021-11-23T08:09:29Z"}} ...

Then it would be visible that the Failed status happens because the containers terminated. But not sure if we need it, could be confusing to have too much output..

Example of a JSON output about pod status for a single pod from kubectl get pods -o json command:

"status": { "conditions": [ { "lastProbeTime": null, "lastTransitionTime": "2021-11-11T16:15:07Z", "status": "True", "type": "Initialized" }, { "lastProbeTime": null, "lastTransitionTime": "2021-11-11T16:15:12Z", "status": "True", "type": "Ready" }, { "lastProbeTime": null, "lastTransitionTime": "2021-11-11T16:15:12Z", "status": "True", "type": "ContainersReady" }, { "lastProbeTime": null, "lastTransitionTime": "2021-11-11T16:15:07Z", "status": "True", "type": "PodScheduled" } ], "containerStatuses": [ { "containerID": "containerd://78d3916f2dfca74427f8b0cc38cc344a746a2594c8fd6f746133e7a10b0cfdbb", "image": "docker.io/reanahub/reana-workflow-controller:latest", "imageID": "sha256:75e1ea2066193703f73e1c71fce92969a7b9c47b8af3e307dc2d5ee812e2d40d", "lastState": {}, "name": "job-status-consumer", "ready": true, "restartCount": 0, "started": true, "state": { "running": { "startedAt": "2021-11-11T16:15:10Z" } } }, { "containerID": "containerd://4b915673af6c64cbec168cc8512969507544365d8ba35b3900f836836d384a26", "image": "docker.io/reanahub/reana-workflow-controller:latest", "imageID": "sha256:75e1ea2066193703f73e1c71fce92969a7b9c47b8af3e307dc2d5ee812e2d40d", "lastState": {}, "name": "rest-api", "ready": true, "restartCount": 0, "started": true, "state": { "running": { "startedAt": "2021-11-11T16:15:11Z" } } } ], "hostIP": "172.18.0.2", "phase": "Running", "podIP": "10.244.0.117", "podIPs": [ { "ip": "10.244.0.117" } ], "qosClass": "BestEffort", "startTime": "2021-11-11T16:15:07Z" }

As described above phase can show pods as Failed while they are terminating and it is not so straightforward how to extract more precise info from the above JSON. Some checks can be done on containerStatuses or, maybe, in conditions.

The current goal for monitoring command is to collect data that later can be visualized along with execution progress plot. With additional containerStatuses output it is not clear how to visualize it (there is a phase but also one or more containers in containerStatuses). I would prefer to reduce it to one field (Running: 5, Failed: 1, etc.) but the question is how :)

If I understood correctly when the pod is terminated SIGTERM is sent by k8s and apparently job controller takes too much time to exit so status k8s sends SIGKILL (code: 137) and marks pod as Failed.

Example of JSON:

"containerStatuses": [ { "containerID": "containerd://400831aac7443eba9a41228abf264d1adfa63d8279b3a4b71f34f819af7df420", "image": "docker.io/reanahub/reana-job-controller:latest", "imageID": "sha256:268ad55769e69f1c41d448fb7b54a95dc074cba5e2e1cfa6a2642d8484756551", "lastState": {}, "name": "job-controller", "ready": false, "restartCount": 0, "started": false, "state": { "terminated": { "containerID": "containerd://400831aac7443eba9a41228abf264d1adfa63d8279b3a4b71f34f819af7df420", "exitCode": 137, "finishedAt": "2021-11-24T10:28:32Z", "reason": "Error", "startedAt": "2021-11-24T10:27:44Z" } } }, { "containerID": "containerd://d5e4a82bfe1acd86411058ebaa0302dd8a6d9632b1a07d6f849789071918b1b3", "image": "docker.io/reanahub/reana-workflow-engine-serial:latest", "imageID": "sha256:5756537cf4d2673d2725e0824f6f6883062c363229d6c81a500b19a1487b8641", "lastState": {}, "name": "workflow-engine", "ready": false, "restartCount": 0, "started": false, "state": { "terminated": { "containerID": "containerd://d5e4a82bfe1acd86411058ebaa0302dd8a6d9632b1a07d6f849789071918b1b3", "exitCode": 0, "finishedAt": "2021-11-24T10:28:02Z", "reason": "Completed", "startedAt": "2021-11-24T10:27:44Z" } } } ],

Check timestamps in containerStatuses. workflow-engine finished at 10:28:02 so around that time REANA asks k8s to delete pod. Take a look at job-controller finish time - 10:28:32. Exactly 30 seconds after workflow-engine, 30 seconds is a default time in k8s for graceful shutdown of a pod (details).

The same happened in other batch pods - job controller is not terminating within 30 seconds when k8s asks for it. @tiborsimko

suggestion: I think the above case deserves a separate issue. I can leave the phase value in the metric for this PR. I assume when we will figure out why the exit code is 137 instead of 0, the phase value will show Succeeded status for batch pods.

Opened the issue to fix Failed phase for batch pods. We can proceed forward with this PR and leave phase as it is.

audrium

Works nicely!

There was one more metric described in the issue which was not implemented - db workflow status. I think it could be also interesting to check it and compare if it's consistent to information which k8s provides.

    kubectl exec deployment/reana-db -- psql -U reana -c "SELECT status,COUNT(*) FROM __reana.workflow WHERE name LIKE '$workflow-%' GROUP BY status;"

audrium · 2021-11-23T13:00:02Z

reana/reana_benchmark/monitor.py

+    def _count(pods: List[Dict]) -> Dict[str, int]:
+        statistics = defaultdict(lambda: 0)
+        for pod in pods:
+            phase = pod.get("status", {}).get("phase")


The phase values are described here. I think it's a pretty accurate metric for us to use. But this case with successfully finished Failed status could be confusing.. I guess one way how we could solve it is to include the containerStatuses to the final output (downsize is that it will make it much more verbose..). Here is an example:

kubectl get pod -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\t"}{.status.containerStatuses[*].state}{"\n"}{end}' ... reana-server-69c494847b-fxldq Running {"running":{"startedAt":"2021-11-23T08:09:29Z"}} {"running":{"startedAt":"2021-11-23T08:09:29Z"}} ...

Then it would be visible that the Failed status happens because the containers terminated. But not sure if we need it, could be confusing to have too much output..

audrium · 2021-11-23T13:00:45Z

reana/reana_benchmark/cli.py

+@reana_benchmark.command(name="monitor")
+@workflow_option
+@click.option(
+    "--sleep", "-s", help="Sleep between querying in seconds", type=int, default=30,


Minor: perhaps we could mention that the default sleep time is 30 seconds?

reana/reana_benchmark/monitor.py

VMois · 2021-11-23T15:36:37Z

Regarding one missing metric, I can add it. I didn't do it initially because, technically, workflow statuses should be listed by the reana-client list command (which is used in the collect already).

audrium

Works nicely!
I left couple of small comments. PR also needs a rebase

audrium · 2021-11-25T16:17:12Z

reana/reana_benchmark/cli.py

+@click.option(
+    "--sleep",
+    "-s",
+    help="Sleep in seconds between collecting metrics, default 30",


Very minor: I guess to be consistent with reana-client and reana-dev commands format we could use

Sleep in seconds between collecting metrics [default=30]

help description here, WDYT?

audrium · 2021-11-25T16:20:03Z

reana/reana_benchmark/monitor.py

+METRICS = [
+    NumberOfDBConnectionsMetric(),
+    WorkflowPodsMetric(),
+    WorkflowStatuses(),


Minor: for consistency we could add Metric to the end. Also, I think it would be a bit more clear to name it WorkflowDBStatusesMetric and accordingly name = "workflow_db_statuses". I think it would be more in line with workflows_pods_status

audrium

Only thing which is missing for this PR is a rebase ("This branch is out-of-date with the base branch")

closes reanahub#574

VMois force-pushed the benchmark-monitor branch 3 times, most recently from 24a9eed to a7fec64 Compare November 19, 2021 13:59

VMois commented Nov 19, 2021

View reviewed changes

VMois changed the title ~~WIP: benchmark: add monitor command~~ benchmark: add monitor command Nov 19, 2021

VMois force-pushed the benchmark-monitor branch from a7fec64 to ed0a1bc Compare November 19, 2021 14:27

VMois marked this pull request as ready for review November 19, 2021 14:29

audrium reviewed Nov 23, 2021

View reviewed changes

VMois force-pushed the benchmark-monitor branch 2 times, most recently from 3a38210 to 21e7937 Compare November 24, 2021 10:22

VMois mentioned this pull request Nov 24, 2021

job controller doesn't respect graceful shutdown send by k8s reanahub/reana-job-controller#347

Open

VMois requested a review from audrium November 24, 2021 13:20

audrium approved these changes Nov 25, 2021

View reviewed changes

VMois force-pushed the benchmark-monitor branch from 21e7937 to 18192c1 Compare November 26, 2021 10:29

VMois requested a review from audrium November 26, 2021 10:30

audrium reviewed Nov 26, 2021

View reviewed changes

benchmark: add monitor command

41fa8f9

closes reanahub#574

VMois force-pushed the benchmark-monitor branch from 18192c1 to 41fa8f9 Compare November 26, 2021 11:35

audrium merged commit 41fa8f9 into reanahub:master Nov 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark: add monitor command #593

benchmark: add monitor command #593

VMois commented Nov 19, 2021 •

edited

VMois Nov 19, 2021

audrium Nov 23, 2021

VMois Nov 23, 2021 •

edited

VMois Nov 24, 2021

VMois Nov 24, 2021 •

edited

VMois Nov 24, 2021

audrium left a comment

audrium Nov 23, 2021

audrium Nov 23, 2021

VMois commented Nov 23, 2021 •

edited

audrium left a comment

audrium Nov 25, 2021

VMois Nov 26, 2021

audrium Nov 25, 2021

VMois Nov 26, 2021

audrium left a comment

benchmark: add monitor command #593

benchmark: add monitor command #593

Conversation

VMois commented Nov 19, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VMois Nov 23, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VMois Nov 24, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

audrium left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VMois commented Nov 23, 2021 • edited

audrium left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

audrium left a comment

Choose a reason for hiding this comment

VMois commented Nov 19, 2021 •

edited

VMois Nov 23, 2021 •

edited

VMois Nov 24, 2021 •

edited

VMois commented Nov 23, 2021 •

edited