New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
benchmark: add monitor command #593
Conversation
24a9eed
to
a7fec64
Compare
def _count(pods: List[Dict]) -> Dict[str, int]: | ||
statistics = defaultdict(lambda: 0) | ||
for pod in pods: | ||
phase = pod.get("status", {}).get("phase") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: The phase
value is not representing exactly what kubectl get pods
(STATUS
column) is displaying. It has Pending
, Running
and Succeed
statuses. In addition, it has Failed
status which, for some reason, appears for batch pods at the end when workflows are successfully finished.
I guess this happens because batch pods enter NotReady
state because the workflow engine successfully finished and exits. You cannot easily get NotReady
state from kubectl JSON output, you need to deduct it from container statuses.
What do you think about the situation? Any ideas on how to get a more accurate status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The phase
values are described here. I think it's a pretty accurate metric for us to use. But this case with successfully finished Failed
status could be confusing.. I guess one way how we could solve it is to include the containerStatuses
to the final output (downsize is that it will make it much more verbose..). Here is an example:
kubectl get pod -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\t"}{.status.containerStatuses[*].state}{"\n"}{end}'
...
reana-server-69c494847b-fxldq Running {"running":{"startedAt":"2021-11-23T08:09:29Z"}} {"running":{"startedAt":"2021-11-23T08:09:29Z"}}
...
Then it would be visible that the Failed
status happens because the containers terminated. But not sure if we need it, could be confusing to have too much output..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example of a JSON output about pod status
for a single pod from kubectl get pods -o json
command:
"status": {
"conditions": [
{
"lastProbeTime": null,
"lastTransitionTime": "2021-11-11T16:15:07Z",
"status": "True",
"type": "Initialized"
},
{
"lastProbeTime": null,
"lastTransitionTime": "2021-11-11T16:15:12Z",
"status": "True",
"type": "Ready"
},
{
"lastProbeTime": null,
"lastTransitionTime": "2021-11-11T16:15:12Z",
"status": "True",
"type": "ContainersReady"
},
{
"lastProbeTime": null,
"lastTransitionTime": "2021-11-11T16:15:07Z",
"status": "True",
"type": "PodScheduled"
}
],
"containerStatuses": [
{
"containerID": "containerd://78d3916f2dfca74427f8b0cc38cc344a746a2594c8fd6f746133e7a10b0cfdbb",
"image": "docker.io/reanahub/reana-workflow-controller:latest",
"imageID": "sha256:75e1ea2066193703f73e1c71fce92969a7b9c47b8af3e307dc2d5ee812e2d40d",
"lastState": {},
"name": "job-status-consumer",
"ready": true,
"restartCount": 0,
"started": true,
"state": {
"running": {
"startedAt": "2021-11-11T16:15:10Z"
}
}
},
{
"containerID": "containerd://4b915673af6c64cbec168cc8512969507544365d8ba35b3900f836836d384a26",
"image": "docker.io/reanahub/reana-workflow-controller:latest",
"imageID": "sha256:75e1ea2066193703f73e1c71fce92969a7b9c47b8af3e307dc2d5ee812e2d40d",
"lastState": {},
"name": "rest-api",
"ready": true,
"restartCount": 0,
"started": true,
"state": {
"running": {
"startedAt": "2021-11-11T16:15:11Z"
}
}
}
],
"hostIP": "172.18.0.2",
"phase": "Running",
"podIP": "10.244.0.117",
"podIPs": [
{
"ip": "10.244.0.117"
}
],
"qosClass": "BestEffort",
"startTime": "2021-11-11T16:15:07Z"
}
As described above phase
can show pods as Failed
while they are terminating and it is not so straightforward how to extract more precise info from the above JSON. Some checks can be done on containerStatuses
or, maybe, in conditions
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current goal for monitoring command is to collect data that later can be visualized along with execution progress plot. With additional containerStatuses
output it is not clear how to visualize it (there is a phase
but also one or more containers in containerStatuses
). I would prefer to reduce it to one field (Running: 5
, Failed: 1
, etc.) but the question is how :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understood correctly when the pod is terminated SIGTERM
is sent by k8s and apparently job controller
takes too much time to exit so status k8s sends SIGKILL
(code: 137) and marks pod as Failed
.
Example of JSON:
"containerStatuses": [
{
"containerID": "containerd://400831aac7443eba9a41228abf264d1adfa63d8279b3a4b71f34f819af7df420",
"image": "docker.io/reanahub/reana-job-controller:latest",
"imageID": "sha256:268ad55769e69f1c41d448fb7b54a95dc074cba5e2e1cfa6a2642d8484756551",
"lastState": {},
"name": "job-controller",
"ready": false,
"restartCount": 0,
"started": false,
"state": {
"terminated": {
"containerID": "containerd://400831aac7443eba9a41228abf264d1adfa63d8279b3a4b71f34f819af7df420",
"exitCode": 137,
"finishedAt": "2021-11-24T10:28:32Z",
"reason": "Error",
"startedAt": "2021-11-24T10:27:44Z"
}
}
},
{
"containerID": "containerd://d5e4a82bfe1acd86411058ebaa0302dd8a6d9632b1a07d6f849789071918b1b3",
"image": "docker.io/reanahub/reana-workflow-engine-serial:latest",
"imageID": "sha256:5756537cf4d2673d2725e0824f6f6883062c363229d6c81a500b19a1487b8641",
"lastState": {},
"name": "workflow-engine",
"ready": false,
"restartCount": 0,
"started": false,
"state": {
"terminated": {
"containerID": "containerd://d5e4a82bfe1acd86411058ebaa0302dd8a6d9632b1a07d6f849789071918b1b3",
"exitCode": 0,
"finishedAt": "2021-11-24T10:28:02Z",
"reason": "Completed",
"startedAt": "2021-11-24T10:27:44Z"
}
}
}
],
Check timestamps in containerStatuses
. workflow-engine
finished at 10:28:02
so around that time REANA asks k8s to delete pod. Take a look at job-controller
finish time - 10:28:32
. Exactly 30 seconds after workflow-engine
, 30 seconds is a default time in k8s for graceful shutdown of a pod (details).
The same happened in other batch pods - job controller
is not terminating within 30 seconds when k8s asks for it. @tiborsimko
suggestion: I think the above case deserves a separate issue. I can leave the phase
value in the metric for this PR. I assume when we will figure out why the exit code is 137
instead of 0
, the phase
value will show Succeeded
status for batch pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened the issue to fix Failed
phase for batch pods. We can proceed forward with this PR and leave phase
as it is.
a7fec64
to
ed0a1bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works nicely!
There was one more metric described in the issue which was not implemented - db workflow status. I think it could be also interesting to check it and compare if it's consistent to information which k8s provides.
kubectl exec deployment/reana-db -- psql -U reana -c "SELECT status,COUNT(*) FROM __reana.workflow WHERE name LIKE '$workflow-%' GROUP BY status;"
def _count(pods: List[Dict]) -> Dict[str, int]: | ||
statistics = defaultdict(lambda: 0) | ||
for pod in pods: | ||
phase = pod.get("status", {}).get("phase") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The phase
values are described here. I think it's a pretty accurate metric for us to use. But this case with successfully finished Failed
status could be confusing.. I guess one way how we could solve it is to include the containerStatuses
to the final output (downsize is that it will make it much more verbose..). Here is an example:
kubectl get pod -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\t"}{.status.containerStatuses[*].state}{"\n"}{end}'
...
reana-server-69c494847b-fxldq Running {"running":{"startedAt":"2021-11-23T08:09:29Z"}} {"running":{"startedAt":"2021-11-23T08:09:29Z"}}
...
Then it would be visible that the Failed
status happens because the containers terminated. But not sure if we need it, could be confusing to have too much output..
reana/reana_benchmark/cli.py
Outdated
@reana_benchmark.command(name="monitor") | ||
@workflow_option | ||
@click.option( | ||
"--sleep", "-s", help="Sleep between querying in seconds", type=int, default=30, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: perhaps we could mention that the default sleep time is 30 seconds?
Regarding one missing metric, I can add it. I didn't do it initially because, technically, workflow statuses should be listed by the |
3a38210
to
21e7937
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works nicely!
I left couple of small comments. PR also needs a rebase
reana/reana_benchmark/cli.py
Outdated
@click.option( | ||
"--sleep", | ||
"-s", | ||
help="Sleep in seconds between collecting metrics, default 30", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor: I guess to be consistent with reana-client
and reana-dev
commands format we could use
Sleep in seconds between collecting metrics [default=30]
help
description here, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
reana/reana_benchmark/monitor.py
Outdated
METRICS = [ | ||
NumberOfDBConnectionsMetric(), | ||
WorkflowPodsMetric(), | ||
WorkflowStatuses(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: for consistency we could add Metric
to the end. Also, I think it would be a bit more clear to name it WorkflowDBStatusesMetric
and accordingly name = "workflow_db_statuses"
. I think it would be more in line with workflows_pods_status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
21e7937
to
18192c1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only thing which is missing for this PR is a rebase ("This branch is out-of-date with the base branch")
18192c1
to
41fa8f9
Compare
This PR introduces
monitor
command to the benchmark. It will collect all defined metrics (check code for details), sleep for a defined time in seconds, and repeat the same process until is terminated manually. Results are saved to a JSON file to be used in theanalyze
command later.closes #574
How to test:
(optional) Launch some workflows
reana-benchmark launch -w monitor-test -n 1-20
(nothelloworld
as they are too fast,atlas-demo
is okay);Right after workflows are started, start a monitor command
reana-benchmark monitor --sleep 5 -w monitor-test
;When workflows are finished (or around that time), terminate the command using Ctrl+C;
Check JSON output in
monitor-test_monitored_results.json
, you will see some metrics (currently 2) and timestamps when they were collected.Note:
-w
in themonitor
command is not tied to actual workflows started before, and added only as a convention and a way to map monitored results inanalyze
command later.