Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmark: add monitor command #593

Merged
merged 1 commit into from Nov 26, 2021
Merged

Conversation

VMois
Copy link

@VMois VMois commented Nov 19, 2021

This PR introduces monitor command to the benchmark. It will collect all defined metrics (check code for details), sleep for a defined time in seconds, and repeat the same process until is terminated manually. Results are saved to a JSON file to be used in the analyze command later.

closes #574

How to test:

  1. (optional) Launch some workflows reana-benchmark launch -w monitor-test -n 1-20 (not helloworld as they are too fast, atlas-demo is okay);

  2. Right after workflows are started, start a monitor command reana-benchmark monitor --sleep 5 -w monitor-test;

  3. When workflows are finished (or around that time), terminate the command using Ctrl+C;

  4. Check JSON output in monitor-test_monitored_results.json, you will see some metrics (currently 2) and timestamps when they were collected.

Note: -w in the monitor command is not tied to actual workflows started before, and added only as a convention and a way to map monitored results in analyze command later.

@VMois VMois force-pushed the benchmark-monitor branch 3 times, most recently from 24a9eed to a7fec64 Compare November 19, 2021 13:59
def _count(pods: List[Dict]) -> Dict[str, int]:
statistics = defaultdict(lambda: 0)
for pod in pods:
phase = pod.get("status", {}).get("phase")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: The phase value is not representing exactly what kubectl get pods (STATUS column) is displaying. It has Pending, Running and Succeed statuses. In addition, it has Failed status which, for some reason, appears for batch pods at the end when workflows are successfully finished.

I guess this happens because batch pods enter NotReady state because the workflow engine successfully finished and exits. You cannot easily get NotReady state from kubectl JSON output, you need to deduct it from container statuses.

What do you think about the situation? Any ideas on how to get a more accurate status?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phase values are described here. I think it's a pretty accurate metric for us to use. But this case with successfully finished Failed status could be confusing.. I guess one way how we could solve it is to include the containerStatuses to the final output (downsize is that it will make it much more verbose..). Here is an example:

kubectl get pod -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\t"}{.status.containerStatuses[*].state}{"\n"}{end}'
...
reana-server-69c494847b-fxldq   Running {"running":{"startedAt":"2021-11-23T08:09:29Z"}} {"running":{"startedAt":"2021-11-23T08:09:29Z"}}
...

Then it would be visible that the Failed status happens because the containers terminated. But not sure if we need it, could be confusing to have too much output..

Copy link
Author

@VMois VMois Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example of a JSON output about pod status for a single pod from kubectl get pods -o json command:

"status": {
      "conditions": [
          {
              "lastProbeTime": null,
              "lastTransitionTime": "2021-11-11T16:15:07Z",
              "status": "True",
              "type": "Initialized"
          },
          {
              "lastProbeTime": null,
              "lastTransitionTime": "2021-11-11T16:15:12Z",
              "status": "True",
              "type": "Ready"
          },
          {
              "lastProbeTime": null,
              "lastTransitionTime": "2021-11-11T16:15:12Z",
              "status": "True",
              "type": "ContainersReady"
          },
          {
              "lastProbeTime": null,
              "lastTransitionTime": "2021-11-11T16:15:07Z",
              "status": "True",
              "type": "PodScheduled"
          }
      ],
      "containerStatuses": [
          {
              "containerID": "containerd://78d3916f2dfca74427f8b0cc38cc344a746a2594c8fd6f746133e7a10b0cfdbb",
              "image": "docker.io/reanahub/reana-workflow-controller:latest",
              "imageID": "sha256:75e1ea2066193703f73e1c71fce92969a7b9c47b8af3e307dc2d5ee812e2d40d",
              "lastState": {},
              "name": "job-status-consumer",
              "ready": true,
              "restartCount": 0,
              "started": true,
              "state": {
                  "running": {
                      "startedAt": "2021-11-11T16:15:10Z"
                  }
              }
          },
          {
              "containerID": "containerd://4b915673af6c64cbec168cc8512969507544365d8ba35b3900f836836d384a26",
              "image": "docker.io/reanahub/reana-workflow-controller:latest",
              "imageID": "sha256:75e1ea2066193703f73e1c71fce92969a7b9c47b8af3e307dc2d5ee812e2d40d",
              "lastState": {},
              "name": "rest-api",
              "ready": true,
              "restartCount": 0,
              "started": true,
              "state": {
                  "running": {
                      "startedAt": "2021-11-11T16:15:11Z"
                  }
              }
          }
      ],
      "hostIP": "172.18.0.2",
      "phase": "Running",
      "podIP": "10.244.0.117",
      "podIPs": [
          {
              "ip": "10.244.0.117"
          }
      ],
      "qosClass": "BestEffort",
      "startTime": "2021-11-11T16:15:07Z"
}

As described above phase can show pods as Failed while they are terminating and it is not so straightforward how to extract more precise info from the above JSON. Some checks can be done on containerStatuses or, maybe, in conditions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current goal for monitoring command is to collect data that later can be visualized along with execution progress plot. With additional containerStatuses output it is not clear how to visualize it (there is a phase but also one or more containers in containerStatuses). I would prefer to reduce it to one field (Running: 5, Failed: 1, etc.) but the question is how :)

Copy link
Author

@VMois VMois Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood correctly when the pod is terminated SIGTERM is sent by k8s and apparently job controller takes too much time to exit so status k8s sends SIGKILL (code: 137) and marks pod as Failed.

Example of JSON:

"containerStatuses": [
      {
        "containerID": "containerd://400831aac7443eba9a41228abf264d1adfa63d8279b3a4b71f34f819af7df420",
        "image": "docker.io/reanahub/reana-job-controller:latest",
        "imageID": "sha256:268ad55769e69f1c41d448fb7b54a95dc074cba5e2e1cfa6a2642d8484756551",
        "lastState": {},
        "name": "job-controller",
        "ready": false,
        "restartCount": 0,
        "started": false,
        "state": {
          "terminated": {
            "containerID": "containerd://400831aac7443eba9a41228abf264d1adfa63d8279b3a4b71f34f819af7df420",
            "exitCode": 137,
            "finishedAt": "2021-11-24T10:28:32Z",
            "reason": "Error",
            "startedAt": "2021-11-24T10:27:44Z"
          }
        }
      },
      {
        "containerID": "containerd://d5e4a82bfe1acd86411058ebaa0302dd8a6d9632b1a07d6f849789071918b1b3",
        "image": "docker.io/reanahub/reana-workflow-engine-serial:latest",
        "imageID": "sha256:5756537cf4d2673d2725e0824f6f6883062c363229d6c81a500b19a1487b8641",
        "lastState": {},
        "name": "workflow-engine",
        "ready": false,
        "restartCount": 0,
        "started": false,
        "state": {
          "terminated": {
            "containerID": "containerd://d5e4a82bfe1acd86411058ebaa0302dd8a6d9632b1a07d6f849789071918b1b3",
            "exitCode": 0,
            "finishedAt": "2021-11-24T10:28:02Z",
            "reason": "Completed",
            "startedAt": "2021-11-24T10:27:44Z"
          }
        }
      }
    ],

Check timestamps in containerStatuses. workflow-engine finished at 10:28:02 so around that time REANA asks k8s to delete pod. Take a look at job-controller finish time - 10:28:32. Exactly 30 seconds after workflow-engine, 30 seconds is a default time in k8s for graceful shutdown of a pod (details).

The same happened in other batch pods - job controller is not terminating within 30 seconds when k8s asks for it. @tiborsimko

suggestion: I think the above case deserves a separate issue. I can leave the phase value in the metric for this PR. I assume when we will figure out why the exit code is 137 instead of 0, the phase value will show Succeeded status for batch pods.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened the issue to fix Failed phase for batch pods. We can proceed forward with this PR and leave phase as it is.

@VMois VMois changed the title WIP: benchmark: add monitor command benchmark: add monitor command Nov 19, 2021
@VMois VMois marked this pull request as ready for review November 19, 2021 14:29
Copy link
Member

@audrium audrium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works nicely!

There was one more metric described in the issue which was not implemented - db workflow status. I think it could be also interesting to check it and compare if it's consistent to information which k8s provides.

    kubectl exec deployment/reana-db -- psql -U reana -c "SELECT status,COUNT(*) FROM __reana.workflow WHERE name LIKE '$workflow-%' GROUP BY status;"

def _count(pods: List[Dict]) -> Dict[str, int]:
statistics = defaultdict(lambda: 0)
for pod in pods:
phase = pod.get("status", {}).get("phase")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phase values are described here. I think it's a pretty accurate metric for us to use. But this case with successfully finished Failed status could be confusing.. I guess one way how we could solve it is to include the containerStatuses to the final output (downsize is that it will make it much more verbose..). Here is an example:

kubectl get pod -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\t"}{.status.containerStatuses[*].state}{"\n"}{end}'
...
reana-server-69c494847b-fxldq   Running {"running":{"startedAt":"2021-11-23T08:09:29Z"}} {"running":{"startedAt":"2021-11-23T08:09:29Z"}}
...

Then it would be visible that the Failed status happens because the containers terminated. But not sure if we need it, could be confusing to have too much output..

@reana_benchmark.command(name="monitor")
@workflow_option
@click.option(
"--sleep", "-s", help="Sleep between querying in seconds", type=int, default=30,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: perhaps we could mention that the default sleep time is 30 seconds?

reana/reana_benchmark/monitor.py Show resolved Hide resolved
@VMois
Copy link
Author

VMois commented Nov 23, 2021

Regarding one missing metric, I can add it. I didn't do it initially because, technically, workflow statuses should be listed by the reana-client list command (which is used in the collect already).

Copy link
Member

@audrium audrium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works nicely!
I left couple of small comments. PR also needs a rebase

@click.option(
"--sleep",
"-s",
help="Sleep in seconds between collecting metrics, default 30",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor: I guess to be consistent with reana-client and reana-dev commands format we could use

Sleep in seconds between collecting metrics [default=30]

help description here, WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

METRICS = [
NumberOfDBConnectionsMetric(),
WorkflowPodsMetric(),
WorkflowStatuses(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: for consistency we could add Metric to the end. Also, I think it would be a bit more clear to name it WorkflowDBStatusesMetric and accordingly name = "workflow_db_statuses". I think it would be more in line with workflows_pods_status

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@audrium audrium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only thing which is missing for this PR is a rebase ("This branch is out-of-date with the base branch")

@audrium audrium merged commit 41fa8f9 into reanahub:master Nov 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

benchmark: new monitor command for DB and K8S statuses
2 participants