Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmark: new monitor command for DB and K8S statuses #574

Closed
tiborsimko opened this issue Nov 1, 2021 · 3 comments · Fixed by #593
Closed

benchmark: new monitor command for DB and K8S statuses #574

tiborsimko opened this issue Nov 1, 2021 · 3 comments · Fixed by #593
Assignees

Comments

@tiborsimko
Copy link
Member

(stems from #541 (comment))

Current behaviour

Currently, while running benchmarking script, one can monitor the DB status and the K8S status independently via a script like:

#!/bin/sh

workflow=$1

while true; do

    echo "$(date +%Y-%m-%dT%H:%M:%S) db pg_stat_activity count:"
    kubectl exec deployment/reana-db -- psql -U reana -c "SELECT COUNT(*) FROM pg_stat_activity;"

    echo "$(date +%Y-%m-%dT%H:%M:%S) db workflow.status count:"
    kubectl exec deployment/reana-db -- psql -U reana -c "SELECT status,COUNT(*) FROM __reana.workflow WHERE name LIKE '$workflow-%' GROUP BY status;"

    echo "$(date +%Y-%m-%dT%H:%M:%S) run-b pods count:"
    kubectl get pods | grep run-b | grep -v STATUS | awk '{print $3}' | sort|uniq -c|sort -rn|head -20|awk '!max{max=$1;}{r="";i=s=60*$1/max;while(i-->0)r=r"#"; printf "%25s %5d %s %s",$2,$1,r,"\n";}'

    echo "$(date +%Y-%m-%dT%H:%M:%S) run-j pods count:"
    kubectl get pods | grep run-j | grep -v STATUS | awk '{print $3}' | sort|uniq -c|sort -rn|head -20|awk '!max{max=$1;}{r="";i=s=60*$1/max;while(i-->0)r=r"#"; printf "%25s %5d %s %s",$2,$1,r,"\n";}'

    echo

    sleep 30

done

This gives output as follows, for one particular moment of time:

2021-10-31T20:48:35 db pg_stat_activity count:
 count
-------
   182
(1 row)

2021-10-31T20:48:35 db workflow.status count:
 status  | count
---------+-------
 running |   177
 queued  |    89
 pending |   134
(3 rows)

2021-10-31T20:48:36 run-b pods count:
                  Running   190 ############################################################
        ContainerCreating     3 #
2021-10-31T20:48:36 run-j pods count:
                  Running   198 ############################################################
                 Init:0/1    16 #####
          PodInitializing     7 ###

and, 30 seconds later:

2021-10-31T20:49:07 db pg_stat_activity count:
 count
-------
   212
(1 row)

2021-10-31T20:49:07 db workflow.status count:
 status  | count
---------+-------
 pending |   197
 running |   203
(2 rows)

2021-10-31T20:49:08 run-b pods count:
                  Running   217 ############################################################
2021-10-31T20:49:09 run-j pods count:
                  Running   314 ############################################################
                 Init:0/1    15 ###
          PodInitializing     7 ##
                  Pending     1 #

These time snapshots allow to monitor the number of DB connections, the DB statuses vs K8S statuses, the number of "Running" pods vs "Pending" pods, to see how fast the pods terminate, etc, giving a complementary picture of what's happening in the cluster.

The trouble is that this "side" monitoring is a bit "detached" from the main content output of the benchmark scripts. It'll be advantageous to better correlate this information with the workflow burn down plots.

Expected behaviour

We can introduce a new command monitor --sleep 30 which would do the above automatically and which would collect the information in either the textual format above (MVP), or even better in a CSV format that will allow to plot nice DB and K8S status evolution graphs later about how the measured DB and K8S quantities evolve as a function of time.

For example, once #573 is implemented, we shall have a "real time arrow" representation of the workflow burn down in the cluster, and the DB info plots and K8S info plots will nicely complement the overall picture about what's happening in the cluster.

They might show graphical insight into "orange hill" and "blue spread" phenomena, such as the transition from "Running -> NotReady -> Terminating" status of workflow pods.

@tiborsimko tiborsimko added this to Backlog in ATLAS-pMSSM-Basics Nov 1, 2021
@VMois VMois moved this from Backlog to Ready for work in ATLAS-pMSSM-Basics Nov 16, 2021
@VMois VMois moved this from Ready for work to In work in ATLAS-pMSSM-Basics Nov 18, 2021
@VMois VMois self-assigned this Nov 18, 2021
@VMois
Copy link

VMois commented Nov 18, 2021

I would prefer to store data in CSV format (or JSON) to be able to connect it with collected_results.csv later.

After some investigation, it looks like more thoughts will be needed on how monitored data should be structured and saved.

  1. For example, the number of DB connections is "easy" to structure in CSV file:
monitored_date,db_connections_number
2021-11-16T10:24:12,15
2021-11-16T10:25:12,20

If we want to add workflow statuses, it gets a bit more complicated:

monitored_date,db_connections_number,status,count
2021-11-16T10:24:12,15,running,5
2021-11-16T10:24:12,15,pending,2
2021-11-16T10:25:12,20,running,6
2021-11-16T10:25:12,20,pending,1

If we want to add pod statuses, it gets even more complicated:

monitored_date,db_connections_number,status,count,type,type_count
2021-11-16T10:24:12,15,running,5,run-b,5
2021-11-16T10:24:12,15,running,5,run-j,2
...

Splitting into multiple CSV files can help. But will introduce more complexity in analyze to merge them together.

  1. Another idea is to use a single JSON file instead of multiple CSV files:
{
"2021-11-16T10:24:12": {
        "db_connection_number": 15,
        "workflow_statuses": {
            "running": 5,
            "pending": 2
        }
    }
}

This is a more flexible approach. It is also possible to extend the file with new metrics by just adding a new entry under the 2021-11-16T10:24:12 key. In the analyze command, I can just use key (date) to plot metrics.

P.S While writing my findings, I realized that JSON looks like a good idea. Writing stuff down helps a lot :)

P.S P.S This whole problem with how to save data is a classical "structured vs non-structured data" debate.

@VMois
Copy link

VMois commented Nov 18, 2021

suggestion: The point of this issue is to develop a monitor command only. I will add another issue that will focus on how the analyze command will use monitored data and plot it alongside what we have already.

@VMois
Copy link

VMois commented Nov 18, 2021

Another thing, I will use subprocess to execute commands and parse the output. Maybe, it is not as effective as using some API (like a python-k8s library) but it is simpler to start. We can improve later if needed.

VMois pushed a commit to VMois/reana that referenced this issue Nov 19, 2021
VMois pushed a commit to VMois/reana that referenced this issue Nov 19, 2021
VMois pushed a commit to VMois/reana that referenced this issue Nov 19, 2021
VMois pushed a commit to VMois/reana that referenced this issue Nov 19, 2021
VMois pushed a commit to VMois/reana that referenced this issue Nov 19, 2021
@VMois VMois moved this from In work to In review in ATLAS-pMSSM-Basics Nov 19, 2021
VMois pushed a commit to VMois/reana that referenced this issue Nov 23, 2021
VMois pushed a commit to VMois/reana that referenced this issue Nov 24, 2021
VMois pushed a commit to VMois/reana that referenced this issue Nov 26, 2021
ATLAS-pMSSM-Basics automation moved this from In review to Done Nov 26, 2021
mdonadoni added a commit to mdonadoni/reana that referenced this issue Mar 5, 2024
chore(reana-ui/master): release 0.9.4

build(reana-ui/package): update yarn.lock (reanahub#399)

build(reana-ui/package): require jsroot<7.6.0 (reanahub#399)

ci(reana-ui/commitlint): allow release commit style (reanahub#400)

docs(reana-ui/authors): complete list of contributors (reanahub#396)

ci(reana-ui/shellcheck): exclude node_modules from the analyzed paths (reanahub#387)

fix(reana-ui/progress): update failed workflows duration using finish time (reanahub#387)

feat(reana-ui/footer): link privacy notice to configured URL (reanahub#393)

refactor(reana-ui/docs): move from reST to Markdown (reanahub#391)

ci(reana-ui/commitlint): check for the presence of concrete PR number (reanahub#390)

ci(reana-ui/shellcheck): fix exit code propagation (reanahub#390)

fix(reana-ui/launcher): remove dollar sign in generated Markdown (reanahub#389)

ci(reana-ui/release-please): update version in package.json and Dockerfile (reanahub#385)

ci(reana-ui/release-please): switch to `simple` release strategy (reanahub#383)

fix(reana-ui/router): show 404 page for invalid URLs (reanahub#382)

ci(reana-ui/release-please): initial configuration (reanahub#380)

ci(reana-ui/commitlint): addition of commit message linter (reanahub#380)

chore(reana-message-broker/master): release 0.9.3

ci(reana-message-broker/commitlint): allow release commit style (reanahub#67)

docs(reana-message-broker/authors): complete list of contributors (reanahub#66)

refactor(reana-message-broker/docs): move from reST to Markdown (reanahub#65)

ci(reana-message-broker/commitlint): check for the presence of concrete PR number (reanahub#64)

ci(reana-message-broker/shellcheck): fix exit code propagation (reanahub#64)

ci(reana-message-broker/release-please): update version in Dockerfile (reanahub#63)

fix(reana-message-broker/startup): handle signals for graceful shutdown (reanahub#59)

ci(reana-message-broker/release-please): initial configuration (reanahub#60)

ci(reana-message-broker/commitlint): addition of commit message linter (reanahub#60)

chore(reana-server/master): release 0.9.3

build(reana-server/python): bump all required packages as of 2024-03-04 (reanahub#674)

build(reana-server/python): bump shared REANA packages as of 2024-03-04 (reanahub#674)

build(reana-server/python): bump shared modules (reanahub#676)

ci(reana-server/commitlint): allow release commit style (reanahub#675)

docs(reana-server/authors): complete list of contributors (reanahub#673)

ci(reana-server/pytest): move to PostgreSQL 14.10 (reanahub#672)

refactor(reana-server/docs): move from reST to Markdown (reanahub#671)

style(reana-server/black): format with black v24 (reanahub#670)

ci(reana-server/commitlint): check for the presence of concrete PR number (reanahub#669)

ci(reana-server/shellcheck): fix exit code propagation (reanahub#669)

ci(reana-server/release-please): update version in Dockerfile/OpenAPI specs (reanahub#668)

build(reana-server/docker): non-editable submodules in "latest" mode (reanahub#656)

build(reana-server/deps): pin invenio-userprofiles to 1.2.4 (reanahub#665)

ci(reana-server/release-please): initial configuration (reanahub#665)

ci(reana-server/commitlint): addition of commit message linter (reanahub#665)

chore(reana-workflow-controller/master): release 0.9.3

build(reana-workflow-controller/python): bump all required packages as of 2024-03-04 (reanahub#574)

build(reana-workflow-controller/python): bump shared REANA packages as of 2024-03-04 (reanahub#574)

feat(reana-workflow-controller/manager): increase termination period of run-batch pods (reanahub#572)

ci(reana-workflow-controller/commitlint): allow release commit style (reanahub#575)

feat(reana-workflow-controller/manager): pass custom env variables to job controller (reanahub#571)

feat(reana-workflow-controller/manager): pass custom env variables to workflow engines (reanahub#571)

docs(reana-workflow-controller/authors): complete list of contributors (reanahub#570)

ci(reana-workflow-controller/pytest): move to PostgreSQL 14.10 (reanahub#568)

fix(reana-workflow-controller/manager): use valid group name when calling `groupadd` (reanahub#566)

refactor(reana-workflow-controller/docs): move from reST to Markdown (reanahub#567)

fix(reana-workflow-controller/stop): store engine logs of stopped workflow (reanahub#563)

fix(reana-workflow-controller/manager): graceful shutdown of job-controller (reanahub#559)

feat(reana-workflow-controller/manager): call shutdown endpoint before workflow stop (reanahub#559)

refactor(reana-workflow-controller/consumer): do not update status of jobs (reanahub#559)

style(reana-workflow-controller/black): format with black v24 (reanahub#564)

ci(reana-workflow-controller/commitlint): check for the presence of concrete PR number (reanahub#562)

ci(reana-workflow-controller/shellcheck): fix exit code propagation (reanahub#562)

ci(reana-workflow-controller/release-please): update version in Dockerfile/OpenAPI specs (reanahub#558)

build(reana-workflow-controller/docker): non-editable submodules in "latest" mode (reanahub#551)

ci(reana-workflow-controller/release-please): initial configuration (reanahub#555)

ci(reana-workflow-controller/commitlint): addition of commit message linter (reanahub#555)

chore(reana-job-controller/master): release 0.9.3

build(reana-job-controller/python): bump all required packages as of 2024-03-04 (reanahub#442)

build(reana-job-controller/python): bump shared REANA packages as of 2024-03-04 (reanahub#442)

ci(reana-job-controller/commitlint): allow release commit style (reanahub#443)

build(reana-job-controller/certificates): update expired CERN Grid CA certificate (reanahub#440)

fix(reana-job-controller/database): limit the number of open database connections (reanahub#437)

docs(reana-job-controller/authors): complete list of contributors (reanahub#434)

perf(reana-job-controller/cache): avoid caching jobs when the cache is disabled (reanahub#435)

ci(reana-job-controller/pytest): move to PostgreSQL 14.10 (reanahub#429)

refactor(reana-job-controller/docs): move from reST to Markdown (reanahub#428)

ci(reana-job-controller/commitlint): check for the presence of concrete PR number (reanahub#425)

ci(reana-job-controller/shellcheck): fix exit code propagation (reanahub#425)

feat(reana-job-controller/shutdown): stop all running jobs before stopping workflow (reanahub#423)

refactor(reana-job-controller/monitor): move fetching of logs to job-manager (reanahub#423)

refactor(reana-job-controller/db): set job status also in the main database (reanahub#423)

refactor(reana-job-controller/monitor): centralise logs and status updates (reanahub#423)

style(reana-job-controller/black): format with black v24 (reanahub#426)

ci(reana-job-controller/release-please): update version in Dockerfile/OpenAPI specs (reanahub#421)

build(reana-job-controller/docker): non-editable submodules in "latest" mode (reanahub#416)

ci(reana-job-controller/release-please): initial configuration (reanahub#417)

ci(reana-job-controller/commitlint): addition of commit message linter (reanahub#417)

chore(reana-workflow-engine-cwl/master): release 0.9.3

build(reana-workflow-engine-cwl/python): bump all required packages as of 2024-03-04 (reanahub#267)

build(reana-workflow-engine-cwl/python): bump shared REANA packages as of 2024-03-04 (reanahub#267)

docs(reana-workflow-engine-cwl/conformance-tests): update CWL conformance test badges (reanahub#264)

ci(reana-workflow-engine-cwl/commitlint): allow release commit style (reanahub#268)

docs(reana-workflow-engine-cwl/authors): complete list of contributors (reanahub#266)

refactor(reana-workflow-engine-cwl/docs): move from reST to Markdown (reanahub#263)

fix(reana-workflow-engine-cwl/progress): handle stopped jobs (reanahub#260)

ci(reana-workflow-engine-cwl/commitlint): check for the presence of concrete PR number (reanahub#262)

ci(reana-workflow-engine-cwl/shellcheck): fix exit code propagation (reanahub#262)

build(reana-workflow-engine-cwl/docker): install correct extras of reana-commons submodule (reanahub#261)

ci(reana-workflow-engine-cwl/release-please): update version in Dockerfile (reanahub#259)

build(reana-workflow-engine-cwl/docker): non-editable submodules in "latest" mode (reanahub#255)

ci(reana-workflow-engine-cwl/release-please): initial configuration (reanahub#256)

ci(reana-workflow-engine-cwl/commitlint): addition of commit message linter (reanahub#256)

chore(reana-workflow-engine-serial/master): release 0.9.3

build(reana-workflow-engine-serial/python): bump all required packages as of 2024-03-04 (reanahub#200)

build(reana-workflow-engine-serial/python): bump shared REANA packages as of 2024-03-04 (reanahub#200)

ci(reana-workflow-engine-serial/commitlint): allow release commit style (reanahub#201)

docs(reana-workflow-engine-serial/authors): complete list of contributors (reanahub#199)

refactor(reana-workflow-engine-serial/docs): move from reST to Markdown (reanahub#198)

fix(reana-workflow-engine-serial/progress): handle stopped jobs (reanahub#195)

ci(reana-workflow-engine-serial/commitlint): check for the presence of concrete PR number (reanahub#197)

ci(reana-workflow-engine-serial/shellcheck): fix exit code propagation (reanahub#197)

build(reana-workflow-engine-serial/docker): install correct extras of reana-commons submodule (reanahub#196)

ci(reana-workflow-engine-serial/release-please): update version in Dockerfile (reanahub#194)

build(reana-workflow-engine-serial/docker): non-editable submodules in "latest" mode (reanahub#190)

ci(reana-workflow-engine-serial/release-please): initial configuration (reanahub#191)

ci(reana-workflow-engine-serial/commitlint): addition of commit message linter (reanahub#191)

chore(reana-workflow-engine-yadage/master): release 0.9.4

build(reana-workflow-engine-yadage/python): bump all required packages as of 2024-03-04 (reanahub#261)

build(reana-workflow-engine-yadage/python): bump shared REANA packages as of 2024-03-04 (reanahub#261)

ci(reana-workflow-engine-yadage/commitlint): allow release commit style (reanahub#262)

docs(reana-workflow-engine-yadage/authors): complete list of contributors (reanahub#260)

refactor(reana-workflow-engine-yadage/docs): move from reST to Markdown (reanahub#259)

fix(reana-workflow-engine-yadage/progress): correctly handle running and stopped jobs (reanahub#258)

ci(reana-workflow-engine-yadage/commitlint): check for the presence of concrete PR number (reanahub#257)

ci(reana-workflow-engine-yadage/shellcheck): fix exit code propagation (reanahub#257)

build(reana-workflow-engine-yadage/docker): install correct extras of reana-commons submodule (reanahub#256)

ci(reana-workflow-engine-yadage/release-please): update version in Dockerfile (reanahub#254)

build(reana-workflow-engine-yadage/docker): non-editable submodules in "latest" mode (reanahub#249)

ci(reana-workflow-engine-yadage/release-please): initial configuration (reanahub#251)

ci(reana-workflow-engine-yadage/commitlint): addition of commit message linter (reanahub#251)

chore(reana-workflow-engine-snakemake/master): release 0.9.3

build(reana-workflow-engine-snakemake/python): bump all required packages as of 2024-03-04 (reanahub#85)

build(reana-workflow-engine-snakemake/python): bump shared REANA packages as of 2024-03-04 (reanahub#85)

ci(reana-workflow-engine-snakemake/commitlint): allow release commit style (reanahub#86)

feat(reana-workflow-engine-snakemake/config): get max number of parallel jobs from env vars (reanahub#84)

feat(reana-workflow-engine-snakemake/executor): upgrade to Snakemake v7.32.4 (reanahub#81)

docs(reana-workflow-engine-snakemake/authors): complete list of contributors (reanahub#83)

refactor(reana-workflow-engine-snakemake/docs): move from reST to Markdown (reanahub#82)

fix(reana-workflow-engine-snakemake/progress): handle stopped jobs (reanahub#78)

ci(reana-workflow-engine-snakemake/commitlint): check for the presence of concrete PR number (reanahub#80)

ci(reana-workflow-engine-snakemake/shellcheck): fix exit code propagation (reanahub#80)

build(reana-workflow-engine-snakemake/docker): install correct extras of reana-commons submodule (reanahub#79)

ci(reana-workflow-engine-snakemake/release-please): update version in Dockerfile (reanahub#77)

build(reana-workflow-engine-snakemake/docker): non-editable submodules in "latest" mode (reanahub#73)

ci(reana-workflow-engine-snakemake/release-please): initial configuration (reanahub#74)

ci(reana-workflow-engine-snakemake/commitlint): addition of commit message linter (reanahub#74)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

2 participants