Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

devops: basic cluster monitoring #251

Closed
3 tasks done
diegodelemos opened this issue Jan 17, 2020 · 3 comments
Closed
3 tasks done

devops: basic cluster monitoring #251

diegodelemos opened this issue Jan 17, 2020 · 3 comments

Comments

@diegodelemos
Copy link
Member

diegodelemos commented Jan 17, 2020

Get ready for quick monitoring and resolution for failed workflows. For example, a snapshot from production cluster today:

$ kubectl get pods
NAME                                                      READY   STATUS    RESTARTS   AGE
batch-serial-21a304f9-f955-4c70-98db-e04794279fde-d2v6s   1/2     Running   0          7d17h
batch-serial-f699e6f2-e26c-40e1-96c6-ae8105c7cd55-9sg9q   1/2     Running   0          3d
batch-yadage-632a68c9-00c2-43af-af12-263e2d0cdec2-cfrxt   1/2     Running   0          7d17h
batch-yadage-a72c3d98-9744-4d77-ac3f-0583d6586506-z5z9x   1/2     Running   0          7d17h
batch-yadage-f2e9f2a3-46df-4de8-bddf-8991bc8caa72-4vzlb   1/2     Running   0          3d
cache-f74dd5fd8-rrwcg                                     1/1     Running   0          9d
d92872ad-e85d-4b5c-aa31-ecb361e23d8b-xm9ps                0/1     Error     0          42h
db-6f8f8579c9-hv427                                       1/1     Running   0          9d
message-broker-789ff97759-hnjvc                           1/1     Running   0          9d
server-76d6b4d68c-s7h7c                                   2/2     Running   0          9d
ui-59884b5965-kw8nt                                       1/1     Running   0          9d
workflow-controller-7d75898f5d-knllj                      2/2     Running   0          9d

Note: Error pods not garbage-collected, workflows apparently still running, etc.

@diegodelemos diegodelemos added this to Ready for work in Awesome-Workshop-Basics Jan 17, 2020
@diegodelemos diegodelemos moved this from Ready for work to In work in Awesome-Workshop-Basics Jan 31, 2020
@diegodelemos diegodelemos self-assigned this Jan 31, 2020
@diegodelemos
Copy link
Member Author

For now, we will be able to use Grafana and the Kubernetes dashboard, documented in https://digital-repositories.web.cern.ch/digital-repositories/reana/operations#monitoring

Creation of new Grafana dashboards can be done later as it is a well-isolated task ⏳

@roksys @tiborsimko @mvidalgarcia also, as it is documented in the internal docs, we have a central user for Grafana... we could explore the possibility of having one user per person... but that could also be messy as this is not integrating with central egroups or users, WDYT?

@diegodelemos diegodelemos moved this from In work to Ready for work in Awesome-Workshop-Basics Feb 4, 2020
@mvidalgarcia
Copy link
Member

@roksys @tiborsimko @mvidalgarcia also, as it is documented in the internal docs, we have a central user for Grafana... we could explore the possibility of having one user per person... but that could also be messy as this is not integrating with central egroups or users, WDYT?

As long as we have the secrets on tbag or similar it's ok for me.

@diegodelemos diegodelemos removed this from Ready for work in Awesome-Workshop-Basics Feb 6, 2020
@diegodelemos
Copy link
Member Author

Closing as basic monitoring is up and running.

Awesome-Workshop-Morrows automation moved this from Backlog to Done Nov 2, 2020
mdonadoni added a commit to mdonadoni/reana that referenced this issue Mar 5, 2024
chore(reana-ui/master): release 0.9.4

build(reana-ui/package): update yarn.lock (reanahub#399)

build(reana-ui/package): require jsroot<7.6.0 (reanahub#399)

ci(reana-ui/commitlint): allow release commit style (reanahub#400)

docs(reana-ui/authors): complete list of contributors (reanahub#396)

ci(reana-ui/shellcheck): exclude node_modules from the analyzed paths (reanahub#387)

fix(reana-ui/progress): update failed workflows duration using finish time (reanahub#387)

feat(reana-ui/footer): link privacy notice to configured URL (reanahub#393)

refactor(reana-ui/docs): move from reST to Markdown (reanahub#391)

ci(reana-ui/commitlint): check for the presence of concrete PR number (reanahub#390)

ci(reana-ui/shellcheck): fix exit code propagation (reanahub#390)

fix(reana-ui/launcher): remove dollar sign in generated Markdown (reanahub#389)

ci(reana-ui/release-please): update version in package.json and Dockerfile (reanahub#385)

ci(reana-ui/release-please): switch to `simple` release strategy (reanahub#383)

fix(reana-ui/router): show 404 page for invalid URLs (reanahub#382)

ci(reana-ui/release-please): initial configuration (reanahub#380)

ci(reana-ui/commitlint): addition of commit message linter (reanahub#380)

chore(reana-message-broker/master): release 0.9.3

ci(reana-message-broker/commitlint): allow release commit style (reanahub#67)

docs(reana-message-broker/authors): complete list of contributors (reanahub#66)

refactor(reana-message-broker/docs): move from reST to Markdown (reanahub#65)

ci(reana-message-broker/commitlint): check for the presence of concrete PR number (reanahub#64)

ci(reana-message-broker/shellcheck): fix exit code propagation (reanahub#64)

ci(reana-message-broker/release-please): update version in Dockerfile (reanahub#63)

fix(reana-message-broker/startup): handle signals for graceful shutdown (reanahub#59)

ci(reana-message-broker/release-please): initial configuration (reanahub#60)

ci(reana-message-broker/commitlint): addition of commit message linter (reanahub#60)

chore(reana-server/master): release 0.9.3

build(reana-server/python): bump all required packages as of 2024-03-04 (reanahub#674)

build(reana-server/python): bump shared REANA packages as of 2024-03-04 (reanahub#674)

build(reana-server/python): bump shared modules (reanahub#676)

ci(reana-server/commitlint): allow release commit style (reanahub#675)

docs(reana-server/authors): complete list of contributors (reanahub#673)

ci(reana-server/pytest): move to PostgreSQL 14.10 (reanahub#672)

refactor(reana-server/docs): move from reST to Markdown (reanahub#671)

style(reana-server/black): format with black v24 (reanahub#670)

ci(reana-server/commitlint): check for the presence of concrete PR number (reanahub#669)

ci(reana-server/shellcheck): fix exit code propagation (reanahub#669)

ci(reana-server/release-please): update version in Dockerfile/OpenAPI specs (reanahub#668)

build(reana-server/docker): non-editable submodules in "latest" mode (reanahub#656)

build(reana-server/deps): pin invenio-userprofiles to 1.2.4 (reanahub#665)

ci(reana-server/release-please): initial configuration (reanahub#665)

ci(reana-server/commitlint): addition of commit message linter (reanahub#665)

chore(reana-workflow-controller/master): release 0.9.3

build(reana-workflow-controller/python): bump all required packages as of 2024-03-04 (reanahub#574)

build(reana-workflow-controller/python): bump shared REANA packages as of 2024-03-04 (reanahub#574)

feat(reana-workflow-controller/manager): increase termination period of run-batch pods (reanahub#572)

ci(reana-workflow-controller/commitlint): allow release commit style (reanahub#575)

feat(reana-workflow-controller/manager): pass custom env variables to job controller (reanahub#571)

feat(reana-workflow-controller/manager): pass custom env variables to workflow engines (reanahub#571)

docs(reana-workflow-controller/authors): complete list of contributors (reanahub#570)

ci(reana-workflow-controller/pytest): move to PostgreSQL 14.10 (reanahub#568)

fix(reana-workflow-controller/manager): use valid group name when calling `groupadd` (reanahub#566)

refactor(reana-workflow-controller/docs): move from reST to Markdown (reanahub#567)

fix(reana-workflow-controller/stop): store engine logs of stopped workflow (reanahub#563)

fix(reana-workflow-controller/manager): graceful shutdown of job-controller (reanahub#559)

feat(reana-workflow-controller/manager): call shutdown endpoint before workflow stop (reanahub#559)

refactor(reana-workflow-controller/consumer): do not update status of jobs (reanahub#559)

style(reana-workflow-controller/black): format with black v24 (reanahub#564)

ci(reana-workflow-controller/commitlint): check for the presence of concrete PR number (reanahub#562)

ci(reana-workflow-controller/shellcheck): fix exit code propagation (reanahub#562)

ci(reana-workflow-controller/release-please): update version in Dockerfile/OpenAPI specs (reanahub#558)

build(reana-workflow-controller/docker): non-editable submodules in "latest" mode (reanahub#551)

ci(reana-workflow-controller/release-please): initial configuration (reanahub#555)

ci(reana-workflow-controller/commitlint): addition of commit message linter (reanahub#555)

chore(reana-job-controller/master): release 0.9.3

build(reana-job-controller/python): bump all required packages as of 2024-03-04 (reanahub#442)

build(reana-job-controller/python): bump shared REANA packages as of 2024-03-04 (reanahub#442)

ci(reana-job-controller/commitlint): allow release commit style (reanahub#443)

build(reana-job-controller/certificates): update expired CERN Grid CA certificate (reanahub#440)

fix(reana-job-controller/database): limit the number of open database connections (reanahub#437)

docs(reana-job-controller/authors): complete list of contributors (reanahub#434)

perf(reana-job-controller/cache): avoid caching jobs when the cache is disabled (reanahub#435)

ci(reana-job-controller/pytest): move to PostgreSQL 14.10 (reanahub#429)

refactor(reana-job-controller/docs): move from reST to Markdown (reanahub#428)

ci(reana-job-controller/commitlint): check for the presence of concrete PR number (reanahub#425)

ci(reana-job-controller/shellcheck): fix exit code propagation (reanahub#425)

feat(reana-job-controller/shutdown): stop all running jobs before stopping workflow (reanahub#423)

refactor(reana-job-controller/monitor): move fetching of logs to job-manager (reanahub#423)

refactor(reana-job-controller/db): set job status also in the main database (reanahub#423)

refactor(reana-job-controller/monitor): centralise logs and status updates (reanahub#423)

style(reana-job-controller/black): format with black v24 (reanahub#426)

ci(reana-job-controller/release-please): update version in Dockerfile/OpenAPI specs (reanahub#421)

build(reana-job-controller/docker): non-editable submodules in "latest" mode (reanahub#416)

ci(reana-job-controller/release-please): initial configuration (reanahub#417)

ci(reana-job-controller/commitlint): addition of commit message linter (reanahub#417)

chore(reana-workflow-engine-cwl/master): release 0.9.3

build(reana-workflow-engine-cwl/python): bump all required packages as of 2024-03-04 (reanahub#267)

build(reana-workflow-engine-cwl/python): bump shared REANA packages as of 2024-03-04 (reanahub#267)

docs(reana-workflow-engine-cwl/conformance-tests): update CWL conformance test badges (reanahub#264)

ci(reana-workflow-engine-cwl/commitlint): allow release commit style (reanahub#268)

docs(reana-workflow-engine-cwl/authors): complete list of contributors (reanahub#266)

refactor(reana-workflow-engine-cwl/docs): move from reST to Markdown (reanahub#263)

fix(reana-workflow-engine-cwl/progress): handle stopped jobs (reanahub#260)

ci(reana-workflow-engine-cwl/commitlint): check for the presence of concrete PR number (reanahub#262)

ci(reana-workflow-engine-cwl/shellcheck): fix exit code propagation (reanahub#262)

build(reana-workflow-engine-cwl/docker): install correct extras of reana-commons submodule (reanahub#261)

ci(reana-workflow-engine-cwl/release-please): update version in Dockerfile (reanahub#259)

build(reana-workflow-engine-cwl/docker): non-editable submodules in "latest" mode (reanahub#255)

ci(reana-workflow-engine-cwl/release-please): initial configuration (reanahub#256)

ci(reana-workflow-engine-cwl/commitlint): addition of commit message linter (reanahub#256)

chore(reana-workflow-engine-serial/master): release 0.9.3

build(reana-workflow-engine-serial/python): bump all required packages as of 2024-03-04 (reanahub#200)

build(reana-workflow-engine-serial/python): bump shared REANA packages as of 2024-03-04 (reanahub#200)

ci(reana-workflow-engine-serial/commitlint): allow release commit style (reanahub#201)

docs(reana-workflow-engine-serial/authors): complete list of contributors (reanahub#199)

refactor(reana-workflow-engine-serial/docs): move from reST to Markdown (reanahub#198)

fix(reana-workflow-engine-serial/progress): handle stopped jobs (reanahub#195)

ci(reana-workflow-engine-serial/commitlint): check for the presence of concrete PR number (reanahub#197)

ci(reana-workflow-engine-serial/shellcheck): fix exit code propagation (reanahub#197)

build(reana-workflow-engine-serial/docker): install correct extras of reana-commons submodule (reanahub#196)

ci(reana-workflow-engine-serial/release-please): update version in Dockerfile (reanahub#194)

build(reana-workflow-engine-serial/docker): non-editable submodules in "latest" mode (reanahub#190)

ci(reana-workflow-engine-serial/release-please): initial configuration (reanahub#191)

ci(reana-workflow-engine-serial/commitlint): addition of commit message linter (reanahub#191)

chore(reana-workflow-engine-yadage/master): release 0.9.4

build(reana-workflow-engine-yadage/python): bump all required packages as of 2024-03-04 (reanahub#261)

build(reana-workflow-engine-yadage/python): bump shared REANA packages as of 2024-03-04 (reanahub#261)

ci(reana-workflow-engine-yadage/commitlint): allow release commit style (reanahub#262)

docs(reana-workflow-engine-yadage/authors): complete list of contributors (reanahub#260)

refactor(reana-workflow-engine-yadage/docs): move from reST to Markdown (reanahub#259)

fix(reana-workflow-engine-yadage/progress): correctly handle running and stopped jobs (reanahub#258)

ci(reana-workflow-engine-yadage/commitlint): check for the presence of concrete PR number (reanahub#257)

ci(reana-workflow-engine-yadage/shellcheck): fix exit code propagation (reanahub#257)

build(reana-workflow-engine-yadage/docker): install correct extras of reana-commons submodule (reanahub#256)

ci(reana-workflow-engine-yadage/release-please): update version in Dockerfile (reanahub#254)

build(reana-workflow-engine-yadage/docker): non-editable submodules in "latest" mode (reanahub#249)

ci(reana-workflow-engine-yadage/release-please): initial configuration (reanahub#251)

ci(reana-workflow-engine-yadage/commitlint): addition of commit message linter (reanahub#251)

chore(reana-workflow-engine-snakemake/master): release 0.9.3

build(reana-workflow-engine-snakemake/python): bump all required packages as of 2024-03-04 (reanahub#85)

build(reana-workflow-engine-snakemake/python): bump shared REANA packages as of 2024-03-04 (reanahub#85)

ci(reana-workflow-engine-snakemake/commitlint): allow release commit style (reanahub#86)

feat(reana-workflow-engine-snakemake/config): get max number of parallel jobs from env vars (reanahub#84)

feat(reana-workflow-engine-snakemake/executor): upgrade to Snakemake v7.32.4 (reanahub#81)

docs(reana-workflow-engine-snakemake/authors): complete list of contributors (reanahub#83)

refactor(reana-workflow-engine-snakemake/docs): move from reST to Markdown (reanahub#82)

fix(reana-workflow-engine-snakemake/progress): handle stopped jobs (reanahub#78)

ci(reana-workflow-engine-snakemake/commitlint): check for the presence of concrete PR number (reanahub#80)

ci(reana-workflow-engine-snakemake/shellcheck): fix exit code propagation (reanahub#80)

build(reana-workflow-engine-snakemake/docker): install correct extras of reana-commons submodule (reanahub#79)

ci(reana-workflow-engine-snakemake/release-please): update version in Dockerfile (reanahub#77)

build(reana-workflow-engine-snakemake/docker): non-editable submodules in "latest" mode (reanahub#73)

ci(reana-workflow-engine-snakemake/release-please): initial configuration (reanahub#74)

ci(reana-workflow-engine-snakemake/commitlint): addition of commit message linter (reanahub#74)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants