Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-status-consumer: improve handling of "not alive" workflows #437

Open
VMois opened this issue Mar 28, 2022 · 2 comments · May be fixed by #445
Open

job-status-consumer: improve handling of "not alive" workflows #437

VMois opened this issue Mar 28, 2022 · 2 comments · May be fixed by #445

Comments

@VMois
Copy link

VMois commented Mar 28, 2022

Some workflows that have "Not alive" status (finished, failed, deleted, stopped, etc.) in DB can continue running on Kubernetes, and even start reporting status to job-status-consumer.

Example of such workflow from job-status-consumer logs:

2022-03-24 14:20:16,312 | root | MainThread | WARNING | Event for not alive workflow 9b67170b-33ce-4dc3-8150-99e490afcade received:
{"workflow_uuid": "9b67170b-33ce-4dc3-8150-99e490afcade", "logs": "", "status": 1, "message": {"progress": {"total": {"total": 1, "job_ids": []}}}}
2022-03-24 14:20:16,315 | root | MainThread | WARNING | Event for not alive workflow 9b67170b-33ce-4dc3-8150-99e490afcade received:
{"workflow_uuid": "9b67170b-33ce-4dc3-8150-99e490afcade", "logs": "", "status": 1, "message": {"progress": {"running": {"total": 1, "job_ids": ["51060737-7e03-4255-a779-0e71e061e5f5"]}}}}
...
2022-03-24 14:20:16,860 | root | MainThread | WARNING | Event for not alive workflow 9b67170b-33ce-4dc3-8150-99e490afcade received:
{"workflow_uuid": "9b67170b-33ce-4dc3-8150-99e490afcade", "logs": "", "status": 2, "message": {"progress": {"finished": {"total": 1, "job_ids": ["51060737-7e03-4255-a779-0e71e061e5f5"]}}}}

Workflow 9b67170b-33ce-4dc3-8150-99e490afcade is reported as deleted in DB. But, according to the above logs, it even finished its execution. A user reported that some of the workflows were stuck in "pending".

Looking into the code:

def delete_workflow(workflow, all_runs=False, workspace=False):
"""Delete workflow."""
if workflow.status in [
RunStatus.created,
RunStatus.finished,
RunStatus.stopped,
RunStatus.deleted,
RunStatus.failed,
RunStatus.queued,
RunStatus.pending,

In delete_workflow, it is possible to delete the "pending" workflow, or more precisely, mark workflow as deleted in DB. This, probably, means that workflow was deleted between pending status and actual workflow-engine pod start (and sending the first message to the job-status queue).

Some optional questions are:

  • Why workflow stayed in "pending" state for so long, and after that actually started running on Kubernetes?

How to reproduce

  1. Start helloworld workflow - reana-client run -w test
  2. As soon as workflow starts, wait 3-5 seconds (it should go to the pending state) and delete it reana-client delete -w test
  3. Check reana-client list, it will show you that test workflow is deleted
  4. Check kubectl get pods, you will find batch pod in NotReady state (and it will stay like this)
  5. Check kubectl logs deployment/reana-workflow-controller job-status-consumer, it will show you that workflow was not in an alive state but still continued to execute.

Next actions

  • set better logging for "Not alive" workflow, so it is clear from the logs in which state workflow is if it is not alive (job-status-consumer: improve logging of "not alive" workflows #443)

  • if the workflow is deleted it can still be running on Kubernetes, it somehow needs to be detected and fixed so we do not have hanging workflows.

@VMois VMois changed the title job-status-consumer: make it clear what status "not alive workflow" event has in logs, check and clean "not alive workflow" pods job-status-consumer: improve handling of "not alive" workflows Mar 28, 2022
@VMois VMois self-assigned this Mar 31, 2022
VMois pushed a commit to VMois/reana-workflow-controller that referenced this issue Apr 4, 2022
VMois pushed a commit to VMois/reana-workflow-controller that referenced this issue Apr 4, 2022
VMois pushed a commit to VMois/reana-workflow-controller that referenced this issue Apr 4, 2022
VMois pushed a commit to VMois/reana-workflow-controller that referenced this issue Apr 4, 2022
VMois pushed a commit to VMois/reana-workflow-controller that referenced this issue Apr 4, 2022
VMois pushed a commit to VMois/reana-workflow-controller that referenced this issue Apr 4, 2022
@VMois
Copy link
Author

VMois commented Apr 4, 2022

set better logging for "Not alive" workflow, so it is clear from the logs in which state workflow is if it is not alive

This is easy to do and I have already opened PR for that.

if the workflow is deleted it can still be running on Kubernetes, it somehow needs to be detected and fixed so we do not have hanging workflows.

This is more tricky logic and will require more time. Maybe, even, it should be handled differently. So this will be done in a separate PR.

@VMois
Copy link
Author

VMois commented Apr 4, 2022

if the workflow is deleted it can still be running on Kubernetes, it somehow needs to be detected and fixed so we do not have hanging workflows.

One easy way of solving a problem is to forbid deleting workflow in a pending state. Cause technically it is already started. But I am not sure how it might impact UX as in the case above workflow was in a pending state for a long time and was blocking other workflows. cc: @tiborsimko

Update: allowing to delete workflows in a pending state was requested by the users in case workflow is stuck for a long time. So we cannot revert this change back.

VMois pushed a commit to VMois/reana-workflow-controller that referenced this issue Apr 13, 2022
@VMois VMois linked a pull request Apr 13, 2022 that will close this issue
VMois pushed a commit to VMois/reana-workflow-controller that referenced this issue Apr 19, 2022
VMois pushed a commit to VMois/reana-workflow-controller that referenced this issue Apr 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant