Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflow_run_manager: workflow stop race condition #310

Merged

Conversation

diegodelemos
Copy link
Member

  • Happening when user stops a workflow before its first job is created.
    RWC stops only the existing jobs. However, because there is a grace
    period for stopping pods, RJC sidecar still runs, submitting a new
    job, and reporting its status, causing the workflow to “revive”
    (closes cli: stop command misbehaviour reana-client#395).

@diegodelemos
Copy link
Member Author

  • Quick solution: Force delete with grace period 0
    • Didn't work, still is not 0 seconds to delete -> ❌
  • Better solution: Ignore status updates for workflows stopped in DB
    • stop_workflow performs an atomic operation, therefore, if the stop of a workflow and its alive jobs was done, the workflow has successfully been marked as stopped in the database.
    • Use that field to ignore, in job-status-consumer, status updates for stopped workflows (something similar is being done with GitLab, see here)
  • Best solution: Create the state machine for workflow status manipulation and embed logic there, see previous musings.

@diegodelemos diegodelemos force-pushed the rc-395/stop-command-misbehaviour branch from c87f38b to 1cb66c2 Compare April 9, 2020 13:49
@diegodelemos diegodelemos marked this pull request as ready for review April 9, 2020 13:49
@diegodelemos diegodelemos force-pushed the rc-395/stop-command-misbehaviour branch from 1cb66c2 to 8495199 Compare April 9, 2020 13:50
@diegodelemos
Copy link
Member Author

I haven't fully addressed the state machine but I've recorded the knowledge summarised in reanahub/reana-client#192 (comment) (referenced in #149) inside a function in a separate module prepared for a possible future refactor for status transition management.

@diegodelemos diegodelemos force-pushed the rc-395/stop-command-misbehaviour branch from 8495199 to 568a933 Compare April 9, 2020 13:54
* Happening when user stops a workflow before its first job is created.
  RWC stops only the existing jobs. However, because there is a grace
  period for stopping pods, RJC sidecar still runs, submitting a new
  job, and reporting its status, causing the workflow to “revive”.
  To mitigate this, we decrease the grace period to 0 and we don't
  allow stopped workflows to change status
  (closes reanahub/reana-client#395).
@diegodelemos diegodelemos force-pushed the rc-395/stop-command-misbehaviour branch from 568a933 to 8ba38e9 Compare April 9, 2020 15:33
Copy link
Member

@mvidalgarcia mvidalgarcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested successfully on my machine 👍
Something to take into account for future improvements, several trials to switch to a forbidden status are produced. See the job-status-consumer logs:

2020-04-10 08:55:57,554 | kombu.mixins | MainThread | INFO | Connected to amqp://test:**@reana-message-broker:5672//
2020-04-10 08:57:44,143 | root | MainThread | ERROR | Cannot transition workflow d4e80dbd-80b4-4725-a415-60708a8aa896 from status WorkflowStatus.stopped to WorkflowStatus.running.
2020-04-10 08:57:44,291 | root | MainThread | ERROR | Cannot transition workflow d4e80dbd-80b4-4725-a415-60708a8aa896 from status WorkflowStatus.stopped to WorkflowStatus.running.
2020-04-10 08:57:50,325 | root | MainThread | ERROR | Cannot transition workflow d4e80dbd-80b4-4725-a415-60708a8aa896 from status WorkflowStatus.stopped to WorkflowStatus.running.
2020-04-10 08:57:50,431 | root | MainThread | ERROR | Cannot transition workflow d4e80dbd-80b4-4725-a415-60708a8aa896 from status WorkflowStatus.stopped to WorkflowStatus.running.
2020-04-10 08:57:59,471 | root | MainThread | ERROR | Cannot transition workflow d4e80dbd-80b4-4725-a415-60708a8aa896 from status WorkflowStatus.stopped to WorkflowStatus.finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cli: stop command misbehaviour
2 participants