Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job step skipped for no apparent reason #3443

Closed
vinillum opened this issue May 29, 2018 · 15 comments

Comments

@vinillum
Copy link

commented May 29, 2018

Describe the bug
We have a job that is executed on the nodes. It's a rather long job, but towards the end we received the following:

10:22:42 |   |   | Step [27] did not run. start conditions: [], skip conditions: [] -- | -- | -- | -- 10:22:42 |   |   | Step [27] did not run. start conditions: [], skip conditions: []

This is the first time I am seeing this. The job itself is not new, it had hundreds of successful runs, and it doesn't have any conditions around it - the step should be executed no matter what.

My Rundeck detail

  • Rundeck version: 2.10.7
  • install type: rpm
  • OS Name/version: RedHat EL 6
  • DB Type/version: MySQL

To Reproduce
Not sure how to reproduce this - this is the first time I am seeing this, and I had quite a few successful runs afterwards.

Expected behavior
Step should have executed.

@vinillum

This comment has been minimized.

Copy link
Author

commented May 29, 2018

Job steps:
capture

Log output:
capture

@hs-hub-world

This comment has been minimized.

Copy link

commented May 30, 2018

Can this problem be reproduced -? Because it will be hard to troubleshoot. Also I would suggest to look in the server rundeck/sevice logs if anything shows up there -? One other thing that comes to mind make sure your jobs are not accidenlty referencing 'itself'. Because rundeck will not throw a proper error message when a job references 'itself'.

@puremourning

This comment has been minimized.

Copy link

commented Jun 29, 2018

We have seen similar when running the same Start Server script across many nodes. Sometimes the step is just skipped on some nodes, with no explanation as to why. It is not consistently reproducible, which suggests a race condition internally to rundeck. I have tried to catch the result with execution debug mode without success (yet).

This kind of thing is devastating for production use however, and it is making me very nervous. Skipping running steps in a workflow could lead to disaster. Fortunately when it happens to the "Start Server" job, the next job is "Check server started successfully" in both our implementations, so we just fail on the validate and a human has to recover production or restart the deployment.

But if it skipped "Perform backup", we could be in serious trouble.

Case in point, our regression tests just hit this with another job altogether:

rundeck-skip-steps-start-condition-empty

The relevant section of the workflow is:

  • Pre-upgrade intra-week check
# Pre-upgrade intra-week check
- id: 390f1149-ae03-4bad-a0df-20d9e4ebb4c2
  uuid: 390f1149-ae03-4bad-a0df-20d9e4ebb4c2
  name: Pre-upgrade intra-week check
  description: Check the system is in a state to be upgraded while the system is up
  scheduleEnabled: true
  project: XTP
  group: Checks
  loglevel: INFO
  multipleExecutions: true
  options:
    - *oSystem
    - *oRegion
    - *oITPCODELINE
    - *oITPLABEL
    - *oITPMACODELINE
    - *oITPMALABEL
    - *oGPMCODELINE
    - *oGPMLABEL
    - *oXTPCODELINE
    - *oXTPLABEL
    - *oXTPMACODELINE
    - *oXTPMALABEL
    - *oPORTALCODELINE
    - *oPORTALLABEL
  sequence:
    keepgoing: true
    commands:
    - jobref:
        group: Checks
        name: Pre-upgrade sanity check
        args: -System ${option.System} -Region ${option.Region}
... <other steps>
  • Pre-upgrade sanity check
# Pre-upgrade sanity check
- id: d6ae06e7-4100-464d-81d3-b14d268dd57e
  uuid: d6ae06e7-4100-464d-81d3-b14d268dd57e
  name: Pre-upgrade sanity check
  description: Ensure access and box setup
  scheduleEnabled: true
  project: XTP
  group: Checks
  loglevel: INFO
  multipleExecutions: true
  options:
  - *oSystem
  - *oRegion
  sequence:
    keepgoing: true # Run everything even if _basic_ checks fail
    commands:
    - jobref:
        group: Checks/_PerNode
        name: Check account access
        nodefilters:
          filter: 'Fidessa:System: ${option.System}
                   Fidessa:Node: ${option.Region}'
    - jobref:
        group: Checks/_PerNode
        name: Check account setup
        nodefilters:
          filter: 'Fidessa:System: ${option.System}
                   Fidessa:Node: ${option.Region}
                   tags: fidserver'
  • Check account setup
# The server environments should be configured approporiately
- id: 0a88581d-0b96-481d-909c-600bcfd5be2e
  uuid: 0a88581d-0b96-481d-909c-600bcfd5be2e
  name: Check account setup
  description: Ensure the server accounts have appropriate configuration
  scheduleEnabled: false
  project: XTP
  group: Checks/_PerNode
  loglevel: INFO
  nodeFilterEditable: true
  orchestrator:
    configuration:
      maxThreads: '6'
    type: perhost-maxthreads
  nodefilters:
    dispatch:
      excludePrecedence: true
      keepgoing: true # Run on all nodes
      successOnEmptyNodeFilter: false
      threadcount: 24
    filter: tags:fidserver
  nodesSelectedByDefault: false
  multipleExecutions: true
  sequence:
    keepgoing: false
    strategy: node-first
    commands:
    - script: |
        ... some secret scripts :)

As you can see, it skipped the Check account setup step. entirely. I can confirm that the nodes in quetsion have tags: fidserver set on them (as evidenced by the later steps in the workflow).

What sort of steps can i perform to debug this? It's not consistently reproducible.

We do use a custom orchestrator plugin, but i'm 99% sure that it is not the cause and is following the spec to the letter, particularly as the OP doesn't use that orchestrator.

@puremourning

This comment has been minimized.

Copy link

commented Jun 29, 2018

To answer related questions from @hs-hub-world :

Can this problem be reproduced -? Because it will be hard to troubleshoot.

As described, it is intermittent.

Also I would suggest to look in the server rundeck/sevice logs if anything shows up there -?

Nothing obvious, but the log doesn't have timestamps and this happened on an overnight regression test. It's also full of the usual megabytes of exception trace noise. I can provide the entire log, but I think this is the relevant section (going by the exec id):

INFO  ExecutionService: updated scheduled Execution
ERROR ExecutionUtilService: Execution failed: 73306 in project XIT-TEST: [Workflow result: , step failures: {7=JobFailed: Job [Checks/_PerNode/Run installChecks] failed}, status: failed]              
INFO  ScheduledExecutionService: scheduling temp job: TEMP:admin:73307                              

One other thing that comes to mind make sure your jobs are not accidenlty referencing 'itself'. Because rundeck will not throw a proper error message when a job references 'itself'.

The jobs don't reference themselves

@puremourning

This comment has been minimized.

Copy link

commented Jun 29, 2018

We just hit this again today, step was skipped on a subset of nodes with no explanation. This particular case could have been a disaster if it was prod:

rundeck-skip-steps-start-condition-empty-2

@ahonor

This comment has been minimized.

Copy link
Contributor

commented Jun 29, 2018

@puremourning are you also on 2.10.7?

@puremourning

This comment has been minimized.

Copy link

commented Jun 29, 2018

2.10.2

@gschueler

This comment has been minimized.

Copy link
Member

commented Jun 29, 2018

are you able to run with DEBUG logging on and see this happen?

@puremourning

This comment has been minimized.

Copy link

commented Jun 29, 2018

I have been trying to :) debug turned on for a host of overnight deploys.

However, i'm calling our regression jobs with rd run <stuff>--logevel "verbose" but this doesn't seem to enable debug log for the job that it runs. Is that a bug, or am I holding it wrong?

One thing that's tricky is that we don't always catch it, because it doesn't always cause the job to fail per se.

@gschueler

This comment has been minimized.

Copy link
Member

commented Jun 29, 2018

(I think the rd run ... -logevel verbose not working is a bug)

@puremourning

This comment has been minimized.

Copy link

commented Aug 17, 2018

Well, what i can say is that this frequently happens when debug logging is not enabled. It happened today: skipped a job-reference step on all nodes, printing the following against the "local" rundeck node:

13:03:32 | ukfd-ltp2l | 1. Start Server pair | Step [1] did not run. start conditions: [], skip conditions: []
-- | -- | -- | --
13:03:32 |   |   | Step [1] did not run. start conditions: [], skip conditions: []

However, i have not observed it with debug logging enabled, which is somewhat frustrating.

Is that start conditions of [] is correct wrong, even for the first step? I would have expected conditions of [] to always execute ?

Is there any possibility this is related to a custom orchestrator? I'm pretty certain our orchestrator is working according to the documentation, but it's a data point nonetheless.

@puremourning

This comment has been minimized.

Copy link

commented Sep 6, 2018

Happened again. FWIW we're currently running 2.10.7.

I'm going to have to try and repro this with Pro, because this issue could be catastrophic in our production environment.

@gschueler gschueler added the bug label Sep 17, 2018

@gschueler

This comment has been minimized.

Copy link
Member

commented Sep 18, 2018

I've been trying to reproduce this, with no luck.

One way you could increase the server log data is to enable DEBUG logging for the com.dtolabs.rundeck.core.rules.WorkflowEngine logger in log4j.properties.

Add this to server/exp/webapp/WEB-INF/classes/log4j.properties:

log4j.logger.com.dtolabs.rundeck.core.rules.WorkflowEngine=DEBUG

This will add a lot of output to the service log, but I believe it would capture whatever is causing this behavior.

Since the error is intermittent, I don't know if that is feasible in your production environment, since the logging will be very verbose

@puremourning

This comment has been minimized.

Copy link

commented Jan 23, 2019

I have sent a repro with all the loggings.

@gschueler

This comment has been minimized.

Copy link
Member

commented Feb 4, 2019

I've found a race condition that can leak state b/w threads in this case: The workflow definition uses node-first workflow, and node dispatch with threadcount > 1.

This has the potential to interfere with proper workflow execution on a node, as the workflow engine state from another node is being used 😬 And that could manifest as the "skipped steps" issue (or other random weirdness)

A possible workaround: don't use node-first strategy, or don't use multiple threads.

I am working on a fix

gschueler added a commit that referenced this issue Feb 5, 2019

Fix #3443 ensure new builder used for workflow system
otherwise, the single builder might be used in multiple threads

gschueler added a commit that referenced this issue Feb 5, 2019

For #3443 create new executor for dispatched workflow
don't re-use the executor for each dispatched item

@gschueler gschueler added this to the 3.0.14 milestone Feb 5, 2019

gschueler added a commit that referenced this issue Feb 5, 2019

Fix #3443 ensure new builder used for workflow system
otherwise, the single builder might be used in multiple threads

gschueler added a commit that referenced this issue Feb 6, 2019

Merge pull request #4462 from rundeck/issue/3443-2.11-backport
Fix #3443 ensure new builder used for workflow system
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.