Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zombie scp processes and failed jobs #2651

Closed
pbenas opened this issue Jul 31, 2017 · 11 comments
Closed

Zombie scp processes and failed jobs #2651

pbenas opened this issue Jul 31, 2017 · 11 comments

Comments

@pbenas
Copy link

pbenas commented Jul 31, 2017

Issue type: Bug report

My Rundeck detail

  • Rundeck version: 2.8.3
  • install type: war
  • OS Name/version: CentOS Linux release 7.3.1611 (Core)
  • DB Type/version: postgresql-9.2.18-1.el7.x86_64

Expected Behavior
No failed jobs, no zombie SCP processes.

Actual Behavior
Rundeck executes the scp, then fails the job. Couple of zombie scp processes remains.

tomcat   22856  0.0  0.0      0     0 ?        Z    12:36   0:00 [scp] <defunct> 

Digging through the logs, I've found following two errors:

Jul 31 13:22:10 rundeck server: INFO  stream: Failed: IOFailure: Cannot run program "/usr/bin/ssh": error=11, Resource temporarily unavailable severity=info

and



Jul 31 13:14:24 rundeck server: ERROR WorkflowService: updateStateForStep([identifier:6/1, index:0, stepStateChange:StepStateChangeImpl{stepState=StepS
tateImpl{executionState=RUNNING, metadata=null, errorMessage='null'}, nodeName='na1-dss-v21n01', nodeState=true}, timestamp:Mon Jul 31 13:14:23 CEST 20
17]): No such property: nodeState for class: java.util.Date severity=info
Jul 31 13:14:24 rundeck server: groovy.lang.MissingPropertyException: No such property: nodeState for class: java.util.Date severity=info
Jul 31 13:14:24 rundeck server: at com.dtolabs.rundeck.app.internal.workflow.MutableWorkflowStateImpl$_updateStateForStep_closure7.doCall(MutableWorkfl
owStateImpl.groovy:221) severity=info
Jul 31 13:14:24 rundeck server: at com.dtolabs.rundeck.app.internal.workflow.MutableWorkflowStateImpl.updateStateForStep(MutableWorkflowStateImpl.groov
y:220) severity=info
Jul 31 13:14:24 rundeck server: at com.dtolabs.rundeck.app.internal.workflow.MutableWorkflowStateImpl.descendUpdateStateForStep(MutableWorkflowStateImp
l.groovy:323) severity=info
Jul 31 13:14:24 rundeck server: at com.dtolabs.rundeck.app.internal.workflow.MutableWorkflowStateImpl.updateStateForStep(MutableWorkflowStateImpl.groov
y:140) severity=info
Jul 31 13:14:24 rundeck server: at com.dtolabs.rundeck.app.internal.workflow.ExceptionHandlingMutableWorkflowState.updateStateForStep(ExceptionHandling
MutableWorkflowState.groovy:36) severity=info
Jul 31 13:14:24 rundeck server: at com.dtolabs.rundeck.app.internal.workflow.MutableWorkflowStateListener.stepStateChanged(MutableWorkflowStateListener
.groovy:40) severity=info
Jul 31 13:14:24 rundeck server: at com.dtolabs.rundeck.core.execution.workflow.state.WorkflowExecutionStateListenerAdapter.notifyAllStepState(WorkflowE
xecutionStateListenerAdapter.java:72) severity=info
Jul 31 13:14:24 rundeck server: at com.dtolabs.rundeck.core.execution.workflow.state.WorkflowExecutionStateListenerAdapter.beginExecuteNodeStep(Workflo
wExecutionStateListenerAdapter.java:203) severity=info
Jul 31 13:14:24 rundeck server: at com.dtolabs.rundeck.app.internal.workflow.MultiWorkflowExecutionListener.beginExecuteNodeStep(MultiWorkflowExecution
Listener.groovy:104) severity=info
Jul 31 13:14:24 rundeck server: at com.dtolabs.rundeck.core.execution.ExecutionServiceImpl.executeNodeStep(ExecutionServiceImpl.java:109) severity=info
Jul 31 13:14:24 rundeck server: at com.dtolabs.rundeck.core.execution.dispatch.ParallelNodeDispatcher$ExecNodeStepCallable.call(ParallelNodeDispatcher.
java:215) severity=info
Jul 31 13:14:24 rundeck server: at com.dtolabs.rundeck.core.execution.dispatch.ParallelNodeDispatcher$ExecNodeStepCallable.call(ParallelNodeDispatcher.
java:190) severity=info
Jul 31 13:14:24 rundeck server: at com.dtolabs.rundeck.core.cli.CallableWrapperTask.execute(CallableWrapperTask.java:52) severity=info
Jul 31 13:14:24 rundeck server: at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) severity=info
Jul 31 13:14:24 rundeck server: at org.apache.tools.ant.Task.perform(Task.java:348) severity=info
Jul 31 13:14:24 rundeck server: at org.apache.tools.ant.taskdefs.Parallel$TaskRunnable.run(Parallel.java:435) severity=info
Jul 31 13:14:24 rundeck server: at java.lang.Thread.run(Thread.java:745) severity=info

How to reproduce Behavior

  • configure file copier to scp
  • run a job containing a scritpt on about 700 target nodes.
    Not reproducible when running over smaller sets of nodes. Our instance has 12 CPUs and 72G of RAM. Short term load reaches small, single-digit numbers in peaks.
@pbenas
Copy link
Author

pbenas commented Jul 31, 2017

Looking to the release notes from 2.7.3 to 2.8.3, found two commits relating to ssh. Not sure if relevant.

@pbenas
Copy link
Author

pbenas commented Jul 31, 2017

screenshot

@gschueler
Copy link
Member

it looks like you are using the scp commandline tool, are you using a plugin which wraps the scp/ssh cli tools? Those are not used by the java code you linked to, so i don't think that is related

@pbenas
Copy link
Author

pbenas commented Aug 1, 2017

No, I'm not using a plugin, I'm using the script-copy FileCopier provider on project level config such as:

"service.FileCopier.default.provider": "script-copy",
"service.NodeExecutor.default.provider": "script-exec",
"plugin.script-copy.default.command": "/usr/bin/scp -q -o StrictHostKeyChecking=no ${file-copy.file} ${node.username}@${node.hostname}:${file-copy.destination}",
"plugin.script-copy.default.remote-filepath": "/tmp/${file-copy.filename}",
"plugin.script-exec.default.command": "/usr/bin/ssh -q -o StrictHostKeyChecking=no ${node.username}@${node.hostname} ${exec.command}"

@pbenas
Copy link
Author

pbenas commented Aug 1, 2017

I continued testing this in 2.7.3 and it's also reproducible, just less likely to happen.

@gschueler gschueler added the bug label Aug 10, 2017
@gschueler
Copy link
Member

ok this is possibly a bug in the script-copy/exec plugin

@samber
Copy link

samber commented Sep 12, 2017

Same issue here when we upgraded from 2.6.9-1 to 2.9.2-1-GA. We got 5k to 10k ssh zombies after 10 days of normal work (cronjobs, deployment jobs...).

@pbenas
Copy link
Author

pbenas commented Sep 13, 2017

Had time for some deeper testing through different rundeck releases - 2.7.3, through 2.8.* to 2.9.3 and found out this is not a regression, I was just confused by a different issue of rundeck 2.9 - #2756

For our 12-cpu instance, the limit of zombie creation remains the same - something between 650 and 700 target nodes being acessed simulaneously (threadcount). The zombie creation is associated with the Fork: resource temporarily unavailable error, no matter how hard are the limits for max number of processes and max open files pushed. In majority of the times, the zombies were spawned almost immediately after the job was stated. My though is the fork may succeed but rundeck thinks it did not (therefore not knowing about the child process) and some reasonable throttling could help - delay the start of some processes by second or two, so it won't perform so many forks within a short period of time.

@gschueler
Copy link
Member

@pbenas do you have a minimum heap size setting, e.g. -Xms ? In some linux JVMs, it has been a problem if the -Xms is set too high: when the JVM needs to fork a process it allocates the same amount of memory as -Xms for the forked process: if you are creating a lot of processes it may be hitting memory limits, even though the actual process that runs doesn't require a lot of memory.

@pbenas
Copy link
Author

pbenas commented Sep 14, 2017

@gschueler Yes, we run rundeck with -Xms2048M -Xmx4096M, but it does not seem to be the limit for us. I've watched memory usage during the startup of the job, and the the memory usage increases from about 3G to 5G temporarily, while there are still 70G free. Moreover, switching to -Xms512M makes no difference to zombie creation.

@stale
Copy link

stale bot commented Apr 3, 2020

In an effort to focus on bugs and issues that impact currently supported versions of Rundeck, we have elected to notify GitHub issue creators if their issue is classified as stale and close the issue. An issue is identified as stale when there have been no new comments, responses or other activity within the last 12 months. If a closed issue is still present please feel free to open a new Issue against the current version and we will review it. If you are an enterprise customer, please contact your Rundeck Support to assist in your request.
Thank you, The Rundeck Team

@stale stale bot added the wontfix:stale label Apr 3, 2020
@stale stale bot closed this as completed Apr 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants