lots of non-multiple execution jobs can block starting jobs #1305

Closed
jquick opened this Issue Jun 26, 2015 · 30 comments

Projects

None yet

4 participants

@jquick
jquick commented Jun 26, 2015

We are currently having an issue where some jobs that are set to be ran just never run. On the gui it looks correct and counts down accordingly and then say something like "6sec ago" but the jobs never starts. A 0/15 job can go a few hours without running. Manually running the jobs will get them working again but its not the best work around.

I have looked in all the logs in /var/log/rundeck/* but I don't see any errors or anything. We have about 400 jobs running and anywhere from 30-70 jobs running at a time. We are using a oracle back-end and have not had any other issues thus far.

Another thing we have noticed is when you click to manually start a job it can take upwards of 10 mins to respawn. I think this has to do with the quarts scheduler getting overloaded but I'm not sure.

Please let me know if there are other log files I can dig into or some other quarts tuning I can do. We are currently running on 2.4.2-1 and plan on moving to 2.5.1 soon.

Thanks

@gschueler
Contributor

@jquick it sounds like the quartz threadCount is limiting you. It may look like a job started, but it is waiting for a thread to be available from the scheduler.

see http://rundeck.org/2.4.2/administration/tuning-rundeck.html#quartz-job-threadcount

@jquick
jquick commented Jun 26, 2015

@gschueler Thanks for the input. In the past we have upped the threadCount and it is currently at 80. We thought of this but the jobs never run even if the current number of jobs is as low as 30.

We do find that it stops scheduling them at peak times where it likely could hit 80 jobs for awhile. I would assume they would just get queued and run when a spot opens up though.

@gschueler
Contributor

in the System Report view in the configuration tab, there should be a number indicating the Scheduler running thread count

@jquick
jquick commented Jun 26, 2015

Thanks I will monitor this more closely. We were looking at the scheduled_execution table where date_completed = null but I think the count in the System Report includes more. So we were likely hitting the cap without realizing it.

I have also noticed the thread count is much higher then the running count. Currently at 17 running jobs on the scheduler is taking 131 active threads. Does this mean our quartz threadCount should be more like 200?

@gschueler
Contributor

A job will spawn likely multiple threads, but those don't count against the schedulers max thread count. Eg if running on multiple nodes with a node dispatch threadcount

Greg

On Jun 26, 2015, at 4:56 PM, jquick notifications@github.com wrote:

Thanks I will monitor this more closely. We were looking at the scheduled_execution table where date_completed = null but I think the count in the System Report includes more. So we were likely hitting the cap without realizing it.

I have also noticed the thread count is much higher then the running count. Currently at 17 running jobs on the scheduler is taking 131 active threads. Does this mean our quartz threadCount should be more like 200?


Reply to this email directly or view it on GitHub.

@jquick
jquick commented Jun 29, 2015

That was the issue, scheduler was maxed. Is there any soft limit to how high we can go with the scheduler thread count? We are currently at 150 and still maxed.

@gschueler
Contributor

@jquick you should be able to set the max thread count as high as you want, the limit would be the memory and/or number of cores that the server has available. The default is a fixed size pool (SimpleThreadPool), meaning it creates that many threads which will sit idle in memory when not being used, but you can change it to a different threadpool implementation class, you can specify that as part of the quartz config.

@jquick
jquick commented Jul 2, 2015

Ok thanks, we have been playing with the quartz_job_threadcount for awhile and we currently have it at 450. This seems to keep up with the scheduler but we are still having major issues trying to manually start a job. Currently at only 112 current scheduler threads if I try to start a job via the gui it just spins forever at "starting execution". Sometimes it creates the job after about 10 min but most of the time it just never completes. Any other settings you can think of to boost?

@gschueler
Contributor

what kind of memory/cpu configuration do you have?

@jquick
jquick commented Jul 2, 2015

The VM has 4gb
The heap set to:
RDECK_JVM="$RDECK_JVM -Xmx2560m -Xms256m -XX:MaxPermSize=256m -server

We have never come close to oom issues on the box or inside the rundeck app looking at system profile.

CPUs: 2 Intel(R) Xeon(R) CPU L5520 @ 2.27GHz

Rundeck shows 2 processes and i've never seen it above:
loadAverage 0.25%

@gschueler
Contributor

Are there any error messages in the service.log?

@jquick
jquick commented Jul 2, 2015

Nope, just tried manually kicking a job off for the last 30 min. Just a bunch of:
INFO ExecutionService: updated scheduled Execution
INFO ExecutionUtilService: Execution successful: 1927101
...
Eventually ~30min later i see
INFO ScheduledExecutionService: scheduling immediate job run: TEMP:jquick:269:1927812


Something else I don't understand. We are currently at:

stats: scheduler
running 412
stats: threads
active 483

But under active we only have about 10 jobs currently running on the active tab. I think I remember seeing they they sit idle sometimes. We do have about 400 jobs all scheduled for a few times a day.

FWIW in the thread dump I do see about 400 of these:

quartzScheduler_Worker-189 id=203 state=BLOCKED
    - waiting to lock <0x5e927fe7> (a rundeck.services.ExecutionService)
     owned by quartzScheduler_Worker-380 id=394
    at rundeck.services.ExecutionService.createExecution(ExecutionService.groovy:1518)
    at rundeck.services.ExecutionService.createExecution(ExecutionService.groovy)
    at rundeck.services.ExecutionService$$FastClassByCGLIB$$ffb73873.invoke(<generated>)
    at net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:191)
    at org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation.invokeJoinpoint(Cglib2AopProxy.java:689)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
    at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:110)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
    at org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(Cglib2AopProxy.java:622)
    at rundeck.services.ExecutionService$$EnhancerByCGLIB$$a26e78b0.createExecution(<generated>)
    at rundeck.services.ExecutionService$createExecution$1.call(Unknown Source)
    at rundeck.quartzjobs.ExecutionJob.initialize(ExecutionJob.groovy:240)
    at rundeck.quartzjobs.ExecutionJob$initialize.callCurrent(Unknown Source)
    at rundeck.quartzjobs.ExecutionJob.execute_internal(ExecutionJob.groovy:90)
    at sun.reflect.GeneratedMethodAccessor1301.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:90)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:233)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1082)
    at groovy.lang.ExpandoMetaClass.invokeMethod(ExpandoMetaClass.java:1106)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:906)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1027)
    at groovy.lang.ExpandoMetaClass.invokeMethod(ExpandoMetaClass.java:1106)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:906)
    at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.callCurrent(PogoMetaClassSite.java:66)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:141)
    at rundeck.quartzjobs.ExecutionJob$_execute_closure1.doCall(ExecutionJob.groovy:69)
    at rundeck.quartzjobs.ExecutionJob$_execute_closure1.doCall(ExecutionJob.groovy)
    at sun.reflect.GeneratedMethodAccessor1299.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:90)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:233)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1082)
    at groovy.lang.ExpandoMetaClass.invokeMethod(ExpandoMetaClass.java:1106)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:906)
    at groovy.lang.Closure.call(Closure.java:412)
    at groovy.lang.Closure.call(Closure.java:406)
    at com.codahale.metrics.Timer.time(Timer.java:99)
    at com.codahale.metrics.Timer$time$0.call(Unknown Source)
    at rundeck.quartzjobs.ExecutionJob.execute(ExecutionJob.groovy:68)
    at org.quartz.core.JobRunShell.run(JobRunShell.java:199)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:546)
@gschueler
Contributor

that could be a clue, are you able to send me the full threaddump? (via email if you prefer)

are the jobs that are running, or that you invoke manually set to allow multiple executions?

@jquick
jquick commented Jul 2, 2015

Sent via email. We don't run anything under multiple executions and everything only runs on a single external node.

@jquick
jquick commented Jul 2, 2015

Looking in the DB we actually do have about 10 jobs set to multiple executions but they only run about once a day and are currently not running.

@gschueler
Contributor

thanks, I will look at the threaddump. Based on your stack trace, you might try this workaround to see if it improves the situation: enable multiple executions for the other jobs if possible, and see if that allows invoking them without any hangups

@jquick
jquick commented Jul 6, 2015

I think that is helping. We can't set multiple executions on most of our jobs due to the frequency we run them (most are 0/15) or the fact that we just cant run them in parallel. But it looks like the ones I am setting multi on are helping. Il'l keep testing as we get more busy.

@jquick
jquick commented Jul 7, 2015

Setting a job to multiple executions does seem to make it start quickly when running it manually. I am using this to start manual jobs and just setting manual executions back to "no" after it starts. If I don't toggle, it still takes ~20 min to start.

@jquick
jquick commented Jul 16, 2015

Still capped with 450 blocking threads during low times. Let me know if I can provide any more data or do any more testing. Thanks for your help!

@gschueler gschueler added the bug label Jul 16, 2015
@gschueler gschueler added this to the 2.6.0 milestone Jul 16, 2015
@gschueler
Contributor

I believe the issue is the way that "multiple executions" setting is checked when you try to run the job, it is causing lock contention on a single object

@jquick
jquick commented Aug 3, 2015

@gschueler - Well good news. Upgrading to 2.5.2-1 seems to of fixed this issue. We have gone about a week without any threads getting blocked up.

@jquick
jquick commented Aug 3, 2015

Never mind. Showed up again today. Much less then normal though.

@jquick
jquick commented Aug 20, 2015

I don't think this has happened in the last few weeks. I believe the one time it locked up was when we were mass deleting jobs (my previous comment). There has been no locking on the object since updating to 2.5.2-1. Not sure if there is still a off case or not. The "Scheduler Running" count now matches jobs actually running.

@gschueler gschueler changed the title from Rundeck not starting scheduled jobs to lots of non-multiple execution jobs can block starting jobs Oct 7, 2015
@gschueler gschueler modified the milestone: 2.7.0, 2.6.0 Oct 7, 2015
@jquick
jquick commented Apr 15, 2016

@gschueler, I think this is still going on.

Currently we show 100 running in stats (which is what the quarts max threads is set too). But we never get above 25-30 actually running. Doing a thread dump its a little different then last time but basically the same.

Any other ideas for a work around? We are on version 2.6.1-1

Thanks!

quartzScheduler_Worker-25 id=39 state=BLOCKED
    - waiting to lock <0x07f18fcf> (a rundeck.services.ExecutionService)
     owned by quartzScheduler_Worker-24 id=38
    at rundeck.services.ExecutionService.createExecution(ExecutionService.groovy:1608)
    at rundeck.services.ExecutionService.createExecution(ExecutionService.groovy)
    at rundeck.services.ExecutionService$$FastClassBySpringCGLIB$$ffb73873.invoke(<generated>)
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:708)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
    at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:98)
    at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:262)
    at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:95)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:644)
    at rundeck.services.ExecutionService$$EnhancerBySpringCGLIB$$f24c190b.createExecution(<generated>)
    at rundeck.services.ExecutionService$createExecution$15.call(Unknown Source)
    at rundeck.quartzjobs.ExecutionJob.initialize(ExecutionJob.groovy:265)
    at rundeck.quartzjobs.ExecutionJob$initialize$3.callCurrent(Unknown Source)
    at rundeck.quartzjobs.ExecutionJob.execute_internal(ExecutionJob.groovy:95)
    at sun.reflect.GeneratedMethodAccessor644.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:90)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:324)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1207)
    at groovy.lang.ExpandoMetaClass.invokeMethod(ExpandoMetaClass.java:1110)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1016)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1152)
    at groovy.lang.ExpandoMetaClass.invokeMethod(ExpandoMetaClass.java:1110)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1016)
    at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.callCurrent(PogoMetaClassSite.java:66)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:141)
    at rundeck.quartzjobs.ExecutionJob$_execute_closure1.doCall(ExecutionJob.groovy:74)
    at rundeck.quartzjobs.ExecutionJob$_execute_closure1.doCall(ExecutionJob.groovy)
    at sun.reflect.GeneratedMethodAccessor643.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:90)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:324)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1207)
    at groovy.lang.ExpandoMetaClass.invokeMethod(ExpandoMetaClass.java:1110)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1016)
    at groovy.lang.Closure.call(Closure.java:423)
    at groovy.lang.Closure.call(Closure.java:417)
    at com.codahale.metrics.Timer.time(Timer.java:99)
    at com.codahale.metrics.Timer$time$17.call(Unknown Source)
    at rundeck.quartzjobs.ExecutionJob.execute(ExecutionJob.groovy:73)
    at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
@jquick
jquick commented May 5, 2016

As a work around we had the job gently kill itself if it was already running and set multiple executions on for all of them. This has worked for now confirming that is the issue.

@leprince
leprince commented May 31, 2016 edited

Hi,
We are facing the same issue.
Our daily execution count went from 16000 to about 8000, and the Quartz Scheduler thread usage is almost all the time at 100% after upgrading from 2.6.4 to 2.6.5. (We actually run on version 2.6.7)
We have a total of 70 jobs running at different intervals, with:

running:100
threadPoolSize:100
stats: threads
active:142

but in the activity tab, we see less than 10 running tasks.
We have tested different values for the threadPoolSize config as per the rundeck tuning guide, but it did not help.

We can not allow "parallel/multiple" executions for our tasks so @jquick 's solution is not one for us.

@leprince

Hi,
After months of research, we found the solution to our problem.

Rundeck was blocking on MySQL queries.

The table workflow_workflow_step was missing indexes on workflow_commands_id and commands_idx.
Once added these indexes, the threadPoolSize went from 100 (full) to 2/100.
All jobs could be launched easily and the blocking threads went away.
We added more indexes on other tables and the GUI is faster than ever before!

Fyi, here are some stats from our database:
Number of total lines: 6 596 495
Size: 1.3G

Hope this helps you save some time :)

@jquick
jquick commented Jun 15, 2016

@leprince - Funny you should mention the GUI slowness. We are also having the same issue with so much data. Would you mind listing the indexes you created somewhere to speed it up? We would like to do the same.

@damienjoldersma

@jquick I ran these it seemed to help gui almost instantly:

ALTER TABLE workflow_workflow_step ADD INDEX workflow_commands_id ( workflow_commands_id );
ALTER TABLE workflow_workflow_step ADD INDEX workflow_commands_id ( commands_idx );
@leprince

We added quite a lot of indexes on the database to optimize different aspects of rundeck.

The ones that allowed to resolve the blocking full threadPool and easy some parts of the GUI where the ones mentioned by @damienjoldersma :

ALTER TABLE workflow_workflow_step ADD INDEX workflow_commands_id ( workflow_commands_id ); ALTER TABLE workflow_workflow_step ADD INDEX workflow_commands_id ( commands_idx );

@gschueler gschueler added a commit to gschueler/rundeck that referenced this issue Nov 30, 2016
@gschueler gschueler don't block all jobs when checking for multiple executions
fix #1305
44f6883
@gschueler gschueler self-assigned this Nov 30, 2016
@gschueler gschueler closed this in #2212 Nov 30, 2016
@gschueler gschueler removed the in progress label Nov 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment