Jobs run concurrently with the same signature #46

torleif · 2015-06-18T02:13:24Z

I have several jobs that I don't want to run at the same time. Even though the job returns a static signature through the getSignature() function, they continue to run concurrently at the same time.

I'm setting isComplete = true when it completes, and I'm stepping through the currentSteps and have set the totalSteps correctly.

nyeholt · 2015-06-18T03:33:56Z

The job signature is only used when you create a new job, and only prevents the creation of a new job if there's already a 'new' job with that signature in place - ie if there's already a job running, it will let you queue that new job up even though they have the same signatures.

Do you have multiple queue executors configured? Or is it that you only want the job to be added once the current one finishes? For the second case, I generally have the job create a new version of itself when it finishes.

torleif · 2015-06-18T05:39:15Z

I have it creating another job after it finishes. The queue executors (I assume) are the default ones described in the readme.md (our sites are running on CWP)

if you run the job queue task multiple times via cli, even if there's a job being run, it will continue to spawn multiple instances of the same job.

The behaviour should check if the same job is not currently being executed. It's very misleading to have one job 'running', even though there often are multiple jobs of the same one being ran.

nyeholt · 2015-06-18T07:23:09Z

Hmm, something sounds amiss. The way the executors are set up is to

call ProcessJobQueueTask::process
Look to see if the provided $queue parameter has any pending jobs via QueuedJobService::getNextPendingJob (doesn't run the job, just checks to see if there are any)
If it does, call QueuedJobService::processJobQueue
Retrieve the next job via processJobQueue
Execute that job
Call getNextPendingJob again, and if found, execute - looping this way until all jobs from that queue are complete

getNextPendingJob has an explicit check that looks for jobs executing on $queue (ie the jobs have a status of either INIT or RUNNING) - if it finds an already running job on that queue it returns false (and the previously mentioned job loop does not execute.

Now - one thing that does stand out as a possible issue is if, on your job completing and creating a new version of itself, it creates a new job and adds it, which will then get picked up by that processJobQueue loop - and create a new job, which immediately executes, and so on. Could that be what's happening? Or are you actually having two (or three / four / five) instances of the job executing concurrently?

torleif · 2015-06-21T23:01:39Z

I wouldn't be able to elaborate on the internals, but it's fairly simple to test: if you create a new job that takes a while (say 30 seconds), and the call php framwork/cli-script.php /dev/tasks/ProcessQueueJobs a few times in different terminal tabs, they all start executing at the same time. QueuedJob::LARGE etc seems to have no affect on this. The job queue seems to be incorrectly returning jobs that should be currently running.

This is particularly frustrating for me as I have to deal with multiple APIs which can take a varied amount of time to complete. They're fairly large jobs that will start to have unpredictable behavior if they're running at the same time.

nyeholt · 2015-06-22T07:25:25Z

Yep cool - will try and run up a reproduction of the issue.

nyeholt · 2015-07-01T04:25:32Z

Just adding a note that I've been able to reproduce something that maybe is similar to your issue, but not consistently. Have set up 4 instances of the "DummyQueuedJob" via the queuedjob admin, each with a different run duration - this actually means they have separate signatures. Starting the run from one cli, it starts normally

$ php framework/cli-script.php dev/tasks/ProcessJobQueueTask queue=2
Running Task ProcessJobQueueTask

[2015-07-01 13:52:33] Processing queue 2
[2015-07-01 13:52:33] Running Some test job for 150 seconds and others from 2

From another console, it behaves as expected by doing

$ php framework/cli-script.php dev/tasks/ProcessJobQueueTask queue=2
Running Task ProcessJobQueueTask

[2015-07-01 13:52:35] Processing queue 2
[2015-07-01 13:52:35] Job is still running on 2

But on the 4th or 5th execution, it picks up the same job


$ php framework/cli-script.php dev/tasks/ProcessJobQueueTask queue=2
Running Task ProcessJobQueueTask

[2015-07-01 13:52:35] Processing queue 2
[2015-07-01 13:52:35] Job is still running on 2
$ php framework/cli-script.php dev/tasks/ProcessJobQueueTask queue=2
Running Task ProcessJobQueueTask

[2015-07-01 13:52:42] Processing queue 2
[2015-07-01 13:52:42] Job is still running on 2
$ php framework/cli-script.php dev/tasks/ProcessJobQueueTask queue=2
Running Task ProcessJobQueueTask

[2015-07-01 13:52:52] Processing queue 2
[2015-07-01 13:52:52] Job is still running on 2
$ php framework/cli-script.php dev/tasks/ProcessJobQueueTask queue=2
Running Task ProcessJobQueueTask

[2015-07-01 13:52:53] Processing queue 2
[2015-07-01 13:52:53] Job is still running on 2
$ php framework/cli-script.php dev/tasks/ProcessJobQueueTask queue=2
Running Task ProcessJobQueueTask

[2015-07-01 13:52:53] Processing queue 2
sendmail: option requires an argument -- 'f'
sendmail: option requires an argument -- 'f'
sendmail: fatal: usage: sendmail [options]
sendmail: fatal: marcus(1000): No recipient addresses found in message header
[2015-07-01 13:52:53] Running Some test job for 150 seconds and others from 2.

Now, the sendmail error in there gives me a bit of a hint as to what might be happening, am going to look into it further to see if my guess is correct, but I think what's happening is that the job health check is detecting the running job as being stalled (correctly or not), then pausing it. The subsequent attempts to process the queue are then picking up the job as it has been paused by the system and set for re-starting; what I'm not sure just yet is if it's pausing it incorrectly, or whether that's a legitimate pause.

A couple questions about your jobs

Is your job's process() method re-entrant? What I mean by this is, does one call to process() perform a small part of the overall job, incrementing the job state each time, or is the whole job executed in this call?
If it is re-entrant, does your job increment its internal currentStep variable for each call of the process() loop?
Is your job's process() method executing code that could take 5 minutes or more?

My hunch is that your process() is doing the whole job in the one call to process() - nothing wrong with doing things in this way, but if the duration of what the process() method is doing is making that ProcessJobQueue thread hang around long enough without the job state being updated, I think it's being marked as 'dead' by the job health check, the job is being paused to restart, and other jobs are free to start running again... but the job is actually still actually sitting there processing away. In other words, because the job hasn't told the system that it's still running (by updating currentStep and returning from process()), the job queue processing thinks it dead and marks it as being able to restart.

Am just trying to put together another test class to confirm, and figure out if there's a "nice" way to handle the situation.

nyeholt · 2015-07-01T06:08:14Z

Okay, looks to be related to commit ba94a35 - essentially, the first call to checkJobHealth (after the job is running) will mark it as having not been updated, the subsequent one will pause it, then it'll be immediately resumed (as a new instance of the job) because the pause/process logic triggers in the same thread. So if your process() method runs over just a couple of minutes, it'll get paused and restarted un-ceremoniously.

The issue is that the QueuedJobService doesn't know anything about the processing queues (by design - the processors could be the dev/task based one, but could be gearman or similar) to know whether they're still alive, other than the job's LastProcessedStep (derived from the currentStep value in the job) incrementing. As an 'interim' thing to prevent the issue in the current release, it might be simplest to introduce a time based delay for job health checks; ie only check every X (5? 10?) minutes for job health. This would at least mean a greater duration before the potential for the job to be paused/restarted. A more complete solution is a little more difficult and would probably involve introducing additional DB fields which I'm not sure is the best way just yet. Will that time delay for health checks work for you? It'd be a configurable variable, defaulting to 2 minutes or something close to what it is now, but let you set it high given your scenario.

An interim solution for issue #46 Job queue health check can now be configured to extend the duration before the health check is run. Configure QueuedJobService: health_check_mins: {int} for the number of minutes to skip on checking. The check is done by modding this value against the current time's minute, and if that returns 0, then the health check is performed.

nyeholt · 2015-07-01T07:20:17Z

Okay you can try https://github.com/silverstripe-australia/silverstripe-queuedjobs/tree/fix-health-check-time, and add some config to your project yml like

QueuedJobService:
  health_check_mins: 5

to set it to run the health check only every 5th minute.

torleif · 2015-07-01T07:54:28Z

Cool - I'll give this a try to see if it helps the situation. I'll also give you some code so you can reproduce the issue as well.

I'm hoping this will help as we've got 6 or so jobs that have to be ran manually as they aren't executed by the queued job module.

nyeholt · 2015-09-11T06:45:51Z

Better yet - you can simply make your job set $this->currentStep = -1 in the __constructor. This'll mean the health check doesn't kick in :)

zarocknz · 2016-05-13T03:28:40Z

Just a quick note that we hit this issue too and overcome it with setting $this->currentStep = -1 in the construct. Thanks.

torleif · 2016-05-13T04:21:13Z

it's a lazy hack, but managed to avoid issues by spacing out our jobs and throwing more CPU at it. health_check_mins: 5 didn't seem to do much

maxime-rainville · 2020-08-05T10:07:04Z

Need to validate if this affects SS4

nyeholt mentioned this issue Sep 11, 2015

DOC document job health check workaround #61

Merged

robbieaverill added affects/v3 impact/medium type/bug labels Sep 22, 2017

maxime-rainville added affects/v4 and removed affects/v3 labels Aug 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs run concurrently with the same signature #46

Jobs run concurrently with the same signature #46

torleif commented Jun 18, 2015

nyeholt commented Jun 18, 2015

torleif commented Jun 18, 2015

nyeholt commented Jun 18, 2015

torleif commented Jun 21, 2015

nyeholt commented Jun 22, 2015

nyeholt commented Jul 1, 2015

nyeholt commented Jul 1, 2015

nyeholt commented Jul 1, 2015

torleif commented Jul 1, 2015

nyeholt commented Sep 11, 2015

zarocknz commented May 13, 2016

torleif commented May 13, 2016

maxime-rainville commented Aug 5, 2020

Jobs run concurrently with the same signature #46

Jobs run concurrently with the same signature #46

Comments

torleif commented Jun 18, 2015

nyeholt commented Jun 18, 2015

torleif commented Jun 18, 2015

nyeholt commented Jun 18, 2015

torleif commented Jun 21, 2015

nyeholt commented Jun 22, 2015

nyeholt commented Jul 1, 2015

nyeholt commented Jul 1, 2015

nyeholt commented Jul 1, 2015

torleif commented Jul 1, 2015

nyeholt commented Sep 11, 2015

zarocknz commented May 13, 2016

torleif commented May 13, 2016

maxime-rainville commented Aug 5, 2020