Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs run concurrently with the same signature #46

Open
torleif opened this issue Jun 18, 2015 · 13 comments
Open

Jobs run concurrently with the same signature #46

torleif opened this issue Jun 18, 2015 · 13 comments

Comments

@torleif
Copy link
Contributor

torleif commented Jun 18, 2015

I have several jobs that I don't want to run at the same time. Even though the job returns a static signature through the getSignature() function, they continue to run concurrently at the same time.

I'm setting isComplete = true when it completes, and I'm stepping through the currentSteps and have set the totalSteps correctly.

@nyeholt
Copy link
Contributor

nyeholt commented Jun 18, 2015

The job signature is only used when you create a new job, and only prevents the creation of a new job if there's already a 'new' job with that signature in place - ie if there's already a job running, it will let you queue that new job up even though they have the same signatures.

Do you have multiple queue executors configured? Or is it that you only want the job to be added once the current one finishes? For the second case, I generally have the job create a new version of itself when it finishes.

@torleif
Copy link
Contributor Author

torleif commented Jun 18, 2015

I have it creating another job after it finishes. The queue executors (I assume) are the default ones described in the readme.md (our sites are running on CWP)

if you run the job queue task multiple times via cli, even if there's a job being run, it will continue to spawn multiple instances of the same job.

The behaviour should check if the same job is not currently being executed. It's very misleading to have one job 'running', even though there often are multiple jobs of the same one being ran.

@nyeholt
Copy link
Contributor

nyeholt commented Jun 18, 2015

Hmm, something sounds amiss. The way the executors are set up is to

  • call ProcessJobQueueTask::process
  • Look to see if the provided $queue parameter has any pending jobs via QueuedJobService::getNextPendingJob (doesn't run the job, just checks to see if there are any)
  • If it does, call QueuedJobService::processJobQueue
  • Retrieve the next job via processJobQueue
  • Execute that job
  • Call getNextPendingJob again, and if found, execute - looping this way until all jobs from that queue are complete

getNextPendingJob has an explicit check that looks for jobs executing on $queue (ie the jobs have a status of either INIT or RUNNING) - if it finds an already running job on that queue it returns false (and the previously mentioned job loop does not execute.

Now - one thing that does stand out as a possible issue is if, on your job completing and creating a new version of itself, it creates a new job and adds it, which will then get picked up by that processJobQueue loop - and create a new job, which immediately executes, and so on. Could that be what's happening? Or are you actually having two (or three / four / five) instances of the job executing concurrently?

@torleif
Copy link
Contributor Author

torleif commented Jun 21, 2015

I wouldn't be able to elaborate on the internals, but it's fairly simple to test: if you create a new job that takes a while (say 30 seconds), and the call php framwork/cli-script.php /dev/tasks/ProcessQueueJobs a few times in different terminal tabs, they all start executing at the same time. QueuedJob::LARGE etc seems to have no affect on this. The job queue seems to be incorrectly returning jobs that should be currently running.

This is particularly frustrating for me as I have to deal with multiple APIs which can take a varied amount of time to complete. They're fairly large jobs that will start to have unpredictable behavior if they're running at the same time.

@nyeholt
Copy link
Contributor

nyeholt commented Jun 22, 2015

Yep cool - will try and run up a reproduction of the issue.

@nyeholt
Copy link
Contributor

nyeholt commented Jul 1, 2015

Just adding a note that I've been able to reproduce something that maybe is similar to your issue, but not consistently. Have set up 4 instances of the "DummyQueuedJob" via the queuedjob admin, each with a different run duration - this actually means they have separate signatures. Starting the run from one cli, it starts normally

$ php framework/cli-script.php dev/tasks/ProcessJobQueueTask queue=2
Running Task ProcessJobQueueTask

[2015-07-01 13:52:33] Processing queue 2
[2015-07-01 13:52:33] Running Some test job for 150 seconds and others from 2

From another console, it behaves as expected by doing

$ php framework/cli-script.php dev/tasks/ProcessJobQueueTask queue=2
Running Task ProcessJobQueueTask

[2015-07-01 13:52:35] Processing queue 2
[2015-07-01 13:52:35] Job is still running on 2

But on the 4th or 5th execution, it picks up the same job


$ php framework/cli-script.php dev/tasks/ProcessJobQueueTask queue=2
Running Task ProcessJobQueueTask

[2015-07-01 13:52:35] Processing queue 2
[2015-07-01 13:52:35] Job is still running on 2
$ php framework/cli-script.php dev/tasks/ProcessJobQueueTask queue=2
Running Task ProcessJobQueueTask

[2015-07-01 13:52:42] Processing queue 2
[2015-07-01 13:52:42] Job is still running on 2
$ php framework/cli-script.php dev/tasks/ProcessJobQueueTask queue=2
Running Task ProcessJobQueueTask

[2015-07-01 13:52:52] Processing queue 2
[2015-07-01 13:52:52] Job is still running on 2
$ php framework/cli-script.php dev/tasks/ProcessJobQueueTask queue=2
Running Task ProcessJobQueueTask

[2015-07-01 13:52:53] Processing queue 2
[2015-07-01 13:52:53] Job is still running on 2
$ php framework/cli-script.php dev/tasks/ProcessJobQueueTask queue=2
Running Task ProcessJobQueueTask

[2015-07-01 13:52:53] Processing queue 2
sendmail: option requires an argument -- 'f'
sendmail: option requires an argument -- 'f'
sendmail: fatal: usage: sendmail [options]
sendmail: fatal: marcus(1000): No recipient addresses found in message header
[2015-07-01 13:52:53] Running Some test job for 150 seconds and others from 2.

Now, the sendmail error in there gives me a bit of a hint as to what might be happening, am going to look into it further to see if my guess is correct, but I think what's happening is that the job health check is detecting the running job as being stalled (correctly or not), then pausing it. The subsequent attempts to process the queue are then picking up the job as it has been paused by the system and set for re-starting; what I'm not sure just yet is if it's pausing it incorrectly, or whether that's a legitimate pause.

A couple questions about your jobs

  • Is your job's process() method re-entrant? What I mean by this is, does one call to process() perform a small part of the overall job, incrementing the job state each time, or is the whole job executed in this call?
  • If it is re-entrant, does your job increment its internal currentStep variable for each call of the process() loop?
  • Is your job's process() method executing code that could take 5 minutes or more?

My hunch is that your process() is doing the whole job in the one call to process() - nothing wrong with doing things in this way, but if the duration of what the process() method is doing is making that ProcessJobQueue thread hang around long enough without the job state being updated, I think it's being marked as 'dead' by the job health check, the job is being paused to restart, and other jobs are free to start running again... but the job is actually still actually sitting there processing away. In other words, because the job hasn't told the system that it's still running (by updating currentStep and returning from process()), the job queue processing thinks it dead and marks it as being able to restart.

Am just trying to put together another test class to confirm, and figure out if there's a "nice" way to handle the situation.

@nyeholt
Copy link
Contributor

nyeholt commented Jul 1, 2015

Okay, looks to be related to commit ba94a35 - essentially, the first call to checkJobHealth (after the job is running) will mark it as having not been updated, the subsequent one will pause it, then it'll be immediately resumed (as a new instance of the job) because the pause/process logic triggers in the same thread. So if your process() method runs over just a couple of minutes, it'll get paused and restarted un-ceremoniously.

The issue is that the QueuedJobService doesn't know anything about the processing queues (by design - the processors could be the dev/task based one, but could be gearman or similar) to know whether they're still alive, other than the job's LastProcessedStep (derived from the currentStep value in the job) incrementing. As an 'interim' thing to prevent the issue in the current release, it might be simplest to introduce a time based delay for job health checks; ie only check every X (5? 10?) minutes for job health. This would at least mean a greater duration before the potential for the job to be paused/restarted. A more complete solution is a little more difficult and would probably involve introducing additional DB fields which I'm not sure is the best way just yet. Will that time delay for health checks work for you? It'd be a configurable variable, defaulting to 2 minutes or something close to what it is now, but let you set it high given your scenario.

nyeholt pushed a commit that referenced this issue Jul 1, 2015
An interim solution for issue #46

Job queue health check can now be configured to extend the duration before the
health check is run. Configure

QueuedJobService:
  health_check_mins: {int}

for the number of minutes to skip on checking. The check is done by modding
this value against the current time's minute, and if that returns 0, then the
health check is performed.
@nyeholt
Copy link
Contributor

nyeholt commented Jul 1, 2015

Okay you can try https://github.com/silverstripe-australia/silverstripe-queuedjobs/tree/fix-health-check-time, and add some config to your project yml like

QueuedJobService:
  health_check_mins: 5

to set it to run the health check only every 5th minute.

@torleif
Copy link
Contributor Author

torleif commented Jul 1, 2015

Cool - I'll give this a try to see if it helps the situation. I'll also give you some code so you can reproduce the issue as well.

I'm hoping this will help as we've got 6 or so jobs that have to be ran manually as they aren't executed by the queued job module.

@nyeholt
Copy link
Contributor

nyeholt commented Sep 11, 2015

Better yet - you can simply make your job set $this->currentStep = -1 in the __constructor. This'll mean the health check doesn't kick in :)

@zarocknz
Copy link

Just a quick note that we hit this issue too and overcome it with setting $this->currentStep = -1 in the construct. Thanks.

@torleif
Copy link
Contributor Author

torleif commented May 13, 2016

it's a lazy hack, but managed to avoid issues by spacing out our jobs and throwing more CPU at it. health_check_mins: 5 didn't seem to do much

@maxime-rainville
Copy link
Collaborator

Need to validate if this affects SS4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants