Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rejected Job can not be restarted [BATCH-2675] #930

Closed
spring-issuemaster opened this issue Jan 29, 2018 · 14 comments
Closed

Rejected Job can not be restarted [BATCH-2675] #930

spring-issuemaster opened this issue Jan 29, 2018 · 14 comments

Comments

@spring-issuemaster
Copy link
Collaborator

@spring-issuemaster spring-issuemaster commented Jan 29, 2018

Heiko Does opened BATCH-2675 and commented

We had a Problem in one of our batch applications, because our queue capacity was too small.
So some of the jobs resulted in an TaskRejectedException and the job was marked as FAILED in the database. After we figured that out we wanted to restarted the failed jobs via JobOperator.restart method. This resulted in an JobExecutionAlreadyRunningException because the job had no end time and thishttps://github.com/spring-projects/spring-batch/blob/master/spring-batch-core/src/main/java/org/springframework/batch/core/repository/support/SimpleJobRepository.java#L119 check failed.

I my opinion this behaviour is not correct and i would suggest to add a end time when a task ist rejected (https://github.com/spring-projects/spring-batch/blob/master/spring-batch-core/src/main/java/org/springframework/batch/core/launch/support/SimpleJobLauncher.java#L167) so that the job could be restarted.


Affects: 3.0.9, 4.0.1, 4.1.0.RC1

Issue Links:

  • BATCH-2760 Failed JobExecution due to unavailable TaskExecutor leaves End Time unpopulated
    ("is duplicated by")

Referenced from: pull request #659, and commits b28238e, 13ca0e0, f8f8a02

Backported to: 4.0.2, 3.0.10

0 votes, 5 watchers

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

@spring-issuemaster spring-issuemaster commented Oct 9, 2018

Dimitrios Liapis commented

Continuing the discussion here for the closed PR https://github.com/spring-projects/spring-batch/pull/643 

Michael Minella Mahmoud Ben Hassine thanks for the reviews.

In the duplicate ticket BATCH-2760 I've raised for this issue: I've identified the method that gives the false positive for this case ( SimpleJobExplorer#findRunningJobExecutions() ) mainly because the subsequent query is based on the Job Execution END_TIME column being null that also includes this erroneous cases.

Looking at the alternative scenario suggested by Michael Minella in the PR comment: we could amend the above query somehow to filter out these special erroneous cases. For instance we could amend the query to exclude entries that have STATUS as FAILED along with the existing 'END_TIME is null' predicate:

SELECT E.JOB_EXECUTION_ID, E.START_TIME, E.END_TIME, E.STATUS, E.EXIT_CODE, E.EXIT_MESSAGE, E.CREATE_TIME, E.LAST_UPDATED, E.VERSION, E.JOB_INSTANCE_ID, E.JOB_CONFIGURATION_LOCATION from %PREFIX%JOB_EXECUTION E, %PREFIX%JOB_INSTANCE I where E.JOB_INSTANCE_ID=I.JOB_INSTANCE_ID and I.JOB_NAME=? and E.END_TIME is NULL *and E.STATUS != 'FAILED'* order by E.JOB_EXECUTION_ID desc

We could also use the Job Execution EXIT_MESSAGE column where I think the 'TaskRejectedException' name is also logged on these erroneous cases though I'm not sure how performant including this predicate would be.

Or maybe an alternative solution all together?

What do you think??

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

@spring-issuemaster spring-issuemaster commented Oct 10, 2018

Mahmoud Ben Hassine commented

I would not go for filtering based on the EXIT_MESSAGE containing the text "TaskRejectedException" as this would:

  • break if we change the exit message content (the exception name for instance)
  • perform poorly if the exit message is long enough (for queries with 'like %XXX%')

Excluding failed executions based on the status would solve the issue. However, the status "FAILED" is not the appropriate status for these rejected executions. Their status should be something like "REJECTED". The status "FAILED" implies that the execution was run and failed (which is not the case), and having an execution in "FAILED" status without start time and end time makes no sense to me (because we can't answer questions like: when did it start and when did it fail? How long did it take to fail?). That's why I suggested to set the end time in BATCH-2760). On the other hand, it makes sense for a "REJECTED" execution to not have a start/end time.

So IMO, the correct fix for this issue would be to set the status of rejected execution to "REJECTED" (a new status we need to create or may be use UNKNOWN ?) and filter out rejected executions in the query.

What do you think?

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

@spring-issuemaster spring-issuemaster commented Oct 10, 2018

Dimitrios Liapis commented

Agreed on the EXIT_MESSAGE not being a good idea to go about this, just mentioned it as it's a way to narrow down on these particular failures.

Having a dedicated status (either existing "UNKOWN" or a new "REJECTED") sounds good idea to me and will keep things a bit cleaner.

Pending also on Michael Minella's opinion about this new approach, I can start putting together a new PR.

One question in case we decide to introduce a new dedicated status i.e. "REJECTED" to capture this: where exactly should we place it inside the BatchStatus enum since there is an implicit ordering? I would assume either directly after "UNKNOWN" or directly before "UNKNOWN".

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

@spring-issuemaster spring-issuemaster commented Oct 16, 2018

Mahmoud Ben Hassine commented

Hi Dimitrios Liapis,

After discussing this issue with Michael, we will not create a new status "REJECTED" as this might introduce confusions about the orchestration part of batch jobs.

Finding running executions should be done by looking at executions having a start time but no end time yet. This can be written in the query as ... and E.START_TIME is not NULL and E.END_TIME is NULL ....

Since rejected tasks do not have a start time set (and no end time neither), they should be excluded from the "findRunningExecutions" query. Note that we don't need to filter on the status anymore.

Can you please open a PR with this change? We would like to include the fix in the upcoming 4.1.0.RELEASE.

Kind regards,
Mahmoud

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

@spring-issuemaster spring-issuemaster commented Oct 16, 2018

Dimitrios Liapis commented

Hi Mahmoud Ben Hassine,

thanks for that.

I've started working on this today and while working on the tests, another thing popped up:

MapJobExecutionDao#findRunningJobExecutions() internally uses JobExecution#isRunning() where isRunning() method is defined as endTime == null.

Initially I've changed that to be startTime != null && endTime == null, to be in synch with the above fix on the query inside JdbcJobExecutionDao#findRunningJobExecutions()

This latter change opened up to some new test failures with a common pattern:

SimpleJobRepository#createJobExecution() is internally using the JobExecution#isRunning() method. Therefore, i.e. following the test MapJobRepositoryFactoryBeanTests#testCreateRepository(), two consecutive SimpleJobRepository#createJobExecution() calls will not detect a positive isRunning() anymore as the Start Time is not set.

Which got me questioning whether the JobExecution#isRunning() really is a check to find out if a JobExecution has been created rather than running (which implies that it has already started).

What are you thoughts?

Many thanks

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

@spring-issuemaster spring-issuemaster commented Oct 17, 2018

Mahmoud Ben Hassine commented

Thank you for fixing the map based dao to be in sync with the jdbc one (y).

Which got me questioning whether the JobExecution#isRunning() really is a check to find out if a JobExecution has been created rather than running (which implies that it has already started).

We have createTime for creation time, so isRunning is really about running executions and not created ones. It is the test that should be updated:

repository.createJobExecution(job.getName(), jobParameters);

try {
	repository.createJobExecution(job.getName(), jobParameters);
	fail("Expected JobExecutionAlreadyRunningException");
}
catch (JobExecutionAlreadyRunningException e) {
	// expected
}

In this test, we create a job execution (note that we don't run it), and then try create another one and expect to have a running execution (but the first one was never run). If you look at the JobExecution#setStartTime method, it is used in AbstractJob#doExecute method. So we need to execute the execution to have its start time set (which makes sense to me). So the test should be updated with something like:

JobExecution jobExecution = repository.createJobExecution(job.getName(), jobParameters);
job.execute(jobExecution); // this would set the start time and update the execution in the job repository

try {
	repository.createJobExecution(job.getName(), jobParameters);
	fail("Expected JobExecutionAlreadyRunningException");
}
catch (JobExecutionAlreadyRunningException e) {
	// expected
}
@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

@spring-issuemaster spring-issuemaster commented Oct 18, 2018

Dimitrios Liapis commented

Thanks, sounds good.

Looking at it now, AbstractJob#execute() sets the Start Time but also sets the End Time in a finally block.

I'm guessing if there isn't another way naturally just setting the Start Time alone without the End Time I'll try circumventing the End Time setting via mock-ing/stub-ing it somehow.

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

@spring-issuemaster spring-issuemaster commented Oct 18, 2018

Mahmoud Ben Hassine commented

I see, there should be some time between the call to job.execute(jobExecution); and the second call to repository.createJobExecution(job.getName(), jobParameters); so the repository sees the execution as running (start time set but the end time not set yet). The simplest thing to do it is to simulate a running execution, something like:

@Test
public void testCreateRepository() throws Exception {
	tested.afterPropertiesSet();
	JobRepository repository = tested.getObject();
	Job job = new JobSupport("jobName");
	JobParameters jobParameters = new JobParameters();

	JobExecution jobExecution = repository.createJobExecution(job.getName(), jobParameters);

	// simulate a running execution
	jobExecution.setStartTime(new Date());
	repository.update(jobExecution);

	try {
		repository.createJobExecution(job.getName(), jobParameters);
		fail("Expected JobExecutionAlreadyRunningException");
	}
	catch (JobExecutionAlreadyRunningException e) {
		// expected
	}
}

Otherwise, we need to run a job (in a separate thread) and make it take some time, so that the job repository has a chance to see the execution as running when it tries to create a new one. For example:

@Test
public void testCreateRepository() throws Exception {
	ApplicationContext applicationContext = new AnnotationConfigApplicationContext(JobConfiguration.class);
	JobRepository jobRepository = applicationContext.getBean(JobRepository.class);
	Job job = applicationContext.getBean(Job.class);
	JobLauncher jobLauncher = applicationContext.getBean(JobLauncher.class);

	jobLauncher.run(job, new JobParameters());

	try {
		jobRepository.createJobExecution(job.getName(), new JobParameters());
		fail("Expected JobExecutionAlreadyRunningException");
	}
	catch (JobExecutionAlreadyRunningException e) {
		// expected
	}
}

@Configuration
@EnableBatchProcessing
static class JobConfiguration {

	private JobBuilderFactory jobBuilderFactory;
	private StepBuilderFactory stepBuilderFactory;

	public JobConfiguration(JobBuilderFactory jobBuilderFactory, StepBuilderFactory stepBuilderFactory) {
		this.jobBuilderFactory = jobBuilderFactory;
		this.stepBuilderFactory = stepBuilderFactory;
	}

	@Bean
	public JobLauncher jobLauncher(JobRepository jobRepository) {
		SimpleJobLauncher jobLauncher = new SimpleJobLauncher();
		jobLauncher.setJobRepository(jobRepository);
		jobLauncher.setTaskExecutor(new SimpleAsyncTaskExecutor());
		return jobLauncher;
	}

	@Bean
	public Job job() {
		return jobBuilderFactory.get("job")
				.start(stepBuilderFactory.get("step")
						.tasklet((contribution, chunkContext) -> {
							Thread.sleep(500);
							return RepeatStatus.FINISHED;
						})
						.build())
				.build();
	}
}

Both tests pass on my side. I let you choose how to update the test (or suggest another solution) and we can discuss it on the PR.

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

@spring-issuemaster spring-issuemaster commented Oct 18, 2018

Dimitrios Liapis commented

Many thanks for the concrete examples - it made things very clear.

PR raised: #659

Could we also please consider this fix for 3.0.10.RELEASE too?

We have substantial infrastructure based on this version and we can't currently upgrade to version 4.

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

@spring-issuemaster spring-issuemaster commented Oct 18, 2018

Michael Minella commented

Dimitrios Liapis We definitely can take a look at back porting this fix (no promises), but I'd be curious as to what is holding you back from upgrading to Spring Batch 4.

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

@spring-issuemaster spring-issuemaster commented Oct 18, 2018

Dimitrios Liapis commented

Michael Minella It's really the Spring 5 dependency which we can't currently upgrade to (from almost latest 4).

That would be highly appreciated if we could back-port it and I would be more than happy to help on that (apply it independently via a PR to 3.0.x branch or any other way you prefer).

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

@spring-issuemaster spring-issuemaster commented Oct 26, 2018

Mahmoud Ben Hassine commented

I'm resolving this as the PR was merged to the master branch (for the upcoming 4.1.0 GA and) and back ported to 4.0.x (for 4.0.2).

@ Dimitrios Liapis you are welcome to help in back porting the fix to 3.0.x via a PR if you want (otherwise we will try to do it, but no promises as mentioned by Michael).

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

@spring-issuemaster spring-issuemaster commented Dec 14, 2018

Spazbob commented

Mahmoud Ben Hassine

This fix tripped us up a bit. We have built various dashboards that use SimpleJobExplorer's findRunningJobExecutions(). We have worker pool of 12 threads to run jobs, so if you submitted 16 jobs at once, prior to the fix all 16 would appear in the list, but now 4 of them will not start immediately (as they are waiting for an execution thread) and so they drop off the list.

 

It worked well for us when the "waiting" jobs appeared in the list, as it made it clear that they would be running shortly.

What is the recommended approach to get a list of waiting AND running jobs? I know we could use getJobExecutions() and then filter, but that would imply pulling back a significant amount of history which is inefficient.

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

@spring-issuemaster spring-issuemaster commented Dec 19, 2018

Mahmoud Ben Hassine commented

@Spazbob Thank you for reporting this, your use case confirms that there was an issue. In your example, the 4 executions that are not started yet should not be returned by JobExplorer#findRunningJobExecutions because they are not effectively running.

What is the recommended approach to get a list of waiting AND running jobs? I know we could use getJobExecutions() and then filter, but that would imply pulling back a significant amount of history which is inefficient.

  • For running jobs, you can use findRunningJobExecutions
  • For waiting jobs, If you don't want to use getJobExecutions and filter the results, you can do the filtering in the database by sending a query that grabs only waiting executions (having status = STARTING). Another solution if you use a ThreadPoolTaskExecutor with your JobLauncher is to get the pending tasks from the task executor's queue with taskExecutor.getThreadPoolExecutor().getQueue().size() (in your example this would return 4). I've seen people using a custom JobLauncher that uses a rabbitmq queue to submit jobs and they monitor the queue size to get pending jobs. The same idea applies with the in-memory queue of a ThreadPoolTaskExecutor.

Does this help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.