Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SchedulerFactoryBean's setOverwriteExistingJobs does not reliably work in a cluster [SPR-14745] #19311

Closed
spring-projects-issues opened this issue Sep 24, 2016 · 12 comments
Labels
in: core Issues in core modules (aop, beans, core, context, expression) status: bulk-closed An outdated, unresolved issue that's closed in bulk as part of a cleaning process

Comments

@spring-projects-issues
Copy link
Collaborator

spring-projects-issues commented Sep 24, 2016

Alexandru-Constantin Bledea opened SPR-14745 and commented

This happens when I want to deploy a spring boot application on a cluster.

What I suspect the problem to be is SchedulerFactoryBean's setOverwriteExistingJobs not offering enough protection.

One node will be initializing the scheduler and it will decide to replace the trigger (breakpoint org.quartz.impl.jdbcjobstore.SimpleTriggerPersistenceDelegate#deleteExtendedTriggerProperties )

Right after it executes this method, the trigger won't be in the database any longer so when another node in the cluster will try to read it (org.quartz.impl.jdbcjobstore.JobStoreSupport#retrieveTrigger) it will fail with the exception below. Because of this exception, the whole application will fail to start (not just the scheduler).

Caused by: org.quartz.JobPersistenceException: Couldn't retrieve trigger: No record found for selection of Trigger with key:

I attached the logs (The exception can be found on the Server-1 node after the 4th restart)

For the whole project that demonstrates this issue go to https://github.com/apixandru/case-study/tree/master/spring-boot-quartz

The way that we configure the scheduler is here

@Bean
JobDetailFactoryBean jobFactoryBean() {
    JobDetailFactoryBean bean = new JobDetailFactoryBean();
    bean.setDurability(true);
    bean.setName("Sampler");
    bean.setJobClass(SampleJob.class);
    return bean;
}

@Bean
SimpleTriggerFactoryBean triggerFactoryBean(JobDetailFactoryBean jobFactoryBean) {
    SimpleTriggerFactoryBean bean = new SimpleTriggerFactoryBean();
    bean.setName("Sampler Trigger");
    bean.setRepeatInterval(20_000);
    bean.setJobDetail(jobFactoryBean.getObject());
    return bean;
}

@Bean
SchedulerFactoryBean schedulerFactoryBean(SimpleTriggerFactoryBean triggerFactoryBean, DataSource dataSource, Dependency dependency) {
    Properties props = new Properties();
    props.put("org.quartz.scheduler.instanceId", "AUTO");
    props.put("org.quartz.jobStore.isClustered", "true");

    SchedulerFactoryBean bean = new SchedulerFactoryBean();
    bean.setTriggers(triggerFactoryBean.getObject());
    bean.setSchedulerName("Demo Scheduler");
    bean.setSchedulerContextAsMap(Collections.singletonMap("dependency", dependency));
    bean.setOverwriteExistingJobs(true);
    bean.setDataSource(dataSource);
    bean.setQuartzProperties(props);

    return bean;
}

This happens a lot on our work servers but it's a lot harder to get locally (possibly due to the fact that the actual servers are dedicated and have a lot more power than my local machine?)

To get the bug on any machine, start one server in debug mode and put a breakpoint on SimpleTriggerPersistenceDelegate.deleteExtendedTriggerProperties and just after it executes, start the second server and you will get this exception

Anyway, I managed to get this error locally as well after about 40 redeploys to my local clustered weblogic server.


Affects: 4.3.3

Reference URL: http://stackoverflow.com/questions/39673572/spring-quartz-scheduler-race-condition

Attachments:

Issue Links:

1 votes, 3 watchers

@spring-projects-issues
Copy link
Collaborator Author

Juergen Hoeller commented

Isn't this an issue with Quartz's trigger persistence mechanism itself? Is there anything specific to Spring's Quartz integration here? Otherwise you'd have to report this to the Quartz project...

@spring-projects-issues
Copy link
Collaborator Author

Alexandru-Constantin Bledea commented

I'm actually not sure to be honest but I'm pretty sure that you're right now that I think about it.

The issue seemed to arise from the fact that overwriteExistingJobs is set to true. However, i just had another look into it that in turn eventually delegates to the scheduler's "replaceTrigger" which doesn't seem to do the whole replacing thing transactionally, there's a small window where the database is in an inconsistent state because of it.

I will try to see if i can get into this scenario without spring at all (under normal circumstances as defined by their docs), but it definitely looks plausible. If i get there I will raise the issue on the quartz side. (Thanks for the quick response.)

@spring-projects-issues
Copy link
Collaborator Author

Alexandru-Constantin Bledea commented

I've tinkered with vanilla quartz a bit and I can't get to a scenario where I would reschedule a job in the way that it's done with SchedulerFactoryBean.

Here's what i came up with

package com.apixandru.casestudy;

import org.quartz.JobDetail;
import org.quartz.Scheduler;
import org.quartz.SchedulerException;
import org.quartz.Trigger;
import org.quartz.impl.StdSchedulerFactory;
import org.quartz.impl.jdbcjobstore.JobStoreTX;

import java.util.Properties;

import static java.util.Collections.singleton;
import static org.quartz.JobBuilder.newJob;
import static org.quartz.SimpleScheduleBuilder.simpleSchedule;
import static org.quartz.TriggerBuilder.newTrigger;
import static org.quartz.impl.StdSchedulerFactory.PROP_JOB_STORE_CLASS;

public class QuartzTest {

    public static void main(String[] args) throws SchedulerException, InterruptedException {

        Properties props = new Properties();
        props.put("org.quartz.scheduler.instanceName", "QuartzTest Scheduler");
        props.put("org.quartz.scheduler.instanceId", "AUTO");

        props.put("org.quartz.jobStore.isClustered", "true");
        props.put("org.quartz.threadPool.threadCount", "2");

        props.put("org.quartz.dataSource.NAME.URL", "jdbc:oracle:thin:@localhost:49161:xe");
        props.put("org.quartz.dataSource.NAME.user", "system");
        props.put("org.quartz.dataSource.NAME.password", "oracle");
        props.put("org.quartz.dataSource.NAME.driver", "oracle.jdbc.OracleDriver");
        props.put("org.quartz.jobStore.dataSource", "NAME");

        props.put(PROP_JOB_STORE_CLASS, JobStoreTX.class.getName());

        Scheduler scheduler = new StdSchedulerFactory(props).getScheduler();

        scheduler.start();


        JobDetail job = newJob(HelloJob.class)
                .withIdentity("QuartzTest Job")
                .build();

        Trigger trigger = newTrigger()
                .withIdentity("QuartzTest Trigger")
                .startNow()
                .withSchedule(simpleSchedule()
                        .withIntervalInSeconds(20)
                        .repeatForever())
                .build();

        scheduler.scheduleJob(job, singleton(trigger), true);

    }
}

This code never throws that exception because it manages the whole trigger replacing internally and it does that by updating the existing trigger rather than deleting it and reinserting it.

The rescheduleJob that is used in spring seems to do all that managing on the spring side (to figure out that the trigger already exists and so on).

@spring-projects-issues
Copy link
Collaborator Author

Juergen Hoeller commented

Alright, we'll try to reimplement our logic there to leave the entire replacement step up to Quartz itself...

@spring-projects-issues
Copy link
Collaborator Author

Alexandru-Constantin Bledea commented

I think that the fact that the job has to be durable in the spring-quartz integration is probably exactly because of this issue.

Since the trigger is deleted in replaceTrigger, the job would have been without any trigger and it would have been deleted which meant that replace trigger would probably have failed.

I'm facing another issue currently because of this, the jobs are still in the database even though they are no longer in the code (i'm pretty sure it's because they are durable) .If i just change the job detail factory bean's name, the trigger is not currently being replaced and linked to the new job. It still remains on the old job even though a new job is created in the database.

@spring-projects-issues
Copy link
Collaborator Author

Alexandru-Constantin Bledea commented

Hi Juergen Hoeller,

I see that this was pushed back to 5.x.
I understand that it's not on the priority list but maybe some small changes that would allow us to subclass SchedulerFactoryBean and change the default behavior would let us work around the issue until the official fix.

I made some (very) minor changes that allows me to work around the issue here: #1198
Would it be possible to have this in 4.3.x?

Regards,
Alexandru

@spring-projects-issues
Copy link
Collaborator Author

Alexandru-Constantin Bledea commented

I believe that allowing SchedulerFactoryBean to be subclassed and giving the option to override the behavior in the subclass is a good approach because there are other things that I expected to happen that aren't currently happening and I cannot insert hooks for my logic anywhere in the SchedulerFactoryBean.

Here are a couple of tests of what I would expect to happen but aren't currently.
https://github.com/apixandru/case-study/tree/master/spring-boot-quartz-expectations

Also, this pull request doesn't fix it either (however, it does come close if you set the durability flag to false - which wasn't possible with the old logic, and you set remove unused triggers to true), I suppose that remove unused triggers should come before rescheduling them to remove the old link between the old job and the new one
#1200

@spring-projects-issues
Copy link
Collaborator Author

Alexandru-Constantin Bledea commented

Just found out something really interesting. The problem is not the method that we are calling. The replaceJob does actually do this atomically and inside a transaction. The thing is that spring doesn't honor the transaction apparently. I'm assuming that if it doesn't honor the transaction here, it probably doesn't honor it in other places as well which is not that great.

quartz-scheduler/quartz#57

@spring-projects-issues
Copy link
Collaborator Author

Alexandru-Constantin Bledea commented

Ok, so the reason why it is not respecting the transaction is that the spring provided job store's transaction management is container managed. The problem with that is that we do not provide a transaction so it's done non transactionally. (Actually, the whole code that registers everything is executed in the afterPropertiesSet method outside of a transaction.)

Just to check if this is correct, I manually inserted a transaction where i knew it fails and it no longer failed. (the trigger was no longer removed eagerly, but after the commit)

// Register Triggers.
if (this.triggers != null) {
  for (final Trigger trigger : this.triggers) {
    TransactionTemplate tmpl = new TransactionTemplate(txManager);
    tmpl.execute(new TransactionCallbackWithoutResult() {
      @Override
      protected void doInTransactionWithoutResult(TransactionStatus status) {
        try {
            addTriggerToScheduler(trigger);
        } catch (SchedulerException e) {
            e.printStackTrace();
        }
    }});
  }

@spring-projects-issues
Copy link
Collaborator Author

Juergen Hoeller commented

If such wrapping in a transaction helped, wouldn't it also work to set the transaction manager through SchedulerFactoryBean's setTransactionManager method? That would wrap the entire registerJobsAndTriggers procedure in one large transaction... Or do you specifically need an individual transaction per trigger here?

@spring-projects-issues
Copy link
Collaborator Author

Alexandru-Constantin Bledea commented

I can confirm that setting the TransactionManager solves the concurrency issues without any additional changes.

@spring-projects-issues spring-projects-issues added status: waiting-for-triage An issue we've not yet triaged or decided on type: enhancement A general enhancement in: core Issues in core modules (aop, beans, core, context, expression) and removed type: enhancement A general enhancement labels Jan 11, 2019
@rstoyanchev rstoyanchev added status: bulk-closed An outdated, unresolved issue that's closed in bulk as part of a cleaning process and removed status: waiting-for-triage An issue we've not yet triaged or decided on labels Jan 11, 2019
@spring-projects-issues
Copy link
Collaborator Author

Bulk closing outdated, unresolved issues. Please, reopen if still relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: core Issues in core modules (aop, beans, core, context, expression) status: bulk-closed An outdated, unresolved issue that's closed in bulk as part of a cleaning process
Projects
None yet
Development

No branches or pull requests

2 participants