Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The jobs recovering (on scheduler startup) blocks simple trigger after failover situation for job which was executing during JVM crash #93

Closed
ghost opened this issue Dec 13, 2016 · 25 comments
Milestone

Comments

@ghost
Copy link

ghost commented Dec 13, 2016

http://www.quartz-scheduler.org/documentation/faq.html

What is Quartz?
...
Quartz is fault-tolerant ...

But there is a problem with it in some cases.

Reproduced with Quartz 2.2.1 and 2.2.3 (didn't check other versions)

Prerequisites

  • Job has simple trigger to repeat execution with some interval (e.g. minute).
  • Concurrent execution is not allowed.
  • Recovery is not requested.

JVM crashes during a job execution (or was stopped for maintenance during a job execution)

Downtime is much bigger than trigger's interval (e.g. > 2 minutes)

Important Quartz tables are TRIGGERS and FIRED_TRIGGERS and theirs states directly after JVM crashed are:

TRIGGERS table

TRIGGER_NAME TRIGGER_GROUP JOB_NAME JOB_GROUP NEXT_FIRE_TIME PREV_FIRE_TIME TRIGGER_STATE TRIGGER_TYPE START_TIME MISFIRE_INSTR SCHED_NAME
test TestJob test TestJob 1481618100000 1481618040000 BLOCKED SIMPLE 1481555640000 0 scheduler

FIRED_TRIGGERS table

ENTRY_ID TRIGGER_NAME TRIGGER_GROUP INSTANCE_NAME FIRED_TIME STATE JOB_NAME JOB_GROUP REQUESTS_RECOVERY SCHED_TIME IS_NONCONCURRENT SCHED_NAME
NON_CLUSTERED1481617513489 test TestJob NON_CLUSTERED 1481618049269 EXECUTING test TestJob 0 1481618040000 1 scheduler

The 1st scheduler start after system crashed (or stopped for maintenance)

Below are SQL statements for recoverJobs procedure on scheduler.start() and states of TRIGGERS and FIRED_TRIGGERS tables

UPDATE TRIGGERS SET TRIGGER_STATE = 'WAITING'
  WHERE SCHED_NAME = 'scheduler' AND (TRIGGER_STATE = 'ACQUIRED' OR TRIGGER_STATE = 'BLOCKED')
-> trigger has updated from BLOCKED to WAITING

UPDATE TRIGGERS SET TRIGGER_STATE = 'PAUSED' WHERE SCHED_NAME = 'scheduler' AND (TRIGGER_STATE = 'PAUSED_BLOCKED' OR TRIGGER_STATE = 'PAUSED_BLOCKED')
// not important

SELECT TRIGGER_NAME, TRIGGER_GROUP FROM TRIGGERS
  WHERE SCHED_NAME = 'scheduler' AND NOT (MISFIRE_INSTR = -1) AND NEXT_FIRE_TIME < 1481618263782 AND TRIGGER_STATE = 'WAITING'
  ORDER BY NEXT_FIRE_TIME ASC, PRIORITY DESC
-> trigger has been selected as misfired in WAITING state

SELECT * FROM TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT * FROM SIMPLE_TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT TRIGGER_NAME FROM TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT TRIGGER_GROUP FROM PAUSED_TRIGGER_GRPS  WHERE SCHED_NAME = 'scheduler' AND TRIGGER_GROUP = 'TestJob'
SELECT TRIGGER_GROUP FROM PAUSED_TRIGGER_GRPS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_GROUP = '_$_ALL_GROUPS_PAUSED_$_'
SELECT * FROM JOB_DETAILS WHERE SCHED_NAME = 'scheduler' AND JOB_NAME = 'test' AND JOB_GROUP = 'TestJob'
// not very important, assume it collects information about misfired trigger and there job

SELECT * FROM FIRED_TRIGGERS
  WHERE SCHED_NAME = 'scheduler' AND JOB_NAME = 'test' AND JOB_GROUP = 'TestJob'
-> fired trigger has been selected for misfired trigger

UPDATE TRIGGERS SET JOB_NAME = 'test', JOB_GROUP = 'TestJob', DESCRIPTION = NULL, NEXT_FIRE_TIME = 1481618340000, PREV_FIRE_TIME = 1481618040000,
  TRIGGER_STATE = 'BLOCKED', -- !!!! IMPORTANT !!!!
  TRIGGER_TYPE = 'SIMPLE', START_TIME = 1481555640000, END_TIME = 0, CALENDAR_NAME = NULL, MISFIRE_INSTR = 0, PRIORITY = 5, JOB_DATA = '<byte[]>'
  WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
-> (IMPORTANT) trigger which is misfired and fired (because in execution on JVM crash/stop) at the same time
    has been updated to BLOCKED state on scheduler start

UPDATE SIMPLE_TRIGGERS SET REPEAT_COUNT = -1, REPEAT_INTERVAL = 60000, TIMES_TRIGGERED = 1045  WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
// not important

SELECT * FROM FIRED_TRIGGERS WHERE SCHED_NAME = 'scheduler' AND INSTANCE_NAME = 'NON_CLUSTERED' AND REQUESTS_RECOVERY = 1
// not important, assume handling triggers requested recovery

SELECT TRIGGER_NAME, TRIGGER_GROUP FROM TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_STATE = 'COMPLETE'
// not important, assume select to remove stale triggers

DELETE FROM FIRED_TRIGGERS WHERE SCHED_NAME = 'scheduler'
-> fired triggers are removed

TRIGGERS table

TRIGGER_NAME TRIGGER_GROUP JOB_NAME JOB_GROUP NEXT_FIRE_TIME PREV_FIRE_TIME TRIGGER_STATE TRIGGER_TYPE START_TIME MISFIRE_INSTR SCHED_NAME
test TestJob test TestJob 1481618100000 1481618040000 BLOCKED SIMPLE 1481555640000 0 scheduler

FIRED_TRIGGERS table

No rows

Problem

Job has repeat trigger but in BLOCKED state, trigger will not fired, job will not executed at least until JVM is not restarted again

The 2nd scheduler start (just for test purposes)

Below are SQL statements for recoverJobs procedure on scheduler.start() and states of TRIGGERS and FIRED_TRIGGERS tables

UPDATE TRIGGERS SET TRIGGER_STATE = 'WAITING'
  WHERE SCHED_NAME = 'scheduler' AND (TRIGGER_STATE = 'ACQUIRED' OR TRIGGER_STATE = 'BLOCKED')
-> TRIGGER has updated from BLOCKED to WAITING

UPDATE TRIGGERS SET TRIGGER_STATE = 'PAUSED' WHERE SCHED_NAME = 'scheduler' AND (TRIGGER_STATE = 'PAUSED_BLOCKED' OR TRIGGER_STATE = 'PAUSED_BLOCKED')
// not important

SELECT TRIGGER_NAME, TRIGGER_GROUP FROM TRIGGERS
  WHERE SCHED_NAME = 'scheduler' AND NOT (MISFIRE_INSTR = -1) AND NEXT_FIRE_TIME < 1481621819212 AND TRIGGER_STATE = 'WAITING'
  ORDER BY NEXT_FIRE_TIME ASC, PRIORITY DESC
-> trigger has been selected as misfired in WAITING state

SELECT * FROM TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT * FROM SIMPLE_TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT TRIGGER_NAME FROM TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT TRIGGER_GROUP FROM PAUSED_TRIGGER_GRPS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_GROUP = 'TestJob'
SELECT TRIGGER_GROUP FROM PAUSED_TRIGGER_GRPS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_GROUP = '_$_ALL_GROUPS_PAUSED_$_'
SELECT * FROM JOB_DETAILS WHERE SCHED_NAME = 'scheduler' AND JOB_NAME = 'test' AND JOB_GROUP = 'TestJob'
// not very important, assume it collects information about misfired trigger and there job

SELECT * FROM FIRED_TRIGGERS
  WHERE SCHED_NAME = 'scheduler' AND JOB_NAME = 'test' AND JOB_GROUP = 'TestJob'
-> fired trigger has been selected for misfired trigger

UPDATE TRIGGERS SET JOB_NAME = 'test', JOB_GROUP = 'TestJob', DESCRIPTION = NULL, NEXT_FIRE_TIME = 1481621880000, PREV_FIRE_TIME = 1481618040000,
  TRIGGER_STATE = 'WAITING', -- !!! OK without fired trigger !!!
  TRIGGER_TYPE = 'SIMPLE', START_TIME = 1481555640000, END_TIME = 0, CALENDAR_NAME = NULL, MISFIRE_INSTR = 0, PRIORITY = 5, JOB_DATA = '<byte[]>'
  WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
-> now it is OK because there is not fired trigger

UPDATE SIMPLE_TRIGGERS SET REPEAT_COUNT = -1, REPEAT_INTERVAL = 60000, TIMES_TRIGGERED = 1104 WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT * FROM FIRED_TRIGGERS WHERE SCHED_NAME = 'scheduler' AND INSTANCE_NAME = 'NON_CLUSTERED' AND REQUESTS_RECOVERY = 1
SELECT TRIGGER_NAME, TRIGGER_GROUP FROM TRIGGERS  WHERE SCHED_NAME = 'scheduler' AND TRIGGER_STATE = 'COMPLETE'
DELETE FROM FIRED_TRIGGERS WHERE SCHED_NAME = 'scheduler'
// not important

Result

Job with repeat trigger is executed according trigger definition

BUT BELIEVE it is NOT workaround to restart JVM twice to solve problem.
There can be another jobs/triggers in such situation in second restart.

Setting request recovery to true is not workaround either, in our case we definitely do not need recovery request but job must be executed according trigger interval after JVM restart

Perhaps workaround to use (before it is fixed in Quartz) is:

  on JVM starting up but before scheduler started
  if (!scheduler.getMetaData().isJobStoreClustered()) {
    // delete all rows from FIRED_TRIGGERS
    // which do not request recovery
    DELETE FROM FIRED_TRIGGERS
      WHERE SCHED_NAME = scheduler.name AND REQUESTS_RECOVERY = 0
  }
  scheduler.start()

See pull request #94

@ahlinist
Copy link

ahlinist commented Dec 13, 2016

Seem to be suffering from the same issue (v. 2.2.1). Frequent-fired jobs appear in the blocked state after restarting the app (while the jobs are running) and the second restart "recovers" them.

ghost pushed a commit to EugeneGoroschenyaOld/quartz that referenced this issue Dec 13, 2016
…rtup) blocks simple trigger after failover situation for job which was executing during JVM crash
@achernyakevich-jc
Copy link

+1 It is critical issue that block our project.

@matveinazaruk
Copy link

👍 for resolving this issue ASAP.

@apyrkh
Copy link

apyrkh commented Dec 16, 2016

that bug is really annoying, please apply PR as soon as you can

@viktarovich
Copy link

I have the same problem

@JealousyM
Copy link

JealousyM commented Dec 16, 2016

Our project have the same problem

@ghost
Copy link

ghost commented Dec 16, 2016

+1

@ghost
Copy link
Author

ghost commented Dec 16, 2016

+1 for issue resolution ASAP

@isnigurova
Copy link

+1

@feliks-the-cat
Copy link

+1
It would be nice to fix it as soon as possible

@ghost
Copy link
Author

ghost commented Dec 16, 2016

+1

@MariaMarusevich
Copy link

+1 It's a critical issue for our project.

@AndreiSawicki
Copy link

+1

1 similar comment
@ghost
Copy link
Author

ghost commented Dec 16, 2016

+1

@AlexandrShestak
Copy link

+1 Fix it please.

@dzhitomirsky
Copy link

Critical for our project too, please fix ASAP.

@mpritchin
Copy link

+1

@JochenFrankeOC
Copy link

+1 (Causes unreliability issues in a productive environment.)

@apechenko-sc
Copy link

@eugene-goroschenya thanks for raising the issue. I have the same problem in my project. Fixing it would be a true gift for Christmas.

@ghost
Copy link
Author

ghost commented Dec 19, 2016

Pull request #94 was provided, awaiting review, merge and release

@jhouserizer
Copy link
Contributor

@eugene-goroschenya thank you, please see comments on PR.

@wushp
Copy link

wushp commented Dec 30, 2016

+1

@zemian
Copy link
Contributor

zemian commented Jan 16, 2017

I have applied the PR to both quartz-2.2.x and master now. Thanks for everyone in helping out here!

@zemian zemian closed this as completed Jan 16, 2017
@egoroschenya-sc
Copy link

Hi @jhouserizer and @zemian

Thanks for reviewing and merging PR quite fast.

Any plans to release quartz-2.2.4 bugfix version?

We are looking forward the official quartz version where this bug has been fixed.

@JochenFrankeOC
Copy link

Dear @jhouserizer and @zemian,
I kindly ask for feedback to the question from @egoroschenya-sc regarding an official quartz-2.2.4 bugfix version containing the fix. It's a little more than 1 year since quartz 2.2.3 release.

The issue causes problems with reliability in our software in production environments.
Please provide feedback at least, whether and when a release can be expected.
We need to decide short term to either wait for it, or deal with the issue on our own in a different way.

egoroschenya-sc pushed a commit to OpusCapita/quartz that referenced this issue Mar 10, 2017
…rtup) blocks simple trigger after failover situation for job which was executing during JVM crash

(cherry picked from commit 1afb695)
@zemian zemian added this to the 2.3.0 milestone Jul 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests