Is there a way to add retry policy for transient failures in database operations? #733

footcha · 2019-02-08T16:23:06Z

We are running our clustered jobs scheduler in Azure. It is common that a database connection is interrupted from time to time or SQL command fails due to transient network failure.
Is there a way to configure a retry policy so that for example, when trigger state is update fails the a retry policy is immediately applied.

I am investigating for a way how to avoid restarting a scheduler.

Thank you!

Quartz.JobPersistenceException: Couldn't update states of blocked triggers: Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.Data.SqlClient.SqlException: Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception: The wait operation timed out
   --- End of inner exception stack trace ---
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.SqlCommand.InternalEndExecuteNonQuery(IAsyncResult asyncResult, String endMethod, Boolean isInternal)
   at System.Data.SqlClient.SqlCommand.EndExecuteNonQueryInternal(IAsyncResult asyncResult)
   at System.Data.SqlClient.SqlCommand.EndExecuteNonQueryAsync(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Quartz.Impl.AdoJobStore.StdAdoDelegate.<UpdateTriggerStatesForJobFromOtherState>d__70.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Quartz.Impl.AdoJobStore.JobStoreSupport.<TriggerFired>d__237.MoveNext()
   --- End of inner exception stack trace ---
   at Quartz.Impl.AdoJobStore.JobStoreSupport.<TriggerFired>d__237.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Quartz.Impl.AdoJobStore.JobStoreSupport.<>c__DisplayClass236_0.<<TriggersFired>b__0>d.MoveNext() [See nested exception: System.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.SqlCommand.InternalEndExecuteNonQuery(IAsyncResult asyncResult, String endMethod, Boolean isInternal)
   at System.Data.SqlClient.SqlCommand.EndExecuteNonQueryInternal(IAsyncResult asyncResult)
   at System.Data.SqlClient.SqlCommand.EndExecuteNonQueryAsync(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Quartz.Impl.AdoJobStore.StdAdoDelegate.<UpdateTriggerStatesForJobFromOtherState>d__70.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Quartz.Impl.AdoJobStore.JobStoreSupport.<TriggerFired>d__237.MoveNext()

The text was updated successfully, but these errors were encountered:

jvilimek · 2019-02-11T11:39:43Z

Hey! That's also our problem. When using Azure SQL there might be transient errors where a simple retry would help...

puddlewitt · 2019-03-08T00:15:32Z

I have been seeing similar issues using AzureSql.

I have taken a look at StdAdoDelegate.cs and wondered if it was worth adding something like the following...

StdAdoDelegate.cs

public virtual bool IsTransient(Exception ex)
{
 return false;
}

try{
  // For all cmd.XXX calls
  cmd.ExecuteNonQueryAsync()
}
catch(Exception ex)
{
  if(IsTransient(ex))
  {
    // Retry up to X.
  }
}

SqlServerDelegate.cs

public override bool IsTransient(Exception ex)
{
 // Some logic here.
 return true;
}

pierluca · 2019-06-19T07:59:48Z

Hello !
I was wondering if you had found some solution for this?
I also find myself having to handle this situation.

puddlewitt · 2019-07-23T10:19:53Z

Hello !
I was wondering if you had found some solution for this?
I also find myself having to handle this situation.

I run it in cluster mode even though only a single node exists. For my particular situation the timing of the event doesn't have to be perfect, just as long as it runs in a sensible time period.

lahma · 2019-11-07T06:35:47Z

I've been circling around these reports about transient errors and there's already retries in place based on this logic here. The retry should be done for operations which could cause the scheduler to go into bad state, other operations usually can safely be retried later on.

puddlewitt · 2019-11-08T10:58:43Z

I don't think TriggerFired is covered by IsTransient because it doesn't call RollbackConnection. Doesn't appear to be then caught by ReleaseAcquiredTrigger because the exception type doesn't match.

lahma · 2019-11-09T07:10:55Z

@puddlewitt you are indeed correct that the logic won't take action here. I've opened an issue on Java side to discuss what is the correct action as there is a logic fault as far as I understand. The retry-logic should come from JobStoreSupport level where transaction is retried on error.

lahma mentioned this issue Nov 9, 2019

JobStoreSupport.triggersFired not using txValidator compensation logic quartz-scheduler/quartz#533

Closed

lahma mentioned this issue Nov 10, 2019

Scheduler hangs or stop processing jobs without reporting any exception #800

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to add retry policy for transient failures in database operations? #733

Is there a way to add retry policy for transient failures in database operations? #733

footcha commented Feb 8, 2019

jvilimek commented Feb 11, 2019

puddlewitt commented Mar 8, 2019

pierluca commented Jun 19, 2019 •

edited

puddlewitt commented Jul 23, 2019

lahma commented Nov 7, 2019

puddlewitt commented Nov 8, 2019

lahma commented Nov 9, 2019

Is there a way to add retry policy for transient failures in database operations? #733

Is there a way to add retry policy for transient failures in database operations? #733

Comments

footcha commented Feb 8, 2019

jvilimek commented Feb 11, 2019

puddlewitt commented Mar 8, 2019

pierluca commented Jun 19, 2019 • edited

puddlewitt commented Jul 23, 2019

lahma commented Nov 7, 2019

puddlewitt commented Nov 8, 2019

lahma commented Nov 9, 2019

pierluca commented Jun 19, 2019 •

edited