Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to add retry policy for transient failures in database operations? #733

Open
footcha opened this issue Feb 8, 2019 · 7 comments

Comments

@footcha
Copy link
Contributor

footcha commented Feb 8, 2019

We are running our clustered jobs scheduler in Azure. It is common that a database connection is interrupted from time to time or SQL command fails due to transient network failure.
Is there a way to configure a retry policy so that for example, when trigger state is update fails the a retry policy is immediately applied.

I am investigating for a way how to avoid restarting a scheduler.

Thank you!

Quartz.JobPersistenceException: Couldn't update states of blocked triggers: Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.Data.SqlClient.SqlException: Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception: The wait operation timed out
   --- End of inner exception stack trace ---
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.SqlCommand.InternalEndExecuteNonQuery(IAsyncResult asyncResult, String endMethod, Boolean isInternal)
   at System.Data.SqlClient.SqlCommand.EndExecuteNonQueryInternal(IAsyncResult asyncResult)
   at System.Data.SqlClient.SqlCommand.EndExecuteNonQueryAsync(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Quartz.Impl.AdoJobStore.StdAdoDelegate.<UpdateTriggerStatesForJobFromOtherState>d__70.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Quartz.Impl.AdoJobStore.JobStoreSupport.<TriggerFired>d__237.MoveNext()
   --- End of inner exception stack trace ---
   at Quartz.Impl.AdoJobStore.JobStoreSupport.<TriggerFired>d__237.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Quartz.Impl.AdoJobStore.JobStoreSupport.<>c__DisplayClass236_0.<<TriggersFired>b__0>d.MoveNext() [See nested exception: System.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.SqlCommand.InternalEndExecuteNonQuery(IAsyncResult asyncResult, String endMethod, Boolean isInternal)
   at System.Data.SqlClient.SqlCommand.EndExecuteNonQueryInternal(IAsyncResult asyncResult)
   at System.Data.SqlClient.SqlCommand.EndExecuteNonQueryAsync(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Quartz.Impl.AdoJobStore.StdAdoDelegate.<UpdateTriggerStatesForJobFromOtherState>d__70.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Quartz.Impl.AdoJobStore.JobStoreSupport.<TriggerFired>d__237.MoveNext()
@jvilimek
Copy link

Hey! That's also our problem. When using Azure SQL there might be transient errors where a simple retry would help...

@puddlewitt
Copy link
Contributor

I have been seeing similar issues using AzureSql.

I have taken a look at StdAdoDelegate.cs and wondered if it was worth adding something like the following...

StdAdoDelegate.cs

public virtual bool IsTransient(Exception ex)
{
 return false;
}

try{
  // For all cmd.XXX calls
  cmd.ExecuteNonQueryAsync()
}
catch(Exception ex)
{
  if(IsTransient(ex))
  {
    // Retry up to X.
  }
}

SqlServerDelegate.cs

public override bool IsTransient(Exception ex)
{
 // Some logic here.
 return true;
}

@pierluca
Copy link

pierluca commented Jun 19, 2019

Hello !
I was wondering if you had found some solution for this?
I also find myself having to handle this situation.

@puddlewitt
Copy link
Contributor

Hello !
I was wondering if you had found some solution for this?
I also find myself having to handle this situation.

I run it in cluster mode even though only a single node exists. For my particular situation the timing of the event doesn't have to be perfect, just as long as it runs in a sensible time period.

@lahma
Copy link
Member

lahma commented Nov 7, 2019

I've been circling around these reports about transient errors and there's already retries in place based on this logic here. The retry should be done for operations which could cause the scheduler to go into bad state, other operations usually can safely be retried later on.

@puddlewitt
Copy link
Contributor

I don't think TriggerFired is covered by IsTransient because it doesn't call RollbackConnection. Doesn't appear to be then caught by ReleaseAcquiredTrigger because the exception type doesn't match.

@lahma
Copy link
Member

lahma commented Nov 9, 2019

@puddlewitt you are indeed correct that the logic won't take action here. I've opened an issue on Java side to discuss what is the correct action as there is a logic fault as far as I understand. The retry-logic should come from JobStoreSupport level where transaction is retried on error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants