Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hangfire server crashed - and didn't restart #50

Closed
johnml1135 opened this issue Aug 7, 2023 · 15 comments
Closed

Hangfire server crashed - and didn't restart #50

johnml1135 opened this issue Aug 7, 2023 · 15 comments
Assignees

Comments

@johnml1135
Copy link
Collaborator

Here is the error:

�[41m�[1m�[37mcrit�[39m�[22m�[49m: Microsoft.Extensions.Hosting.Internal.Host[10]
      The HostOptions.BackgroundServiceExceptionBehavior is configured to StopHost. A BackgroundService has thrown an unhandled exception, and the IHost instance is stopping. To avoid this behavior, configure this to Ignore; however the BackgroundService will not be restarted.
      System.Net.Http.HttpRequestException: Resource temporarily unavailable (api.sil.hosted.allegro.ai:443)
       ---> System.Net.Sockets.SocketException (11): Resource temporarily unavailable
         at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
         at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
         at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|277_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
         at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
         --- End of inner exception stack trace ---
         at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
         at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
         at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
         at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(HttpRequestMessage request)
         at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
         at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
         at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
         at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
         at Microsoft.Extensions.Http.Logging.LoggingHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
         at Microsoft.Extensions.Http.Logging.LoggingScopeHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
         at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
         at SIL.Machine.AspNetCore.Services.ClearMLAuthenticationService.AuthorizeAsync(CancellationToken cancellationToken) in /app/src/SIL.Machine.AspNetCore/Services/ClearMLAuthenticationService.cs:line 81
         at SIL.Machine.AspNetCore.Services.ClearMLAuthenticationService.ExecuteAsync(CancellationToken stoppingToken) in /app/src/SIL.Machine.AspNetCore/Services/ClearMLAuthenticationService.cs:line 47
         at Microsoft.Extensions.Hosting.Internal.Host.TryExecuteBackgroundServiceAsync(BackgroundService backgroundService)

And then, there was a steady stream of failing health checks:

	
{
  "log": "\u001b[41m\u001b[30mfail\u001b[39m\u001b[22m\u001b[49m: Microsoft.Extensions.Diagnostics.HealthChecks.DefaultHealthCheckService[103]\n      Health check Hangfire with status Unhealthy completed after 0.9272ms with message 'There are no Hangfire servers running.'\n",
  "stream": "stdout",
  "time": "2023-08-03T01:18:43.492804482Z",
  "type": "fail",
  "source": "Microsoft.Extensions.Diagnostics.HealthChecks.DefaultHealthCheckService"
}

We should:

  1. check the underlying issue
  2. Make sure that the service auto-restarts
  3. Setup an alert on the failing health check
@johnml1135
Copy link
Collaborator Author

Eli - look at the auto-restarting. I'll get the alerting to work.

@Enkidu93
Copy link
Collaborator

I haven't found a way to auto-restart. One thing I could do is catch the exception to keep the Hangfire server from crashing in this particular situation (clearML being down apparently), but it wouldn't be a general solution. Thoughts? @johnml1135

@ddaspit
Copy link
Contributor

ddaspit commented Aug 29, 2023

We should definitely handle the exception more gracefully. @johnml1135 Is there some way to restart the server in Kubernetes?

@johnml1135
Copy link
Collaborator Author

johnml1135 commented Aug 30, 2023

@Enkidu93 - this may help for auto-restarting the service: https://docs.hangfire.io/en/latest/background-processing/dealing-with-exceptions.html

johnml1135 added a commit that referenced this issue Aug 30, 2023
@johnml1135
Copy link
Collaborator Author

This is implementing the standard guidance.

@Enkidu93
Copy link
Collaborator

Enkidu93 commented Aug 30, 2023

@johnml1135 Doesn't that just allow jobs to be retried but isn't this a question of the Hangfire server crashing? That may be helpful (and I wonder if it would be good to configure it further, say, specifying the DelayInSeconds parameter because I imagine a common cause of this would be unavailable external services and it wouldn't be reasonable to retry so quickly), but I just wonder if it has really solved the issue. I know you've created a separate issue for handling the ClearML-related exception, but this seems like a more generic issue: What do we do if the Hangfire job server crashes? Maybe I'm misunderstanding.

@johnml1135 johnml1135 reopened this Aug 30, 2023
@johnml1135
Copy link
Collaborator Author

@ddaspit - I believe you are right - it is about retrying jobs, not restarting the server. Are there any dependancies between hangfire and the rest of the code? Could it be a completely separate executable running either on the same container or a different one?

johnml1135 added a commit that referenced this issue Aug 30, 2023
@johnml1135
Copy link
Collaborator Author

The fix actually crashed the engine - I need to test these things out first.... it is not needed. Let's rather try to do the original thing which is restart the server if it crashes.

@ddaspit
Copy link
Contributor

ddaspit commented Aug 30, 2023

When the job server starts, it retrieves the access token for the ClearML API. If this throws, then it is a fatal error and the server crashes. We should restart the server if it crashes, but we can also better handle this particular exception. We could configure the HttpClient to retry using Polly.

@Enkidu93
Copy link
Collaborator

What has yet to be done here, @johnml1135 ?

@johnml1135 johnml1135 assigned Enkidu93 and unassigned johnml1135 Sep 29, 2023
@johnml1135
Copy link
Collaborator Author

So, you implemented Polly in #66, but I don't believe that the hangfire server restarting has been resolved. Could you investigate further and see if there is a good resolution?

@Enkidu93
Copy link
Collaborator

@johnml1135 Is it as simple as setting the docker containers to restart on failure (which is very easy to do)? Or do you see a reason that won't work?

@johnml1135
Copy link
Collaborator Author

johnml1135 commented Sep 29, 2023 via email

@Enkidu93
Copy link
Collaborator

Enkidu93 commented Oct 4, 2023

@johnml1135 The decision here was to ignore this until it pops up again, correct? Just wanted to document that if so.

@johnml1135
Copy link
Collaborator Author

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

3 participants