Hangfire server crashed - and didn't restart #50

johnml1135 · 2023-08-07T13:43:10Z

Here is the error:

�[41m�[1m�[37mcrit�[39m�[22m�[49m: Microsoft.Extensions.Hosting.Internal.Host[10]
      The HostOptions.BackgroundServiceExceptionBehavior is configured to StopHost. A BackgroundService has thrown an unhandled exception, and the IHost instance is stopping. To avoid this behavior, configure this to Ignore; however the BackgroundService will not be restarted.
      System.Net.Http.HttpRequestException: Resource temporarily unavailable (api.sil.hosted.allegro.ai:443)
       ---> System.Net.Sockets.SocketException (11): Resource temporarily unavailable
         at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
         at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
         at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|277_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
         at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
         --- End of inner exception stack trace ---
         at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
         at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
         at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
         at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(HttpRequestMessage request)
         at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
         at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
         at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
         at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
         at Microsoft.Extensions.Http.Logging.LoggingHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
         at Microsoft.Extensions.Http.Logging.LoggingScopeHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
         at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
         at SIL.Machine.AspNetCore.Services.ClearMLAuthenticationService.AuthorizeAsync(CancellationToken cancellationToken) in /app/src/SIL.Machine.AspNetCore/Services/ClearMLAuthenticationService.cs:line 81
         at SIL.Machine.AspNetCore.Services.ClearMLAuthenticationService.ExecuteAsync(CancellationToken stoppingToken) in /app/src/SIL.Machine.AspNetCore/Services/ClearMLAuthenticationService.cs:line 47
         at Microsoft.Extensions.Hosting.Internal.Host.TryExecuteBackgroundServiceAsync(BackgroundService backgroundService)

And then, there was a steady stream of failing health checks:

	
{
  "log": "\u001b[41m\u001b[30mfail\u001b[39m\u001b[22m\u001b[49m: Microsoft.Extensions.Diagnostics.HealthChecks.DefaultHealthCheckService[103]\n      Health check Hangfire with status Unhealthy completed after 0.9272ms with message 'There are no Hangfire servers running.'\n",
  "stream": "stdout",
  "time": "2023-08-03T01:18:43.492804482Z",
  "type": "fail",
  "source": "Microsoft.Extensions.Diagnostics.HealthChecks.DefaultHealthCheckService"
}

We should:

check the underlying issue
Make sure that the service auto-restarts
Setup an alert on the failing health check

The text was updated successfully, but these errors were encountered:

johnml1135 · 2023-08-22T18:47:03Z

Eli - look at the auto-restarting. I'll get the alerting to work.

Enkidu93 · 2023-08-29T19:48:16Z

I haven't found a way to auto-restart. One thing I could do is catch the exception to keep the Hangfire server from crashing in this particular situation (clearML being down apparently), but it wouldn't be a general solution. Thoughts? @johnml1135

ddaspit · 2023-08-29T21:45:41Z

We should definitely handle the exception more gracefully. @johnml1135 Is there some way to restart the server in Kubernetes?

johnml1135 · 2023-08-30T12:09:52Z

@Enkidu93 - this may help for auto-restarting the service: https://docs.hangfire.io/en/latest/background-processing/dealing-with-exceptions.html

johnml1135 · 2023-08-30T12:23:35Z

This is implementing the standard guidance.

Enkidu93 · 2023-08-30T12:38:58Z

@johnml1135 Doesn't that just allow jobs to be retried but isn't this a question of the Hangfire server crashing? That may be helpful (and I wonder if it would be good to configure it further, say, specifying the DelayInSeconds parameter because I imagine a common cause of this would be unavailable external services and it wouldn't be reasonable to retry so quickly), but I just wonder if it has really solved the issue. I know you've created a separate issue for handling the ClearML-related exception, but this seems like a more generic issue: What do we do if the Hangfire job server crashes? Maybe I'm misunderstanding.

johnml1135 · 2023-08-30T13:45:58Z

@ddaspit - I believe you are right - it is about retrying jobs, not restarting the server. Are there any dependancies between hangfire and the rest of the code? Could it be a completely separate executable running either on the same container or a different one?

This reverts commit 80fbaf0.

johnml1135 · 2023-08-30T15:21:39Z

The fix actually crashed the engine - I need to test these things out first.... it is not needed. Let's rather try to do the original thing which is restart the server if it crashes.

ddaspit · 2023-08-30T22:15:00Z

When the job server starts, it retrieves the access token for the ClearML API. If this throws, then it is a fatal error and the server crashes. We should restart the server if it crashes, but we can also better handle this particular exception. We could configure the HttpClient to retry using Polly.

Enkidu93 · 2023-09-29T14:35:27Z

What has yet to be done here, @johnml1135 ?

johnml1135 · 2023-09-29T14:39:56Z

So, you implemented Polly in #66, but I don't believe that the hangfire server restarting has been resolved. Could you investigate further and see if there is a good resolution?

Enkidu93 · 2023-09-29T20:22:11Z

@johnml1135 Is it as simple as setting the docker containers to restart on failure (which is very easy to do)? Or do you see a reason that won't work?

johnml1135 · 2023-09-29T21:37:38Z

The hang fire server is running as a sub process on the machine server. When is a hang fire server crashed, the machine server kept running, and nothing proactive happened. My best guess would be to have the main.net process, ping the hang fire server, and if it is down, restart it. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Eli C. Lowry ***@***.***> Sent: Friday, September 29, 2023 4:22:21 PM To: sillsdev/machine ***@***.***> Cc: John Lambert ***@***.***>; Mention ***@***.***> Subject: Re: [sillsdev/machine] Hangfire server crashed - and didn't restart (Issue #50) @johnml1135<https://github.com/johnml1135> Is it as simple as setting the docker containers to restart on failure (which is very easy to do)? Or do you see a reason that won't work? — Reply to this email directly, view it on GitHub<#50 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADIY5NHNWQ7RSI62WWYD7ZTX44UX3ANCNFSM6AAAAAA3HAVGTE>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Enkidu93 · 2023-10-04T22:00:45Z

@johnml1135 The decision here was to ignore this until it pops up again, correct? Just wanted to document that if so.

johnml1135 · 2023-10-05T11:23:22Z

Yes.

johnml1135 assigned johnml1135 and Enkidu93 Aug 7, 2023

johnml1135 added the operations label Aug 7, 2023

johnml1135 added this to the 1.1 Mother Tongue MVP milestone Aug 7, 2023

johnml1135 mentioned this issue Aug 30, 2023

ClearML server is down - and then hangfire crashes #65

Closed

johnml1135 unassigned Enkidu93 Aug 30, 2023

johnml1135 added a commit that referenced this issue Aug 30, 2023

#50 auto-restart hangfire

80fbaf0

johnml1135 closed this as completed Aug 30, 2023

johnml1135 reopened this Aug 30, 2023

johnml1135 added a commit that referenced this issue Aug 30, 2023

Revert "#50 auto-restart hangfire"

e30aaab

This reverts commit 80fbaf0.

Enkidu93 mentioned this issue Sep 1, 2023

Fixes ClearML server is down - and then hangfire crashes #65 --ECL #66

Merged

johnml1135 assigned Enkidu93 and unassigned johnml1135 Sep 29, 2023

johnml1135 closed this as completed Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hangfire server crashed - and didn't restart #50

Hangfire server crashed - and didn't restart #50

johnml1135 commented Aug 7, 2023

johnml1135 commented Aug 22, 2023

Enkidu93 commented Aug 29, 2023

ddaspit commented Aug 29, 2023

johnml1135 commented Aug 30, 2023 •

edited

johnml1135 commented Aug 30, 2023

Enkidu93 commented Aug 30, 2023 •

edited

johnml1135 commented Aug 30, 2023

johnml1135 commented Aug 30, 2023

ddaspit commented Aug 30, 2023

Enkidu93 commented Sep 29, 2023

johnml1135 commented Sep 29, 2023

Enkidu93 commented Sep 29, 2023

johnml1135 commented Sep 29, 2023 via email

Enkidu93 commented Oct 4, 2023

johnml1135 commented Oct 5, 2023

Hangfire server crashed - and didn't restart #50

Hangfire server crashed - and didn't restart #50

Comments

johnml1135 commented Aug 7, 2023

johnml1135 commented Aug 22, 2023

Enkidu93 commented Aug 29, 2023

ddaspit commented Aug 29, 2023

johnml1135 commented Aug 30, 2023 • edited

johnml1135 commented Aug 30, 2023

Enkidu93 commented Aug 30, 2023 • edited

johnml1135 commented Aug 30, 2023

johnml1135 commented Aug 30, 2023

ddaspit commented Aug 30, 2023

Enkidu93 commented Sep 29, 2023

johnml1135 commented Sep 29, 2023

Enkidu93 commented Sep 29, 2023

johnml1135 commented Sep 29, 2023 via email

Enkidu93 commented Oct 4, 2023

johnml1135 commented Oct 5, 2023

johnml1135 commented Aug 30, 2023 •

edited

Enkidu93 commented Aug 30, 2023 •

edited