-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hangfire server crashed - and didn't restart #50
Comments
Eli - look at the auto-restarting. I'll get the alerting to work. |
I haven't found a way to auto-restart. One thing I could do is catch the exception to keep the Hangfire server from crashing in this particular situation (clearML being down apparently), but it wouldn't be a general solution. Thoughts? @johnml1135 |
We should definitely handle the exception more gracefully. @johnml1135 Is there some way to restart the server in Kubernetes? |
@Enkidu93 - this may help for auto-restarting the service: https://docs.hangfire.io/en/latest/background-processing/dealing-with-exceptions.html |
This is implementing the standard guidance. |
@johnml1135 Doesn't that just allow jobs to be retried but isn't this a question of the Hangfire server crashing? That may be helpful (and I wonder if it would be good to configure it further, say, specifying the DelayInSeconds parameter because I imagine a common cause of this would be unavailable external services and it wouldn't be reasonable to retry so quickly), but I just wonder if it has really solved the issue. I know you've created a separate issue for handling the ClearML-related exception, but this seems like a more generic issue: What do we do if the Hangfire job server crashes? Maybe I'm misunderstanding. |
@ddaspit - I believe you are right - it is about retrying jobs, not restarting the server. Are there any dependancies between hangfire and the rest of the code? Could it be a completely separate executable running either on the same container or a different one? |
The fix actually crashed the engine - I need to test these things out first.... it is not needed. Let's rather try to do the original thing which is restart the server if it crashes. |
When the job server starts, it retrieves the access token for the ClearML API. If this throws, then it is a fatal error and the server crashes. We should restart the server if it crashes, but we can also better handle this particular exception. We could configure the HttpClient to retry using Polly. |
What has yet to be done here, @johnml1135 ? |
So, you implemented Polly in #66, but I don't believe that the hangfire server restarting has been resolved. Could you investigate further and see if there is a good resolution? |
@johnml1135 Is it as simple as setting the docker containers to restart on failure (which is very easy to do)? Or do you see a reason that won't work? |
The hang fire server is running as a sub process on the machine server. When is a hang fire server crashed, the machine server kept running, and nothing proactive happened. My best guess would be to have the main.net process, ping the hang fire server, and if it is down, restart it.
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Eli C. Lowry ***@***.***>
Sent: Friday, September 29, 2023 4:22:21 PM
To: sillsdev/machine ***@***.***>
Cc: John Lambert ***@***.***>; Mention ***@***.***>
Subject: Re: [sillsdev/machine] Hangfire server crashed - and didn't restart (Issue #50)
@johnml1135<https://github.com/johnml1135> Is it as simple as setting the docker containers to restart on failure (which is very easy to do)? Or do you see a reason that won't work?
—
Reply to this email directly, view it on GitHub<#50 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADIY5NHNWQ7RSI62WWYD7ZTX44UX3ANCNFSM6AAAAAA3HAVGTE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@johnml1135 The decision here was to ignore this until it pops up again, correct? Just wanted to document that if so. |
Yes. |
Here is the error:
And then, there was a steady stream of failing health checks:
We should:
The text was updated successfully, but these errors were encountered: