Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash gracefully #12395

Merged
merged 6 commits into from Mar 15, 2024
Merged

Crash gracefully #12395

merged 6 commits into from Mar 15, 2024

Conversation

turbolay
Copy link
Collaborator

@turbolay turbolay commented Feb 8, 2024

Thanks @molnard for part of the code and @adamPetho for the help

This PR provides a way to crash the software when we reach an unrecoverable exception and display the crash reporter. The PR also uses this new function when the WalletFilterProcessor cannot work anymore

On Fluent.Desktop, the exception is logged twice when WalletFilterProcessor crashes, but I think it's ok because on the Daemon it will only be logged once.

adamPetho
adamPetho previously approved these changes Feb 8, 2024
Copy link
Collaborator

@adamPetho adamPetho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tACK, the crash reporter is invoked and Wasabi crashes gracefully.
(Crash reporter is broken atm, this PR didn't break it, nor its responsibility to fix it)

@turbolay
Copy link
Collaborator Author

turbolay commented Feb 8, 2024

I tested with #12392 and can confirm that it works as expected
EDIT: PR Is rebased, so you can simply test this by adding a line throw new Exception("test"); in WalletFilterProcessor.ExecuteAsync

Copy link
Collaborator

@yahiheb yahiheb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this, the crash reporter didn't show up.

2024-02-09 01:17:38.102 [20] INFO       HybridFeeProvider.OnAllFeeEstimateArrived (132) Accurate fee rates are acquired from WasabiSynchronizer ranging from target 2 blocks at 1 sat/vByte to target 2 blocks at 1 sat/vByte.
2024-02-09 01:17:38.107 [20] ERROR      WalletFilterProcessor.ExecuteAsync (189)        System.Exception: test
   at WalletWasabi.Wallets.WalletFilterProcessor.ExecuteAsync(CancellationToken cancellationToken) in WalletWasabi\Wallets\WalletFilterProcessor.cs:line 108
2024-02-09 01:17:38.108 [20] INFO       Wallet.Dispose (302)    StartAsync finished in 4 milliseconds.
2024-02-09 01:17:38.111 [20] ERROR      WalletLoadWorkflow.LoadWalletAsync (105)        System.Exception: test
   at WalletWasabi.Wallets.WalletFilterProcessor.ExecuteAsync(CancellationToken cancellationToken) in WalletWasabi\Wallets\WalletFilterProcessor.cs:line 108
   at WalletWasabi.Wallets.WalletFilterProcessor.StartAsync(CancellationToken cancellationToken) in WalletWasabi\Wallets\WalletFilterProcessor.cs:line 288
   at WalletWasabi.Wallets.Wallet.StartAsync(CancellationToken cancel) in WalletWasabi\Wallets\Wallet.cs:line 306
   at WalletWasabi.Wallets.WalletManager.StartWalletAsync(Wallet wallet) in WalletWasabi\Wallets\WalletManager.cs:line 243
   at WalletWasabi.Wallets.WalletManager.StartWalletAsync(Wallet wallet) in WalletWasabi\Wallets\WalletManager.cs:line 251
   at WalletWasabi.Fluent.Models.Wallets.WalletLoadWorkflow.<LoadWalletAsync>b__21_0() in WalletWasabi.Fluent\Models\Wallets\WalletLoadWorkflow.cs:line 97
   at WalletWasabi.Fluent.Models.Wallets.WalletLoadWorkflow.LoadWalletAsync(Boolean isBackendAvailable) in WalletWasabi.Fluent\Models\Wallets\WalletLoadWorkflow.cs:line 97
2024-02-09 01:17:38.194 [22] INFO       WasabiAppBuilder.TerminateApplicationAsync (139)        Wasabi GUI stopped gracefully (7f49a81b-4684-4aed-8eb2-ca0c9a2de20f).
2024-02-09 01:17:38.196 [22] WARNING    Global.DisposeAsync (402)       Process is exiting.
.
.
.
2024-02-09 01:17:38.305 [19] INFO       Global.DisposeAsync (523)       AllTransactionStore is disposed.
2024-02-09 01:17:38.305 [1] INFO        TerminateService.Terminate (184)        Wasabi stopped gracefully (7f49a81b-4684-4aed-8eb2-ca0c9a2de20f).
2024-02-09 01:17:38.322 [1] INFO        WasabiAppBuilder.BeforeStopping (105)   Wasabi GUI stopped gracefully (7f49a81b-4684-4aed-8eb2-ca0c9a2de20f).
2024-02-09 01:17:38.340 [1] CRITICAL    Program.Main (80)       System.Exception: test
   at WalletWasabi.Fluent.Desktop.Program.Main(String[] args) in WalletWasabi.Fluent.Desktop\Program.cs:line 67

@turbolay
Copy link
Collaborator Author

turbolay commented Feb 9, 2024

From what @yahiheb sent, the PR is working as expected, but it seems that the CrashReporter still doesn't work on master for windows?? It works on Mac now. Can you test again @adamPetho?

@adamPetho
Copy link
Collaborator

Can you test again?

Yep, still no Crash Reporter on Win11. Only a hanging WW process, which does nothing.

@turbolay turbolay marked this pull request as draft February 12, 2024 05:30
Copy link
Collaborator

@molnard molnard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a look after the release.

@molnard molnard self-requested a review February 12, 2024 08:52
@kiminuo
Copy link
Collaborator

kiminuo commented Feb 15, 2024

We have talked about this PR with Clement and this approach is really not good. Just to shortly say why:

  • Adding static is a red flag.
  • Child components depending on parent components is another red flag (here it's WalletFilterProcessor depending on TerminateService)

A better approach is IMO that one detects that WalletFilterProcessor's execution task failed and then this signal should be propagated and should lead to crash with crash reporter instance being shown.

It's IMO better to better to crash immeditely in case some unhandled exception was thrown. Otherwise, there is a risk that the graceful shutdown makes things worse.

This concept of checking of services for exceptions can be then used for all services without modifying the services themselves. That would be a win as well.

@molnard
Copy link
Collaborator

molnard commented Feb 15, 2024

It's IMO better to better to crash immeditely in case some unhandled exception was thrown. Otherwise, there is a risk that the graceful shutdown makes things worse.

Graceful shutdown means to me that I ask the components to stop. Crash immediately means to just stop the execution of the Threads / Tasks wherever they are. Is this what you mean by that?
If disposals are not called and loops are not canceled it will lead to an unknown state of the underlying resources - I am sure we cannot cover all the cases to test if it is safe and won't cause issues at the next start of the application.

First, we need to decide on this!

Adding static is a red flag.

We have multiple options, see this and below. Add the static and make it work - is something preventing us from that? I don't see anything that would be broken by that, moreover, it could fix what is broken now. In general, it is a red flag but red flags do not mean it is an issue by default, just something we need to look very closely and consider wisely. One more perspective on this, is TerminateService a program-wise service that everyone can use? Like Logging or HostedServices. If it is like that we might consider having a similar structure here.

This concept of checking of services for exceptions can be then used for all services without modifying the services themselves.

I can imagine a solution with an interface that is added to all classes that can cause crashes and gather them in TerminateService. Whenever there is a signal it will start the crash.

@kiminuo
Copy link
Collaborator

kiminuo commented Feb 16, 2024

It's IMO better to better to crash immeditely in case some unhandled exception was thrown. Otherwise, there is a risk that the graceful shutdown makes things worse.

Graceful shutdown means to me that I ask the components to stop.

Graceful shutdown is typically: A program is running fine and at some point user decides to turn it off and at that point everything is stopped and disposed. The important part is that the program is running "fine" (no service is terribly broken)

Crash immediately means to just stop the execution of the Threads / Tasks wherever they are. Is this what you mean by that?

By crash I mean that the program should stop all its threads immediately and no cleanup should be done whatsoever.

So I'm saying "if WalletFilterProcessor (WFP) throws an exception that is unhandled or effectively unhandled (e.g. just logging is not enough to handle an exception) then we should crash and show the crash reporter, not ask for a graceful shutdown"1. The reason why I think it makes good sense is: If there is an unhandled exception in the WFP (or any other background service really, WFP is a placeholder here), then there is a risk that during that graceful shutdown you make things worse (e.g. store corrupted data, or throw another exception at a different place that would obfuscate the original error, or some other service stops working because the other is dead, etc. etc).2

However, to clarify it, it's important to say when one should not crash. And that is whenever we can fully recover from an exceptional state and we actually implemented that handling (like if there are connectivity issues we can recover, we simply try again, etc. So this is covered and hence no crash is needed.)

Why should we care? Because if the app crashes, then we can fix it. If we don't crash and we just log the error, then the likelyhood of the issue being fixed is small (the so called it somehow works so why bother principle)3.

Footnotes

  1. An analogy here is "if sudden big fire erupts in a house, the owner leaves immediately and calls firefighters. She skips the normal house-leaving behavior: 1) turn off lights, 2) turn off TV 3) lock the door, etc."

  2. Note that if a (background) service throws an unhandled exception and stops working it means that we do not have robust enough implementation for that service and so we don't really know how bad the situation is. It might be actually OK or it might be catastrophic (an invalid program state). So it makes more sense to kill the program than to pretend that everything is fine.

  3. There is also a precedent, when we worked on CoinsRegistry it had multiple quite severe bugs (for example, undo function did not work properly). It become obvious only after we added more strict checks and began to throw exceptions that these bugs could be fixed.

@molnard
Copy link
Collaborator

molnard commented Feb 16, 2024

  • Agreement archived: we need to stop/crash the program for sure. The issue here is unrecoverable. Simply just throwing an exception is not enough because it won't stop the program or visible, it is a background task.

  • If there is an unhandled exception in the WFP (or any other background service really, WFP is a placeholder here), then there is a risk that during that graceful shutdown you make things worse (e.g. store corrupted data

    I think the risk of corrupting data is much higher if we just kill the application. For me, this seems obvious. In this specific case I fail to see how it can go bad if we do graceful, but I see numerous issues if we just crash. You mentioned this multiple times "things can get worse", can you provide me a real example what can break in this specific case?

  • We made numerous efforts to get rid of individual exit points in the application. It was a good move and an improvement - we have to keep it this way while figuring out the solution.

@kiminuo
Copy link
Collaborator

kiminuo commented Feb 16, 2024

  • Agreement archived: we need to stop/crash the program for sure. The issue here is unrecoverable. Simply just throwing an exception is not enough because it won't stop the program or visible, it is a background task.

Yes, the exception needs to lead to that action in general. I proposed "propagation of the exception", you did that stuff with the TerminateService.

  • I think the risk of corrupting data is much higher if we just kill the application. For me, this seems obvious. In this specific case I fail to see how it can go bad if we do graceful, but I see numerous issues if we just crash.

Could you describe them?

You mentioned this multiple times "things can get worse", can you provide me a real example what can break in this specific case?

So my concern are scenarios like:

  1. A WW instance runs
  2. A component breaks (e.g. we crash on a blockchain reorg) -> we gracefully shut down and we store some state that is neither "before reorg" or "after reorg"
  3. User starts another instance of WW and her balance is invalid because we won't re-process some blocks.

Something like this.

Anyway, the question is a philosofical one, not that much practical1. I'm saying "if a part of application crashes, attempt to terminate it immediately, not to make things worse" (a pessimistic approach). You are saying "if a part of application crashes, then it's still safe to terminate gracefully" (that's an optimistic approach). Note that there are still scenarios when you don't have it under your control -- when a machine battery dies.

I don't believe that there is an approach that would work always. It feels like all approaches are compomises.

edit: This is what I basically talk about https://doc.rust-lang.org/book/ch09-03-to-panic-or-not-to-panic.html#guidelines-for-error-handling.

Footnotes

  1. Whatever example I come up with, you can say "it will be ok" (and be right) but then the code changes and nobody is checking how services can crash, what the relationships are, etc. So it means that if you are right now, you might not be right tomorrow.

@molnard
Copy link
Collaborator

molnard commented Feb 19, 2024

Could you describe them?

I assumed we use Environment.Exit but the consequences are anyway even if we use something else.

  • Immediate Termination: cleanup code (like finally blocks or destructors) may not necessarily run, depending on where and how Environment.Exit is called. This can lead to resources not being released properly or other cleanup tasks not being completed.

  • Skipping Finalization: Some finalizers (destructors) may not run if "crash" is used. The runtime tries to run finalizers if Environment.Exit is called, but if the exit is immediate and the finalizers take too long, they may not complete.

  • Thread Termination: Environment.Exit will terminate all threads in the application. If there are background operations or other threads running that are doing important work or require proper shutdown to maintain data integrity, using Environment.Exit can disrupt these processes.

  • Application State: If the application maintains state or has temporary data that needs to be saved or persisted, calling Environment.Exit without ensuring this data is saved can result in data loss.

  • External Resources and Commitments: If your application is holding external resources (like database connections, file handles, network connections, etc.), using Environment.Exit might not allow for these resources to be released or committed properly. This could lead to data corruption, lost transactions, or other issues with resources that were being managed by the application.

  • Use in Libraries: If you're developing a library that is used by other applications, calling Environment.Exit within a library is highly discouraged. It can unexpectedly terminate the host application, leading to a poor user experience and making it difficult for applications using your library to manage their lifecycle properly.

@molnard
Copy link
Collaborator

molnard commented Feb 19, 2024

  • Agreement archived: we need to stop/crash the program for sure. The issue here is unrecoverable. Simply just throwing an exception is not enough because it won't stop the program or visible, it is a background task.

Yes, the exception needs to lead to that action in general. I proposed "propagation of the exception", you did that stuff with the TerminateService.

  • I think the risk of corrupting data is much higher if we just kill the application. For me, this seems obvious. In this specific case I fail to see how it can go bad if we do graceful, but I see numerous issues if we just crash.

Could you describe them?

You mentioned this multiple times "things can get worse", can you provide me a real example what can break in this specific case?

So my concern are scenarios like:

  1. A WW instance runs
  2. A component breaks (e.g. we crash on a blockchain reorg) -> we gracefully shut down and we store some state that is neither "before reorg" or "after reorg"
  3. User starts another instance of WW and her balance is invalid because we won't re-process some blocks.

Something like this.

Anyway, the question is a philosofical one, not that much practical1. I'm saying "if a part of application crashes, attempt to terminate it immediately, not to make things worse" (a pessimistic approach). You are saying "if a part of application crashes, then it's still safe to terminate gracefully" (that's an optimistic approach). Note that there are still scenarios when you don't have it under your control -- when a machine battery dies.

I don't believe that there is an approach that would work always. It feels like all approaches are compomises.

edit: This is what I basically talk about https://doc.rust-lang.org/book/ch09-03-to-panic-or-not-to-panic.html#guidelines-for-error-handling.

Footnotes

  1. Whatever example I come up with, you can say "it will be ok" (and be right) but then the code changes and nobody is checking how services can crash, what the relationships are, etc. So it means that if you are right now, you might not be right tomorrow.

Yes I agree, the cases have to be evaluated one by one. However, we could set up a rule of what is the default crash and what can be used only if there is a very strong reason - to help future decisions and minimize confusion.

@kiminuo
Copy link
Collaborator

kiminuo commented Feb 20, 2024

I'll try to summarize my thoughts a bit more:

I would add that my UnreachableExceptions are not really supposed to reach production. So the idea was that one throws UEs when the application hits a code path that is totally wrong and it should never happen1. So if the application then crashes, it's OK because it will force people to fix it. This is true for my approach and your approach. So that's good. Once the application is sufficiently tested and no UE is thrown, then the software can be released (nothing new, we test before each release; however, any instance of crash reporter being invoked is much harder to ignore than "it feels like something wrong happened but I'm not really sure")

Anyway, then there are two approaches what to do when such a big bug happens:

  • Graceful shutdown
    • Pro: state is not lost (IMO it's questionable pro though)
    • Con: By definition "if something weird happened and storing the application state after something weird happened can result in invalid state being stored"
  • Crash immediately
    • Pro: Some state can be lost (possibly some block will need to be scan again, etc.)
      • Personally, I don't see this as a smaller issue than storing something invalid later on (my reog example above)
    • Con: If a file is being written to, then there is a risk of partial write which would corrupt the file
      • Note that this is not something new though. If a machine is out of battery, then the very same thing can happen. So the underlying problem is there. And the problem would be fixed by using an SQLite database that guarantees that a transaction either happens or not.

Now regarding #12395 (comment), I'll try to address it one by one:

  • Use in Libraries: If you're developing a library that is used by other applications, calling Environment.Exit within a library is highly discouraged. It can unexpectedly terminate the host application, leading to a poor user experience and making it difficult for applications using your library to manage their lifecycle properly.

We would not put Environment.Exit in components. The idea was to throw UnreachableException and then on Global level, we would register what happens with background services when they fail with such exception. So the classes themselves would throw UEs and anyone who uses WW as a library would observe an UE being thrown, certainly not Environment.Exit.

  • Skipping Finalization: Some finalizers (destructors) may not run if "crash" is used. The runtime tries to run finalizers if Environment.Exit is called, but if the exit is immediate and the finalizers take too long, they may not complete.

I don't think we use finalizers really. But even if we do, then you are not guaranteed that they are called

  • Application State: If the application maintains state or has temporary data that needs to be saved or persisted, calling Environment.Exit without ensuring this data is saved can result in data loss.

This is true. However, we try to establish here if risk of losing some data is worse than risk of storing invalid data. This is not a trivial question. Your approach can be better, mine can be better, or it depends on a situation. So the question should be narrowed down to "what is more likely be better", I guess.

  • External Resources and Commitments: If your application is holding external resources (like database connections, file handles, network connections, etc.), using Environment.Exit might not allow for these resources to be released or committed properly. This could lead to data corruption, lost transactions, or other issues with resources that were being managed by the application.

So once the application terminates, file handles, network connections and other resources are released.

If an instance of bitcoind is started and a crash occurs, then it might not be stopped (btw: this is what happens in integration tests all the time).

Tor is either left to be running (by design), or there is an option to stop the process once WW terminates (AFAIK).

The next step: We can pick either my approach or your approach and just implement it and we can later on re-evaluate if the decision was good or not (in my mind, that should be a minimal change to change "graceful" -> "crash" or vice versa)


We should not really use files for storage purposes. History has IMO shown good enough that it has limitations. A database is more robust and allows to do more with the data.

Footnotes

  1. Meaning, whoever implemented that certainly did not put effort to mitigate that state and it might not be mitigable at all. In short, there is a big bug that must be fixed.

@lontivero
Copy link
Collaborator

The rule for unmanaged exceptions is to crash. Also, the reference to TerminateService looks ni good.

@molnard
Copy link
Collaborator

molnard commented Feb 20, 2024

This is true. However, we try to establish here if risk of losing some data is worse than risk of storing invalid data. This is not a trivial question. Your approach can be better, mine can be better, or it depends on a situation. So the question should be narrowed down to "what is more likely be better", I guess.

Do it as we do now and did in the past years. Signal for termination.

We can only crash if all the issues mentioned above like File writings are made crash-safe - not the way around. But I guess that's not worth it. I suggest starting with graceful and investigating the reorg stuff as well.

# Conflicts:
#	WalletWasabi.Fluent.Desktop/Program.cs
@turbolay turbolay marked this pull request as ready for review March 14, 2024 18:24
@turbolay
Copy link
Collaborator Author

Ok so well it has been a mess :D I'm pretty convinced myself this PR was not the way to go, but as I understood we chose to go with it. Maybe it will just be temporary?

Master is merged, IMO this PR is ready to go, but still a few questions @molnard :

  1. Should we implement Lucas' idea of using an event? It's basically just a style change over the function that I am using here.
  2. Should we crash in the same way we Force Kill instead of the same way we CTRL + C? This is an echo to something lucas said as well that we shouldn't crash gracefully. Not sure what the conclusion was on this.
  3. Should I implement the crash in the same services Kimi did in Handle crashes in background services (take 2) #12606, which makes sense even if probably not super required

Copy link
Collaborator

@molnard molnard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tACK

@molnard molnard merged commit f2b25ec into zkSNACKs:master Mar 15, 2024
7 of 8 checks passed
@molnard
Copy link
Collaborator

molnard commented Mar 15, 2024

Maybe it will just be temporary?

No, it is the final solution. This argument caused a major distraction in our team (typical minefield category). I will try to avoid getting back to this for a while by making more powerful/forceful decisions for the sake of team integrity.

  • Evaluate the impact on the business.
  • Take the simplest solution that works.
  • Forget about it and focus on "productive" things.

Should we implement Lucas' idea of using an event? It's basically just a style change over the function that I am using here.

It is OK.

Should we crash in the same way we Force Kill instead of the same way we CTRL + C?

No. I want graceful shutdowns.

Should I implement the crash in the same services Kimi did in

It is fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants