Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User defined error types #631

Closed
leighmcculloch opened this issue Sep 10, 2022 · 27 comments · Fixed by stellar/stellar-xdr#29, stellar/rs-stellar-xdr#162, stellar/rs-soroban-env#449 or #601
Closed
Assignees

Comments

@leighmcculloch
Copy link
Member

It should be possible for contracts to create their own status error codes that are clearly separated from all errors that the host could create. For this we need a new ScStatusType, something like ContractError should do. The codes within that type of error would be undefined by the XDR and any u32 value could be used. It would be up to contracts to define what they are.

@leighmcculloch leighmcculloch transferred this issue from stellar/rs-stellar-xdr Sep 10, 2022
@leighmcculloch
Copy link
Member Author

@graydon Does the following change align with how you intended a contract would use statuses?

stellar/stellar-xdr#29

cc @jonjove

@jonjove
Copy link
Contributor

jonjove commented Sep 12, 2022

I think I need a lot more context on this. It comes back to an issue we've discussed many times inconclusively: what should a contract do on error, and what should the host support a contract does on error? For example, a contract certainly should not be able to modify ledger state if it panics (implicit rollback) so it can't record the status on the ledger. Should a contract be able to emit an event on error in production? This seems like (a) a footgun because the events could be misinterpreted as indicating some kind of success (b) a waste of space and compute. Should a contract be able to emit an event on error during preflight? This would still waste space in the actual contract but not compute (it could no-op in prod). So where do contract status codes fit into this picture, and how do you imagine them being used?

Assuming that contract status codes do fit into this picture somewhere, the follow up question is why does this need to be a primitive type? Unless we're treating it specially in the host or downstream systems somehow, why shouldn't we just let contracts define their own error type? For example, this primitive doesn't easily extend to rust enum-style errors where some are codes, some contain additional data, etc.

@graydon
Copy link
Contributor

graydon commented Sep 12, 2022

We are treating it specially in the host. HostError carries a status code and it's how we plumb errors from user code or host functions back through the VM and into results we give users (also status codes) from functions like try_call.

@leighmcculloch
Copy link
Member Author

so it can't record the status on the ledger

I don't think function return values are stored on the ledger anyway. Is that the case? It would be recorded wherever function return values are, in the transaction result I imagine if the function result is recorded there.

events could be misinterpreted as indicating some kind of success (b) a waste of space and compute.

We have this today on classic Stellar. Failed transactions and operations show up in transaction histories. Applications seem to distinguish them okay and I don't know of any evidence of it being an area of common mistake. But I'm also not sure on what the correct answer should be if events should be emitted. If they were emitted, it would be important a consumer could see they relate to an aborted invocation.

why does this need to be a primitive type

@graydon could you answer this? I don't feel strongly. We could encourage user defined types as errors. Solidity supports custom errors – https://solidity-by-example.org/error/ – and as you say Rust has good support for this. Imx plenty of systems have proven this unnecessary and error codes often suffice. Take Zig's error system for example.

@graydon
Copy link
Contributor

graydon commented Sep 12, 2022

Yeah the status type was intended to have user errors tacked onto it (there's tons of bit-space in there) so long as they are integer codes. There is even potentially space for separately tagging different contracts' origin codes (eg. code 7 from contract A vs. code 7 from contract B) using the high 28 bits / type code. But we can revisit that later, IMO it overlaps a bit with a different topic I Have A Bit Of A Plan For.

@jonjove
Copy link
Contributor

jonjove commented Sep 12, 2022

We have this today on classic Stellar. Failed transactions and operations show up in transaction histories.

Not really. Failed operations don't emit any meta.

@jonjove
Copy link
Contributor

jonjove commented Sep 12, 2022

I've said on a number of occasions that the decision to expose detailed error codes was a big mistake from a maintainability perspective. If we do it again, we're just going to end up in that world all over again.

@leighmcculloch
Copy link
Member Author

leighmcculloch commented Sep 12, 2022

the decision to expose detailed error codes was a big mistake from a maintainability perspective

Could you expand on what the cost to maintainability has been? Do you mean maintainability of stellar-core?

Failed operations don't emit any meta.

Interesting. TIL.

Looking at a recent failed transaction on pubnet as a concrete example:
941f4274fc85e4abe70eb694792eb7159721f377108e65d6a6cfb12a448bf11c

Its meta is: AAAAAgAAAAAAAAAAAAAAAA== (ref)

TransactionMeta: [undefined]
  v2
    txChangesBefore: Array[0]
    operations: Array[0]
    txChangesAfter: Array[0]

Its result is: AAAAAAAAASz/////AAAAAwAAAAAAAAAA/////QAAAAAAAAAA/////QAAAAAAAAAA/////QAAAAA= (ref)

TransactionResult
  feeCharged: 300
  result: [txFailed]
    results: Array[3]
      [0]: [opInner]
        tr: [createAccount]
          createAccountResult: [createAccountLowReserve]
      [1]: [opInner]
        tr: [createAccount]
          createAccountResult: [createAccountLowReserve]
      [2]: [opInner]
        tr: [createAccount]
          createAccountResult: [createAccountLowReserve]
  ext: [undefined]

It looks like we still get result error codes back even though the meta isn't emitted.

@graydon @jonjove Will the status show up in the meta, or in the result ☝🏻?

@jonjove
Copy link
Contributor

jonjove commented Sep 12, 2022

Will the status show up in the meta, or in the result ☝🏻?

This question is what I was getting at with

It comes back to an issue we've discussed many times inconclusively: what should a contract do on error, and what should the host support a contract does on error? For example, a contract certainly should not be able to modify ledger state if it panics (implicit rollback) so it can't record the status on the ledger. Should a contract be able to emit an event on error in production? This seems like (a) a footgun because the events could be misinterpreted as indicating some kind of success (b) a waste of space and compute. Should a contract be able to emit an event on error during preflight? This would still waste space in the actual contract but not compute (it could no-op in prod). So where do contract status codes fit into this picture, and how do you imagine them being used?

There's no point building functionality for this until we know what we're doing with it. That's like trying to build the roof of a house (because you don't want the interior to get wet) before building the foundation.

Could you expand on what the cost to maintainability has been? Do you mean maintainability of stellar-core?

Yes, maintainability of stellar-core. I've written about this a number of times when discussing error handling from various perspectives, so I will do so again in this context. It means that the implementation of stellar-core needs to always maintain full binary compatibility of error codes. It doesn't matter if you can say "I can make this change because the new code fails if-and-only-if the old code fails" since the way it fails matters. For example, imagine that you have a data structure with the following properties (a) never contains more than 10 elements (b) all elements are u32. If I try to insert an 11th element, it will fail regardless of if the element is u32 or not. So suppose I always report "too many elements" but later I realize it would be cheaper to check if it's u32 first... I can't. Likewise if you want to switch libraries that provide the same data structure, you might not be able to unless I can figure out how to map errors from one to the other (in general, this can be impossible if it's not 1-to-1). If contracts can report information on error beyond "failure", then we will get locked into this world again.

@jonjove
Copy link
Contributor

jonjove commented Sep 12, 2022

Tangible example is here https://github.com/stellar/stellar-core/blob/364b2f768aa5261980dedfd48b5fcb85a0b357b9/src/transactions/ChangeTrustOpFrame.cpp#L167-L193 where we report the same error code in multiple blocks because we couldn't rearrange even though it would have been simpler to do so.

@graydon
Copy link
Contributor

graydon commented Sep 12, 2022

Hmm .. I'm not sure I understand. I think that contract-specified behaviour is not under the host's control at all, and the host is always obliged to maintain observably-identical behaviour regardless of how that behaviour manifests (as an emitted error code or as a modified ledger entry, say). What makes errors different here?

@graydon
Copy link
Contributor

graydon commented Sep 12, 2022

(I should also say: separate from contract-originating errors, which is the topic of this bug, @jayz22 also has on his plate an issue to do a pass over the work of the past few months and recategorize / reassign error codes that originate on the host and it would be great to get @jonjove's overall sense of the right granularity there, especially now that we have the "non-canonical behaviour / diagnostic-message side channel" for debugging, it may very well be that we want only a handful of canonical error codes. I very much want to learn from past experiences if they're considered mistakes!)

@leighmcculloch
Copy link
Member Author

@jonjove Are you proposing that contracts should only ever panic!/abort without passing any signal or error value back to the caller?

How do you see a developer understanding why a contract invocation failed? For the sake of answering this let's assume events are not emitted, since that seems like the most contentious thing in this topic, and therefore even if there were debug events they wouldn't be emitted either.

Or are you proposing that the value that gets passed back should be flexible, per your comment here?

@jonjove
Copy link
Contributor

jonjove commented Sep 12, 2022

@jonjove Are you proposing that contracts should only ever panic!/abort without passing any signal or error value back to the caller?

How do you see a developer understanding why a contract invocation failed? For the sake of answering this let's assume events are not emitted, since that seems like the most contentious thing in this topic, and therefore even if there were debug events they wouldn't be emitted either.

@leighmcculloch I'm not really proposing anything concretely. I recognize that there are challenges in every path. I'm saying we should consider a world where contracts "only ever panic!/abort without passing any signal". @MonsieurNicolas has voiced the opinion that error handling should be happening on the preflight path. I've suggested an approach in which error handling is provided by contract developers through an "install-debug-contract-over-other-contract" mechanism (can't find the issue where this came up, but still looking).

What I'm saying is we shouldn't be pigeon-holed into the "always report errors to users" option. We should recognize that it has numerous drawbacks particularly in terms of compute cost, storage cost, contract cost, and future maintainability. We should consider deeply the alternatives which likely have superior properties in those categories but will be different in UX. Note different in UX doesn't necessarily mean worse. For example, the "install-debug-contract-over-other-contract" mechanism is slightly annoying for contract developers but would probably provide the best possible experience to users since they could get super detailed error messages.

Or are you proposing that the value that gets passed back should be flexible, per your comment #631?

I'm saying that if we want to pass values back, we should consider the trade-offs in terms of what we're allowing to be reported. Error codes are not highly informative.

@jonjove
Copy link
Contributor

jonjove commented Sep 12, 2022

Given that we're discussing in this issue, I'm reopening it.

@jonjove jonjove reopened this Sep 12, 2022
@graydon
Copy link
Contributor

graydon commented Sep 12, 2022

I'm confused. I really do want to understand the issue here, but I'm struggling. Can you elaborate on some of these points?

  • What are the compute and storage costs to reporting a contract-originating error code to the transaction-invoker or try_call caller? Every call already produces a result value, and status is a value, so it's the same size..
  • What is "contract cost"? Do you mean like the contract code has more complexity in it because it has to maintain a set of error codes?
  • For error codes originating in contracts -- which is what this is about -- who has a future maintenance burden? Surely not us, the host is not going to interpret a contract-originating error code at all, just pass them through. And contracts don't have a compatibility requirement, they can break compat with themselves whenever they want, any calls to old code in old ledgers still replay with the old version of the code.

I recognize this is proximal to the topic of host-originating error code granularity, and I totally respect that we need to take some principled approach to that topic and make sure we're not painting ourselves into corners with too-fine-granularity errors. But I'm unsure that any of that applies to contract-originating error codes.

Concerning other UX: we do provide a logging method that adds debug events to the log, and that can be extracted by a user without perturbing the canonical output from a contract, so that was the avenue I was assuming users would want if they wanted higher-resolution errors than error codes. But I think it's reasonable to allow users to exit-with-a-code also; if we don't provide that they will just emulate it by emitting an error description into the contract-events stream and then executing a trap (divide by zero or whatever) to halt. That seems clunkier to me (and costlier -- events are much larger than status values, are likely to route into a whole publish/subscribe system).

@jonjove
Copy link
Contributor

jonjove commented Sep 12, 2022

By contract cost I meant the cost of storing the actual WASM to determine what error and then actually report it. By storage cost I meant the cost of recording those errors somewhere like in history archives or whatever mechanism ends up resulting from all of this. By compute cost I meant the cost of figuring out what error to report, although I acknowledge that in many cases this could actually be very minor.

we do provide a logging method that adds debug events to the log, and that can be extracted by a user without perturbing the canonical output from a contract

In stellar/rs-soroban-env#205 (comment), we agreed that "We will make it possible to disable debug information in release builds." so that doesn't seem like it's likely to be a reliable source of information in production. Producing the debug information is likely expensive just in terms of contract size, and the compute isn't free either.

@jonjove
Copy link
Contributor

jonjove commented Sep 12, 2022

One comment specifically related to this proposal instead of the more general "is this a good idea". Suppose I have contracts A and B, both of which use this error mechanism. They use the same set of values because they were developed independently. Contract A calls B, and B fails. How does a user distinguish what the real error code was (should I interpret it as an A error or a B error)? Assuming there's no way to distinguish, what stops a malicious contract from deliberately misleading users in this way?

@leighmcculloch
Copy link
Member Author

@leighmcculloch I'm not really proposing anything concretely. I'm saying we should consider a world where contracts "only ever panic!/abort without passing any signal".

So far I think we have four proposals in this thread:

  1. no errors – Provide no capability to abort with error information. The only options to a developer are to abort without error information, or to non-abort with an error and require the developer to unwind state manually if they're wanting to do this. User defined error types #631

    Providing nothing seems not great. Debugging is a painful activity, and we need to provide something.

  2. error codes – Provide something very simple, lightweight, i.e. error codes, which is this proposal. User defined error types #631.

  3. error values – Provide something more complex, where users can abort with any error value, not just a status. User defined error types #631

    This seems to up the complexity pretty quickly. But definitely doable I think, if we really wanted it, but effort is high, so maybe if we want this we should layer this ontop of (2). i.e. Start with (2), and add support for custom error types in the future.

  4. "install-debug-contract-over-other-contract" – Require developers to ship two versions of their contract and publish them somewhere, so that other developers can retrieve the debug version, deploy it and test with it. User defined error types #631. 6.

    This poses significant security challenges. e.g. How can you tell the debug version does the same thing as the non-debug version? This also poses significant usability challenges. e.g. Where are the debug versions hosted? This also seems like it could be an extension of (2), so doing (2) today doesn't prevent us from doing this in the future.

I think we need to provide something. Providing error codes seems like it would address the opaqueness issue to at least some degree, even if it is imperfect. And, providing error codes today doesn't prevent us from exploring other options like what you describe.

@jonjove
Copy link
Contributor

jonjove commented Sep 12, 2022

Require developers

I saw this in your previous comment and it struck me that we might not be talking about the same thing. We should really distinguish between debugging during development (who cares what it costs?) and debugging in production for users (who doesn't care what it costs?). Which are we talking about here?

For developers, they should use the debug log mechanism @graydon mentioned to help other developers debug. I don't see why they would use anything else? Even better would be to use the native code when it's available, which was a huge motivation for our initial design but seems to have dropped off the radar lately.

@graydon
Copy link
Contributor

graydon commented Sep 13, 2022

I agree we're mixing scenarios here. There are several separate ones with very separate costs and complexity.

  • Local development, native build linked to soroban-env-host. Should definitely use debug messages and local debugger. I'm not aware of this having dropped off anyone's radar, we're still very much supporting this as a primary development model.
  • Local development, wasm build running on your own core and RPC server, preflighting or testnet'ing, staring at debug logs. Should upload a contract with as much logging as you like. Codesize and storage don't matter, and you have access to debug logs.
  • Production use, trying to understand why a given contract invocation failed after the fact, either by looking at its txresult and/or txmeta-borne events. Some diagnostic information can make it through as events in the debug-event stream, but not all (it depends if you use the format-string-based logging API -- no-ops in production builds -- or the non-formatted value-based API which is preserved). And we can expect users to be optimizing for size, so they will probably compile-out a fair amount of debug-event-emitting stuff. I think this case is what this bug is about.

I think we're talking about splitting a fairly minor hair, but one that's still worth splitting right: in an optimized contract, running in production, on a failure path, should the user be allowed to provide a single u32 worth of hint to their caller (be it a contract or a txn-invoker) about what went wrong, or not. I argue that this costs as little as we can make anything cost -- it's the difference between the wasm sequences [i32.const N; ctrl.call] and [ctrl.unreachable] in the contract and no cost difference at all in the return value from the contract -- and IMO this is fairly harmless, potentially useful when diagnosing failures or surfacing diagnostics to users, and should be supported.

I agree you can spoof it. I agree it's not super detailed. But it's better than nothing, and the user's next-best option for providing such information in a production build is emitting a txmeta debug-event then trapping (which will require the combined wasm sequence [i32.const N; ctrl.call; ctrl.unreachable] and cost more in the host since debug-events are larger).

@leighmcculloch
Copy link
Member Author

I'm hoping error codes are a good option that are just always there, locally, or deployed on pubnet.

The diagnostics stuff @graydon recently added, like log_static_fmt_val, and the DebugEvents that get triggered in a try_call go a long way to help a developer when they are in that development phase for their own contract. But some of those things mostly disappear when the contract is built for deployment.

While I think we definitely have options for how to make for better error diagnostics during development, there's a UX setup cost to someone going and finding the native code of another contract, verifying (somehow) that it is the legit exact same code that is deployed, to understand what is happening.

I'm not opposed to us doing smart things like that. But I think for a very small price we can provide some error information that event exists on chain so that even if you are developing against an on-chain version of a contract, you aren't flying blind.

@leighmcculloch
Copy link
Member Author

I'm responding to the comments about integration testing for cross-contract development against Rust code vs WASM binaries in this thread, so as not to derail this conversation: https://discord.com/channels/897514728459468821/935999748840751125/1019050694055047228.

@jonjove
Copy link
Contributor

jonjove commented Sep 13, 2022

Go for the error codes, it's fine. I'm quite concerned about the status code collision though. Here's an example. I have a contract that can report the error "not enough balance". It calls another contract that reports the same status code, tricking the user into sending more money. The user didn't actually need to send more money and now needs to spend more time and effort getting the money back out. Perhaps we can do something to mitigate this?

I'm not aware of this having dropped off anyone's radar, we're still very much supporting this as a primary development model.

We've definitely been pulling away from this. Our original work was geared towards using x86-compiled contracts (as opposed to WASM) for all contracts that you have the source for during development--even if they aren't the contract you are writing but rather contracts you depend on. This is no longer the case: the very-slick client support that @leighmcculloch built only uses WASM. I'll follow up in the linked thread.

@leighmcculloch
Copy link
Member Author

leighmcculloch commented Sep 13, 2022

This is no longer the case: the very-slick client support that @leighmcculloch built only uses WASM. I'll follow up in the linked thread.

I think there's a misunderstanding. The client equally supports both WASM and natively registered contracts. The client is designed to provide a unified single interface to both. I've provided more details here: https://discord.com/channels/897514728459468821/1019050694055047228/1019270676353384508.

@leighmcculloch
Copy link
Member Author

leighmcculloch commented Sep 13, 2022

I'm quite concerned about the status code collision though. Here's an example. I have a contract that can report the error "not enough balance". It calls another contract that reports the same status code, tricking the user into sending more money.

This seems most relevant in the case that A calls B and B calls C where A is a try_call and B is a call.

@graydon In that case does a Status generated by C end up in A looking like it was generated by B? Maybe we could disambiguate between statuses returned from the callee vs statuses returned from some deeper component.

@graydon
Copy link
Contributor

graydon commented Sep 13, 2022

Yeah this particular pattern is, I think, fairly easy to fix. If C fails with an explicit status, the inner call (non try_call, just normal call) from B -> C will fail with a wasmi VM trap status. At present we do appear to unpack the inner host status and propagate it as the call status (to try to "preserve information") but it'd be easy to leave that as an opaque boundary, maybe add a special ScSubContractError type code to downgrade it to so the user can recover it if they want but are not confused about "came from B" vs. "came from one of B's callees".

@leighmcculloch leighmcculloch changed the title Add ContractError to ScStatusType Errors: Add ContractError to ScStatusType Sep 14, 2022
@leighmcculloch leighmcculloch transferred this issue from stellar/stellar-xdr Sep 17, 2022
@leighmcculloch leighmcculloch changed the title Errors: Add ContractError to ScStatusType Errors: Support user defined error types and efficient handling pattern Sep 17, 2022
@leighmcculloch leighmcculloch changed the title Errors: Support user defined error types and efficient handling pattern Errors: Support user defined error types Sep 18, 2022
@leighmcculloch leighmcculloch changed the title Errors: Support user defined error types User defined error types Sep 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment