New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User defined error types #631
User defined error types #631
Comments
I think I need a lot more context on this. It comes back to an issue we've discussed many times inconclusively: what should a contract do on error, and what should the host support a contract does on error? For example, a contract certainly should not be able to modify ledger state if it panics (implicit rollback) so it can't record the status on the ledger. Should a contract be able to emit an event on error in production? This seems like (a) a footgun because the events could be misinterpreted as indicating some kind of success (b) a waste of space and compute. Should a contract be able to emit an event on error during preflight? This would still waste space in the actual contract but not compute (it could no-op in prod). So where do contract status codes fit into this picture, and how do you imagine them being used? Assuming that contract status codes do fit into this picture somewhere, the follow up question is why does this need to be a primitive type? Unless we're treating it specially in the host or downstream systems somehow, why shouldn't we just let contracts define their own error type? For example, this primitive doesn't easily extend to rust enum-style errors where some are codes, some contain additional data, etc. |
We are treating it specially in the host. HostError carries a status code and it's how we plumb errors from user code or host functions back through the VM and into results we give users (also status codes) from functions like try_call. |
I don't think function return values are stored on the ledger anyway. Is that the case? It would be recorded wherever function return values are, in the transaction result I imagine if the function result is recorded there.
We have this today on classic Stellar. Failed transactions and operations show up in transaction histories. Applications seem to distinguish them okay and I don't know of any evidence of it being an area of common mistake. But I'm also not sure on what the correct answer should be if events should be emitted. If they were emitted, it would be important a consumer could see they relate to an aborted invocation.
@graydon could you answer this? I don't feel strongly. We could encourage user defined types as errors. Solidity supports custom errors – https://solidity-by-example.org/error/ – and as you say Rust has good support for this. Imx plenty of systems have proven this unnecessary and error codes often suffice. Take Zig's error system for example. |
Yeah the status type was intended to have user errors tacked onto it (there's tons of bit-space in there) so long as they are integer codes. There is even potentially space for separately tagging different contracts' origin codes (eg. code 7 from contract A vs. code 7 from contract B) using the high 28 bits / type code. But we can revisit that later, IMO it overlaps a bit with a different topic I Have A Bit Of A Plan For. |
Not really. Failed operations don't emit any meta. |
I've said on a number of occasions that the decision to expose detailed error codes was a big mistake from a maintainability perspective. If we do it again, we're just going to end up in that world all over again. |
Could you expand on what the cost to maintainability has been? Do you mean maintainability of stellar-core?
Interesting. TIL. Looking at a recent failed transaction on pubnet as a concrete example: Its meta is:
Its result is:
It looks like we still get result error codes back even though the meta isn't emitted. @graydon @jonjove Will the status show up in the meta, or in the result ☝🏻? |
This question is what I was getting at with
There's no point building functionality for this until we know what we're doing with it. That's like trying to build the roof of a house (because you don't want the interior to get wet) before building the foundation.
Yes, maintainability of stellar-core. I've written about this a number of times when discussing error handling from various perspectives, so I will do so again in this context. It means that the implementation of stellar-core needs to always maintain full binary compatibility of error codes. It doesn't matter if you can say "I can make this change because the new code fails if-and-only-if the old code fails" since the way it fails matters. For example, imagine that you have a data structure with the following properties (a) never contains more than 10 elements (b) all elements are u32. If I try to insert an 11th element, it will fail regardless of if the element is u32 or not. So suppose I always report "too many elements" but later I realize it would be cheaper to check if it's u32 first... I can't. Likewise if you want to switch libraries that provide the same data structure, you might not be able to unless I can figure out how to map errors from one to the other (in general, this can be impossible if it's not 1-to-1). If contracts can report information on error beyond "failure", then we will get locked into this world again. |
Tangible example is here https://github.com/stellar/stellar-core/blob/364b2f768aa5261980dedfd48b5fcb85a0b357b9/src/transactions/ChangeTrustOpFrame.cpp#L167-L193 where we report the same error code in multiple blocks because we couldn't rearrange even though it would have been simpler to do so. |
Hmm .. I'm not sure I understand. I think that contract-specified behaviour is not under the host's control at all, and the host is always obliged to maintain observably-identical behaviour regardless of how that behaviour manifests (as an emitted error code or as a modified ledger entry, say). What makes errors different here? |
(I should also say: separate from contract-originating errors, which is the topic of this bug, @jayz22 also has on his plate an issue to do a pass over the work of the past few months and recategorize / reassign error codes that originate on the host and it would be great to get @jonjove's overall sense of the right granularity there, especially now that we have the "non-canonical behaviour / diagnostic-message side channel" for debugging, it may very well be that we want only a handful of canonical error codes. I very much want to learn from past experiences if they're considered mistakes!) |
@jonjove Are you proposing that contracts should only ever panic!/abort without passing any signal or error value back to the caller? How do you see a developer understanding why a contract invocation failed? For the sake of answering this let's assume events are not emitted, since that seems like the most contentious thing in this topic, and therefore even if there were debug events they wouldn't be emitted either. Or are you proposing that the value that gets passed back should be flexible, per your comment here? |
@leighmcculloch I'm not really proposing anything concretely. I recognize that there are challenges in every path. I'm saying we should consider a world where contracts "only ever panic!/abort without passing any signal". @MonsieurNicolas has voiced the opinion that error handling should be happening on the preflight path. I've suggested an approach in which error handling is provided by contract developers through an "install-debug-contract-over-other-contract" mechanism (can't find the issue where this came up, but still looking). What I'm saying is we shouldn't be pigeon-holed into the "always report errors to users" option. We should recognize that it has numerous drawbacks particularly in terms of compute cost, storage cost, contract cost, and future maintainability. We should consider deeply the alternatives which likely have superior properties in those categories but will be different in UX. Note different in UX doesn't necessarily mean worse. For example, the "install-debug-contract-over-other-contract" mechanism is slightly annoying for contract developers but would probably provide the best possible experience to users since they could get super detailed error messages.
I'm saying that if we want to pass values back, we should consider the trade-offs in terms of what we're allowing to be reported. Error codes are not highly informative. |
Given that we're discussing in this issue, I'm reopening it. |
I'm confused. I really do want to understand the issue here, but I'm struggling. Can you elaborate on some of these points?
I recognize this is proximal to the topic of host-originating error code granularity, and I totally respect that we need to take some principled approach to that topic and make sure we're not painting ourselves into corners with too-fine-granularity errors. But I'm unsure that any of that applies to contract-originating error codes. Concerning other UX: we do provide a logging method that adds debug events to the log, and that can be extracted by a user without perturbing the canonical output from a contract, so that was the avenue I was assuming users would want if they wanted higher-resolution errors than error codes. But I think it's reasonable to allow users to exit-with-a-code also; if we don't provide that they will just emulate it by emitting an error description into the contract-events stream and then executing a trap (divide by zero or whatever) to halt. That seems clunkier to me (and costlier -- events are much larger than status values, are likely to route into a whole publish/subscribe system). |
By contract cost I meant the cost of storing the actual WASM to determine what error and then actually report it. By storage cost I meant the cost of recording those errors somewhere like in history archives or whatever mechanism ends up resulting from all of this. By compute cost I meant the cost of figuring out what error to report, although I acknowledge that in many cases this could actually be very minor.
In stellar/rs-soroban-env#205 (comment), we agreed that "We will make it possible to disable debug information in release builds." so that doesn't seem like it's likely to be a reliable source of information in production. Producing the debug information is likely expensive just in terms of contract size, and the compute isn't free either. |
One comment specifically related to this proposal instead of the more general "is this a good idea". Suppose I have contracts A and B, both of which use this error mechanism. They use the same set of values because they were developed independently. Contract A calls B, and B fails. How does a user distinguish what the real error code was (should I interpret it as an A error or a B error)? Assuming there's no way to distinguish, what stops a malicious contract from deliberately misleading users in this way? |
So far I think we have four proposals in this thread:
I think we need to provide something. Providing error codes seems like it would address the opaqueness issue to at least some degree, even if it is imperfect. And, providing error codes today doesn't prevent us from exploring other options like what you describe. |
I saw this in your previous comment and it struck me that we might not be talking about the same thing. We should really distinguish between debugging during development (who cares what it costs?) and debugging in production for users (who doesn't care what it costs?). Which are we talking about here? For developers, they should use the debug log mechanism @graydon mentioned to help other developers debug. I don't see why they would use anything else? Even better would be to use the native code when it's available, which was a huge motivation for our initial design but seems to have dropped off the radar lately. |
I agree we're mixing scenarios here. There are several separate ones with very separate costs and complexity.
I think we're talking about splitting a fairly minor hair, but one that's still worth splitting right: in an optimized contract, running in production, on a failure path, should the user be allowed to provide a single u32 worth of hint to their caller (be it a contract or a txn-invoker) about what went wrong, or not. I argue that this costs as little as we can make anything cost -- it's the difference between the wasm sequences I agree you can spoof it. I agree it's not super detailed. But it's better than nothing, and the user's next-best option for providing such information in a production build is emitting a txmeta debug-event then trapping (which will require the combined wasm sequence |
I'm hoping error codes are a good option that are just always there, locally, or deployed on pubnet. The diagnostics stuff @graydon recently added, like While I think we definitely have options for how to make for better error diagnostics during development, there's a UX setup cost to someone going and finding the native code of another contract, verifying (somehow) that it is the legit exact same code that is deployed, to understand what is happening. I'm not opposed to us doing smart things like that. But I think for a very small price we can provide some error information that event exists on chain so that even if you are developing against an on-chain version of a contract, you aren't flying blind. |
I'm responding to the comments about integration testing for cross-contract development against Rust code vs WASM binaries in this thread, so as not to derail this conversation: https://discord.com/channels/897514728459468821/935999748840751125/1019050694055047228. |
Go for the error codes, it's fine. I'm quite concerned about the status code collision though. Here's an example. I have a contract that can report the error "not enough balance". It calls another contract that reports the same status code, tricking the user into sending more money. The user didn't actually need to send more money and now needs to spend more time and effort getting the money back out. Perhaps we can do something to mitigate this?
We've definitely been pulling away from this. Our original work was geared towards using x86-compiled contracts (as opposed to WASM) for all contracts that you have the source for during development--even if they aren't the contract you are writing but rather contracts you depend on. This is no longer the case: the very-slick client support that @leighmcculloch built only uses WASM. I'll follow up in the linked thread. |
I think there's a misunderstanding. The client equally supports both WASM and natively registered contracts. The client is designed to provide a unified single interface to both. I've provided more details here: https://discord.com/channels/897514728459468821/1019050694055047228/1019270676353384508. |
This seems most relevant in the case that A calls B and B calls C where A is a try_call and B is a call. @graydon In that case does a Status generated by C end up in A looking like it was generated by B? Maybe we could disambiguate between statuses returned from the callee vs statuses returned from some deeper component. |
Yeah this particular pattern is, I think, fairly easy to fix. If C fails with an explicit status, the inner call (non try_call, just normal call) from B -> C will fail with a wasmi VM trap status. At present we do appear to unpack the inner host status and propagate it as the call status (to try to "preserve information") but it'd be easy to leave that as an opaque boundary, maybe add a special |
It should be possible for contracts to create their own status error codes that are clearly separated from all errors that the host could create. For this we need a new
ScStatusType
, something likeContractError
should do. The codes within that type of error would be undefined by the XDR and any u32 value could be used. It would be up to contracts to define what they are.The text was updated successfully, but these errors were encountered: