Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose error codes #61

Closed
snej opened this issue Aug 4, 2019 · 13 comments
Closed

Expose error codes #61

snej opened this issue Aug 4, 2019 · 13 comments

Comments

@snej
Copy link

snej commented Aug 4, 2019

The API does not expose the details of an error, e.g. error codes.

Many use cases need to know the exact reason for a connection failure, for recovery and logging and to inform the user. (The library I work on uses these codes to decide whether to retry a connection. And my employer’s support team is often found poring over error logs submitted by developers using our library, to help them figure out why the client won’t connect to the 🤬 server.)

Error codes are problematic because there are multiple namespaces (errno, gethostbyname status, OpenSSL, mbedTLS, Apple OSStatus, etc.) What I tend to do is define an enumeration covering all the namespaces, then make my error type a {namespace, code} tuple. But defining your own namespace with an enum of your own error codes works too...

(This is another task I’m willing & likely to work on, if I end up choosing uWebSockets for this work project.)

@ghost
Copy link

ghost commented Aug 4, 2019

#25 Related,

TCP layer doesn't have errors because out of 19 different return values, only one or two may actually happen. Those one or two different actual errors are essentially ECONNRESET; something you can catch by checking if you get a proper on_end callback before on_close.

So even if you would use the Linux syscalls directly, there never were a way to get "the exact reason". And there is no need to.

Sure, as bug 25 reports, we have issues with connect errors and yes they need to be improved. But even here there are only a handful of actual errors.

TLS doesn't have errors because TCP doesn't. You kind of have the same reasoning here: if you do not get on_end but only get on_close then_something_ went wrong either protocol error or reset.

Now, the Linux kernel can see protocol errors for TCP just like openssl can see protocol errors for TLS. Linux will just report all protocol errors as ECONNRESET or EPOLLHUP or any such generic failure.

You'll never get super detailed errors such as "TCP handskar failed".

Therefore this lib makes it I've of the design goals not to emit pointless errors nobody understands anyways. You either get on_end for a graceful non-error shutdown or you get only on_close when something went wrong.

Use strace to log and debug.

@ghost
Copy link

ghost commented Aug 4, 2019

Under key aspects on readme:

Minimal yet truly cross-platform, will not emit a billion different platform specific error codes.

@ghost
Copy link

ghost commented Aug 4, 2019

You can do what you want in the plug-in, if you want to log things do it. You could define a function like register_error_handler which takes a c function then just declare it in your program and use it. Nothing is stopping you from doing that, but the public API should not have detailed errors.

Maybe we can add an integer to on_close which may or may not hold extra info. Kind of like a reserved thing for use by those implementations who care

@snej snej mentioned this issue Aug 5, 2019
@snej
Copy link
Author

snej commented Aug 5, 2019

You have a point about error codes being pretty much irrelevant when an open socket terminates. But when connecting, there are several distinct error conditions we [and probably many apps / higher level libraries] care about:

  • DNS lookup failure (no DNS, or DNS not responding)
  • Hostname not known
  • Host not reachable
  • No response from host
  • Connection refused
  • Server TLS cert invalid
  • Client TLS cert rejected

I'm not saying we have different logic for each one of these, but we use these as heuristics for whether to retry immediately, or treat the client as "offline" and wait for a network interface to go up, or give up and report an error to the user. (Some of this complexity comes from running on mobile devices, which find themselves in changing and unstable network conditions.)

Use strace to log and debug.

That sort of thing works when developers are testing in-house, but most troubleshooting is forensic, based on logs from customer devices (iOS and Android mostly.) Also, to be honest, a lot of our enterprise customers employ developers who are not super knowledgeable at anything below the app-framework layers of the OS, so it's for the best if we can tell them to just turn on our logging, instead of having to explain what system-tracing is and how to run it. (You may roll your eyes, but our support engineers are very emphatic about this, especially after a few beers.)

@ghost
Copy link

ghost commented Aug 5, 2019

Connecting is in a pretty bad shape in this library - I've mainly focused on server side.

I agree connecting is a special case where you might want at least a few possible errors.

Problem is the API need to be symmetric and I do not want to introduce on_error as that makes less sense for server side.

What I've thought about is passing error to on_open. Why on_open? Why not on_close?

Because if you pass the error to on_close you cannot know if you were closed from open state or closed immediately, so you cannot know if you have run a proper on_open before.

So it could make sense to have on_open take an integer error or even uintptr_t or something like that. Then you basically have to check if it was opened without error or if you opened with an error.

Maybe it makes sense to rename on_open to on_try_open?

@snej
Copy link
Author

snej commented Aug 5, 2019

if you pass the error to on_close you cannot know if you were closed from open state or closed immediately, so you cannot know if you have run a proper on_open before.

Can't you [the client code] set a flag in on_open and then test it in on_close? Same advice you gave above about detecting disconnects (just with on_open instead of on_end.)

@snej
Copy link
Author

snej commented Aug 6, 2019

I've experimentally added the int error parameter to on_open. IMHO, on_open is looking pretty messy since it has some parameters only for server-side and another only for client-side, and a flag to tell which mode — to me this looks like a strong code smell for splitting it into two callbacks, one for client and one for server. But you've mentioned a necessity of keeping the API symmetrical, which I don't understand but I'll take it as a given.

@ghost
Copy link

ghost commented Aug 6, 2019

Both client and server get the IP and IP length. Is_client is kind of a helper, you could have two different contexts if you want to separate them

@ghost
Copy link

ghost commented Aug 6, 2019

Hmm maybe not

@ghost
Copy link

ghost commented Aug 16, 2019

Also closing due to inactivity

@ghost ghost closed this as completed Aug 16, 2019
@ghost
Copy link

ghost commented Aug 18, 2019

@snej I know you're not interested anymore but this is how I want to solve it:

  • Replace is_client with a reason (integer)
  • Reason is a bitfield of:
    • OUTBOUND for a connection
    • INBOUND for accept
    • ERROR for any error
    • TLS_ERROR for any TLS error at connection (can contain more bits set for more detail)
    • DNS_ERROR for any DNS failure (probably won't need any more bits)
    • TCP_ERROR for basically anything from the kernel such as rejected, etc

Timeout errors are up to you to track with us_socket_timeout calls

@ghost ghost reopened this Aug 18, 2019
@snej
Copy link
Author

snej commented Aug 20, 2019

Sorry I vanished — I decided that for our architecture we don't need to do multiplexed or async socket I/O, since this is a client that usually only opens one socket and will never open more than a handful. So I've gone with a simpler approach that just spawns a read thread and a write thread, instead of using more complex 3rd party libraries with async I/O.

@ghost
Copy link

ghost commented Jun 26, 2020

The API now has int code and void *reason to its on_close and close functions. This allows passing TCP/TLS error codes 👍

@ghost ghost closed this as completed Jun 26, 2020
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant