New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Traefik 3.0: StartTLS connection hanging if connection initiated when upstream unavailable #9929
Comments
I think the ideal situation would be...
|
@rtribotte any advice appreciated, also we are willing to contribute this addition to Traefik |
Hello @sjmiller609, Thanks for your interest in Traefik and for reporting this! Could you please share the complete debug log of Traefik during the sequence you described? To give you more insights, your proposals make sense, but we suspect that the cause of your problem could be linked to the StartTLS feature itself, so we need to investigate. |
@rtribotte Thank you!
Shell output
Above, on the second psql command we delete the pod manually immediately before running psql. I let it hang for awhile, then I control C. We see the pod is 23s old and running when I list pods again. Then, I tidied the logs like this to remove healthcheck noise
I don't think so, but it looks like that is feature of HTTP services, and we are using a TCP service, but I may be misunderstanding. In the case of |
@sjmiller609 , can you give me the helm chart version you use for values file? |
|
Thank @sjmiller609 |
Hello @sjmiller609, Thanks for your feedback! Additionally, can you please share with us the |
Also, here is a repository where we can reproduce the issue locally: https://github.com/sjmiller609/reproduce-issues/tree/main/traefik-psql-ingress |
Hello @sjmiller609, The rawdata endpoint payload shows that the "customer-1-hippo-1-c539bbaafd097714bd0e@kubernetescrd" router is in error because the service is missing (because there was an error when building it). It seems that your issue is related to a configuration issue and the GitHub issue tracker is dedicated to bug and feature requests. For help on your configuration, please join our Community Forum and reach out to us on the Traefik v2 section. We will close this issue accordingly but feel free to re-open it if you think that we missed something. |
There are Kubernetes Services with no available Pod for some duration during Postgres restart, when we connect during this time, we find the connections hang open instead of closing or queuing up then connecting when the target is ready. Trying the setting allowEmptyServices by setting on the CLI We reproduce the issue again, and this is the api raw data
This no longer includes the error message from the previous output. However, the connection is still hanging open. @rtribotte Please let me know, and thank you for your help. |
Hello @sjmiller609, Thanks for the feedback. I can reproduce the issue with the psql command in your script.
|
Thank you again, you have been a huge help. I confirm I see the same:
Traefik may benefit from forced disconnect in the case of the default sslmode prefer when the backend is not running, for the same reason our product would benefit - as the service provider, you are not typically in control of the clients' behavior. Also it does seem to me like it ought to not hang open with sslmode prefer or even even with an invalid sslmode like "disable", instead just closing the connection in any case where it's not fully established (or ideally, for valid configurations retry for short duration before timeout then close). If you can let us know how we can help, we would be very grateful. We want the default psql our customers will use to not hang if the workload is restarting. Also, as far as I know there is no other Kubernetes ingress that does SNI routing with StartTLS support, so it's really great that Traefik has this option. Thanks again. |
@rtribotte thanks again! Please let us know if there is a technical limitation and / or if PR would be accepted from our team. This change would be valuable enough for us to work on this. |
Hello @sjmiller609, Sorry for the late reply! Thanks for the proposal, this would be indeed valuable to us, and we would welcome a PR that would fix this "hanging" behavior. What would the fix consist of? Can you please share with us what would be the approach? |
@rtribotte thanks again so much for your feedback. Here is my analysis of what might work. What is going on? When upstream postgres is not running, this is the order of events: For both sslmode=require and sslmode=prefer
Next, only when sslmode=prefer, the client tries again
What could we change in Traefik? In Traefik, we are peeking to check if it's a StartTLS message, if so we return "S". I suggest we should also peek if it's a postgres startup message, just close the connection if it is. |
More information on peeking the message. https://www.postgresql.org/docs/current/protocol-message-formats.html With the start tls request, it looks like this in hexadecimal, that matches the bytes here
In startup message, it looks like this
How to peek it? Start TLS request
This is an int32 with the message length (8), followed by another int 32 with the value 80877103 In comparison for the startup message: The startup message is an int32 for the message length (varies), followed by another int32 with value 196608 In hex that is 30000, this is supposed to mean “protocol version 3” In my example, we see the start of the Startup Message looks like this So I think we should basically just peek for |
@rtribotte when you have the time, please let me know if this approach sounds OK to you. |
@sjmiller609 Sure! At a glance, we are unsure to grasp every subtlety of the approach, especially on the client side (what would in this approach make the client consider the connection closed?), and the potential side effects in Traefik itself. Anyway, we think that it is worth going further and we would gladly welcome a PR for this. We cannot commit to merging it, but we for sure will be reviewing it. Thanks! |
@rtribotte Thanks again for considering to allow a change! Let's follow up in the PR? To answer your questions with what I suspect so far:
server-side initiated TCP disconnect
Maybe the current version of Traefik 3 has an issue with any TCP protocol that starts with the same 8 bytes as StartTLS, or a protocol where the client sends less than 8 bytes to start, and the bytes sent match the initial bytes of StartTLS. I believe I am finding this issue in my draft PR when a client message is less than 5 bytes, but in my case this is much more probable because that's any possible initial 4 bytes instead of in the case of start TLS where perhaps it's just improbable that any protocol matches the initial bytes of start TLS. |
Just another case where the connection can hang : if the client tries to use GSSAPI. I just has a situation where dbeaver was working nicely, but command line tools (pg_dump, pg_restore, psql) was failing. The problem is that those tools will first try to negociate GSSAPI encryption (even if we set sslmode=verify-full). Failing command
The connexion will just hang. The sent bytes in this case are
To work around this, we can explicitely disable GSSAPI with
Not sure what the correct Traefik behavior should be here |
@rtribotte suggestion works! |
Welcome!
What did you do?
Hello, I am using the pre-release Traefik version 3 for this feature: #9377
It is working great, and we have noticed one possible issue.
If we establish a connection to Traefik while the upstream server is not available, the connection will hang open. We see the connection remains to hang, not completing the fully connection to the upstream even after the upstream becomes available.
What did you see instead?
Example order of events:
What version of Traefik are you using?
Version: 3.0.0-beta2
Codename: beaufort
Go version: go1.19.4
Built: 2022-12-07T16:32:34Z
OS/Arch: linux/amd64
What is your environment & configuration?
Add more configuration information here.
If applicable, please paste the log output in DEBUG level
No response
The text was updated successfully, but these errors were encountered: