-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems: TCP connecter does not Round Robin Multiple DNS A records, TCP connecter does not check for DNS TTL expiration #2297
Comments
It connects only to the first address returned by getaddrinfo, see: https://github.com/zeromq/libzmq/blob/master/src/tcp_address.cpp#L545 In theory it could be implemented by changing the address internal structure in tcp_address.c/hpp to a list and making some adjustments (well maybe a few) in tcp_connecter.cpp, but I am worried about the semantics implications. Having multiple connects, as the documentation explains, implies for most sockets doing round-robin for sends. This works fine and is not confusing because a user has to manually call connect to each endpoint, so there's no surprise. But if this started to happen behind-the-scenes, without any intervention of the user, just depending on what the DNS returns, I can see it could get icky very quickly. So if you would like to implement it and send a PR by all means please do so and we'll merge it, but it should be behind a socket option and disabled by default I think. For your use case, wouldn't it be doable to just to the DNS resolution in your application, and connect multiple times using the IPs rather than the hostname? |
Thanks for the answer @bluca!
Round robin is the right semantics DNS-wise too, so it's tempting to just match it up, but the change in behaviour might be surprising, agreed that putting this behind a sock opt makes sense.
So, to workaround this I think I'd need a thread which polls the DNS after its TTL and re-resolves. If the DNS changes from It seems like sticking a TTL timer (with some reasonable minimum) per hostname on the I/O thread event loop to re-resolve and perform these diffs for all the sockets would be much more elegant (and we'd respect DNS semantics!). Even without the "connect multiple times" socket option set, the TTL timer would be useful to react to DNS changes. Does this sound reasonable? I'm not very familiar with the 0MQ codebase, I could give this a try with some mentoring if that's on the table :) |
If you are using CZMQ with zloop for your sockets you can add a zloop_timer () per socket with a callback, so it won't be in another thread. If you are using zpoller, you can add a zactor that just sleeps and writes back in the pipe, and read the zactor pipe from the poller, a bit more verbose code-wise but same result. Should be pretty easy to implement.
Sounds good. There is already support for timer events, so it should be trivial to add one for the tcp connecter class (see zmq::tcp_connecter_t::timer_event). The only thing is we want to try and keep behaviour changes at a minimum, to avoid tripping users who are upgrading from one version to another. So the best thing would be to have a socket option to turn on the TTL expiry check, and another to do the multiple connects, and have both disabled by default at the beginning. So a rough TODO list:
Sure, I'm happy to help, thanks for tackling this. I'm also online on IRC during weekdays working hours, GMT+00. |
Ah, this is turning out to be quite the rabbit hole. If we're going to do periodic DNS lookups, it feels like they should be async too. I may just work around this in client code for now---I still think this would be an absolute killer feature for something like a Kubernetes cluster: it could completely remove the need for all the protocol-unaware local LB-s that k8s sticks in my cluster and it'd make it super convenient to right 0MQ micro-services. Thanks for all the info though! I may circle back to this if I ever get the time, I'm surprised this hasn't been an issue for other people (I googled for a long time before filing this issue). |
Does 0MQ connect to all IPs returned by DNS lookup, or only to the first one?
I'm trying to connect a
DEALER
to multipleROUTER
backends. Therouter.local
DNS resolves to the IP addresses of all my backends10.0.1.1
,10.0.1.2
... It seems like if there is a connection failure, the hostname will be re-resolved and the next IP is picked up (which is great!), but I would have really liked the behaviour of doingsocket.connect("tcp://router.local:8081")
to be the same as asfor address in lookup(hostname): socket.connect(address)
.Is there some way to tell 0MQ: "stay connected to all the endpoints that this hostname resolves to, respecting TTL"? Our setup is headless services in Kubernetes: scaling up a service causes the dns to resolve to new IP-s, updating a service triggers a rolling update (with a corresponding rolling replacement of addresses in the DNS A-records).
The text was updated successfully, but these errors were encountered: