Skip to content
This repository has been archived by the owner on Nov 17, 2020. It is now read-only.

Stream queue is sending an incomplete hostname to client, client fails to connect - maybe Kubernetes-specific #2

Closed
gerhard opened this issue Oct 29, 2020 · 2 comments

Comments

@gerhard
Copy link

gerhard commented Oct 29, 2020

While preparing rabbitmq/tgir#18 with @mkuratczyk, we have hit the following issue with the java client (click to expand the full stack trace):

java.net.UnknownHostException: stream-rabbitmq-server-0: Name or service not known
│ 17:41:41.722 [main] INFO  c.r.stream.perf.StreamPerfTest - Created stream stream1                                                                                                                                                                                                                                                                                        │
│ java.net.UnknownHostException: stream-rabbitmq-server-0: Name or service not known
│
│     at java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)                                                                                                                                                                                                                                                                                              │
│     at java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)                                                                                                                                                                                                                                                                        │
│     at java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)                                                                                                                                                                                                                                                                                 │
│     at java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)                                                                                                                                                                                                                                                                                     │
│     at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)                                                                                                                                                                                                                                                                                               │
│     at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)                                                                                                                                                                                                                                                                                                │
│     at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)                                                                                                                                                                                                                                                                                                │
│     at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)                                                                                                                                                                                                                                                                                                   │
│     at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:156)                                                                                                                                                                                                                                                                                                    │
│     at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:153)                                                                                                                                                                                                                                                                                                    │
│     at java.base/java.security.AccessController.doPrivileged(Native Method)                                                                                                                                                                                                                                                                                              │
│     at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:153)                                                                                                                                                                                                                                                                                            │
│     at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:41)                                                                                                                                                                                                                                                                                      │
│     at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:61)                                                                                                                                                                                                                                                                                          │
│     at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:53)                                                                                                                                                                                                                                                                                          │
│     at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:55)                                                                                                                                                                                                                                                                          │
│     at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:31)                                                                                                                                                                                                                                                                          │
│     at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:106)                                                                                                                                                                                                                                                                               │
│     at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:206)                                                                                                                                                                                                                                                                                             │
│     at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:46)                                                                                                                                                                                                                                                                                                        │
│     at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:180)                                                                                                                                                                                                                                                                                              │
│     at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:166)                                                                                                                                                                                                                                                                                              │
│     at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)                                                                                                                                                                                                                                                                                  │
│     at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)                                                                                                                                                                                                                                                                               │
│     at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)                                                                                                                                                                                                                                                                                  │
│     at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)                                                                                                                                                                                                                                                                                        │
│     at io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:604)                                                                                                                                                                                                                                                                                      │
│     at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)                                                                                                                                                                                                                                                                                       │
│     at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84)                                                                                                                                                                                                                                                                                  │
│     at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:989)                                                                                                                                                                                                                                                                          │
│     at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:504)                                                                                                                                                                                                                                                                               │
│     at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:417)                                                                                                                                                                                                                                                                              │
│     at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:474)                                                                                                                                                                                                                                                                                   │
│     at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)                                                                                                                                                                                                                                                                        │
│     at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)                                                                                                                                                                                                                                                                │
│     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)                                                                                                                                                                                                                                                                                                      │
│     at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)                                                                                                                                                                                                                                                                      │
│     at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)                                                                                                                                                                                                                                                                                         │
│     at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)                                                                                                                                                                                                                                                                             │
│     at java.base/java.lang.Thread.run(Thread.java:834)                                                                                                                                                                                                                                                                                                                   │
│ stream closed

And these are the logs from the broker:

Received close command 0 <<"OK">>
│ rabbitmq 2020-10-29 17:41:41.223 [info] <0.1404.0> Transitioned from tcp_connected to tcp_connected                                                                                                                                                                                                                                                                      │
│ rabbitmq 2020-10-29 17:41:41.429 [info] <0.1404.0> Transitioned from tcp_connected to tcp_connected                                                                                                                                                                                                                                                                      │
│ rabbitmq 2020-10-29 17:41:41.523 [info] <0.1404.0> Transitioned from tcp_connected to authenticated                                                                                                                                                                                                                                                                      │
│ rabbitmq 2020-10-29 17:41:41.622 [info] <0.1404.0> Tuning response 1048576 60                                                                                                                                                                                                                                                                                            │
│ rabbitmq 2020-10-29 17:41:41.623 [info] <0.1404.0> Transitioned from tuning to tuned                                                                                                                                                                                                                                                                                     │
│ rabbitmq 2020-10-29 17:41:41.625 [info] <0.1404.0> Transitioned from tuned to opened                                                                                                                                                                                                                                                                                     │
│ rabbitmq 2020-10-29 17:41:41.633 [notice] <0.1411.0> rabbit_stream_coordinator: candidate -> leader in term: 1 machine version: 0                                                                                                                                                                                                                                        │
│ rabbitmq 2020-10-29 17:41:41.633 [info] <0.943.0> ra: started cluster rabbit_stream_coordinator with 1 servers                                                                                                                                                                                                                                                           │
│ rabbitmq 0 servers failed to start: []                                                                                                                                                                                                                                                                                                                                   │
│ rabbitmq Leader: {rabbit_stream_coordinator,'rabbit@stream-rabbitmq-server-0.stream-rabbitmq-headless.default'}                                                                                                                                                                                                                                                          │
│ rabbitmq 2020-10-29 17:41:41.634 [info] <0.1417.0> osiris_log:init/1 max_segment_size: 500000000, retention [{max_bytes,20000000000}]                                                                                                                                                                                                                                    │
│ rabbitmq 2020-10-29 17:41:41.634 [info] <0.1417.0> osiris_writer:init/1: name: __stream1_1603993301627472257 last offset: -1 committed chunk id: -1                                                                                                                                                                                                                      │
│ rabbitmq 2020-10-29 17:41:41.637 [info] <0.1404.0> Created cluster with leader <0.1417.0> and replicas []                                                                                                                                                                                                                                                                │
│ rabbitmq 2020-10-29 17:41:41.727 [info] <0.1404.0> Received close command 0 <<"OK">>                                                                                                                                                                                                                                                                                     │
│ rabbitmq 2020-10-29 17:41:41.728 [info] <0.1421.0> Closing all channels from connection '<<"10.42.0.76:42886 -> 10.42.0.75:5555">>' because it has been closed

The RabbitMQ nodename is rabbit@stream-rabbitmq-server-0.stream-rabbitmq-headless.default, and the hostname portion is stream-rabbitmq-server-0.stream-rabbitmq-headless.default, so why does java client try to connect to stream-rabbitmq-server-0? Our assumption is that longnames are not handled correctly by stream queues.

If it helps, either myself or @mkuratczyk can show you how to reproduce this in a few minutes. This is the exact broker & stream-perf-test configuration that we were using when we've hit this issue.


GitHub Protips: Tips, tricks, hacks, and secrets from Lee Reilly

gerhard referenced this issue in rabbitmq/tgir Oct 29, 2020
Otherwise they will use the system hostname which is the first part of
the FQDN. More info:
https://github.com/rabbitmq/rabbitmq-website/blob/stream-queue/site/stream.md#advertised-host-port

re https://github.com/rabbitmq/rabbitmq-server/issues/2486

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
@michaelklishin michaelklishin transferred this issue from rabbitmq/rabbitmq-server Nov 2, 2020
@acogoluegnes
Copy link
Contributor

The performance tool uses the URL in the command line to create a "locator" connection and then uses the metadata protocol command from this connection to get the topology of a given stream (where the members of this stream are located in the cluster). The stream plugin uses inet:gethostname/0 function to find out about the host name and return this to the client. The client then uses this information to connect to appropriate nodes (leader node for a publisher, replica node for a consumer).

One can use the advertised_host configuration entry to tell the plugin which host name to return to clients.

So you're suggesting to use the Erlang long name instead of the host name?

@gerhard
Copy link
Author

gerhard commented Nov 2, 2020

So you're suggesting to use the Erlang long name instead of the host name?

Yes.

A hostname does not resolve outside the host context, and is typically set to the first part of the FQDN. E.g. the hostname part of arnaud.rabbitmq.com is arnaud.

While the host portion in a RabbitMQ node name might not resolve outside the context of RabbitMQ nodes, this is less likely.

In the case of Kubernetes, the host portion is an actual DNS entry that can be resolved by any pod running inside the same cluster. For outside connectivity, we need to do a bit more work, but the basics are already in place.

In the case of Docker, the host portion gets resolved by Docker itself, e.g. https://github.com/rabbitmq/rabbitmq-prometheus/blob/bba82369f0b578ebe6c94b2f85f6822003e1cdcc/docker/docker-compose-overview.yml#L60

While less relevant, I am bringing this up for completeness. In the case of BOSH, we used to resolve the host portion explicitly, using erl inetrc config, e.g. https://github.com/pivotal-cf/cf-rabbitmq-release/blob/bd9b265108455353ec378d946c57f6e03ad58eca/jobs/rabbitmq-server/templates/pre-start.bash#L81-L100. That would only work on the RabbitMQ hosts, for clients the advertised_host config would need to be set to the load balancer in the case of pre-provisioned, or the FQDN of the RabbitMQ node in the case of on-demand.

If we default to the Erlang long name instead of hostname, I see us being one step closer to implementing client hinting. We talked about this as a way of enabling the broker to hint to clients a more optimal node to connect to. If we standardise on using Erlang long names everywhere (including streams), and we implement the functionality that allows services (K8S concept) to route traffic to the RabbitMQ node specified in the Host header, we wouldn't need to do anything else on the RabbitMQ side to get this working when users expose streams via a public IP. I can already see @Gsantomaggio and a few others getting excited about this in the context of rabbitmq/tgir#16 (comment)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants