Kubernetes API requests in a pure IPv6 environment fail with an "nxdomain" #55

taijitao · 2019-10-16T08:23:04Z

Hi,
I had a pure ipv6 k8s cluster. and i want to instal rabbitmq helm chart.
I followed the instrument in https://www.rabbitmq.com/networking.html#distribution-ipv6
My parameter(in helm chart):

   environment: |-
      RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+A 128 -kernel inetrc '/etc/rabbitmq/erl_inetrc'  -proto_dist inet6_tcp"
      RABBITMQ_CTL_ERL_ARGS="-proto_dist inet6_tcp "
  erl_inetrc: |-
    {inet6, true}.

File erl_inetrc was created under /etc/rabbitmq.
and I found error in log:

2019-10-15 07:33:55.000 [info] <0.238.0> Peer discovery backend does not support locking, falling back to randomized delay
2019-10-15 07:33:55.000 [info] <0.238.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized start
up delay.
2019-10-15 07:33:55.000 [debug] <0.238.0> GET https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/tazou/endpoints/zt4-crmq
2019-10-15 07:33:55.015 [debug] <0.238.0> Response: {error,{failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},{inet,[inet]
,nxdomain}]}}
2019-10-15 07:33:55.015 [debug] <0.238.0> HTTP Error {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},{inet,[inet],nxdom
ain}]}
2019-10-15 07:33:55.015 [info] <0.238.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}}
,
                 {inet,[inet],nxdomain}]}
2019-10-15 07:33:55.016 [error] <0.237.0> CRASH REPORT Process <0.237.0> with 0 neighbours exited with reason: no case clause matching {error,"{fa
iled_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from
_config/0 line 167 in application_master:init/4 line 138
2019-10-15 07:33:55.016 [info] <0.43.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kub
ernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 167

the inet could return ipv6 address.

[root]# kubectl exec -ti zt5-crmq-0 rabbitmqctl eval 'inet:gethostbyname("kubernetes.default.svc.cluster.local", inet6).'
{ok,{hostent,"kubernetes.default.svc.cluster.local",[],inet6,16,
             [{64769,43981,0,0,0,0,0,1}]}}

[root]#  kubectl exec -ti zt5-crmq-0 rabbitmqctl eval 'inet_res:resolve("kubernetes.default.svc.cluster.local", in, aaaa).'
{ok,{dns_rec,{dns_header,1,true,query,true,false,true,true,false,0},
             [{dns_query,"kubernetes.default.svc.cluster.local",aaaa,in}],
             [{dns_rr,"kubernetes.default.svc.cluster.local",aaaa,in,0,5,
                      {64769,43981,0,0,0,0,0,1},
                      undefined,[],false}],
             [],[]}}

nslookup return ipv6 address when type=aaaa.
return error when type=a.

I don't know why httpc:request will return nxdomain.
is it a bug or setting issue?

B.R,
Tao

The text was updated successfully, but these errors were encountered:

taijitao · 2019-10-16T08:26:45Z

does this plugin support ipv6 only stack or it support ipv6/ipv4 stack?

michaelklishin · 2019-10-16T15:06:41Z

This plugin issues requests to the Kubernetes API over HTTP[S]. It is entirely unaware of what IP version is used underneath. nxdomain, as I'm sure you know, means "no domain resolved". This plugin cannot be responsible for that.

For cases when proper hostname resolution configuration is not available, Erlang provides its own resolution configuration file which should be pointed at using the ERL_INETRC environment variable. You don't need it most of the time but sometimes it is indispensable.

lukebakken · 2019-10-17T15:25:41Z

Versions of the software from this rabbitmq-users discussion:

rabbitmq_3.7.18-1.el7
erlang_22.0.7-1.el7

I suspect this is due to the httpc library defaulting to inet: docs.

Note the default value for IpFamily.

@taijitao since you have access to an IPv6-only environment, I will create a custom build of this plugin for you to test.

lukebakken · 2019-10-17T15:52:19Z

@taijitao - here is the custom plugin built from this branch:

rabbitmq_peer_discovery_k8s-3.7.20+rc.1.dirty.ez.zip

To install:

Copy to your RabbitMQ servers and remove the .zip extension.
Locate the existing rabbitmq_peer_discovery_k8s-3.7.18.ez file and re-name it or move it out of the way.
Copy the rabbitmq_peer_discovery_k8s-3.7.20+rc.1.dirty.ez to that location.
Restart RabbitMQ

Please note that cluster formation only happens the first time RabbitMQ is started. If these nodes have been started before, you will have to reset them (rabbitmqctl reset) or delete their data directory.

lukebakken · 2019-10-21T12:32:58Z

@taijitao any chance to test this? ^^^^

taijitao · 2019-10-22T02:56:24Z

Yes, I'll test that.
Could you give me some explaination what you have changed in the customize build?

michaelklishin · 2019-10-22T04:16:00Z

@taijitao it configures (unconditionally at the moment) HTTP client's socket address family to IPv6.

taijitao · 2019-10-22T06:20:20Z

I have tested it and it worked.
the erlang setting is : {inet6, true}.
good news is :

2019-10-22 06:10:28.934 [info] <0.274.0> Peer discovery Kubernetes: setting IpFamily to inet6...
2019-10-22 06:10:28.934 [info] <0.274.0> Peer discovery Kubernetes: setting IpFamily to inet6 response: ok
2019-10-22 06:10:28.934 [info] <0.274.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2019-10-22 06:10:28.934 [info] <0.274.0> Peer discovery backend does not support locking, falling back to randomized delay
2019-10-22 06:10:28.934 [info] <0.274.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2019-10-22 06:10:29.016 [info] <0.274.0> All discovered existing cluster peers: rabbit@zt2-crmq-1, rabbit@zt2-crmq-0
2019-10-22 06:10:29.016 [info] <0.274.0> Peer nodes we can cluster with: rabbit@zt2-crmq-0
2019-10-22 06:10:29.032 [warning] <0.274.0> Could not auto-cluster with node rabbit@zt2-crmq-0: {badrpc,nodedown}

but it's fail to form cluser. I now had two separated nodes.
docker process
bash-4.2$ ps -ef

UID        PID  PPID  C STIME TTY          TIME CMD
rabbitmq     1     0  0 06:09 ?        00:00:00 /bin/sh /usr/lib/rabbitmq/bin/rabbitmq-server start
rabbitmq   197     1  0 06:09 ?        00:00:00 /usr/lib64/erlang/erts-10.4.4/bin/epmd -daemon
rabbitmq   383     1  1 06:09 ?        00:00:18 /usr/lib64/erlang/erts-10.4.4/bin/beam.smp -W w -A 64 -MBas ageffcbf -MHas ageffcbf -MBlmbcs 512 -MHlmbcs 512 -MMmcs 30 -P 1048
rabbitmq   551   383  0 06:10 ?        00:00:00 erl_child_setup 1048576
rabbitmq  1894   551  0 06:10 ?        00:00:00 inet_gethost 4
rabbitmq  1895  1894  0 06:10 ?        00:00:00 inet_gethost 4
rabbitmq  9563     0 35 06:26 ?        00:00:00 /usr/lib64/erlang/erts-10.4.4/bin/beam.smp -B -- -root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -boot star
rabbitmq  9676  9563 34 06:26 ?        00:00:00 erl_child_setup 1048576
rabbitmq  9697     0  2 06:26 ?        00:00:00 bash
rabbitmq  9706  9697  0 06:26 ?        00:00:00 ps -ef

michaelklishin · 2019-10-22T06:56:56Z

According to the log discovery via Kubernetes API endpoint has succeeded. However, nodes could not contact and/or authenticate with each other. This is not a responsibility of this plugin. See rabbit@zt2-crmq-0 logs for more clues. This part of the discussion is mailing list material.

michaelklishin · 2019-10-22T07:01:03Z

See Using IPv6 for Inter-node and CLI Tool Communication.

michaelklishin · 2019-10-22T07:06:44Z

httpc can only use one address family for its sockets. So we have a couple of options:

Add a configuration setting for this plugin that would switch it to inet6 (for IPv6)
Try to detect IPv6 availability, then switch

I personally would prefer the latter. @taijitao WDYT?

Gsantomaggio · 2019-10-22T15:30:20Z

Hi,
I had a k8s configured in pure IPv6 ( with Kind ).

I tried this patch because I'd need also here.
It seems to work correctly:

[vagrant@localhost k8s_statefulsets]$ kubectl get pod -o wide
NAME                   READY   STATUS    RESTARTS   AGE     IP                NODE                 NOMINATED NODE   READINESS GATES
rabbitmq-0             1/1     Running   0          9m59s   fd00:10:244::27   kind-control-plane   <none>           <none>
rabbitmq-1             1/1     Running   0          8m43s   fd00:10:244::28   kind-control-plane   <none>           <none>
rabbitmq-2             1/1     Running   0          7m51s   fd00:10:244::29   kind-control-plane   <none>           <none>

and:

 kubectl describe service rabbitmq
Name:                     rabbitmq
Namespace:                default
Labels:                   app=rabbitmq
Annotations:              kubectl.kubernetes.io/last-applied-configuration:
                            {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"app":"rabbitmq"},"name":"rabbitmq","namespace":"default"},"spe...
Selector:                 app=rabbitmq
Type:                     NodePort
IP:                       fd00:10:96::99a8
Port:                     http  15672/TCP
TargetPort:               15672/TCP
NodePort:                 http  31672/TCP
Endpoints:                [fd00:10:244::27]:15672,[fd00:10:244::28]:15672,[fd00:10:244::29]:15672
Port:                     amqp  5672/TCP
TargetPort:               5672/TCP
NodePort:                 amqp  30672/TCP
Endpoints:                [fd00:10:244::27]:5672,[fd00:10:244::28]:5672,[fd00:10:244::29]:5672
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

also the cluster status:

 rabbitmqctl cluster_status
Cluster status of node rabbit@rabbitmq-0.rabbitmq.default.svc.cluster.local ...
Basics

Cluster name: rabbit@rabbitmq-0.rabbitmq.default.svc.cluster.local

Disk Nodes

rabbit@rabbitmq-0.rabbitmq.default.svc.cluster.local
rabbit@rabbitmq-1.rabbitmq.default.svc.cluster.local
rabbit@rabbitmq-2.rabbitmq.default.svc.cluster.local

Running Nodes

rabbit@rabbitmq-0.rabbitmq.default.svc.cluster.local
rabbit@rabbitmq-1.rabbitmq.default.svc.cluster.local
rabbit@rabbitmq-2.rabbitmq.default.svc.cluster.local

I noticed that for some reason the command check_port_connectivity does not work correctly in this stack:

 rabbitmq-diagnostics check_port_connectivity
Testing TCP connections to all active listeners on node rabbit@rabbitmq-0.rabbitmq.default.svc.cluster.local ...
Error:
Connection to ports of the following listeners on node rabbit@rabbitmq-0.rabbitmq.default.svc.cluster.local failed:
Interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Interface: [::], port: 15672, protocol: http, purpose: HTTP API

lukebakken · 2019-10-22T22:21:35Z

@michaelklishin working on a PR to fix this in an "auto detect" fashion

taijitao · 2019-10-23T02:36:56Z

thanks lukebakken for your help.
it's better to 'auto detect' than to switch between different binary plugin.
Now cluster is created based on your private build.

michaelklishin · 2019-10-23T02:40:00Z

Auto-detection has a tendency to fail in ways that are hard to understand. There will be no switching between binary plugins if we can't get auto-detection to work reliably but an option that lets the operator to tell the plugin what AF to use.

taijitao · 2019-10-23T02:41:47Z

that's fine if one option is provided.
is it in the erl_inetrc? or in plugin configuration?

If the user configures `{inet6, true}` in `ERL_INETRC` file, then use it for all `httpc:` calls in peer discovery. Fixes rabbitmq/rabbitmq-peer-discovery-k8s#55

lukebakken · 2019-10-23T16:18:35Z

@taijitao @Gsantomaggio if you have time, I would really appreciate you testing the fix in rabbitmq/rabbitmq-peer-discovery-common#11

Revert your rabbitmq_peer_discovery_k8s-3.7.18.ez file to the original.
Locate your existing rabbitmq_peer_discovery_common*.ez file, and move it or rename it.
Install this file where that file was located, without the .zip extension:

rabbitmq_peer_discovery_common-3.7.20+rc.1.2.gb768f10.ez.zip

Ensure that you have {inet6, true} in your ERL_INETRC file.
Reset your cluster, and restart it.

The changes in rabbitmq/rabbitmq-peer-discovery-common#11 look for the presence of {inet6, true} in your inetrc file and will set the appropriate httpc option if found.

hustlzp1981 · 2020-03-16T08:34:41Z

@taijitao @lukebakken
Could you help take a look at my issue, thanks a lot!
I have tried as you mentioned above and other methods.
The rabbitmq pod always failed with below error in my IPV6 setup.
ERROR: epmd error for host osh-openstack-rabbitmq-rabbitmq-0.rabbitmq.openstack.svc.cluster.local: nxdomain (non-existing domain)

I added below in configmap-etc.yaml
environment: |-
RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+A 128 -kernel inetrc '/etc/rabbitmq/erl_inetrc' -proto_dist inet6_tcp"
RABBITMQ_CTL_ERL_ARGS="-proto_dist inet6_tcp"
erl_inetrc: |-
{inet6, true}.
In my armada manifest, pull image: rabbitmq: docker.io/rabbitmq:3.7.24

Thanks!
Zhipeng

hustlzp1981 · 2020-03-16T08:35:54Z

@lukebakken
Do I need your patch? Has your patch been merged into some release(3.7.24 or later )
Thanks!!!
Zhipeng

michaelklishin · 2020-03-16T11:37:47Z

@hustlzp1981 see the milestone on this PR and 3.7.20 release notes?

michaelklishin · 2020-03-16T11:41:36Z

@hustlzp1981 this is not a support forum. Please post your questions to the mailing list.

nxdomain means that the hostname (osh-openstack-rabbitmq-rabbitmq-0.rabbitmq.openstack.svc.cluster.local) failed to resolve. This PR simply makes the HTTP client use IPv6 if it is configured via ERL_INETRC. There must be an AAAA DNS record in place or the client won't be able to resolve it.

hustlzp1981 · 2020-03-16T15:18:53Z

Thanks klishin!
Could you tell me which mailing list I should use?

michaelklishin · 2020-03-16T15:50:27Z

RabbitMQ has only one and it hasn't changed since 2014.

Gsantomaggio · 2020-03-17T10:40:33Z

The nxdomain is a common problem in k8s, maybe we should update the documentation to add this document , this document, and add some specific example for rabbitmq.

hustlzp1981 · 2020-03-17T10:48:46Z

Thanks!
Now I fixed nxdomain issue in my ipv6 k8s setup according to above guide.
osh-openstack-rabbitmq-cluster-wait-9rw6p 1/1 Running 0 17m
osh-openstack-rabbitmq-rabbitmq-0 1/1 Running 0 17m

However, still have another issue.
In pod osh-openstack-rabbitmq-cluster-wait, it will use rabbitmqadmin to connect rabbitmq
but always get error. It can work in my ipv4 setup.
++ active_rabbit_nodes
2020-03-17T10:31:12.124589385Z stderr F ++ wc -w
2020-03-17T10:31:12.134367271Z stderr F ++ rabbitmqadmin_authed list nodes -f bash
2020-03-17T10:31:12.134427089Z stderr F ++ set +x
2020-03-17T10:31:12.179073378Z stderr F Traceback (most recent call last):
2020-03-17T10:31:12.179644557Z stderr F error: [Errno 111] Connection refused
2020-03-17T10:31:12.17964969Z stderr F *** Could not connect: [Errno 111] Connection refused

michaelklishin · 2020-03-18T16:19:41Z

Could not connect: [Errno 111] Connection refused is specific enough: a TCP connection (presumably to the HTTP API endpoint) was refused.

michaelklishin · 2020-03-18T16:19:54Z

This is not a Kubernetes support forum so I will lock this.

michaelklishin closed this as completed Oct 16, 2019

michaelklishin added the mailing list material label Oct 16, 2019

rabbitmq deleted a comment from michaelklishin Oct 17, 2019

lukebakken reopened this Oct 17, 2019

michaelklishin changed the title ~~rabbitmq-peer-discovery-k8s can't work in pure ipv6 k8s~~ Kubernetes API requests in a pure IPv6 environment fail with an "nxdomain" Oct 22, 2019

michaelklishin removed the mailing list material label Oct 22, 2019

Gsantomaggio mentioned this issue Oct 23, 2019

Add call for IPv6 listening in check_port_connectivity rabbitmq/rabbitmq-cli#385

Merged

11 tasks

lukebakken mentioned this issue Oct 23, 2019

If inet6 is defined in inetrc, use it in httpc rabbitmq/rabbitmq-peer-discovery-common#11

Merged

michaelklishin closed this as completed in rabbitmq/rabbitmq-peer-discovery-common#11 Oct 23, 2019

michaelklishin added this to the 3.7.20 milestone Oct 23, 2019

michaelklishin assigned lukebakken Oct 23, 2019

michaelklishin added the bug label Oct 23, 2019

Gsantomaggio mentioned this issue Oct 24, 2019

Add ipv6 call in the k8s module #56

Merged

11 tasks

rabbitmq locked as resolved and limited conversation to collaborators Mar 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes API requests in a pure IPv6 environment fail with an "nxdomain" #55

Kubernetes API requests in a pure IPv6 environment fail with an "nxdomain" #55

taijitao commented Oct 16, 2019

taijitao commented Oct 16, 2019

michaelklishin commented Oct 16, 2019

lukebakken commented Oct 17, 2019

lukebakken commented Oct 17, 2019 •

edited

lukebakken commented Oct 21, 2019

taijitao commented Oct 22, 2019

michaelklishin commented Oct 22, 2019 •

edited

taijitao commented Oct 22, 2019 •

edited

michaelklishin commented Oct 22, 2019

michaelklishin commented Oct 22, 2019

michaelklishin commented Oct 22, 2019

Gsantomaggio commented Oct 22, 2019 •

edited

lukebakken commented Oct 22, 2019

taijitao commented Oct 23, 2019

michaelklishin commented Oct 23, 2019

taijitao commented Oct 23, 2019 •

edited

lukebakken commented Oct 23, 2019 •

edited by michaelklishin

hustlzp1981 commented Mar 16, 2020

hustlzp1981 commented Mar 16, 2020

michaelklishin commented Mar 16, 2020

michaelklishin commented Mar 16, 2020

hustlzp1981 commented Mar 16, 2020

michaelklishin commented Mar 16, 2020

Gsantomaggio commented Mar 17, 2020

hustlzp1981 commented Mar 17, 2020

michaelklishin commented Mar 18, 2020

michaelklishin commented Mar 18, 2020

Kubernetes API requests in a pure IPv6 environment fail with an "nxdomain" #55

Kubernetes API requests in a pure IPv6 environment fail with an "nxdomain" #55

Comments

taijitao commented Oct 16, 2019

taijitao commented Oct 16, 2019

michaelklishin commented Oct 16, 2019

lukebakken commented Oct 17, 2019

lukebakken commented Oct 17, 2019 • edited

lukebakken commented Oct 21, 2019

taijitao commented Oct 22, 2019

michaelklishin commented Oct 22, 2019 • edited

taijitao commented Oct 22, 2019 • edited

michaelklishin commented Oct 22, 2019

michaelklishin commented Oct 22, 2019

michaelklishin commented Oct 22, 2019

Gsantomaggio commented Oct 22, 2019 • edited

lukebakken commented Oct 22, 2019

taijitao commented Oct 23, 2019

michaelklishin commented Oct 23, 2019

taijitao commented Oct 23, 2019 • edited

lukebakken commented Oct 23, 2019 • edited by michaelklishin

hustlzp1981 commented Mar 16, 2020

hustlzp1981 commented Mar 16, 2020

michaelklishin commented Mar 16, 2020

michaelklishin commented Mar 16, 2020

hustlzp1981 commented Mar 16, 2020

michaelklishin commented Mar 16, 2020

Gsantomaggio commented Mar 17, 2020

hustlzp1981 commented Mar 17, 2020

michaelklishin commented Mar 18, 2020

michaelklishin commented Mar 18, 2020

lukebakken commented Oct 17, 2019 •

edited

michaelklishin commented Oct 22, 2019 •

edited

taijitao commented Oct 22, 2019 •

edited

Gsantomaggio commented Oct 22, 2019 •

edited

taijitao commented Oct 23, 2019 •

edited

lukebakken commented Oct 23, 2019 •

edited by michaelklishin