Skip to content
This repository has been archived by the owner on Nov 16, 2020. It is now read-only.

Kubernetes API requests in a pure IPv6 environment fail with an "nxdomain" #55

Closed
taijitao opened this issue Oct 16, 2019 · 27 comments · Fixed by rabbitmq/rabbitmq-peer-discovery-common#11
Assignees
Labels
Milestone

Comments

@taijitao
Copy link

Hi,
I had a pure ipv6 k8s cluster. and i want to instal rabbitmq helm chart.
I followed the instrument in https://www.rabbitmq.com/networking.html#distribution-ipv6
My parameter(in helm chart):

   environment: |-
      RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+A 128 -kernel inetrc '/etc/rabbitmq/erl_inetrc'  -proto_dist inet6_tcp"
      RABBITMQ_CTL_ERL_ARGS="-proto_dist inet6_tcp "
  erl_inetrc: |-
    {inet6, true}.

File erl_inetrc was created under /etc/rabbitmq.
and I found error in log:

2019-10-15 07:33:55.000 [info] <0.238.0> Peer discovery backend does not support locking, falling back to randomized delay
2019-10-15 07:33:55.000 [info] <0.238.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized start
up delay.
2019-10-15 07:33:55.000 [debug] <0.238.0> GET https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/tazou/endpoints/zt4-crmq
2019-10-15 07:33:55.015 [debug] <0.238.0> Response: {error,{failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},{inet,[inet]
,nxdomain}]}}
2019-10-15 07:33:55.015 [debug] <0.238.0> HTTP Error {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},{inet,[inet],nxdom
ain}]}
2019-10-15 07:33:55.015 [info] <0.238.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}}
,
                 {inet,[inet],nxdomain}]}
2019-10-15 07:33:55.016 [error] <0.237.0> CRASH REPORT Process <0.237.0> with 0 neighbours exited with reason: no case clause matching {error,"{fa
iled_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from
_config/0 line 167 in application_master:init/4 line 138
2019-10-15 07:33:55.016 [info] <0.43.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kub
ernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 167

the inet could return ipv6 address.

[root]# kubectl exec -ti zt5-crmq-0 rabbitmqctl eval 'inet:gethostbyname("kubernetes.default.svc.cluster.local", inet6).'
{ok,{hostent,"kubernetes.default.svc.cluster.local",[],inet6,16,
             [{64769,43981,0,0,0,0,0,1}]}}
[root]#  kubectl exec -ti zt5-crmq-0 rabbitmqctl eval 'inet_res:resolve("kubernetes.default.svc.cluster.local", in, aaaa).'
{ok,{dns_rec,{dns_header,1,true,query,true,false,true,true,false,0},
             [{dns_query,"kubernetes.default.svc.cluster.local",aaaa,in}],
             [{dns_rr,"kubernetes.default.svc.cluster.local",aaaa,in,0,5,
                      {64769,43981,0,0,0,0,0,1},
                      undefined,[],false}],
             [],[]}}

nslookup return ipv6 address when type=aaaa.
return error when type=a.

I don't know why httpc:request will return nxdomain.
is it a bug or setting issue?

B.R,
Tao

@taijitao
Copy link
Author

does this plugin support ipv6 only stack or it support ipv6/ipv4 stack?

@michaelklishin
Copy link
Member

This plugin issues requests to the Kubernetes API over HTTP[S]. It is entirely unaware of what IP version is used underneath. nxdomain, as I'm sure you know, means "no domain resolved". This plugin cannot be responsible for that.

For cases when proper hostname resolution configuration is not available, Erlang provides its own resolution configuration file which should be pointed at using the ERL_INETRC environment variable. You don't need it most of the time but sometimes it is indispensable.

@lukebakken
Copy link
Contributor

Versions of the software from this rabbitmq-users discussion:

rabbitmq_3.7.18-1.el7
erlang_22.0.7-1.el7

I suspect this is due to the httpc library defaulting to inet: docs.

Note the default value for IpFamily.

@taijitao since you have access to an IPv6-only environment, I will create a custom build of this plugin for you to test.

@lukebakken
Copy link
Contributor

lukebakken commented Oct 17, 2019

@taijitao - here is the custom plugin built from this branch:

rabbitmq_peer_discovery_k8s-3.7.20+rc.1.dirty.ez.zip

To install:

  • Copy to your RabbitMQ servers and remove the .zip extension.
  • Locate the existing rabbitmq_peer_discovery_k8s-3.7.18.ez file and re-name it or move it out of the way.
  • Copy the rabbitmq_peer_discovery_k8s-3.7.20+rc.1.dirty.ez to that location.
  • Restart RabbitMQ

Please note that cluster formation only happens the first time RabbitMQ is started. If these nodes have been started before, you will have to reset them (rabbitmqctl reset) or delete their data directory.

@lukebakken
Copy link
Contributor

@taijitao any chance to test this? ^^^^

@taijitao
Copy link
Author

Yes, I'll test that.
Could you give me some explaination what you have changed in the customize build?

@michaelklishin
Copy link
Member

michaelklishin commented Oct 22, 2019

@taijitao it configures (unconditionally at the moment) HTTP client's socket address family to IPv6.

@taijitao
Copy link
Author

taijitao commented Oct 22, 2019

I have tested it and it worked.
the erlang setting is : {inet6, true}.
good news is :

2019-10-22 06:10:28.934 [info] <0.274.0> Peer discovery Kubernetes: setting IpFamily to inet6...
2019-10-22 06:10:28.934 [info] <0.274.0> Peer discovery Kubernetes: setting IpFamily to inet6 response: ok
2019-10-22 06:10:28.934 [info] <0.274.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2019-10-22 06:10:28.934 [info] <0.274.0> Peer discovery backend does not support locking, falling back to randomized delay
2019-10-22 06:10:28.934 [info] <0.274.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2019-10-22 06:10:29.016 [info] <0.274.0> All discovered existing cluster peers: rabbit@zt2-crmq-1, rabbit@zt2-crmq-0
2019-10-22 06:10:29.016 [info] <0.274.0> Peer nodes we can cluster with: rabbit@zt2-crmq-0
2019-10-22 06:10:29.032 [warning] <0.274.0> Could not auto-cluster with node rabbit@zt2-crmq-0: {badrpc,nodedown}


but it's fail to form cluser. I now had two separated nodes.
docker process
bash-4.2$ ps -ef

UID        PID  PPID  C STIME TTY          TIME CMD
rabbitmq     1     0  0 06:09 ?        00:00:00 /bin/sh /usr/lib/rabbitmq/bin/rabbitmq-server start
rabbitmq   197     1  0 06:09 ?        00:00:00 /usr/lib64/erlang/erts-10.4.4/bin/epmd -daemon
rabbitmq   383     1  1 06:09 ?        00:00:18 /usr/lib64/erlang/erts-10.4.4/bin/beam.smp -W w -A 64 -MBas ageffcbf -MHas ageffcbf -MBlmbcs 512 -MHlmbcs 512 -MMmcs 30 -P 1048
rabbitmq   551   383  0 06:10 ?        00:00:00 erl_child_setup 1048576
rabbitmq  1894   551  0 06:10 ?        00:00:00 inet_gethost 4
rabbitmq  1895  1894  0 06:10 ?        00:00:00 inet_gethost 4
rabbitmq  9563     0 35 06:26 ?        00:00:00 /usr/lib64/erlang/erts-10.4.4/bin/beam.smp -B -- -root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -boot star
rabbitmq  9676  9563 34 06:26 ?        00:00:00 erl_child_setup 1048576
rabbitmq  9697     0  2 06:26 ?        00:00:00 bash
rabbitmq  9706  9697  0 06:26 ?        00:00:00 ps -ef

@michaelklishin
Copy link
Member

According to the log discovery via Kubernetes API endpoint has succeeded. However, nodes could not contact and/or authenticate with each other. This is not a responsibility of this plugin. See rabbit@zt2-crmq-0 logs for more clues. This part of the discussion is mailing list material.

@michaelklishin michaelklishin changed the title rabbitmq-peer-discovery-k8s can't work in pure ipv6 k8s Kubernetes API requests in a pure IPv6 environment fail with an "nxdomain" Oct 22, 2019
@michaelklishin
Copy link
Member

@michaelklishin
Copy link
Member

httpc can only use one address family for its sockets. So we have a couple of options:

  • Add a configuration setting for this plugin that would switch it to inet6 (for IPv6)
  • Try to detect IPv6 availability, then switch

I personally would prefer the latter. @taijitao WDYT?

@Gsantomaggio
Copy link
Member

Gsantomaggio commented Oct 22, 2019

Hi,
I had a k8s configured in pure IPv6 ( with Kind ).

I tried this patch because I'd need also here.
It seems to work correctly:

[vagrant@localhost k8s_statefulsets]$ kubectl get pod -o wide
NAME                   READY   STATUS    RESTARTS   AGE     IP                NODE                 NOMINATED NODE   READINESS GATES
rabbitmq-0             1/1     Running   0          9m59s   fd00:10:244::27   kind-control-plane   <none>           <none>
rabbitmq-1             1/1     Running   0          8m43s   fd00:10:244::28   kind-control-plane   <none>           <none>
rabbitmq-2             1/1     Running   0          7m51s   fd00:10:244::29   kind-control-plane   <none>           <none>

and:

 kubectl describe service rabbitmq
Name:                     rabbitmq
Namespace:                default
Labels:                   app=rabbitmq
Annotations:              kubectl.kubernetes.io/last-applied-configuration:
                            {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"app":"rabbitmq"},"name":"rabbitmq","namespace":"default"},"spe...
Selector:                 app=rabbitmq
Type:                     NodePort
IP:                       fd00:10:96::99a8
Port:                     http  15672/TCP
TargetPort:               15672/TCP
NodePort:                 http  31672/TCP
Endpoints:                [fd00:10:244::27]:15672,[fd00:10:244::28]:15672,[fd00:10:244::29]:15672
Port:                     amqp  5672/TCP
TargetPort:               5672/TCP
NodePort:                 amqp  30672/TCP
Endpoints:                [fd00:10:244::27]:5672,[fd00:10:244::28]:5672,[fd00:10:244::29]:5672
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

also the cluster status:

 rabbitmqctl cluster_status
Cluster status of node rabbit@rabbitmq-0.rabbitmq.default.svc.cluster.local ...
Basics

Cluster name: rabbit@rabbitmq-0.rabbitmq.default.svc.cluster.local

Disk Nodes

rabbit@rabbitmq-0.rabbitmq.default.svc.cluster.local
rabbit@rabbitmq-1.rabbitmq.default.svc.cluster.local
rabbit@rabbitmq-2.rabbitmq.default.svc.cluster.local

Running Nodes

rabbit@rabbitmq-0.rabbitmq.default.svc.cluster.local
rabbit@rabbitmq-1.rabbitmq.default.svc.cluster.local
rabbit@rabbitmq-2.rabbitmq.default.svc.cluster.local

I noticed that for some reason the command check_port_connectivity does not work correctly in this stack:

 rabbitmq-diagnostics check_port_connectivity
Testing TCP connections to all active listeners on node rabbit@rabbitmq-0.rabbitmq.default.svc.cluster.local ...
Error:
Connection to ports of the following listeners on node rabbit@rabbitmq-0.rabbitmq.default.svc.cluster.local failed:
Interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Interface: [::], port: 15672, protocol: http, purpose: HTTP API

@lukebakken
Copy link
Contributor

@michaelklishin working on a PR to fix this in an "auto detect" fashion

@taijitao
Copy link
Author

thanks lukebakken for your help.
it's better to 'auto detect' than to switch between different binary plugin.
Now cluster is created based on your private build.

@michaelklishin
Copy link
Member

Auto-detection has a tendency to fail in ways that are hard to understand. There will be no switching between binary plugins if we can't get auto-detection to work reliably but an option that lets the operator to tell the plugin what AF to use.

@taijitao
Copy link
Author

taijitao commented Oct 23, 2019

that's fine if one option is provided.
is it in the erl_inetrc? or in plugin configuration?

lukebakken added a commit to rabbitmq/rabbitmq-peer-discovery-common that referenced this issue Oct 23, 2019
If the user configures `{inet6, true}` in `ERL_INETRC` file, then use it for all `httpc:` calls in peer discovery.

Fixes rabbitmq/rabbitmq-peer-discovery-k8s#55
lukebakken added a commit to rabbitmq/rabbitmq-peer-discovery-common that referenced this issue Oct 23, 2019
If the user configures `{inet6, true}` in `ERL_INETRC` file, then use it for all `httpc:` calls in peer discovery.

Fixes rabbitmq/rabbitmq-peer-discovery-k8s#55
@lukebakken
Copy link
Contributor

lukebakken commented Oct 23, 2019

@taijitao @Gsantomaggio if you have time, I would really appreciate you testing the fix in rabbitmq/rabbitmq-peer-discovery-common#11

  • Revert your rabbitmq_peer_discovery_k8s-3.7.18.ez file to the original.
  • Locate your existing rabbitmq_peer_discovery_common*.ez file, and move it or rename it.
  • Install this file where that file was located, without the .zip extension:

rabbitmq_peer_discovery_common-3.7.20+rc.1.2.gb768f10.ez.zip

  • Ensure that you have {inet6, true} in your ERL_INETRC file.
  • Reset your cluster, and restart it.

The changes in rabbitmq/rabbitmq-peer-discovery-common#11 look for the presence of {inet6, true} in your inetrc file and will set the appropriate httpc option if found.

@hustlzp1981
Copy link

@taijitao @lukebakken
Could you help take a look at my issue, thanks a lot!
I have tried as you mentioned above and other methods.
The rabbitmq pod always failed with below error in my IPV6 setup.
ERROR: epmd error for host osh-openstack-rabbitmq-rabbitmq-0.rabbitmq.openstack.svc.cluster.local: nxdomain (non-existing domain)

  1. I added below in configmap-etc.yaml
    environment: |-
    RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+A 128 -kernel inetrc '/etc/rabbitmq/erl_inetrc' -proto_dist inet6_tcp"
    RABBITMQ_CTL_ERL_ARGS="-proto_dist inet6_tcp"
    erl_inetrc: |-
    {inet6, true}.
  2. In my armada manifest, pull image: rabbitmq: docker.io/rabbitmq:3.7.24

Thanks!
Zhipeng

@hustlzp1981
Copy link

@lukebakken
Do I need your patch? Has your patch been merged into some release(3.7.24 or later )
Thanks!!!
Zhipeng

@michaelklishin
Copy link
Member

@hustlzp1981 see the milestone on this PR and 3.7.20 release notes?

@michaelklishin
Copy link
Member

@hustlzp1981 this is not a support forum. Please post your questions to the mailing list.

nxdomain means that the hostname (osh-openstack-rabbitmq-rabbitmq-0.rabbitmq.openstack.svc.cluster.local) failed to resolve. This PR simply makes the HTTP client use IPv6 if it is configured via ERL_INETRC. There must be an AAAA DNS record in place or the client won't be able to resolve it.

@hustlzp1981
Copy link

Thanks klishin!
Could you tell me which mailing list I should use?

@michaelklishin
Copy link
Member

RabbitMQ has only one and it hasn't changed since 2014.

@Gsantomaggio
Copy link
Member

The nxdomain is a common problem in k8s, maybe we should update the documentation to add this document , this document, and add some specific example for rabbitmq.

@hustlzp1981
Copy link

Thanks!
Now I fixed nxdomain issue in my ipv6 k8s setup according to above guide.
osh-openstack-rabbitmq-cluster-wait-9rw6p 1/1 Running 0 17m
osh-openstack-rabbitmq-rabbitmq-0 1/1 Running 0 17m

However, still have another issue.
In pod osh-openstack-rabbitmq-cluster-wait, it will use rabbitmqadmin to connect rabbitmq
but always get error. It can work in my ipv4 setup.
++ active_rabbit_nodes
2020-03-17T10:31:12.124589385Z stderr F ++ wc -w
2020-03-17T10:31:12.134367271Z stderr F ++ rabbitmqadmin_authed list nodes -f bash
2020-03-17T10:31:12.134427089Z stderr F ++ set +x
2020-03-17T10:31:12.179073378Z stderr F Traceback (most recent call last):
2020-03-17T10:31:12.179644557Z stderr F error: [Errno 111] Connection refused
2020-03-17T10:31:12.17964969Z stderr F *** Could not connect: [Errno 111] Connection refused

@michaelklishin
Copy link
Member

Could not connect: [Errno 111] Connection refused is specific enough: a TCP connection (presumably to the HTTP API endpoint) was refused.

@michaelklishin
Copy link
Member

This is not a Kubernetes support forum so I will lock this.

@rabbitmq rabbitmq locked as resolved and limited conversation to collaborators Mar 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants