Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rancher network agent stuck in restart loop - DNS lookup issue #4237

Closed
khanetor opened this issue Mar 31, 2016 · 14 comments
Closed

Rancher network agent stuck in restart loop - DNS lookup issue #4237

khanetor opened this issue Mar 31, 2016 · 14 comments
Labels
kind/question Issues that just require an answer. No code change needd

Comments

@khanetor
Copy link

After upgrading to Rancher v1.0.0, and Rancher agent v0.8.1, there seems to be a bug on Rancher agent that prevents the agent from starting, and thus cannot start any other services.

The log is as follow

Requesting system reboot
March 31, 2016 8:21:10 AM GMT+7INFO: Downloading agent https://rancher.khanetor.com/v1/configcontent/configscripts
March 31, 2016 8:21:41 AM GMT+7
The system is going down NOW!
March 31, 2016 8:21:41 AM GMT+7
Sent SIGTERM to all processes
March 31, 2016 8:21:42 AM GMT+7
Sent SIGKILL to all processes
March 31, 2016 8:21:42 AM GMT+7
Requesting system reboot
March 31, 2016 8:21:55 AM GMT+7INFO: Downloading agent https://rancher.khanetor.com/v1/configcontent/configscripts
March 31, 2016 8:22:26 AM GMT+7
The system is going down NOW!
March 31, 2016 8:22:26 AM GMT+7
Sent SIGTERM to all processes
March 31, 2016 8:22:27 AM GMT+7
Sent SIGKILL to all processes
March 31, 2016 8:22:27 AM GMT+7
Requesting system reboot
March 31, 2016 8:22:40 AM GMT+7INFO: Downloading agent https://rancher.khanetor.com/v1/configcontent/configscripts
March 31, 2016 8:23:11 AM GMT+7
The system is going down NOW!
March 31, 2016 8:23:11 AM GMT+7
Sent SIGTERM to all processes
March 31, 2016 8:23:12 AM GMT+7
Sent SIGKILL to all processes
March 31, 2016 8:23:12 AM GMT+7
Requesting system reboot
March 31, 2016 8:23:25 AM GMT+7INFO: Downloading agent https://rancher.khanetor.com/v1/configcontent/configscripts
March 31, 2016 8:23:56 AM GMT+7
The system is going down NOW!
March 31, 2016 8:23:56 AM GMT+7
Sent SIGTERM to all processes
March 31, 2016 8:23:57 AM GMT+7
Sent SIGKILL to all processes
March 31, 2016 8:23:57 AM GMT+7
Requesting system reboot
March 31, 2016 8:24:10 AM GMT+7INFO: Downloading agent https://rancher.khanetor.com/v1/configcontent/configscripts
March 31, 2016 8:24:41 AM GMT+7
The system is going down NOW!
March 31, 2016 8:24:41 AM GMT+7
Sent SIGTERM to all processes
March 31, 2016 8:24:42 AM GMT+7
Sent SIGKILL to all processes
March 31, 2016 8:24:42 AM GMT+7
Requesting system reboot
March 31, 2016 8:24:56 AM GMT+7INFO: Downloading agent https://rancher.khanetor.com/v1/configcontent/configscripts
March 31, 2016 8:25:27 AM GMT+7
The system is going down NOW!
March 31, 2016 8:25:27 AM GMT+7
Sent SIGTERM to all processes
March 31, 2016 8:25:28 AM GMT+7
Sent SIGKILL to all processes
March 31, 2016 8:25:28 AM GMT+7
Requesting system reboot
March 31, 2016 8:25:40 AM GMT+7INFO: Downloading agent https://rancher.khanetor.com/v1/configcontent/configscripts
March 31, 2016 8:26:11 AM GMT+7
The system is going down NOW!
March 31, 2016 8:26:11 AM GMT+7

Useful Info
Versions Rancher v1.0.0 Cattle: v0.159.2 UI: v0.100.3
Access github admin
Route container.labels
@deniseschannon
Copy link

What version did you upgrade from?
What OS are you running?
Where are your server/hosts? (Digital Ocean, AWS, Baremetal?)

Anything else you can share about your setup?

@deniseschannon deniseschannon added the kind/question Issues that just require an answer. No code change needd label Mar 31, 2016
@khanetor
Copy link
Author

The previous version of Rancher was 0.63.1.
My hosts are in Digital Oceans, running Ubuntu 14.04 and Docker 1.10.

On Mar 31, 2016, 10:12 AM +0700, Denisenotifications@github.com, wrote:

What version did you upgrade from?
What OS are you running?
Where are your server/hosts? (Digital Ocean, AWS, Baremetal?)

Anything else you can share about your setup?


You are receiving this because you authored the thread.
Reply to this email directly orview it on GitHub(#4237 (comment))

@khanetor
Copy link
Author

I am getting a little closer to the issue. From within the Rancher Network container (which is the only container that is partially running) I cannot ping anything, including rancher.com, facebook.com, google.com, and my own domain.

@khanetor
Copy link
Author

Is it possible that somehow the new Rancher broke the DNS?

@khanetor
Copy link
Author

So I spawned a fresh VPS with Docker on DigitalOcean. The host is perfectly healthy, and I can ping any popular domain, including my own Rancher server domain. I then manually added this host to Rancher, and started spinning up contains, and the same issue is happening, that, from within the network agent, I CANNOT ping anything with domain name. Ping by IP address is fine though.

@khanetor
Copy link
Author

I have created a new environment, and I am still seeing the same issue, that I cannot ping anything by domain name from within the network agent (agent-instance) container. I am not sure if the same issue applies to other containers, since I cannot start any of them.

I think this is a bug with DNS in rancher/agent-instance:0.8.1. Unfortunately there is no way for one to specify a specific version for Rancher network agent image, so I cannot revert to 0.8.0 for testing.

All my existing Rancher network agent containers (running 0.8.0) are still functioning normally (and I can ping domain names).

This is a serious blocking issue for me since I cannot deploy anything when I am very close to launching my product.

@ibuildthecloud
Copy link
Contributor

Just to clarify, right now cross host networking is working, but DNS is not? Can you try launching container and manually editting /etc/resolv.conf and set the DNS server to 8.8.8.8. Then see if DNS works from your container. I'm trying to see if DNS is broken in general or if its Rancher.

@khanetor
Copy link
Author

khanetor commented Apr 1, 2016

Here is the original content of /etc/resolv.conf:

search rancher.internal
nameserver 2001:4860:4860::8844
nameserver 2001:4860:4860::8888
nameserver 8.8.8.8

I then manually edited the file to only:

nameserver 8.8.8.8

Then I was able to ping my domain (and other popular domains). Because of this, the container succeeded in downloading what it needs, and no long stuck in an infinite cycle. Other containers (my service containers) also got deployed successfully as well.

Then I try to restart the network agent container to see if this fix persist, but the /etc/resolv.conf was reset, and the error happens again.

To answer your first question: yes, cross host networking is working, but DNS is not.

@khanetor khanetor changed the title Rancher network agent stuck in restart loop Rancher network agent stuck in restart loop - DNS lookup issue Apr 1, 2016
@khanetor
Copy link
Author

khanetor commented Apr 1, 2016

For reference, here is the content of /etc/resolv.con in rancher/network-instance:0.8.0:

# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 8.8.8.8

So this points towards one of the first three lines in /etc/resolv.conf broke the DNS in rancher/network-instance:0.8.1.

@khanetor
Copy link
Author

khanetor commented Apr 2, 2016

This issue is more serious than I thought.

I so manually updated the namesever entry and the network agents can ping domain name from the outside world, BUT they cannot ping any other service within Rancher, i.e. if I have a service db, then ping db would return an unknown domain.

Are you actively investigating this? I think this is a bug rather than just a question.

@khanetor
Copy link
Author

khanetor commented Apr 3, 2016

Rancher does not support IPv6 yet, but I have the habit of enabling IPv6 on my DigitalOcean VPSs, so the Rancher network agents also add the ipv6 entries in resolv.conf file. The two IPv6 nameserver entries are what make the network failing. I simply avoid enabling IPv6 in my DigitalOcean VPSs, and Rancher DNS is working now.

@deniseschannon
Copy link

@nlhkh Glad to hear you figured it out.

Here's the request to support ipv6: #1403

@corentin59
Copy link

Hi,

I have exactly the same problem with ipv6 disabled on my host :

nano /etc/sysctl.conf
net.ipv6.conf.all.disable_ipv6=1
net.ipv6.conf.default.disable_ipv6=1
net.ipv6.conf.lo.disable_ipv6=1
sudo sysctl -p

After a network restart, I can see than ipv6 is off.
I have clear all docker containers, restart docker, delete and add the host in rancher.
But the problem is always here.

Versions :

  • Rancher v1.0.0 with rancher/agent:v0.11.0
  • Docker 1.10.3
  • OS : Ubuntu 14.04.4 LTS
  • Amazon ec2 instance

No problem if I return on Rancher 0.63.1

@argent-smith
Copy link

BTW I had the same thing without ipv6. Fixed by tuning my DNS/DHCP boxes and recreating all affected Rancher server/agent/agent-instance/agent-state containers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Issues that just require an answer. No code change needd
Projects
None yet
Development

No branches or pull requests

5 participants