-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stuck in trying to register forever #3179
Comments
Any updates? How do I fix the |
+1 |
Can you provide more details on the host that's failing? If other hosts are registering fine, then it must be unique to this host? Is this host also running rancher server? How are you adding this host? What version of Rancher server are you running? How are you running it (locally, on a VM (GCE, AWS, DO), etc?) Is this a standalone container or a HA set up? |
Some more info:
Host Information: Host 1 / Master |
Host 2 / Slave |
Host 3 / Slave |
|
I just want to confirm that you're running Rancher v0.30.0? I'm not sure that version would support Docker 1.9.1. Would you be willing to upgrade? We're at v0.51.0. It's strange that 1 host would be able to register and not the other... |
@deniseschannon sorry for the late reply. We updated rancher now to Error still persists and is the same as before. which then ends with |
+1 |
Is there any way I can debug this? Pinging the rancher server from this host works just fine. Also curling the site works. |
Same question here: We re stucking in coming ahead by rancher agent not wanting to connect :/ |
@lenovouser is the host showing up in the Rancher UI? If so, can you click its drop down menu, click 'view and api' and share its api output here? Please scrub it of any sensitive information such as domains/ips that you dont wish to share. Also, can you check the logs inside the rancher-agent container at /var/log/rancher/agent.log to see if there are any other errors there? |
@cjellick thanks for the reply. No, it is not showing up in the host view, that is our problem. The problem is in fact that it is not connecting and we don't really know why. The logs are very big, I had to |
Thanks @lenovouser. That "Host not registered yet" error you're seeing is actually a symptom, same as the host not showing up in the UI yet. There is either some connectivity problem between your host and rancehr-server or rancher-server is erroring out when trying to process the host. Would you be able to also send over the logs for rancher-agent container? You can limit the output in case it is really long via `docker logs rancher-agent --tail 3000. Also, in the Rancher UI, go to the admin tab, then processes and let me know if there are any long running processes there. Finally, have a look at the rancher-server logs and see if their any recurring errors. |
Okay, here are the container logs (I just created a fresh one to be sure):
this just repeats itself indefinitely. I am not really sure if it is a connectivity issue because I was able to There also indeed is a If I click on that it takes forever to load and ends with this screen: Which finally seems to give some information, even though I still don't get why it is failing. And last: where can I find the |
In case you wanted the
|
Any idea? |
Hello? 😄 How can I fix / kill the reconnect process? |
+1 |
Sorry for possible spamming but I d also like to +1 the issue. |
Hello? Is this a known bug? Or did we do something wrong? Can I fix it somehow? |
We have now done a few things because we hoped it might be our fault:
Nothing helps. It is always the one host who doesn't want to connect. Even if the server is e.g. running on the same host. It still may be our fault, I know. But is there at least any way I can find out why it is not connecting? Any logs containing more information than that it is not connecting and more like why it doesn't want to. Something which is also very weird is that it definitely is connecting somehow and then disconnecting because this reconnect process is starting on |
@lenovouser this error shows up 6k times in your rancher-agent logs:
It appears about every 70 seconds, which is inline with the 60 second timeout we have on the websocket connection (plus 10 for the agent to restart and get to the point where it is trying to connect again). I'm trying to think of what scenario could cause one host to work perfectly fine and another to consistently fail. rancher-agent is attempting to make an outbound websocket connection to rancher-server. One typical cause of this not working is a proxy or other network device that does not like websocket connections sitting between the agent and server. Is that possible? Assuming that your two agent nodes are "next to each other" and their traffic is going through all the same devices, that seems unlikely, but do you have any proxies, load balancers, or other known devices between the two? Can you exec into the rancher-server container and do ping/curl from there? |
@lenovouser i had similar websockets issues, and enabling SSL for Rancher server helped us: In the UI you would need to configure your Host Registration for SSL. This will make sure proxies or whatever is in the middle does not block websockets and let ngnix/apachr upgrade the protocol properly. |
@lenovouser i just realised you seem to already have https setup https://rancher.domain.tld/v1 |
@cjellick I can ping the |
I meant the other way around @lenovouser: exec into the rancher-agent container and curl from it to rancher-server. |
and
|
@lenovouser i dont think you mentioned whether or not you had any LBs, proxies, or other network devices between rnacher-server and this node. Also, what do you iptables look like on this node? |
@lenovouser also are you in a cloud-like environment where you have the ability to add new hosts to this cluster relatively easily? You mentioned that another host you provisioned works just fine, so just wondering if you do that again to see if the success or failure is the anomaly. |
Well. No. There is nothing really between it except for the NGINX in front of
and yes, failure is the anomaly. I just tried to manually add a DO test droplet and it worked. (The other hosts were added manually too, therefore I chose to do it with the droplet too) |
@lenovouser jsut want to confirm: this host is not the same node on which rancher-server is running right? I believe you stated this previously, but wanted to confirm again explicitly. Next, is the host that is failing and the ones that are succeeding similar in that they are in the same cloud provider and are part of all the security groups, private networks, etc? Finally, you could try the nuclear option:
|
@cjellick Yes this is not the same node and yes they are at the same provider, same firewall and network groups, just different physical servers in different datacenters. Okay, I just did it and it is not connecting again, even though I want to mention that the reconnect process was still running on the master. I am going to purge the master now again and do a fresh install to see if that helps. EDIT: Okay, I just deleted the server and all agents. Then I ran |
Any updates? |
"Okay, I just deleted the server and all agents. Then I ran |
Any updates? |
That's all you should need to do to purge the droplet before being added to the server. When you do a fresh install of the server, are you still using v0.51.0 or are you doing v0.56.1? |
@deniseschannon I now did the fresh install and pulled from docker hub before that, so we are running There is a completely different error now which is very weird... It just spams the console with this when I do
This also happens on my verification DO droplet, but only once and then it connects. On the problematic host it just keeps spamming that |
Tried
even though all the files were purged and it even pulled the image new. So I absolutely do not get where the |
|
Any updates? |
It's still just the single problematic host, correct? We're working hard on finishing up the last couple of features that will be in GA before we'll have time to concentrate on fixing as many bugs. We'll try to look at as soon as possible. |
Okay, cool. Yes, just the single host. |
This is now fixed in |
Like stated above, don't know what is causing this. Registered another host and it worked perfectly fine.
Logs:
The text was updated successfully, but these errors were encountered: