-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agent keeps failing and host state stucks at "Reconnecting" #2196
Comments
cc @fernandoneto, it would be awesome if you follow up this issue. This relates with the issue I was talking about this afternoon. |
@cusspvz I've seen similar symptoms in the past on slightly older builds. For me, the server then re-connected in the gui. Not sure if that's of any help. |
@Rucknar could you please point me whats the stablest build you've found til today? |
We did some upgrades and NFO testing recently on 38/39 and they seemed to work fine without issues. |
I have rancher-server data covered, my issue is to be able to start using it at production. I will try a downgrade. |
Do you know whats happening behind the logs? If yes, could you please explain? I would be glad if i could help in some way. |
Sorry, can't help there. Just thought i'd offer a fix which has helped me in the past. |
Trying up with |
@cusspvz If you are looking for "stable" versions, please don't use anything tagged as "rc". With the "rc" tag, they are still going through QA process. There are some definite bugs in the v0.40.0-rc1. @cusspvz @fernandoneto, Can you tell me what you were doing before you saw this error? Were you upgrading from one version to another? |
@deniseschannon I've created a server using the 0.40.0 release candidate with my Until now, this doesn't seems to affect |
BTW, I'm loving so much my Rancher experience that I'm running multiple versions locally and on Azure trying to spot the most stable one to start deploying it on production with our services. :D |
When the host fails/reboots, are you sure that the IP is the same on the machine? That could cause the "Reconnecting" issue as Rancher server wouldn't know that the IP of the machine has changed. |
If you continue to see this issue, you could try to re-register the agent on the Reconnecting host and see if the IP changed? |
Azure locks a Virtual IP Address for each cloud (you could think cloud as a SOHO Router), meaning that even if you reboot your machine, IP remains the same until you delete your cloud.
@Rucknar gave me that trick, tried but it didn't worked aswell. |
@cusspvz What version are you running and are you still seeing this issue? |
Oddly i'm now actually seeing this with 0.40.0. We use an external DB with out rancher setup and i'm wondering if i've done one too many upgrades to RC's etc, going to rebuild the server/hosts from scratch. I'm currently just waiting on RancherOS 0.4.0 to be released before i begin. |
@Rucknar I've putted a lot of effort trying to find the pattern already, on The most oddly thing I've seen were the agents that logged their fail attempt to access |
Interesting, sounds similar. It seems the host go into a 'reconnecting' state when running a large compose files into the environment. Were working with physical hosts here and our rancher server is running on a VM. All with RancherOS as the base. This can't be affecting many people so there must be something about our setups, what kind of setup are you running? |
Currently, all of them are running on top of Azure VMs. I think it might be related with |
Seems that now I could gather some data around this since all hosts got stuck into "Reconnecting". |
On
|
After
|
Restarted
Don't know if
|
Going to test if issue reported on #2207 occurs with |
Try removing the agent completely and installing via the run command you get through add hosts. Looks like your agent can't update for whatever reason, likely different from my issue |
I've assigned @fernandoneto to take care of this issue with Rancher. Thanks @Rucknar |
@ibuildthecloud do you have any updates regarding this? |
@cusspvz, I run it from this branch.
From rancher server: From agent's API info:
To reproduce: Run vagrant up, try to deploy an app and watch agents/server go in reconnecting state after some time. |
bash-3.2$ git status Still broken :( |
I'm having the same/similar issue I believe... in my case the hosts still appear up and from ssh inside the host I can get the rancher api using 127.0.0.1 but not the "host ip", it seems like something is causing it to loose it.... maybe something vbox related :/ (im on the same vbox version and vagrant version... (its happening without changing nodes to 3)... |
Just as a headsup, I installed rancher-server and rancher-agent on a single coreos node and it seems to be working fine... I'm not sure if its something on the Vagrant setup shipped with rancher, or with rancheros... one of the two for sure... |
Still broken in v0.52.0-rc3 |
Yes testing again, still broken... it works until you start (try) any container or service from the catalog... even starting a ubuntu image will cause the whole thing to collapse.... Vagrant ssh still works but the rancher server and rancher-01 upped by the vagrant up are inaccessible... so still no go on my side too... Tried on 5 machines... tried removing everthing (inclduing vbox and vagrant and starting from scratch... no go...) |
@rokka-n I was able to get it working using another OS instead of rancheros, so I'm unsure where the issues lie... I'm going to be working with coreos from now on, so I'll be trying my hand and getting things working well on that... |
@RVN-BR Did you manage to make it completely automatic with vagrant? Edit: running rancher server on Ubuntu and using rancherOS as a host works just fine |
@rokka-n No... unfortunately not completely automatic... I made a "unit file" with the agent setup, which I deployed through fleetctl to all the coreos nodes, and ran rancher server "by hand" on one of the nodes... its not ideal, and definitely not "HA"... but it worked for me tests.... I will try doing something a bit easier... I'd venture into a coreos+rancher repo but rancher is moving quite fast so I'm not sure it would be too useful :( If you'd like I can post my unit file... (I actually think I lost it, but it shouldnt take long to whip it up...and share it with you...) |
I think this issue could be related with high CPU usage on rancher's server side, making rancher-server unable to respond some ping requests, leading agents to lost contact. Could any of you please check your rancher-server CPU loads? |
Hello @cusspvz I will check on your suspicions, but I'm pretty sure it isnt really the case... Its not really some ping requests being dropped... The IP becomes incomunicable... SSHing and connecting locally to the webserver/api/etc all continues working... I'll make some more tests and try and report... What is the best way to obtain the relevant logs? Would it be docker logs of the rancher-server container from within the rancher machine? or stderr of the machine itself running the containers? I'm unsure... |
Hi @cusspvz & @rokka-n no dice... I dont think its CPU related... I followed the following steps:
Monitored CPU throughout on the servers and didnt see anything abnormal... All VMs have 4gb ram same cpu config |
@RVN-BR yeah, I don't think its related to cpu. It would be nice if rancher folks give some priority to this. I dunno... The whole thing looks pretty much broken to me. Not sure how other people use it without being able to develop and run full stack locally. |
I'm trying to get it going on coreos... I'll post up here when I have a working setup... I'm really keen on getting this going to... at least theres 2 of us :) |
I used this repo to launch agent hosts with rancheros and then register it to rancher-server running on ubuntu. Didn't see any issues with connectivity. But there is some work required to make provisioning fully automatic. |
I'll take a look for inspiration... dont really want to get into rancheros... dont see any benefits over coreos... plus too much vendor lockin for not any benefit? Anyways, I'm almost setup with a unit file that can run rancher-agents on a cluster of coreos machines... should work well if I get through some bumps... then a single unit (or single docker comand) to run the server and it should be "all set"... |
We now have a working Rancher + CoreOS setup. Here's what I discovered in the process --
This all works well for us, and I can deploy a VMware CoreOS template with the cloud-config and have a new host in the cluster in less than a minute. I have not managed to make the Rancher VMware add host vsphere integration work though, nor have I been able to make Rancheros work properly under VMware. |
Hi @sshipway I'd be interested in looking at what you have ifyou dont mind sharing... I'm only doing mostly testing for now... I got Rancher server working on its own using rancheros, and I am able to get a simple unit file to get the entire coreos cluster registered as agents... Seems to work for testing so far, but I'm going to need an HA solution if this will ever hit production...and I'd like to see this run on ubuntu or coreos, I'm not really that keen on rancheros for now... who knows one day...for now its seems like more hassle than good... Why is it that you are using the NFS storage? is it just for the initial provisioning of the coreos node? And also, have you tried benchmarking a baremetal coreos install on the server? We have some hosts of different sizes, and currently use vmware too... we are contemplating whether to try baremetal with coreos or some other docker hosting OS and something like rancher to arrange things around... One thing I'd like to see in rancher is a "birds-eye" view of hosts/containers, so we can see where particular hosts are getting more traffic/particular containers are getting to strain, etc... and in future some autoscaling based on this maybe? But even as a way of just having a good view of the entire infrastructure... It would be briddging the gap from dcos and kubernetes I guess... but it just may be the way rancher labs gets its attention? :p |
We're using NFS to provide shared persistent storage for containers; since we have 6 hosts, and we don't know where the container will be started, we need to have some sort of shared filesystem. This way, /mnt/docker is the same on all hosts, so a container with CoreOS works well with VMware because they provide a VMware image that you can just drop in. Ive not tried baremetal as we are pretty much entirely virtualised here, so it wasn't really an option - and VMware makes things much simpler for scaling and reprovisioning. There is definitely a good idea in having a dynamic scheduling container that hooks into the Rancher API and monitors the balance over the hosts, migrating an instance between hosts to preserve the balance. However there are a lot of problems with this, eg keeping data integrity and availability with single-instance containers, scheduling rules that prevent balancing, and that you cannot force a new instance to start where you want - and containers that expose ports should maybe not be migrated... etc etc |
Here is the cloud-config YAML that we use with CoreOS to make a Rancher host. Note that this is sanitised for keys etc. Note that we have DHCP on the subnet. The cloud-config.yaml is bundled up into an ISO which is loaded on the VMware farm, and the VMware template uses the CoreOS image plus this ISO mounted to the CDRom, correct network connection, and an additional 16G disk which is used for /var/lib/docker
|
Cool... gotcha on the NFS... We had a bad experience a long time ago with NFS and have tended to stay away from it, using other stuff like gluster for a period, and then moving altogether away from having "centralized storage" except for some very specific needs which run on a limited number of hosts... I will probably however look into something with convoy or gluster in the near future as we look to containerize persistant applications such as DBs, etc... (but this will be an entire new chapter)... Thanks for sharing the file... The rancher unit you are usng is very similr to what I used, the only thing is I added a CATTLE_AGENT_IP env variable with the coreos node ip as I was getting the same IP reported in rancher for all nodes |
@cusspvz @fernandoneto Are you still facing this issue? We've changed the networking in Rancher as of Rancher v0.56.0. Could you please re-test with the lastest Racnher and open a new ticket if you are still facing the issues? |
Agent keeps failing and host state stucks at "Reconnecting".
As this occurs, rancher fails all posterior deploys.
here are the failing rancher-agent logs:
The text was updated successfully, but these errors were encountered: