Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent keeps failing and host state stucks at "Reconnecting" #2196

Closed
cusspvz opened this issue Sep 30, 2015 · 93 comments
Closed

Agent keeps failing and host state stucks at "Reconnecting" #2196

cusspvz opened this issue Sep 30, 2015 · 93 comments
Assignees
Labels
area/agent Issues that deal with the Rancher Agent kind/bug Issues that are defects reported by users or that we know have reached a real release kind/question Issues that just require an answer. No code change needd

Comments

@cusspvz
Copy link

cusspvz commented Sep 30, 2015

Agent keeps failing and host state stucks at "Reconnecting".
As this occurs, rancher fails all posterior deploys.

here are the failing rancher-agent logs:

INFO: Sending host-routes applied 93-43fd6287135af077f5dffa7a2b5721a16cf12fe8b28a9e1bf522b35d19a80ea8

Traceback (most recent call last):
  File "/var/lib/cattle/pyagent/cattle/utils.py", line 280, in get_command_output
    return check_output(*args, **kw)
  File "/var/lib/cattle/pyagent/cattle/utils.py", line 337, in check_output
    raise e1
CalledProcessError: Command '['/var/lib/cattle/config.sh', u'host-iptables', u'host-routes']' returned non-zero exit status 28
2015-09-30 21:48:53,013 ERROR agent [140334277124176] [event.py:111] 08df7e0f-9a5a-4d5b-9871-69a3b1c40d06 : Unknown error
Traceback (most recent call last):
  File "/var/lib/cattle/pyagent/cattle/agent/event.py", line 94, in _worker_main
    resp = agent.execute(req)
  File "/var/lib/cattle/pyagent/cattle/agent/__init__.py", line 15, in execute
    return self._router.route(req)
  File "/var/lib/cattle/pyagent/cattle/plugins/core/event_router.py", line 13, in route
    resp = handler.execute(req)
  File "/var/lib/cattle/pyagent/cattle/agent/handler.py", line 34, in execute
    return method(req=req, **req.data.__dict__)
  File "/var/lib/cattle/pyagent/cattle/compute/__init__.py", line 67, in instance_remove
    action=lambda: self._do_instance_remove(instance, host, progress)
  File "/var/lib/cattle/pyagent/cattle/agent/handler.py", line 72, in _do
    action()
  File "/var/lib/cattle/pyagent/cattle/compute/__init__.py", line 67, in <lambda>
    action=lambda: self._do_instance_remove(instance, host, progress)
  File "/var/lib/cattle/pyagent/cattle/plugins/docker/compute.py", line 813, in _do_instance_remove
    remove_container(client, container)
  File "/var/lib/cattle/pyagent/cattle/plugins/docker/util.py", line 71, in remove_container
    raise e
APIError: 500 Server Error: Internal Server Error ("Cannot destroy container 627e31b591fa3c7a94f8308baaa5a0f3aacab7d90ebc01c106bfcd85a611315b: Driver aufs failed to remove root filesystem 627e31b591fa3c7a94f8308baaa5a0f3aacab7d90ebc01c106bfcd85a611315b: rename /var/lib/docker/aufs/mnt/627e31b591fa3c7a94f8308baaa5a0f3aacab7d90ebc01c106bfcd85a611315b /var/lib/docker/aufs/mnt/627e31b591fa3c7a94f8308baaa5a0f3aacab7d90ebc01c106bfcd85a611315b-removing: device or resource busy")
2015-09-30 21:50:39,141 ERROR docker [140334277082960] [delegate.py:104] Can not call [c85d17a3-a134-4322-8823-54a185cd5eaf], container is not running
2015-09-30 21:50:54,612 ERROR docker [140334277082000] [delegate.py:104] Can not call [c85d17a3-a134-4322-8823-54a185cd5eaf], container is not running
2015-09-30 21:50:57,136 ERROR docker [140334277081840] [delegate.py:104] Can not call [c85d17a3-a134-4322-8823-54a185cd5eaf], container is not running
2015-09-30 21:50:59,138 ERROR docker [140334277126096] [delegate.py:104] Can not call [c85d17a3-a134-4322-8823-54a185cd5eaf], container is not running
2015-09-30 21:51:01,142 ERROR docker [140334277079920] [delegate.py:104] Can not call [c85d17a3-a134-4322-8823-54a185cd5eaf], container is not running
2015-09-30 21:51:03,139 ERROR docker [140334277079600] [delegate.py:104] Can not call [c85d17a3-a134-4322-8823-54a185cd5eaf], container is not running
2015-09-30 21:51:05,142 ERROR docker [140334277913968] [delegate.py:104] Can not call [c85d17a3-a134-4322-8823-54a185cd5eaf], container is not running
2015-09-30 21:51:07,145 ERROR docker [140334277125136] [delegate.py:104] Can not call [c85d17a3-a134-4322-8823-54a185cd5eaf], container is not running
2015-09-30 21:51:10,156 ERROR docker [140334277913968] [delegate.py:104] Can not call [c85d17a3-a134-4322-8823-54a185cd5eaf], container is not running
2015-09-30 21:51:12,153 ERROR docker [140334277126416] [delegate.py:104] Can not call [c85d17a3-a134-4322-8823-54a185cd5eaf], container is not running
2015-09-30 21:51:15,155 ERROR docker [140334277124816] [delegate.py:104] Can not call [c85d17a3-a134-4322-8823-54a185cd5eaf], container is not running
2015-09-30 21:51:17,124 ERROR agent [140334277913168] [event.py:201] Max of [10] dropped [ping] requests exceeded
I0930 21:51:18.745278 02094 manager.go:936] Exiting thread watching subcontainers
I0930 21:51:18.745579 02094 manager.go:313] Exiting global housekeeping thread
I0930 21:51:18.745658 02094 cadvisor.go:124] Exiting given signal: terminated
@cusspvz
Copy link
Author

cusspvz commented Sep 30, 2015

cc @fernandoneto, it would be awesome if you follow up this issue. This relates with the issue I was talking about this afternoon.

@Rucknar
Copy link

Rucknar commented Oct 1, 2015

@cusspvz I've seen similar symptoms in the past on slightly older builds.
I remedied by removing the rancher agent from the host from the command line (not the agent-instance stuff, just the agent). Then i re-ran the script that rancher gives you when adding a 'other' host type.

For me, the server then re-connected in the gui. Not sure if that's of any help.

@cusspvz
Copy link
Author

cusspvz commented Oct 1, 2015

@Rucknar could you please point me whats the stablest build you've found til today?

@Rucknar
Copy link

Rucknar commented Oct 1, 2015

We did some upgrades and NFO testing recently on 38/39 and they seemed to work fine without issues.
If you map the database out to the host file system then it's really easy to upgrade the server versions without losing config.

@cusspvz
Copy link
Author

cusspvz commented Oct 1, 2015

I have rancher-server data covered, my issue is to be able to start using it at production. I will try a downgrade.

@cusspvz
Copy link
Author

cusspvz commented Oct 1, 2015

Do you know whats happening behind the logs? If yes, could you please explain? I would be glad if i could help in some way.

@Rucknar
Copy link

Rucknar commented Oct 1, 2015

Sorry, can't help there. Just thought i'd offer a fix which has helped me in the past.

@cusspvz
Copy link
Author

cusspvz commented Oct 1, 2015

Trying up with v0.40.0-rc1

@deniseschannon
Copy link

@cusspvz If you are looking for "stable" versions, please don't use anything tagged as "rc". With the "rc" tag, they are still going through QA process. There are some definite bugs in the v0.40.0-rc1.

@cusspvz @fernandoneto, Can you tell me what you were doing before you saw this error? Were you upgrading from one version to another?

@cusspvz
Copy link
Author

cusspvz commented Oct 1, 2015

@deniseschannon I've created a server using the 0.40.0 release candidate with my 0.39.0 data just for checking what was fixed and to help you to detect issues as soon as possible. :)

Until now, this doesn't seems to affect 0.40.0-rc1. On my 0.39.0 server I'm constantly swapping hosts since them doesn't recover after some failure/reboots.

@cusspvz
Copy link
Author

cusspvz commented Oct 1, 2015

BTW, I'm loving so much my Rancher experience that I'm running multiple versions locally and on Azure trying to spot the most stable one to start deploying it on production with our services. :D

@deniseschannon
Copy link

When the host fails/reboots, are you sure that the IP is the same on the machine? That could cause the "Reconnecting" issue as Rancher server wouldn't know that the IP of the machine has changed.

@deniseschannon
Copy link

If you continue to see this issue, you could try to re-register the agent on the Reconnecting host and see if the IP changed?

@cusspvz
Copy link
Author

cusspvz commented Oct 2, 2015

When the host fails/reboots, are you sure that the IP is the same on the machine?

Azure locks a Virtual IP Address for each cloud (you could think cloud as a SOHO Router), meaning that even if you reboot your machine, IP remains the same until you delete your cloud.

If you continue to see this issue, you could try to re-register the agent on the Reconnecting host and see if the IP changed?

@Rucknar gave me that trick, tried but it didn't worked aswell.

@deniseschannon
Copy link

@cusspvz What version are you running and are you still seeing this issue?

@cusspvz
Copy link
Author

cusspvz commented Oct 7, 2015

Still happening. When I've reported, it was with 0.39.0. I had let servers running with 0.40.0 since yesterday and happened again:

captura de ecra 2015-10-7 as 01 08 01

captura de ecra 2015-10-7 as 01 08 31

@Rucknar
Copy link

Rucknar commented Oct 7, 2015

Oddly i'm now actually seeing this with 0.40.0. We use an external DB with out rancher setup and i'm wondering if i've done one too many upgrades to RC's etc, going to rebuild the server/hosts from scratch. I'm currently just waiting on RancherOS 0.4.0 to be released before i begin.

@cusspvz
Copy link
Author

cusspvz commented Oct 7, 2015

@Rucknar I've putted a lot of effort trying to find the pattern already, on 0.39.0 tried to rebuild entire server stack thinking it was for the rc upgrades and then downgrades. Did a fresh install of rancher-server with a new servers cluster. Also tried your tip to restart agent containers, all of them lead me to the same issue.

The most oddly thing I've seen were the agents that logged their fail attempt to access rancher-server API, the server was up, related host could access it, agent container as well but the logs from the agent were saying the oppose.

@Rucknar
Copy link

Rucknar commented Oct 7, 2015

Interesting, sounds similar. It seems the host go into a 'reconnecting' state when running a large compose files into the environment. Were working with physical hosts here and our rancher server is running on a VM. All with RancherOS as the base. This can't be affecting many people so there must be something about our setups, what kind of setup are you running?

@cusspvz
Copy link
Author

cusspvz commented Oct 7, 2015

Currently, all of them are running on top of Azure VMs. I think it might be related with agent ability to connect with server, maybe with curl or code that calls curl someway, because I remember to get the URL agent was failing to reach and with wget it worked.

@cusspvz
Copy link
Author

cusspvz commented Oct 8, 2015

Seems that now I could gather some data around this since all hosts got stuck into "Reconnecting".

@cusspvz
Copy link
Author

cusspvz commented Oct 8, 2015

On agent logs, this pattern keeps repeating:

INFO: Updating host-routes
INFO: Downloading http://**host_omitted**/v1//configcontent//host-routes current=host-routes-180-43fd6287135af077f5dffa7a2b5721a16cf12fe8b28a9e1bf522b35d19a80ea8

Traceback (most recent call last):
  File "/var/lib/cattle/pyagent/cattle/utils.py", line 280, in get_command_output
    return check_output(*args, **kw)
  File "/var/lib/cattle/pyagent/cattle/utils.py", line 337, in check_output
    raise e1
CalledProcessError: Command '['/var/lib/cattle/config.sh', u'host-routes']' returned non-zero exit status 28
2015-10-07 23:54:52,092 ERROR cattle [140119070288304] [utils.py:284] Failed to call (['/var/lib/cattle/config.sh', u'host-routes'],) {'stderr': -2, 'cwd': '/var/lib/cattle', 'env': {'AGENT_PARENT_PID': '29536', 'CATTLE_CONFIG_URL': 'http://**host_omitted**/v1', 'CATTLE_AGENT_PIDNS': 'host', 'CATTLE_URL': 'http://**host_omitted**/v1', 'SHLVL': '1', 'OLDPWD': '/tmp', 'HOSTNAME': 'jc-cd-3hj4', 'PWD': '/var/lib/cattle/pyagent', 'CATTLE_STORAGE_URL': 'http://**host_omitted**/v1', 'CATTLE_ACCESS_KEY': '3DD34BB19364186841C4', 'CATTLE_AGENT_LOG_FILE': '/var/log/rancher/agent.log', 'CATTLE_AGENT_IP': '**IP_omitted**', 'CATTLE_STATE_DIR': '/var/lib/rancher/state', 'CATTLE_SECRET_KEY': 'fWEQ1h3daEwUbwRUWvfC3WdtqNb42rYEeyTWXKc4', 'CATTLE_SYSTEMD': 'false', 'RANCHER_AGENT_IMAGE': 'rancher/agent:v0.8.2', 'HOME': '/root', 'CATTLE_CADVISOR_WRAPPER': 'cadvisor.sh', 'CATTLE_PHYSICAL_HOST_UUID': '344ed948-125a-432b-96da-64af83e987ac', 'CATTLE_HOME': '/var/lib/cattle', 'PATH': '/var/lib/cattle/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin'}}, exit [28], output :

@cusspvz
Copy link
Author

cusspvz commented Oct 8, 2015

After agent container restart, agent logs keeps repeating this pattern:

INFO: Updating configscripts
INFO: Downloading http://**host_omitted**/v1//configcontent//configscripts current=
INFO: Starting agent for 3DD34BB19364186841C4
INFO: Access Key: 3DD34BB19364186841C4
INFO: Config URL: http://**host_omitted**/v1
INFO: Storage URL: http://**host_omitted**/v1
INFO: API URL: http://**host_omitted**/v1
INFO: IP: **IP_omitted**
INFO: Port:
INFO: Required Image: rancher/agent:v0.8.2
INFO: Current Image: rancher/agent:v0.8.2
INFO: Using image rancher/agent:v0.8.2
INFO: Downloading agent http://**host_omitted**/v1/configcontent/configscripts

@cusspvz
Copy link
Author

cusspvz commented Oct 8, 2015

Restarted agent-instance container, it keeps looping:

INFO: Downloading agent http://**host_omitted**/v1/configcontent/configscripts

INFO: Updating configscripts
INFO: Downloading http://**host_omitted**/v1//configcontent//configscripts current=
INFO: Running /var/lib/cattle/download/configscripts/configscripts-1-f5763391fb7914dcd14001f29ffe28d1167a3bfcc0ee0ec05d2ca9c722103c02/apply.sh
INFO: Sending configscripts applied 1-f5763391fb7914dcd14001f29ffe28d1167a3bfcc0ee0ec05d2ca9c722103c02
The system is going down NOW!
Sent SIGTERM to all processes
Sent SIGKILL to all processes
Requesting system reboot

Don't know if reboot on container should reboot host instead, if it should, it isn't happening. agent-instance stops. Here's agent-instance shown by docker ps -a:

d3123d8dcfaa        rancher/agent-instance:v0.4.1   "/etc/init.d/agent-i   6 days ago          Exited (129) About a minute ago                       dca6c43a-b46a-4b8c-8b52-17ccdcd1aeca

@cusspvz
Copy link
Author

cusspvz commented Oct 8, 2015

Going to test if issue reported on #2207 occurs with 0.40.0.

@Rucknar
Copy link

Rucknar commented Oct 8, 2015

Try removing the agent completely and installing via the run command you get through add hosts. Looks like your agent can't update for whatever reason, likely different from my issue

@cusspvz
Copy link
Author

cusspvz commented Oct 8, 2015

I've assigned @fernandoneto to take care of this issue with Rancher. Thanks @Rucknar

@deniseschannon deniseschannon added the kind/bug Issues that are defects reported by users or that we know have reached a real release label Oct 9, 2015
@cusspvz
Copy link
Author

cusspvz commented Dec 9, 2015

@ibuildthecloud do you have any updates regarding this?

@rokka-n
Copy link

rokka-n commented Dec 17, 2015

#2335

@cusspvz, I run it from this branch.

  • 8527c45 - (HEAD -> master, origin/master, origin/HEAD) Update README.md (2 days ago)

From rancher server:
[rancher@rancher ~]$ rancherctl --version
rancherctl version v0.4.1
[rancher@rancher ~]$ docker --version
Docker version 1.9.1-rc1, build 4663423

From agent's API info:

"osInfo": {
"kernelVersion": "4.2.3-rancher",
"operatingSystem": "RancherOS (containerized)",
"dockerVersion": "Docker version 1.9.1-rc1, build 4663423",
},

To reproduce:
Install virtualbox (Version 5.0.10 r104061) on a mac computer, install vagrant 1.7.4
Checkout rancher repo.
Change $number_of_nodes = 3 in Vagrantfile

Run vagrant up, try to deploy an app and watch agents/server go in reconnecting state after some time.

@rokka-n
Copy link

rokka-n commented Dec 17, 2015

bash-3.2$ git status
On branch v0.50.1

Still broken :(

@roynasser
Copy link

I'm having the same/similar issue I believe... in my case the hosts still appear up and from ssh inside the host I can get the rancher api using 127.0.0.1 but not the "host ip", it seems like something is causing it to loose it.... maybe something vbox related :/ (im on the same vbox version and vagrant version... (its happening without changing nodes to 3)...

@roynasser
Copy link

Just as a headsup, I installed rancher-server and rancher-agent on a single coreos node and it seems to be working fine...

I'm not sure if its something on the Vagrant setup shipped with rancher, or with rancheros... one of the two for sure...

@rokka-n
Copy link

rokka-n commented Jan 5, 2016

Still broken in v0.52.0-rc3
Ay-ay-ay-yay :(

@roynasser
Copy link

Yes testing again, still broken... it works until you start (try) any container or service from the catalog... even starting a ubuntu image will cause the whole thing to collapse.... Vagrant ssh still works but the rancher server and rancher-01 upped by the vagrant up are inaccessible... so still no go on my side too... Tried on 5 machines... tried removing everthing (inclduing vbox and vagrant and starting from scratch... no go...)

@roynasser
Copy link

@rokka-n I was able to get it working using another OS instead of rancheros, so I'm unsure where the issues lie...

I'm going to be working with coreos from now on, so I'll be trying my hand and getting things working well on that...

@rokka-n
Copy link

rokka-n commented Jan 6, 2016

@RVN-BR Did you manage to make it completely automatic with vagrant?
Maybe vargant config does something weird? You can see how networking being screwed when services provisioned by simply pinging host ip from itself.

Edit: running rancher server on Ubuntu and using rancherOS as a host works just fine

@roynasser
Copy link

@rokka-n No... unfortunately not completely automatic...

I made a "unit file" with the agent setup, which I deployed through fleetctl to all the coreos nodes, and ran rancher server "by hand" on one of the nodes... its not ideal, and definitely not "HA"... but it worked for me tests.... I will try doing something a bit easier...

I'd venture into a coreos+rancher repo but rancher is moving quite fast so I'm not sure it would be too useful :(

If you'd like I can post my unit file... (I actually think I lost it, but it shouldnt take long to whip it up...and share it with you...)

@cusspvz
Copy link
Author

cusspvz commented Jan 6, 2016

I think this issue could be related with high CPU usage on rancher's server side, making rancher-server unable to respond some ping requests, leading agents to lost contact. Could any of you please check your rancher-server CPU loads?

@roynasser
Copy link

Hello @cusspvz I will check on your suspicions, but I'm pretty sure it isnt really the case... Its not really some ping requests being dropped... The IP becomes incomunicable... SSHing and connecting locally to the webserver/api/etc all continues working...

I'll make some more tests and try and report... What is the best way to obtain the relevant logs? Would it be docker logs of the rancher-server container from within the rancher machine? or stderr of the machine itself running the containers? I'm unsure...

@roynasser
Copy link

Hi @cusspvz & @rokka-n no dice... I dont think its CPU related...

I followed the following steps:

  • Started a rancher server from git repo (server + 1 agent)...
  • Deployed a container manually on the rancher-01 server (agent)... As it initializes it goes into eternal "reconnecting"...
  • Launcher another VM running CoreOS this time, (using alpha, and CATTLE_AGENT_IP setting)...
  • Launcher the same ubuntu container but on this new host... All is well apparently...

Monitored CPU throughout on the servers and didnt see anything abnormal... All VMs have 4gb ram same cpu config

image

@rokka-n
Copy link

rokka-n commented Jan 6, 2016

@RVN-BR yeah, I don't think its related to cpu.

It would be nice if rancher folks give some priority to this.

I dunno... The whole thing looks pretty much broken to me. Not sure how other people use it without being able to develop and run full stack locally.

@roynasser
Copy link

I'm trying to get it going on coreos... I'll post up here when I have a working setup... I'm really keen on getting this going to... at least theres 2 of us :)

@rokka-n
Copy link

rokka-n commented Jan 6, 2016

I used this repo to launch agent hosts with rancheros and then register it to rancher-server running on ubuntu. Didn't see any issues with connectivity. But there is some work required to make provisioning fully automatic.

https://github.com/ailispaw/rancheros-iso-box

@roynasser
Copy link

I'll take a look for inspiration... dont really want to get into rancheros... dont see any benefits over coreos... plus too much vendor lockin for not any benefit?

Anyways, I'm almost setup with a unit file that can run rancher-agents on a cluster of coreos machines... should work well if I get through some bumps... then a single unit (or single docker comand) to run the server and it should be "all set"...

@sshipway
Copy link

sshipway commented Jan 6, 2016

We now have a working Rancher + CoreOS setup. Here's what I discovered in the process --

  • Using Rancher 0.51.0 currently
  • We have a standard VMWare CoreOS image that we clone from, and some ISO images holding the YAML cloud-config for each environment which starts up everything
  • CoreOS 877.1.0 (beta, Docker 1.9.1) works. 766.4.0 (Docker 1.7.1) will work with the agent, but Covoy fails and there is the hanging bug mentioned previously. I have tried with later Betas, but other beta versions seem to have problems.
  • The CPU/Memory graphs started working at about Rancher 0.47? and now seem fine.
  • Our hosts are 4x3.3GHz CPU and 16GB memory, though I see no reason why they could not be smaller
  • I can let you have a copy of our cloud config if you want; it wipes /var/lib/docker on reboot (using a recreated btrfs on sdb), mounts central shared NFS storage, and auto registers the agent with the Rancher server, but needs to have the Environment registration URL coded into it (hence one cloud-config per environment).
  • CoreOS has a rather small /var/lib by default, so it is necessary to run a cleanup job, and have an extended filesystem (either on a separate disk as we do, or just grow /dev/sda)

This all works well for us, and I can deploy a VMware CoreOS template with the cloud-config and have a new host in the cluster in less than a minute. I have not managed to make the Rancher VMware add host vsphere integration work though, nor have I been able to make Rancheros work properly under VMware.

@roynasser
Copy link

Hi @sshipway I'd be interested in looking at what you have ifyou dont mind sharing...

I'm only doing mostly testing for now... I got Rancher server working on its own using rancheros, and I am able to get a simple unit file to get the entire coreos cluster registered as agents... Seems to work for testing so far, but I'm going to need an HA solution if this will ever hit production...and I'd like to see this run on ubuntu or coreos, I'm not really that keen on rancheros for now... who knows one day...for now its seems like more hassle than good...

Why is it that you are using the NFS storage? is it just for the initial provisioning of the coreos node? And also, have you tried benchmarking a baremetal coreos install on the server? We have some hosts of different sizes, and currently use vmware too... we are contemplating whether to try baremetal with coreos or some other docker hosting OS and something like rancher to arrange things around...

One thing I'd like to see in rancher is a "birds-eye" view of hosts/containers, so we can see where particular hosts are getting more traffic/particular containers are getting to strain, etc... and in future some autoscaling based on this maybe? But even as a way of just having a good view of the entire infrastructure...

It would be briddging the gap from dcos and kubernetes I guess... but it just may be the way rancher labs gets its attention? :p

@sshipway
Copy link

sshipway commented Jan 6, 2016

We're using NFS to provide shared persistent storage for containers; since we have 6 hosts, and we don't know where the container will be started, we need to have some sort of shared filesystem. This way, /mnt/docker is the same on all hosts, so a container with -v /mnt/docker/mystack:/data will start the same anywhere. Web stacks can mount the same path from all instances, persistent containers can start from anywhere. Nowadays, you can probably use convoy-nfs to the same effect, but I've not yet been able to make this work properly :(.
This is also handy for backups because the NFS server does the backups of the /mnt/docker filesystem centrally.

CoreOS works well with VMware because they provide a VMware image that you can just drop in. Ive not tried baremetal as we are pretty much entirely virtualised here, so it wasn't really an option - and VMware makes things much simpler for scaling and reprovisioning.

There is definitely a good idea in having a dynamic scheduling container that hooks into the Rancher API and monitors the balance over the hosts, migrating an instance between hosts to preserve the balance. However there are a lot of problems with this, eg keeping data integrity and availability with single-instance containers, scheduling rules that prevent balancing, and that you cannot force a new instance to start where you want - and containers that expose ports should maybe not be migrated... etc etc

@sshipway
Copy link

sshipway commented Jan 6, 2016

Here is the cloud-config YAML that we use with CoreOS to make a Rancher host. Note that this is sanitised for keys etc. Note that we have DHCP on the subnet. The cloud-config.yaml is bundled up into an ISO which is loaded on the VMware farm, and the VMware template uses the CoreOS image plus this ISO mounted to the CDRom, correct network connection, and an additional 16G disk which is used for /var/lib/docker

#cloud-config

---
write-files:
  - path: /etc/conf.d/nfs
    permissions: '0644'
    content: |
      OPTS_RPC_MOUNTD=""
  - path: "/etc/resolv.conf"
    permissions: "0644"
    owner: "root"
    content: |
      nameserver 130.216.191.1
      nameserver 130.216.190.1
      domain container.auckland.ac.nz 
      search container.auckland.ac.nz auckland.ac.nz
  - path: /etc/systemd/timesyncd.conf
    content: |
      [Time]
      NTP=ntp1.pool.ntp.auckland.ac.nz ntp2.pool.ntp.auckland.ac.nz  ntp3.pool.ntp.auckland.ac.nz  ntp4.pool.ntp.auckland.ac.nz
hostname: cow
users:
  - name: rancher
    ssh-authorized-keys:
      - "ssh-rsa NotTheRealKeyAAAAB3NzaC1yc2EAAAABJQAAAQEAp19j4j+fmriejut8hvcy7j09fyug8590hjfo8rnnnnnnnnjut589067u9678nyhujgkfvjimpjhitophvjupfdklh;jklfjh;lkfdjhlkfd;jhlkfdjglk;hjgkld;/WIpR5Qi5pT24fdsayhr9j4fy56f4rjd98y4r7htr7498wf75j9dy7e9yfrythdjf74eodyt7hoth7oyt7fPRl9SPnlDfSbTghufoiwhncuoisgisuoghouchiureohguicoghuioghiurhgiueohgoichgoiurhg87967t563754ytoityriuo0ihvjug8teojf6w== rancher-20151102"
    passwd: "$6$NotTheRealPasswdfhre7y547tv5987ryt87e09jfu5t90ynu60kfu8tg9r0jtgv58trnyb8u5ft9pkut58p985geoyu86ofeku85gyyuf8pyug8pon"  
ssh_authorized_keys:
  - "ssh-rsa NotTheRealKeyAAAAB3NzaC1yc2EAAAABJQAAAQEAp19j4j+fmriejut8hvcy7j09fyug8590hjfo8rnnnnnnnnjut589067u9678nyhujgkfvjimpjhitophvjupfdklh;jklfjh;lkfdjhlkfd;jhlkfdjglk;hjgkld;/WIpR5Qi5pT24fdsayhr9j4fy56f4rjd98y4r7htr7498wf75j9dy7e9yfrythdjf74eodyt7hoth7oyt7fPRl9SPnlDfSbTghufoiwhncuoisgisuoghouchiureohguicoghuioghiurhgiueohgoichgoiurhg87967t563754ytoityriuo0ihvjug8teojf6w== rancher-20151102"
timezone: "Pacific/Auckland"  
coreos:
  units:
    - name: format-ephemeral.service
      command: start
      content: |
        [Unit]
        Description=Formats the ephemeral drive
        After=dev-sdb.device
        Requires=dev-sdb.device
        [Service]
        Type=oneshot
        RemainAfterExit=yes
        ExecStart=/usr/sbin/wipefs -f /dev/sdb
        ExecStart=/usr/sbin/mkfs.btrfs -f /dev/sdb
    - name: var-lib-docker.mount
      command: start
      content: |
        [Unit]
        Description=Mount ephemeral to /var/lib/docker
        Requires=format-ephemeral.service
        After=format-ephemeral.service
        Before=docker.service
        [Mount]
        What=/dev/sdb
        Where=/var/lib/docker
        Type=btrfs
    - name: rpc-statd.service
      command: start
      enable: true
    - name: opt-docker.mount
      command: start
      content: |
        [Unit]
        Description=Mount persistent to /opt/docker
        Before=docker.service
        [Mount]
        What=stg.container.auckland.ac.nz:/mnt/docker/persist
        Where=/opt/docker
        Type=nfs
    - name: docker.service
      command: start
    - name: join-ranch.service
      command: start
      content: |
        [Unit]
        Description=Start the Rancher agent
        After=docker.service
        Requires=docker.service
        [Service]
        Type=oneshot
        RemainAfterExit=yes
        ExecStart=/bin/docker run -e CATTLE_HOST_LABELS='HERD=DEFAULT' -d --privileged -v /var/run/docker.sock:/var/run/docker.sock rancher/agent:v0.8.2 https://rancher.container.auckland.ac.nz/v1/scripts/6598426574925745498564738938745:NotTheRealKeyIUdr3rrw53cr4W8
  update:
    group: alpha
    reboot-strategy: off

@roynasser
Copy link

Cool... gotcha on the NFS... We had a bad experience a long time ago with NFS and have tended to stay away from it, using other stuff like gluster for a period, and then moving altogether away from having "centralized storage" except for some very specific needs which run on a limited number of hosts... I will probably however look into something with convoy or gluster in the near future as we look to containerize persistant applications such as DBs, etc... (but this will be an entire new chapter)...

Thanks for sharing the file... The rancher unit you are usng is very similr to what I used, the only thing is I added a CATTLE_AGENT_IP env variable with the coreos node ip as I was getting the same IP reported in rancher for all nodes

@deniseschannon
Copy link

@cusspvz @fernandoneto Are you still facing this issue? We've changed the networking in Rancher as of Rancher v0.56.0. Could you please re-test with the lastest Racnher and open a new ticket if you are still facing the issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/agent Issues that deal with the Rancher Agent kind/bug Issues that are defects reported by users or that we know have reached a real release kind/question Issues that just require an answer. No code change needd
Projects
None yet
Development

No branches or pull requests