Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Glusterfs catalog version wont start anymore #3670

Closed
joostliketoast opened this issue Feb 22, 2016 · 29 comments
Closed

Glusterfs catalog version wont start anymore #3670

joostliketoast opened this issue Feb 22, 2016 · 29 comments
Assignees
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release
Milestone

Comments

@joostliketoast
Copy link

Version:
rancher v0.59.0
cattle v0.148.0
user interface v0.90.0
rancher compose v0.7.2

Steps:

  1. Create glusterFS stack
  2. Create convoy gluster stack
  3. Create some volumes
  4. Remove the volumes
  5. Remove gluster and convoy stack
  6. Create Glusterfs

Results:
glusterfs stays in a booting loop
rancher
rancher

Expected:
A running glusterfs stack

EDIT:
step 7. Create convoy gluster

results in the following error:


2/22/2016 2:26:48 PMWaiting for metadata.
2/22/2016 2:26:48 PMtime="2016-02-22T13:26:48Z" level=info msg="Execing [/usr/bin/nsenter --mount=/proc/2522/ns/mnt -F -- /var/lib/docker/aufs/mnt/a6ba41d189c1d3adbf3c8cbdb347ada8f0786910277c3184785a21ccc937441e/var/lib/rancher/convoy-agent/share-mnt --stage2 /var/lib/rancher/convoy/convoy-gluster-fc47dbdb-4a9c-4475-84e1-da035f0ede30 -- /launch volume-agent-glusterfs-internal]"
2/22/2016 2:26:48 PMWaiting for metadata
2/22/2016 2:26:48 PMRegistering convoy socket at /var/run/convoy-convoy-gluster.sock
2/22/2016 2:26:48 PMtime="2016-02-22T13:26:48Z" level=info msg="Listening for health checks on 0.0.0.0:10241/healthcheck"
2/22/2016 2:26:48 PMtime="2016-02-22T13:26:48Z" level=info msg="Got: root /var/lib/rancher/convoy/convoy-gluster-fc47dbdb-4a9c-4475-84e1-da035f0ede30"
2/22/2016 2:26:48 PMtime="2016-02-22T13:26:48Z" level=info msg="Got: drivers [glusterfs]"
2/22/2016 2:26:48 PMtime="2016-02-22T13:26:48Z" level=info msg="Got: driver-opts [glusterfs.defaultvolumepool=web_vol glusterfs.servers=glusterfs]"
2/22/2016 2:26:48 PMtime="2016-02-22T13:26:48Z" level=info msg="Launching convoy with args: [--socket=/host/var/run/convoy-convoy-gluster.sock daemon --root=/var/lib/rancher/convoy/convoy-gluster-fc47dbdb-4a9c-4475-84e1-da035f0ede30 --drivers=glusterfs --driver-opts=glusterfs.defaultvolumepool=web_vol --driver-opts=glusterfs.servers=glusterfs]"
2/22/2016 2:26:48 PMtime="2016-02-22T13:26:48Z" level=debug msg="Creating config at /var/lib/rancher/convoy/convoy-gluster-fc47dbdb-4a9c-4475-84e1-da035f0ede30" pkg=daemon
2/22/2016 2:26:48 PMtime="2016-02-22T13:26:48Z" level=debug msg= driver=glusterfs driver_opts=map[glusterfs.servers:glusterfs glusterfs.defaultvolumepool:web_vol] event=init pkg=daemon reason=prepare root="/var/lib/rancher/convoy/convoy-gluster-fc47dbdb-4a9c-4475-84e1-da035f0ede30"
2/22/2016 2:26:48 PMtime="2016-02-22T13:26:48Z" level=debug msg="Volume web_vol is being mounted it to /var/lib/rancher/convoy/convoy-gluster-fc47dbdb-4a9c-4475-84e1-da035f0ede30/glusterfs/mounts/web_vol, with option [-t glusterfs]" pkg=util
2/22/2016 2:26:49 PMtime="2016-02-22T13:26:49Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
2/22/2016 2:26:50 PMtime="2016-02-22T13:26:50Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
2/22/2016 2:26:51 PMtime="2016-02-22T13:26:51Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
2/22/2016 2:26:52 PMtime="2016-02-22T13:26:52Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
2/22/2016 2:26:53 PMtime="2016-02-22T13:26:53Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
2/22/2016 2:26:54 PMtime="2016-02-22T13:26:54Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
2/22/2016 2:26:55 PMtime="2016-02-22T13:26:55Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
2/22/2016 2:26:56 PMtime="2016-02-22T13:26:56Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
2/22/2016 2:26:57 PMtime="2016-02-22T13:26:57Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
2/22/2016 2:26:58 PMtime="2016-02-22T13:26:58Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
2/22/2016 2:26:59 PMtime="2016-02-22T13:26:59Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
2/22/2016 2:27:00 PMtime="2016-02-22T13:27:00Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
2/22/2016 2:27:01 PMtime="2016-02-22T13:27:01Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
2/22/2016 2:27:02 PMtime="2016-02-22T13:27:02Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
2/22/2016 2:27:03 PMtime="2016-02-22T13:27:03Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
2/22/2016 2:27:04 PMtime="2016-02-22T13:27:04Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
2/22/2016 2:27:04 PMtime="2016-02-22T13:27:04Z" level=debug msg="Cleaning up environment..." pkg=daemon
2/22/2016 2:27:04 PMtime="2016-02-22T13:27:04Z" level=error msg="Failed to execute: mount [-t glusterfs glusterfs:/web_vol /var/lib/rancher/convoy/convoy-gluster-fc47dbdb-4a9c-4475-84e1-da035f0ede30/glusterfs/mounts/web_vol], output Mount failed. Please check the log file for more details.\n, error exit status 1"
2/22/2016 2:27:04 PM{
2/22/2016 2:27:04 PM    "Error": "Failed to execute: mount [-t glusterfs glusterfs:/web_vol /var/lib/rancher/convoy/convoy-gluster-fc47dbdb-4a9c-4475-84e1-da035f0ede30/glusterfs/mounts/web_vol], output Mount failed. Please check the log file for more details.\n, error exit status 1"
2/22/2016 2:27:04 PM}
2/22/2016 2:27:04 PMtime="2016-02-22T13:27:04Z" level=info msg="convoy exited with error: exit status 1"
2/22/2016 2:27:04 PMtime="2016-02-22T13:27:04Z" level=info msg=Exiting.
@deniseschannon
Copy link

We recently replaced the existing version of GlusterFS, which may have caused this issue.

When did you launch your original GlusterFS and your new one? Also, when did you launch convoy-gluster?

@joostliketoast
Copy link
Author

i just tried it again by creating a new environment setting up glusterfs and gluster convoy then added some volumes and then deleting the glusterfs and then the gluster convoy.

Then creating a glusterfs which seems to get running alright after it restarts itself a couple of times.
but the gluster convoy keeps saying:

2/25/2016 9:17:31 PMWaiting for metadata.
2/25/2016 9:17:31 PMtime="2016-02-25T20:17:31Z" level=info msg="Execing [/usr/bin/nsenter --mount=/proc/2197/ns/mnt -F -- /var/lib/docker/aufs/mnt/08c60e5598cc2d655b86472b0fea779c0f6d0f4e3c1fe67d0fda2f49685a3510/var/lib/rancher/convoy-agent/share-mnt --stage2 /var/lib/rancher/convoy/convoy-gluster-7709e40d-6754-4e97-ade2-741f7051a2ad -- /launch volume-agent-glusterfs-internal]"
2/25/2016 9:17:31 PMWaiting for metadata
2/25/2016 9:17:31 PMRegistering convoy socket at /var/run/convoy-convoy-gluster.sock
2/25/2016 9:17:31 PMtime="2016-02-25T20:17:31Z" level=info msg="Listening for health checks on 0.0.0.0:10241/healthcheck"
2/25/2016 9:17:31 PMtime="2016-02-25T20:17:31Z" level=info msg="Got: driver-opts [glusterfs.defaultvolumepool=web_storage glusterfs.servers=glusterfs]"
2/25/2016 9:17:31 PMtime="2016-02-25T20:17:31Z" level=info msg="Got: root /var/lib/rancher/convoy/convoy-gluster-7709e40d-6754-4e97-ade2-741f7051a2ad"
2/25/2016 9:17:31 PMtime="2016-02-25T20:17:31Z" level=info msg="Got: drivers [glusterfs]"
2/25/2016 9:17:31 PMtime="2016-02-25T20:17:31Z" level=info msg="Launching convoy with args: [--socket=/host/var/run/convoy-convoy-gluster.sock daemon --driver-opts=glusterfs.defaultvolumepool=web_storage --driver-opts=glusterfs.servers=glusterfs --root=/var/lib/rancher/convoy/convoy-gluster-7709e40d-6754-4e97-ade2-741f7051a2ad --drivers=glusterfs]"
2/25/2016 9:17:31 PMtime="2016-02-25T20:17:31Z" level=debug msg="Creating config at /var/lib/rancher/convoy/convoy-gluster-7709e40d-6754-4e97-ade2-741f7051a2ad" pkg=daemon
2/25/2016 9:17:31 PMtime="2016-02-25T20:17:31Z" level=debug msg= driver=glusterfs driver_opts=map[glusterfs.defaultvolumepool:web_storage glusterfs.servers:glusterfs] event=init pkg=daemon reason=prepare root="/var/lib/rancher/convoy/convoy-gluster-7709e40d-6754-4e97-ade2-741f7051a2ad"
2/25/2016 9:17:31 PMtime="2016-02-25T20:17:31Z" level=debug msg="Volume web_storage is being mounted it to /var/lib/rancher/convoy/convoy-gluster-7709e40d-6754-4e97-ade2-741f7051a2ad/glusterfs/mounts/web_storage, with option [-t glusterfs]" pkg=util
2/25/2016 9:17:31 PMtime="2016-02-25T20:17:31Z" level=debug msg="Cleaning up environment..." pkg=daemon
2/25/2016 9:17:31 PMtime="2016-02-25T20:17:31Z" level=error msg="Failed to execute: mount [-t glusterfs glusterfs:/web_storage /var/lib/rancher/convoy/convoy-gluster-7709e40d-6754-4e97-ade2-741f7051a2ad/glusterfs/mounts/web_storage], output Mount failed. Please check the log file for more details.\n, error exit status 1"
2/25/2016 9:17:31 PM{
2/25/2016 9:17:31 PM    "Error": "Failed to execute: mount [-t glusterfs glusterfs:/web_storage /var/lib/rancher/convoy/convoy-gluster-7709e40d-6754-4e97-ade2-741f7051a2ad/glusterfs/mounts/web_storage], output Mount failed. Please check the log file for more details.\n, error exit status 1"
2/25/2016 9:17:31 PM}
2/25/2016 9:17:31 PMtime="2016-02-25T20:17:31Z" level=info msg="convoy exited with error: exit status 1"
2/25/2016 9:17:31 PMtime="2016-02-25T20:17:31Z" level=info msg=Exiting.

edit:

this is the output from the convoy gluster storage pool container:


2/25/2016 9:05:13 PMWaiting for metadata.
2/25/2016 9:05:13 PMtime="2016-02-25T20:05:13Z" level=info msg="Listening for health checks on 0.0.0.0:10241/healthcheck"
2/25/2016 9:05:13 PMtime="2016-02-25T20:05:13Z" level=info msg="Socket file: /host/var/run/convoy-convoy-gluster.sock"
2/25/2016 9:05:13 PMtime="2016-02-25T20:05:13Z" level=info msg="Initializing event router" workerCount=10
2/25/2016 9:05:13 PMtime="2016-02-25T20:05:13Z" level=info msg="Connection established"
2/25/2016 9:05:18 PMtime="2016-02-25T20:05:18Z" level=debug msg="storagepool event [096eefea-1df7-4ce9-a2a2-6d219cd2a5e3 7e61b68b-7a83-4e67-9a27-572e08eac5b0 65c5cbce-71c6-4749-ba88-8860269f62af]"

and from the glusterfs server:


2/25/2016 8:44:13 PMWaiting for all service containers to start...
2/25/2016 8:48:19 PMContainers are starting...
2/25/2016 8:48:19 PMWaiting for Gluster Daemons to come up
2/25/2016 8:49:02 PMgluster peer probe 10.42.195.55
2/25/2016 8:49:02 PMpeer probe: success. Host 10.42.195.55 port 24007 already in peer list
2/25/2016 8:49:03 PMgluster peer probe 10.42.163.174
2/25/2016 8:49:03 PMpeer probe: failed: Probe returned with Transport endpoint is not connected
2/25/2016 8:50:26 PMWaiting for all service containers to start...
2/25/2016 8:50:33 PMContainers are starting...
2/25/2016 8:50:33 PMWaiting for Gluster Daemons to come up
2/25/2016 8:51:06 PMgluster peer probe 10.42.195.55
2/25/2016 8:51:06 PMpeer probe: success. Host 10.42.195.55 port 24007 already in peer list
2/25/2016 8:51:07 PMgluster peer probe 10.42.163.174
2/25/2016 8:51:07 PMpeer probe: success.

anymore information usefull?

@tfiduccia
Copy link

Rancher version - master
Glusterfs template version - 3.7.5-rancher1
Convoy glusterfs template version - 2.0

I followed the steps above. With the newest build and latest glusterfs and convoy glusterfs templates, I was not able to reproduce. I did notice that if you don't wait a little bit to recreate glusterfs, there are problems. Even after the delete it takes a few minutes for everything to clear out. Please wait for the next build and try again, this should be resolved. If it is not please reopen and let me know.

@joostliketoast
Copy link
Author

i think my issue might be related to #2903
seeing as i got the same problem with the gluster convoy

@deniseschannon
Copy link

@joostliketoast Yes, if you deployed when we first launched these templates, you might be having these issues. The newer templates address these re-deployment scenarios.

@joostliketoast
Copy link
Author

@deniseschannon i assumed i used the latest templates when selecting from the catalog, is there any other way to force updating them? Cause the issue is still here after recreating the 2 stacks

@joostliketoast
Copy link
Author

also my glusterfs stack seems to have a problem getting the server containers up in time

3/3/2016 12:34:07 PMWaiting for Gluster Daemons to come up
3/3/2016 12:34:37 PMpool list: failed

cause it only waits for 30 seconds and sometimes it takes longer to get a container up.
this cause to stack to go into a reboot loop trying to get all 3 running

@soundman666
Copy link

@deniseschannon , @joostliketoast , I have the same problem

@joostliketoast
Copy link
Author

from the agent.log incase it might be usefull

2016-03-03 12:20:05,411 INFO agent [139819748437360] [utils.py:430] Response: {"name": "reply.8410449701303523903", "transitioningProgress": null, "resourceType": null, "resourceId": null, "id": "21a593c3-1bac-4b90-ad17-e8d4b2ad3b29", "transitioningMessage": "Update failed", "time": 1457007605000, "previousNames": ["delegate.request"], "transitioning": "yes", "data": {"name": "config.update.reply", "transitioningProgress": null, "resourceType": "agent", "resourceId": "461", "id": "0588464c-7505-4b39-a598-49ed42dff6b8", "transitioningMessage": "Update failed", "time": 1457007605000, "previousNames": ["config.update"], "transitioning": "yes", "data": {"output": "Lock failed", "exitCode": 122}, "previousIds": ["34b1bbb3-92d9-45bb-8ca8-8a107f7cd81c"]}, "previousIds": ["a0aa1e95-b9f6-4bc6-94ac-0f0a9cdc5d5a"]} [0.00612998008728] seconds

@deniseschannon deniseschannon reopened this Mar 9, 2016
@deniseschannon
Copy link

@cloudnautique Can you take a look?

@cloudnautique
Copy link
Contributor

Out of curiosity, what size hosts are these? Are they located within a relatively close network?

@joostliketoast
Copy link
Author

i think its related to this issue #3750
the screenshot on there also has the specs of the hosts

@pboos
Copy link

pboos commented Mar 17, 2016

We are experiencing the same problem. It seems like glusterfs is not correctly starting up (though it shows up as green). convoy-gluster fails terribly at starting up with the message further below.

We used the older versions of glusterfs before. We removed them completely and switched to the new version. Maybe that triggers the problem?

We are willing to try out anything you suggest and report back.

Info to the hosts: They are within the same data center (two of them even in the same rack). And the machines are 4core machines with 64gb ram. Pretty powerful machines. Network is really fast as well. Ping between the servers through the VPN is < 1ms.

On the glusterfs_glusterfs-server_1 1-3 we see the following (the show up green).

3/17/2016 1:33:00 PMWaiting for all service containers to start...
3/17/2016 1:33:01 PMContainers are starting...
3/17/2016 1:33:01 PMWaiting for Gluster Daemons to come up
3/17/2016 1:38:48 PMWaiting for all service containers to start...
3/17/2016 1:38:49 PMContainers are starting...
3/17/2016 1:38:49 PMWaiting for Gluster Daemons to come up
3/17/2016 1:55:36 PMgluster peer probe 10.42.162.226
3/17/2016 1:55:36 PMConnection failed. Please check if gluster daemon is operational.
3/17/2016 1:56:37 PMWaiting for all service containers to start...
3/17/2016 1:56:38 PMContainers are starting...
3/17/2016 1:56:38 PMWaiting for Gluster Daemons to come up

Error on convoy-gluster container convoy-gluster_convoy-gluster_1

3/17/2016 1:56:28 PMWaiting for metadata.
3/17/2016 1:56:28 PMtime="2016-03-17T12:56:28Z" level=info msg="Execing [/usr/bin/nsenter --mount=/proc/645/ns/mnt -F -- /var/lib/docker/aufs/mnt/d08f1b25cb1d7d119db93d942329f25599f5981b2e6ed65c5b4b7b27f48e424a/var/lib/rancher/convoy-agent/share-mnt --stage2 /var/lib/rancher/convoy/convoy-gluster-75e26d85-7e46-402b-a0ed-ce357900bc54 -- /launch volume-agent-glusterfs-internal]"
3/17/2016 1:56:28 PMWaiting for metadata
3/17/2016 1:56:28 PMRegistering convoy socket at /var/run/convoy-convoy-gluster.sock
3/17/2016 1:56:28 PMtime="2016-03-17T12:56:28Z" level=info msg="Listening for health checks on 0.0.0.0:10241/healthcheck"
3/17/2016 1:56:28 PMtime="2016-03-17T12:56:28Z" level=info msg="Got: root /var/lib/rancher/convoy/convoy-gluster-75e26d85-7e46-402b-a0ed-ce357900bc54"
3/17/2016 1:56:28 PMtime="2016-03-17T12:56:28Z" level=info msg="Got: drivers [glusterfs]"
3/17/2016 1:56:28 PMtime="2016-03-17T12:56:28Z" level=info msg="Got: driver-opts [glusterfs.defaultvolumepool=integral_vol glusterfs.servers=glusterfs]"
3/17/2016 1:56:28 PMtime="2016-03-17T12:56:28Z" level=info msg="Launching convoy with args: [--socket=/host/var/run/convoy-convoy-gluster.sock daemon --root=/var/lib/rancher/convoy/convoy-gluster-75e26d85-7e46-402b-a0ed-ce357900bc54 --drivers=glusterfs --driver-opts=glusterfs.defaultvolumepool=integral_vol --driver-opts=glusterfs.servers=glusterfs]"
3/17/2016 1:56:28 PMtime="2016-03-17T12:56:28Z" level=debug msg="Creating config at /var/lib/rancher/convoy/convoy-gluster-75e26d85-7e46-402b-a0ed-ce357900bc54" pkg=daemon
3/17/2016 1:56:28 PMtime="2016-03-17T12:56:28Z" level=debug msg= driver=glusterfs driver_opts=map[glusterfs.defaultvolumepool:integral_vol glusterfs.servers:glusterfs] event=init pkg=daemon reason=prepare root="/var/lib/rancher/convoy/convoy-gluster-75e26d85-7e46-402b-a0ed-ce357900bc54"
3/17/2016 1:56:28 PMtime="2016-03-17T12:56:28Z" level=debug msg="Volume integral_vol is being mounted it to /var/lib/rancher/convoy/convoy-gluster-75e26d85-7e46-402b-a0ed-ce357900bc54/glusterfs/mounts/integral_vol, with option [-t glusterfs]" pkg=util
3/17/2016 1:56:29 PMtime="2016-03-17T12:56:29Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
3/17/2016 1:56:30 PMtime="2016-03-17T12:56:30Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
3/17/2016 1:56:31 PMtime="2016-03-17T12:56:31Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
3/17/2016 1:56:32 PMtime="2016-03-17T12:56:32Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
3/17/2016 1:56:33 PMtime="2016-03-17T12:56:33Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
3/17/2016 1:56:34 PMtime="2016-03-17T12:56:34Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
3/17/2016 1:56:35 PMtime="2016-03-17T12:56:35Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
3/17/2016 1:56:36 PMtime="2016-03-17T12:56:36Z" level=error msg="Get http:///host/var/run/convoy-convoy-gluster.sock/v1/volumes/list: dial unix /host/var/run/convoy-convoy-gluster.sock: connection refused"
3/17/2016 1:56:37 PMtime="2016-03-17T12:56:37Z" level=debug msg="Cleaning up environment..." pkg=daemon
3/17/2016 1:56:37 PMtime="2016-03-17T12:56:37Z" level=error msg="Failed to execute: mount [-t glusterfs glusterfs:/integral_vol /var/lib/rancher/convoy/convoy-gluster-75e26d85-7e46-402b-a0ed-ce357900bc54/glusterfs/mounts/integral_vol], output Mount failed. Please check the log file for more details.\n, error exit status 1"
3/17/2016 1:56:37 PM{
3/17/2016 1:56:37 PM    "Error": "Failed to execute: mount [-t glusterfs glusterfs:/integral_vol /var/lib/rancher/convoy/convoy-gluster-75e26d85-7e46-402b-a0ed-ce357900bc54/glusterfs/mounts/integral_vol], output Mount failed. Please check the log file for more details.\n, error exit status 1"
3/17/2016 1:56:37 PM}
3/17/2016 1:56:37 PMtime="2016-03-17T12:56:37Z" level=info msg="convoy exited with error: exit status 1"
3/17/2016 1:56:37 PMtime="2016-03-17T12:56:37Z" level=info msg=Exiting.

And some screenshots here

@deniseschannon deniseschannon added the kind/bug Issues that are defects reported by users or that we know have reached a real release label Mar 17, 2016
@cloudnautique
Copy link
Contributor

What were the logs on glusterfs_glusterfs-server_glusterfs-volume-create_1?

@pboos
Copy link

pboos commented Mar 17, 2016

Log of: glusterfs_glusterfs-server_glusterfs-volume-create_1

3/17/2016 5:21:35 PMWaiting for all containers to come up...
3/17/2016 5:22:14 PMContainers are coming up...

Delete + recreate of glusterfs

glusterfs_glusterfs-server_glusterfs-volume-create_1-3

We now deleted and recreated glusterfs. The log on create_1 is now:

3/17/2016 5:20:20 PMWaiting for all containers to come up...
3/17/2016 5:22:13 PMContainers are coming up...
3/17/2016 5:22:13 PMWaiting for pool...
3/17/2016 5:22:18 PMWaiting for pool...
3/17/2016 5:22:23 PMWaiting for pool...
3/17/2016 5:22:28 PMWaiting for pool...
3/17/2016 5:22:33 PMWaiting for pool...
3/17/2016 5:22:38 PMWaiting for pool...
3/17/2016 5:22:43 PMWaiting for pool...
3/17/2016 5:22:48 PMWaiting for pool...
3/17/2016 5:22:53 PMWaiting for pool...
3/17/2016 5:22:58 PMWaiting for peerprobes and gluster daemons to come on line
3/17/2016 5:23:33 PMGetting peer mount points...
3/17/2016 5:23:33 PMVolume integral_vol does not exist
3/17/2016 5:23:33 PMCreating volume integral_vol...
3/17/2016 5:23:33 PMvolume create: integral_vol: success: please start the volume to access data
3/17/2016 5:23:38 PMStarting volume integral_vol...
3/17/2016 5:23:38 PMvolume start: integral_vol: success

But create_2 and create_3 are stuck at Containers are coming up... (since 30 minutes)

It seems like create_2 and create_3 are not even Waiting for pool.... As if something did not get started correctly.

glusterfs_glusterfs-server_1-3

As well there is a difference between glusterfs_glusterfs-server_1, glusterfs_glusterfs-server_2, glusterfs_glusterfs-server_3. server_1 and server_2 start up okay it seems:

3/17/2016 5:19:58 PMWaiting for all service containers to start...
3/17/2016 5:22:12 PMContainers are starting...
3/17/2016 5:22:12 PMWaiting for Gluster Daemons to come up
3/17/2016 5:22:56 PMgluster peer probe 10.42.57.8
3/17/2016 5:22:56 PMpeer probe: success.
3/17/2016 5:22:57 PMgluster peer probe 10.42.179.194
3/17/2016 5:22:57 PMpeer probe: success.

but glusterfs_glusterfs-server_3:

3/17/2016 5:22:30 PMWaiting for all service containers to start...
3/17/2016 5:22:31 PMContainers are starting...
3/17/2016 5:22:31 PMWaiting for Gluster Daemons to come up

The log is like this for over 20 minutes already.

Anything else we can provide you with? Other log files somewhere in the containers, or whatever.

@cjellick
Copy link

I looked into @deniseschannon's setup where she had this problem.

The volume never got created by the glusterfs. The create-volume container had this output:

3/17/2016 9:10:52 AMWaiting for all containers to come up...
3/17/2016 9:13:02 AMContainers are coming up...
3/17/2016 9:13:02 AMWaiting for pool...
3/17/2016 9:13:07 AMWaiting for pool...
3/17/2016 9:13:12 AMWaiting for pool...
3/17/2016 9:13:17 AMWaiting for pool...
3/17/2016 9:13:22 AMWaiting for pool...
3/17/2016 9:13:27 AMWaiting for pool...
3/17/2016 9:13:32 AMWaiting for pool...
3/17/2016 9:13:37 AMWaiting for pool...
3/17/2016 9:13:42 AMpool list: failed
3/17/2016 9:13:42 AMWaiting for pool...

and was stopped. I started that container back up and it created the gluster volume and convoy-gluster started up properly.

I'm sure there is some fix that will work around this problem, but I would also suggest that we do this:
#3356
so that the glusterfs stack appears unhealthy if it failed to create the gluster volume.

@cjellick
Copy link

Is the issue that when this happend: 3/17/2016 9:13:42 AMpool list: failed, the create volume container prematurely quit and did not restart?

@mholttech
Copy link

I tried the steps @cjellick mentioned on my own last week with no success, i'll update everything to the latest codebase and try this again tomorrow

@cloudnautique
Copy link
Contributor

@pboos , the create_volume containers do a leader election. Only the first one, under normal circumstances, will create the volume. The others just exit, and we show them as having completed their task in the UI. They only ever run once.

@cloudnautique
Copy link
Contributor

@joostliketoast, we pushed a new template up today. It has a revamped initialization process, that we feel is more stable. Please give it a try.

@sangeethah
Copy link
Contributor

With the latest gluster-fs version - 3.7.9 and rancher-server version - v1.0.0-rc1

Create glusterFS stack.
Create convoy gluster stack.
Launch a service with volumes using volume-driver convoy-gluster.
Delete the service.
remove and purge the volume.

Purged volume continues to be listed in the docker volume ls command:
Also I see this error - "list convoy-gluster: invalid character 'H' looking for beginning of value" when listing docker volumes.

root@sangeemyrc1-10acre-2:/home/sangeethahariharan1# docker volume ls
list convoy-gluster: invalid character 'H' looking for beginning of value
DRIVER              VOLUME NAME
local               f9f5998d80e40031c19274069db0245a2eaabd2eccb3c066e9ec9afbd2f84a55
local               a25506dbda733a505fea83a2b26f4757f8b8aa85d7ec1f5f793fa90a2f257c66
local               fed7bf62e63f0e2755e822128618c27217e9b817b6b9f4167fd7794de9256b3a
local               1d80100d5a14bb6fc42eb4ab0ccee24270a0d0e1e6815e045a9ba02d068d497d
local               cd5b18f15269b4046b9ea7dda1dcc44d3430bfae2cadcf1407ae7c28ceb4d829
local               ee0e649506b3cc03e5bb5f00a109eb196ff74be90e6c64472c341f7572c3cf7e
convoy-gluster      test1
root@sangeemyrc1-10acre-2:/home/sangeethahariharan1# 

rancher/convoy-agent -v0.3.0 is being used convoy-gluster instances.

This issue is tracked in #3671

@sangeethah
Copy link
Contributor

Will need to retest this scenario once #3671 gets addressed

@sangeethah
Copy link
Contributor

If I remove the existing gluster and convoy stack and recreate Glusterfs and convoy-gluster stack ,

glusterfs stack comes up fine . But instances in convoy-gluster are not able to start successgully and are in "stopped" state.

Following is the container logs of convoy-gluster_convoy-gluster_1 instance:

3/23/2016 1:54:40 PMtime="2016-03-23T20:54:40Z" level=info msg="Execing [/usr/bin/nsenter --mount=/proc/8715/ns/mnt -F -- /var/lib/docker/aufs/mnt/2521eb6c2cc4a543a06384696c2f38c481a3bf9b4606ea94ee40d0923860386d/var/lib/rancher/convoy-agent/share-mnt --stage2 /var/lib/rancher/convoy/convoy-gluster-265a2c39-f391-48ef-898e-6bc6bd6162f5 -- /launch volume-agent-glusterfs-internal]"
3/23/2016 1:54:40 PMWaiting for metadata
3/23/2016 1:54:40 PMRegistering convoy socket at /var/run/convoy-convoy-gluster.sock
3/23/2016 1:54:40 PMtime="2016-03-23T20:54:40Z" level=info msg="Listening for health checks on 0.0.0.0:10241/healthcheck"
3/23/2016 1:54:40 PMtime="2016-03-23T20:54:40Z" level=info msg="Got: drivers [glusterfs]"
3/23/2016 1:54:40 PMtime="2016-03-23T20:54:40Z" level=info msg="Got: driver-opts [glusterfs.defaultvolumepool=my_vol glusterfs.servers=glusterfs]"
3/23/2016 1:54:40 PMtime="2016-03-23T20:54:40Z" level=info msg="Got: root /var/lib/rancher/convoy/convoy-gluster-265a2c39-f391-48ef-898e-6bc6bd6162f5"
3/23/2016 1:54:40 PMtime="2016-03-23T20:54:40Z" level=info msg="Launching convoy with args: [--socket=/host/var/run/convoy-convoy-gluster.sock daemon --drivers=glusterfs --driver-opts=glusterfs.defaultvolumepool=my_vol --driver-opts=glusterfs.servers=glusterfs --root=/var/lib/rancher/convoy/convoy-gluster-265a2c39-f391-48ef-898e-6bc6bd6162f5]"
3/23/2016 1:54:40 PMtime="2016-03-23T20:54:40Z" level=debug msg="Creating config at /var/lib/rancher/convoy/convoy-gluster-265a2c39-f391-48ef-898e-6bc6bd6162f5" pkg=daemon
3/23/2016 1:54:40 PMtime="2016-03-23T20:54:40Z" level=debug msg= driver=glusterfs driver_opts=map[glusterfs.defaultvolumepool:my_vol glusterfs.servers:glusterfs] event=init pkg=daemon reason=prepare root="/var/lib/rancher/convoy/convoy-gluster-265a2c39-f391-48ef-898e-6bc6bd6162f5"
3/23/2016 1:54:40 PMtime="2016-03-23T20:54:40Z" level=debug msg="Volume my_vol is being mounted it to /var/lib/rancher/convoy/convoy-gluster-265a2c39-f391-48ef-898e-6bc6bd6162f5/glusterfs/mounts/my_vol, with option [-t glusterfs]" pkg=util
3/23/2016 1:54:40 PMtime="2016-03-23T20:54:40Z" level=debug msg="Cleaning up environment..." pkg=daemon
3/23/2016 1:54:40 PMtime="2016-03-23T20:54:40Z" level=error msg="Failed to execute: mount [-t glusterfs glusterfs:/my_vol /var/lib/rancher/convoy/convoy-gluster-265a2c39-f391-48ef-898e-6bc6bd6162f5/glusterfs/mounts/my_vol], output Mount failed. Please check the log file for more details.\n, error exit status 1"
3/23/2016 1:54:40 PM{
3/23/2016 1:54:40 PM    "Error": "Failed to execute: mount [-t glusterfs glusterfs:/my_vol /var/lib/rancher/convoy/convoy-gluster-265a2c39-f391-48ef-898e-6bc6bd6162f5/glusterfs/mounts/my_vol], output Mount failed. Please check the log file for more details.\n, error exit status 1"
3/23/2016 1:54:40 PM}
3/23/2016 1:54:40 PMtime="2016-03-23T20:54:40Z" level=info msg="convoy exited with error: exit status 1"
3/23/2016 1:54:40 PMtime="2016-03-23T20:54:40Z" level=info msg=Exiting.

@will-chan will-chan modified the milestones: Release 1.1, Release 1.0 Mar 23, 2016
@iangcarroll
Copy link

Also running into this issue fairly frequently (same log as @sangeethah).

@guruvan
Copy link

guruvan commented Apr 10, 2016

Well - I'm trying to track this down as well - I've clearly got glusterfs running correctly

  • started with preexisting environment, having attempted this previously, and having marked entries in the storage_pool and storage_pool_host_map tables purged or removed - storage pool appeared to be removed.
  • created convoy-gluster stack by grabbing .yml files from catalog + upload slightly modded via UI
  • storagepool service is stuck initializing - appears to not pass healthcheck(?)
  • Infrastructure/Storagepools shows all the hosts, but not volume
  • convoy-agent instances (global) would not remain in Running state, continued to stop/start
  • deleted the convoy-gluster stack
  • rebuilt the convoy-gluster stack
  • convoy-agent runs now on one host successfully
    • this host happens to be the rancher-server with a rancher-agent running on it
    • all other services appear to have correctly working intercontainer networking (we have several services in full production now)
    • the remaining hosts have no evidence of the convoy-convoy-gluster.sock that should be present
    • rm /etc/docker/plugins/*.spec && reboot host appears to have no effect

Convoy isn't creating the socket on these hosts. (but did on the one host?)

@guruvan
Copy link

guruvan commented Apr 10, 2016

UPDATE: - semi-working state

  • Removed and replaced the convoy-gluster stack, setting up the storagepool service instance to run on the host I had working. This finally finished initializing on this host (this host had produced a proper convoy-gluster socket
  • Rebooted a glusterfs server that appears to be underpowered and was lagging
  • At this time (after a few rancher-initiated container restarts) the convoy-gluster containers ceased reporting errors, and appeared to be in the working state. Test data expected to be on the glusterfs mount was present on the host as expected in /var/lib/rancher/convoy/convoy-gluster-126a2d27-845f-44b3-96bd-6b303cf8f985/glusterfs/mounts/my_vol
  • One host refused to create a socket and run the convoy-gluster container - I deactivated this host
  • I spun up a fresh host (on AWS) and added to rancher via Custom host option
  • convoy-gluster deployed to this host and fired right up (looks like it restarted 1 or 2 times first, and then
    correctly launched)

Infrastructure/Storagepools shows "convoy-glusterfs" and shows all hosts as green, and no volumes
Adding a volume here results in "Requested" state for requested volume name (never progresses)

Testing docker, I'm able to mount the glusterfs volume:
docker run -it --rm --volume my_vol:/data --volume-driver=convoy-gluster guruvan/bash touch /data/anewfile
this file is correctly replicated and available across hosts, however it appears in a subdirectory of the original volume with the name of the volume:
/var/lib/rancher/convoy/convoy-gluster-126a2d27-845f-44b3-96bd-6b303cf8f985/glusterfs/mounts/my_vol/my_vol

I am unable as yet to get Rancher to launch a container with the same parameters

from rancher-server logs:
016-04-10 22:14:05,542 ERROR [be5bb2bf-6580-4684-a8d1-2b2da3ada6fe:118955] [instance:1952] [instance.start->(InstanceStart)->instance.allocate] [] [torService-1715] [c.p.e.p.i.DefaultProcessInstanceImpl] Unknown exception io.cattle.platform.eventing.exception.EventExecutionException: Scheduling failed: volume [3027] have exactly these pool(s): [13]

Conclusion: Rancher networking issue appears to be at the root of the original problem. The host above which is now deactivated also seems otherwise unable to participate in rancher managed overlay network. (no rancher managed containers on this host ping any other containers on any other hosts AFAICT so far)

  • this host has networking services as standalone containers that may interfere

convoy appears to be working, while it appears likely that the rancher DB is ....a mess at this point.

YMMV ;)

@cjellick
Copy link

cjellick commented Oct 3, 2016

GlusterFS. Don't think this needs to be in 1.2.0.

@deniseschannon
Copy link

Please note that we have removed GlusterFS and Convoy Gluster from the catalog as users were expecting a robust tool as an alternative persistent storage for Docker volumes. However, due to lack of active maintenance, we cannot recommend this solution going forward.

Instead, we recommend and certify Convoy NFS, which is actively maintained by Rancher. As a user, you can get GlusterFS support directly from Red Hat and use it in Rancher using Rancher's NFS plugin.

Due to these changes regarding glusterfs and convoy gluster, we will not be addressing this bug for 1.2.0

@deniseschannon
Copy link

We aren't able to actively help maintain GlusterFS in the catalog and will not be able to fix these issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release
Projects
None yet
Development

No branches or pull requests