New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cinder: Volume detection & detach fails w Docker Swarm #913
Comments
Same issue for me. |
Just to confirm, you have the following set your in
This would need to be set on the node where the service is getting restarted/rescheduled to. This allows the new node to steal the volume from the old host, as it can only be mounted in one place at a time. FWIW, I think you do have that set, because you are seeing the failure as "error detaching volume" -- meaning that it is actually trying to detach from the old host, and that call is failing. Looking at the code just now, I think I see the reason why. I'll have to try and recreate... |
Yes, it was set, also used ignoreusedcount: true (which can't be set from the plugin...). This problem caused me to switch from openstack to digital ocean, but it would be great if it could be fixed, I could always use some extra swarm nodes on openstack...! Thanks. |
Same behavior for me with openstack. |
On my side I have create the config.yml file using your tool http://rexrayconfig.codedellemc.com/ the confile file is the following :
I have also try to detach the voume usin the rexray command :
it fails with the output:
|
According to the log above, I can see that the command fail because it is not able to remove the local device .
I have tried to execute the detach command from the host where is attached the volume and it works fine. So to summarize this issue, rexray is not able to detach a volume mounted on a REMOTE host |
hi @codenrhoden Regards |
is /dev/sda(1) the good one to unmount ? |
Yes but on a remote server ( the one to which is attached the device) |
Hi @MatMaul, Do you have the bandwidth to address this issue? |
Hello, I might have found the issue: https://github.com/codedellemc/rexray/blob/e8d7a016221ac2ff82e243466cff2d5d214da75d/libstorage/drivers/storage/cinder/storage/cinder_storage.go#L643 the volumeID is used for detaching, but the attachmentID is needed for this method: I'm not sure about this since i'm not a Go developer, I hope it will help. |
You are completely right. If you want to try to fix it and send a PR be my guest, otherwise let me know I'll fix it. |
Hello, in fact i'm not right at all :) |
I'm wondering if this is also fixed by #935. |
Same problem for me. When a node of my swarm cluster crashes for any reason, Libstorage is not able to detach the volume of the dead node to attach it elsewhere. I have all the time the error:
But if I detach manually the volume it's working. @codenrhoden I am using the latest unstable version of Rexray. I don't know if the fix #935 is include in this version.
|
Hi @benoitm76, It's not the latest version. There was a big fix around idempotency in commit 97dbe46 around midnight earlier this morning. |
A new version with a cinder fix has been released, if you can check if it helps. |
Hello, I'm still getting the error detaching volume when service is rescheduled on a node where volume is not yet attached. Here is my test case.
Start test service :
Shutdown node test-cinder2 Then
|
Same for me.
I am using rexray in service mode on my docker swarm cluster. If I stop a node, I still encounter this error: Is it normal to have all my volume in state
|
Hi @benoitm76,
That depends, are those volumes attached to some node? If so, then yes, Here is a full list of the volume status values and their descriptions:
|
Same for me: Using the latest version of the rexray/cinder docker plugin on each node
I would love to see this working on swarm mode with a working rexray/cinder driver. Whenever a node gets shut down or paused where a task of a service - i.e. Postgres - is running, rexray is not able to detach the volume and attach it to a different node. As a consequence the docker swarm service is stopped, if your service is set up with just one replica. Please help, as it is really quite frustrating. using the command
If you inspect this task using {
|
I am unable to make it work on the first place. swarm init on one node and swarm join on 2 other nodes. Rexray 0.10.2 is launched on all nodes, as root with a working config file. Ideas ? |
hi @MatMaul ,wouldn't using a constraint solve your problem?
|
@harrycain72 it gives me the same result as when I put the master in pause :( I tried with docker plugin install instead of manually running rexray, it is even worse: in the log I get connection refused on /run/docker/plugins/rexray.sock and even a simple I think I'll need a step by step, with exact version of Docker/OS. |
The same problem occurs when I use docker swarm in legacy mode. yvao9990 is the swarm node that runs the container using rexray volume. Steps to reproduce . on yvao9990 : halt -f . on MASTER VM :
From Openstack dashboard, I see volume is still attached to the failing node yvao9990. Docker version 17.07.0-ce, build 8784753 |
Hi. At the beginning, both 2 nodes see the rexray volume as available :
Openstack : rex volume has no attachement. Node 1 - openstack ID=4686db80-9835-4506-b212-f9aa12d2e7ae
-> the continer is still running. Node 2 - openstack ID=0d75f444-60fc-4655-9621-741c9fc5d21e
Logs on both nodes are available. I would expect that preempt option in rexray env with allow to detach volume from node 1 and then attach volume to node 2. It seems that node 2 try to detach on itself. Rexray should detach the attachement between node 1 and volume.
|
ok thanks I completely missed @fabio-barile statement. Simple enough to be reproducible easily, investigating. |
Problem spotted: the detach takes the ID of the server on top of the volume ID, and we always take the ID of the server where we run the detach. @akutz what is the expected behavior ? How is PREEMPT handled ? I don't have the option specified in my config and it looks like it calls VolumeDetach anyway. When happening on a remote server should I only detach if opts.Force is true ? or everytime ? |
@akutz another question, it looks like rexray lists the volumes before calling detach. Is it possible to access the result of that list in VolumeDetach ? If not perhaps we should store it in the context, I would like to avoid to make a 3rd API call to retrieve an info provided by list (server ID where the volume is attached). |
Yeah, I see that Detach is always using the IID to do the detach, e.g. https://github.com/codedellemc/rexray/blob/3dce0e4a20cee6ce8878a95734315b8565d5d3aa/libstorage/drivers/storage/cinder/storage/cinder_storage.go#L643-L646 Instead, detach needs to use the ID of the node the volume is attached to. REX-Ray builds off of the paradigm that a block volume can only be attached to one node at a time, so detach methods generally don't care who the caller is. Depending on the cloud/storage provider's API, either it detaches the volume and this happens everywhere automatically (EBS does this) or it has to cycle through all the places where it is attached and detaches each one by one. GCE does this: https://github.com/codedellemc/rexray/blob/3dce0e4a20cee6ce8878a95734315b8565d5d3aa/libstorage/drivers/storage/gcepd/storage/gce_storage.go#L1050-L1065 The expected behavior is that
So, the main thing that needs to change here for Cinder is that Detach needs to query the volume to see where it is attached, and detach it from any and all nodes. I'm not sure what the The attach logic in Cinder is also a bit weird re: It's only querying the state of the volume if force is specified. Most other drivers always query it, then change their behavior based on
Not sure what "list" you are talking about? Are you talking about a list in the libStorage client? If so, this does happen and can be run on any node, not necessarily where the libStorage server is running, so no it can't be shared. If you are talking about the I know that's a lot to throw at you, but it's mostly explanation. The fix here is probably pretty straightforward. |
@codenrhoden Thanks a lot for fixing this issue. Testing on openstack && docker swarm is looking very good! Greetings from Munich. |
@harrycain72 so glad to hear! |
Thanks for fixing that @codenrhoden :) |
Summary
When I create stateful services in swarm mode and they fail, or a node goes down and the service gets rescheduled, the stateful services keep on having cinder errors. This makes it a manual process to remove each of these services manually unmount the volumes with cinder or openstack panel, and redeploy the stacks. If there is a sudden failure, this would effectively defeat the purpose of swarm orchestration...
Bug Reports
Cinder driver fails to detach or detect volume in docker swarm mode. I currently have settings like recommended here.
Version
Expected Behavior
The failed services should restart with the appropriate volumes mounted without any errors.
Actual Behavior
Getting either:
VolumeDriver.Mount: {"Error":"error detaching volume"}
or
VolumeDriver.Mount: {"Error":"open /dev/vdc: no such file or directory"}
The text was updated successfully, but these errors were encountered: