New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS EBS stuck attachments DOS libstorage server. #773
Comments
@Nomon Thanks for the detailed write up. I will see if I can recreate by purposefully corrupting the filesystem on an EBS volume. If not, I'll take you up on the offer to mount via IAM profile. :) From you config file, it looks like you are using the default file system, which is ext4, so I'll stick with that as well. |
@Nomon After looking into this a bit, I believe I have identified the issue within the Separately, I will probably file a second issue in order to track a fix that we can put in at the libStorage framework level to keep misbehaving drivers from triggering this outcome. Thanks again for the writeup. For what it's worth, I was never able to get one of my EBS volumes to fail to attach. I tried corrupting the ext4 filesystem superblocks with |
Some updates of the issue: From what I have gathered, unclean detaches (without unmount first inside the vm) render the device name (xen virtual device name in aws hvm) unusable for the VM lifetime(actually reboot resets the state, so not strictly VM lifetime) and there is no guest vm workaround fix for it and AWS is "working on one". Even though the guest OS does not see the device, trying to attach any device to the same /dev/xvdXXX after an unclean detach will leave the volume in PENDING attachment state forever this is 100% reproducible . Kubernetes has several issues and merges used to workaround this issue with aws ebs over a long period of time. kubernetes/kubernetes#32630 The consensus seems to be that the best "workaround" currently is to use the full 56 ebs device names range (/dev/xvd{b,c}[a-z] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/device_naming.html) for allocating the device names and the name allocator should not select first available device on the vm from that list, instead it should walk the list so that first mount goes to /dev/xvdba and second to /dev/xvdbb even if xvdba is no longer attached to the vm and once reaching /dev/xvdcz it should start from /dev/xvdba and so forth. When using the full range of devices and setting preempt to false and changing the allocator to walk the names instead of probing first available should greatly mitigate the issue. I see that there is already new timeout code in place, if attach timeouts would also blacklist insteance_id:device_name I think the issue would be mitigated as good as can be done in the guest vm. In theory the VM could then continue attaching volumes until all names are blacklisted and under normal operation that should not happen. Currently we run embedded server on each node to work around the pending attachments blocking all attachments, but this only reduces the failure domain of the attachments from the cluster to a single node, and when scheduling even moderately sized workloads it easily leads to aws api ratelimiting with all the independent agents using the api. A sequential device name allocator would allow the failed node to make progress in both embedded and server-client setups instead of trying to attach to that device name that will not work over and over. |
with 0.9.0 the situation is much much better and the server-client no longer get stuck but server keeps making progress. Once in a while a node will still end up in a state where some devices unusable but now its isolated to the nodes. I would still like to see the whole /dev/xvd{b,c}[a-z] space used for the ebs volumes instead of the 10 devices (/dev/xvd{g,h,i,j,k,l,m,n,o,p}) currently used. |
Hi @Nomon, The device name logic (and limitation to 10 volumes) comes directly from Amazon itself: Both paravirtual and HVM instances, for EBS, restrict their recommendation to |
The recommended range is what it says, a recommendation. The range is not large enough to attach the max 40 volumes that AWS promises you can attach to an hvm instance and more than that is possible on best-effort basis. For example kubernetes uses the whole of the available range at /dev/xvd[b-c][a-z] in an implementation that minimizes the device node re-use to work around the attaching forever on some device nodes issue, a retried attachment will receive a different device (least used device node on the instance): |
@akutz I have same issue. I could not attach more than 1 volume with kubernetes. Can you suggest solution how to fix it? |
@miry Can you discuss more about why only a single volume can be attached? Are the previous device names allocated already? |
@clintkitson I try to setup rexray with kubernetes. Followed next man https://rexray.readthedocs.io/en/stable/user-guide/schedulers/#kubernetes. After small debug, in the logs I found: Device /dev/xvdg is busy status code 400. I checked current device lists there are no such disk. So i tried next experiment: On new node I setup rexray and kubelet. And using Same experiment I did without kubernetes and it was ok. I could mount more than 4 disks. It could be related that I created 3 PV with FlexVolume and kubernetes somehow reserved disks, but this is crazy idea. |
@clintkitson it is the xen device ghosting issue I mentioned earlier (apr 14) in this thread, if a device is detached uncleanly from a xen virtual device node that node will become unusable for the instances lifetime even though the instance cannot see any device attached to that node, and the device allocator will always return the same device. The PR referenced in this issue from villet mostly mitigates the issue by making the allocator use a random order when seeking the next device for attachment, as with that hitting a broken node will retry the failed attachment to a new node. |
@Nomon Ok thank you for the clarification. |
@miry Do you have the |
Summary
When the docker driver on host A tries to attach EBS volume with a corrupted filesystem(corrupted superblock) to a host the attachment never transitions from attaching to attached state. This will cause the VolumeDriver.Mount on that once host to return error 500 when the attachment request to libstorage server times out (1 minute). The libstorage server responds with the running task info along with the 408. After this moment all other EBS operation on the rexray server will time out and respond with 408 and a new task id. Each call effectively doing nothing more than create tasks on the server to be queued for execution that is never executed (because the server does range over unbuffered channel to execute tasks in sequence).
I have ran into this issue several times, the corrupted volume will prevent even functional volumes from attaching to other hosts. The /tasks api on the server returns all the tasks that keep piling up.
Next time this happen I will get dump of goroutines to see where they are blocking, but I am fairly confident it is due to the serial execution of tasks and the ebs driver waitVolumeComplete where it blocks forever as the volume in EBS is stuck in attaching state forever. There are multiple cases where volumes on AWS get stuck into attaching state, AWS docs recommend rebooting the instance as initial fix....
Version
Expected Behavior
A failing mount on a single host should not prevent other hosts using different volumes.
Steps To Reproduce
I have a few corrupted EBS volumes that do this consistently, if you are unable to reproduce, let me know and I will create an IAM policy allowing you to mount the volumes in question to your AWS account.
Here are the relevant kernel ring buffer entries from the host running the docker integration, using libstorage server to do the attachment:
Here is the log of the rexray-server timing out the attachment:
Logs from the node with the docker integration driver:
State of the volume in AWS (notice that it has been attaching past 3.5h or so):
Server config:
Agent config:
I did not upload all the logs due to not being sure if there is anything confidential in them, if you require the full logs for further analysis I can sanitize and share the rest of the logs.
The text was updated successfully, but these errors were encountered: