Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aws/ebs] NVMe support #1104

Open
jippi opened this Issue Nov 7, 2017 · 45 comments

Comments

Projects
None yet
@jippi
Copy link
Contributor

jippi commented Nov 7, 2017

Summary

When using EBS on a c5 / m5 instance (using NVMe) the volume are mounted under a different name than non-NVMe instances.

PR

Bug Reports

EC2 c5.* and other NVMe instances mount volumes under a different /dev path than non-NVMe instances.

The path is /dev/nvme*n* instead of /dev/sd* or /dev/xvd* - apparently NVMe volumes ignores the volume device name completely and always assign them in a ${prev}+1 naming scheme

More information can be found here and here

I'm not in any way expert in EC2 / NVMe or how these things could work, I'm just putting it out here than our rexray usage as docker plugin did not work on c5 due to the path difference

Version

Any version

@clintkitson

This comment has been minimized.

Copy link
Member

clintkitson commented Nov 7, 2017

Very interesting @jippi.

Thank you for the writeup and links for reference for implementation. From what I can see so far this involves

  1. discovery of valid toolset for introspecting device and volumeid relationships
  2. update of ebs container to include such tooling, paths, and testing across ami/typical envs
  3. new awareness of whether devices are showing as nvm or typical ebs
  4. new discovery steps for nvm devices
  5. skipping device picking logic
  6. pre-detach command to ensure flush since a forced detach is always requested with nvme

@clintkitson clintkitson added the ebs label Nov 7, 2017

@jippi

This comment has been minimized.

Copy link
Contributor Author

jippi commented Nov 7, 2017

to 0 it seem to be documented in nvme-ebs-volumes - at least on a high-level

@clintkitson

This comment has been minimized.

Copy link
Member

clintkitson commented Nov 7, 2017

Agree, that's why I called it out as an update that is needed for the EBS driver =)

@jippi jippi changed the title [aws/ebs] nvm support [aws/ebs] NVMe support Nov 7, 2017

@djannot

This comment has been minimized.

Copy link

djannot commented Jan 17, 2018

+1

@clintkitson

This comment has been minimized.

Copy link
Member

clintkitson commented Jan 18, 2018

We are currently working through the plans for the upcoming REX-Ray releases inclusive of a path forward to native CSI drivers. We wanted to combine this phase while including enhanced and additional functionality that goes beyond what currently exists in REX or the K8s ebs plugin.

@matthewmrichter

This comment has been minimized.

Copy link

matthewmrichter commented Feb 3, 2018

+1

@jippi

This comment has been minimized.

Copy link
Contributor Author

jippi commented Feb 27, 2018

Whats the status on this @clintkitson ? - do you have a guess on when 5th gen AWS instances will work as expected?

@clintkitson

This comment has been minimized.

Copy link
Member

clintkitson commented Feb 27, 2018

Sounds amazing =)

Pinging @arun-gupta re AWS and CSI where this functionality would make the most sense.

@arun-gupta

This comment has been minimized.

Copy link

arun-gupta commented Feb 27, 2018

I've asked @aaithal to look into this

@aaithal

This comment has been minimized.

Copy link

aaithal commented Mar 8, 2018

@clintkitson your comment makes sense to me. The driver being responsible for discovery and subsequent pass-through sounds good as well. You might end up taking a dependency on ebsnvme-id, which should be ok i'm guessing.

@jippi

This comment has been minimized.

Copy link
Contributor Author

jippi commented Mar 9, 2018

So timeline wise, which quarter of 2018 might 5th gen AWS support land?

@clintkitson

This comment has been minimized.

Copy link
Member

clintkitson commented Mar 12, 2018

Unfortunately there is no commitment to this functionality as of yet. Looking forward to working with AWS's team around a CSI driver for EBS to bring this forward. @aaithal

@jippi

This comment has been minimized.

Copy link
Contributor Author

jippi commented Apr 18, 2018

@clintkitson thats a frustrating answer, its basically blocking all rexray users to look elsewhere for volume management on AWS or stick to previous-generation instances :( Any way to help guide the prioritisation on this to be higher?

@dispalt

This comment has been minimized.

Copy link

dispalt commented May 21, 2018

does that mean this is completely dependent on a new implementation around CSI before it would be fixed on rexray, or is that still up in the air?

@jippi

This comment has been minimized.

Copy link
Contributor Author

jippi commented May 21, 2018

I've dug in the code, i think it would be really really simple to add to rexray as it is today. The only issue is that the device name in /proc/partitions do not match the expected volume.

If it just looked in /proc/<expected device name> it would actually work - everything else do not need any kind of special handling

@clintkitson

This comment has been minimized.

Copy link
Member

clintkitson commented May 21, 2018

Yes for the project we are planning on the support in CSI. We'd be happy to look at a contribution that has this functioning in the meantime.

@jippi

This comment has been minimized.

Copy link
Contributor Author

jippi commented May 22, 2018

@clintkitson would it be acceptable to just check if /dev/<device name> exist for nvme?

@JustinVenus

This comment has been minimized.

Copy link

JustinVenus commented Jun 7, 2018

I have a dirty work around to support EBS on M5/C5's. It requires some code changes, udev rules, and nvme-cli. Here is the gist of what I did to make this work for me https://gist.github.com/JustinVenus/7d26ff5e885dc0ed34951ce20e5e79e0 ... I could make a proper pull request if it is wanted.

@jippi

This comment has been minimized.

Copy link
Contributor Author

jippi commented Jun 7, 2018

@JustinVenus looks good! can you submit the changes as a PR ? it's very close to what I had planned for too - i think all the changes are extremely reasonable

@JustinVenus

This comment has been minimized.

Copy link

JustinVenus commented Jun 7, 2018

I opened a PR for this issue at #1233.

@nunofernandes

This comment has been minimized.

Copy link

nunofernandes commented Aug 17, 2018

This (seatgeek@c01de50) worked for me...

@gai00

This comment has been minimized.

Copy link

gai00 commented Aug 22, 2018

waiting for docker plugin update for this.

@raravena80

This comment has been minimized.

Copy link

raravena80 commented Aug 23, 2018

👍

@jamesdh

This comment has been minimized.

Copy link

jamesdh commented Aug 30, 2018

Just an FYI that this applies to all C5, C5d, i3.metal, M5, M5d, R5, R5d, T3, and z1d instance types.

T3 being the new lower-end general purpose tier. This issue pretty much kills rexray for any modern EC2 instance.

@clintkitson

This comment has been minimized.

Copy link
Member

clintkitson commented Aug 30, 2018

Thanks for the clarification @jamesdh

@CameronGo

This comment has been minimized.

Copy link

CameronGo commented Sep 19, 2018

Does this PR close out this issue? #1252

@jippi

This comment has been minimized.

Copy link
Contributor Author

jippi commented Sep 19, 2018

A new release have to be cut, but otherwise, yes

@jippi

This comment has been minimized.

Copy link
Contributor Author

jippi commented Sep 27, 2018

@akutz @clintkitson when do you plan to cut a release? :)

@clintkitson

This comment has been minimized.

Copy link
Member

clintkitson commented Sep 28, 2018

@jippi Did you get a chance to test the latest unstable release?

@jippi

This comment has been minimized.

Copy link
Contributor Author

jippi commented Sep 28, 2018

@clintkitson no, but i currently run a fork with my PR on top of master in prod - which works great :)

@dovreshef

This comment has been minimized.

Copy link

dovreshef commented Jan 6, 2019

Hi

I'm also running into this. Any plan to release soon?

@digovc

This comment has been minimized.

Copy link

digovc commented Jan 7, 2019

I'm using the master version on T3 instances with the PR from @jippi and really works.

Please release a new version with master to resolve this problem with the new EBS NVMe blocks.

@clintkitson

This comment has been minimized.

Copy link
Member

clintkitson commented Jan 7, 2019

If we can verify the unstable released binaries work as expected for AWS then we can proceed with a release. Can someone signup to verify that? Thank you!

@matthewmrichter

This comment has been minimized.

Copy link

matthewmrichter commented Jan 7, 2019

Isn't that what @digovc said he did?

@clintkitson

This comment has been minimized.

Copy link
Member

clintkitson commented Jan 7, 2019

Apologies for not being more clear. Below is the link for unstable binaries that were built from the merge of this fix. We are looking for testers on this build.

https://dl.bintray.com/rexray/rexray/unstable/0.11.3+36/

@dovreshef

This comment has been minimized.

Copy link

dovreshef commented Jan 8, 2019

@clintkitson
I don't think that version includes the fix. I have installed that version with:

curl -sSL https://rexray.io/install | sh -s -- unstable 0.11.3+36

I still get the same error with it, and in the rexray service logs I see:

Jan 08 09:25:17 ip-10-200-2-147.eu-central-1.compute.internal rexray[3566]: time="2019-01-08T09:25:17Z" level=warning msg="device does not match" deviceName=nvme0n1 deviceRX=^xvd[f-p]$ host="unix:///var/run/rexray/015457837.sock" instanceID="ebs=i-XXXXXX,availabilityZone=eu-central-1c&region=eu-central-1" integrationDriver=linux osDriver=linux server=cedar-face-ly service=ebs storageDriver=libstorage time=1546939517619 txCR=1546939517 txID=3e34070a-af62-4dbd-4b3b-811463a4e039
@digovc

This comment has been minimized.

Copy link

digovc commented Jan 8, 2019

I build the Docker plugin from branch master (https://rexray.readthedocs.io/en/stable/dev-guide/build-reference/#docker-plug-ins) and putted that running in a swarm with a group of T3 instances. With the official stable plugin I had the same error describe above. But with this version the all instances works fine with EBS volumes.

@dovreshef

This comment has been minimized.

Copy link

dovreshef commented Jan 8, 2019

Found the problem.
First, I can confirm that the latest version is working.
The problem was that the nvme utility was not installed. I suppose this probably be better documented. Only figured it out by going through the code from the error backwards.

@clintkitson

This comment has been minimized.

Copy link
Member

clintkitson commented Jan 8, 2019

@dovreshef Can you make a suggestion for the documentation updates or better yet a PR including them? Thanks for the feedback.

@edthorne

This comment has been minimized.

Copy link

edthorne commented Feb 11, 2019

I think there's a subtle issue that's lingering here. I'm seeing the following output by the docker daemon frequently:

Feb 11 19:37:16 ip-172-31-53-17 dockerd: time="2019-02-11T19:37:16Z" level=error msg="time=\"2019-02-11T19:37:16Z\" level=warning msg=\"device does not match\" deviceName=name deviceRX=^xvd[f-p]$ host=\"unix:///var/run/rexray/239479381.sock\" instanceID=\"ebs=i-XXXXXXXXXXXXXXXXX,availabilityZone=us-east-1a&region=us-east-1\" integrationDriver=linux osDriver=linux server=bevel-minnow-ie service=ebs storageDriver=libstorage time=1549913836132 txCR=1549913836 txID=9bf36cdd-8437-4ec8-7178-7ab401ea062f " plugin=0e1b330cbb644b81c26a9cce14c76ede21f347829c53d53d53832027e69fcc53
Feb 11 19:37:16 ip-172-31-53-17 dockerd: time="2019-02-11T19:37:16Z" level=error msg="time=\"2019-02-11T19:37:16Z\" level=warning msg=\"device does not match\" deviceName=nvme1n1 deviceRX=^xvd[f-p]$ host=\"unix:///var/run/rexray/239479381.sock\" instanceID=\"ebs=i-XXXXXXXXXXXXXXXXX,availabilityZone=us-east-1a&region=us-east-1\" integrationDriver=linux osDriver=linux server=bevel-minnow-ie service=ebs storageDriver=libstorage time=1549913836133 txCR=1549913836 txID=9bf36cdd-8437-4ec8-7178-7ab401ea062f " plugin=0e1b330cbb644b81c26a9cce14c76ede21f347829c53d53d53832027e69fcc53
Feb 11 19:37:16 ip-172-31-53-17 dockerd: time="2019-02-11T19:37:16Z" level=error msg="time=\"2019-02-11T19:37:16Z\" level=warning msg=\"device does not match\" deviceName=nvme0n1 deviceRX=^xvd[f-p]$ host=\"unix:///var/run/rexray/239479381.sock\" instanceID=\"ebs=i-XXXXXXXXXXXXXXXXX,availabilityZone=us-east-1a&region=us-east-1\" integrationDriver=linux osDriver=linux server=bevel-minnow-ie service=ebs storageDriver=libstorage time=1549913836133 txCR=1549913836 txID=9bf36cdd-8437-4ec8-7178-7ab401ea062f " plugin=0e1b330cbb644b81c26a9cce14c76ede21f347829c53d53d53832027e69fcc53
Feb 11 19:37:16 ip-172-31-53-17 dockerd: time="2019-02-11T19:37:16Z" level=error msg="time=\"2019-02-11T19:37:16Z\" level=warning msg=\"device does not match\" deviceName=nvme0n1p1 deviceRX=^xvd[f-p]$ host=\"unix:///var/run/rexray/239479381.sock\" instanceID=\"ebs=i-XXXXXXXXXXXXXXXXX,availabilityZone=us-east-1a&region=us-east-1\" integrationDriver=linux osDriver=linux server=bevel-minnow-ie service=ebs storageDriver=libstorage time=1549913836134 txCR=1549913836 txID=9bf36cdd-8437-4ec8-7178-7ab401ea062f " plugin=0e1b330cbb644b81c26a9cce14c76ede21f347829c53d53d53832027e69fcc53

Docker believes them to be errors even though rexray appears to report them as warnings.

The mounting of EBS volumes appears to be happening normally and things are working. Here's the relevant version info for our swarms.

"DockerVersion": "18.06.1-ce"
"Name": "rexray/ebs:latest"
"PluginReference": "docker.io/rexray/ebs:0.11.4"
@gmembre-zenika

This comment has been minimized.

Copy link

gmembre-zenika commented Feb 15, 2019

+1 same problem here : still seeing the error and cannot mount EBS volume :'(

@gmembre-zenika

This comment has been minimized.

Copy link

gmembre-zenika commented Mar 7, 2019

I mis-read the logs : the driver complains about root devices of the VM but mounting volumes with the plugin works like a charm. Thanks !

@darkl0rd

This comment has been minimized.

Copy link

darkl0rd commented Apr 12, 2019

This is still not working for me.

I have just installed rexray/ebs:0.11.4 on a t3.2xlarge:

$ docker plugin ls
b71744536bc8        rexray/ebs:0.11.4   REX-Ray for Amazon EBS   true

When I try to run a container using an EBS volume:

$ docker run --tty \
             --interactive \
             --rm \
             --volume-driver rexray/ebs:0.11.4 \
             --volume my_test_volume:/test \
             alpine /bin/sh

docker: Error response from daemon: error while mounting volume '': VolumeDriver.Mount: docker-legacy: Mount: my_test_volume: failed: no device name returned.

To (re-)confirm that this is indeed 0.11.4 that is running:

$ cat /var/lib/docker/plugins/b71744536bc8704c6b89cb70e9c13fb534616593df4dbec604c91a14ab098c22/rootfs/var/log/rexray/rexray.log

REX-Ray
-------
Binary: /usr/bin/rexray
Flavor: client+agent+controller
SemVer: 0.11.4
OsArch: Linux-x86_64
Commit: e7414eaa971b27977d2283f2882825393493179d
Formed: Tue, 15 Jan 2019 16:03:57 UTC

@MonsieurPaulLeBoulanger

This comment has been minimized.

Copy link

MonsieurPaulLeBoulanger commented Apr 13, 2019

@darkl0rd : did you add the required udev rule & associated handler to make it work ?
It's well hidden in rexray documentation :
https://rexray.readthedocs.io/en/v0.11.4/user-guide/storage-providers/aws/#nvme-support

You may find more informations about this issue here: #1252 (thanks @jippi )

@darkl0rd

This comment has been minimized.

Copy link

darkl0rd commented Apr 13, 2019

@MonsieurPaulLeBoulanger In all honesty, I did read that - however, I figured using the docker volume plugin that would be unnecessary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.