Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VMD driver bindings not reset after none override #2926

Closed
tanabarr opened this issue Feb 28, 2023 · 6 comments
Closed

VMD driver bindings not reset after none override #2926

tanabarr opened this issue Feb 28, 2023 · 6 comments
Assignees
Labels
Sighting Waiting on Submitter currently waiting on input from submitter

Comments

@tanabarr
Copy link
Contributor

tanabarr commented Feb 28, 2023

Sighting report

If I run setup.sh with DRIVER_OVERRIDE=none (as recommended to speed up binding on VMD systems), all the VMD domains get unbound (and are therefore inaccessible from OS) but in order to get them back I try setup.sh reset with PCI_ALLOWED='0000:2a:00.5 0000:62:00.5 0000:a1:00.5 0000:dd:00.5' only half of the devices are then visible even though the VMD domain shown to have been returned to the kernel driver:

0000:dd:00.5 (8086 28c0): Already using the vmd driver
0000:2a:00.5 (8086 28c0): Already using the vmd driver
0000:a1:00.5 (8086 28c0): Already using the vmd driver
0000:62:00.5 (8086 28c0): Already using the vmd driver

spdk/scripts/setup.sh status shows that the backing devices have not been bound back to the kernel which explains why they are not visible in the OS under /dev/nvme*.

Type     BDF             Vendor Device NUMA    Driver           Device     Block devices
<snip I/OAT entries>
VMD      0000:2a:00.5    8086   28c0   0       vmd              -          -
VMD      0000:62:00.5    8086   28c0   0       vmd              -          -
<snip I/OAT entries>
VMD      0000:a1:00.5    8086   28c0   1       vmd              -          -
VMD      0000:dd:00.5    8086   28c0   1       vmd              -          -
NVMe     10000:01:00.0   144d   a824   1       -                -          -
NVMe     10000:03:00.0   144d   a824   1       -                -          -
NVMe     10000:05:00.0   144d   a824   1       -                -          -
NVMe     10000:07:00.0   144d   a824   1       -                -          -
NVMe     10003:81:00.0   144d   a824   0       -                -          -
NVMe     10003:83:00.0   144d   a824   0       -                -          -
NVMe     10003:85:00.0   144d   a824   0       -                -          -
NVMe     10003:87:00.0   144d   a824   0       -                -          -
NVMe     10004:81:00.0   144d   a824   0       nvme             nvme0      nvme0n1
NVMe     10004:83:00.0   144d   a824   0       nvme             nvme1      nvme1n1
NVMe     10004:85:00.0   144d   a824   0       nvme             nvme2      nvme2n1
NVMe     10004:87:00.0   144d   a824   0       nvme             nvme3      nvme3n1
NVMe     10005:01:00.0   144d   a824   1       nvme             nvme4      nvme4n1
NVMe     10005:03:00.0   144d   a824   1       nvme             nvme5      nvme5n1
NVMe     10005:05:00.0   144d   a824   1       nvme             nvme6      nvme6n1
NVMe     10005:07:00.0   144d   a824   1       nvme             nvme7      nvme7n1

I think the workaround in my case is to make sure we populate PCI_ALLOWED whenever we set DRIVER_OVERRIDE=none.

Expected Behavior

Current Behavior

Possible Solution

Workaround to change step no.1 below to:

  1. sudo DRIVER_OVERRIDE=none PCI_ALLOWED="0000:2a:00.5 0000:62:00.5" scripts/setup.sh

BUT using this workaround defeats the point of running the preparatory DRIVER_OVERRIDE=none as an interim step to improve VMD driver binding times on high-capacity systems. Specifying VMD controller addresses in this command slows the setup time down dramatically.

Steps to Reproduce

  1. sudo DRIVER_OVERRIDE=none scripts/setup.sh
  2. sudo PCI_ALLOWED="0000:2a:00.5 0000:62:00.5" scripts/setup.sh
  3. sudo PCI_ALLOWED="0000:2a:00.5 0000:62:00.5 0000:a1:00.5 0000:dd:00.5" scripts/setup.sh reset
  4. sudo scripts/setup.sh status

Devices can be returned to the kernel using the following sequence when in this state:

  1. sudo DRIVER_OVERRIDE=none scripts/setup.sh
  2. sudo PCI_ALLOWED="0000:2a:00.5 0000:62:00.5 0000:a1:00.5 0000:dd:00.5" scripts/setup.sh
  3. sudo PCI_ALLOWED="0000:2a:00.5 0000:62:00.5 0000:a1:00.5 0000:dd:00.5" scripts/setup.sh reset

Context (Environment including OS version, SPDK version, etc.)

Rocky Linux 8.6, SPDK 22.01.2, IceLake

@tanabarr
Copy link
Contributor Author

wolf-218_dmesg_nodriver.log
wolf-218_spdk-status_nodriver.log
Logs from dmesg and spdk/scripts/setup.sh status after commands listed in "Steps to Reproduce" are run.

@ksztyber
Copy link
Contributor

ksztyber commented Mar 3, 2023

Hi @tanabarr, I think that behavior should be expected.
Doing:

DRIVER_OVERRIDE=none scripts/setup.sh

unbinds the nvme driver on all drives (from all VMD domains).
Then,

PCI_ALLOWED="0000:2a:00.5 0000:62:00.5" scripts/setup.sh

binds the loads the vfio-pci driver on VMD domains 0000:2a:00.5 and 0000:62:00.5. This causes the drivers that are behind those domains (10004:* and 10005:*) to be removed from the system.
Finally, doing:

PCI_ALLOWED="0000:2a:00.5 0000:62:00.5 0000:a1:00.5 0000:dd:00.5" scripts/setup.sh reset

tries to bind vmd driver to the specified domains. 0000:2a:00.5 and 0000:62:00.5 will be rebound (vfio-pci -> vmd), which will also cause a rescan of devices behind those domains. This means that new PCI devices will be found (10004:*, 10005:*) and the kernel sees that they're NVMe devices so it binds them to the NVMe driver. 0000:a1:00.5 and 0000:dd:00.5 on the other hand were already bound to the vmd driver, so they're no-ops. And since you didn't specify any of the drives behind those domains in PCI_ALLOWED, they're left untouched.

You can either specify the drives behind 0000:a1:00.5 and 0000:dd:00.5 in PCI_ALLOWED when doing setup.sh reset:

PCI_ALLOWED="0000:2a:00.5 0000:62:00.5 0000:a1:00.5 0000:dd:00.5 10000:01:00.0 10000:03:00.0 ..." scripts/setup.sh reset

or do an extra scripts/setup.sh reset without setting PCI_ALLOWED.

@tomzawadzki tomzawadzki added the Waiting on Submitter currently waiting on input from submitter label Mar 7, 2023
@jimharris
Copy link
Member

@tanabarr, can we close this issue based on feedback from @ksztyber?

@tomzawadzki
Copy link
Contributor

[Bug scrub] @tanabarr Any feedback on the above ?

@tanabarr
Copy link
Contributor Author

Thanks @ksztyber and @tomzawadzki , I will update our usage as per this comment: or do an extra scripts/setup.sh reset without setting PCI_ALLOWED..

DAOS tools that wrap the setup script apply PCI_ALLOWED filter based on the PCI list in DAOS config files. This enables a subset of the host SSDs to be selected for use. As a result the setup script gets called with the VMD domains in the allow list in both setup and reset mode. As the override call is made during the setup workflow to speed up the binding times on high-capacity drives, we end up in the described scenario when only a subset of domains are specified in the DAOS config file and after reset is called not all drives are visible in /dev/nvme* (which confuses people).

The intent behind applying the allow list when running the setup script is to enable other applications on the same host to use SPDK with SSDs not in use by DAOS (and only bind a subset to userspace drivers).

The solution you have suggested to call reset without an allow list is an acceptable compromise. Whilst this conceptually disturbs the intended ability to "play nicely with other applications that might be using system SSDs" the reality is that when devices need to be re-bound back for use via the kernel it's unlikely that not specifying an allow list will cause any unintended consequences on DAOS storage servers.

I think this ticket can be marked as resolved.

@tanabarr
Copy link
Contributor Author

DAOS update ticket reference: https://daosio.atlassian.net/browse/DAOS-13521

tanabarr added a commit to daos-stack/daos that referenced this issue Jun 14, 2023
To avoid undesired consequences related to
spdk/spdk#2926, perform an extra SPDK reset
(without PCI_ALLOWED) at the end of the daos_server nvme reset command
when VMD is enabled. This will reset any dangling NVMe devices left
unbound after the DRIVER_OVERRIDE=none setup call was used in
daos_server nvme prepare.

Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
phender pushed a commit to daos-stack/daos that referenced this issue Jun 21, 2023
…#12397)

To avoid undesired consequences related to
spdk/spdk#2926, perform an extra SPDK reset
(without PCI_ALLOWED) at the end of the daos_server nvme reset command
when VMD is enabled. This will reset any dangling NVMe devices left
unbound after the DRIVER_OVERRIDE=none setup call was used in
daos_server nvme prepare.

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
tanabarr added a commit to daos-stack/daos that referenced this issue Jun 22, 2023
…#12397)

To avoid undesired consequences related to
spdk/spdk#2926, perform an extra SPDK reset
(without PCI_ALLOWED) at the end of the daos_server nvme reset command
when VMD is enabled. This will reset any dangling NVMe devices left
unbound after the DRIVER_OVERRIDE=none setup call was used in
daos_server nvme prepare.

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
mjmac pushed a commit to daos-stack/daos that referenced this issue Jun 23, 2023
…#12397) (#12482)

To avoid undesired consequences related to
spdk/spdk#2926, perform an extra SPDK reset
(without PCI_ALLOWED) at the end of the daos_server nvme reset command
when VMD is enabled. This will reset any dangling NVMe devices left
unbound after the DRIVER_OVERRIDE=none setup call was used in
daos_server nvme prepare.

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sighting Waiting on Submitter currently waiting on input from submitter
Projects
Archived in project
Development

No branches or pull requests

4 participants