Skip to content

Commit

Permalink
PCI/PM: Extend D3hot delay for NVIDIA HDA controllers
Browse files Browse the repository at this point in the history
[ Upstream commit a5a6dd2 ]

Assignment of NVIDIA Ampere-based GPUs have seen a regression since the
below referenced commit, where the reduced D3hot transition delay appears
to introduce a small window where a D3hot->D0 transition followed by a bus
reset can wedge the device.  The entire device is subsequently unavailable,
returning -1 on config space read and is unrecoverable without a host
reset.

This has been observed with RTX A2000 and A5000 GPU and audio functions
assigned to a Windows VM, where shutdown of the VM places the devices in
D3hot prior to vfio-pci performing a bus reset when userspace releases the
devices.  The issue has roughly a 2-3% chance of occurring per shutdown.

Restoring the HDA controller d3hot_delay to the effective value before the
below commit has been shown to resolve the issue.  NVIDIA confirms this
change should be safe for all of their HDA controllers.

Fixes: 3e34796 ("PCI/PM: Reduce D3hot delay with usleep_range()")
Link: https://lore.kernel.org/r/20230413194042.605768-1-alex.williamson@redhat.com
Reported-by: Zhiyi Guo <zhguo@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Tarun Gupta <targupta@nvidia.com>
Cc: Abhishek Sahu <abhsahu@nvidia.com>
Cc: Tarun Gupta <targupta@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
  • Loading branch information
awilliam authored and gregkh committed May 11, 2023
1 parent 1bdb4a5 commit 4039b58
Showing 1 changed file with 13 additions and 0 deletions.
13 changes: 13 additions & 0 deletions drivers/pci/quirks.c
Original file line number Diff line number Diff line change
Expand Up @@ -1939,6 +1939,19 @@ static void quirk_radeon_pm(struct pci_dev *dev)
}
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6741, quirk_radeon_pm);

/*
* NVIDIA Ampere-based HDA controllers can wedge the whole device if a bus
* reset is performed too soon after transition to D0, extend d3hot_delay
* to previous effective default for all NVIDIA HDA controllers.
*/
static void quirk_nvidia_hda_pm(struct pci_dev *dev)
{
quirk_d3hot_delay(dev, 20);
}
DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
PCI_CLASS_MULTIMEDIA_HD_AUDIO, 8,
quirk_nvidia_hda_pm);

/*
* Ryzen5/7 XHCI controllers fail upon resume from runtime suspend or s2idle.
* https://bugzilla.kernel.org/show_bug.cgi?id=205587
Expand Down

0 comments on commit 4039b58

Please sign in to comment.