Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfaults on RDNA1 / gfx1010 (RX 5700XT) with ROCm on built nightly torch 2.1.0 #106728

Closed
cl0ck-byte opened this issue Aug 7, 2023 · 11 comments
Closed
Labels
module: rocm AMD GPU support for Pytorch

Comments

@cl0ck-byte
Copy link

cl0ck-byte commented Aug 7, 2023

🐛 Describe the bug

Since workaround with forcing environment variable HSA_OVERRIDE_GFX_VERSION=10.3.0 (which is RDNA2 card) doesn't work anymore after torch>=2.0.0 on RDNA1 (results in segfaults), I've tried my luck with compiling torch wheel by myself with gfx1010 target and numpy support.

Running example code under such torch build:

import torch
torch.cuda.is_available()
torch.cuda.device_count()
torch.cuda.current_device()
torch.cuda.get_device_name(torch.cuda.current_device())

tensor = torch.randn(2, 2)
res = tensor.to(0)
print(res)

results in segmentation fault:

Python 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> torch.cuda.current_device()
0
>>> torch.cuda.get_device_name(torch.cuda.current_device())
'AMD Radeon RX 5700 XT'
>>> 
>>> tensor = torch.randn(2, 2)
>>> res = tensor.to(0)
>>> print(res)
Segmentation fault (core dumped)

Output from running coredumpctl debug on latest dump:
https://gist.github.com/cl0ck-byte/5a4e24f1a67fd588fde06d28da2e5765

Output from running dmesg -kuT:

[mon aug 7 22:48:52 2023] python[8910]: segfault at 70 ip 00007f575bed2364 sp 00007ffdffd6fc80 error 4 in libamdhip64.so.5.6.50600[7f575be20000+375000] likely on CPU 3 (core 4, socket 0)
[mon aug 7 22:48:52 2023] Code: e9 89 3f f5 ff 90 f3 0f 1e fa 41 54 55 53 48 83 ec 20 64 48 8b 04 25 28 00 00 00 48 89 44 24 18 31 c0 85 f6 0f 88 e4 00 00 00 <48> 8b 47 70 48 2b 47 68 48 63 ee 48 89 fb 48 c1 f8 03 48 39 c5 0f

Cannot upload torch wheel build due to size.

Versions

PyTorch version: 2.1.0a0+git8b8f576
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 5.6.31061-8c743ae5d

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-6.2.0-26-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Radeon RX 5700 XT
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 5.6.31061
MIOpen runtime version: 2.20.0
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   43 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          12
On-line CPU(s) list:             0-11
Vendor ID:                       AuthenticAMD
Model name:                      AMD Ryzen 5 3600X 6-Core Processor
CPU family:                      23
Model:                           113
Thread(s) per core:              2
Core(s) per socket:              6
Socket(s):                       1
Stepping:                        0
Frequency boost:                 enabled
CPU max MHz:                     4408,5928
CPU min MHz:                     2200,0000
BogoMIPS:                        7600.16
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es
Virtualization:                  AMD-V
L1d cache:                       192 KiB (6 instances)
L1i cache:                       192 KiB (6 instances)
L2 cache:                        3 MiB (6 instances)
L3 cache:                        32 MiB (2 instances)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-11
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.25.2
[pip3] torch==2.1.0a0+git8b8f576
[conda] numpy                     1.25.2                   pypi_0    pypi
[conda] torch                     2.1.0a0+git8b8f576          pypi_0    pypi

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

@pytorch-bot pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Aug 7, 2023
@hongxiayang
Copy link
Collaborator

hongxiayang commented Aug 8, 2023

@cl0ck-byte Can you run the following commands and send the output:

grep flags /sys/class/kfd/kfd/topology/nodes/*/io_links/0/properties
rocminfo
uname -a
apt --installed list | grep dkms

@cl0ck-byte
Copy link
Author

cl0ck-byte commented Aug 8, 2023

Output from grep flags /sys/class/kfd/kfd/topology/nodes/*/io_links/0/properties:

/sys/class/kfd/kfd/topology/nodes/0/io_links/0/properties:flags 3
/sys/class/kfd/kfd/topology/nodes/1/io_links/0/properties:flags 1

rocminfo:
https://gist.github.com/cl0ck-byte/52a5a91b93a50a485cff64d37901b7f4

@hongxiayang

@hongxiayang
Copy link
Collaborator

@cl0ck-byte Thanks for the output. For your reference, the link below shows the latest list of AMD officially supported GPUs. https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html#linux-supported-gpus

@cl0ck-byte
Copy link
Author

cl0ck-byte commented Aug 8, 2023

@cl0ck-byte Thanks for the output. For your reference, the link below shows the latest list of AMD officially supported GPUs. https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html#linux-supported-gpus

From what I'm understanding, nothing will be done about this issue since RDNA1 is not officially supported?

edit: if someone ever wants to pick up on this, here's GEF(gdb fork) output with debug symbols (torch was compiled without debug param), although missing colored syntax:
https://gist.github.com/cl0ck-byte/cdb70bb9252bb6677d405bd005229b3c

Seems like it's something related to drivers/ROCm/whatever else, and that's way beyond my capabilities.

And if someone really wants to get Torch working on RDNA1, downgrade it to latest pre 2.0.0 version and use same workaround as before, which is forcing environment variable HSA_OVERRIDE_GFX_VERSION=10.3.0. Unless something else comes up in the meantime, there's no way around it. Thanks AMD!

also quick edit: maybe DirectML will work instead? be sure to check it out

@cl0ck-byte cl0ck-byte closed this as not planned Won't fix, can't repro, duplicate, stale Aug 8, 2023
@cl0ck-byte
Copy link
Author

@cl0ck-byte I believe the issue was introduced sometime between ROCm 5.2 and 5.3. The last Torch 2.0.0 snapshot built against 5.2 works just fine while the same snapshots of that time period built against 5.3 had the same symptoms.

interesting, should I try building that snapshot by myself with gfx1010 target, if possible? @ddvarpdd

@DGdev91
Copy link

DGdev91 commented Oct 5, 2023

I can confirm that pytorch 2 is indeed working on gfx1010 if compiled using rocm 5.2, using "export HSA_OVERRIDE_GFX_VERSION=10.3.0"
I'm currently running successfuly automatic1111's WebUI for StableDiffusion using the nightly build posted by @ddvarpdd

Actually, webarchive isn't really needed here. The pytorch official repo still has it.

So, for torch, torchvision and torchaudio the links are:
https://download.pytorch.org/whl/nightly/rocm5.2/torch-2.0.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl
https://download.pytorch.org/whl/nightly/rocm5.2/torchvision-0.15.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl
https://download.pytorch.org/whl/nightly/rocm5.2/torchaudio-2.0.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl

anyway.... i think the problem here is on rocm 5.3 and newer

@cl0ck-byte
Copy link
Author

Someone compiled wheels for torch 2.1.0 on rocm 5.2 if somebody wants to use them

#103973 (comment)

@kontorol
Copy link

kontorol commented Dec 5, 2023

I can confirm that pytorch 2 is indeed working on gfx1010 if compiled using rocm 5.2, using "export HSA_OVERRIDE_GFX_VERSION=10.3.0" I'm currently running successfuly automatic1111's WebUI for StableDiffusion using the nightly build posted by @ddvarpdd

Actually, webarchive isn't really needed here. The pytorch official repo still has it.

So, for torch, torchvision and torchaudio the links are: https://download.pytorch.org/whl/nightly/rocm5.2/torch-2.0.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/nightly/rocm5.2/torchvision-0.15.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/nightly/rocm5.2/torchaudio-2.0.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl

anyway.... i think the problem here is on rocm 5.3 and newer

Grep flags /sys/class/kfd/kfd/topology/nodes/*/io_links/0/properties

Returns "flags 13" on RDNA1 / gfx1010 (RX 5700XT)!

Can this version be supported?

@DGdev91
Copy link

DGdev91 commented Dec 5, 2023

I can confirm that pytorch 2 is indeed working on gfx1010 if compiled using rocm 5.2, using "export HSA_OVERRIDE_GFX_VERSION=10.3.0" I'm currently running successfuly automatic1111's WebUI for StableDiffusion using the nightly build posted by @ddvarpdd
Actually, webarchive isn't really needed here. The pytorch official repo still has it.
So, for torch, torchvision and torchaudio the links are: https://download.pytorch.org/whl/nightly/rocm5.2/torch-2.0.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/nightly/rocm5.2/torchvision-0.15.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/nightly/rocm5.2/torchaudio-2.0.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl
anyway.... i think the problem here is on rocm 5.3 and newer

Grep flags /sys/class/kfd/kfd/topology/nodes/*/io_links/0/properties

Returns "flags 13" on RDNA1 / gfx1010 (RX 5700XT)!

Can this version be supported?

That's a different issue and has nothing to do with the RX 5700XT itself, it's about your configuration, wich doesn't supports PCI Atomics, needed on newer rocm versions. There can be multiple causes. Maybe the motherboard or the CPU are too old. There was also a guy who wrote somewhere here on github wich had the video card mounted in the lower pci slot but his motherboard supported pci atomics only on the first one.

@kontorol
Copy link

kontorol commented Dec 5, 2023

I can confirm that pytorch 2 is indeed working on gfx1010 if compiled using rocm 5.2, using "export HSA_OVERRIDE_GFX_VERSION=10.3.0" I'm currently running successfuly automatic1111's WebUI for StableDiffusion using the nightly build posted by @ddvarpdd
Actually, webarchive isn't really needed here. The pytorch official repo still has it.
So, for torch, torchvision and torchaudio the links are: https://download.pytorch.org/whl/nightly/rocm5.2/torch-2.0.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/nightly/rocm5.2/torchvision-0.15.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/nightly/rocm5.2/torchaudio-2.0.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl
anyway.... i think the problem here is on rocm 5.3 and newer

Grep flags /sys/class/kfd/kfd/topology/nodes/*/io_links/0/properties
Returns "flags 13" on RDNA1 / gfx1010 (RX 5700XT)!
Can this version be supported?

That's a different issue and has nothing to do with the RX 5700XT itself, it's about your configuration, wich doesn't supports PCI Atomics, needed on newer rocm versions. There can be multiple causes. Maybe the motherboard or the CPU are too old. There was also a guy who wrote somewhere here on github wich had the video card mounted in the lower pci slot but his motherboard supported pci atomics only on the first one.

Thank you for the guidance. Yes unfortunately both the cpu and motherboard are very old!

@TriedAllDay
Copy link

Someone compiled wheels for torch 2.1.0 on rocm 5.2 if somebody wants to use them

#103973 (comment)

Does this include Torch audio/Vision? I only see Torch 2.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: rocm AMD GPU support for Pytorch
Projects
None yet
Development

No branches or pull requests

6 participants
@DGdev91 @cl0ck-byte @kontorol @hongxiayang @TriedAllDay and others