Segfaults on RDNA1 / gfx1010 (RX 5700XT) with ROCm on built nightly torch 2.1.0 #106728

cl0ck-byte · 2023-08-07T21:05:42Z

🐛 Describe the bug

Since workaround with forcing environment variable HSA_OVERRIDE_GFX_VERSION=10.3.0 (which is RDNA2 card) doesn't work anymore after torch>=2.0.0 on RDNA1 (results in segfaults), I've tried my luck with compiling torch wheel by myself with gfx1010 target and numpy support.

Running example code under such torch build:

import torch
torch.cuda.is_available()
torch.cuda.device_count()
torch.cuda.current_device()
torch.cuda.get_device_name(torch.cuda.current_device())

tensor = torch.randn(2, 2)
res = tensor.to(0)
print(res)

results in segmentation fault:

Python 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> torch.cuda.current_device()
0
>>> torch.cuda.get_device_name(torch.cuda.current_device())
'AMD Radeon RX 5700 XT'
>>> 
>>> tensor = torch.randn(2, 2)
>>> res = tensor.to(0)
>>> print(res)
Segmentation fault (core dumped)

Output from running coredumpctl debug on latest dump:
https://gist.github.com/cl0ck-byte/5a4e24f1a67fd588fde06d28da2e5765

Output from running dmesg -kuT:

[mon aug 7 22:48:52 2023] python[8910]: segfault at 70 ip 00007f575bed2364 sp 00007ffdffd6fc80 error 4 in libamdhip64.so.5.6.50600[7f575be20000+375000] likely on CPU 3 (core 4, socket 0)
[mon aug 7 22:48:52 2023] Code: e9 89 3f f5 ff 90 f3 0f 1e fa 41 54 55 53 48 83 ec 20 64 48 8b 04 25 28 00 00 00 48 89 44 24 18 31 c0 85 f6 0f 88 e4 00 00 00 <48> 8b 47 70 48 2b 47 68 48 63 ee 48 89 fb 48 c1 f8 03 48 39 c5 0f

Cannot upload torch wheel build due to size.

Versions

PyTorch version: 2.1.0a0+git8b8f576
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 5.6.31061-8c743ae5d

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-6.2.0-26-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Radeon RX 5700 XT
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 5.6.31061
MIOpen runtime version: 2.20.0
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   43 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          12
On-line CPU(s) list:             0-11
Vendor ID:                       AuthenticAMD
Model name:                      AMD Ryzen 5 3600X 6-Core Processor
CPU family:                      23
Model:                           113
Thread(s) per core:              2
Core(s) per socket:              6
Socket(s):                       1
Stepping:                        0
Frequency boost:                 enabled
CPU max MHz:                     4408,5928
CPU min MHz:                     2200,0000
BogoMIPS:                        7600.16
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es
Virtualization:                  AMD-V
L1d cache:                       192 KiB (6 instances)
L1i cache:                       192 KiB (6 instances)
L2 cache:                        3 MiB (6 instances)
L3 cache:                        32 MiB (2 instances)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-11
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.25.2
[pip3] torch==2.1.0a0+git8b8f576
[conda] numpy                     1.25.2                   pypi_0    pypi
[conda] torch                     2.1.0a0+git8b8f576          pypi_0    pypi

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

The text was updated successfully, but these errors were encountered:

hongxiayang · 2023-08-08T14:34:51Z

@cl0ck-byte Can you run the following commands and send the output:

grep flags /sys/class/kfd/kfd/topology/nodes/*/io_links/0/properties

rocminfo
uname -a
apt --installed list | grep dkms

cl0ck-byte · 2023-08-08T16:01:50Z

Output from grep flags /sys/class/kfd/kfd/topology/nodes/*/io_links/0/properties:

/sys/class/kfd/kfd/topology/nodes/0/io_links/0/properties:flags 3
/sys/class/kfd/kfd/topology/nodes/1/io_links/0/properties:flags 1

rocminfo:
https://gist.github.com/cl0ck-byte/52a5a91b93a50a485cff64d37901b7f4

@hongxiayang

hongxiayang · 2023-08-08T17:45:17Z

@cl0ck-byte Thanks for the output. For your reference, the link below shows the latest list of AMD officially supported GPUs. https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html#linux-supported-gpus

cl0ck-byte · 2023-08-08T17:55:00Z

@cl0ck-byte Thanks for the output. For your reference, the link below shows the latest list of AMD officially supported GPUs. https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html#linux-supported-gpus

From what I'm understanding, nothing will be done about this issue since RDNA1 is not officially supported?

edit: if someone ever wants to pick up on this, here's GEF(gdb fork) output with debug symbols (torch was compiled without debug param), although missing colored syntax:
https://gist.github.com/cl0ck-byte/cdb70bb9252bb6677d405bd005229b3c

Seems like it's something related to drivers/ROCm/whatever else, and that's way beyond my capabilities.

And if someone really wants to get Torch working on RDNA1, downgrade it to latest pre 2.0.0 version and use same workaround as before, which is forcing environment variable HSA_OVERRIDE_GFX_VERSION=10.3.0. Unless something else comes up in the meantime, there's no way around it. Thanks AMD!

also quick edit: maybe DirectML will work instead? be sure to check it out

cl0ck-byte · 2023-08-25T22:35:08Z

@cl0ck-byte I believe the issue was introduced sometime between ROCm 5.2 and 5.3. The last Torch 2.0.0 snapshot built against 5.2 works just fine while the same snapshots of that time period built against 5.3 had the same symptoms.

interesting, should I try building that snapshot by myself with gfx1010 target, if possible? @ddvarpdd

DGdev91 · 2023-10-05T19:24:25Z

I can confirm that pytorch 2 is indeed working on gfx1010 if compiled using rocm 5.2, using "export HSA_OVERRIDE_GFX_VERSION=10.3.0"
I'm currently running successfuly automatic1111's WebUI for StableDiffusion using the nightly build posted by @ddvarpdd

Actually, webarchive isn't really needed here. The pytorch official repo still has it.

So, for torch, torchvision and torchaudio the links are:
https://download.pytorch.org/whl/nightly/rocm5.2/torch-2.0.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl
https://download.pytorch.org/whl/nightly/rocm5.2/torchvision-0.15.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl
https://download.pytorch.org/whl/nightly/rocm5.2/torchaudio-2.0.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl

anyway.... i think the problem here is on rocm 5.3 and newer

cl0ck-byte · 2023-12-05T10:25:22Z

Someone compiled wheels for torch 2.1.0 on rocm 5.2 if somebody wants to use them

#103973 (comment)

kontorol · 2023-12-05T21:03:50Z

I can confirm that pytorch 2 is indeed working on gfx1010 if compiled using rocm 5.2, using "export HSA_OVERRIDE_GFX_VERSION=10.3.0" I'm currently running successfuly automatic1111's WebUI for StableDiffusion using the nightly build posted by @ddvarpdd

Actually, webarchive isn't really needed here. The pytorch official repo still has it.

So, for torch, torchvision and torchaudio the links are: https://download.pytorch.org/whl/nightly/rocm5.2/torch-2.0.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/nightly/rocm5.2/torchvision-0.15.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/nightly/rocm5.2/torchaudio-2.0.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl

anyway.... i think the problem here is on rocm 5.3 and newer

Grep flags /sys/class/kfd/kfd/topology/nodes/*/io_links/0/properties

Returns "flags 13" on RDNA1 / gfx1010 (RX 5700XT)!

Can this version be supported?

DGdev91 · 2023-12-05T21:19:32Z

I can confirm that pytorch 2 is indeed working on gfx1010 if compiled using rocm 5.2, using "export HSA_OVERRIDE_GFX_VERSION=10.3.0" I'm currently running successfuly automatic1111's WebUI for StableDiffusion using the nightly build posted by @ddvarpdd
Actually, webarchive isn't really needed here. The pytorch official repo still has it.
So, for torch, torchvision and torchaudio the links are: https://download.pytorch.org/whl/nightly/rocm5.2/torch-2.0.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/nightly/rocm5.2/torchvision-0.15.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/nightly/rocm5.2/torchaudio-2.0.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl
anyway.... i think the problem here is on rocm 5.3 and newer

Grep flags /sys/class/kfd/kfd/topology/nodes/*/io_links/0/properties

Returns "flags 13" on RDNA1 / gfx1010 (RX 5700XT)!

Can this version be supported?

That's a different issue and has nothing to do with the RX 5700XT itself, it's about your configuration, wich doesn't supports PCI Atomics, needed on newer rocm versions. There can be multiple causes. Maybe the motherboard or the CPU are too old. There was also a guy who wrote somewhere here on github wich had the video card mounted in the lower pci slot but his motherboard supported pci atomics only on the first one.

kontorol · 2023-12-05T21:24:52Z

I can confirm that pytorch 2 is indeed working on gfx1010 if compiled using rocm 5.2, using "export HSA_OVERRIDE_GFX_VERSION=10.3.0" I'm currently running successfuly automatic1111's WebUI for StableDiffusion using the nightly build posted by @ddvarpdd
Actually, webarchive isn't really needed here. The pytorch official repo still has it.
So, for torch, torchvision and torchaudio the links are: https://download.pytorch.org/whl/nightly/rocm5.2/torch-2.0.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/nightly/rocm5.2/torchvision-0.15.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/nightly/rocm5.2/torchaudio-2.0.0.dev20230209%2Brocm5.2-cp310-cp310-linux_x86_64.whl
anyway.... i think the problem here is on rocm 5.3 and newer

Grep flags /sys/class/kfd/kfd/topology/nodes/*/io_links/0/properties
Returns "flags 13" on RDNA1 / gfx1010 (RX 5700XT)!
Can this version be supported?

That's a different issue and has nothing to do with the RX 5700XT itself, it's about your configuration, wich doesn't supports PCI Atomics, needed on newer rocm versions. There can be multiple causes. Maybe the motherboard or the CPU are too old. There was also a guy who wrote somewhere here on github wich had the video card mounted in the lower pci slot but his motherboard supported pci atomics only on the first one.

Thank you for the guidance. Yes unfortunately both the cpu and motherboard are very old!

TriedAllDay · 2023-12-11T10:05:37Z

Someone compiled wheels for torch 2.1.0 on rocm 5.2 if somebody wants to use them

#103973 (comment)

Does this include Torch audio/Vision? I only see Torch 2.1.0

pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Aug 7, 2023

DGdev91 mentioned this issue Aug 8, 2023

[Bug]: Image generation won't start forever (Linux+ROCm, possibly specific to RX 5000 series) AUTOMATIC1111/stable-diffusion-webui#10855

Open

1 task

cl0ck-byte closed this as not planned Won't fix, can't repro, duplicate, stale Aug 8, 2023

This was referenced Oct 5, 2023

Regression in rocm 5.3 and newer for gfx1010 ROCm/ROCm#2527

Open

[Bug]: gfx906 ROCM won't work with torch: 2.0.1+rocm5.4.2 but works with other AIs AUTOMATIC1111/stable-diffusion-webui#10873

Open

sapozhnikov mentioned this issue Feb 13, 2024

Zero progress at training voicepaw/so-vits-svc-fork#1090

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfaults on RDNA1 / gfx1010 (RX 5700XT) with ROCm on built nightly torch 2.1.0 #106728

Segfaults on RDNA1 / gfx1010 (RX 5700XT) with ROCm on built nightly torch 2.1.0 #106728

cl0ck-byte commented Aug 7, 2023 •

edited by pytorch-bot bot

Loading

hongxiayang commented Aug 8, 2023 •

edited

Loading

cl0ck-byte commented Aug 8, 2023 •

edited

Loading

hongxiayang commented Aug 8, 2023

cl0ck-byte commented Aug 8, 2023 •

edited

Loading

cl0ck-byte commented Aug 25, 2023

DGdev91 commented Oct 5, 2023 •

edited

Loading

cl0ck-byte commented Dec 5, 2023

kontorol commented Dec 5, 2023

DGdev91 commented Dec 5, 2023

kontorol commented Dec 5, 2023

TriedAllDay commented Dec 11, 2023

Segfaults on RDNA1 / gfx1010 (RX 5700XT) with ROCm on built nightly torch 2.1.0 #106728

Segfaults on RDNA1 / gfx1010 (RX 5700XT) with ROCm on built nightly torch 2.1.0 #106728

Comments

cl0ck-byte commented Aug 7, 2023 • edited by pytorch-bot bot Loading

🐛 Describe the bug

Versions

hongxiayang commented Aug 8, 2023 • edited Loading

cl0ck-byte commented Aug 8, 2023 • edited Loading

hongxiayang commented Aug 8, 2023

cl0ck-byte commented Aug 8, 2023 • edited Loading

cl0ck-byte commented Aug 25, 2023

DGdev91 commented Oct 5, 2023 • edited Loading

cl0ck-byte commented Dec 5, 2023

kontorol commented Dec 5, 2023

DGdev91 commented Dec 5, 2023

kontorol commented Dec 5, 2023

TriedAllDay commented Dec 11, 2023

cl0ck-byte commented Aug 7, 2023 •

edited by pytorch-bot bot

Loading

hongxiayang commented Aug 8, 2023 •

edited

Loading

cl0ck-byte commented Aug 8, 2023 •

edited

Loading

cl0ck-byte commented Aug 8, 2023 •

edited

Loading

DGdev91 commented Oct 5, 2023 •

edited

Loading