[Pytorch] AMD GPUs benchmarks #328

dmenig · 2018-12-09T13:47:29Z

Hi. Thanks for this work guys.

I was curious as to whether you had been able to bench the framework on amd gpus ? I've successfully build pytorch with rocm support following your instructions, and the benchs I got don't seem right. I'm testing with a Radeon 580, which should be like half the performance as 1080 Ti, and I'm seeing more like 9-10 times drop in performances on convolution. The tensorflow benchs already show that the gap shouldn't be that wide.

Is this supposed to be normal for the moment ?

dmenig · 2018-12-09T17:33:06Z

This is tensorflow benchmarks

http://blog.gpueater.com/en/2018/04/23/00011_tech_cifar10_bench_on_tf13/

iotamudelta · 2018-12-10T23:47:09Z

Thanks for the report! We are working on performance optimizations at the moment. If you are interested to help, could you export MIOPEN_FIND_ENFORCE=3 prior to running the benchmark? This will tune the MIOpen convolution kernels for the benchmark (it will take a long time though!) - that way we could see what the "ideal" performance currently is.

FelixSchwarz · 2018-12-11T20:40:33Z

it will take a long time though!

How long is "long" on a RX 580? Some hours? A day? A week?

iotamudelta · 2018-12-11T20:55:26Z

Typically a few hours for a single network. It depends on how many unique convolution configs are missing from our internal database.

dmenig · 2018-12-12T09:49:13Z

Ran this command with this bench :
https://github.com/ryujaehun/pytorch-gpu-benchmark

Whole lotta tests were running. Waited the night for it to finish. Came back this morning and the task was frozen, radeontop showing every metric at 100% utilization (doubt it though).

Looked for any result but none is showing. I'll try another, maybe lighter benchmark.

dmenig · 2018-12-12T10:24:08Z

Not sure if relevant for you, but I can't detect my gpu now, and this is the log of when I killed it when it froze during the first benchmark. Working on it.

déc. 12 10:15:25 redqueen kernel: WARNING: CPU: 0 PID: 1429 at /var/lib/dkms/amdgpu/1.9-307/build/amd/amdgpu/../display/dc/dc_helper.c:254 generic_reg_wait+0xed/0x170 [amdgpu]
déc. 12 10:15:25 redqueen kernel: Modules linked in: md4 nls_utf8 cifs ccm fscache veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack
déc. 12 10:15:25 redqueen kernel:  aes_x86_64 crypto_simd btintel glue_helper bluetooth cryptd intel_cstate intel_rapl_perf mei_me ecdh_generic soundcore mei shpchp acpi_pad mac_hid sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 amdkfd(OE) amd_iommu_v2
déc. 12 10:15:25 redqueen kernel: CPU: 0 PID: 1429 Comm: kworker/0:0 Tainted: G        W  OE    4.15.0-42-generic #45-Ubuntu
déc. 12 10:15:25 redqueen kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z370M-ITX/ac, BIOS P1.20 09/13/2017
déc. 12 10:15:25 redqueen kernel: Workqueue: events kfd_process_hw_exception [amdkfd]
déc. 12 10:15:25 redqueen kernel: RIP: 0010:generic_reg_wait+0xed/0x170 [amdgpu]
déc. 12 10:15:25 redqueen kernel: RSP: 0018:ffffa292c55bbc90 EFLAGS: 00010297
déc. 12 10:15:25 redqueen kernel: RAX: 000000000000039f RBX: 0000000000000bb9 RCX: 0000000000000000
déc. 12 10:15:25 redqueen kernel: RDX: 0000000000000000 RSI: ffff92d06ec16498 RDI: ffff92d06ec16498
déc. 12 10:15:25 redqueen kernel: RBP: ffffa292c55bbcd0 R08: 0000000000000000 R09: 000000000005a9a0
déc. 12 10:15:25 redqueen kernel: R10: 00000000ffffffff R11: ffffffff88b5380e R12: 000000000000000a
déc. 12 10:15:25 redqueen kernel: R13: ffff92d03e846200 R14: 0000000000010000 R15: 0000000000000000
déc. 12 10:15:25 redqueen kernel: FS:  0000000000000000(0000) GS:ffff92d06ec00000(0000) knlGS:0000000000000000
déc. 12 10:15:25 redqueen kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
déc. 12 10:15:25 redqueen kernel: CR2: 00007f2fb125ba38 CR3: 000000030f80a006 CR4: 00000000003606f0
déc. 12 10:15:25 redqueen kernel: Call Trace:
déc. 12 10:15:25 redqueen kernel:  dce110_stream_encoder_dp_blank+0x12f/0x1a0 [amdgpu]
déc. 12 10:15:25 redqueen kernel:  power_down_all_hw_blocks+0x44/0x1e0 [amdgpu]
déc. 12 10:15:25 redqueen kernel:  dce110_power_down+0x12/0x20 [amdgpu]
déc. 12 10:15:25 redqueen kernel:  dc_set_power_state+0x20/0x80 [amdgpu]
déc. 12 10:15:25 redqueen kernel:  dm_suspend+0x4e/0x60 [amdgpu]
déc. 12 10:15:25 redqueen kernel:  amdgpu_device_ip_suspend+0xcf/0x190 [amdgpu]
déc. 12 10:15:25 redqueen kernel:  amdgpu_device_gpu_recover+0x451/0x7b0 [amdgpu]
déc. 12 10:15:25 redqueen kernel:  ? add_timer+0x124/0x280
déc. 12 10:15:25 redqueen kernel:  amdgpu_amdkfd_gpu_reset+0x12/0x20 [amdgpu]
déc. 12 10:15:25 redqueen kernel:  kfd_process_hw_exception+0x26/0x30 [amdkfd]
déc. 12 10:15:25 redqueen kernel:  process_one_work+0x1de/0x410
déc. 12 10:15:25 redqueen kernel:  worker_thread+0x32/0x410
déc. 12 10:15:25 redqueen kernel:  kthread+0x121/0x140
déc. 12 10:15:25 redqueen kernel:  ? process_one_work+0x410/0x410
déc. 12 10:15:25 redqueen kernel:  ? kthread_create_worker_on_cpu+0x70/0x70
déc. 12 10:15:25 redqueen kernel:  ? do_syscall_64+0x73/0x130
déc. 12 10:15:25 redqueen kernel:  ? SyS_exit+0x17/0x20
déc. 12 10:15:25 redqueen kernel:  ret_from_fork+0x35/0x40
déc. 12 10:15:25 redqueen kernel: Code: 44 8b 45 10 44 89 e1 48 c7 c2 28 e7 77 c0 48 c7 c7 b9 6e 78 c0 44 89 55 d4 50 e8 9f cd c5 ff 41 83 7d 20 01 44 8b 55 d4 58 74 02 <0f> 0b 48 8d 65 d8 44 89 d0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 41 
déc. 12 10:15:25 redqueen kernel: ---[ end trace e3447af507105f30 ]---```

dmenig · 2018-12-12T10:26:53Z

This is problematic as well

déc. 12 10:15:45 redqueen kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DA7E (len 824, WS 0, PS 0) @ 0xDBFE
déc. 12 10:15:45 redqueen kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D938 (len 326, WS 0, PS 0) @ 0xDA28
déc. 12 10:15:45 redqueen kernel: [drm:dce110_link_encoder_disable_output [amdgpu]] *ERROR* dce110_link_encoder_disable_output: Failed to execute VBIOS command table!
déc. 12 10:15:45 redqueen kernel: WARNING: CPU: 0 PID: 1429 at /var/lib/dkms/amdgpu/1.9-307/build/amd/amdgpu/../display/dc/dce/dce_link_encoder.c:1062 dce110_link_encoder_disable_output+0x16b/0x180 [amdgpu]
déc. 12 10:15:45 redqueen kernel: Modules linked in: md4 nls_utf8 cifs ccm fscache veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack
déc. 12 10:15:45 redqueen kernel:  aes_x86_64 crypto_simd btintel glue_helper bluetooth cryptd intel_cstate intel_rapl_perf mei_me ecdh_generic soundcore mei shpchp acpi_pad mac_hid sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 amdkfd(OE) amd_iommu_v2
déc. 12 10:15:45 redqueen kernel: CPU: 0 PID: 1429 Comm: kworker/0:0 Tainted: G        W  OE    4.15.0-42-generic #45-Ubuntu
déc. 12 10:15:45 redqueen kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z370M-ITX/ac, BIOS P1.20 09/13/2017
déc. 12 10:15:45 redqueen kernel: Workqueue: events kfd_process_hw_exception [amdkfd]
déc. 12 10:15:45 redqueen kernel: RIP: 0010:dce110_link_encoder_disable_output+0x16b/0x180 [amdgpu]
déc. 12 10:15:45 redqueen kernel: RSP: 0018:ffffa292c55bbcd8 EFLAGS: 00010286
déc. 12 10:15:45 redqueen kernel: RAX: 0000000000000000 RBX: ffff92d03f68b7e0 RCX: 0000000000000006
déc. 12 10:15:45 redqueen kernel: RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff92d06ec16490
déc. 12 10:15:45 redqueen kernel: RBP: ffffa292c55bbd30 R08: 0000000000000000 R09: 000000000005aa8d
déc. 12 10:15:45 redqueen kernel: R10: ffff92d03f7c5540 R11: ffffffff88b5380e R12: 0000000000000000
déc. 12 10:15:45 redqueen kernel: R13: ffffa292c55bbcdc R14: 0000000000000080 R15: 000000000000000c
déc. 12 10:15:45 redqueen kernel: FS:  0000000000000000(0000) GS:ffff92d06ec00000(0000) knlGS:0000000000000000
déc. 12 10:15:45 redqueen kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
déc. 12 10:15:45 redqueen kernel: CR2: 00007f2fb125ba38 CR3: 000000030f80a006 CR4: 00000000003606f0
déc. 12 10:15:45 redqueen kernel: Call Trace:
déc. 12 10:15:45 redqueen kernel:  power_down_all_hw_blocks+0x84/0x1e0 [amdgpu]
déc. 12 10:15:45 redqueen kernel:  dce110_power_down+0x12/0x20 [amdgpu]
déc. 12 10:15:45 redqueen kernel:  dc_set_power_state+0x20/0x80 [amdgpu]
déc. 12 10:15:45 redqueen kernel:  dm_suspend+0x4e/0x60 [amdgpu]
déc. 12 10:15:45 redqueen kernel:  amdgpu_device_ip_suspend+0xcf/0x190 [amdgpu]
déc. 12 10:15:45 redqueen kernel:  amdgpu_device_gpu_recover+0x451/0x7b0 [amdgpu]
déc. 12 10:15:45 redqueen kernel:  ? add_timer+0x124/0x280
déc. 12 10:15:45 redqueen kernel:  amdgpu_amdkfd_gpu_reset+0x12/0x20 [amdgpu]
déc. 12 10:15:45 redqueen kernel:  kfd_process_hw_exception+0x26/0x30 [amdkfd]
déc. 12 10:15:45 redqueen kernel:  process_one_work+0x1de/0x410
déc. 12 10:15:45 redqueen kernel:  worker_thread+0x32/0x410
déc. 12 10:15:45 redqueen kernel:  kthread+0x121/0x140
déc. 12 10:15:45 redqueen kernel:  ? process_one_work+0x410/0x410
déc. 12 10:15:45 redqueen kernel:  ? kthread_create_worker_on_cpu+0x70/0x70
déc. 12 10:15:45 redqueen kernel:  ? do_syscall_64+0x73/0x130
déc. 12 10:15:45 redqueen kernel:  ? SyS_exit+0x17/0x20
déc. 12 10:15:45 redqueen kernel:  ret_from_fork+0x35/0x40
déc. 12 10:15:45 redqueen kernel: Code: 00 00 75 2b 48 8d 65 e8 5b 41 5c 41 5d 5d c3 48 c7 c1 40 57 73 c0 48 c7 c2 a8 53 77 c0 31 f6 48 c7 c7 05 64 78 c0 e8 55 60 cb ff <0f> 0b eb c6 e8 bc e3 bf c6 66 90 66 2e 0f 1f 84 00 00 00 00 00 
déc. 12 10:15:45 redqueen kernel: ---[ end trace e3447af507105f36 ]---

iotamudelta · 2018-12-12T15:20:28Z

Couple of questions:
a) can you salvage the content of ~/.config/miopen and attach here? That'll at least give us an information how far it got before erroring out and if you keep the content of the file, the next time round no tuning for these configurations is required
b) which host kernel are you running this on and which dkms (or are you using upstream)?

Thanks!

skylt · 2018-12-13T09:14:04Z

(I'm working with @hyperfraise )

a) here
b) 4.15.0-42-generic on ubuntu 18.04, with rocm 2.0.white-rabbit

iotamudelta · 2018-12-21T17:44:45Z

@skylt Thanks, this is helpful. There is no 2.0.white-rabbit. Could you provide apt list | grep rocm? Thanks!

The instability looks like a kernel driver issue on gfx803

Delaunay · 2019-02-28T03:00:47Z

I noticed that sometimes hcc is launched prior to the training beginning.
What is it doing ? Testing different kernels ? Compiling missing kernels ?
Compiling the best kernels according to MIOpen's DB?

The compilation step seems to disappear after the first script run.
Is it correct to assume the timings after that first run should be consistent ?

You can find the results of the test I ran prior to tuning MIOpen below.
I am running the tuning at the moment (1 done, 3 to go).

The numbers are based on 90 observations (batch) where the first 10 were discarded.

updated numbers can be found below

iotamudelta · 2019-02-28T15:56:03Z

Concerning your questions: we do compile some kernels the first time we run them and subsequently get them from cache. So you are correct that the second time you'll invoke these kernels will be what you want to time.

I think the performance you observe makes sense. I'd expect resnet18 to get better as you tune MIOpen and in general I've observed that larger batch sizes help - so judging from your memory consumption numbers you could increase the batch size for resnet18 on the 580 2x (maybe even 4x). Please do attach your performance database here once you are done so that I can make sure we'll get these configs tuned in a future MIOpen release.

geekboood · 2019-03-02T15:08:08Z

@Delaunay

RX580 8GB 153.4 121.76 208.56 FALSE 32 resnet50

I wonder about how you get such a nice result.
I use one RX480 but I only got about 83.92 img/sec.
Do you use dual RX580?

Delaunay · 2019-03-02T22:25:24Z

Single GPU. I don't know where the differences could come from; I did not do anything in particular.
How did you measure the compute time ?

geekboood · 2019-03-02T23:35:32Z

I used the benchmark script in https://github.com/ROCmSoftwarePlatform/pytorch/wiki/Performance-analysis-of-PyTorch.
And in my test, Pytorch is slightly faster than Tensorflow-rocm, which only got 77.59 imgs/sec.
Another comparison I did between RX480 and GTX1060 is to train DrQA, a machine comprehension neural network based on LSTM. In this test, it costs 42.68s to finish 100 training iterations on GTX1060, but it needs 230.10s to finish 100 training iterations on RX480. We can assume that even if VEGA series is 2x faster than RX480, a single GTX1060 can still beat it down to the ground.

Delaunay · 2019-03-03T02:18:53Z

I see; I did not run the same script and mine had a bug. You can find the updated numbers below.

I have not started to benchmark anything else yet. I will probably next week.
I would not expect everything to work perfectly yet. NVIDIA had a lot of time to tunes their library for a lot of different application that does not mean the underlying hardware is not capable of the same performance.

I do think the convnets numbers show that their devices are capable. It is now a matter of optimization on the software level. For example the memory usage under ROCm is significantly lower than NVIDIA which suggests the algorithm under the hood are different and a trade off could be made between memory and speed.

The only big difference is f16 which I would think they did not yet work on specifically since none of their device really supports fast f16 operations (I think Radeon VII does but that's it).

This compares the RX580 (8Go) against the 1060 (6Go)

I also attached the MIOpen perf db.

gfx803_36.cd.updb.txt

iotamudelta · 2019-03-05T17:40:30Z

@Delaunay this is very interesting data, thanks for posting it here! Which PyTorch commit did you use for your test? Your fp16 data looks correct for gfx803.

@geekboood could you post a link to the benchmark you were running? I'd be interested in having a look at it. Thanks!

Delaunay · 2019-03-05T23:31:14Z

The commit I used was ffcbf1bd1 from Fri Feb 22 16:36:06 2019 -0800.
Yes this is for gfx803.

iotamudelta · 2019-03-05T23:38:11Z

Thanks, very good! It may be interesting to rerun this benchmark after 3ed44b67 is included for ROCm.

geekboood · 2019-03-06T00:48:34Z

@iotamudelta The ResNet one is already in the link above.
As for the DrQA, I use this https://github.com/facebookresearch/DrQA code and train on CMRC dataset. You can also train the reader on Squad dataset.
If you want to inspect the performance of ROCm, you can have a look at this one https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/173#issuecomment-469017319. I have also done some trivial test on tensorflow-rocm.

dmenig changed the title ~~[Pytorch AMD GPUs benchmarks~~ [Pytorch] AMD GPUs benchmarks Dec 9, 2018

iotamudelta self-assigned this Jan 4, 2019

dmenig closed this as completed May 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pytorch] AMD GPUs benchmarks #328

[Pytorch] AMD GPUs benchmarks #328

dmenig commented Dec 9, 2018 •

edited

Loading

dmenig commented Dec 9, 2018

iotamudelta commented Dec 10, 2018

FelixSchwarz commented Dec 11, 2018

iotamudelta commented Dec 11, 2018

dmenig commented Dec 12, 2018

dmenig commented Dec 12, 2018 •

edited

Loading

dmenig commented Dec 12, 2018

iotamudelta commented Dec 12, 2018

skylt commented Dec 13, 2018

iotamudelta commented Dec 21, 2018

Delaunay commented Feb 28, 2019 •

edited

Loading

iotamudelta commented Feb 28, 2019

geekboood commented Mar 2, 2019 •

edited

Loading

Delaunay commented Mar 2, 2019

geekboood commented Mar 2, 2019 •

edited

Loading

Delaunay commented Mar 3, 2019 •

edited

Loading

iotamudelta commented Mar 5, 2019

Delaunay commented Mar 5, 2019

iotamudelta commented Mar 5, 2019

geekboood commented Mar 6, 2019

[Pytorch] AMD GPUs benchmarks #328

[Pytorch] AMD GPUs benchmarks #328

Comments

dmenig commented Dec 9, 2018 • edited Loading

dmenig commented Dec 9, 2018

iotamudelta commented Dec 10, 2018

FelixSchwarz commented Dec 11, 2018

iotamudelta commented Dec 11, 2018

dmenig commented Dec 12, 2018

dmenig commented Dec 12, 2018 • edited Loading

dmenig commented Dec 12, 2018

iotamudelta commented Dec 12, 2018

skylt commented Dec 13, 2018

iotamudelta commented Dec 21, 2018

Delaunay commented Feb 28, 2019 • edited Loading

iotamudelta commented Feb 28, 2019

geekboood commented Mar 2, 2019 • edited Loading

Delaunay commented Mar 2, 2019

geekboood commented Mar 2, 2019 • edited Loading

Delaunay commented Mar 3, 2019 • edited Loading

iotamudelta commented Mar 5, 2019

Delaunay commented Mar 5, 2019

iotamudelta commented Mar 5, 2019

geekboood commented Mar 6, 2019

dmenig commented Dec 9, 2018 •

edited

Loading

dmenig commented Dec 12, 2018 •

edited

Loading

Delaunay commented Feb 28, 2019 •

edited

Loading

geekboood commented Mar 2, 2019 •

edited

Loading

geekboood commented Mar 2, 2019 •

edited

Loading

Delaunay commented Mar 3, 2019 •

edited

Loading