Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix template kernels with ROCm #33

Merged
merged 4 commits into from Oct 26, 2021
Merged

Fix template kernels with ROCm #33

merged 4 commits into from Oct 26, 2021

Conversation

mlazzarin
Copy link
Contributor

@mlazzarin mlazzarin commented Oct 25, 2021

In this PR I implemented the workaround for the template kernels with ROCm, as suggested in cupy/cupy#5436 .

In particuar, I replaced <complex<double>> with <thrust::complex<double> > and <complex<float>> with <thrust::complex<float> >. I also replaced complex with thrust::complex in the __device__ functions of gates.cu.cc for consistency.
Then, I removed the duplicated file with the ROCm kernels, which is now redundant.

I run the tests of this repository and they are ok. However, the tests in the qibo repository fail, but they fail also with the main branch. I will open a separate issue concerning this.

@mlazzarin
Copy link
Contributor Author

Given that the issues reported in #34 seem independent from this PR, I would address them in another PR and mark this one ready for review.

@mlazzarin mlazzarin marked this pull request as ready for review October 25, 2021 12:27
@scarrazza
Copy link
Member

@mlazzarin do you really need to add the thrust:: prefix?

@mlazzarin
Copy link
Contributor Author

@scarrazza It seems to be required in backends.py but not in the __device__ functions of gates.cu.cc. Let me know what you prefer.

@scarrazza
Copy link
Member

@scarrazza It seems to be required in backends.py

do you have an example? I would prefer to keep kernels as light as possible.

@mlazzarin
Copy link
Contributor Author

I implemented the workaround suggested in this comment cupy/cupy#5436 (comment) .

In particular, in the first commit e2a83eb of this PR I replaced <complex<double>> with <complex<double> > and the same with float. On my configuration, it doesn't work, and raises the following error cupy_backends.cuda.api.driver.CUDADriverError: hipErrorNotFound: hipErrorNotFound.

In the second commit 9f4c059 I replaced <complex<double> > with <thrust::complex<double> > and the same with float, and the error doesn't appear anymore.

Actually, I've no idea why this works, I just followed the discussion of that thread.

@scarrazza
Copy link
Member

Ok thanks, I am fine with this current implementation If the performance on NVIDIA does not change.

@mlazzarin
Copy link
Contributor Author

Ok, I'll double-check the performance.

@mlazzarin
Copy link
Contributor Author

mlazzarin commented Oct 26, 2021

I performed some benchmarks (EDIT on a NVIDIA gpu), I've included also the main branch so that we can see also the impact of the multiqubitgpu branch on performances.

The simulation times are quite similar across the three branches.
The dry run overhead is similar between fixrocm and multiqubitgpu, but larger than that of the main branch.

qft - simulation times
nqubits Simulation time main Simulation time multiqubitgpu Simulation time fixrocm
3 0.00042 0.00043 0.00042
4 0.00073 0.00073 0.00070
5 0.00104 0.00102 0.00107
6 0.00147 0.00148 0.00146
7 0.00198 0.00196 0.00195
8 0.00248 0.00254 0.00244
9 0.00318 0.00304 0.00317
10 0.00387 0.00388 0.00387
11 0.00470 0.00479 0.00476
12 0.00565 0.00549 0.00566
13 0.00643 0.00662 0.00663
14 0.00752 0.00767 0.00770
15 0.00872 0.00847 0.00861
16 0.00982 0.00986 0.01018
17 0.01162 0.01116 0.01109
18 0.01270 0.01264 0.01281
19 0.01441 0.01477 0.01498
20 0.01810 0.01758 0.01833
21 0.02530 0.02483 0.02517
22 0.03671 0.03655 0.03686
23 0.05898 0.05888 0.05853
24 0.10572 0.10547 0.10608
25 0.20290 0.20228 0.20338
26 0.40923 0.40984 0.41000
27 0.84414 0.84438 0.84496
28 1.77014 1.76874 1.76983
29 3.71794 3.71742 3.71838
30 7.85706 7.85693 7.85759
variational - simulation times
nqubits Simulation time main Simulation time multiqubitgpu Simulation time fixrocm
3 0.00042 0.00043 0.00042
4 0.00073 0.00073 0.00070
5 0.00104 0.00102 0.00107
6 0.00147 0.00148 0.00146
7 0.00198 0.00196 0.00195
8 0.00248 0.00254 0.00244
9 0.00318 0.00304 0.00317
10 0.00387 0.00388 0.00387
11 0.00470 0.00479 0.00476
12 0.00565 0.00549 0.00566
13 0.00643 0.00662 0.00663
14 0.00752 0.00767 0.00770
15 0.00872 0.00847 0.00861
16 0.00982 0.00986 0.01018
17 0.01162 0.01116 0.01109
18 0.01270 0.01264 0.01281
19 0.01441 0.01477 0.01498
20 0.01810 0.01758 0.01833
21 0.02530 0.02483 0.02517
22 0.03671 0.03655 0.03686
23 0.05898 0.05888 0.05853
24 0.10572 0.10547 0.10608
25 0.20290 0.20228 0.20338
26 0.40923 0.40984 0.41000
27 0.84414 0.84438 0.84496
28 1.77014 1.76874 1.76983
29 3.71794 3.71742 3.71838
30 7.85706 7.85693 7.85759
bv - simulation times
nqubits Simulation time main Simulation time multiqubitgpu Simulation time fixrocm
3 0.00038 0.00040 0.00039
4 0.00052 0.00052 0.00052
5 0.00065 0.00064 0.00065
6 0.00077 0.00078 0.00079
7 0.00089 0.00093 0.00090
8 0.00100 0.00104 0.00103
9 0.00116 0.00116 0.00117
10 0.00133 0.00133 0.00132
11 0.00150 0.00153 0.00153
12 0.00172 0.00165 0.00168
13 0.00176 0.00179 0.00181
14 0.00198 0.00197 0.00196
15 0.00210 0.00211 0.00206
16 0.00228 0.00228 0.00222
17 0.00239 0.00241 0.00243
18 0.00282 0.00278 0.00272
19 0.00353 0.00366 0.00362
20 0.00517 0.00513 0.00512
21 0.00799 0.00796 0.00804
22 0.01379 0.01379 0.01376
23 0.02577 0.02590 0.02578
24 0.05074 0.05071 0.05063
25 0.10251 0.10255 0.10256
26 0.21009 0.21012 0.21009
27 0.43246 0.43246 0.43247
28 0.89208 0.89218 0.89226
29 1.84328 1.84265 1.84303
30 3.80665 3.80711 3.80684
supremacy - simulation times
nqubits Simulation time main Simulation time multiqubitgpu Simulation time fixrocm
3 0.00050 0.00050 0.00050
4 0.00059 0.00062 0.00061
5 0.00072 0.00072 0.00071
6 0.00091 0.00086 0.00087
7 0.00102 0.00103 0.00101
8 0.00110 0.00111 0.00116
9 0.00126 0.00125 0.00127
10 0.00145 0.00148 0.00148
11 0.00170 0.00172 0.00172
12 0.00184 0.00181 0.00181
13 0.00196 0.00193 0.00193
14 0.00222 0.00215 0.00215
15 0.00222 0.00226 0.00227
16 0.00242 0.00247 0.00251
17 0.00259 0.00249 0.00261
18 0.00307 0.00309 0.00312
19 0.00401 0.00413 0.00406
20 0.00567 0.00570 0.00563
21 0.00899 0.00902 0.00904
22 0.01580 0.01592 0.01578
23 0.02963 0.02968 0.02962
24 0.05829 0.05838 0.05837
25 0.11844 0.11843 0.11843
26 0.24500 0.24480 0.24487
27 0.50053 0.50056 0.50070
28 1.03131 1.03162 1.03103
29 2.13265 2.13260 2.13229
30 4.43731 4.43699 4.43743
qv - simulation times
nqubits Simulation time main Simulation time multiqubitgpu Simulation time fixrocm
3 0.00055 0.00056 0.00057
4 0.00109 0.00104 0.00104
5 0.00103 0.00103 0.00105
6 0.00153 0.00155 0.00152
7 0.00154 0.00153 0.00155
8 0.00203 0.00204 0.00204
9 0.00205 0.00212 0.00210
10 0.00262 0.00264 0.00259
11 0.00266 0.00282 0.00278
12 0.00326 0.00328 0.00335
13 0.00331 0.00331 0.00323
14 0.00392 0.00385 0.00390
15 0.00386 0.00385 0.00388
16 0.00429 0.00446 0.00440
17 0.00442 0.00430 0.00437
18 0.00554 0.00560 0.00552
19 0.00673 0.00666 0.00678
20 0.00978 0.00991 0.00994
21 0.01471 0.01472 0.01484
22 0.02682 0.02685 0.02690
23 0.04789 0.04786 0.04779
24 0.09674 0.09671 0.09690
25 0.18977 0.18961 0.18981
26 0.39928 0.39922 0.39943
27 0.80149 0.80118 0.80150
28 1.71569 1.71599 1.71572
29 3.38466 3.38512 3.38455
30 7.31601 7.31566 7.31591
qft - dry run overhead
nqubits delta main delta multiqubitgpu delta fixrocm
3 0.50600 1.13538 1.13443
4 0.51782 0.70583 0.71342
5 0.50966 0.71327 0.70894
6 0.51338 0.71801 0.71131
7 0.51297 0.71498 0.71341
8 0.51247 0.71702 0.70893
9 0.51243 0.70881 0.70851
10 0.51233 0.71408 0.71520
11 0.50800 0.71673 0.71505
12 0.50860 0.71547 0.71151
13 0.51156 0.71476 0.71485
14 0.51206 0.71056 0.71341
15 0.51082 0.71272 0.71272
16 0.51419 0.71710 0.71600
17 0.51712 0.71543 0.71881
18 0.51536 0.71398 0.71490
19 0.51353 0.72074 0.71850
20 0.51847 0.71789 0.72098
21 0.52013 0.71812 0.72064
22 0.51935 0.72013 0.72437
23 0.52238 0.72093 0.72061
24 0.51966 0.72555 0.72160
25 0.52140 0.72868 0.72414
26 0.51978 0.72428 0.72281
27 0.51812 0.72199 0.71927
28 0.51509 0.71866 0.71964
29 0.51091 0.70912 0.71425
30 0.48910 0.69470 0.69538
variational - dry run overhead
nqubits delta main delta multiqubitgpu delta fixrocm
3 0.50932 0.70942 0.70957
4 0.51115 0.71280 0.71235
5 0.51016 0.71519 0.71482
6 0.51240 0.71522 0.70514
7 0.50688 0.70741 0.71343
8 0.50666 0.71094 0.71117
9 0.50773 0.70722 0.70982
10 0.50776 0.71402 0.70825
11 0.51665 0.70771 0.71423
12 0.51323 0.71545 0.71442
13 0.50632 0.71600 0.71012
14 0.51356 0.71467 0.71145
15 0.50701 0.71459 0.71528
16 0.51191 0.72116 0.71520
17 0.51211 0.71485 0.71230
18 0.50663 0.71494 0.71543
19 0.51929 0.71472 0.71117
20 0.51281 0.71722 0.71047
21 0.51599 0.71583 0.70926
22 0.51343 0.71671 0.71165
23 0.51521 0.70842 0.70766
24 0.50832 0.71208 0.71228
25 0.50583 0.71475 0.70992
26 0.51356 0.71521 0.71451
27 0.51062 0.70510 0.70490
28 0.50514 0.70421 0.70642
29 0.48779 0.69141 0.69917
30 0.47347 0.67801 0.67424
bv - dry run overhead
nqubits delta main delta multiqubitgpu delta fixrocm
3 0.50389 0.71101 0.71201
4 0.50578 0.70876 0.70934
5 0.50564 0.71011 0.70903
6 0.50741 0.70865 0.70792
7 0.51030 0.70771 0.71134
8 0.50559 0.71036 0.70916
9 0.51019 0.70944 0.71090
10 0.51035 0.71359 0.71303
11 0.50659 0.70928 0.70955
12 0.50894 0.71110 0.71203
13 0.50907 0.70998 0.71149
14 0.50620 0.71470 0.70897
15 0.50793 0.71113 0.71075
16 0.50990 0.71100 0.70926
17 0.50641 0.71272 0.71628
18 0.50742 0.70893 0.71348
19 0.50990 0.71450 0.71516
20 0.51708 0.71440 0.71574
21 0.51125 0.71372 0.71442
22 0.50879 0.71461 0.71580
23 0.50713 0.70903 0.71162
24 0.50907 0.70802 0.70865
25 0.50843 0.71258 0.70754
26 0.50646 0.70859 0.70724
27 0.50258 0.70430 0.70739
28 0.50465 0.70688 0.70589
29 0.50231 0.70292 0.70423
30 0.49307 0.69398 0.69374
supremacy - dry run overhead
nqubits delta main delta multiqubitgpu delta fixrocm
3 0.50895 0.70971 0.71043
4 0.50803 0.71247 0.71221
5 0.50867 0.71171 0.71373
6 0.50887 0.71013 0.71086
7 0.50891 0.71199 0.70991
8 0.51028 0.70930 0.70998
9 0.50841 0.71422 0.71056
10 0.50698 0.71189 0.71167
11 0.50867 0.71360 0.71045
12 0.50967 0.70944 0.71022
13 0.50931 0.71132 0.71015
14 0.50990 0.71007 0.71157
15 0.50775 0.71130 0.71036
16 0.50903 0.71232 0.71380
17 0.50891 0.70984 0.71116
18 0.50955 0.71213 0.71130
19 0.51059 0.71230 0.71120
20 0.50953 0.71041 0.71246
21 0.50877 0.71270 0.71184
22 0.50994 0.71221 0.71200
23 0.51276 0.71392 0.71194
24 0.51112 0.71274 0.71493
25 0.50926 0.71004 0.71070
26 0.50770 0.70891 0.71020
27 0.50554 0.71058 0.70449
28 0.50003 0.70185 0.70343
29 0.49402 0.69845 0.69420
30 0.47374 0.67465 0.67695
qv - dry run overhead
nqubits delta main delta multiqubitgpu delta fixrocm
3 0.51344 0.71274 0.70779
4 0.51029 0.71115 0.71321
5 0.50925 0.70837 0.71013
6 0.50921 0.71082 0.71210
7 0.51268 0.71174 0.71035
8 0.51095 0.71131 0.71224
9 0.51031 0.71100 0.71399
10 0.51105 0.71263 0.71151
11 0.51093 0.71219 0.71526
12 0.51140 0.71173 0.71430
13 0.51256 0.71747 0.71260
14 0.51218 0.71464 0.71713
15 0.51327 0.71229 0.71348
16 0.51469 0.71368 0.71219
17 0.51575 0.71638 0.71581
18 0.51401 0.71165 0.71535
19 0.51752 0.71422 0.71572
20 0.51340 0.71548 0.71394
21 0.51353 0.71906 0.71586
22 0.51371 0.71465 0.71638
23 0.51654 0.71913 0.71523
24 0.51528 0.71613 0.71690
25 0.51305 0.71642 0.71747
26 0.51215 0.71330 0.71667
27 0.51107 0.71343 0.71095
28 0.50792 0.70824 0.70956
29 0.49898 0.70139 0.69964
30 0.47820 0.68516 0.68069

@scarrazza
Copy link
Member

@mlazzarin thanks. All these numbers refer to the Radeon VII, correct?

@mlazzarin
Copy link
Contributor Author

Sorry I didn't mention it. These numbers refer to an NVIDIA GPU, to see if the performance on NVIDIA changes or not.

@scarrazza
Copy link
Member

Ok, good, so NVIDIA performance is unaffected when compared to the multiqubitgpu.

@mlazzarin
Copy link
Contributor Author

Yes, exactly.

@codecov
Copy link

codecov bot commented Oct 26, 2021

Codecov Report

Merging #33 (590e80c) into multiqubitgpu (402fd9c) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           multiqubitgpu       #33   +/-   ##
===============================================
  Coverage         100.00%   100.00%           
===============================================
  Files                  9         9           
  Lines                758       758           
===============================================
  Hits                 758       758           
Flag Coverage Δ
unittests 100.00% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/qibojit/custom_operators/backends.py 100.00% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 402fd9c...590e80c. Read the comment docs.

@mlazzarin
Copy link
Contributor Author

@scarrazza shall we merge this?

@scarrazza scarrazza merged commit 991eed8 into multiqubitgpu Oct 26, 2021
@scarrazza scarrazza deleted the fixrocm branch February 11, 2022 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants