Multi-qubit kernel for GPU #23

stavros11 · 2021-10-05T10:32:42Z

Closes #22. As we discussed this is missing the proper buffer implementation.

Currently I use a buffer that has the same size as the full state vector as a placeholder and to make sure that the rest of the code works okay. Ideally this vector should have size (nthreads * nsubstates,) and we need to index it properly wo that each thread uses a different set of values. Alternatively, we could try initializing a vector of size (nsubstates,) within the kernel but I searched a little bit and I am not sure if this is a good practice in CUDA.

mlazzarin · 2021-10-07T11:07:12Z

Concerning the buffer implementation, I see three options:

Duplicate the state vector
Dynamically allocate local arrays of size (nsubstates,) in the kernel
Allocate arrays of size (nsubstates*block_size,) in the shared memory of each block (such memory is on-chip so it's very fast but also limited in size).

I'll look into the three options, the first one is the simplest but wastes memory, the second one has more overhead and may raise memory issues (it may be useful in hard-coded kernels) and the third one may raise memory issues.

scarrazza · 2021-10-07T11:21:30Z

@mlazzarin sounds reasonable, in particular if you tabulate performance and memory for each approach should be great.

codecov · 2021-10-12T06:05:27Z

Codecov Report

Merging #23 (3183df2) into multistr (a703c0b) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           multistr       #23   +/-   ##
==========================================
  Coverage    100.00%   100.00%           
==========================================
  Files             9         9           
  Lines           983       983           
==========================================
  Hits            983       983

Flag	Coverage Δ
unittests	`100.00% <100.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/qibojit/custom_operators/backends.py	`100.00% <ø> (ø)`
src/qibojit/tests/test_gates.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a703c0b...3183df2. Read the comment docs.

mlazzarin · 2021-10-12T07:27:12Z

To sum up, we can proceed with different options:

Duplicate the state vector and use it as a buffer: the option implemented in the PR. Simple to write and maintain, we lose one maximum qubit.
Dynamically allocate buffer inside the kernels: overhead for the allocation, it raises memory issues when too much memory is allocated inside the kernel (unless we raise the heap size limit with CUDA)
Implement hard-coded kernels with a local buffer each: it works only for three-qubit gates, then raises memory issues (at least in my configuration). We may keep the three-qubit implementation.
Allocate the buffer in shared memory within each block of threads: the shared memory is fast because is on-chip, but the problem is that it is something like 40-60 KB, and we need to allocate nthreads_per_block * 2^ntargets * 16 B, so we need to reduce a lot the number of threads per block.
Allocate the buffers in global memory and use it in each thread: similar to the duplication of the state vector, we need to allocate a buffer of nthreads*2^ntargets. If we use nthreads=nstates, this is a problem, but we may implement the batch computation of [WIP] Kernels that update the state vector in batches #25 (which seems a bit slower but not much) and reduce the number of threads to something like 5000-10000 to solve the memory issues. We may implement the batch computation only for n-qubit gates, n > 3.

EDIT: To me it seems like a combination of options 1, 3 and 5 is the best one. I'll provide some temporary benchmarks soon.

scarrazza · 2021-10-12T10:18:41Z

@mlazzarin thanks for this suggestions and tests. I also have suspicious that 1, 3 and 5 are acceptable solutions. Most likely 1 is the easiest to maintain.

mlazzarin · 2021-10-12T11:45:15Z

I've performed a simple benchmark on my laptop, to have an approximate idea of the performance. I'll provide more precise benchmarks later on.

26 qubits

ntargets	simulation_times_mean	simulation_times_mean 2	simulation_times_mean 3+5	simulation_times_mean 3+5 (diff. params)	simulation_times_mean 4
3.0	0.501093864440918	3.7012813091278076	0.47490787506103516	0.4949676990509033	0.47093701362609863
4.0	0.5783326625823975	2.1077096462249756	0.8769850730895996	0.6424994468688965	0.5471789836883545
5.0	0.7357909679412842	1.689469814300537	1.580549955368042	0.8523507118225098	0.7668726444244385
6.0	1.0668115615844727	3.3204479217529297	4.937393426895142	1.3008100986480713	1.548600673675537
7.0	1.7232403755187988	12.982528448104858	9.024909019470215	2.1566641330718994	nan
8.0	2.96152925491333	27.363354682922363	17.62743639945984	3.904561996459961	nan
9.0	5.6248955726623535	54.49666380882263	34.962459325790405	7.134665489196777	nan
10.0	11.253756523132324	nan	69.19841933250427	14.528671741485596	nan

27 qubits (method 1 requires too much memory)

ntargets	simulation_times_mean 2	simulation_times_mean 3+5	simulation_times_mean 3+5 (diff. params)	simulation_times_mean 4
3.0	7.343163013458252	0.963958740234375	0.995358943939209	0.9702143669128418
4.0	4.401804208755493	1.754577875137329	1.347081184387207	1.1217937469482422
5.0	3.5388331413269043	3.071789503097534	1.8756260871887207	1.561262607574463
6.0	8.072781085968018	8.924863338470459	2.5406107902526855	3.124509811401367
7.0	25.89661717414856	17.724601984024048	4.496196985244751	nan
8.0	51.02249789237976	34.91356921195984	7.6825950145721436	nan
9.0	104.09557867050171	70.08783483505249	14.059113502502441	nan
10.0	nan	140.3752624988556	26.748580932617188	nan

Concerning three-qubit gates, the hard-coded one is the best.
Excluding that, it seems like the method 1 is always the fastest, at the expense of losing one maximum qubit.
The other options have other drawbacks: option 5 is slower and is dependent on the configuration of nblocks, nthreads, option two is very slow and option 4 works only up to 6 targets on my laptop.

If we consider also maintainability, I think that the best option may be 1 + the hardcoded three-qubit kernel. If we want to complicate the code, we can also implement one of the other options and fall back to them in case the memory overhead of 1 prevents the execution of the circuit.

mlazzarin · 2021-10-19T13:02:28Z

I've performed additional benchmarks with a more powerful machine.

29 qubits: simulation times

ntargets	Simulation time CPU	Simulation time CPU Qiskit	Simulation time GPU 1	Simulation time GPU 2	Simulation time GPU 3	Simulation time GPU 4	Simulation time GPU 5	Simulation time GPU Qiskit
3	4.37162	4.41568	0.87840	12.97101	0.83416	0.85116	0.87027	5.16430
4	4.48415	4.54203	0.98535	7.02323	0.92451	0.96455	1.00448	5.16169
5	5.48248	4.81841	1.19414	4.33198	1.10545	1.54272	1.27370	5.15485
6	5.86040	5.36233	1.63869	6.02275	nan	2.28142	3.12753	5.14882
7	6.09441	6.27977	2.52201	10.72576	nan	nan	5.70711	5.71034
8	7.32556	10.95151	4.29162	nan	nan	nan	11.78047	7.19121
9	10.10311	18.89987	8.24116	nan	nan	nan	24.46231	10.23238
10	15.76016	65.62365	17.28352	nan	nan	nan	47.32037	16.72979
11	31.98329	430.08204	37.82665	nan	nan	nan	94.30802	not performed
12	110.54395	1236.13424	72.91512	nan	nan	nan	192.27257	not performed

30 qubits: simulation times

ntargets	Simulation time CPU	Simulation time CPU Qiskit	Simulation time GPU 1	Simulation time GPU 2	Simulation time GPU 3	Simulation time GPU 4	Simulation time GPU 5	Simulation time GPU Qiskit
3	8.98304	9.15258	1.79254	26.01708	1.71656	1.73892	1.78816	10.27811
4	9.24200	9.34022	2.01931	14.17244	1.89794	1.92429	2.05220	10.29524
5	11.26570	9.90266	2.43439	8.67897	2.26075	2.58267	2.59685	10.31337
6	12.43468	11.00892	3.31824	11.46821	nan	4.52018	6.30734	10.32503
7	12.49654	12.84493	5.09171	21.56444	nan	nan	11.31291	11.50012
8	14.95679	22.73134	8.53706	nan	nan	nan	23.57340	14.40164
9	20.52620	39.12058	16.06461	nan	nan	nan	48.98731	20.48101
10	31.42744	130.09857	32.26213	nan	nan	nan	94.83861	32.93433
11	63.51460	832.37259	67.39161	nan	nan	nan	188.75589	not performed
12	201.67062	2415.66124	147.86115	nan	nan	nan	384.64494	not performed

18 qubits: dry run time - simulation time

ntargets	delta CPU	delta CPU Qiskit	delta GPU 1	delta GPU 2	delta GPU 3	delta GPU 4	delta GPU 5	delta GPU Qiskit
3	0.13864	0.02050	0.66376	0.55524	0.85127	0.78859	0.84690	0.00309
4	0.16576	0.02005	0.66619	0.55557	0.85902	0.78326	0.86043	0.00299
5	0.13019	0.01383	0.66636	0.55317	0.86737	0.77655	0.86716	0.00244
6	0.14925	0.12211	0.65860	0.55048	nan	0.66763	0.85594	0.00265
7	0.13032	0.13500	0.66649	0.54820	nan	nan	0.84973	0.00364
8	0.13304	0.11849	0.66336	0.55268	nan	nan	0.85293	0.01142
9	0.14281	0.07636	0.68302	0.55383	nan	nan	0.86233	0.05565
10	0.14997	0.07241	0.65187	0.55876	nan	nan	0.85332	0.03711
11	0.13002	0.04130	0.70330	0.56513	nan	nan	0.88719	not performed
12	0.14162	-0.06923	0.74065	0.59264	nan	nan	0.92795	not performed

I've managed to use hard-coded kernels up to 5 targets (option 3) by reducing the number of threads per block (just for those specific calls). It is the fastest option, even though the overhead of the dry run increases a bit.

Note the Qiskit seems very slow with more than 10 targets. Also the results with the GPU are a bit strange. Maybe there's something wrong with my implementation, I don't know. I had to skip the results for the GPU with n>10 due to time limits.

Also the results with option 2 are a bit weird, I'll double-check them tomorrow.

Given these results, I'd say that the best option is to use hard-coded kernels for n<=5 (option 3) and duplicate the state vector for n>5 (option 1). Moreover, in this way we don't need to specify an hard-coded number of blocks (while we still need to hard-code the number of threads).

Fix template kernels with ROCm

mlazzarin · 2021-10-26T13:19:51Z

src/qibojit/custom_operators/gates.cu.cc

+// Multiplies two complex numbers
+__device__ thrust::complex<double> cmult(thrust::complex<double> a, thrust::complex<double> b) {
+  return thrust::complex<double>(a.real() * b.real() - a.imag() * b.imag(),
+                                 a.real() * b.imag() + a.imag() * b.real());
 }

-__device__ complex<float> cmult(complex<float> a, complex<float> b) {
-  return complex<float>(a.real() * b.real() - a.imag() * b.imag(),
-                        a.real() * b.imag() + a.imag() * b.real());
+__device__ thrust::complex<float> cmult(thrust::complex<float> a, thrust::complex<float> b) {
+  return thrust::complex<float>(a.real() * b.real() - a.imag() * b.imag(),
+                                a.real() * b.imag() + a.imag() * b.real());
 }

-__device__ complex<double> cadd(complex<double> a, complex<double> b) {
-  return complex<double>(a.real() + b.real(), a.imag() + b.imag());
+// Adds two complex numbers
+__device__ thrust::complex<double> cadd(thrust::complex<double> a, thrust::complex<double> b) {
+  return thrust::complex<double>(a.real() + b.real(), a.imag() + b.imag());
 }

-__device__ complex<float> cadd(complex<float> a, complex<float> b) {
-  return complex<float>(a.real() + b.real(), a.imag() + b.imag());
+__device__ thrust::complex<float> cadd(thrust::complex<float> a, thrust::complex<float> b) {
+  return thrust::complex<float>(a.real() + b.real(), a.imag() + b.imag());
 }


@scarrazza is there a reason why we didn't use templates for these functions?

Not really, I believe we can use templates. Could you please give a try?

It seems to work well.

Btw, maybe these functions could be replaced with some thrust implementation, right?

Yes, I think so, maybe it will also reduce the compilation time. I will try.

Ok, (nearly) all tests are fine (both those of this repository and those of qibo, included the examples) for both NVIDIA and AMD gpus. Some tests fail with the latter but I'll take care of them in another issue (#34).

Tomorrow I'll double-check my implementation and see if the dry run overhead changed.

I compared the performance between commit 991eed8 and the latest commit, on a NVIDIA gpu.

Simulation times are not impacted by the changes. The dry run overhead is maybe a bit higher, but the difference is very very small, and now the kernels are more readable.

qft - simulation times

nqubits Simulation time 991eed8 Simulation time multiqubitgpu

3 0.00044 0.00043

4 0.00072 0.00072

5 0.00103 0.00106

6 0.00148 0.00151

7 0.00193 0.00187

8 0.00251 0.00249

9 0.00318 0.00317

10 0.00388 0.00382

11 0.00477 0.00471

12 0.00582 0.00575

13 0.00701 0.00632

14 0.00776 0.00751

15 0.00861 0.00851

16 0.00985 0.00958

17 0.01142 0.01130

18 0.01270 0.01226

19 0.01449 0.01429

20 0.01860 0.01853

21 0.02566 0.02442

22 0.03653 0.03638

23 0.05881 0.05849

24 0.10524 0.10577

25 0.20298 0.20303

26 0.40956 0.40973

27 0.84477 0.84448

28 1.77029 1.76949

29 3.71816 3.71843

30 7.85646 7.85749

variational - simulation times

nqubits Simulation time 991eed8 Simulation time multiqubitgpu

3 0.00041 0.00041

4 0.00059 0.00058

5 0.00067 0.00067

6 0.00089 0.00091

7 0.00092 0.00091

8 0.00111 0.00110

9 0.00120 0.00118

10 0.00136 0.00142

11 0.00156 0.00154

12 0.00170 0.00179

13 0.00182 0.00188

14 0.00211 0.00204

15 0.00214 0.00214

16 0.00234 0.00230

17 0.00245 0.00236

18 0.00289 0.00291

19 0.00357 0.00357

20 0.00489 0.00495

21 0.00748 0.00762

22 0.01309 0.01300

23 0.02367 0.02376

24 0.04685 0.04684

25 0.09353 0.09345

26 0.19289 0.19316

27 0.39304 0.39327

28 0.81734 0.81717

29 1.67333 1.67299

30 3.48082 3.48066

bv - simulation times

nqubits Simulation time 991eed8 Simulation time multiqubitgpu

3 0.00040 0.00038

4 0.00053 0.00052

5 0.00065 0.00066

6 0.00078 0.00078

7 0.00092 0.00091

8 0.00105 0.00105

9 0.00117 0.00118

10 0.00131 0.00134

11 0.00148 0.00156

12 0.00169 0.00171

13 0.00184 0.00180

14 0.00189 0.00201

15 0.00209 0.00214

16 0.00224 0.00227

17 0.00237 0.00234

18 0.00265 0.00274

19 0.00364 0.00365

20 0.00514 0.00503

21 0.00801 0.00805

22 0.01389 0.01380

23 0.02588 0.02580

24 0.05067 0.05071

25 0.10253 0.10257

26 0.21017 0.20995

27 0.43234 0.43248

28 0.89246 0.89234

29 1.84317 1.84307

30 3.80694 3.80670

supremacy - simulation times

nqubits Simulation time 991eed8 Simulation time multiqubitgpu

3 0.00049 0.00047

4 0.00061 0.00060

5 0.00071 0.00070

6 0.00090 0.00089

7 0.00103 0.00099

8 0.00113 0.00113

9 0.00128 0.00127

10 0.00146 0.00146

11 0.00173 0.00171

12 0.00181 0.00187

13 0.00196 0.00198

14 0.00219 0.00212

15 0.00233 0.00231

16 0.00239 0.00255

17 0.00259 0.00256

18 0.00308 0.00312

19 0.00403 0.00409

20 0.00563 0.00567

21 0.00894 0.00895

22 0.01585 0.01591

23 0.02969 0.02972

24 0.05832 0.05837

25 0.11840 0.11834

26 0.24483 0.24496

27 0.50071 0.50065

28 1.03147 1.03117

29 2.13266 2.13229

30 4.43612 4.43687

qv - simulation times

nqubits Simulation time 991eed8 Simulation time multiqubitgpu

3 0.00055 0.00055

4 0.00105 0.00105

5 0.00103 0.00102

6 0.00155 0.00160

7 0.00150 0.00157

8 0.00203 0.00199

9 0.00204 0.00203

10 0.00263 0.00260

11 0.00277 0.00283

12 0.00331 0.00324

13 0.00328 0.00332

14 0.00386 0.00389

15 0.00396 0.00381

16 0.00443 0.00444

17 0.00440 0.00440

18 0.00553 0.00552

19 0.00677 0.00679

20 0.00976 0.00990

21 0.01480 0.01482

22 0.02683 0.02672

23 0.04771 0.04792

24 0.09675 0.09656

25 0.18975 0.18979

26 0.39937 0.39940

27 0.80150 0.80146

28 1.71590 1.71595

29 3.38466 3.38493

30 7.31564 7.31548

qft - dry run overhead

nqubits delta 991eed8 delta multiqubitgpu

3 1.13092 1.13790

4 0.71047 0.71884

5 0.71031 0.71928

6 0.71481 0.71649

7 0.70863 0.72005

8 0.70899 0.71804

9 0.71400 0.71668

10 0.71852 0.71835

11 0.71023 0.72112

12 0.70969 0.71989

13 0.71568 0.71456

14 0.71226 0.71509

15 0.71486 0.72075

16 0.71409 0.71657

17 0.71541 0.72047

18 0.71804 0.72197

19 0.72080 0.72394

20 0.72220 0.72338

21 0.72299 0.72673

22 0.72623 0.72765

23 0.72267 0.72840

24 0.72529 0.73085

25 0.72182 0.72296

26 0.72701 0.72603

27 0.72602 0.73006

28 0.71864 0.72659

29 0.70956 0.71191

30 0.69071 0.70341

variational - dry run overhead

nqubits delta 991eed8 delta multiqubitgpu

3 0.71415 0.72038

4 0.70769 0.71649

5 0.70937 0.71812

6 0.70958 0.71555

7 0.70727 0.71559

8 0.70720 0.71885

9 0.70969 0.71751

10 0.71365 0.71747

11 0.71844 0.71616

12 0.71788 0.71990

13 0.71002 0.71771

14 0.71480 0.71783

15 0.71852 0.71724

16 0.71529 0.71917

17 0.71499 0.71951

18 0.71080 0.71586

19 0.71067 0.72309

20 0.71775 0.71293

21 0.71109 0.72117

22 0.71793 0.71935

23 0.71640 0.72139

24 0.71106 0.71933

25 0.71344 0.71769

26 0.70721 0.72180

27 0.71279 0.71130

28 0.70703 0.70995

29 0.69703 0.70370

30 0.67889 0.68067

bv - dry run overhead

nqubits delta 991eed8 delta multiqubitgpu

3 0.70693 0.70872

4 0.70554 0.71479

5 0.70762 0.71364

6 0.70775 0.71065

7 0.71237 0.71555

8 0.71490 0.71482

9 0.70930 0.71308

10 0.71173 0.71508

11 0.70676 0.71636

12 0.71702 0.71578

13 0.71345 0.71600

14 0.70832 0.71245

15 0.72123 0.71359

16 0.71067 0.71403

17 0.71310 0.71767

18 0.71147 0.71437

19 0.71051 0.71625

20 0.71215 0.71688

21 0.71719 0.71342

22 0.71227 0.71678

23 0.71176 0.71218

24 0.70776 0.71637

25 0.70673 0.71207

26 0.71570 0.71079

27 0.71087 0.71212

28 0.70449 0.71039

29 0.70249 0.70672

30 0.69658 0.70001

supremacy - dry run overhead

nqubits delta 991eed8 delta multiqubitgpu

3 0.70985 0.71530

4 0.71563 0.71162

5 0.70983 0.71305

6 0.71057 0.71416

7 0.71140 0.71252

8 0.70858 0.71158

9 0.71183 0.71507

10 0.71016 0.71395

11 0.71208 0.71326

12 0.71051 0.71362

13 0.71302 0.71593

14 0.71557 0.71449

15 0.71754 0.71594

16 0.71387 0.71398

17 0.71395 0.71546

18 0.71328 0.71520

19 0.71274 0.72063

20 0.71494 0.71321

21 0.71122 0.71630

22 0.71197 0.71504

23 0.71151 0.71616

24 0.71333 0.71774

25 0.71069 0.71447

26 0.70870 0.71649

27 0.70629 0.71073

28 0.70280 0.70524

29 0.69261 0.69725

30 0.67662 0.67756

qv - dry run overhead

nqubits delta 991eed8 delta multiqubitgpu

3 0.71142 0.71313

4 0.71076 0.71464

5 0.71482 0.71656

6 0.71078 0.71453

7 0.70992 0.71436

8 0.71111 0.71599

9 0.71156 0.71491

10 0.71524 0.71790

11 0.71223 0.71627

12 0.71253 0.71641

13 0.71605 0.71792

14 0.71840 0.71634

15 0.71330 0.72055

16 0.71573 0.71746

17 0.71763 0.71917

18 0.71394 0.71920

19 0.71437 0.71903

20 0.71500 0.72045

21 0.71883 0.72240

22 0.71666 0.72083

23 0.71968 0.72134

24 0.71736 0.72497

25 0.71923 0.72301

26 0.71585 0.71849

27 0.71657 0.71572

28 0.71002 0.71283

29 0.69706 0.70242

30 0.68073 0.68579

mlazzarin · 2021-10-27T06:30:20Z

Ready for review. Please double-check the kernels written by me, as I didn't derive them again but I worked by analogy w.r.t the numba implementation, and I'm still confused about some function parameters and some indices.

mlazzarin · 2021-10-27T06:31:59Z

It seems like I cannot ask @stavros11 for the review since he's the author, so I'll do it with this comment.

stavros11

Thank you for implementing this. I have not done any performance tests as I believe this is covered in the above comments, but tests pass on my CUDA GPU (cannot check for AMD) and the kernels look ok to me.

Regarding the cadd, cmult replacement, the code is much cleaner now so if there is no big performance change, which seems to be the case from your latest benchmark, I agree with the new approach.

It seems like I cannot ask @stavros11 for the review since he's the author, so I'll do it with this comment.

Aparrently I also cannot approve the PR since I am the author, but it looks good to me. Just a few minor comments/possible simplifications below.

src/qibojit/custom_operators/backends.py

stavros11 · 2021-10-27T09:12:17Z

src/qibojit/custom_operators/backends.py

+            # len(targets): 4 -> 512  threads per block
+            # len(targets): 5 -> 256  threads per block
+            nblocks, block_size = self.calculate_blocks(nstates,
+                                                        block_size=2**(13-len(targets)))


Just to make sure I understand this correctly, this block size is the maximum possible (so that we have the best performance) that does not lead to memory issues, right?
This result is independent of GPU model? No need to do any tests regarding this, I am mainly asking out of curiosity.

Yes, it's the maximum possible (considering only powers of two). I don't know if it's independent of GPU model, I performed the tests on my laptop.

Ok, I confirm that it is not independent of GPU model, as this configuration doesn't work with a NVIDIA Tesla V100 (I need to halve the block size). I need to find a smarter way to manage this before merging this PR.

I think that I've found a smarter way in #37 .

mlazzarin · 2021-10-27T12:25:51Z

@stavros11 I implemented your comments. I also added some tests with 22 qubits for the multiqubit gates, now the CI takes longer, let me now if it's ok.

mlazzarin · 2021-10-28T09:03:03Z

I implemented a simple change in the multiqubit kernel for n>5. The goal was to try to optimize the performance.
I performed some benchmarks, they show a small increase in performance of around 1%-4%, while the results for the dry run overhead are too noisy to show a trend.

29 qubits

ntargets	Simulation time before	Simulation time after
6	1.61395	1.57153
7	2.47546	2.37732
8	4.25149	4.09117
9	8.18834	8.06142
10	17.16740	16.94156

30 qubits

ntargets	Simulation time before	Simulation time after
6	3.30619	3.22333
7	5.04858	4.93918
8	8.51509	8.38226
9	15.98977	15.75742
10	32.21293	31.64226

29 qubits - dry run overhead

ntargets	delta before	delta after
6	0.84389	0.84887
7	0.84079	0.85194
8	0.82255	0.77665
9	0.77009	0.75778
10	0.70218	0.61692

30 qubits - dry run overhead

ntargets	delta before	delta after
6	0.85006	0.85752
7	0.84390	0.85973
8	0.86111	0.83283
9	0.82162	0.82221
10	0.79891	0.86906

stavros11 · 2021-10-28T11:35:46Z

I implemented a simple change in the multiqubit kernel for n>5. The goal was to try to optimize the performance. I performed some benchmarks, they show a small increase in performance of around 1%-4%, while the results for the dry run overhead are too noisy to show a trend.

Thanks for this change, it makes sense. I am not sure if you can do something to reduce noise in the dry run benchmark, it only happens once per script so you cannot do more reps like for simulation. Only if you repeat the script execution multiple times and average, but I don't think it's necessary to go to such detail.

Use template kernels for multiqubit gates

scarrazza · 2021-11-04T14:24:13Z

@mlazzarin @stavros11 could you please fix the conflicts so we can merge? Thanks.

mlazzarin · 2021-11-05T05:50:59Z

@stavros11 thank you for fixing the conflicts.

stavros11 · 2021-11-05T06:05:44Z

@mlazzarin @scarrazza before merging this, please make sure that tests pass in your configurations with the GPUs enabled and disabled, particularly for AMDs since I do not have access to these and the CI is not testing for GPU. This looks good to me, my only concerns are the following:

The multi-qubit kernels increase the dry run/compilation time by around 0.3sec which is small but non-negligible and happens regardless of whether the user actually uses these kernels or no.
The tests for 22 qubits take some time, at least in my machine, but since they don't break the CI it is not a big issue.

Just mentioning these to keep in mind, I am not sure if we should/can do something in the current stage.

mlazzarin · 2021-11-05T07:02:16Z

@mlazzarin @scarrazza before merging this, please make sure that tests pass in your configurations with the GPUs enabled and disabled, particularly for AMDs since I do not have access to these and the CI is not testing for GPU.

Ok, I'm on it.

The multi-qubit kernels increase the dry run/compilation time by around 0.3sec which is small but non-negligible and happens regardless of whether the user actually uses these kernels or no.

We may solve this issue by redesigning the way we load the kernels, e.g. by replacing cp.RawModule with many cp.RawKernel and compile each kernel only when it's needed.

The tests for 22 qubits take some time, at least in my machine, but since they don't break the CI it is not a big issue.

The tests for 22 qubits are required to catch some resource issues with GPUs. Given that the CI doesn't use a GPU, is it possible the implement some skipping mechanism that works only if a GPU is not available?

mlazzarin · 2021-11-05T07:57:59Z

Ok, I tried with both Intel/AMD CPUs and NVIDIA/AMD GPUs and the qibojit tests are fine.

scarrazza · 2021-11-05T10:12:11Z

@stavros11 thanks for the summary, @mlazzarin thanks for testing. Let me suggest to merge and release a 0.0.4rc0, then next week we can discuss about this compilation issue with their developers.

stavros11 added 2 commits October 4, 2021 17:39

Add multiqubit gpu kernel

58e1dfe

Fix tests

46d6a88

stavros11 mentioned this pull request Oct 5, 2021

Multiqubit kernel template #21

Merged

Improve copy of state vector in multiqubit CUDA kernel

772579e

mlazzarin and others added 8 commits October 21, 2021 18:51

Remove some comments

1b4e363

Add hard-coded gpu kernels for multiqubit gates

fa372f1

Implement workaround for ROCm kernels

e2a83eb

Fix previous commit

9f4c059

Use thrust::complex in device functions of gates.cu.cc

967f891

Add some comments

402fd9c

Merge branch 'multiqubitgpu' into fixrocm

590e80c

Merge pull request #33 from qiboteam/fixrocm

991eed8

Fix template kernels with ROCm

mlazzarin reviewed Oct 26, 2021

View reviewed changes

mlazzarin added 3 commits October 26, 2021 20:08

Add templates for __device__ functions in gates.cu.cc

9c6beda

Replace cadd and cmult with + and * in gates.cu.cc

4272793

Merge branch 'multistr' into multiqubitgpu

ca6d5cc

mlazzarin marked this pull request as ready for review October 27, 2021 06:20

mlazzarin requested a review from scarrazza October 27, 2021 06:30

stavros11 commented Oct 27, 2021

View reviewed changes

mlazzarin added 2 commits October 27, 2021 14:59

Implement comments

19cee52

Add more tests for multiqubit gates

1562fea

Small performance improvement

bacf88e

mlazzarin and others added 8 commits October 31, 2021 09:45

Add hard-coded general multiqubit gate kernel for GPU

02756b1

Fix bug in gates.cu.cc

cba2399

Fix CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES

bd1755c

Fix error with ROCm

c0bf1d9

Remove the hard-coded max block size in gates.cu.cc

a97fcfc

Move GPU kernels compilation to CupyBackend's __init__

d650a8b

Hard-code the maximum number of target qubits for GPU kernels

63a2202

Merge pull request #37 from qiboteam/unroll

f71b1d7

Use template kernels for multiqubit gates

Fix conflicts

3183df2

scarrazza merged commit 4ff7885 into multistr Nov 5, 2021

mlazzarin mentioned this pull request Nov 14, 2021

Fix multiqubit GPU ops not called by JitCustomBackend #41

Merged

stavros11 deleted the multiqubitgpu branch January 17, 2022 07:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-qubit kernel for GPU #23

Multi-qubit kernel for GPU #23

stavros11 commented Oct 5, 2021

mlazzarin commented Oct 7, 2021

scarrazza commented Oct 7, 2021

codecov bot commented Oct 12, 2021 •

edited

Loading

mlazzarin commented Oct 12, 2021 •

edited

Loading

scarrazza commented Oct 12, 2021

mlazzarin commented Oct 12, 2021

mlazzarin commented Oct 19, 2021

mlazzarin Oct 26, 2021

scarrazza Oct 26, 2021

mlazzarin Oct 26, 2021

scarrazza Oct 26, 2021

mlazzarin Oct 26, 2021 •

edited

Loading

mlazzarin Oct 26, 2021

mlazzarin Oct 27, 2021

mlazzarin commented Oct 27, 2021

mlazzarin commented Oct 27, 2021

stavros11 left a comment

stavros11 Oct 27, 2021

mlazzarin Oct 27, 2021

mlazzarin Oct 31, 2021

mlazzarin Nov 3, 2021

mlazzarin commented Oct 27, 2021

mlazzarin commented Oct 28, 2021

stavros11 commented Oct 28, 2021

scarrazza commented Nov 4, 2021

mlazzarin commented Nov 5, 2021

stavros11 commented Nov 5, 2021

mlazzarin commented Nov 5, 2021

mlazzarin commented Nov 5, 2021

scarrazza commented Nov 5, 2021

nqubits	Simulation time `991eed8`	Simulation time multiqubitgpu
3	0.00044	0.00043
4	0.00072	0.00072
5	0.00103	0.00106
6	0.00148	0.00151
7	0.00193	0.00187
8	0.00251	0.00249
9	0.00318	0.00317
10	0.00388	0.00382
11	0.00477	0.00471
12	0.00582	0.00575
13	0.00701	0.00632
14	0.00776	0.00751
15	0.00861	0.00851
16	0.00985	0.00958
17	0.01142	0.01130
18	0.01270	0.01226
19	0.01449	0.01429
20	0.01860	0.01853
21	0.02566	0.02442
22	0.03653	0.03638
23	0.05881	0.05849
24	0.10524	0.10577
25	0.20298	0.20303
26	0.40956	0.40973
27	0.84477	0.84448
28	1.77029	1.76949
29	3.71816	3.71843
30	7.85646	7.85749

nqubits	Simulation time `991eed8`	Simulation time multiqubitgpu
3	0.00041	0.00041
4	0.00059	0.00058
5	0.00067	0.00067
6	0.00089	0.00091
7	0.00092	0.00091
8	0.00111	0.00110
9	0.00120	0.00118
10	0.00136	0.00142
11	0.00156	0.00154
12	0.00170	0.00179
13	0.00182	0.00188
14	0.00211	0.00204
15	0.00214	0.00214
16	0.00234	0.00230
17	0.00245	0.00236
18	0.00289	0.00291
19	0.00357	0.00357
20	0.00489	0.00495
21	0.00748	0.00762
22	0.01309	0.01300
23	0.02367	0.02376
24	0.04685	0.04684
25	0.09353	0.09345
26	0.19289	0.19316
27	0.39304	0.39327
28	0.81734	0.81717
29	1.67333	1.67299
30	3.48082	3.48066

nqubits	Simulation time `991eed8`	Simulation time multiqubitgpu
3	0.00040	0.00038
4	0.00053	0.00052
5	0.00065	0.00066
6	0.00078	0.00078
7	0.00092	0.00091
8	0.00105	0.00105
9	0.00117	0.00118
10	0.00131	0.00134
11	0.00148	0.00156
12	0.00169	0.00171
13	0.00184	0.00180
14	0.00189	0.00201
15	0.00209	0.00214
16	0.00224	0.00227
17	0.00237	0.00234
18	0.00265	0.00274
19	0.00364	0.00365
20	0.00514	0.00503
21	0.00801	0.00805
22	0.01389	0.01380
23	0.02588	0.02580
24	0.05067	0.05071
25	0.10253	0.10257
26	0.21017	0.20995
27	0.43234	0.43248
28	0.89246	0.89234
29	1.84317	1.84307
30	3.80694	3.80670

nqubits	Simulation time `991eed8`	Simulation time multiqubitgpu
3	0.00049	0.00047
4	0.00061	0.00060
5	0.00071	0.00070
6	0.00090	0.00089
7	0.00103	0.00099
8	0.00113	0.00113
9	0.00128	0.00127
10	0.00146	0.00146
11	0.00173	0.00171
12	0.00181	0.00187
13	0.00196	0.00198
14	0.00219	0.00212
15	0.00233	0.00231
16	0.00239	0.00255
17	0.00259	0.00256
18	0.00308	0.00312
19	0.00403	0.00409
20	0.00563	0.00567
21	0.00894	0.00895
22	0.01585	0.01591
23	0.02969	0.02972
24	0.05832	0.05837
25	0.11840	0.11834
26	0.24483	0.24496
27	0.50071	0.50065
28	1.03147	1.03117
29	2.13266	2.13229
30	4.43612	4.43687

nqubits	Simulation time `991eed8`	Simulation time multiqubitgpu
3	0.00055	0.00055
4	0.00105	0.00105
5	0.00103	0.00102
6	0.00155	0.00160
7	0.00150	0.00157
8	0.00203	0.00199
9	0.00204	0.00203
10	0.00263	0.00260
11	0.00277	0.00283
12	0.00331	0.00324
13	0.00328	0.00332
14	0.00386	0.00389
15	0.00396	0.00381
16	0.00443	0.00444
17	0.00440	0.00440
18	0.00553	0.00552
19	0.00677	0.00679
20	0.00976	0.00990
21	0.01480	0.01482
22	0.02683	0.02672
23	0.04771	0.04792
24	0.09675	0.09656
25	0.18975	0.18979
26	0.39937	0.39940
27	0.80150	0.80146
28	1.71590	1.71595
29	3.38466	3.38493
30	7.31564	7.31548

nqubits	delta `991eed8`	delta multiqubitgpu
3	1.13092	1.13790
4	0.71047	0.71884
5	0.71031	0.71928
6	0.71481	0.71649
7	0.70863	0.72005
8	0.70899	0.71804
9	0.71400	0.71668
10	0.71852	0.71835
11	0.71023	0.72112
12	0.70969	0.71989
13	0.71568	0.71456
14	0.71226	0.71509
15	0.71486	0.72075
16	0.71409	0.71657
17	0.71541	0.72047
18	0.71804	0.72197
19	0.72080	0.72394
20	0.72220	0.72338
21	0.72299	0.72673
22	0.72623	0.72765
23	0.72267	0.72840
24	0.72529	0.73085
25	0.72182	0.72296
26	0.72701	0.72603
27	0.72602	0.73006
28	0.71864	0.72659
29	0.70956	0.71191
30	0.69071	0.70341

nqubits	delta `991eed8`	delta multiqubitgpu
3	0.71415	0.72038
4	0.70769	0.71649
5	0.70937	0.71812
6	0.70958	0.71555
7	0.70727	0.71559
8	0.70720	0.71885
9	0.70969	0.71751
10	0.71365	0.71747
11	0.71844	0.71616
12	0.71788	0.71990
13	0.71002	0.71771
14	0.71480	0.71783
15	0.71852	0.71724
16	0.71529	0.71917
17	0.71499	0.71951
18	0.71080	0.71586
19	0.71067	0.72309
20	0.71775	0.71293
21	0.71109	0.72117
22	0.71793	0.71935
23	0.71640	0.72139
24	0.71106	0.71933
25	0.71344	0.71769
26	0.70721	0.72180
27	0.71279	0.71130
28	0.70703	0.70995
29	0.69703	0.70370
30	0.67889	0.68067

nqubits	delta `991eed8`	delta multiqubitgpu
3	0.70693	0.70872
4	0.70554	0.71479
5	0.70762	0.71364
6	0.70775	0.71065
7	0.71237	0.71555
8	0.71490	0.71482
9	0.70930	0.71308
10	0.71173	0.71508
11	0.70676	0.71636
12	0.71702	0.71578
13	0.71345	0.71600
14	0.70832	0.71245
15	0.72123	0.71359
16	0.71067	0.71403
17	0.71310	0.71767
18	0.71147	0.71437
19	0.71051	0.71625
20	0.71215	0.71688
21	0.71719	0.71342
22	0.71227	0.71678
23	0.71176	0.71218
24	0.70776	0.71637
25	0.70673	0.71207
26	0.71570	0.71079
27	0.71087	0.71212
28	0.70449	0.71039
29	0.70249	0.70672
30	0.69658	0.70001

nqubits	delta `991eed8`	delta multiqubitgpu
3	0.70985	0.71530
4	0.71563	0.71162
5	0.70983	0.71305
6	0.71057	0.71416
7	0.71140	0.71252
8	0.70858	0.71158
9	0.71183	0.71507
10	0.71016	0.71395
11	0.71208	0.71326
12	0.71051	0.71362
13	0.71302	0.71593
14	0.71557	0.71449
15	0.71754	0.71594
16	0.71387	0.71398
17	0.71395	0.71546
18	0.71328	0.71520
19	0.71274	0.72063
20	0.71494	0.71321
21	0.71122	0.71630
22	0.71197	0.71504
23	0.71151	0.71616
24	0.71333	0.71774
25	0.71069	0.71447
26	0.70870	0.71649
27	0.70629	0.71073
28	0.70280	0.70524
29	0.69261	0.69725
30	0.67662	0.67756

nqubits	delta `991eed8`	delta multiqubitgpu
3	0.71142	0.71313
4	0.71076	0.71464
5	0.71482	0.71656
6	0.71078	0.71453
7	0.70992	0.71436
8	0.71111	0.71599
9	0.71156	0.71491
10	0.71524	0.71790
11	0.71223	0.71627
12	0.71253	0.71641
13	0.71605	0.71792
14	0.71840	0.71634
15	0.71330	0.72055
16	0.71573	0.71746
17	0.71763	0.71917
18	0.71394	0.71920
19	0.71437	0.71903
20	0.71500	0.72045
21	0.71883	0.72240
22	0.71666	0.72083
23	0.71968	0.72134
24	0.71736	0.72497
25	0.71923	0.72301
26	0.71585	0.71849
27	0.71657	0.71572
28	0.71002	0.71283
29	0.69706	0.70242
30	0.68073	0.68579

Multi-qubit kernel for GPU #23

Multi-qubit kernel for GPU #23

Conversation

stavros11 commented Oct 5, 2021

mlazzarin commented Oct 7, 2021

scarrazza commented Oct 7, 2021

codecov bot commented Oct 12, 2021 • edited Loading

Codecov Report

mlazzarin commented Oct 12, 2021 • edited Loading

scarrazza commented Oct 12, 2021

mlazzarin commented Oct 12, 2021

mlazzarin commented Oct 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlazzarin Oct 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlazzarin commented Oct 27, 2021

mlazzarin commented Oct 27, 2021

stavros11 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlazzarin commented Oct 27, 2021

mlazzarin commented Oct 28, 2021

stavros11 commented Oct 28, 2021

scarrazza commented Nov 4, 2021

mlazzarin commented Nov 5, 2021

stavros11 commented Nov 5, 2021

mlazzarin commented Nov 5, 2021

mlazzarin commented Nov 5, 2021

scarrazza commented Nov 5, 2021

codecov bot commented Oct 12, 2021 •

edited

Loading

mlazzarin commented Oct 12, 2021 •

edited

Loading

mlazzarin Oct 26, 2021 •

edited

Loading