Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-qubit kernel for GPU #23

Merged
merged 26 commits into from
Nov 5, 2021
Merged

Multi-qubit kernel for GPU #23

merged 26 commits into from
Nov 5, 2021

Conversation

stavros11
Copy link
Member

Closes #22. As we discussed this is missing the proper buffer implementation.

Currently I use a buffer that has the same size as the full state vector as a placeholder and to make sure that the rest of the code works okay. Ideally this vector should have size (nthreads * nsubstates,) and we need to index it properly wo that each thread uses a different set of values. Alternatively, we could try initializing a vector of size (nsubstates,) within the kernel but I searched a little bit and I am not sure if this is a good practice in CUDA.

@mlazzarin
Copy link
Contributor

Concerning the buffer implementation, I see three options:

  • Duplicate the state vector
  • Dynamically allocate local arrays of size (nsubstates,) in the kernel
  • Allocate arrays of size (nsubstates*block_size,) in the shared memory of each block (such memory is on-chip so it's very fast but also limited in size).

I'll look into the three options, the first one is the simplest but wastes memory, the second one has more overhead and may raise memory issues (it may be useful in hard-coded kernels) and the third one may raise memory issues.

@scarrazza
Copy link
Member

@mlazzarin sounds reasonable, in particular if you tabulate performance and memory for each approach should be great.

@codecov
Copy link

codecov bot commented Oct 12, 2021

Codecov Report

Merging #23 (3183df2) into multistr (a703c0b) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           multistr       #23   +/-   ##
==========================================
  Coverage    100.00%   100.00%           
==========================================
  Files             9         9           
  Lines           983       983           
==========================================
  Hits            983       983           
Flag Coverage Δ
unittests 100.00% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/qibojit/custom_operators/backends.py 100.00% <ø> (ø)
src/qibojit/tests/test_gates.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a703c0b...3183df2. Read the comment docs.

@mlazzarin
Copy link
Contributor

mlazzarin commented Oct 12, 2021

To sum up, we can proceed with different options:

  1. Duplicate the state vector and use it as a buffer: the option implemented in the PR. Simple to write and maintain, we lose one maximum qubit.
  2. Dynamically allocate buffer inside the kernels: overhead for the allocation, it raises memory issues when too much memory is allocated inside the kernel (unless we raise the heap size limit with CUDA)
  3. Implement hard-coded kernels with a local buffer each: it works only for three-qubit gates, then raises memory issues (at least in my configuration). We may keep the three-qubit implementation.
  4. Allocate the buffer in shared memory within each block of threads: the shared memory is fast because is on-chip, but the problem is that it is something like 40-60 KB, and we need to allocate nthreads_per_block * 2^ntargets * 16 B, so we need to reduce a lot the number of threads per block.
  5. Allocate the buffers in global memory and use it in each thread: similar to the duplication of the state vector, we need to allocate a buffer of nthreads*2^ntargets. If we use nthreads=nstates, this is a problem, but we may implement the batch computation of [WIP] Kernels that update the state vector in batches #25 (which seems a bit slower but not much) and reduce the number of threads to something like 5000-10000 to solve the memory issues. We may implement the batch computation only for n-qubit gates, n > 3.

EDIT: To me it seems like a combination of options 1, 3 and 5 is the best one. I'll provide some temporary benchmarks soon.

@scarrazza
Copy link
Member

@mlazzarin thanks for this suggestions and tests. I also have suspicious that 1, 3 and 5 are acceptable solutions. Most likely 1 is the easiest to maintain.

@mlazzarin
Copy link
Contributor

I've performed a simple benchmark on my laptop, to have an approximate idea of the performance. I'll provide more precise benchmarks later on.

26 qubits
ntargets simulation_times_mean simulation_times_mean 2 simulation_times_mean 3+5 simulation_times_mean 3+5 (diff. params) simulation_times_mean 4
3.0 0.501093864440918 3.7012813091278076 0.47490787506103516 0.4949676990509033 0.47093701362609863
4.0 0.5783326625823975 2.1077096462249756 0.8769850730895996 0.6424994468688965 0.5471789836883545
5.0 0.7357909679412842 1.689469814300537 1.580549955368042 0.8523507118225098 0.7668726444244385
6.0 1.0668115615844727 3.3204479217529297 4.937393426895142 1.3008100986480713 1.548600673675537
7.0 1.7232403755187988 12.982528448104858 9.024909019470215 2.1566641330718994 nan
8.0 2.96152925491333 27.363354682922363 17.62743639945984 3.904561996459961 nan
9.0 5.6248955726623535 54.49666380882263 34.962459325790405 7.134665489196777 nan
10.0 11.253756523132324 nan 69.19841933250427 14.528671741485596 nan
27 qubits (method 1 requires too much memory)
ntargets simulation_times_mean 2 simulation_times_mean 3+5 simulation_times_mean 3+5 (diff. params) simulation_times_mean 4
3.0 7.343163013458252 0.963958740234375 0.995358943939209 0.9702143669128418
4.0 4.401804208755493 1.754577875137329 1.347081184387207 1.1217937469482422
5.0 3.5388331413269043 3.071789503097534 1.8756260871887207 1.561262607574463
6.0 8.072781085968018 8.924863338470459 2.5406107902526855 3.124509811401367
7.0 25.89661717414856 17.724601984024048 4.496196985244751 nan
8.0 51.02249789237976 34.91356921195984 7.6825950145721436 nan
9.0 104.09557867050171 70.08783483505249 14.059113502502441 nan
10.0 nan 140.3752624988556 26.748580932617188 nan

Concerning three-qubit gates, the hard-coded one is the best.
Excluding that, it seems like the method 1 is always the fastest, at the expense of losing one maximum qubit.
The other options have other drawbacks: option 5 is slower and is dependent on the configuration of nblocks, nthreads, option two is very slow and option 4 works only up to 6 targets on my laptop.

If we consider also maintainability, I think that the best option may be 1 + the hardcoded three-qubit kernel. If we want to complicate the code, we can also implement one of the other options and fall back to them in case the memory overhead of 1 prevents the execution of the circuit.

@mlazzarin
Copy link
Contributor

I've performed additional benchmarks with a more powerful machine.

29 qubits: simulation times
ntargets Simulation time CPU Simulation time CPU Qiskit Simulation time GPU 1 Simulation time GPU 2 Simulation time GPU 3 Simulation time GPU 4 Simulation time GPU 5 Simulation time GPU Qiskit
3 4.37162 4.41568 0.87840 12.97101 0.83416 0.85116 0.87027 5.16430
4 4.48415 4.54203 0.98535 7.02323 0.92451 0.96455 1.00448 5.16169
5 5.48248 4.81841 1.19414 4.33198 1.10545 1.54272 1.27370 5.15485
6 5.86040 5.36233 1.63869 6.02275 nan 2.28142 3.12753 5.14882
7 6.09441 6.27977 2.52201 10.72576 nan nan 5.70711 5.71034
8 7.32556 10.95151 4.29162 nan nan nan 11.78047 7.19121
9 10.10311 18.89987 8.24116 nan nan nan 24.46231 10.23238
10 15.76016 65.62365 17.28352 nan nan nan 47.32037 16.72979
11 31.98329 430.08204 37.82665 nan nan nan 94.30802 not performed
12 110.54395 1236.13424 72.91512 nan nan nan 192.27257 not performed
30 qubits: simulation times
ntargets Simulation time CPU Simulation time CPU Qiskit Simulation time GPU 1 Simulation time GPU 2 Simulation time GPU 3 Simulation time GPU 4 Simulation time GPU 5 Simulation time GPU Qiskit
3 8.98304 9.15258 1.79254 26.01708 1.71656 1.73892 1.78816 10.27811
4 9.24200 9.34022 2.01931 14.17244 1.89794 1.92429 2.05220 10.29524
5 11.26570 9.90266 2.43439 8.67897 2.26075 2.58267 2.59685 10.31337
6 12.43468 11.00892 3.31824 11.46821 nan 4.52018 6.30734 10.32503
7 12.49654 12.84493 5.09171 21.56444 nan nan 11.31291 11.50012
8 14.95679 22.73134 8.53706 nan nan nan 23.57340 14.40164
9 20.52620 39.12058 16.06461 nan nan nan 48.98731 20.48101
10 31.42744 130.09857 32.26213 nan nan nan 94.83861 32.93433
11 63.51460 832.37259 67.39161 nan nan nan 188.75589 not performed
12 201.67062 2415.66124 147.86115 nan nan nan 384.64494 not performed
18 qubits: dry run time - simulation time
ntargets delta CPU delta CPU Qiskit delta GPU 1 delta GPU 2 delta GPU 3 delta GPU 4 delta GPU 5 delta GPU Qiskit
3 0.13864 0.02050 0.66376 0.55524 0.85127 0.78859 0.84690 0.00309
4 0.16576 0.02005 0.66619 0.55557 0.85902 0.78326 0.86043 0.00299
5 0.13019 0.01383 0.66636 0.55317 0.86737 0.77655 0.86716 0.00244
6 0.14925 0.12211 0.65860 0.55048 nan 0.66763 0.85594 0.00265
7 0.13032 0.13500 0.66649 0.54820 nan nan 0.84973 0.00364
8 0.13304 0.11849 0.66336 0.55268 nan nan 0.85293 0.01142
9 0.14281 0.07636 0.68302 0.55383 nan nan 0.86233 0.05565
10 0.14997 0.07241 0.65187 0.55876 nan nan 0.85332 0.03711
11 0.13002 0.04130 0.70330 0.56513 nan nan 0.88719 not performed
12 0.14162 -0.06923 0.74065 0.59264 nan nan 0.92795 not performed

I've managed to use hard-coded kernels up to 5 targets (option 3) by reducing the number of threads per block (just for those specific calls). It is the fastest option, even though the overhead of the dry run increases a bit.

Note the Qiskit seems very slow with more than 10 targets. Also the results with the GPU are a bit strange. Maybe there's something wrong with my implementation, I don't know. I had to skip the results for the GPU with n>10 due to time limits.

Also the results with option 2 are a bit weird, I'll double-check them tomorrow.

Given these results, I'd say that the best option is to use hard-coded kernels for n<=5 (option 3) and duplicate the state vector for n>5 (option 1). Moreover, in this way we don't need to specify an hard-coded number of blocks (while we still need to hard-code the number of threads).

Comment on lines 8 to 26
// Multiplies two complex numbers
__device__ thrust::complex<double> cmult(thrust::complex<double> a, thrust::complex<double> b) {
return thrust::complex<double>(a.real() * b.real() - a.imag() * b.imag(),
a.real() * b.imag() + a.imag() * b.real());
}

__device__ complex<float> cmult(complex<float> a, complex<float> b) {
return complex<float>(a.real() * b.real() - a.imag() * b.imag(),
a.real() * b.imag() + a.imag() * b.real());
__device__ thrust::complex<float> cmult(thrust::complex<float> a, thrust::complex<float> b) {
return thrust::complex<float>(a.real() * b.real() - a.imag() * b.imag(),
a.real() * b.imag() + a.imag() * b.real());
}

__device__ complex<double> cadd(complex<double> a, complex<double> b) {
return complex<double>(a.real() + b.real(), a.imag() + b.imag());
// Adds two complex numbers
__device__ thrust::complex<double> cadd(thrust::complex<double> a, thrust::complex<double> b) {
return thrust::complex<double>(a.real() + b.real(), a.imag() + b.imag());
}

__device__ complex<float> cadd(complex<float> a, complex<float> b) {
return complex<float>(a.real() + b.real(), a.imag() + b.imag());
__device__ thrust::complex<float> cadd(thrust::complex<float> a, thrust::complex<float> b) {
return thrust::complex<float>(a.real() + b.real(), a.imag() + b.imag());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scarrazza is there a reason why we didn't use templates for these functions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, I believe we can use templates. Could you please give a try?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to work well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, maybe these functions could be replaced with some thrust implementation, right?

Copy link
Contributor

@mlazzarin mlazzarin Oct 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so, maybe it will also reduce the compilation time. I will try.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, (nearly) all tests are fine (both those of this repository and those of qibo, included the examples) for both NVIDIA and AMD gpus. Some tests fail with the latter but I'll take care of them in another issue (#34).

Tomorrow I'll double-check my implementation and see if the dry run overhead changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compared the performance between commit 991eed8 and the latest commit, on a NVIDIA gpu.

Simulation times are not impacted by the changes. The dry run overhead is maybe a bit higher, but the difference is very very small, and now the kernels are more readable.

qft - simulation times
nqubits Simulation time 991eed8 Simulation time multiqubitgpu
3 0.00044 0.00043
4 0.00072 0.00072
5 0.00103 0.00106
6 0.00148 0.00151
7 0.00193 0.00187
8 0.00251 0.00249
9 0.00318 0.00317
10 0.00388 0.00382
11 0.00477 0.00471
12 0.00582 0.00575
13 0.00701 0.00632
14 0.00776 0.00751
15 0.00861 0.00851
16 0.00985 0.00958
17 0.01142 0.01130
18 0.01270 0.01226
19 0.01449 0.01429
20 0.01860 0.01853
21 0.02566 0.02442
22 0.03653 0.03638
23 0.05881 0.05849
24 0.10524 0.10577
25 0.20298 0.20303
26 0.40956 0.40973
27 0.84477 0.84448
28 1.77029 1.76949
29 3.71816 3.71843
30 7.85646 7.85749
variational - simulation times
nqubits Simulation time 991eed8 Simulation time multiqubitgpu
3 0.00041 0.00041
4 0.00059 0.00058
5 0.00067 0.00067
6 0.00089 0.00091
7 0.00092 0.00091
8 0.00111 0.00110
9 0.00120 0.00118
10 0.00136 0.00142
11 0.00156 0.00154
12 0.00170 0.00179
13 0.00182 0.00188
14 0.00211 0.00204
15 0.00214 0.00214
16 0.00234 0.00230
17 0.00245 0.00236
18 0.00289 0.00291
19 0.00357 0.00357
20 0.00489 0.00495
21 0.00748 0.00762
22 0.01309 0.01300
23 0.02367 0.02376
24 0.04685 0.04684
25 0.09353 0.09345
26 0.19289 0.19316
27 0.39304 0.39327
28 0.81734 0.81717
29 1.67333 1.67299
30 3.48082 3.48066
bv - simulation times
nqubits Simulation time 991eed8 Simulation time multiqubitgpu
3 0.00040 0.00038
4 0.00053 0.00052
5 0.00065 0.00066
6 0.00078 0.00078
7 0.00092 0.00091
8 0.00105 0.00105
9 0.00117 0.00118
10 0.00131 0.00134
11 0.00148 0.00156
12 0.00169 0.00171
13 0.00184 0.00180
14 0.00189 0.00201
15 0.00209 0.00214
16 0.00224 0.00227
17 0.00237 0.00234
18 0.00265 0.00274
19 0.00364 0.00365
20 0.00514 0.00503
21 0.00801 0.00805
22 0.01389 0.01380
23 0.02588 0.02580
24 0.05067 0.05071
25 0.10253 0.10257
26 0.21017 0.20995
27 0.43234 0.43248
28 0.89246 0.89234
29 1.84317 1.84307
30 3.80694 3.80670
supremacy - simulation times
nqubits Simulation time 991eed8 Simulation time multiqubitgpu
3 0.00049 0.00047
4 0.00061 0.00060
5 0.00071 0.00070
6 0.00090 0.00089
7 0.00103 0.00099
8 0.00113 0.00113
9 0.00128 0.00127
10 0.00146 0.00146
11 0.00173 0.00171
12 0.00181 0.00187
13 0.00196 0.00198
14 0.00219 0.00212
15 0.00233 0.00231
16 0.00239 0.00255
17 0.00259 0.00256
18 0.00308 0.00312
19 0.00403 0.00409
20 0.00563 0.00567
21 0.00894 0.00895
22 0.01585 0.01591
23 0.02969 0.02972
24 0.05832 0.05837
25 0.11840 0.11834
26 0.24483 0.24496
27 0.50071 0.50065
28 1.03147 1.03117
29 2.13266 2.13229
30 4.43612 4.43687
qv - simulation times
nqubits Simulation time 991eed8 Simulation time multiqubitgpu
3 0.00055 0.00055
4 0.00105 0.00105
5 0.00103 0.00102
6 0.00155 0.00160
7 0.00150 0.00157
8 0.00203 0.00199
9 0.00204 0.00203
10 0.00263 0.00260
11 0.00277 0.00283
12 0.00331 0.00324
13 0.00328 0.00332
14 0.00386 0.00389
15 0.00396 0.00381
16 0.00443 0.00444
17 0.00440 0.00440
18 0.00553 0.00552
19 0.00677 0.00679
20 0.00976 0.00990
21 0.01480 0.01482
22 0.02683 0.02672
23 0.04771 0.04792
24 0.09675 0.09656
25 0.18975 0.18979
26 0.39937 0.39940
27 0.80150 0.80146
28 1.71590 1.71595
29 3.38466 3.38493
30 7.31564 7.31548
qft - dry run overhead
nqubits delta 991eed8 delta multiqubitgpu
3 1.13092 1.13790
4 0.71047 0.71884
5 0.71031 0.71928
6 0.71481 0.71649
7 0.70863 0.72005
8 0.70899 0.71804
9 0.71400 0.71668
10 0.71852 0.71835
11 0.71023 0.72112
12 0.70969 0.71989
13 0.71568 0.71456
14 0.71226 0.71509
15 0.71486 0.72075
16 0.71409 0.71657
17 0.71541 0.72047
18 0.71804 0.72197
19 0.72080 0.72394
20 0.72220 0.72338
21 0.72299 0.72673
22 0.72623 0.72765
23 0.72267 0.72840
24 0.72529 0.73085
25 0.72182 0.72296
26 0.72701 0.72603
27 0.72602 0.73006
28 0.71864 0.72659
29 0.70956 0.71191
30 0.69071 0.70341
variational - dry run overhead
nqubits delta 991eed8 delta multiqubitgpu
3 0.71415 0.72038
4 0.70769 0.71649
5 0.70937 0.71812
6 0.70958 0.71555
7 0.70727 0.71559
8 0.70720 0.71885
9 0.70969 0.71751
10 0.71365 0.71747
11 0.71844 0.71616
12 0.71788 0.71990
13 0.71002 0.71771
14 0.71480 0.71783
15 0.71852 0.71724
16 0.71529 0.71917
17 0.71499 0.71951
18 0.71080 0.71586
19 0.71067 0.72309
20 0.71775 0.71293
21 0.71109 0.72117
22 0.71793 0.71935
23 0.71640 0.72139
24 0.71106 0.71933
25 0.71344 0.71769
26 0.70721 0.72180
27 0.71279 0.71130
28 0.70703 0.70995
29 0.69703 0.70370
30 0.67889 0.68067
bv - dry run overhead
nqubits delta 991eed8 delta multiqubitgpu
3 0.70693 0.70872
4 0.70554 0.71479
5 0.70762 0.71364
6 0.70775 0.71065
7 0.71237 0.71555
8 0.71490 0.71482
9 0.70930 0.71308
10 0.71173 0.71508
11 0.70676 0.71636
12 0.71702 0.71578
13 0.71345 0.71600
14 0.70832 0.71245
15 0.72123 0.71359
16 0.71067 0.71403
17 0.71310 0.71767
18 0.71147 0.71437
19 0.71051 0.71625
20 0.71215 0.71688
21 0.71719 0.71342
22 0.71227 0.71678
23 0.71176 0.71218
24 0.70776 0.71637
25 0.70673 0.71207
26 0.71570 0.71079
27 0.71087 0.71212
28 0.70449 0.71039
29 0.70249 0.70672
30 0.69658 0.70001
supremacy - dry run overhead
nqubits delta 991eed8 delta multiqubitgpu
3 0.70985 0.71530
4 0.71563 0.71162
5 0.70983 0.71305
6 0.71057 0.71416
7 0.71140 0.71252
8 0.70858 0.71158
9 0.71183 0.71507
10 0.71016 0.71395
11 0.71208 0.71326
12 0.71051 0.71362
13 0.71302 0.71593
14 0.71557 0.71449
15 0.71754 0.71594
16 0.71387 0.71398
17 0.71395 0.71546
18 0.71328 0.71520
19 0.71274 0.72063
20 0.71494 0.71321
21 0.71122 0.71630
22 0.71197 0.71504
23 0.71151 0.71616
24 0.71333 0.71774
25 0.71069 0.71447
26 0.70870 0.71649
27 0.70629 0.71073
28 0.70280 0.70524
29 0.69261 0.69725
30 0.67662 0.67756
qv - dry run overhead
nqubits delta 991eed8 delta multiqubitgpu
3 0.71142 0.71313
4 0.71076 0.71464
5 0.71482 0.71656
6 0.71078 0.71453
7 0.70992 0.71436
8 0.71111 0.71599
9 0.71156 0.71491
10 0.71524 0.71790
11 0.71223 0.71627
12 0.71253 0.71641
13 0.71605 0.71792
14 0.71840 0.71634
15 0.71330 0.72055
16 0.71573 0.71746
17 0.71763 0.71917
18 0.71394 0.71920
19 0.71437 0.71903
20 0.71500 0.72045
21 0.71883 0.72240
22 0.71666 0.72083
23 0.71968 0.72134
24 0.71736 0.72497
25 0.71923 0.72301
26 0.71585 0.71849
27 0.71657 0.71572
28 0.71002 0.71283
29 0.69706 0.70242
30 0.68073 0.68579

@mlazzarin mlazzarin marked this pull request as ready for review October 27, 2021 06:20
@mlazzarin
Copy link
Contributor

Ready for review. Please double-check the kernels written by me, as I didn't derive them again but I worked by analogy w.r.t the numba implementation, and I'm still confused about some function parameters and some indices.

@mlazzarin
Copy link
Contributor

It seems like I cannot ask @stavros11 for the review since he's the author, so I'll do it with this comment.

Copy link
Member Author

@stavros11 stavros11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing this. I have not done any performance tests as I believe this is covered in the above comments, but tests pass on my CUDA GPU (cannot check for AMD) and the kernels look ok to me.

Regarding the cadd, cmult replacement, the code is much cleaner now so if there is no big performance change, which seems to be the case from your latest benchmark, I agree with the new approach.

It seems like I cannot ask @stavros11 for the review since he's the author, so I'll do it with this comment.

Aparrently I also cannot approve the PR since I am the author, but it looks good to me. Just a few minor comments/possible simplifications below.

src/qibojit/custom_operators/backends.py Outdated Show resolved Hide resolved
src/qibojit/custom_operators/backends.py Outdated Show resolved Hide resolved
src/qibojit/custom_operators/backends.py Outdated Show resolved Hide resolved
# len(targets): 4 -> 512 threads per block
# len(targets): 5 -> 256 threads per block
nblocks, block_size = self.calculate_blocks(nstates,
block_size=2**(13-len(targets)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understand this correctly, this block size is the maximum possible (so that we have the best performance) that does not lead to memory issues, right?
This result is independent of GPU model? No need to do any tests regarding this, I am mainly asking out of curiosity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's the maximum possible (considering only powers of two). I don't know if it's independent of GPU model, I performed the tests on my laptop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I confirm that it is not independent of GPU model, as this configuration doesn't work with a NVIDIA Tesla V100 (I need to halve the block size). I need to find a smarter way to manage this before merging this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I've found a smarter way in #37 .

@mlazzarin
Copy link
Contributor

@stavros11 I implemented your comments. I also added some tests with 22 qubits for the multiqubit gates, now the CI takes longer, let me now if it's ok.

@mlazzarin
Copy link
Contributor

I implemented a simple change in the multiqubit kernel for n>5. The goal was to try to optimize the performance.
I performed some benchmarks, they show a small increase in performance of around 1%-4%, while the results for the dry run overhead are too noisy to show a trend.

29 qubits
ntargets Simulation time before Simulation time after
6 1.61395 1.57153
7 2.47546 2.37732
8 4.25149 4.09117
9 8.18834 8.06142
10 17.16740 16.94156
30 qubits
ntargets Simulation time before Simulation time after
6 3.30619 3.22333
7 5.04858 4.93918
8 8.51509 8.38226
9 15.98977 15.75742
10 32.21293 31.64226
29 qubits - dry run overhead
ntargets delta before delta after
6 0.84389 0.84887
7 0.84079 0.85194
8 0.82255 0.77665
9 0.77009 0.75778
10 0.70218 0.61692
30 qubits - dry run overhead
ntargets delta before delta after
6 0.85006 0.85752
7 0.84390 0.85973
8 0.86111 0.83283
9 0.82162 0.82221
10 0.79891 0.86906

@stavros11
Copy link
Member Author

I implemented a simple change in the multiqubit kernel for n>5. The goal was to try to optimize the performance. I performed some benchmarks, they show a small increase in performance of around 1%-4%, while the results for the dry run overhead are too noisy to show a trend.

Thanks for this change, it makes sense. I am not sure if you can do something to reduce noise in the dry run benchmark, it only happens once per script so you cannot do more reps like for simulation. Only if you repeat the script execution multiple times and average, but I don't think it's necessary to go to such detail.

@scarrazza
Copy link
Member

@mlazzarin @stavros11 could you please fix the conflicts so we can merge? Thanks.

@mlazzarin
Copy link
Contributor

@stavros11 thank you for fixing the conflicts.

@stavros11
Copy link
Member Author

@mlazzarin @scarrazza before merging this, please make sure that tests pass in your configurations with the GPUs enabled and disabled, particularly for AMDs since I do not have access to these and the CI is not testing for GPU. This looks good to me, my only concerns are the following:

  • The multi-qubit kernels increase the dry run/compilation time by around 0.3sec which is small but non-negligible and happens regardless of whether the user actually uses these kernels or no.
  • The tests for 22 qubits take some time, at least in my machine, but since they don't break the CI it is not a big issue.

Just mentioning these to keep in mind, I am not sure if we should/can do something in the current stage.

@mlazzarin
Copy link
Contributor

@mlazzarin @scarrazza before merging this, please make sure that tests pass in your configurations with the GPUs enabled and disabled, particularly for AMDs since I do not have access to these and the CI is not testing for GPU.

Ok, I'm on it.

The multi-qubit kernels increase the dry run/compilation time by around 0.3sec which is small but non-negligible and happens regardless of whether the user actually uses these kernels or no.

We may solve this issue by redesigning the way we load the kernels, e.g. by replacing cp.RawModule with many cp.RawKernel and compile each kernel only when it's needed.

The tests for 22 qubits take some time, at least in my machine, but since they don't break the CI it is not a big issue.

The tests for 22 qubits are required to catch some resource issues with GPUs. Given that the CI doesn't use a GPU, is it possible the implement some skipping mechanism that works only if a GPU is not available?

@mlazzarin
Copy link
Contributor

Ok, I tried with both Intel/AMD CPUs and NVIDIA/AMD GPUs and the qibojit tests are fine.

@scarrazza
Copy link
Member

@stavros11 thanks for the summary, @mlazzarin thanks for testing. Let me suggest to merge and release a 0.0.4rc0, then next week we can discuss about this compilation issue with their developers.

@scarrazza scarrazza merged commit 4ff7885 into multistr Nov 5, 2021
@stavros11 stavros11 deleted the multiqubitgpu branch January 17, 2022 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants