Multiqubit ops CPU performance #51

stavros11 · 2021-11-27T09:30:45Z

As we saw during our discussion about qiboteam/qibo#505, we are observing some performance issues while incorporating the multiqubit ops in qibo, particularly in comparison to qiskit. Here are some benchmarks on CPU for circuits of the following type:

q0 : ─U───────────────────────────────
q1 : ─U─U─────────────────────────────
q2 : ─U─U─U───────────────────────────
q3 : ─U─U─U─U─────────────────────────
q4 : ─U─U─U─U─U───────────────────────
q5 : ───U─U─U─U─U─────────────────────
q6 : ─────U─U─U─U─U───────────────────
q7 : ───────U─U─U─U─U─────────────────
q8 : ─────────U─U─U─U─U───────────────
q9 : ───────────U─U─U─U─U─────────────
q10: ─────────────U─U─U─U─U───────────
q11: ───────────────U─U─U─U─U─────────
q12: ─────────────────U─U─U─U─U───────
q13: ───────────────────U─U─U─U─U─────
q14: ─────────────────────U─U─U─U─U───
q15: ───────────────────────U─U─U─U─U─
q16: ─────────────────────────U─U─U─U─
q17: ───────────────────────────U─U─U─
q18: ─────────────────────────────U─U─
q19: ───────────────────────────────U─

where U is a multiqubit (here five-qubit) unitary:

multiqubit - qibo/qiskit - simulation time - double

multiqubit - qibo/qiskit - simulation time - single

Since in previous benchmarks on this repository we were comparing calling the custom operators directly vs qiskit, I made an additional comparison of qibo (with qibojit) vs qibojit:

multiqubit - qibo/qibojit - simulation time - double

multiqubit - qibo/qibojit - dry run time - double

multiqubit - qibo/qibojit - simulation time - single

multiqubit - qibo/qibojit - dry run time - single

From these we see that the qibo overhead is minimal and not enough to explain the difference with qiskit so the discrepancy most likely comes from the qibojit side. Also, although absolute qibo/qiskit ratios differ between single and double precision, the behavior is qualitatively the same. What is interesting is that qiskit appears much faster in the 4 <= ntargets <=6, nqubits > 20 but becomes much slower for ntargets > 6. Here are some absolute times (no ratio) that clearly show this:

nqubits=23 - simulation times - double

ntargets	qibo (sec)	qibojit (sec)	qiskit (sec)
3	0.09326	0.06830	0.03834
4	0.17647	0.16432	0.05503
5	0.25915	0.24949	0.10277
6	0.45729	0.48950	0.18796
7	0.65344	0.78138	0.74864
8	0.98115	1.19669	2.51820
9	1.54533	1.62834	5.98414
10	3.47243	3.38084	42.43832

nqubits=24 - simulation times - double

ntargets	qibo (sec)	qibojit (sec)	qiskit (sec)
3	0.15120	0.21074	0.08805
4	0.35044	0.30915	0.12071
5	0.58445	0.56411	0.20928
6	1.09115	1.02924	0.34992
7	1.16172	1.22445	1.23091
8	2.01835	1.91831	5.66647
9	3.32717	3.47910	11.75414
10	7.11032	6.52692	73.70552

nqubits=25 - simulation times - double

ntargets	qibo (sec)	qibojit (sec)	qiskit (sec)
3	0.31643	0.36482	0.24589
4	0.59639	0.58661	0.26426
5	1.27955	1.26141	0.38248
6	2.18827	2.08555	0.71935
7	2.47914	2.32125	2.56939
8	4.05328	4.07807	11.53120
9	6.88069	6.86786	22.85148
10	14.79330	13.21936	133.96834

Qibo's performance increases expectedly with ntargets, while qiskit makes at ntargets=7. It looks like they have a very good implementation for ntargets < 7 (perhaps based in some decomposition?) and a very bad for more targets. I think @mlazzarin observed something similar in the past, right?

For all these benchmarks the threads were set using from multiprocessing import cpu_count with all libraries using half of the total threads and is tested that final wavefunctions agree.

The text was updated successfully, but these errors were encountered:

mlazzarin · 2021-11-28T06:16:53Z

Thank you for these tests.

It looks like they have a very good implementation for ntargets < 7 (perhaps based in some decomposition?) and a very bad for more targets. I think @mlazzarin observed something similar in the past, right?

Yes, concerning the GPU implementation I think that qiskit uses a standard matrix multiplication for ntargets <= 10, then it uses the LU decomposition.
https://github.com/Qiskit/qiskit-aer/blob/1be928b3abb4156c1434f3999a3fc2bd8293f3f3/src/simulators/statevector/qubitvector_thrust.hpp#L1906-L1946

Regarding CPU, my guess is that they implemented functions with AVX2 instructions up to ntargets = 6, then for ntargets > 6 they fallback to their implementation without AVX2 instructions.
https://github.com/Qiskit/qiskit-aer/blob/1be928b3abb4156c1434f3999a3fc2bd8293f3f3/src/simulators/statevector/transformer_avx2.hpp#L76-L91
https://github.com/Qiskit/qiskit-aer/blob/1be928b3abb4156c1434f3999a3fc2bd8293f3f3/src/simulators/statevector/qv_avx2.cpp#L1070-L1099
It would be interesting to compile Qiskit Aer without AVX2 support and see what happens.

@stavros11 can you share the code that you are using to do these benchmarks? I'd like to make some experiments.

scarrazza · 2022-07-18T09:30:09Z

@stavros11 can we close this issue?

stavros11 · 2022-07-18T11:16:50Z

I am not sure if the numbers from the first post are still valid because long time has passed, but I do not think we did anything to solve this issue, so it probably still exists. I also have not done the AVX test that Marco is proposing.

scarrazza · 2022-11-10T21:05:57Z

Closing, too obsolete results.

This was referenced Nov 28, 2021

Single precision vs double precision performance #52

Closed

Multiqubit ops GPU performance #53

Closed

stavros11 mentioned this issue Apr 21, 2022

Multi-qubit fusion qiboteam/qibo#577

Merged

scarrazza closed this as completed Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiqubit ops CPU performance #51

Multiqubit ops CPU performance #51

stavros11 commented Nov 27, 2021

mlazzarin commented Nov 28, 2021

scarrazza commented Jul 18, 2022

stavros11 commented Jul 18, 2022

scarrazza commented Nov 10, 2022

Multiqubit ops CPU performance #51

Multiqubit ops CPU performance #51

Comments

stavros11 commented Nov 27, 2021

mlazzarin commented Nov 28, 2021

scarrazza commented Jul 18, 2022

stavros11 commented Jul 18, 2022

scarrazza commented Nov 10, 2022