Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiqubit ops CPU performance #51

Closed
stavros11 opened this issue Nov 27, 2021 · 4 comments
Closed

Multiqubit ops CPU performance #51

stavros11 opened this issue Nov 27, 2021 · 4 comments

Comments

@stavros11
Copy link
Member

As we saw during our discussion about qiboteam/qibo#505, we are observing some performance issues while incorporating the multiqubit ops in qibo, particularly in comparison to qiskit. Here are some benchmarks on CPU for circuits of the following type:

q0 : ─U───────────────────────────────
q1 : ─U─U─────────────────────────────
q2 : ─U─U─U───────────────────────────
q3 : ─U─U─U─U─────────────────────────
q4 : ─U─U─U─U─U───────────────────────
q5 : ───U─U─U─U─U─────────────────────
q6 : ─────U─U─U─U─U───────────────────
q7 : ───────U─U─U─U─U─────────────────
q8 : ─────────U─U─U─U─U───────────────
q9 : ───────────U─U─U─U─U─────────────
q10: ─────────────U─U─U─U─U───────────
q11: ───────────────U─U─U─U─U─────────
q12: ─────────────────U─U─U─U─U───────
q13: ───────────────────U─U─U─U─U─────
q14: ─────────────────────U─U─U─U─U───
q15: ───────────────────────U─U─U─U─U─
q16: ─────────────────────────U─U─U─U─
q17: ───────────────────────────U─U─U─
q18: ─────────────────────────────U─U─
q19: ───────────────────────────────U─

where U is a multiqubit (here five-qubit) unitary:

multiqubit - qibo/qiskit - simulation time - double

image

multiqubit - qibo/qiskit - simulation time - single

image

Since in previous benchmarks on this repository we were comparing calling the custom operators directly vs qiskit, I made an additional comparison of qibo (with qibojit) vs qibojit:

multiqubit - qibo/qibojit - simulation time - double

image

multiqubit - qibo/qibojit - dry run time - double

image

multiqubit - qibo/qibojit - simulation time - single

image

multiqubit - qibo/qibojit - dry run time - single

image

From these we see that the qibo overhead is minimal and not enough to explain the difference with qiskit so the discrepancy most likely comes from the qibojit side. Also, although absolute qibo/qiskit ratios differ between single and double precision, the behavior is qualitatively the same. What is interesting is that qiskit appears much faster in the 4 <= ntargets <=6, nqubits > 20 but becomes much slower for ntargets > 6. Here are some absolute times (no ratio) that clearly show this:
nqubits=23 - simulation times - double
ntargets qibo (sec) qibojit (sec) qiskit (sec)
3 0.09326 0.06830 0.03834
4 0.17647 0.16432 0.05503
5 0.25915 0.24949 0.10277
6 0.45729 0.48950 0.18796
7 0.65344 0.78138 0.74864
8 0.98115 1.19669 2.51820
9 1.54533 1.62834 5.98414
10 3.47243 3.38084 42.43832
nqubits=24 - simulation times - double
ntargets qibo (sec) qibojit (sec) qiskit (sec)
3 0.15120 0.21074 0.08805
4 0.35044 0.30915 0.12071
5 0.58445 0.56411 0.20928
6 1.09115 1.02924 0.34992
7 1.16172 1.22445 1.23091
8 2.01835 1.91831 5.66647
9 3.32717 3.47910 11.75414
10 7.11032 6.52692 73.70552
nqubits=25 - simulation times - double
ntargets qibo (sec) qibojit (sec) qiskit (sec)
3 0.31643 0.36482 0.24589
4 0.59639 0.58661 0.26426
5 1.27955 1.26141 0.38248
6 2.18827 2.08555 0.71935
7 2.47914 2.32125 2.56939
8 4.05328 4.07807 11.53120
9 6.88069 6.86786 22.85148
10 14.79330 13.21936 133.96834

Qibo's performance increases expectedly with ntargets, while qiskit makes at ntargets=7. It looks like they have a very good implementation for ntargets < 7 (perhaps based in some decomposition?) and a very bad for more targets. I think @mlazzarin observed something similar in the past, right?

For all these benchmarks the threads were set using from multiprocessing import cpu_count with all libraries using half of the total threads and is tested that final wavefunctions agree.

@mlazzarin
Copy link
Contributor

Thank you for these tests.

It looks like they have a very good implementation for ntargets < 7 (perhaps based in some decomposition?) and a very bad for more targets. I think @mlazzarin observed something similar in the past, right?

Yes, concerning the GPU implementation I think that qiskit uses a standard matrix multiplication for ntargets <= 10, then it uses the LU decomposition.
https://github.com/Qiskit/qiskit-aer/blob/1be928b3abb4156c1434f3999a3fc2bd8293f3f3/src/simulators/statevector/qubitvector_thrust.hpp#L1906-L1946

Regarding CPU, my guess is that they implemented functions with AVX2 instructions up to ntargets = 6, then for ntargets > 6 they fallback to their implementation without AVX2 instructions.
https://github.com/Qiskit/qiskit-aer/blob/1be928b3abb4156c1434f3999a3fc2bd8293f3f3/src/simulators/statevector/transformer_avx2.hpp#L76-L91
https://github.com/Qiskit/qiskit-aer/blob/1be928b3abb4156c1434f3999a3fc2bd8293f3f3/src/simulators/statevector/qv_avx2.cpp#L1070-L1099
It would be interesting to compile Qiskit Aer without AVX2 support and see what happens.

@stavros11 can you share the code that you are using to do these benchmarks? I'd like to make some experiments.

@scarrazza
Copy link
Member

@stavros11 can we close this issue?

@stavros11
Copy link
Member Author

I am not sure if the numbers from the first post are still valid because long time has passed, but I do not think we did anything to solve this issue, so it probably still exists. I also have not done the AVX test that Marco is proposing.

@scarrazza
Copy link
Member

Closing, too obsolete results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants