Skip to content

Commit

Permalink
Proofreading performance draft
Browse files Browse the repository at this point in the history
  • Loading branch information
WrathfulSpatula committed Jan 23, 2020
1 parent e8751fc commit a7975f3
Showing 1 changed file with 7 additions and 7 deletions.
14 changes: 7 additions & 7 deletions docs/performance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -76,17 +76,17 @@ Method

This performance document is meant to be a simple, to-the-point, and preliminary digest of these results. We plan to submit a formal academic report for peer review of these results, in full detail, in February of 2020. These were prepared with the generous financial support of the Unitary Fund. However, we offer that our benchmark code is public, largely self-explanatory, and easily reproducible, while we prepare that report. Hence, we release these partial preliminary results now.

100 timed trials of single and parallel gates were run for each qubit count between 4 and 28 qubits. Three tests were performed: the quantum Fourier transform, ("QFT"), random circuits constructed from a universal gate set, and an idealized approximation of Google's Sycamore chip benchmark, as per [Sycamore]_. The benchmarking code is available at `https://github.com/vm6502q/simulator-benchmarks<https://github.com/vm6502q/simulator-benchmarks>`_.
100 timed trials of single and parallel gates were run for each qubit count between 4 and 28 qubits. Three tests were performed: the quantum Fourier transform, ("QFT"), random circuits constructed from a universal gate set, and an idealized approximation of Google's Sycamore chip benchmark, as per [Sycamore]_. The benchmarking code is available at `https://github.com/vm6502q/simulator-benchmarks <https://github.com/vm6502q/simulator-benchmarks>`_.

CPU and GPU benchmarks were run on two respective systems that could represent realistic use cases for each engine type. Among AWS virtual machine instances, we sought to find those systems with the lowest possible cost to run the benchmarks for their respective execution times, at or below for the 28 qubit mark. An AWS g3s.xlarge running Ubuntu Server 18.04LTS was selected for GPU benchmarks. An AWS c5.4xlarge running Ubuntu Server 18.04LTS was selected for CPU benchmarks, including FFTW3 for comparison on the QFT test. Benchmarks were collected from December 27, 2019 through January 22, 2020. These results were combined with single gate, N-width gate and Grover's search benchmarks for Qrack, collected overnight from December 19th, 2018 into the morning of December 20th. (The potential difference since December 2018 in these particular Qrack tests reused from then should be insignificant. We took care to try to report fair tests, within cost limitations, but please let us know if you find anything that appears misrepresentative.)

The average time of each set of 100 was recorded and graphed. Grover's search to invert a black box subroutine, or "oracle," was similarly implemented for trials between 5 and 20 qubits, for QEngineOCL with and without QUnit and QFusion layers. Grover's algorithm was iterated an optimal number of times, vs. qubit count, to maximize probability on a half cycle of the algorithm's period, being :math:`floor\left[\frac{\pi}{4asin^2\left(1/\sqrt{2^N}\right)}\right]` iterations for :math:`N` qubits.

Comparitive benchmarks included Cirq, QCGPU, and Qiskit, and Qrack's "QUnit" layer on top of "QFusion" and "QEngineOCL." (Default simulators were used, for other candidates, if more than one option was available.) QFT benchmarks could be implemented in a straightforward manner on all simulators, and were run as such. Qrack appears to be the only candidate considered for which inputs into the QFT can (drastically) affect its execution time, with permutation basis states being run in much shorter time, for example, hence only Qrack required a more general random input, whereas all other simulators including FFTW3 were started in the |0> state. For a sufficiently representatively general test, Qrack instead used registers of single separable qubits intialized with uniformly randomly distributed probability between |0> and |1>, and uniformly randomly distributed phase offset between those states. FFTW3 was included in this test, in "in-place" mode on complex single floats, since all other DFT tests are effectively "in-place" method, here.
Comparative benchmarks included Cirq, QCGPU, and Qiskit, and Qrack's "QUnit" layer on top of "QFusion" and "QEngineOCL." (Default simulators were used, for other candidates, if more than one option was available.) QFT benchmarks could be implemented in a straightforward manner on all simulators, and were run as such. Qrack appears to be the only candidate considered for which inputs into the QFT can (drastically) affect its execution time, with permutation basis states being run in much shorter time, for example, hence only Qrack required a more general random input, whereas all other simulators including FFTW3 were started in the |0> state. For a sufficiently representatively general test, Qrack instead used registers of single separable qubits intialized with uniformly randomly distributed probability between |0> and |1>, and uniformly randomly distributed phase offset between those states. FFTW3 was included in this test, in "in-place" mode on complex single floats, since all other DFT tests are effectively "in-place" method, here.

Random universal circuits carried out layers of single qubit gates on every qubit in the width of the test, followed by layers randomly selected couplings of (2-qubit) CNOT, CZ, and SWAP, or (3-qubit) CCNOT, eliminating each selected bit for the layer. 20 layers of 1-qubit-plus-multi-qubit iterations were carried out, for each qubit width, for the benchmarks presented here.

For the Sycamore circuit idealization, the quoted 2-qubit "typical" gate of "iSWAP" with "1/6th CZ" was used in all cases. Sycamore circuits were carried out similarly to random universal circuits and the method of the [Sycamore]_ paper, interleaving 1-qubit followed by 2-qubit layers, to depth of 20 layers each. Whereas as that original source appears to have randomly fixed its target circuit ahead of any trials, and then carried the same pre-selected circuit out repeatedly for the required number of trials, all benchmarks in the case of this report generated their circuits per-iteration on-the-fly, per the selection criteria as read from the text of [Sycamore]_. With the 2-qubit gate idealization already mentioned, Qrack easily implemented the original Sycamore circuit exactly. By nature of the simulation methods used in each other candidate, atomic "convenience method" 1-qubit and 2-qubit gate definitions could potentially easily be added to every every candidate for this test, hence we thought it most representative to make largely performance-irrelevant substitutions of 1-qubit and 2-qubit gates for those candidates which did not already define sufficient API convenience methods, without nonrepresentatively complicated gate decompositions. We strongly encourage the reader the inspect and independently execute the simple benchmarking code which was already linked in the beginning of this "Method" section, for total specific detail.
For the Sycamore circuit idealization, the quoted 2-qubit "typical" gate of "iSWAP" with "1/6th CZ" was used in all cases. Sycamore circuits were carried out similarly to random universal circuits and the method of the [Sycamore]_ paper, interleaving 1-qubit followed by 2-qubit layers, to depth of 20 layers each. Whereas as that original source appears to have randomly fixed its target circuit ahead of any trials, and then carried the same pre-selected circuit out repeatedly for the required number of trials, all benchmarks in the case of this report generated their circuits per-iteration on-the-fly, per the selection criteria as read from the text of [Sycamore]_. With the 2-qubit gate idealization already mentioned, Qrack easily implemented the original Sycamore circuit exactly. By nature of the simulation methods used in each other candidate, atomic "convenience method" 1-qubit and 2-qubit gate definitions could potentially easily be added to every every candidate for this test, hence we thought it most representative to make largely performance-irrelevant substitutions of 1-qubit and 2-qubit gates for those candidates which did not already define sufficient API convenience methods, without nonrepresentatively complicated gate decompositions. We strongly encourage the reader to inspect and independently execute the simple benchmarking code which was already linked in the beginning of this "Method" section, for total specific detail.

Qrack QEngine type heap usage was established as very closely matching theoretical expections, in earlier benchmarks, and this has not fundamentally changed. QUnit type heap usage varies greatly dependent on use case, though not in significant excess of QEngine types. No representative RAM benchmarks have been established for QUnit types, yet. QEngine Heap profiling was carried out with Valgrind Massif. Heap sampling was limited but ultimately sufficient to show statistical confidence.

Expand Down Expand Up @@ -117,7 +117,7 @@ Grover's algorithm is a relatively ideal test case, in that it allows a modicum

[Broda2016]_ discusses how Grover's might be adapted in practicality to actually "search an unstructured database," or search an unstructured lookup table, and Qrack is also capable of applying Grover's search to a lookup table with its IndexedLDA, IndexedADC, and IndexedSBC methods. Benchmarks are not given for this arguably more practical application of the algorithm, because few other quantum computer simulator libraries implement it, yet.

The "quantum" (or "discrete") Fourier transform (QFT/DFT) is a realistic and important test case for its direct application in day-to-day industrial computing, as well as for being a common processing step in many quantum algorithms.
The "quantum" (or "discrete") Fourier transform (QFT/DFT) is a realistic and important test case for its direct application in day-to-day industrial computing applications, as well as for being a common processing step in many quantum algorithms.

.. image:: performance/qft.png

Expand All @@ -127,11 +127,11 @@ Similarly, on random universal circuits, defined above and in the benchmark repo

.. image:: performance/random_universal.png

Qrack's QUnit makes a fundamental on an idealization of the Sycamore circuit, which we strongly encourage the reader to analyze and reproduce with the provided public benchmark code.
Qrack's QUnit makes a fundamental improvement on an idealization of the Sycamore circuit, which we strongly encourage the reader to analyze and reproduce with the provided public benchmark code.

.. image:: performance/sycamore.png

QUnit represents its state vector in terms of decomposed subsystems, when possible and efficient. On user and internal probability checks, QUnit will attempt to separate the representations of independent subsystems effectively by Schmidt decomposition. Further, Qrack will avoid applying phase effects which make no difference to the expectation values of any Hermitian operators, (no difference to "physical observables,") when optimization can be achieved this way. For each bit whose representation is separated this way, we recover a factor of close to or exactly 1/2 the subsystem RAM and gate execution time. Under the domain constraints, QUnit outperforms all other simulators analyzed, one-way at high qubit counts.
QUnit represents its state vector in terms of decomposed subsystems, when possible and efficient. On user and internal probability checks, QUnit will attempt to separate the representations of independent subsystems by Schmidt decomposition. Further, Qrack will avoid applying phase effects that make no difference to the expectation values of any Hermitian operators, (no difference to "physical observables,") when optimization can be achieved this way. For each bit whose representation is separated this way, we recover a factor of close to or exactly 1/2 the subsystem RAM and gate execution time. Under the domain constraints, QUnit outperforms all other simulators analyzed.

Discussion
**********
Expand All @@ -140,7 +140,7 @@ Up to a consistent deviation at low qubit counts, speed and RAM usage for Schrö

Qrack is written for scalable work distribution in the OpenCL kernels. QEngineOCL will distribute work among an arbitrarily small number of processing elements and max work item size, smaller than state vector size. Max work item size is a device-specific hardware parameter limiting how many work items may be dispatched in an OpenCL kernel call. QEngineOCL can distribute large numbers of probability amplitude transformations to small numbers of work items, incurring additional looping overhead, whereas QCGPU is written to dispatch one work item to one processing element. QCGPU requires a large enough hardware max work item size to add higher numbers of qubits. Whereas QCGPU might not be, Qrack is theoretically compatible with OpenCL devices with smaller maximum work item counts, such as CPUs like the ARM7, on which Qrack has been regularly unit tested.

Qrack gives the option to normalize its state vector while flooring noise-level amplitudes at on-the-fly opportunities, to optimize while correcting for float rounding error, incurring overhead costs but potentially benefiting the stability of the simulation over very long strings of gate applications. QEngineOCL was designed to support access by separate QEngineOCL instances in different threads to shared OpenCL devices, as well as optional out-of-order OpenCL queue execution, when available. QEngineOCL can also dispatch a queue of gates completely asynchronously, without blocking the main execution thread. Runtime options and design features to support a broad range of platforms do add to Qrack's execution overhead, but these make Qrack the best all-around quantum computer simulator for personal and heterogeneous hardware.
Qrack gives the option to normalize its state vector while flooring noise-level amplitudes at on-the-fly opportunities, to optimize while correcting for float rounding error. QEngineOCL was designed to support access by separate QEngineOCL instances in different threads to shared OpenCL devices, as well as optional out-of-order OpenCL queue execution, when available. QEngineOCL can also dispatch a queue of gates completely asynchronously, without blocking the main execution thread. Runtime options and design features to support a broad range of platforms do add to Qrack's execution overhead, but these make Qrack the best all-around quantum computer simulator for personal and heterogeneous hardware.

Further Work
************
Expand Down

0 comments on commit a7975f3

Please sign in to comment.