Revert "Updating performance"

This reverts commit db0dd91.
vm6502q · Apr 6, 2020 · c18584e · c18584e
1 parent db0dd91
commit c18584e
Show file tree

Hide file tree

Showing 6 changed files with 12 additions and 11 deletions.
diff --git a/docs/performance.rst b/docs/performance.rst
@@ -74,11 +74,11 @@ Disclaimers
 Method
 ******
 
-This performance document is meant to be a simple, to-the-point, and preliminary digest of these results. We plan to submit a formal academic report for peer review of these results, in full detail, as soon as we collect sufficient feedback on the preprint. (The originally planned date of submission was in February of 2020, but it seems that COVID-19 has hindered our ability to seek preliminary feedback.) These results were prepared with the generous financial support of the Unitary Fund. However, we offer that our benchmark code is public, largely self-explanatory, and easily reproducible, while we prepare that report. Hence, we release these partial preliminary results now.
+This performance document is meant to be a simple, to-the-point, and preliminary digest of these results. We plan to submit a formal academic report for peer review of these results, in full detail, in February of 2020. These were prepared with the generous financial support of the Unitary Fund. However, we offer that our benchmark code is public, largely self-explanatory, and easily reproducible, while we prepare that report. Hence, we release these partial preliminary results now.
 
 100 timed trials of single and parallel gates were run for each qubit count between 4 and 28 qubits. Three tests were performed: the quantum Fourier transform, ("QFT"), random circuits constructed from a universal gate set, and an idealized approximation of Google's Sycamore chip benchmark, as per [Sycamore]_. The benchmarking code is available at `https://github.com/vm6502q/simulator-benchmarks <https://github.com/vm6502q/simulator-benchmarks>`_.
 
-CPU and GPU benchmarks were run on two respective systems that could represent realistic use cases for each engine type. Among AWS virtual machine instances, we sought to find those systems with the lowest possible cost to run the benchmarks for their respective execution times, at or below for the 28 qubit mark. An AWS g3s.xlarge running Ubuntu Server 18.04LTS was selected for GPU benchmarks. An AWS c5.4xlarge running Ubuntu Server 18.04LTS was selected for CPU benchmarks, including FFTW3 for comparison on the QFT test. Benchmarks were collected from December 27, 2019 through January 24, 2020. Given delays in soliciting peer opinion, while development of Qrack continued, the Qrack benchmarks were updated on April 5, 2020. These results were combined with single gate, N-width gate and Grover's search benchmarks for Qrack, collected overnight from December 19th, 2018 into the morning of December 20th. (The potential difference since December 2018 in these particular Qrack tests reused from then should be insignificant. We took care to try to report fair tests, within cost limitations, but please let us know if you find anything that appears misrepresentative.)
+CPU and GPU benchmarks were run on two respective systems that could represent realistic use cases for each engine type. Among AWS virtual machine instances, we sought to find those systems with the lowest possible cost to run the benchmarks for their respective execution times, at or below for the 28 qubit mark. An AWS g3s.xlarge running Ubuntu Server 18.04LTS was selected for GPU benchmarks. An AWS c5.4xlarge running Ubuntu Server 18.04LTS was selected for CPU benchmarks, including FFTW3 for comparison on the QFT test. Benchmarks were collected from December 27, 2019 through January 24, 2020. These results were combined with single gate, N-width gate and Grover's search benchmarks for Qrack, collected overnight from December 19th, 2018 into the morning of December 20th. (The potential difference since December 2018 in these particular Qrack tests reused from then should be insignificant. We took care to try to report fair tests, within cost limitations, but please let us know if you find anything that appears misrepresentative.)
 
 The average time of each set of 100 was recorded and graphed. Grover's search to invert a black box subroutine, or "oracle," was similarly implemented for trials between 5 and 20 qubits, for QEngineOCL with and without QUnit and QFusion layers. Grover's algorithm was iterated an optimal number of times, vs. qubit count, to maximize probability on a half cycle of the algorithm's period, being :math:`floor\left[\frac{\pi}{4asin^2\left(1/\sqrt{2^N}\right)}\right]` iterations for :math:`N` qubits.
 
@@ -121,30 +121,31 @@ The "quantum" (or "discrete") Fourier transform (QFT/DFT) is a realistic and imp
 
 .. image:: performance/qft.png
 
-.. image:: performance/qft_optimization.png
+Recall that QCGPU and Qrack are GPU-implementations run on AWS g3s.xlarge instances, whereas all other candidates are run on AWS c5.4xlarge instances. Under these considerations, by the 28 qubit level, Qrack out-performs all other candidates except FFTW3. (Recall, also, that Qrack uses a representatively "hard" initialization on this test, as described above, whereas permutation basis eigenstate inputs, for example, are much more quickly executed.) Though we are comparing CPU to GPU, CPU-based FFTW3 is clearly the best suited for low numbers of qubits, in general.
 
-Recall that QCGPU and Qrack are GPU-implementations run on AWS g3s.xlarge instances, whereas all other candidates are run on AWS c5.4xlarge instances. Under these considerations, by the 28 qubit level, Qrack out-performs all other candidates except FFTW3. (Recall, also, that Qrack uses a representatively "hard" initialization on this test, as described above, whereas permutation basis eigenstate inputs, for example, are much more quickly executed.) Though we are comparing CPU to GPU, CPU-based FFTW3 is clearly the best suited for low numbers of qubits, in general. However, Qrack is the only candidate tested which exhibits even better special case performance on the QFT, as for random permutation basis eigenstate initialization, or initialization via permutation basis eigenstates with random "H" gates applied, before QFT.
+Similarly, on random universal circuits, defined above and in the benchmark repository, Qrack and QCGPU are closely matched (for cost) on AWS systems.
 
-Similarly, on random universal circuits, defined above and in the benchmark repository, Qrack leads over all other candidates considered by the 21 qubit mark and up. GPU-based QCGPU leads on the test system for 20 qubits and below, and CPU-based Cirq leads for 8 qubits and fewer.
-
-.. image:: performance/random_circuit.png
+.. image:: performance/random_universal.png
 
 Qrack's QUnit makes a fundamental improvement on an idealization of the Sycamore circuit, which we strongly encourage the reader to analyze and reproduce with the provided public benchmark code.
 
 .. image:: performance/sycamore.png
 
+QUnit represents its state vector in terms of decomposed subsystems, when possible and efficient. On user and internal probability checks, QUnit will attempt to separate the representations of independent subsystems by Schmidt decomposition. Further, Qrack will avoid applying phase effects that make no difference to the expectation values of any Hermitian operators, (no difference to "physical observables,") when optimization can be achieved this way. For each bit whose representation is separated this way, we recover a factor of close to or exactly 1/2 the subsystem RAM and gate execution time. Under the domain constraints, QUnit outperforms all other simulators analyzed.
 
 Discussion
 **********
 
-Up to a consistent deviation at low qubit counts, speed and RAM usage for Schrödinger method "QEngine" types is well predicted by theoretical complexity considerations of the gates, up to about a factor of 2 on heap usage for duplication of the state vector, with additional 1/2 the size of state vector allocated by QEngineOCL for an auxiliary normalization buffer.
+Up to a consistent deviation at low qubit counts, speed and RAM usage for Schrödinger method "QEngine" types is well predicted by theoretical complexity considerations of the gates, up to about a factor of 2 on heap usage for duplication of the state vector, with additional 1/2 the size of state vector allocated by QEngineOCL for an auxiliary normalization buffer. For the additional overhead between Qrack's optimized QUnit and QCGPU in the comparative benchmarks, the difference might come down to Qrack's support for a more general API and set of compatible systems.
+
+Qrack is written for scalable work distribution in the OpenCL kernels. QEngineOCL will distribute work among an arbitrarily small number of processing elements and max work item size, smaller than state vector size. Max work item size is a device-specific hardware parameter limiting how many work items may be dispatched in an OpenCL kernel call. QEngineOCL can distribute large numbers of probability amplitude transformations to small numbers of work items, incurring additional looping overhead, whereas QCGPU is written to dispatch one work item to one processing element. QCGPU requires a large enough hardware max work item size to add higher numbers of qubits. Whereas QCGPU might not be, Qrack is theoretically compatible with OpenCL devices with smaller maximum work item counts, such as CPUs like the ARM7, on which Qrack has been regularly unit tested.
 
-Qrack::QUnit succeeds as a novel and fundamentally improved quantum simulation algorithm, over the naive Schrödinger algorithm. Primarily, QUnit does this by representing its state vector in terms of decomposed subsystems, as well as buffering and commuting H gates and singly-controlled gates. On user and internal probability checks, QUnit will attempt to separate the representations of independent subsystems by Schmidt decomposition. Further, Qrack will avoid applying phase effects that make no difference to the expectation values of any Hermitian operators, (no difference to "physical observables"). For each bit whose representation is separated this way, we recover a factor of close to or exactly 1/2 the subsystem RAM and gate execution time.
+Qrack gives the option to normalize its state vector while flooring noise-level amplitudes at on-the-fly opportunities, to optimize while correcting for float rounding error. QEngineOCL was designed to support access by separate QEngineOCL instances in different threads to shared OpenCL devices, as well as optional out-of-order OpenCL queue execution, when available. QEngineOCL can also dispatch a queue of gates completely asynchronously, without blocking the main execution thread. Runtime options and design features to support a broad range of platforms do add to Qrack's execution overhead, but these make Qrack the best all-around quantum computer simulator for personal and heterogeneous hardware.
 
 Further Work
 ************
 
-A formal report of the above and additional benchmark results, in much greater detail and specificity, is planned to be submitted for publication as soon as sufficient preliminary peer opinion can be collected on the preprint, in early to mid 2020, thanks to the generous support of the Unitary Fund.
+A formalized report of the above and additional benchmark results, in much greater detail and specificity, is planned to be submitted for publication in February, 2020, thanks to the generous support of the Unitary Fund.
 
 Qrack previously contained two experimental multiprocessor types, "QEngineOCLMulti" based on the algorithms developed in Intel's [QHiPSTER]_, and the simpler QUnitMulti type, which dispatches different separable subsystems to different processors. These failed to outperform the single processor QEngineOCL. However, as Qrack has added optional support as a simulator for ProjectQ, we have effectively gained access to the quantum network simulator "SimulaQron" by SoftwareQuTech. At least one Qrack user is experimenting with scaling deployments of containers loaded with Qrack, ProjectQ, and SimulaQron as an effective solution for multiprocessor and cluster operations, and the Qrack team is looking at this and related approaches for this purpose. An asynchronous quantum P2P model, for effective multiprocessor support, should hopefully reduce inter-device communication overhead bottlenecks.
 
@@ -155,7 +156,7 @@ We will maintain systematic comparisons to published benchmarks of quantum compu
 Conclusion
 **********
 
-Per [Pednault2017]_, and many other attendant and synergistic optimizations engineered specifically in Qrack's QUnit, explicitly separated subsystems of qubits in QUnit have a significant RAM and speed edge in many cases over the Schrödinger algorithm of most popular quantum computer simulators. Qrack gives very efficient performance on a single node past 32 qubits, up to the limit of maximal entanglement.
+Per [Pednault2017]_, and many other attendant and synergistic optimizations engineered specifically in Qrack's QUnit, explicitly separated subsystems of qubits in QUnit have a significant RAM and speed edge in many cases over the "Schrödinger algorithm" of most popular quantum computer simulators. Qrack gives very efficient performance on a single node past 32 qubits, up to the limit of maximal entanglement.
 
 Citations
 *********

diff --git a/docs/performance/qft.png b/docs/performance/qft.png
diff --git a/docs/performance/qft_optimization.png b/docs/performance/qft_optimization.png
diff --git a/docs/performance/random_circuit.png b/docs/performance/random_circuit.png
diff --git a/docs/performance/random_universal.png b/docs/performance/random_universal.png
diff --git a/docs/performance/sycamore.png b/docs/performance/sycamore.png