Skip to content

Commit

Permalink
Typo fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
WrathfulSpatula committed Oct 21, 2019
1 parent 9b60366 commit fd6474f
Showing 1 changed file with 6 additions and 6 deletions.
12 changes: 6 additions & 6 deletions docs/performance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -120,31 +120,31 @@ Grover's algorithm is a relatively ideal test case, in that it allows a modicum

[Broda2016]_ discusses how Grover's might be adapted in practicality to actually "search an unstructured database," or search an unstructured lookup table, and Qrack is also capable of applying Grover's search to a lookup table with its IndexedLDA, IndexedADC, and IndexedSBC methods. Benchmarks are not given for this arguably more practical application of the algorithm, because few other quantum computer simulator libraries implement it, yet.

The "quantum" (or "discrete") Fourier transform (QFT/DFT) is another realistic test case. Other simulators were also tested on their QFT execution time, including QCGPU, ProjectQ, and FFTW3, (which is not explicitly quantum computer simulator software). For ease of comparison, in consideration of realistic use cases, the only Qrack options include are the (Schrödinger method) QEngineOCL and the one thought to be best for general use, "QUnit" (Schmidt-decomposition) layered on "QFusion" ("gate fusion") layered on a collection of "QEngineOCL" instances (OpenCL Schrödinger method base engines).
The "quantum" (or "discrete") Fourier transform (QFT/DFT) is another realistic test case. Other simulators were also tested on their QFT execution time, including QCGPU, ProjectQ, and FFTW3, (which is not explicitly quantum computer simulator software). For ease of comparison, in consideration of realistic use cases, the only Qrack options include are the (Schrödinger method) QEngineOCL and the one thought to be best for general use, "QUnit" (Schmidt decomposition) layered on "QFusion" ("gate fusion") layered on a collection of "QEngineOCL" instances (OpenCL Schrödinger method base engines).

.. image:: performance/qft.png

Both QEngineOCL and QUnit generally outperform the default simulator "backend" for ProjectQ. However, Qrack has also been wrapped as an optional backend for ProjectQ, and benchmarks for this layering of both projects will follow.

Though we are comparing CPU to GPU, CPU-based FFTW3 is clearly the best suited for low numbers of qubits, in general. After FFTW3, QEngineOCL outperforms QCGPU for low qubit widths. Both OpenCL simulators follow a trend that appears to reach a knee of faster exponential growth. The "knee" comes at a lower number of qubits for QEngineOCL than for QCGPU, at about 19 or 18 qubits, versus 25 or 24 qubits for QCGPU.

QUnit was analyzed based both on its QFT time alone, and on half its round-trip time for the QFT followed by its inverse. The distribution of its times was log normal for the random input state distributions selected, so the times given are "exponent-mean-log," 2 raised to the mean of the log base 2 of the trial times, for which the mean closely matches the median and the standard deviations are consistently very small. QUnit represents its state vector in terms of decomposed subsystems, when possible and efficient. On user and internal probability checks, QUnit will attempt to separate the representations of independent subsystems effectively by Schmidt decomposition. Further, it proactively attempts to avoid entanglement of representation in controlled and arithmetic gates. For each bit whose representation is separated this way, we recove a factor of close to or exactly 1/2 the subsystem RAM and gate execution time. Under the domain constraints, QUnit outperforms all other simulators analyzed, one-way at high qubit counts.
QUnit was analyzed based both on its QFT time alone, and on half its round-trip time for the QFT followed by its inverse. The distribution of its times was log normal for the random input state distributions selected, so the times given are "exponent-mean-log," 2 raised to the mean of the log base 2 of the trial times, for which the mean closely matches the median and the standard deviations are consistently very small. QUnit represents its state vector in terms of decomposed subsystems, when possible and efficient. On user and internal probability checks, QUnit will attempt to separate the representations of independent subsystems effectively by Schmidt decomposition. Further, it proactively attempts to avoid entanglement of representation in controlled and arithmetic gates. For each bit whose representation is separated this way, we recover a factor of close to or exactly 1/2 the subsystem RAM and gate execution time. Under the domain constraints, QUnit outperforms all other simulators analyzed, one-way at high qubit counts.

Since QUnit does not necessarily require the exponentially scaling resources of the "Schrödinger method," the authors thought to test it informally at much higher qubit counts. Qrack was initially designed in expectation that it might scale past 32 qubits, and 64 qubits was consciously chosen as the maximum potential allocation limit per single coherent instance, with easy ways to expand should 128 integral types become commonly available as a standard. A recently resolved issue opened by a third party on the Qrack Github repo prompted the authors to test and ensure support above the 32 qubit level, giving us confidence that addressing is now reliable up to 63 qubits. (Our tests do point to instability specifically in the 64th bit, commonly reserved for numerical sign in the 64 bit integral types Qrack relies on for qubit addressing.) One-way QFT benchmarks as in the graph above were left to run overnight starting October 17th, 2019 on the same Alienware 17 as other reported benchmarks, and no instability was detected from 4 to 63 qubits, for 100 random trials apiece. Execution time appears to plateau around the mid-30 qubit range; for quantitative point of reference, we report that 100 random trials of the 63 qubit QFT on an AWS p3.2xlarge completed on average (log-normal) in 19.0 seconds, the fastest taking 13.5 seconds and the slowest taking 28.3 seconds. (Remember that the domain was limited to states constructed from random Hadamard gates on bits of a random permutation.)
Since QUnit does not necessarily require the exponentially scaling resources of the "Schrödinger method," the authors thought to test it informally at much higher qubit counts. Qrack was initially designed in expectation that it might scale past 32 qubits, and 64 qubits was consciously chosen as the maximum potential allocation limit per single coherent instance, with easy ways to expand should 128 integral types become commonly available as a standard. A recently resolved issue opened by a third party on the Qrack Github repository prompted the authors to test and ensure support above the 32 qubit level, giving us confidence that addressing is now reliable up to 63 qubits. (Our tests do point to instability specifically in the 64th bit, commonly reserved for numerical sign in the 64 bit integral types Qrack relies on for qubit addressing.) One-way QFT benchmarks as in the graph above were left to run overnight starting October 17th, 2019 on the same Alienware 17 as other reported benchmarks, and no instability was detected from 4 to 63 qubits, for 100 random trials apiece. Execution time appears to plateau around the mid-30 qubit range; for quantitative point of reference, we report that 100 random trials of the 63 qubit QFT on an AWS p3.2xlarge completed on average (log-normal) in 19.0 seconds, the fastest taking 13.5 seconds and the slowest taking 28.3 seconds. (Remember that the domain was limited to states constructed from random Hadamard gates on bits of a random permutation.)

Discussion
**********

Up to a consistent deviation at low qubit counts, speed and RAM usage for Schrödinger method "QEngine" types is well predicted by theoretical complexity considerations of the gates, up to about a factor of 2 on heap usage for duplication of the state vector, with additional 1/2 the size of state vector allocated by QEngineOCL for an auxiliary normalization buffer. For the additional overhead in the comparative QFT benchmarks, the difference between QCGPU and QEngineOCL might come down to Qrack's support for a more general API and set of compatible systems.

Qrack is written for scalable work distribution in the OpenCL kernels. QEngineOCL will distribute work among an arbitrarily small number of processing elements and max work item size than state vector size. Max work item size is a device-specific hardware parameter limiting how many work items may be dispatched in an OpenCL kernel call. QEngineOCL can distribute large numbers of probability amplitude transformations to small numbers of work items, incurring additional looping overhead, whereas QCGPU is written to dispatch one work item to one processing element. QCGPU requires a large enough hardware max work item size to add higher numbers of qubits. Whereas QCGPU might not be, Qrack is theoretically compatible with OpenCL devices with smaller maximum work item counts, such as CPUs like the ARM7, on which Qrack has been regularly unit tested.
Qrack is written for scalable work distribution in the OpenCL kernels. QEngineOCL will distribute work among an arbitrarily small number of processing elements and max work item size, smaller than state vector size. Max work item size is a device-specific hardware parameter limiting how many work items may be dispatched in an OpenCL kernel call. QEngineOCL can distribute large numbers of probability amplitude transformations to small numbers of work items, incurring additional looping overhead, whereas QCGPU is written to dispatch one work item to one processing element. QCGPU requires a large enough hardware max work item size to add higher numbers of qubits. Whereas QCGPU might not be, Qrack is theoretically compatible with OpenCL devices with smaller maximum work item counts, such as CPUs like the ARM7, on which Qrack has been regularly unit tested.

Qrack gives the option to normalize its state vector at on-the-fly opportunities, to correct for float rounding error, incurring overhead costs but benefiting the accuracy of the simulation over very long strings of gate applications. (Normalization was off in all benchmarks, but "host code" must switch between these options.) QEngineOCL was designed to support access by separate QEngineOCL instances in different threads to shared OpenCL devices, as well as optional out-of-order OpenCL queue execution, when available. QEngine types can dispatch a queue of gates completely asynchronously, without blocking the main execution thread. Runtime options and design features to support a broad range of platforms do add to Qrack's execution overhead, but these make Qrack the best all-around quantum computer simulator for personal and heterogeneous hardware.
Qrack gives the option to normalize its state vector at on-the-fly opportunities, to correct for float rounding error, incurring overhead costs but benefiting the accuracy of the simulation over very long strings of gate applications. (Normalization was off in all benchmarks, but "host code" must switch between these options.) QEngineOCL was designed to support access by separate QEngineOCL instances in different threads to shared OpenCL devices, as well as optional out-of-order OpenCL queue execution, when available. QEngineOCL can also dispatch a queue of gates completely asynchronously, without blocking the main execution thread. Runtime options and design features to support a broad range of platforms do add to Qrack's execution overhead, but these make Qrack the best all-around quantum computer simulator for personal and heterogeneous hardware.

Further Work
************

Qrack previously contained two experimental multiprocessor types, "QEngineOCLMulti" based on the algorithms developed in Intel's [QHiPSTER]_, and the simpler QUnitMulti type, which dispatches different separable subsystems to different processors. These failed to outperform the single processor QEngineOCL. However, as Qrack has added optional support as a simulator for ProjectQ, we have effectively gained access to the quantum network simulator "SimulaQron" by SoftwareQuTech. At least one Qrack user is experimenting with scaling deployments of containers loaded with Qrack, ProjectQ, and SimulaQron as effective solution for multiprocessor and cluster operations, and the Qrack Team is looking at this and related approaches for this purpose. An asynchronous quantum P2P model, for effective multiprocessor support, should hopefully reduce inter-device communication overhead bottlenecks.
Qrack previously contained two experimental multiprocessor types, "QEngineOCLMulti" based on the algorithms developed in Intel's [QHiPSTER]_, and the simpler QUnitMulti type, which dispatches different separable subsystems to different processors. These failed to outperform the single processor QEngineOCL. However, as Qrack has added optional support as a simulator for ProjectQ, we have effectively gained access to the quantum network simulator "SimulaQron" by SoftwareQuTech. At least one Qrack user is experimenting with scaling deployments of containers loaded with Qrack, ProjectQ, and SimulaQron as an effective solution for multiprocessor and cluster operations, and the Qrack team is looking at this and related approaches for this purpose. An asynchronous quantum P2P model, for effective multiprocessor support, should hopefully reduce inter-device communication overhead bottlenecks.

Qrack seems to nearly be a strict super set of Gottesman-Knill "classically efficient" stabilizer simulators. Qracks supports freely invoking gates outside of a stabilizer's efficient Clifford algebra, as well as fundamental algorithmic improvements outside of any Clifford algebra. It is a high and immediate priority for the authors to handle the one or few remaining cases of CNOT gate optimization needed for a classically efficient strict super set of a Clifford algebra.

Expand Down

0 comments on commit fd6474f

Please sign in to comment.