Benchmarks 2019-05-27

vm6502q · May 27, 2019 · 750e826 · 750e826
1 parent 1d4c25d
commit 750e826
Show file tree

Hide file tree

Showing 7 changed files with 8 additions and 8 deletions.
diff --git a/docs/performance.rst b/docs/performance.rst
@@ -75,13 +75,13 @@ Disclaimers
 Method
 ******
 
-100 timed trials of single and parallel gates were run for each qubit count between 4 and 28 qubits. The benchmarking code is available at `https://github.com/vm6502q/qrack/blob/master/test/benchmarks.cpp <https://github.com/vm6502q/qrack/blob/master/test/benchmarks.cpp>`_, in which qubit count can be set and tests can be commented out to reduce to the relevant subset.
+100 timed trials of single and parallel gates were run for each qubit count between 5 and 28 qubits. The benchmarking code is available at `https://github.com/vm6502q/qrack/blob/master/test/benchmarks.cpp <https://github.com/vm6502q/qrack/blob/master/test/benchmarks.cpp>`_, in which qubit count can be set and tests can be commented out to reduce to the relevant subset. (Specifically, the branch "benchmarks_2019-05-27" was used, branched off of "qunitmulti," pending the latter branch's merge into master.)
 
-CPU and GPU benchmarks were run on two respective systems that could represent realistic use cases for each engine type. An AWS p3.2xlarge running Ubuntu Server 18.04LTS was used for GPU benchmarks. An Alienware 17 R5 with an Intel(R) Core(TM) i7-8750H running Ubuntu 18.04LTS was used for Qrack CPU benchmarks. Updated Qrack GPU benchmarks were collected May 19th, 2019. These results were combined with an earlier set of test results collected starting on the night of December 19th, 2018 into the morning of December 20th, and additional quantum Fourier transform ("QFT") benchmarks were collected on the night of January 5th, 2019 into the morning of January 6th. To reduce cost, code revisions on Github were compared between December 19th, 2018 and May 19th 2019, for QCGPU and ProjectQ, and lower qubit counts were run to determine that there is no significant change in the set of benchmarks from December and January. (If this is in error, we took care to try to report fair tests, within cost limitations, but please let us know.)
+CPU and GPU benchmarks were run on two respective systems that could represent realistic use cases for each engine type. An AWS p3.2xlarge running Ubuntu Server 18.04LTS was used for GPU benchmarks. An Alienware 17 R5 with an Intel(R) Core(TM) i7-8750H running Ubuntu 18.04LTS was used for Qrack CPU benchmarks and FFTW3. Updated Qrack GPU benchmarks, and benchmarks for FFTW3, were collected May 27th, 2019. These results were combined with an earlier set of test results collected starting on the night of December 19th, 2018 into the morning of December 20th, and additional quantum Fourier transform ("QFT"/"DFT") benchmarks were collected on the night of January 5th, 2019 into the morning of January 6th. To reduce cost, code revisions on Github were compared between December 19th, 2018 and May 19th 2019, for QCGPU and ProjectQ, and lower qubit counts were run to determine that there is no significant change in the set of benchmarks from December and January. (If this is in error, we took care to try to report fair tests, within cost limitations, but please let us know.)
 
-The average and quartile boundary values of each set of 100 were recorded and graphed. Grover's search to invert a black box subroutine, or "oracle," was similarly implemented for trials between 4 and 20 qubits, for QEngineOCL with and without QUnit and QFusion layers. Grover's algorithm was iterated an optimal number of times, vs. qubit count, to maximize probability on a half cycle of the algorithm's period, being :math:`floor\left[\frac{\pi}{4asin^2\left(1/\sqrt{2^N}\right)}\right]` iterations for :math:`N` qubits.
+The average and quartile boundary values of each set of 100 were recorded and graphed. Grover's search to invert a black box subroutine, or "oracle," was similarly implemented for trials between 5 and 20 qubits, for QEngineOCL with and without QUnit and QFusion layers. Grover's algorithm was iterated an optimal number of times, vs. qubit count, to maximize probability on a half cycle of the algorithm's period, being :math:`floor\left[\frac{\pi}{4asin^2\left(1/\sqrt{2^N}\right)}\right]` iterations for :math:`N` qubits.
 
-A quantum Fourier transform was run for 100 trials between 4 and 28 qubits on the GPU engine types. ProjectQ and QCGPU were also benchmarked for this test, on an AWS p3.2xlarge running Ubuntu Server 18.04LTS, with the benchmarking code provided by QCGPU, accessed from its Github repository on 12/1/18, in Python 3. The benchmarking script available from QCGPU iterates at random between the engine options, but it was slightly modified to alternate between trials of ProjectQ and QCGPU, to get exactly 100 samples apiece per qubit count.
+A quantum Fourier transform was run for 100 trials between 5 and 28 qubits on the GPU engine types. ProjectQ, QCGPU, and FFTW3 were also benchmarked for this test. ProjectQ and QCGPU were tested on an AWS p3.2xlarge running Ubuntu Server 18.04LTS, with the benchmarking code provided by QCGPU, accessed from its Github repository on 12/1/18, in Python 3. The benchmarking script available from QCGPU iterates at random between the engine options, but it was slightly modified to alternate between trials of ProjectQ and QCGPU, to get exactly 100 samples apiece per qubit count. FFTW3 was tested on the aforementioned 17 R5, in "in-place" mode, since all other DFT tests are effectively "in-place" method, here.
 
 QUnit was analyzed based on half its round trip time, for the application of the QFT followed by its inverse. This is a representative test of Qrack::QUnit performance, because QFT of a permutation basis eigenstate is generally much faster than the inverse operation applied to the result. QUnit explicitly separates the representation of qubit subsystems when it can, such that performance differs greatly between entirely separable and entirely entangled cases, hence we need to show various representative cases. Separability considerations do not affect maximally entangled representations, as are all other simulators tested here, (though ProjectQ includes a compilation layer on top of its default simulator). 
 
@@ -114,24 +114,24 @@ Grover's algorithm is a relatively ideal test case, in that it allows a modicum
 
 [Broda2016]_ discusses how Grover's might be adapted in practicality to actually "search an unstructured database," or search an unstructured lookup table, and Qrack is also capable of applying Grover's search to a lookup table with its IndexedLDA, IndexedADC, and IndexedSBC methods. Benchmarks are not given for this arguably more practical application of the algorithm, because few other quantum computer simulator libraries implement it, yet.
 
-The quantum Fourier transform (QFT) is another realistic test case. Other simulators were also tested on the QFT. QFT operations were directly "chained," starting from the |0> permutation state. Qrack::QUnit was able to recover full (or virtually full) separability of qubits at every other step of 100 iterations, oscillating between modes of the "entangled" and "separable" QUnit median trends shown in the graph.
+The "quantum" (or "discrete") Fourier transform (QFT/DFT) is another realistic test case. Other simulators were also tested on the QFT, including QCGPU, ProjectQ, and FFTW3, (which is not explicitly quantum simulatory software). 
 
-QEngineCPU took approximately 100 seconds per 1 trial (of 100) for 22 qubits and approximately 200 seconds for a 23 qubit QFT, and testing the QEngineCPU type therefore become prohibitive, for the full range of qubits between 4 and 28. To avoid confusion in the graph, and since QEngineCPU might therefore be impractical for large QFTs, we leave both it and its QUnit/QFusion variant off the graph.
+QEngineCPU took approximately 100 seconds per 1 trial (of 100) for 22 qubits and approximately 200 seconds for a 23 qubit QFT, and testing the QEngineCPU type therefore become prohibitive, for the full range of qubits between 5 and 28. To avoid confusion in the graph, and since QEngineCPU might therefore be impractical for large QFTs, we leave both it and its QUnit/QFusion variant off the graph.
 
 .. image:: performance/qft.png
 
 QEngineOCL generally outperforms the default simulator "backend" for ProjectQ. However, Qrack has also been wrapped as an optional backend for ProjectQ, and benchmarks for this layering of both projects will follow.
 
 For lower numbers of qubits, QEngineOCL outperforms QCGPU. Both simulators follow a trend that appears to reach a knee of faster exponential growth. The "knee" comes at a lower number of qubits for QEngineOCL than for QCGPU, at about 19 or 18 qubits, versus 25 or 24 qubits for QCGPU.
 
-QUnit was analyzed based on half its round trip time, for the application of the QFT followed by its inverse. The distribution of its times was log normal for the random input state distributions selected, so the times given are "exponent-mean-log," 2 raised to the mean of the log base 2 of the trial times, for which the mean closely matches the median and the standard deviations are consistently very small. QUnit represents its state vector in terms of decomposed subsystems, when possible and efficient. After an operation that should disentangle subsystems, QUnit can optionally try to separate the representations of independent subsystems, recovering a factor of 1/2, for each separated bit, of subsystem RAM and gate application time for further gates. QUnit outperforms all other simulators analyzed, when transforming a permutation basis eigenstate, without attempting to recover additional separability after the inverse transform. It fails to outperform QCGPU on half the round trip, if we attempt to separate QUnit's subsystems on the return trip, but further gate application will usually receive a large boost from this attempt at subsystem separation. A linear superposition of permutation states probably represents a realistic and fairly general set of inputs or outputs for the QFT; QUnit times are very close to QEngineOCL times for a random linear superposition of permutation basis input states.
+QUnit was analyzed based on half its round trip time, for the application of the QFT followed by its inverse. The distribution of its times was log normal for the random input state distributions selected, so the times given are "exponent-mean-log," 2 raised to the mean of the log base 2 of the trial times, for which the mean closely matches the median and the standard deviations are consistently very small. QUnit represents its state vector in terms of decomposed subsystems, when possible and efficient. After an operation that should disentangle subsystems, QUnit will attempt to separate the representations of independent subsystems, in the course of further controlled gates, recovering a factor of 1/2 for each separated bit of subsystem RAM and gate application time for further gates. QUnit outperforms all other simulators analyzed, when transforming a permutation basis eigenstate. A linear superposition of permutation states probably represents a realistic and fairly general set of inputs or outputs for the QFT; QUnit times outpeform QEngineOCL times and come within a factor of 2, at high end, for a random linear superposition of permutation basis input states.
 
 Discussion
 **********
 
 Up to a consistent deviation at low qubit counts, speed and RAM usage is well predicted by theoretical complexity considerations of the gates, up to about a factor of 2 on heap usage for duplication of the state vector, with additional 1/2 the size of state vector allocated by QEngineOCL for an auxiliary normalization buffer.
 
-In the comparative QFT benchmarks, the difference between QCGPU and Qrack in the "knee" in the base engine might be partially due to scalable work distribution in the OpenCL kernels. QEngineOCL is written to distribute work among an arbitrarily small number of processing elements and max work item size. Max work item size is a device-specific hardware parameter limiting how many work items may be dispatched in an OpenCL kernel call. QEngineOCL can distribute large numbers of probability amplitude transformations to small numbers of work items, incurring additional looping overhead, whereas QCGPU is written to dispatch one work item to one processing element. QCGPU requires a large enough hardware max work item size to add higher numbers of qubits, which might or not might not prove prohibitive in addressing the largest possible amount of general RAM on typical GPUs. Whereas QCGPU might not be, Qrack is theoretically compatible with OpenCL devices with smaller maximum work item counts, potentially such as CPUs. Additionally, Qrack normalizes its state vector at on-the-fly opportunities, to correct for float rounding error, incurring overhead costs but benefiting the accuracy of the simulation over very long strings of gate applications. QEngineOCL was also designed to support access by separate QEngineOCL instances in different threads to shared OpenCL devices, as well as optional out-of-order OpenCL queue execution, when available, which might add host overhead.
+For the additional overhead in the comparative QFT benchmarks, the difference between QCGPU and Qrack might come down to support for a much more general API and set of compatible systems, for Qrack. For example, Qrack is written for scalable work distribution in the OpenCL kernels. QEngineOCL is written to distribute work among an arbitrarily small number of processing elements and max work item size. Max work item size is a device-specific hardware parameter limiting how many work items may be dispatched in an OpenCL kernel call. QEngineOCL can distribute large numbers of probability amplitude transformations to small numbers of work items, incurring additional looping overhead, whereas QCGPU is written to dispatch one work item to one processing element. QCGPU requires a large enough hardware max work item size to add higher numbers of qubits, which might or not might not prove prohibitive in addressing the largest possible amount of general RAM on typical GPUs. Whereas QCGPU might not be, Qrack is theoretically compatible with OpenCL devices with smaller maximum work item counts, potentially such as CPUs. Additionally, Qrack give the option to normalize its state vector at on-the-fly opportunities, to correct for float rounding error, incurring overhead costs but benefiting the accuracy of the simulation over very long strings of gate applications. (Normalization was off in all benchmarks, but "host code" must switch between these options.) QEngineOCL was also designed to support access by separate QEngineOCL instances in different threads to shared OpenCL devices, as well as optional out-of-order OpenCL queue execution, when available, which might add host overhead. Further, depending on build options, Qrack can be built compatible with purely 32-bit systems.
 
 Further Work
 ************

diff --git a/docs/performance/cnot_all.png b/docs/performance/cnot_all.png
diff --git a/docs/performance/cnot_single.png b/docs/performance/cnot_single.png
diff --git a/docs/performance/grovers.png b/docs/performance/grovers.png
diff --git a/docs/performance/qft.png b/docs/performance/qft.png
diff --git a/docs/performance/x_all.png b/docs/performance/x_all.png
diff --git a/docs/performance/x_single.png b/docs/performance/x_single.png