2019-05-19 benchmarks

vm6502q · May 20, 2019 · 1d4c25d · 1d4c25d
1 parent a3cd19e
commit 1d4c25d
Show file tree

Hide file tree

Showing 9 changed files with 15 additions and 13 deletions.
diff --git a/docs/api/qinterface.rst b/docs/api/qinterface.rst
@@ -51,11 +51,11 @@ State Manipulation Methods
 
 .. doxygenfunction:: Qrack::QInterface::SetQuantumState
 
-.. doxygenfunction:: Qrack::QInterface::Cohere(QInterfacePtr)
-.. doxygenfunction:: Qrack::QInterface::Cohere(std::vector<QInterfacePtr>)
+.. doxygenfunction:: Qrack::QInterface::Compose(QInterfacePtr)
+.. doxygenfunction:: Qrack::QInterface::Compose(std::vector<QInterfacePtr>)
 
-.. doxygenfunction:: Qrack::QInterface::Decohere
-.. doxygenfunction:: Qrack::QInterface::TryDecohere
+.. doxygenfunction:: Qrack::QInterface::Decompose
+.. doxygenfunction:: Qrack::QInterface::TryDecompose
 
 .. doxygenfunction:: Qrack::QInterface::Dispose
 
@@ -67,6 +67,8 @@ State Manipulation Methods
 
 .. doxygenfunction:: Qrack::QInterface::ProbMask
 
+.. doxygenfunction:: Qrack::QInterface::GetProbs
+
 .. doxygenfunction:: Qrack::QInterface::Swap(bitLenInt, bitLenInt)
 
 .. doxygenfunction:: Qrack::QInterface::Swap(bitLenInt, bitLenInt, bitLenInt)

diff --git a/docs/index.rst b/docs/index.rst
@@ -61,7 +61,6 @@ Daniel Strano would like to specifically note that Benn Bollay is almost entirel
     api/qenginecpu
     api/qengineocl
     api/qunit
-    api/qunitmulti
     api/6502
 
 .. The #http:// is a hack to get around Sphinx's re parser for links,
@@ -82,6 +81,5 @@ Daniel Strano would like to specifically note that Benn Bollay is almost entirel
     QEngineCPU </en/latest/_static/doxygen/classQrack_1_1QEngineCPU.html#http://>
     QEngineOCL </en/latest/_static/doxygen/classQrack_1_1QEngineOCL.html#http://>
     QUnit </en/latest/_static/doxygen/classQrack_1_1QUnit.html#http://>
-    QUnitMulti </en/latest/_static/doxygen/classQrack_1_1QUnitMulti.html#http://>
     Complex16Simd </en/latest/_static/doxygen/structQrack_1_1Complex16Simd.html#http://>
 
diff --git a/docs/performance.rst b/docs/performance.rst
@@ -75,7 +75,11 @@ Disclaimers
 Method
 ******
 
-100 timed trials of single and parallel gates were run for each qubit count between 4 and 28 qubits. The benchmarking code is available at `https://github.com/vm6502q/qrack/blob/master/test/benchmarks.cpp <https://github.com/vm6502q/qrack/blob/master/test/benchmarks.cpp>`_, in which qubit count can be set and tests can be commented out to reduce to the relevant subset. CPU and GPU benchmarks were run on two respective systems that could represent realistic use cases for each engine type. An AWS p3.2xlarge running Ubuntu Server 18.04LTS was used for GPU benchmarks. An Alienware 17 R5 with an Intel(R) Core(TM) i7-8750H running Ubuntu 18.04LTS was used for Qrack CPU benchmarks. These test results were collected starting on the night of December 19th, 2018 into the morning of December 20th. Additional quantum Fourier transform ("QFT") benchmarks were collected on the night of January 5th, 2019 into the morning of January 6th. The average and quartile boundary values of each set of 100 were recorded and graphed. Grover's search to invert a black box subroutine, or "oracle," was similarly implemented for trials between 4 and 20 qubits, for QEngineOCL with and without QUnit and QFusion layers. Grover's algorithm was iterated an optimal number of times, vs. qubit count, to maximize probability on a half cycle of the algorithm's period, being :math:`floor\left[\frac{\pi}{4asin\left(1/\sqrt{2^N}\right)}\right]` iterations for :math:`N` qubits.
+100 timed trials of single and parallel gates were run for each qubit count between 4 and 28 qubits. The benchmarking code is available at `https://github.com/vm6502q/qrack/blob/master/test/benchmarks.cpp <https://github.com/vm6502q/qrack/blob/master/test/benchmarks.cpp>`_, in which qubit count can be set and tests can be commented out to reduce to the relevant subset.
+
+CPU and GPU benchmarks were run on two respective systems that could represent realistic use cases for each engine type. An AWS p3.2xlarge running Ubuntu Server 18.04LTS was used for GPU benchmarks. An Alienware 17 R5 with an Intel(R) Core(TM) i7-8750H running Ubuntu 18.04LTS was used for Qrack CPU benchmarks. Updated Qrack GPU benchmarks were collected May 19th, 2019. These results were combined with an earlier set of test results collected starting on the night of December 19th, 2018 into the morning of December 20th, and additional quantum Fourier transform ("QFT") benchmarks were collected on the night of January 5th, 2019 into the morning of January 6th. To reduce cost, code revisions on Github were compared between December 19th, 2018 and May 19th 2019, for QCGPU and ProjectQ, and lower qubit counts were run to determine that there is no significant change in the set of benchmarks from December and January. (If this is in error, we took care to try to report fair tests, within cost limitations, but please let us know.)
+
+The average and quartile boundary values of each set of 100 were recorded and graphed. Grover's search to invert a black box subroutine, or "oracle," was similarly implemented for trials between 4 and 20 qubits, for QEngineOCL with and without QUnit and QFusion layers. Grover's algorithm was iterated an optimal number of times, vs. qubit count, to maximize probability on a half cycle of the algorithm's period, being :math:`floor\left[\frac{\pi}{4asin^2\left(1/\sqrt{2^N}\right)}\right]` iterations for :math:`N` qubits.
 
 A quantum Fourier transform was run for 100 trials between 4 and 28 qubits on the GPU engine types. ProjectQ and QCGPU were also benchmarked for this test, on an AWS p3.2xlarge running Ubuntu Server 18.04LTS, with the benchmarking code provided by QCGPU, accessed from its Github repository on 12/1/18, in Python 3. The benchmarking script available from QCGPU iterates at random between the engine options, but it was slightly modified to alternate between trials of ProjectQ and QCGPU, to get exactly 100 samples apiece per qubit count.
 
@@ -118,7 +122,7 @@ QEngineCPU took approximately 100 seconds per 1 trial (of 100) for 22 qubits and
 
 QEngineOCL generally outperforms the default simulator "backend" for ProjectQ. However, Qrack has also been wrapped as an optional backend for ProjectQ, and benchmarks for this layering of both projects will follow.
 
-For lower numbers of qubits, QEngineOCL outperforms QCGPU. Both simulators follow a trend that appears to reach a knee of faster exponential growth. The "knee" comes at a lower number of qubits for QEngineOCL than for QCGPU, at about 18 qubits, versus 25 or 24 qubits for QCGPU.
+For lower numbers of qubits, QEngineOCL outperforms QCGPU. Both simulators follow a trend that appears to reach a knee of faster exponential growth. The "knee" comes at a lower number of qubits for QEngineOCL than for QCGPU, at about 19 or 18 qubits, versus 25 or 24 qubits for QCGPU.
 
 QUnit was analyzed based on half its round trip time, for the application of the QFT followed by its inverse. The distribution of its times was log normal for the random input state distributions selected, so the times given are "exponent-mean-log," 2 raised to the mean of the log base 2 of the trial times, for which the mean closely matches the median and the standard deviations are consistently very small. QUnit represents its state vector in terms of decomposed subsystems, when possible and efficient. After an operation that should disentangle subsystems, QUnit can optionally try to separate the representations of independent subsystems, recovering a factor of 1/2, for each separated bit, of subsystem RAM and gate application time for further gates. QUnit outperforms all other simulators analyzed, when transforming a permutation basis eigenstate, without attempting to recover additional separability after the inverse transform. It fails to outperform QCGPU on half the round trip, if we attempt to separate QUnit's subsystems on the return trip, but further gate application will usually receive a large boost from this attempt at subsystem separation. A linear superposition of permutation states probably represents a realistic and fairly general set of inputs or outputs for the QFT; QUnit times are very close to QEngineOCL times for a random linear superposition of permutation basis input states.
 
@@ -132,13 +136,11 @@ In the comparative QFT benchmarks, the difference between QCGPU and Qrack in the
 Further Work
 ************
 
-Qrack contains an experimental multiprocessor type, previously "QEngineOCLMulti" based on the algorithms developed in Intel's [QHiPSTER]_, currently replaced in favor of the simpler QUnitMulti type, which dispatches different separable subsystems to different processors. Current and previous generation multiprocessor types fail to outperform the single processor QEngineOCL. We include it in the current release to help the open source community realize a practical multiprocessor implementation in the context of Qrack.
-
-Qrack has been successfully run on multiple processors at once, and even on clusters, but not with practical performance for real application; a good next step is to redesign the multiprocessor engine type(s) to actually outperform the single device engine. Also, CPU "software" implementation parallelism relies on certain potentially expensive standard library functionality, like lambda expressions and parallel "futures," and might still be optimized. Further, there is still opportunity for better explicit qubit subsystem separation in QUnit.
+Qrack previously contained two experimental multiprocessor types, "QEngineOCLMulti" based on the algorithms developed in Intel's [QHiPSTER]_, and the simpler QUnitMulti type, which dispatches different separable subsystems to different processors. These failed to outperform the single processor QEngineOCL. However, as Qrack has added optional support as a simulator for ProjectQ, we have effectively gained access to the quantum network simulator "SimulaQron" by SoftwareQuTech. At least one Qrack user is experimenting with scaling deployments of containers loaded with Qrack, ProjectQ, and SimulaQron as effective solution for multiprocessor and cluster operations, and the Qrack Team is looking at this and related approaches for this purpose. An asynchronous quantum P2P model, for effective multiprocessor support, should hopefully reduce inter-device communication overhead bottlenecks.
 
-With a new generation of "VPU" processors coming in 2019, (for visual inference,) it might be possible to co-opt VPU capabilities for inference of raw state vector features, such as Schmidt separability, to improve the performance of QUnit. The authors of Qrack have just started looking at this hardware for this purpose.
+With a new generation of "VPU" processors coming in 2019, (for visual inference,) it might be possible to co-opt VPU capabilities for inference of raw state vector features, such as Schmidt separability, to improve the performance of QUnit. The authors of Qrack have started looking at this hardware for this purpose.
 
-We will maintain systematic comparisons to published benchmarks of quantum computer simulation standard libraries, as they arise. Comparative benchmarks will be established for Grover's search, in early 2019.
+We will maintain systematic comparisons to published benchmarks of quantum computer simulation standard libraries, as they arise.
 
 Conclusion
 **********

diff --git a/docs/performance/cnot_all.png b/docs/performance/cnot_all.png
diff --git a/docs/performance/cnot_single.png b/docs/performance/cnot_single.png
diff --git a/docs/performance/grovers.png b/docs/performance/grovers.png
diff --git a/docs/performance/qft.png b/docs/performance/qft.png
diff --git a/docs/performance/x_all.png b/docs/performance/x_all.png
diff --git a/docs/performance/x_single.png b/docs/performance/x_single.png