Solutions for the leetcode problem Count submatrices with Top-Left Element and Sum Less Than k in Rust and C++ are compared in performance using the Criterion benchmark and the Google benchmark framework.
Obviously one needs Rust available installed by the recommended rustup method.
Install cmake and google-benchmark by HomeBrew:
brew install cmake google-benchmark
Note that if you want to work with a differenct c compiler than clang delivered with Xcode, you need to compile a version for that other compiler from source. The homebrew version is for clang.
cmake should be available by your Linux distribution package manager.
On Linux, getting access to google-benchmark is a bit more complicated, as the various distributions have that library available under different names and in different qualities, if at all. E.g. on Ubuntu 23.10 one can get a Debug version of the benchmark only, which somewhat render that package useless. So it is better to install google-benchmark from source.
Source rep for google-benchmark is here. Follow the instructions to build and install it in a release version. You will need cmake as well here.
There is partially support to run this on Windows, however, the cmake part is broken on my (arm64) Windows machine, with cxx/corrosion not figuring out which architecture to build and google benchmark not available as a packege. cargo bench will work, but running cmake won't. So one is able to run the Criterion benchmark, but not the google benchmark. Feel free to fix that.
The repository contains a Criterion benchmark to measure the runtime of several functions, which all do the same on the same input (calculation numbers of submatrices):
- mine/cs
consumes an 500x500 Vec<Vec<i32>>
field with random integers and a sufficiently
large submatrix sum limit to finde a substantial amount of submatrices.
- mine/cs1000x1000
same algorithm consuming a 1000x1000 field.
- mine/ref
same implementation of the algorithm, same input, but input is passed as reference. Appears to be considerably faster.
- mine/fs
a variant of the algorithm producing a list of submatrices containing their sum and the x, y position of the lower right end corner of the submatrix.
examples/main contains an example program visualising the output as a png image file. Run it by cargo run --example main --release
.
- leetcode/cpp
the fastest c++ solution, called from rust with rust collection types
- leetcode/cpp_std
the fastest c++ solution, called from rust with c++ standard collection types, this is as fast as mine/ref
- leetcode/cs
the fastest solution for computing submatrices I found on leetcode, consumes its input
- leetcode/cs_unchecked
an unsafe version of leetcode/cs that uses unchecked indexed vector access, much faster than leetcode/cs
- leetcode/cs_raw_ptr
unsafe as leetcode/cs_unchecked, but using raw ptr access. Surprisingly no gain in performance compared to leetcode/cs.
All functions are measured with exactly the same input. Run the benchmarks with
cargo bench
Individual benchmark can be run in isolation by giving cargo bench their name.
There is a c++ variant mimicing the Rust solution and a benchmark using the Google benchmark library. Build and run it with cmake (but make sure to run the benchmark as a release build).
With the help of cxx and corrosion there is now a direct comparison of the rust and the c++ code. Using an efficient c++ solution, speed of c++ and rust is similar.
In developing this repo, it was found that Criterion produces unreliable
results if the functions tested have a runtime in the range of a few µs or lower.
To get reliable results consistent on all platforms one needs to increase the
dataset input size (assuming that runtime depends on it) to take the execution
time of one function call near or into the range of about 100 µs or higher. This
may as well depend on the machine you are using for testing. But no implementation
has a built-in limit for input dataset size and should work with all sizes, as
long as no i32
sum overflow happens.
It was found as well, that the functions being tested by Criterion require the batch input size
to be set for LargeInput
, otherwise the measured runtimes for the various vector consuming
functions will be inaccurate and won't match what is measured in production.
Further lessons learned: calling c++ from Rust with rust collection types is much slower than calling c++ with its own std collection types. But CxxVector may not contain CxxVectors as elements, thus the c++ representation of a grid has to be treated like a opqaue type.
patryk27 on Reddit for helping me figuring out misleading performance measurements on different platforms.
dacozai on leetcode for the contribution of Rust algorithm versions using indices.
Some unknown leetcode user for the fastest c++ solution. I got it from leetcode when asking for a sample of a fast c++ solution. Unfortunately, leetcode does not bother to tell the author. Be aware that the license of this repo may not apply to the c++ solution.
The mine/ref solution and why it is the fastest is explained in detail on leetcode.