Skip to content

Latest commit

 

History

History
143 lines (94 loc) · 5.63 KB

README.md

File metadata and controls

143 lines (94 loc) · 5.63 KB

Demo for rust/c++ interop and benchmarking using cxx and corrosion

Solutions for the leetcode problem Count submatrices with Top-Left Element and Sum Less Than k in Rust and C++ are compared in performance using the Criterion benchmark and the Google benchmark framework.

Installation

Obviously one needs Rust available installed by the recommended rustup method.

macOS

Install cmake and google-benchmark by HomeBrew:

brew install cmake google-benchmark

Note that if you want to work with a differenct c compiler than clang delivered with Xcode, you need to compile a version for that other compiler from source. The homebrew version is for clang.

Linux

cmake should be available by your Linux distribution package manager.

On Linux, getting access to google-benchmark is a bit more complicated, as the various distributions have that library available under different names and in different qualities, if at all. E.g. on Ubuntu 23.10 one can get a Debug version of the benchmark only, which somewhat render that package useless. So it is better to install google-benchmark from source.

Source rep for google-benchmark is here. Follow the instructions to build and install it in a release version. You will need cmake as well here.

Windows

There is partially support to run this on Windows, however, the cmake part is broken on my (arm64) Windows machine, with cxx/corrosion not figuring out which architecture to build and google benchmark not available as a packege. cargo bench will work, but running cmake won't. So one is able to run the Criterion benchmark, but not the google benchmark. Feel free to fix that.

Measurement

The repository contains a Criterion benchmark to measure the runtime of several functions, which all do the same on the same input (calculation numbers of submatrices):

  • mine/cs

consumes an 500x500 Vec<Vec<i32>> field with random integers and a sufficiently large submatrix sum limit to finde a substantial amount of submatrices.

  • mine/cs1000x1000

same algorithm consuming a 1000x1000 field.

  • mine/ref

same implementation of the algorithm, same input, but input is passed as reference. Appears to be considerably faster.

  • mine/fs

a variant of the algorithm producing a list of submatrices containing their sum and the x, y position of the lower right end corner of the submatrix.

examples/main contains an example program visualising the output as a png image file. Run it by cargo run --example main --release.

  • leetcode/cpp

the fastest c++ solution, called from rust with rust collection types

  • leetcode/cpp_std

the fastest c++ solution, called from rust with c++ standard collection types, this is as fast as mine/ref

  • leetcode/cs

the fastest solution for computing submatrices I found on leetcode, consumes its input

  • leetcode/cs_unchecked

an unsafe version of leetcode/cs that uses unchecked indexed vector access, much faster than leetcode/cs

  • leetcode/cs_raw_ptr

unsafe as leetcode/cs_unchecked, but using raw ptr access. Surprisingly no gain in performance compared to leetcode/cs.

All functions are measured with exactly the same input. Run the benchmarks with

cargo bench

Individual benchmark can be run in isolation by giving cargo bench their name.

C++ benchmark

There is a c++ variant mimicing the Rust solution and a benchmark using the Google benchmark library. Build and run it with cmake (but make sure to run the benchmark as a release build).

With the help of cxx and corrosion there is now a direct comparison of the rust and the c++ code. Using an efficient c++ solution, speed of c++ and rust is similar.

Note

In developing this repo, it was found that Criterion produces unreliable results if the functions tested have a runtime in the range of a few µs or lower. To get reliable results consistent on all platforms one needs to increase the dataset input size (assuming that runtime depends on it) to take the execution time of one function call near or into the range of about 100 µs or higher. This may as well depend on the machine you are using for testing. But no implementation has a built-in limit for input dataset size and should work with all sizes, as long as no i32 sum overflow happens.

It was found as well, that the functions being tested by Criterion require the batch input size to be set for LargeInput, otherwise the measured runtimes for the various vector consuming functions will be inaccurate and won't match what is measured in production.

Further lessons learned: calling c++ from Rust with rust collection types is much slower than calling c++ with its own std collection types. But CxxVector may not contain CxxVectors as elements, thus the c++ representation of a grid has to be treated like a opqaue type.

Acknowledgements

patryk27 on Reddit for helping me figuring out misleading performance measurements on different platforms.

dacozai on leetcode for the contribution of Rust algorithm versions using indices.

Some unknown leetcode user for the fastest c++ solution. I got it from leetcode when asking for a sample of a fast c++ solution. Unfortunately, leetcode does not bother to tell the author. Be aware that the license of this repo may not apply to the c++ solution.

Further reading

The mine/ref solution and why it is the fastest is explained in detail on leetcode.