Skip to content

Commit

Permalink
Updated benchmark
Browse files Browse the repository at this point in the history
- Benchmark now performed on a recent Linux distro to facilitate
  reproducibility

- Added Cython statistics
  • Loading branch information
wjakob committed Apr 21, 2023
1 parent de7a29a commit 834cf36
Show file tree
Hide file tree
Showing 9 changed files with 2,828 additions and 1,572 deletions.
8 changes: 5 additions & 3 deletions README.md
Expand Up @@ -24,9 +24,11 @@ compile in a shorter amount of time, produce smaller binaries, and have better
runtime performance.

More concretely,
[benchmarks](https://nanobind.readthedocs.io/en/latest/benchmark.html) show
**~2-3× faster** compile time, **~3× smaller** binaries, and up to **~
lower** runtime overheads compared to pybind11.
[benchmarks](https://nanobind.readthedocs.io/en/latest/benchmark.html) show up
to **~4× faster** compile time, **~5× smaller** binaries, and **~8× lower**
runtime overheads compared to pybind11. nanobind also outperforms Cython in
important metrics (**3-12×** binary size reduction, **1.6-4×** compilation time
reduction, similar runtime performance).

## Documentation

Expand Down
85 changes: 55 additions & 30 deletions docs/benchmark.rst
Expand Up @@ -3,28 +3,40 @@
Benchmarks
==========

.. note::
.. note::

**TL;DR**: nanobind bindings compile **~2-3× faster**, produce **~3×
smaller** binaries, and have up to **~8× lower** runtime overheads compared
to pybind11.
**TL;DR**: nanobind bindings compile up to **~4× faster** and produce **~5×
smaller** binaries with **~8× lower** runtime overheads compared to
pybind11.

The following experiments analyze the performance of a large
function-heavy (``func``) and class-heavy (``class``) binding microbenchmark
compiled using `Boost.Python <https://github.com/boostorg/python>`_,
`pybind11 <https://github.com/pybind/pybind11>`_, and nanobind in both
``debug`` and size-optimized (``opt``) modes. A comparison with `cppyy
<https://cppyy.readthedocs.io/en/latest/>`_, which uses dynamic compilation,
is also shown later. Details on the experimental setup can be found
:ref:`below <benchmark_details>`.
nanobind also outperforms Cython in important metrics (**3-12×** binary size
reduction, **1.6-4×** compilation time reduction, similar runtime performance).

The following experiments analyze the performance of a large function-heavy
(``func``) and class-heavy (``class``) binding microbenchmark compiled using
`Boost.Python <https://github.com/boostorg/python>`__, `Cython
<https://cython.org>`__, `pybind11 <https://github.com/pybind/pybind11>`__. The
``pybind11 + smart_holder`` results below refer to a `special branch
<https://github.com/pybind/pybind11/tree/smart_holder>`__ that addresses
long-standing issues related to holder types in pybind11.

Each experiment is shown twice: light gray `[debug]` columns provide data for
a debug build, and `[opt]` shows a size-optimized build that is representative
of a deployment scenario. The former is included to show that nanobind
performance is also good during a typical development workflow.

A comparison with `cppyy <https://cppyy.readthedocs.io/en/latest/>`_, which
uses dynamic compilation, is also shown later. Details on the experimental
setup can be found :ref:`below <benchmark_details>`.

Compilation time
----------------

The first plot contrasts the compilation time, where “*number* ×”
annotations denote the amount of time spent relative to nanobind. As
shown below, nanobind achieves a consistent ~\ **2-3× improvement**
compared to pybind11.
shown below, nanobind achieves a ~\ **2.3-4.0× improvement**
compared to pybind11 and a **1.6-4.4x improvement** compared to Cython.


.. image:: images/times.svg
:width: 800
Expand All @@ -33,9 +45,11 @@ compared to pybind11.
Binary size
-----------

nanobind also greatly reduces the binary size of the compiled bindings.
There is a roughly **3× improvement** compared to pybind11 and a **8-9×
improvement** compared to Boost.Python (both with size optimizations).
The extremely large size of generated binaries has been a persistent problem of
many prior binding libraries. nanobind significantly improves this metric in
size-optimized builds. There is a ~\ **11× improvement** compared to
Boost.Python, a **3-5× improvement** compared to pybind11, and a **3-12×
improvement** compared to Cython.

.. image:: images/sizes.svg
:width: 800
Expand All @@ -45,10 +59,12 @@ Performance
-----------

The last experiment compares the runtime performance overheads by calling a
bound function many times in a loop. Here, it is also interesting to compare
against `cppyy <https://cppyy.readthedocs.io/en/latest/>`__ (gray bar) and a
pure Python implementation that runs bytecode without binding overheads
(hatched red bar).
bound function many times in a loop. Here, it is also interesting to
additionally compare against `cppyy
<https://cppyy.readthedocs.io/en/latest/>`__ (green bar) and a pure Python
implementation that runs bytecode without binding overheads (hatched gray bar).
The `smart_holder` branch of pybind11 is not explicitly listed since its
runtime performance matches the base version.

.. image:: images/perf.svg
:width: 850
Expand All @@ -60,17 +76,22 @@ functions benchmarked here don’t perform CPU-intensive work, so this
this mainly measures the overheads of performing a function call,
boxing/unboxing arguments and return values, etc.

The difference to pybind11 is **significant**: a ~\ **2× improvement**
The difference to pybind11 is **significant**: a ~\ **2.4× improvement**
for simple functions, and an **~8× improvement** when classes are being
passed around. Complexities in pybind11 related to overload
resolution, multiple inheritance, and holders are the main reasons for
this difference. Those features were either simplified or completely
removed in nanobind.

Finally, there is a **~1.4× improvement** in both experiments compared
to cppyy (please ignore the two ``[debug]`` columns—I did not feel
comfortable adjusting the JIT compilation flags; all cppyy bindings
are therefore optimized.)
The runtime performance Cython and nanobind are similar (Cython leads in one
experiment and trails in another one). Cython generates specialized binding
code for every function and class, which is highly redundant (long compile
times, large binaries) but can also be beneficial for performance.

Finally, there is a **~1.6-2× improvement** in both experiments compared to
cppyy (please ignore the two ``[debug]`` columns—I did not feel comfortable
adjusting the JIT compilation flags; all cppyy bindings are therefore
optimized.)

Discussion
----------
Expand Down Expand Up @@ -140,10 +161,14 @@ up in ``struct``\ s with bindings.
.def(py::init<uint16_t, int64_t, int32_t, uint64_t, uint32_t, float>())
.def("sum", &Struct50::sum);
Each benchmark is compiled in debug mode (``debug``) and with optimizations
(``opt``) that minimize size (i.e., ``-Os``) and run on Python 3.9.10.
Compilation is done by AppleClang using consistent flags for all three binding
tools.
The code to generate the plots shown above is available `here
<https://github.com/wjakob/nanobind/blob/master/docs/microbenchmark.ipynb>`_.

Each test was compiled in debug mode (``debug``) and with optimizations
(``opt``) that minimize size (i.e., ``-Os``). Benchmarking was performed on a
AMD Ryzen 9 7950X workstation running Ubuntu 22.04.2 LTS. CPU boost was
disabled, and all core clock frequencies were pinned. Reported timings are the
median of five runs. Compilation used clang++ 15.0.7 with consistent flags for
all experiments. The relevant software versions are Python 3.10.6, cppyy
1.12.13, Cython 0.29.28, and nanobind 1.1.1.
16 changes: 16 additions & 0 deletions docs/cppyy.h
@@ -0,0 +1,16 @@
float test_0000(uint16_t a, int32_t b, uint32_t c, int64_t d, uint64_t e,
float f) {
return a + b + c + d + e + f;
}

struct Struct0 {
uint16_t a;
int32_t b;
uint32_t c;
int64_t d;
uint64_t e;
float f;
Struct0(uint16_t a, int32_t b, uint32_t c, int64_t d, uint64_t e, float f)
: a(a), b(b), c(c), d(d), e(e), f(f) {}
float sum() const { return a + b + c + d + e + f; }
};

0 comments on commit 834cf36

Please sign in to comment.