# pyzmq performance crossover

Sample plots from running `python collect.py` script to generate performance data.

PyZMQ's zero-copy implementation has a nontrivial overhead, due to the requirement to notify Python garbage collection when libzmq is done with a message from its IO thread. Performance optimizations over time and application/machine circumstances change where the crossover is, where zero-copy is more cost than benefit.

pyzmq 17 introduces `zmq.COPY_THRESHOLD`, a performance-tuning threshold,
where messages will not be copied even if sent with `copy=False`.
Based on these experiments,
the default value for `zmq.COPY_THRESHOLD` in pyzmq 17.0 is 64kB,
which seems to be a common crossover point.

In general, it is recommended to only use zero-copy for 'large' messages (at least 10s-100s of kB) because the bookkeeping overhead is significantly greater than small `memcpy` calls.

In [1]:
import pickle

import altair as alt
import pandas as pd


def crossover(data, column, ylabel="msgs/sec"):
    """Plot the crossover for copy=True|False"""
    return (
        alt.Chart(data)
        .mark_point()
        .encode(
            color="copy",
            x=alt.X("size", title="size (B)").scale(type="log"),
            y=alt.Y(column, title=ylabel).scale(type="log"),
        )
    )


def relative(data, column, yscale="linear"):
    """Plot a normalized value showing relative performance"""
    copy_mean = data[data["copy"]].groupby("size")[column].mean()
    no_copy = data[~data["copy"]]
    reference = copy_mean[no_copy["size"]]
    return (
        alt.Chart(
            pd.DataFrame(
                {
                    "size": no_copy["size"],
                    "no-copy speedup": no_copy[column] / reference.array,
                }
            )
        )
        .mark_point()
        .encode(
            x=alt.X("size", title="size (B)").scale(type="log"),
            y=alt.Y("no-copy speedup", title="").scale(type=yscale),
        )
    )

## Throughput

Throughput tests measure sending messages on a PUSH-PULL pair as fast as possible. These numbers count the time from first `recv` to the last.

In [2]:
with open("thr.pickle", "rb") as f:
    thr = pickle.load(f)

In [3]:
thr.head()

Unnamed: 0,size,count,copy,poll,transport,sends,throughput
0,100,655360,True,False,ipc,1618701.0,412531.609473
1,100,1310720,True,False,ipc,1636403.0,415309.020795
2,100,262144,False,False,ipc,178122.5,178051.290985
3,100,524288,False,False,ipc,181795.7,181758.942987
4,215,524288,True,False,ipc,1599087.0,408223.469186


Plot the throughput performance vs msg size for copy/no-copy.
This should show us a crossover point where zero-copy starts to outperform copying.

In [4]:
chart = crossover(thr, "throughput")
chart.title = "Throughput"
chart

Compare the maximum throughput for small messages:

In [5]:
zero_copy_max = thr.where(~thr["copy"]).throughput.max()
copy_max = thr.where(thr["copy"]).throughput.max()
print(f"zero-copy max msgs/sec: ~{zero_copy_max:.1e}")
print(f"     copy max msgs/sec: ~{copy_max:.1e}")

zero-copy max msgs/sec: ~2.1e+05
     copy max msgs/sec: ~4.2e+05


So that's a ~5x penalty when sending 100B messages.
It's still 40k msgs/sec, which isn't catastrophic,
but if you want to send small messages as fast as possible,
you can get closer to 250-500k msgs/sec if you skip the zero-copy logic.

We can see the relative gains of zero-copy by plotting zero-copy performance
normalized to message-copying performance

In [6]:
chart = relative(thr, "throughput")
chart.title = "Zero-copy Throughput (relative)"
chart

NameError: name 'pd' is not defined

So that's ~5x penalty for using zero-copy on 100B messages
and a ~2x win for using zero-copy in ~500kB messages.
THe crossover where the cost balances the benefit is in the vicinity of ~64kB.

This is why pyzmq 17 introduces the `zmq.COPY_THRESHOLD` behavior,
which sents a bound where `copy=False` can always be used,
and the zero-copy machinery will only be triggered for frames that are larger than this threshold.
The default for zmq.COPY_THRESHOLD in pyzmq-17.0 is 64kB,
based on these experiments.

### Send-only throughput

So far, we've only been measuring the time it takes to actually deliver all of those messages (total application throughput).

One of the big wins for zero-copy in pyzmq is that the the local `send` action is much less expensive for large messages because there is no `memcpy` in the handoff to zmq.
Plotting only the time it takes to *send* messages shows a much bigger win,
but similar crossover point.

In [None]:
chart = crossover(thr, "sends")
chart.title = "Messages sent/sec"
chart

Scaled plot, showing ratio of zero-copy to copy throughput performance:

In [None]:
chart = relative(thr, "sends", yscale="log")
chart.title = "Zero-copy sends/sec (relative speedup)"
chart

The `socket.send` calls for ~1MB messages is ~20x faster with zero-copy than copy,
but it's also ~10x *slower* for very small messages.

Taking that into perspective, the penalty for zero-copy is ~10 µs per send:

In [None]:
copy_small = 1e6 / thr[thr["copy"] * (thr["size"] == thr["size"].min())]["sends"].mean()
nocopy = 1e6 / thr[~thr["copy"]]["sends"]
penalty = nocopy - copy_small
print(f"Small copying send  : {copy_small:.2f}µs")
print(f"Small zero-copy send: {nocopy.mean():.2f}µs ± {nocopy.std():.2f}µs")
print(f"Penalty             : [{penalty.min():.2f}µs - {penalty.max():.2f}µs]")

which is a pretty big deal for small sends that only take 2µs, but nothing for 1MB sends, where the memcpy can take almost a millisecond:

In [None]:
copy_big = 1e6 / thr[thr["copy"] * (thr["size"] == thr["size"].max())]["sends"].mean()
print(f"Big copying send ({thr['size'].max() / 1e6:.0f} MB): {copy_big:.2f}µs")

## Latency

Latency tests measure REQ-REP request/reply cycles, waiting for a reply before sending the next request.
This more directly measures the cost of sending and receiving a single message,
removing any instance of queuing up multiple sends in the background.

This differs from the throughput test, where many messages are in flight at once.
This is significant because much of the performance cost of zero-copy is in
contention between the garbage collection thread and the main thread.
If garbage collection events fire when the main thread is idle waiting for a message,
this has ~no extra cost.

In [None]:
with open("lat.pickle", "rb") as f:
    lat = pickle.load(f)

In [None]:
chart = crossover(lat, "latency", ylabel="µs")
chart.title = "Latency (µs)"
chart

In [None]:
chart = relative(lat, "latency")
chart.title = "Relative increase in latency zero-copy / copy"
chart

For the latency test, we see that there is much lower overhead to the zero-copy machinery when there are few messages in flight.
This is expected, because much of the performance cost comes from thread contention when the gc thread is working hard to keep up with the freeing of messages that zmq is done with.

The result is a much lower penalty for zero-copy of small messages.